Count number of nucleotide bases (letters) in file with multiple entries
조회 수: 3 (최근 30일)
이전 댓글 표시
Below is an example of the type of sequences I'm dealing with. The sequences are truncated for readability. Some IDs correspond to one sequence only whereas others to more than one squence. The objective is to count the number of nucleotides (letters) per ID.
>AAEL001292::intron | Intron sequence
GTGGG
>AAEL001312::intron | Intron sequence
GTAAGAAC
>AAEL001320::intron | Intron sequence
GTAAGACCTACTCA
>AAEL001807::intron | Intron sequence
GTACT
>AAEL002085::intron | Intron sequence
GTTAGTTGTA
>AAEL002085::intron | Intron sequence
GTAAGT
>AAEL002085::intron | Intron sequence
GTT
>AAEL002085::intron | Intron sequence
GTACGA
>AAEL002633::intron | Intron sequence
GTAAGAAAAAT
I've managed to generate a table like:
AAEL001292 5
AAEL001312 8
AAEL001320 14
AAEL001807 5
AAEL002085 10
AAEL002085 6
AAEL002085 3
AAEL002085 6
AAEL002633 11
The desired output is:
AAEL001292 5
AAEL001312 8
AAEL001320 14
AAEL001807 5
AAEL002085 25 # sum of number of nuceotides under to ID AAEL002085
AAEL002633 11
Any suggestions will be most helpful.
댓글 수: 1
dpb
2024년 8월 30일
Be much easier if you would attach a data file for folks to work with...as for hints, without a data file from which to see just what you're starting with, I'd venture using a table and @doc:rowfun or groupsummary even with grouping variables will give you the desired results directly.
채택된 답변
Voss
2024년 8월 30일
% this is what the attached file contains:
type file.txt
data = readlines('file.txt');
ID = regexp(data(1:2:end),'>(.*?):','tokens','once');
ID = vertcat(ID{:});
introns = data(2:2:end);
[G,ID] = findgroups(ID);
F = @(x)sum(cellfun(@numel,x));
count = groupsummary(introns,G,F);
T = table(ID,count)
댓글 수: 0
추가 답변 (2개)
dpb
2024년 8월 31일
편집: dpb
2024년 9월 1일
I'll convert the comment to an Answer because I believe it is/will be part of the required solution and is probably easier to clean the data to match the calculation assumptions than to modify the calculations to match the varying number of rows per group ID.
ADDENDUM
Another way to approach the cleanup would be to compute the line number difference between the lines beginning with ">" -- if it is >2, then that group has one or more wrapped lines.
data = readlines('grp1_introns.txt');
data(strlength(data)==0)=[]; % remove blank lines, if any
data(1:12)
ix=find(startsWith(data,'>'));
dx=[diff(ix);0];
iy=find(dx>2);
whos ix dx iy
[ix(1:10) dx(1:10) iy(1:10)]
NOTA BENE: There are wrapped lines with more than two lines; the sequence at line 16 is three while that at line 32 is 8 since the differences are 4 and 9, respectfully.
One will need to iterate through and combine all those rows in the counting, either by catenating the lines into one long one and removing the wrapped lines so the file follows the assumption of only one line or by summing the lengths of the lines between groups.
Well, let's see if I can manage to keep the indexing straight without too much trouble...
markfordelete=[]; % accumulator for marked rows
for i=1:numel(iy) % iterate over the elements that are multiple lines
l1=ix(iy(i))+1; % the first row in the group past the header
l2=l1+dx(iy(i))-2; % the last row in group
try
data(l1)=join(data(l1:l2),''); % put those rows together, no spaces
markfordelete=cat(1,markfordelete,[l1+1:l2].'); % keep list of rows to delete
catch
e=lasterror; e.message % for debugging
[i iy(i) ix(iy(i)) dx(iy(i)) l1 l2]
data(l1-1:end)
return
end
end
data(markfordelete)=[];
At this point, you should be able to apply @Voss method...it seems to work with at least a cursory checking. You'll want to output to a file and inspect carefully to make sure, but I think it works as intended.
ix=find(startsWith(data,'>'));
height(ix)
any(diff(ix)~=2)
The quick sanity check that the updated file doesn't have any segments of more than one line and the number of segments is the same is agoodsign™
ADDENDUM
Interesting thought -- what's the length of the lines now that are all strung together?
[Lmax,iMax]=max(dx)
L=strlength(data(2:2:end));
UL=unique(L);
numel(UL)
NL=histc(L,UL);
[UL NL]
[min(UL) max(UL)]
subplot(2,1,1)
bar(UL,NL)
set(gca,'XScale','log')
xlim([UL(1) UL(end)])
subplot(2,1,2)
bar(UL,NL)
set(gca,'XScale','log')
xlim([UL(1)-1 200])
xticks([UL(1) 20:10:50 100 200])
sum(L)
There are a couple of introns that have 400+ rows that adds up to string lengths of 20,000+. I dunno if this is correct or not, but opening the text file confirms that is the content of the uploaded file.
댓글 수: 2
George
2024년 8월 31일
댓글 수: 2
dpb
2024년 8월 31일
편집: dpb
2024년 8월 31일
data = readlines('grp1_introns.txt');
data(1:10,:)
The attached file seems to have a linefeed after a fixed-length record for the longer sequences -- is that real in the file or an artifact of how it was saved/uploaded?
That breaks the assumption @Voss made that there are two lines per section and in using findgroups it is found out when the number of rows don't match.
That assumption is also in @Star Strider's code, but his will still run; just those sequences in the second/wrapped line won't be counted.
I gather it must be true in the real files, too, if yours failed; in that case it will take some additional preprocessing to apply either solution.
Be sure you haven't inadvertently introduced that line break by having looked at the files in a word processing type editor such as Notepad that automatically word wrapped the longer lines (or, maybe even the MATLAB code editor might if the line length is set, I know it is sometimes quite annoying, too, with wanting to wrap comments at ends of somewhat longer code lines...
L=max(strlength(data)) % the maximum line length
ix=find(strlength(data)==L); % locate those long lines
iy=find(startsWith(data,'>')); % and the sequence header lines
iy=iy(iy>ix(1)); % keep only those after the first long line
for i=1:numel(ix)
if iy(i)-ix(i)>2 % wrapped line
data(ix,:)=join(data(ix:ix+1,:)); % put them back together
end
end
data(ix+1,:)=[]; % now remove the wrapped lines
참고 항목
카테고리
Help Center 및 File Exchange에서 Text Data Preparation에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!