Count number of nucleotide bases (letters) in file with multiple entries

Question

George 2024년 8월 30일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries

편집: dpb 2024년 9월 1일

Below is an example of the type of sequences I'm dealing with. The sequences are truncated for readability. Some IDs correspond to one sequence only whereas others to more than one squence. The objective is to count the number of nucleotides (letters) per ID.

>AAEL001292::intron  | Intron sequence
GTGGG
>AAEL001312::intron  | Intron sequence
GTAAGAAC
>AAEL001320::intron  | Intron sequence
GTAAGACCTACTCA
>AAEL001807::intron  | Intron sequence
GTACT
>AAEL002085::intron  | Intron sequence
GTTAGTTGTA
>AAEL002085::intron  | Intron sequence
GTAAGT
>AAEL002085::intron  | Intron sequence
GTT
>AAEL002085::intron  | Intron sequence
GTACGA
>AAEL002633::intron  | Intron sequence
GTAAGAAAAAT

I've managed to generate a table like:

AAEL001292        5
AAEL001312        8
AAEL001320      14
AAEL001807        5
AAEL002085      10
AAEL002085        6
AAEL002085        3
AAEL002085        6
AAEL002633       11

The desired output is:

AAEL001292      5
AAEL001312      8
AAEL001320     14
AAEL001807        5
AAEL002085      25  # sum of number of nuceotides under to ID AAEL002085 
AAEL002633       11

Any suggestions will be most helpful.

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

dpb 2024년 8월 30일

Be much easier if you would attach a data file for folks to work with...as for hints, without a data file from which to see just what you're starting with, I'd venture using a table and @doc:rowfun or groupsummary even with grouping variables will give you the desired results directly.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Voss 2024년 8월 30일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#answer_1508074

MATLAB Online에서 열기

file.txt

% this is what the attached file contains:
type file.txt
>AAEL001292::intron  | Intron sequence
GTGGG
>AAEL001312::intron  | Intron sequence
GTAAGAAC
>AAEL001320::intron  | Intron sequence
GTAAGACCTACTCA
>AAEL001807::intron  | Intron sequence
GTACT
>AAEL002085::intron  | Intron sequence
GTTAGTTGTA
>AAEL002085::intron  | Intron sequence
GTAAGT
>AAEL002085::intron  | Intron sequence
GTT
>AAEL002085::intron  | Intron sequence
GTACGA
>AAEL002633::intron  | Intron sequence
GTAAGAAAAAT
data = readlines('file.txt');
ID = regexp(data(1:2:end),'>(.*?):','tokens','once');
ID = vertcat(ID{:});
introns = data(2:2:end);
[G,ID] = findgroups(ID);
F = @(x)sum(cellfun(@numel,x));
count = groupsummary(introns,G,F);
T = table(ID,count)
T = 6x2 table
         ID         count
    ____________    _____

    "AAEL001292"      5  
    "AAEL001312"      8  
    "AAEL001320"     14  
    "AAEL001807"      5  
    "AAEL002085"     25  
    "AAEL002633"     11  

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Answer 2

dpb 2024년 8월 31일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#answer_1508314

편집: dpb 2024년 9월 1일

MATLAB Online에서 열기

grp1_introns.txt

I'll convert the comment to an Answer because I believe it is/will be part of the required solution and is probably easier to clean the data to match the calculation assumptions than to modify the calculations to match the varying number of rows per group ID.

ADDENDUM

Another way to approach the cleanup would be to compute the line number difference between the lines beginning with ">" -- if it is >2, then that group has one or more wrapped lines.

data = readlines('grp1_introns.txt');
data(strlength(data)==0)=[];            % remove blank lines, if any
data(1:12)
ans = 12x1 string array
    ">AAEL001292::intron  | Intron sequence"
    "GTGGGTGATTGTGTTTTTTTGATTTTGATACAAGTTAACAACATTGTATTTTTCAG"
    ">AAEL001312::intron  | Intron sequence"
    "GTAAGAACACATGCGTCGTCAATGCTTAACAAGATTTAATACAATTTGCGCATCTGTTTC"
    "AG"
    ">AAEL001320::intron  | Intron sequence"
    "GTAAGACCTACTCATTAGATTTATAGACAGTGATGTAATACTAATTATGCATTCATTTTA"
    "G"
    ">AAEL001807::intron  | Intron sequence"
    "GTACTTTTGGTTGACGGAAGTTTTAGAATGAGAATTCGTCTAATCGTTATATATATTTTA"
    "CAG"
    ">AAEL002085::intron  | Intron sequence"
ix=find(startsWith(data,'>'));
dx=[diff(ix);0];
iy=find(dx>2);
whos ix dx iy
  Name        Size            Bytes  Class     Attributes

  dx        458x1              3664  double              
  ix        458x1              3664  double              
  iy        290x1              2320  double              
[ix(1:10) dx(1:10) iy(1:10)]
ans = 10x3
     1     2     2
     3     3     3
     6     3     4
     9     3     5
    12     4     8
    16     2     9
    18     2    10
    20     3    11
    23     9    12
    32     9    13
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>

NOTA BENE: There are wrapped lines with more than two lines; the sequence at line 16 is three while that at line 32 is 8 since the differences are 4 and 9, respectfully.

One will need to iterate through and combine all those rows in the counting, either by catenating the lines into one long one and removing the wrapped lines so the file follows the assumption of only one line or by summing the lengths of the lines between groups.

Well, let's see if I can manage to keep the indexing straight without too much trouble...

markfordelete=[];                       % accumulator for marked rows
for i=1:numel(iy)                       % iterate over the elements that are multiple lines
  l1=ix(iy(i))+1;                       % the first row in the group past the header
  l2=l1+dx(iy(i))-2;                    % the last row in group
  try
    data(l1)=join(data(l1:l2),'');      % put those rows together, no spaces
    markfordelete=cat(1,markfordelete,[l1+1:l2].');   % keep list of rows to delete
  catch
    e=lasterror; e.message              % for debugging
    [i iy(i) ix(iy(i)) dx(iy(i)) l1 l2]
    data(l1-1:end)
    return
  end
end
data(markfordelete)=[];

At this point, you should be able to apply @Voss method...it seems to work with at least a cursory checking. You'll want to output to a file and inspect carefully to make sure, but I think it works as intended.

ix=find(startsWith(data,'>'));
height(ix)
ans = 458
any(diff(ix)~=2)
ans = logical
   0

The quick sanity check that the updated file doesn't have any segments of more than one line and the number of segments is the same is agoodsign™

ADDENDUM

Interesting thought -- what's the length of the lines now that are all strung together?

[Lmax,iMax]=max(dx)

Lmax = 401

iMax = 57

L=strlength(data(2:2:end));

UL=unique(L);

numel(UL)

ans = 157

NL=histc(L,UL);

[UL NL]

ans = 157x2

13 1 14 1 50 1 51 3 52 3 53 20 54 11 55 13 56 16 57 32

<mw-icon class=""></mw-icon>

[min(UL) max(UL)]

ans = 1x2

13 23955

subplot(2,1,1)

bar(UL,NL)

set(gca,'XScale','log')

xlim([UL(1) UL(end)])

subplot(2,1,2)

bar(UL,NL)

set(gca,'XScale','log')

xlim([UL(1)-1 200])

xticks([UL(1) 20:10:50 100 200])

sum(L)

ans = 444117

There are a couple of introns that have 400+ rows that adds up to string lengths of 20,000+. I dunno if this is correct or not, but opening the text file confirms that is the content of the uploaded file.

댓글 수: 2
없음 표시없음 숨기기

George 2024년 9월 1일

Grateful for your code. More than I expected.

dpb 2024년 9월 1일

편집: dpb 2024년 9월 1일

No problem; got interesting. Is the file really right that there are strings of 25,000+?

While the forum won't let accept more than one Answer, if the above actually is useful, you could cast a vote... :)

댓글을 달려면 로그인하십시오.

Answer 3

George 2024년 8월 31일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/2149169-count-number-of-nucleotide-bases-letters-in-file-with-multiple-entries#answer_1508259

grp1_introns.txt

Thank you very much for the prompt reply. I tried what you suggested. I get the following error message.

count = groupsummary(introns,G,F);

Error using matlab.internal.math.parseGroupVars (line 129)

Grouping variables and data matrix must have same number of rows.

Error in groupsummary (line 180)

[groupingData,groupVars] = matlab.internal.math.parseGroupVars(groupVars,tableFlag,"groupsummary:Group",T);

I'm attaching the data file in case it helps.

Once again many thanks

댓글 수: 2
없음 표시없음 숨기기

dpb 2024년 8월 31일

편집: dpb 2024년 8월 31일

MATLAB Online에서 열기

grp1_introns.txt

data = readlines('grp1_introns.txt');
data(1:10,:)
ans = 10x1 string array
    ">AAEL001292::intron  | Intron sequence"
    "GTGGGTGATTGTGTTTTTTTGATTTTGATACAAGTTAACAACATTGTATTTTTCAG"
    ">AAEL001312::intron  | Intron sequence"
    "GTAAGAACACATGCGTCGTCAATGCTTAACAAGATTTAATACAATTTGCGCATCTGTTTC"
    "AG"
    ">AAEL001320::intron  | Intron sequence"
    "GTAAGACCTACTCATTAGATTTATAGACAGTGATGTAATACTAATTATGCATTCATTTTA"
    "G"
    ">AAEL001807::intron  | Intron sequence"
    "GTACTTTTGGTTGACGGAAGTTTTAGAATGAGAATTCGTCTAATCGTTATATATATTTTA"

The attached file seems to have a linefeed after a fixed-length record for the longer sequences -- is that real in the file or an artifact of how it was saved/uploaded?

That breaks the assumption @Voss made that there are two lines per section and in using findgroups it is found out when the number of rows don't match.

That assumption is also in @Star Strider's code, but his will still run; just those sequences in the second/wrapped line won't be counted.

I gather it must be true in the real files, too, if yours failed; in that case it will take some additional preprocessing to apply either solution.

Be sure you haven't inadvertently introduced that line break by having looked at the files in a word processing type editor such as Notepad that automatically word wrapped the longer lines (or, maybe even the MATLAB code editor might if the line length is set, I know it is sometimes quite annoying, too, with wanting to wrap comments at ends of somewhat longer code lines...

L=max(strlength(data))      % the maximum line length
ans = 60
ix=find(strlength(data)==L);    % locate those long lines
iy=find(startsWith(data,'>'));  % and the sequence header lines
iy=iy(iy>ix(1));                % keep only those after the first long line
for i=1:numel(ix)
  if iy(i)-ix(i)>2              % wrapped line
    data(ix,:)=join(data(ix:ix+1,:));   % put them back together
  end
end
data(ix+1,:)=[];                % now remove the wrapped lines

George 2024년 9월 1일

Many thanks for your input. Works very well.

댓글을 달려면 로그인하십시오.

Count number of nucleotide bases (letters) in file with multiple entries

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (2개)

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 2
없음 표시없음 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

Count number of nucleotide bases (letters) in file with multiple entries

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (2개)

댓글 수: 2 없음 표시없음 숨기기

댓글 수: 2 없음 표시없음 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 2
없음 표시없음 숨기기