Finding the repeated substrings

Question

Reshma Ravi 2017년 6월 1일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/342796-finding-the-repeated-substrings

답변: Steven Lord 2019년 8월 14일

I have a DNA sequence that is AAGTCAAGTCAATCG and I split into substrings such as AAGT,AGTC,GTCA,TCAA,CAAG,AAGT and so on. Then I have to find the repeated substirngs and their frequency counts ,that is here AAGT is repeated twice so I want to get AAGT - 2.How is this possible .

댓글 수: 2
없음 표시없음 숨기기

Stephen23 2017년 6월 1일

See Andrei Bobrov's answer for an efficient solution.

Andrei Bobrov 2017년 6월 2일

Thank you Stephen!

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

KSSV 2017년 6월 1일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/342796-finding-the-repeated-substrings#answer_269149

MATLAB Online에서 열기

str = {'AAGT','AGTC','GTCA','TCAA','CAAG','AAGT'} ;
idx = cellfun(@(x) find(strcmp(str, x)==1), unique(str), 'UniformOutput', false) ;
L = cellfun(@length,idx) ;
Ridx = find(L>1) ;
for i = 1:length(Ridx)
    st = str(idx{Ridx}) ;
    fprintf('%s string repeated %d times\n',st{1},length(idx{Ridx}))
end

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Answer 2

Andrei Bobrov 2017년 6월 1일

2
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/342796-finding-the-repeated-substrings#answer_269150

MATLAB Online에서 열기

A = 'AAGTCAAGTCAATCG';
B = hankel(A(1:end-3),A(end-3:end));
[a,~,c] = unique(B,'rows','stable');
out = table(a,accumarray(c,1),'VariableNames',{'DNA','counts'});

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

Stephen23 2018년 8월 26일

tabulate requires the Statistics and Machine Learning Toolbox, which not everyone has.

Ivan Savelyev 2019년 8월 14일

Hi.

I have a question. Some time i have a ladder-like results (nested sequences) like this :

AAAAAAAAA which will be calculated (with frame size 3 as) as 6 AAAA sequences, wich is not correct in some cases ( it is also about ATATATA type of sequences). Is there a solution or algorithms to filter nested repeats ?

Thanx a lot.

댓글을 달려면 로그인하십시오.

Answer 3

Steven Lord 2019년 8월 14일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/342796-finding-the-repeated-substrings#answer_387601

MATLAB Online에서 열기

For the original question you could convert the char data into a categorical array and call histcounts.

>> C = categorical({'AAGT','AGTC','GTCA','TCAA','CAAG','AAGT'})
C = 
  1×6 categorical array
     AAGT      AGTC      GTCA      TCAA      CAAG      AAGT 
>> [counts, uniquevalues] = histcounts(C)
counts =
     2     1     1     1     1
uniquevalues =
  1×5 cell array
    {'AAGT'}    {'AGTC'}    {'CAAG'}    {'GTCA'}    {'TCAA'}

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Finding the repeated substrings

댓글 수: 2
없음 표시없음 숨기기

채택된 답변

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (2개)

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

Finding the repeated substrings

댓글 수: 2 없음 표시없음 숨기기

채택된 답변

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (2개)

댓글 수: 5 이전 댓글 3개 표시이전 댓글 3개 숨기기

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기