I have a DNA sequence that is AAGTCAAGTCAATCG and I split into substrings such as AAGT,AGTC,GTCA,TCAA,CAAG,AAGT and so on. Then I have to find the repeated substirngs and their frequency counts ,that is here AAGT is repeated twice so I want to get AAGT - 2.How is this possible .

댓글 수: 2

Stephen23
Stephen23 2017년 6월 1일
See Andrei Bobrov's answer for an efficient solution.
Andrei Bobrov
Andrei Bobrov 2017년 6월 2일
Thank you Stephen!

댓글을 달려면 로그인하십시오.

 채택된 답변

KSSV
KSSV 2017년 6월 1일

0 개 추천

str = {'AAGT','AGTC','GTCA','TCAA','CAAG','AAGT'} ;
idx = cellfun(@(x) find(strcmp(str, x)==1), unique(str), 'UniformOutput', false) ;
L = cellfun(@length,idx) ;
Ridx = find(L>1) ;
for i = 1:length(Ridx)
st = str(idx{Ridx}) ;
fprintf('%s string repeated %d times\n',st{1},length(idx{Ridx}))
end

추가 답변 (2개)

Andrei Bobrov
Andrei Bobrov 2017년 6월 1일

2 개 추천

A = 'AAGTCAAGTCAATCG';
B = hankel(A(1:end-3),A(end-3:end));
[a,~,c] = unique(B,'rows','stable');
out = table(a,accumarray(c,1),'VariableNames',{'DNA','counts'});

댓글 수: 5

Anthony Tracy
Anthony Tracy 2018년 8월 24일
If it's alright, I had a question about the use of unique. Why not use tabulate? Just curious.
Thanks!
Maybe he didn't know about it - I didn't.
outT = tabulate(B)
out =
8×2 table
DNA counts
____ ______
AAGT 2
AGTC 2
GTCA 2
TCAA 2
CAAG 1
CAAT 1
AATC 1
ATCG 1
outT =
8×3 cell array
{'AAGT'} {[2]} {[16.6666666666667]}
{'AGTC'} {[2]} {[16.6666666666667]}
{'GTCA'} {[2]} {[16.6666666666667]}
{'TCAA'} {[2]} {[16.6666666666667]}
{'CAAG'} {[1]} {[8.33333333333333]}
{'CAAT'} {[1]} {[8.33333333333333]}
{'AATC'} {[1]} {[8.33333333333333]}
{'ATCG'} {[1]} {[8.33333333333333]}
Anthony Tracy
Anthony Tracy 2018년 8월 24일
yeah that's fair. I was just curious since I was just looking at both and wondering why I may want to use one over the other. Seems mainly like if I want a table or cell.
Thanks!
Stephen23
Stephen23 2018년 8월 26일
tabulate requires the Statistics and Machine Learning Toolbox, which not everyone has.
Ivan Savelyev
Ivan Savelyev 2019년 8월 14일
Hi.
I have a question. Some time i have a ladder-like results (nested sequences) like this :
AAAAAAAAA which will be calculated (with frame size 3 as) as 6 AAAA sequences, wich is not correct in some cases ( it is also about ATATATA type of sequences). Is there a solution or algorithms to filter nested repeats ?
Thanx a lot.

댓글을 달려면 로그인하십시오.

Steven Lord
Steven Lord 2019년 8월 14일

0 개 추천

For the original question you could convert the char data into a categorical array and call histcounts.
>> C = categorical({'AAGT','AGTC','GTCA','TCAA','CAAG','AAGT'})
C =
1×6 categorical array
AAGT AGTC GTCA TCAA CAAG AAGT
>> [counts, uniquevalues] = histcounts(C)
counts =
2 1 1 1 1
uniquevalues =
1×5 cell array
{'AAGT'} {'AGTC'} {'CAAG'} {'GTCA'} {'TCAA'}

카테고리

도움말 센터File Exchange에서 Genomics and Next Generation Sequencing에 대해 자세히 알아보기

태그

질문:

2017년 6월 1일

답변:

2019년 8월 14일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by