Find and Replace Overlapping Substrings
조회 수: 3 (최근 30일)
이전 댓글 표시
Hello,
I want to find a set of substrings (between 19 and 24 characters long, 'ACGT' mix = DNA sequences) in a bigger string (template DNA) and replace them with '*' for the length of the substring. I have following code.
%"template" is a 8x1 cell array with original DNA sequence data (araound 1800 chars each). To minimize the example I just go through the first cell.
%"substring" is e.g. a 50x2 cell array, with column 1 = substring and olumn 2 = length of the substring.
%"substituted_seq" is a 8x1 cell array with the replaced sequence (substrings substituted by '*')
%
substituted_seq{1,1} = strrep(template{1,1},substring{1,1},'*');
for j=1:size(substring,1)
substituted_seq{1,1} = strrep(substituted_seq{1,1},substring{j,1},'*');
end
The first problem I have is, that these substrings are overlapping with each other. So when I replace the first substring with '*' and search for the next one (which is overlapping the first) this code will not replace it anymore.
Second: I also couldn't figure out, how to replace a substing of e.g. 'ACGTCG' with the same number of '*' (in this example '******').
I would be very grateful for any help. Thanks!
댓글 수: 0
채택된 답변
Robert Cumming
2012년 8월 30일
I would make a binary flag = to the length of your string. Then run through all your substrings and mark the flag true for the characters to be replaced wiht *. This will eliminate the fact the problem of overlapping.
Once its all done you then replace all the true items recorded by flag in your string.
For your second iss: something like:
flag = 'CC';
key = regexprep ( 'CC', '.', '*' );
regexprep ( 'ABCCCCCDEFG', flag, key )
ans =
AB****CDEFG
댓글 수: 0
추가 답변 (0개)
참고 항목
카테고리
Help Center 및 File Exchange에서 Genomics and Next Generation Sequencing에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!