Find and Replace Overlapping Substrings

조회 수: 3 (최근 30일)
Janett Göhring
Janett Göhring 2012년 8월 30일
Hello,
I want to find a set of substrings (between 19 and 24 characters long, 'ACGT' mix = DNA sequences) in a bigger string (template DNA) and replace them with '*' for the length of the substring. I have following code.
%"template" is a 8x1 cell array with original DNA sequence data (araound 1800 chars each). To minimize the example I just go through the first cell.
%"substring" is e.g. a 50x2 cell array, with column 1 = substring and olumn 2 = length of the substring.
%"substituted_seq" is a 8x1 cell array with the replaced sequence (substrings substituted by '*')
%
substituted_seq{1,1} = strrep(template{1,1},substring{1,1},'*');
for j=1:size(substring,1)
substituted_seq{1,1} = strrep(substituted_seq{1,1},substring{j,1},'*');
end
The first problem I have is, that these substrings are overlapping with each other. So when I replace the first substring with '*' and search for the next one (which is overlapping the first) this code will not replace it anymore.
Second: I also couldn't figure out, how to replace a substing of e.g. 'ACGTCG' with the same number of '*' (in this example '******').
I would be very grateful for any help. Thanks!

채택된 답변

Robert Cumming
Robert Cumming 2012년 8월 30일
I would make a binary flag = to the length of your string. Then run through all your substrings and mark the flag true for the characters to be replaced wiht *. This will eliminate the fact the problem of overlapping.
Once its all done you then replace all the true items recorded by flag in your string.
For your second iss: something like:
flag = 'CC';
key = regexprep ( 'CC', '.', '*' );
regexprep ( 'ABCCCCCDEFG', flag, key )
ans =
AB****CDEFG

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Genomics and Next Generation Sequencing에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by