generate all variations on a 20-mer, that are 1 to 4 mismatches away

조회 수: 4 (최근 30일)
Shlomo Geva
Shlomo Geva 2023년 5월 23일
편집: Shlomo Geva 2023년 6월 22일
We consider a kmer. An arbitray k long DNA sequence, consisting of only {A,C,G,T}.
For instance, 'ACTGGTCATTTGGGCTGGTA'. Let's call it a kernel.
We need to generate from the kernel an array of all unique kmers, each of which differs from the kernel by 1 to n positions.
n is typically a small number - 5 at most.
I wrote a solution, but it is a bit slow - it takes about 1.7 sec to generate all variations on a 20-mer, that are at most 4 mismatches away.
A much faster Matlab solution will be very useful, without going into the rabbit hole of implementing a MEX file.
Thanks!
  댓글 수: 2
Walter Roberson
Walter Roberson 2023년 5월 23일
I wrote a solution, but it is a bit slow - it takes about 1.7 sec to generate all variations on a 20-mer, that are at most 4 mismatches away.
And how many seconds of your life are you prepared to dedicate to making the function faster? What is the estimated total number of times you expect your code will be executed before your program falls out of use?
Shlomo Geva
Shlomo Geva 2023년 5월 23일
편집: Shlomo Geva 2023년 5월 23일
Thanks for responding.
Tens of thousands of times per single application. Virtually limitless number of applications at the pace that genomic data is generated. I can't estimate how long before it falls out of use. Hpoefully it is used... I am trying to beat SOTA tools. BTW - Kodak and Nokia failed to estimate how long before their products fell out of use. Since when is this the only consideration, or even a consideration at all? Is one's curiosity not adequate motivation?

댓글을 달려면 로그인하십시오.

채택된 답변

Matt J
Matt J 2023년 5월 23일
편집: Matt J 2023년 5월 23일
Using blkColon from this FEX download,
kmer='ACTGGTCATTTGGGCTGGTA';
k=length(kmer);
n=4;
tic;
v=nchoosek(1:k,n);
clear c
[c{1:n}]=ndgrid('AGCT');
c=reshape( cat(n+1,c{:}),[],n);
p=height(c);
idx=repmat( any( (1:k)==permute(v,[2,3,1]) ,1) ,p,1,1);
Kmers=repmat( kmer ,p,1,height(v));
Kmers(idx)=repmat( c,1,1,size(idx,3));
Kmers=blkColon(Kmers,[1,k]);
Kmers(all(kmer==Kmers,2),:)=[]; %the result
toc;
Elapsed time is 0.164970 seconds.
  댓글 수: 3
Shlomo Geva
Shlomo Geva 2023년 5월 23일
PS - I had to add a call to unique() since the results include duplicates. But that is quick so not an issue. Incidentally - my implementation code also required a call to unique().
Shlomo Geva
Shlomo Geva 2023년 6월 12일
편집: Shlomo Geva 2023년 6월 22일
@Matt J, I ended up writing a MEX file in c++. Takes about 0.02sec.
I tried, but could not get a similar efficient recursive implementation in Matlab because of the parameter passing by value that appears to slow things down.
Anyhow this is the C++ function (ignore the pack20mer(), irrelevant):
// Function to generate variations with 0 to 4 replacements away from a given 20mer
void generateVariations(std::string& sequence, std::vector<uint64_t>& variations, int replacements = 0, int position = 0) {
variations.push_back(pack20mer(sequence));
if (replacements == 4) {
return;
}
char originalBase;
for (int i = position; i < 20; i++) {
for (char base : {'A', 'C', 'G', 'T'}) {
if (sequence[i] != base) {
originalBase = sequence[i]; // Store the original base value
sequence[i] = base; // Modify the base in 'modifiedSequence'
generateVariations(sequence, variations, replacements + 1, i + 1);
sequence[i] = originalBase; // Restore the original base value
}
}
}
}

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Probability Distributions에 대해 자세히 알아보기

제품


릴리스

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by