Improving efficiency of a char array function

Question

0 개 추천

I have built a function to be run 100000 times in a loop. Its input, "motifs", is a full 20x15 char array, containing only four types of characters, 'A', 'C', 'G', 'T'. I am wondering if my implementation, below, can be improved of if it is pretty much as fast as it gets:

% Find consensus string and score of a motifs array
function [score, count, consensus] = scoremotifs_2(motifs)
count = [ sum(motifs == 'A',1) ;
          sum(motifs == 'C',1) ;
          sum(motifs == 'G',1) ;
          sum(motifs == 'T',1) ];    
[count_max, consensus_num] = max(count);
consensus(consensus_num == 1) = 'A';
consensus(consensus_num == 2) = 'C'; 
consensus(consensus_num == 3) = 'G';
consensus(consensus_num == 4) = 'T';
score = sum(sum(motifs ~= consensus));

댓글 수: 3
이전 댓글 1개 표시 이전 댓글 1개 숨기기

the cyclist 2017년 1월 8일

편집: the cyclist 2017년 1월 8일

One thing to be aware of ...

It looks to me like your algorithm will be biased to overcount 'A' relative to the other elements, because in the case of a tied number of counts, consensus_num_ is going to choose the first index. (Similarly, 'C' will be more likely than 'G' and 'T', etc.)

Paolo Binetti 2017년 1월 8일

The code is intended for a random motif search algorithm within multiple sequences of DNA. Good on you for noticing the over-bias issue. It is OK for the particular problem I am solving.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

the cyclist 2017년 1월 8일

편집: the cyclist 2017년 1월 8일

MATLAB Online에서 열기

1 개 추천

This is starting to get a little obfuscated, but it's significantly faster:

list ='ACGT';
count = sum(motifs == reshape(list,[1 1 4]));
[count_max, consensus_num] = max(count,[],3);
consensus = list(consensus_num);
score = sum(sum(motifs ~= consensus));

The key algorithmic difference here is that I am able to compare against all four elements 'ACGT' in parallel, by permuting them into a third dimension.

You need R2016b or later in order for MATLAB to make the implicit dimension expansion to compare the 20x15x1 array against the 1x1x4 array. If you have an older version, you will explicitly need to use repmat.

댓글 수: 2
없음 표시 없음 숨기기

the cyclist 2017년 1월 8일

MATLAB Online에서 열기

Also, because you are calling this function lots of time, you might want to do something like defining

list = 'ACGT';
plist = reshape(list,[1 1 4]);

in the calling function, and pass them in as argument, rather than needing to do that reshape every time. But I think that will be only a tiny speedup, if any.

Paolo Binetti 2017년 1월 9일

편집: Paolo Binetti 2017년 1월 9일

MATLAB Online에서 열기

So, your first suggestion concerned a better way to compute "consensus":

list = 'ACGT';
consensus = list(consensus_num);

I find it more elegant and it is faster, but only slightly.

Then I realized that you do not strictly need "consensus" in order to compute "score". In fact it is faster to directly use "count" like this:

score = tk-sum(max(count)); % count is a 2D array

where "tk" is the number of lines times number of columns of "motifs", which can be fed as an input to the function. In addition, I also realized that I do not strictly need "consensus" as an output, so I removed its above computation altogether. The function then becomes:

function [score, count] = scoremotifs_3(motifs, tk)
count = [ sum(motifs == 'A',1) ;
          sum(motifs == 'C',1) ;
          sum(motifs == 'G',1) ;
          sum(motifs == 'T',1) ];    
score = tk-sum(max(count));

Now, about the "count" computation. Your suggestion to vectorize the comparison also worked, except that I have to reshape the resulting "count" from a 3D array to a 2D array, which is what I need as output. The function is therefore:

function [score, count] = scoremotifs_4(motifs, plist, k, tk)
count = sum(motifs == plist); count = reshape(count,k,4)';
score = tk-sum(max(count));

This already brings visible improvements. But then I wondered: since it's just two lines, why not write them directly in the main code rather than calling them as functions? I did that and the improvement was dramatic. In summary, I ended up replacing the whole function with just these two lines:

    count = sum(motifs == plist); count = reshape(count,k,4)';
    score = tk-sum(max(count));

Once again, for a beginner like me it is very instructive to learn that you can significantly improve even apparently simple pieces of code. Thank you.

댓글을 달려면 로그인하십시오.

Answer 2

the cyclist 2017년 1월 8일

MATLAB Online에서 열기

1 개 추천

It's somewhat faster to calculate consensus like this:

list = 'ACGT';
consensus = list(consensus_num);

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Improving efficiency of a char array function

댓글 수: 3
이전 댓글 1개 표시 이전 댓글 1개 숨기기

채택된 답변

댓글 수: 2
없음 표시 없음 숨기기

추가 답변 (1개)

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

카테고리

태그

Community Treasure Hunt

Improving efficiency of a char array function

댓글 수: 3 이전 댓글 1개 표시 이전 댓글 1개 숨기기

채택된 답변

댓글 수: 2 없음 표시 없음 숨기기

추가 답변 (1개)

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

카테고리

태그

참고 항목

Community Treasure Hunt

댓글 수: 3
이전 댓글 1개 표시 이전 댓글 1개 숨기기

댓글 수: 2
없음 표시 없음 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기