Given a very long string, replace chars with numbers and obtain cumulative sum vector

조회 수: 1 (최근 30일)
As input I have a vector of a few million char, which can be 'A', 'C', 'G', 'T'. The vector is called sequence.
As output I want a vector with the cumulative sums of the elements of the input vector, after I have converted its chars in numbers. This code works, but takes forever. What might be the problem?
n = numel(sequence);
skew = zeros(1,n+1,'int32');
for i = 1:n
switch sequence(i)
case 'C'
skew(i+1) = skew(i)-1;
case 'G'
skew(i+1) = skew(i)+1;
otherwise
skew(i+1) = skew(i);
end
end
By the way, I am not interested in the cumulative sums vectors as is. The only thing I care is the index of its minimum.
  댓글 수: 2
John D'Errico
John D'Errico 2016년 12월 17일
Cumsum crashing due to memory issues, on ONLY a few million integers? Please show us what you did, as that is virtually impossible. A loop on only a few million terms taking forever? Again, any reasonable loop of that size will NOT take forever. So you need to show what you are doing. A few million is simply not a very long vector.
v = int32(floor(rand(1,1e7)*3)) - 1;
V = cumsum(v);
whos
Name Size Bytes Class Attributes
V 1x10000000 40000000 int32
v 1x10000000 40000000 int32
So 10 million elements. 40 megabytes per vector? I tried to watch the memory required, but it happened too fast to see a blip.
Paolo Binetti
Paolo Binetti 2016년 12월 17일
You are right! I made a mistake, the problem was not cumsum, but another one which is not covered by this question. I will edit this question to deal only with the for-loop, and then create a new question on the memory issue.

댓글을 달려면 로그인하십시오.

채택된 답변

Image Analyst
Image Analyst 2016년 12월 17일
Then you're doing something wrong. Are you preallocating your cdf array? It should be only a fraction of a second. For example:
numPoints = 20000000; % Twenty million elements
m = randi(-1,1, 1, numPoints);
tic;
cdf = zeros(1, numPoints);
cdf(1) = m(1);
for k = 2 : length(m)
cdf(k) = cdf(k-1) + m(k);
end
toc;
The elapsed time on my computer for twenty million elements is 0.21 seconds. How long is it taking on your computer?
  댓글 수: 3
Image Analyst
Image Analyst 2016년 12월 17일
Well, two thumbs up for MATLAB and two thumbs down for Octave. The Mathworks spent a lot of time about 5 years ago or so to really speed up for loops quite a bit. That probably was not done with open source Octave. But there are still people who hold on to the believe that for loops are horribly slow and should be avoided at all costs. While they may still be slower than vectorized solutions, the belief that they're really slow is now a myth (except for in Octave I guess).
Paolo Binetti
Paolo Binetti 2016년 12월 17일
편집: Paolo Binetti 2016년 12월 17일
So Octave might be the problem ... I will check this out and try also on Matlab, of which I only have an old version however.

댓글을 달려면 로그인하십시오.

추가 답변 (2개)

Star Strider
Star Strider 2016년 12월 17일
If your memory is that limited, this may be a work-around:
V = int32(randi([-1 1], 1, 1000)); % Create Data
Vr = reshape(V, 100, []); % Create Matrix (Can Be Assigned As ‘V’, Separate Here To Check Code)
vctmin = Inf; % Initialise ‘vctmin’
endsum = 0; % Initialise ‘endsum’
for k1 = 1:size(Vr,2)
colsum = cumsum(Vr(:,k1))+endsum; % Calculate Column ‘cumsum’
endsum = colsum(end); % End Value
[colmin,idx] = min(colsum);
if (colmin < vctmin) % Test Minima & Replace
vctmin = colmin; % New Vector Minimum
minidx = idx + (k1 - 1)*size(Vr,1); % New Minimum Index
end
end
vctmin
minidx
[testmin,testidx] = min(cumsum(V))
It creates a matrix from your vector. (I gave them different names here so I could check the code, but assigning the reshaped vector to ‘V’ instead of ‘Vr’ would not use any additional memory.) The code then does the cumulative sum on each column of the matrix, stores the minimum and its index, and proceeds through the matrix. The advantage is that only one column of the matrix at a time is in your workspace, so memory should not be an issue.
When I checked the matrix approach with directly calculating the minimum and its index, the results were the same over several test runs.

Star Strider
Star Strider 2016년 12월 17일
With respect to your creating ‘skew’, this may be more efficient:
bases = {'A','C','T','G'}; % Cell Array
sequence = bases(randi(4, 1, 20));
skew = zeros(1, length(sequence)+1,'int32');
Cix = find(ismember(sequence, 'C'));
Gix = find(ismember(sequence, 'G'));
skew(Cix+1) = -1;
skew(Gix+1) = +1;
I don’t know what ‘sequence’ is, so I created it as a cell array here.
  댓글 수: 3
Star Strider
Star Strider 2016년 12월 17일
Yes! It works the same way, producing the same (correct) result:
bases = ['A','C','T','G']; % Character Array
sequence = bases(randi(4, 1, 20));
skew = zeros(1, length(sequence)+1,'int32');
Cix = find(ismember(sequence, 'C'));
Gix = find(ismember(sequence, 'G'));
skew(Cix+1) = -1;
skew(Gix+1) = +1;
Image Analyst
Image Analyst 2016년 12월 17일
If you're going to ask another question, then please also attach 'source.txt'.

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Logical에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by