This is a fully vectorized operation. And it should be quite efficient. Requires making 2 matrices however.
a = [1 2 0 0;3 4 0 0;5 6 0 0]
vertind = mod(find(a)-1,size(a,1))+1;
aind = find(a);
b = aind+(vertind-1)*size(a,1);
z = zeros(size(a));
z(b) = a(aind)
thesum = sum(b)
The operation on a matrix of your size (100x300000) take 0.1 seconds per iteration (creation of a excluded) and the maximal memory consumption (a included) is ~twice the size of a. The catch is that this does not allow values to go from end to 1, but that can be fixed, by some thinking. The method is based on the way create matrix indice, so to say columnwise. so in a 4x3 matrix element (2,1) have index 5.
The solution is 9 times slower than the one supported by Dishant Arora, but that solution does not return the B matrix.