Loop through list of vectors (ie, the rows of a matrix), applying same "simple" function to each one, on GPU? Should I use arrayfun somehow?

I currently have an analysis which loops through a list of measurements, where each measurement itself is a vector (so, it loops through the rows of a matrix).
On each measurement input (1D vector) I apply a custom function (which is not simple enough to share but nonetheless straightforward), which then gives an output result of the same vector size as the input. The analysis function itself more or less just loops through each element in the vector and does something, but the result for each element in the vector also depend on the previous results (there is a clear dependency ordering).
I often have 10 or 100s of millions of measurements per experiment, and it currently takes about 6 hours to run about 10 million, but we'd really like to get this working in less than 15 minutes... And so we hope there is a way to get it to run on a GPU? The key thing is that each measurement is actually only 15-50 elements long, and so I'd like to parallelise along the number of measurements, rather than the function which operates on each measurement (which seems straightforward).
I found this question which seemed similar: https://ch.mathworks.com/matlabcentral/answers/83839-can-arrayfun-take-multi-dimensional-arrays-as-individual-arguments but the Staff seemed to just be saying to loop it, which doesn't seem like it is fast enough for me. It seems like pagefun could work, but I'm struggling to come up with a way how.
For an example, an outline sketch of my code might be:
% Update - see attached files.
gpu_run_test
Update(d again, after vectorising - see most recent update) - I've separated out the specific part of my code and created a minimum working example, which I have now attached to this post. The code runs without issue, but running it with GPU arrays is currently much slower:
Outdated info
For larger N, this is only exaggerated. After more thorough research while separating this code example out from everything else, I realised what I hope for might not be possible in the end... If anyone is able to help it would be incredibly amazing.
Update 2 - I've reworked my code again, to make it as clear as I can where the dependency lies. It is hopefully more readable now, as the function calls are far more concise and minimal.
In words; the wc at each step depends on the position at that step, but then the position result at each step depends (on a bunch of things which depend) on the wc at the previous step. In particular, this occurs when "interp1" is used with arg_position and arg_density to assess the wc at the newly-determined position.
Update 3 - Ok so it finally clicked what everyone was suggesting, and indeed it works extremely well now after vectorising across the measurements. The attached code (updated again) works a lot faster, but under one condition:
The arg_position and arg_wc used in the interp1 line must be normal 1-D vectors, the same for all measurements. This is not a dealbreaker, as actually it only changes a relatively few 2000 times instead of 100 million (indeed, everything except the first argument: "invertible_group_delays" only changes 2000 times versus 100 million)...
Does anyone have a suggestion how I can keep all the measurements performed parallel, instead of splitting it into 2000 chunks? The interp1 line seems to be the only issue now (line 45 replacing 47).
Here are the results as they stand now:
gpu_run_test
real scenarios have N = ~1e8
with only ~2000 equilibriums ==> N = ~ 50000 per equilibrium
all arguments except the first will only change each equilibrium
N = 50000
Now doing normal inversion.
Total time for normal inversion:
Elapsed time is 0.567713 seconds.
Maximum difference between expected and achieved result: 5.5202e-06
Now doing GPU.
Total time for GPU inversions:
Elapsed time is 0.502490 seconds.
Maximum difference between expected and achieved result: 5.5202e-06
Done.
gpu_run_test
N = 10000000 ( = 1e7)
Now doing normal inversion.
Total time for normal inversion:
Elapsed time is 93.092429 seconds.
Maximum difference between expected and achieved normal result: 5.5202e-09 mm
Now doing GPU.
Total time for GPU inversions:
Elapsed time is 101.491855 seconds.
Maximum difference between expected and achieved gpu result: 5.5202e-09 mm
Done.
For a typical experiment, I'd like to get through N = 1e8 in under 5 minutes. Anyone have advice on the best way to get there from where things stand? Currently it takes 15 minutes. Perhaps I just have to buy more CPUs?

 채택된 답변

Matt J
Matt J 2024년 3월 25일
편집: Matt J 2024년 3월 25일
Well whether you use the GPU or not, you don't have to loop across measurements. The operations in your example are trivial to vectorize over the measurements so, assuming that's true of your actual code as well, you can gain speed-up just by doing that.
Of course, running the vectorized code on the GPU should speed things up still further, which you can do just by making the variables inside measurement_analysis() into gpuArray variables. All of this is illustrated for your minimized example below:
measurement_results = measurement_analysis(raw_measurements); %pass in the total measurement array
function output_result = measurement_analysis(raw_measurements)
sum_of_previous = gpuArray(0);
output_result = gpuArray.nan(size(raw_measurements));
for k = 1:width(raw_measurements)
output_result(:,k) = dependent_function(input_measurement(:,k), sum_of_previous)
sum_of_previous = sum_of_previous + output_result(:,k);
end
end
function output = dependent_function(input_vector, sum_of_previous)
output = sum_of_previous.*sqrt(1 - input_vector./measurement_length);
%Not sure where "measurement_length" is created. Example doesn''t
%show it.
end

댓글 수: 8

Thank you for the quick help, but the example dependent function is just meant to be an example. Its actually some inversion procedure, where I cannot simply perform element-by-element operation, as the answer at each step will depend on the value of an independent function of the answer at the previous step. This solution only works due to the simplicity of my example... I will edit it to give a better impression.
Or perhaps it does work and Im still just not understanding?
Your revised post does not clarify the structure of the computation, I'm afraid. Basically, everything of substance that dependent_function() was doing has now been hidden in external_function().
If you are simply asking if there is a way to GPU-optimize code without considering the details of the operation being performed, the answer is no.
CPU parallelization via parfor is more flexible though, so as @Catalytic mentions, you could replace the loop over measurements by a parfor loop, as in,
parfor measurement_index = 1:number_of_measurements
measurement_results(measurement_index,:) = measurement_analysis(raw_measurements(measurement_index,:));
end
How much parfor speeds things up (if at all) is always a bit hard to predict, however. It depends on the operations done inside measurement_analysis() and how efficiently Matlab's built-in parallelization already handles those operations without parfor.
Thanks again for your help so far. I've updated my original question with a complete working example so you can see specifically what I'm trying to do.
In particular, in this inversion procedure, it requires interpreting the "wc" variable at each newly-solved position in order to solve for the next position in that particular sweep. This dependency is why there is no simple invertible matrix we can use to directly get the positions, and why it must be done algorithmically (actually, there is another polarisation which is effectively the same as setting wc = 0 everywhere, and for this polarisation it is possible to directly solve for the answer by inverting a matrix...)
Nonetheless, all of the manipulations in this function are extremely straightforward (even the interp1 can be changed to something simpler if this function is not acceptable)... And so I'm still praying there is a chance I can get this to work (a lot) faster.
Thanks again for your help so far. I've updated my original question with a complete working example so you can see specifically what I'm trying to do.
Yes, but I tend to agree with @Catalytic that your coding style is very hard to read because of the way you name your variables. I know there are programmers who claim that code that looks like full English words and text is more readable. It isn't....
Perhaps because of that, I cannot see why advice already given to you does not apply. In what way is your updated example fundamentally different from the example you originally posted?
In particular, why must each step of the loop over i_index be a scalar operation? Why can't things like this,
this_group_delay = invertible_group_delays(i_index)/2;
instead be vectorized to things like this,
this_group_delay = invertible_group_delays(:,i_index)/2;
Sorry about my coding style... But indeed I think you're right, for a lot of things I can probably vectorize them as you mention. I cut a lot of this code from a more complex class elsewhere, and so I bet a lot of these simplified functions could be vectorized somehow.
I'm not super familiar with matlab, and so I'm not great at making things the most efficient.
If it would help I can replace any variable names with things that are simpler... I tend to use tab completion and string-searches to do everything, but its pretty trivial to upload it in a more clear fashion if you have a recommendation and would offer more specific help as a result.
Thanks again for your input
I think in addition to just presenting the full bulk of your code, you should point to the specific line of code where the advice I've given would not apply. Again, at a quick glance, I cannot see why all data sharing a common i_index cannot be processed together with vectorized commands.
Thanks again for your help - Ive taken your advice as best I can, and both tried to vectorise the code as much as possible (reduced function calls), and also made it as explicit as possible how the logic dependency works.
Its possible the "calculate_inversion_phi_integral" function could be vectorised yet further, but it wasn't very clear to me how that might be possible... I just know Im not very good at that still and so there is perhaps still room for improvement.
Nonetheless hopefully its clear now how the dependency works. I am happy for more advice if you have any to offer.
It finally clicked what you meant with vectorise, and I did so over the measurements - it works really great now, GPU or otherwise. Thanks so much for the help - sorry I can be a bit dense when it comes to understanding certain things. :-)
If you have any suggestion for the interp1 issue to keep it vectorised over all the measurements, instead of just all the measurements per equilibrium, that would be seriously incredible. But I think for the moment this will hopefully suffice (unless i come back here and edit this to cry something otherwise).

댓글을 달려면 로그인하십시오.

추가 답변 (2개)

Catalytic
Catalytic 2024년 3월 25일
편집: Catalytic 2024년 3월 25일
It's hard to know exactly what's going on in your code, because every variable and function name has the string "measurement" in it, making them difficult to distinguish. Also including the string "index" in your indices also clutters the code and makes reading hard. Readers can see when a variable is an index by the way it is used. It isn't necessary to label it so.
All that said, if you have a function of a very general form that you want to parallelize, you probably want to use parfor, rather than the GPU.

댓글 수: 6

"...you probably want to use parfor..."
Unfortunately the OP also stated this in the question: "...the result for each element in the vector also depend on the previous results (there is a clear dependency ordering)."
Sorry, to clarify on Stephen23's response - for each measurement, where each measurement is a set of scalars taken at the same time, there is a clear dependency ordering in the scalars. Each measurement is independent from one another.
I guess maybe people are struggling to understand without a legitimate minimal example. I will try to isolate the relevant code and edit my example further so that its perfectly clear what I need.
Thank you for all the help in the interim.
Sorry, to clarify on Stephen23's response - for each measurement, where each measurement is a set of scalars taken at the same time, there is a clear dependency ordering in the scalars.
I think that much is clear. All answers given so far appear to recognize that.
so I'd like to parallelise along the number of measurements
But here, you are also saying that the processing of any two different measurement vectors will be independent of each other. So, in theory, it should be possible to parallelize those with parfor. If this is not the case, it was never clear why you thought this could be parallelized.
Yes, I could parallelise with parfor, that is true - I am not contesting that (and, in fact, I will be doing that until I get a GPU solution)...
However, extrapolating from the single threaded case, even then it seems to take over an hour per experiment... as I often have over 100 million measurements (generated per experiment, which happens once every 15-20 minutes!), and it takes about 6 hours to run 10 million measurements on a single thread.
My lab already has the GPU, and so I am really hoping I can speed it up further with this extra effort.
(I'm still working on a functional minimal example so it is as clear as possible, but thank you for the effort in the meantime!)
If you do parallelize with parfor, it would be interesting to see your scaling on your local machine as you increase the number of workers. If you find that the problem does improve with additional workers, you may wish to see if you have an HPC cluster with MATLAB Parallel Server available to you so you can scale across many machines.
I am working to parfor-ise my code in a different issue, which I will raise soon, once I similarly make a minimum example and tidy up the way I sort it...
Perhaps it is easy to parfor-ise with the example I have written, but considering how simple the calculation is, I expect it will work better with larger chunks fed to each thread. There is a spot earlier on in my code where it would work better to parfor-ise, as there is an earlier calculation which takes ~5 minutes which would be nicer to have parallel too.
At the moment I think I currently have too many loops-inside-of-loops for parfor to work happily. :-( But I will let you know here once I have results to compare (or another issue to address). Getting this working is my highest priority, and I think my boss' too...

댓글을 달려면 로그인하십시오.

If your calculation is truly sequential then by definition you cannot parallelize along the sequence. But it sounds like you definitely can parallelize across measurements. Work sequentially down the columns but compute each element of each row at the same time.

Often seemingly sequential operations, such as a cumulative sum, are in fact parallelizable because they can be computed using a reduction tree where groups of values are reduced with their neighbours and then this is done repeatedly until the final result is found. If your algorithm has this structure and it cannot be implemented using standard vectorized calculations available in matlab (like cumsum or accumarray) then you may be left with no alternative but to write your own CUDA algorithm and integrate into MATLAB with mexcuda.

댓글 수: 2

Hello,
Thank you for your reply. I've edited the original post with attachments of two files that give a concrete minimum working example. I will paste the comment I made elsewhere regarding the most difficult part of this algorithm:
In particular, in this inversion procedure, it requires interpreting the "wc" variable at each newly-solved position in order to solve for the next position in that particular sweep. This dependency is why there is no simple invertible matrix we can use to directly get the positions, and why it must be done algorithmically (actually, there is another polarisation which is effectively the same as setting wc = 0 everywhere, and for this polarisation it is possible to directly solve for the answer by inverting a matrix...)
Nonetheless, all of the manipulations in this function are extremely straightforward (even the interp1 can be changed to something simpler if this function is not acceptable)... And so I'm still praying there is a chance I can get this to work (a lot) faster.
Hello, I implemented correct coding for matlab across measurements, and it works much faster for both CPU and GPU computations, and can do N = ~1e7 measurements in about 100 seconds. (See the working example I've attached to my original post)
If I try 1e8 at once the memory runs out, for both CPU and GPU computations. Nonetheless, the GPU computation is just a shade slower (110 s instead of 100 s), so Im wondering if there is a way to speed this up?
100 seconds x 10 required for a full experiment of N = 1e8 is still 15 minutes, and I was hoping to reduce this to under 5. I just want to check I'm not missing anything else obvious before I suggest more CPUs or something?
Does the interp1 line make a big issue? I could rewrite line 45 in gpuInvert with something that doesnst use interp1 (like just a simple weighted average of the closest elements, or even just the closest element, as a small approximation here is not so concerning).

댓글을 달려면 로그인하십시오.

카테고리

도움말 센터File Exchange에서 Loops and Conditional Statements에 대해 자세히 알아보기

제품

릴리스

R2022b

질문:

2024년 3월 25일

편집:

2024년 4월 2일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by