Choose Between spmd
, parfor
,
and parfeval
Communicating Parallel Code
To run computations in parallel, you can use parfor
,
parfeval
, parfevalOnAll
, or
spmd
. Each construct relies on different parallel programming
concepts. If you require workers to communicate throughout a computation, use
parfeval
, parfevalOnAll
, or
spmd
.
Use
parfeval
orparfevalOnAll
if your code can be split into a set of tasks, where each task can depend on the output of other tasks.Use
spmd
if you require communication between workers during a computation.
Computations with parfeval
are best represented as a graph,
similar to a Kanban board with blocking. Generally, results are collected from
workers after a computation is complete. You can collect results from execution of a
parfeval
operation by using afterEach
or afterAll
. You typically use the results in further
calculations.
Computations with spmd
are best represented by a flowchart,
similar to a waterfall workflow. Results can be collected from workers during a
computation. Sometimes, workers must communicate with other workers before they can
finish their computation.
If you are unsure, ask yourself the following: within my
communicating parallel code, can each computation be completed without any
communication between workers? If yes, use
parfeval
. Otherwise, use spmd
.
Synchronous and Asynchronous Work
When choosing between parfor
,
parfeval
, and spmd
, consider whether
your calculation requires synchronization with the client.
parfor
and spmd
require synchronization,
and therefore block you from running any new computations on the MATLAB® client. parfeval
does not require
synchronization, so the client is free to pursue other work.
Compare Performance of Multithreading and ProcessPool
In this example, you compare how fast functions run on the client and on a ProcessPool
. Some MATLAB functions make use of multithreading. Tasks that use these functions perform better on multiple threads than a single thread. Therefore, if you use these functions on a machine with many cores, a local cluster can perform worse than multithreading on the client.
The supporting function clientFasterThanPool
, listed at the end of this example, returns true
if multiple executions are performed faster on the client than a parfor
-loop. The syntax is the same as parfeval
: use a function handle as the first argument, the number of outputs as the second argument, and then give all required arguments for the function.
First, create a local ProcessPool
.
p = parpool('Processes');
Starting parallel pool (parpool) using the 'Processes' profile ... Connected to the parallel pool (number of workers: 6).
Check how fast the eig
function runs by using the clientFasterThanPool
supporting function. Create an anonymous function with eig
to represent your function call.
[~, t_client, t_pool] = clientFasterThanPool(@(N) eig(randn(N)), 0, 500)
t_client = 34.8639
t_pool = 6.9755
The parallel pool computes the answer faster than the client. Divide t_client
by maxNumCompThreads
to find the time taken per thread on the client.
t_client/maxNumCompThreads
ans = 5.8107
Workers are single threaded by default. The result indicates that the time taken per thread is similar on both the client and the pool, as the value of t_pool
is roughly 1.5 times the value of t_client/maxNumCompThreads
. The eig
function does not benefit from multithreading.
Next, check how fast the lu
function runs by using the clientFasterThanPool
supporting function.
[~, t_client, t_pool] = clientFasterThanPool(@(N) lu(randn(N)), 0, 500)
t_client = 1.0447
t_pool = 0.5785
The parallel pool typically computes the answer faster than the client if your local machine has four or more cores. Divide t_client
by maxNumCompThreads
to find the time taken per thread.
t_client/maxNumCompThreads
ans = 0.1741
This result indicates that the time taken per thread is much less on the client than the pool, as the value of t_pool
is roughly 3 times the value of t_client/maxNumCompThreads
. Each thread is used for less computational time, indicating that lu
uses multithreading.
Define Helper Function
The supporting function clientFasterThanPool
checks whether a computation is faster on the client than on a parallel pool. It takes as input a function handle fcn
and a variable number of input arguments (in1, in2, ...
). clientFasterThanPool
executes fcn(in1, in2, ...)
on both the client and the active parallel pool. As an example, if you wish to test rand(500)
, your function handle must be in the following form:
fcn = @(N) rand(N);
Then, use clientFasterThanPool(fcn,500)
.
function [result, t_multi, t_single] = clientFasterThanPool(fcn,numout,varargin) % Preallocate cell array for outputs outputs = cell(numout); % Client tic for i = 1:200 if numout == 0 fcn(varargin{:}); else [outputs{1:numout}] = fcn(varargin{:}); end end t_multi = toc; % Parallel pool vararginC = parallel.pool.Constant(varargin); tic parfor i = 1:200 % Preallocate cell array for outputs outputs = cell(numout); if numout == 0 fcn(vararginC.Value{:}); else [outputs{1:numout}] = fcn(vararginC.Value{:}); end end t_single = toc; % If multhreading is quicker, return true result = t_single > t_multi; end
Compare Performance of parfor
, parfeval
, and spmd
Using spmd
can be slower or faster than using parfor
-loops or parfeval
, depending on the type of computation. Overhead affects the relative performance of parfor
-loops, parfeval
, and spmd
.
For a set of tasks, parfor
and parfeval
typically perform better than spmd
under these conditions.
The computational time taken per task is not deterministic.
The computational time taken per task is not uniform.
The data returned from each task is small.
Use parfeval
when:
You want to run computations in the background.
Each task is dependent on other tasks.
In this example, you examine the speed at which matrix operations can be performed when using a parfor
-loop, parfeval
, and spmd
.
First, create a parallel pool of process workers p
.
p = parpool('Processes');
Starting parallel pool (parpool) using the 'Processes' profile ... Connected to the parallel pool (number of workers: 6).
Compute Random Matrices
Examine the speed at which random matrices can be generated by using a parfor
-loop, parfeval
, and spmd
. Set the number of trials (n
) and the matrix size (for an m
-by-m
matrix). Increasing the number of trials improves the statistics used in later analysis, but does not affect the calculation itself.
m =1000; n =
20;
Then, use a parfor
-loop to execute rand(m)
once for each worker. Time each of the n
trials.
parforTime = zeros(n,1); for i = 1:n tic; mats = cell(1,p.NumWorkers); parfor N = 1:p.NumWorkers mats{N} = rand(m); end parforTime(i) = toc; end
Next, use parfeval
to execute rand(m)
once for each worker. Time each of the n
trials.
parfevalTime = zeros(n,1); for i = 1:n tic; f(1:p.NumWorkers) = parallel.FevalFuture; for N = 1:p.NumWorkers f(N) = parfeval(@rand,1,m); end mats = fetchOutputs(f, "UniformOutput", false)'; parfevalTime(i) = toc; clear f end
Finally, use spmd
to execute rand(m)
once for each worker. For details on workers and how to execute commands on them with spmd
, see Run Single Programs on Multiple Data Sets. Time each of the n
trials.
spmdTime = zeros(n,1); for i = 1:n tic; spmd e = rand(m); end eigenvals = {e{:}}; spmdTime(i) = toc; end
Use rmoutliers
to remove the outliers from each of the trials. Then, use boxplot
to compare the times.
% Hide outliers boxData = rmoutliers([parforTime parfevalTime spmdTime]); % Plot data boxplot(boxData, 'labels',{'parfor','parfeval','spmd'}, 'Symbol','') ylabel('Time (seconds)') title('Make n random matrices (m by m)')
Typically, spmd
requires more overhead per evaluation than parfor
or parfeval
. Therefore, in this case, using a parfor
-loop or parfeval
is more efficient.
Compute Sum of Random Matrices
Next, compute the sum of random matrices. You can do this by using a reduction variable with a parfor
-loop, a sum after computations with parfeval
, or spmdPlus with spmd
. Again, set the number of trials (n
) and the matrix size (for an m
-by-m
matrix).
m =1000; n =
20;
Then, use a parfor
-loop to execute rand(m)
once for each worker. Compute the sum with a reduction variable. Time each of the n
trials.
parforTime = zeros(n,1); for i = 1:n tic; result = 0; parfor N = 1:p.NumWorkers result = result + rand(m); end parforTime(i) = toc; end
Next, use parfeval
to execute rand(m)
once for each worker. Use fetchOutputs
on all of the matrices, then use sum
. Time each of the n
trials.
parfevalTime = zeros(n,1); for i = 1:n tic; f(1:p.NumWorkers) = parallel.FevalFuture; for N = 1:p.NumWorkers f(N) = parfeval(@rand,1,m); end result = sum(fetchOutputs(f)); parfevalTime(i) = toc; clear f end
Finally, use spmd
to execute rand(m)
once for each worker. Use spmdPlus
to sum all of the matrices. To send the result only to the first worker, set the optional target worker argument to 1
. Time each of the n
trials.
spmdTime = zeros(n,1); for i = 1:n tic; spmd r = spmdPlus(rand(m), 1); end result = r{1}; spmdTime(i) = toc; end
Use rmoutliers
to remove the outliers from each of the trials. Then, use boxplot
to compare the times.
% Hide outliers boxData = rmoutliers([parforTime parfevalTime spmdTime]); % Plot data boxplot(boxData, 'labels',{'parfor','parfeval','spmd'}, 'Symbol','') ylabel('Time (seconds)') title('Sum of n random matrices (m by m)')
For this calculation, spmd
is significantly faster than a parfor
-loop or parfeval
. When you use reduction variables in a parfor
-loop, you broadcast the result of each iteration of the parfor
-loop to all of the workers. By contrast, spmd
calls spmdPlus
only once to do a global reduction operation, requiring less overhead. As such, the overhead for the reduction part of the calculation is for spmd
, and for parfor
.