Best Practice to Distribute Data to Workers?
조회 수: 12 (최근 30일)
이전 댓글 표시
Hi,
I wonder if there is any known best practice to distribute data from the client to workers in terms of time (and space) efficiency. Suppose we have a large matrix A on the client, and want to distribute it to workers (along the column). Suppose
- A is the result of some complicated operations, so we can't generate columns (or rows) of A parallelly on each worker
- A can be fitted into the memory (not datastore type needed)
I wonder what would be the best practice to distribute A on workers.
I made the following comparison:
n = 512;
n_workers = 25;
A = rand(n^2, n); % generate synthesized data A
% method 1: distributed
tic;
A_dist = distributed(A);
t1=toc;
fprintf("t1 = %7.4e\n", t1)
clear A_dist
% method 2: Composite -> distributed
tic;
A_dist = Composite();
chunk_size = ceil(n/n_workers);
for i = 1 : n_workers-1
A_dist{i} = A(:,chunk_size*(i-1)+1:chunk_size*i);
end
A_dist{n_workers} = A(:,chunk_size*(n_workers-1)+1:end);
A_dist = distributed(A_dist, 2);
t2=toc;
fprintf("t2 = %7.4e\n", t2)
clear A_dist
% method 3: spmd + codistributed
tic;
spmd
A_dist = codistributed(A, codistributor('1d', 2));
end
t3=toc;
fprintf("t3 = %7.4e\n", t3)
clear A_dist
I observe that method 2 is always faster than method 1, and they two are both significantly faster than method 3. The typical output is: (and the rank and the gap are quite robust)
t1 = 3.0949e+00
t2 = 2.2290e+00
t3 = 1.7517e+01
Is there any better way than my method 2?
Besides, I am wondering about the mirror question: what would be a best pratice to gather data from workers to client? Basically it should be an inverse of my code that gets a (large) matrix A from distributed array A_dist.
댓글 수: 4
Walter Roberson
2021년 12월 8일
편집: Walter Roberson
2021년 12월 8일
tic
parfor i = 1 : n
A_dist = A(:,i);
end
t4 = toc;
fprintf("t4 = %7.4e\n", t4);
clear A_dist
Oh, I should make a remark about the absolute timings: I was using my 2013-era desktop that only has 4 cores, so I had reduced n_workers to 4.
Edric Ellis
2021년 12월 8일
parfor is probably fastest since it can send slices of data to multiple workers simultaneously. Unfortunately, using parfor is not useful for creating a distributed array since you don't have control over where the data ends up. (Ideally the distributed constructor would do this too, but I think the current implementation doesn't).
답변 (1개)
Shubham
2024년 4월 17일
Hi Shumao,
Your approach to distributing and gathering data in a parallel computing environment, specifically with MATLAB's Parallel Computing Toolbox, is thoughtful and demonstrates a keen understanding of the challenges involved in optimizing for time and space efficiency. Here are some insights and suggestions based on your current methods and questions:
Analysis of Your Methods
- Converts the entire matrix A into a distributed array in one step. This method is straightforward but might not be the most efficient, especially if there is overhead in distributing the data across the network or if the distribution does not optimally utilize the workers' memory.
- Manually splits the matrix into chunks and then distributes these chunks to the workers. This method offers more control over the distribution process, potentially reducing overhead and allowing for optimizations based on the network architecture or the workers' memory constraints.
- Uses SPMD blocks and codistributed arrays to distribute the data. This method might introduce additional overhead due to the SPMD environment setup and synchronization barriers.
Suggestions for Optimization
- When distributing data, the network's bandwidth and latency can significantly impact efficiency. Compressing the data before sending it and then decompressing it on the workers might save time if the compression and decompression times are less than the saved network transfer time.
- If possible, convert your data to more efficient data types. For example, if your data allows for it, using single precision instead of double precision can halve the amount of data that needs to be transferred.
- Ensure that your parallel pool is optimally configured for your specific computational environment. The overhead of starting and stopping the parallel pool can be significant, especially for short tasks.
- Continue to benchmark different methods as you have done. MATLAB's profiler can also help identify bottlenecks in the distribution and gathering processes.
Gathering Data from Workers
For gathering data from workers back to the client, the inverse process of distribution can be applied. However, efficiency in gathering depends on reducing the communication overhead and properly managing memory. Here are some strategies:
- Use gather to collect distributed arrays back to the client. This is the direct inverse of distributing data and works well if the data was evenly distributed among the workers.
- If you manually distributed chunks of data, you might also manually gather them by using labSend and labReceive functions within an SPMD block or by collecting pieces from Composite objects. This method allows for custom optimizations, similar to your distribution strategy.
- Just as with distribution, consider the data size and type when gathering. If possible, compress data on workers before sending it back to the client.
Conclusion
Your Method 2 shows a promising approach by manually controlling the data distribution, which can be more efficient in certain environments. For gathering, applying a similar level of control and optimization can yield the best results. Always consider the specific characteristics of your computational environment and the nature of your data when choosing or designing a distribution/gathering strategy.
댓글 수: 0
참고 항목
카테고리
Help Center 및 File Exchange에서 Parallel for-Loops (parfor)에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!