Slicing matrices to fit in memory

조회 수: 2 (최근 30일)
Yi-xiao Liu
Yi-xiao Liu 2022년 6월 17일
답변: Avni Agrawal 2023년 10월 26일
I have a custom vectorized function whose space (and time) complexity is O(n^2). The problem is that n is very large, ~1e6, and the function cannot run in memory. Even storing the output is difficult. Fortunately, the output is very sparse, containing only ~5% non-zero elements. So I am thinking of slicing the matrix into smaller chunks that fits in memory, and store the output in a sparse format. This is what I got:
n=1e6+1e4-1;
v=rand(n,1);
slice=1e4;
n_slice=floor(n/slice);
slice=repmat(slice,n_slice,1);
slice(1)=n-sum(slice)+slice(1);
v=mat2cell(v,slice);
out=cell(n_slice);
for i=1:n_slice
for ii=1:n_slice
if ii<i
out{i,ii}=logical(sparse(size(v{i},1),size(v{ii},1)));
else
out{i,ii}=sparse(myfun(v{i},v{ii}));
end
end
end
out=cell2mat(out);
out=out|tril(out',-1);
function out=myfun(v1,v2)
%just an example
out=abs(v1-v2')<1e-6;
end
My question is:
  1. what is the ideal size of each slice? is it just the largest that can fit into RAM, or is there other considerations?
  2. The conversion of full matrix to sparse (line 15) seems to have a big overhead, any way to improve this?
  3. In the current code the size of the real slice can vary by as much as a factor of 4 (when the remainder of n/slice is slice-1). Is there any way to get more consistent slices given an arbitrary n?
I would also appericate if you points out any other direction/room for optimization.
Thanks!
  댓글 수: 2
Jonas
Jonas 2022년 6월 18일
if you work with large data sets you may want to have a look into Tall Arrays
Yi-xiao Liu
Yi-xiao Liu 2022년 6월 18일
@Jonas The output is used in subsequent operations so sparse has performance advantages over tall as well.

댓글을 달려면 로그인하십시오.

답변 (1개)

Avni Agrawal
Avni Agrawal 2023년 10월 26일
Your approach to handling large data with memory constraints by slicing the data into smaller chunks and storing the output in a sparse format is a good one. Here are some suggestions to your questions:
1. I understand that you are trying to figure out the ideal slice size which is indeed the largest size that can fit into your RAM. However, you also need to consider the overhead of other programs running on your system. So, you might want to leave some room for them. You can use the `memory` function in MATLAB to get an idea of how much memory is available.
2. You are right that converting a full matrix to a sparse one can be computationally expensive. One way to reduce this overhead is to generate sparse matrices directly in your function `myfun`, if possible. This way, you can avoid creating dense matrices and then converting them to sparse.
3. To get more consistent slices, you could adjust your slicing strategy. Instead of making the first slice larger to accommodate the remainder, you could distribute the remainder across all slices. Here's an example:
remainder = mod(n,slice);
slice = repmat(slice,n_slice,1) + [ones(remainder,1); zeros(n_slice-remainder,1)];
This will add 1 to the size of the first `remainder` slices, making all slices more consistent in size.
Few other optimization suggestions:
You can avoid the `if ii<i` condition by adjusting your loops. You can start the inner loop from `i` instead of 1.
- You are creating a lower triangular matrix and then using `tril` to mirror it to the upper triangle. This is unnecessary if your function `myfun` is symmetric, i.e., `myfun(a,b) = myfun(b,a)`. In that case, you can calculate both `out{i,ii}` and `out{ii,i}` in the same iteration.
Here's your code with these suggestions applied:
n=1e6+1e4-1;
v=rand(n,1);
slice=1e4;
n_slice=floor(n/slice);
remainder = mod(n,slice);
slice=repmat(slice,n_slice,1) + [ones(remainder,1); zeros(n_slice-remainder,1)];
v=mat2cell(v,slice);
out=cell(n_slice);
for i=1:n_slice
for ii=i:n_slice
out{i,ii}=sparse(myfun(v{i},v{ii}));
if ii>i
out{ii,i}=out{i,ii}';
end
end
end
out=cell2mat(out);
Lastly, you could consider using a machine with more memory or distributed computing techniques if your data is too large to fit into memory even with these optimizations.
Take a look at these documentations for better understanding:

카테고리

Help CenterFile Exchange에서 Data Distribution Plots에 대해 자세히 알아보기

태그

제품


릴리스

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by