multiple for loops split data
이전 댓글 표시
hey guys,
currently my function is really slow because of the mass of the data and because it uses only one thread.
Since i have a multicore Processor (Ryzen 5 3600, 6 Cores / 12 Threads), i want to make use of it by splitting my data and using multiple times the same function on these data and putting them back together.
I have found the spmd and parfor command
The raw steps which i want to to:
- split the Data (tables) n times
- give each worker enough parts of the splitted data and the raw data (which i need for the function)
- run a function which modifies the splitted data on each worker
- put all the splitted data back together
Also i am limited to functions in Matlab 2015b for my use.
How can i do that? Can you please help me?
This is what i tried:
workers = 12;
divider = ceil(specs.numberOfRows/workers);
split1 = data((data.ID <= divider),:);
split2 = data((data.ID > divider) & (data.ID <= divider*2),:);
split3 = data((data.ID > divider*2) & (data.ID <= divider*3),:);
split4 = data((data.ID > divider*3) & (data.ID <= divider*4),:);
split5 = data((data.ID > divider*4) & (data.ID <= divider*5),:);
split6 = data((data.ID > divider*5) & (data.ID <= divider*6),:);
split7 = data((data.ID > divider*6) & (data.ID <= divider*7),:);
split8 = data((data.ID > divider*7) & (data.ID <= divider*8),:);
split9 = data((data.ID > divider*8) & (data.ID <= divider*9),:);
split10 = data((data.ID > divider*9) & (data.ID <= divider*10),:);
split11 = data((data.ID > divider*10) & (data.ID <= divider*11),:);
split12 = data((data.ID > divider*11) & (data.ID <= specs.numberOfRows),:);
dataset_array={split1, split2,split3,split4,split5,split6,split7,split8,split9,split10,split11,split12};
parfor i=1:12
newDataset_array(i) = myFunction(dataset_array(i),data);
end
for i = 1:1:12
newData = [newData;newDataset_array(i)]
end
Thanks in Advance
댓글 수: 11
Jakob B. Nielsen
2020년 1월 15일
편집: Jakob B. Nielsen
2020년 1월 15일
I think parfor only runs on parallel cores/workers with the parallel computing toolbox... I assume you have that? Can you give a little more info of what your issue is?
dpb
2020년 1월 15일
" i dont quite understand how to use parfor the optimal way"
Read the introductory documentation and study the examples carefully, then.
Can't really comment on the parfor bit as I don't have the parallel toolbox. As far as I know, your parfor code probably works as you want, but it's not clear why you're passing both a portion of data (as dataset_array(i)) and the whole of data.
With regards to your code. Numbered variables are always a bad idea, even temporary ones. For a start it forces you to needlessly repeat the same code several times (witness all your splitx = ... lines).
At the very least you should use a loop
workers = 12;
divider = ceil(specs.numberOfRows/workers);
%so much simpler than numbered variables
dataset_array = cell(1, numel(workers))
for idx = 1:workers
dataset_array{idx} = data((data.ID > divider*idx-1) & (data.ID <= divider*idx), :);
end
Probably better:
workers = 12;
destination = discretize(data.ID, workers) ; %split ID into workers bins
dataset_array = cell(1, numel(workers))
for idx = 1:workers
dataset_array{idx} = data(destination == idx, :);
end
or:
workers = 12;
destination = discretize(data.ID, workers) ; %split ID into workers bins
dataset_array = splitapply(@(rows) {data(rows, :)}, (1:height(data))', destination);
15 lines of code down to 3! And if you want to change the number of workers, you just have one line to edit instead of lots of copy/paste or deletions required.
Most likely, your myFunction takes a table as input, not a 1x1 cell array of table, in which case your parfor should be:
newDataset_array = cell(size(dataset_array))
parfor i=1:numel(dataset_array) %don't hardcode values
newDataset_array{i} = myFunction(dataset_array{i}); %Use {} indexing to get the content of the cell
end
Owner5566
2020년 1월 15일
Owner5566
2020년 1월 15일
Guillaume
2020년 1월 15일
It's not in the release notes, but it appears that the number of bins option was added in R2016b.
destination = discretize(data.ID, linspace(min(data.ID), max(data.ID), workers))
should work for you.
Owner5566
2020년 1월 15일
Guillaume
2020년 1월 15일
Oh, of course, N edges == (N-1) bins. Use
destination = discretize(data.ID, linspace(min(data.ID), max(data.ID), workers + 1));
Owner5566
2020년 1월 15일
Guillaume
2020년 1월 15일
Now i just need a way, to make the big data Available to all workers
The way i do it now, they all get it in the function, which leads to a lot of memory use.
Cant i make it available to all?
I need it for filtering in the functions
채택된 답변
추가 답변 (0개)
카테고리
도움말 센터 및 File Exchange에서 Parallel for-Loops (parfor)에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!