Processing Tall Arrays Taking Too Long

I have a tall array with about 1.7 billion rows of data and 14 columns. I want to be able to process this data in the same way that several examples (with airline data) do it. I am just trying to extract one column and find the mean. My code is something like:
ds = datastore('some-file.csv');
tt = tall(ds); %Mx14 tall table (M should be about 1.7 billion)
a = tt.V; %Mx1 tall double %(M should be the same as above)
m = mean(a); %One integer
gather_m = gather(m);
The gather step is taking way too much time. I haven't seen it complete at all. In the examples I have seen, this step is shown to be completed in a few seconds. Eventually, I want to be able to make calculations and plots, but I want to start by making this simple step work first. Can anyone recognize the problem and recommend a solution? I have parallel pool turned on and there are two workers.
Thank you very much.

댓글 수: 2

Walter Roberson
Walter Roberson 2017년 11월 1일
I had not realized that tall arrays used parallel if available, but I see that they do; https://www.mathworks.com/help/distcomp/run-tall-arrays-on-a-parallel-pool.html
Avinash Rajendra
Avinash Rajendra 2017년 11월 1일
They do, but it still takes way too long to run. I feel like I'd be in good shape if the time to run the program was more manageable.

댓글을 달려면 로그인하십시오.

답변 (1개)

Kojiro Saito
Kojiro Saito 2017년 11월 2일

0 개 추천

It may speed up by configuring read size of datastore. You can know the default read size by
ds.ReadSize
This is the data size which MATLAB reads from the file at one time. You can set higher size than your default and this will reduce file I/O. Please add ds.ReadSize setting, for example,
ds = datastore('some-file.csv');
ds.ReadSize = 100000; % Or higher
tt = tall(ds); %Mx14 tall table (M should be about 1.7 billion)
a = tt.V; Mx1 tall double %(M should be the same as above)
m = mean(a); %One integer
gather_m = gather(m);
Hope this help.

댓글 수: 4

Avinash Rajendra
Avinash Rajendra 2017년 11월 6일
편집: Avinash Rajendra 2017년 11월 6일
Thanks for the answer. ds.ReadSize was originally 20,000, and I set it to 1,000,000,000, but unfortunately, the gather step was still taking too long. Is there another method to make this run faster? I feel like there must be a solution because datastores and tall arrays are built for big data with millions or billions of points.
Update: I changed the data to a set of csv files that together contain about 170 million rows just to see if the gather worked for a smaller amount of data. With a ds.ReadSize of 100,000,000, it completes in 12.633 minutes. Does anyone know of a possible way to speed up this process? Or, is there any way to bypass this issue by implementing a workaround solution? This kind of time will not work for the program I am developing.
minomi
minomi 2018년 8월 1일
Hi Avinash, did you find a solution to your problem. I am interested in knowing the answer if you did.
Avinash Rajendra
Avinash Rajendra 2018년 8월 1일
No, I didn't get a satisfactory answer to this. I ended up switching to Python and Spark to get what I wanted.
I came from Python and R... Same struggle. So I'm now trying Matlab... If you managed to fix this, please share some secrets. Thanks.

댓글을 달려면 로그인하십시오.

카테고리

도움말 센터File Exchange에서 Call Python from MATLAB에 대해 자세히 알아보기

질문:

2017년 11월 1일

댓글:

2021년 4월 7일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by