datastore readsize and buffer chunk

Question

Peng Li 2020년 5월 12일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/524896-datastore-readsize-and-buffer-chunk

댓글: Peng Li 2020년 5월 18일

I am working with a single csv file of over 10G size. I'm trying to use mapreduce to perform some analyses. The program works as expected but I'd like to speed a bit by increasing readsize.

Currently, it only passes about 1500 rows each time, although by default it is 20000 rows. I have in total over 500k rows so this creates a lot of passes that significantly slow the process.

After looking into the tabulartextdatastore, it seems that the buffer chunk that has been hardcoded to be 32MB that limits this.

So my questions are:

1) What the consideration is that MATLAB applies this 32MB buffer size

2) Whether there is a good way to manipulate this as I'd like to decrease the number of passes while increase the size within each pass (I have about 400G RAM).

댓글 수: 2
없음 표시없음 숨기기

Peng Li 2020년 5월 12일

up :-)

Peng Li 2020년 5월 13일

Any suggestions/comments would be very appreciated!

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Rashed Mohammed 2020년 5월 18일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/524896-datastore-readsize-and-buffer-chunk#answer_433066

Hi Peng Li,

The “ReadSize” name value pair of “tabularTextDatastore” specifies the number of rows to read at most. However, it is bound by the chunk size depending on the data to efficiently manage the datastore. In your case, I would suggest you to look into partitioning the datastore and read the data in parallel. Here is a link to go through.

Hope this helps!

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Peng Li 2020년 5월 18일

Thanks for your answer, Rashed Mohammed!

I've tried datastore partitioning as well using numlabs. Within each spmd pass I used readall. A certain process after loading data requires even more space I think (categorical or replace to replace old codings within each column to human readable meaningful phrases). It made my linux machine unresponsive and many process been killed ... So I guess I should partition it into more chunks instead of into numlabs? Or within the spmd I should use read to let it automatically load whatever size it can handle?

So far I've tried another solution by using transform a datastore. It worked well too. But I'm not sure if this can be made in paralelle as it seems to me that transform only happens when I do some process on the transformed datastore. I basically created a transform function in which each column in the old datastore is being "decoded", and after that I used writeall with parameter 'UseParalelle', 'true'. Will this enable a paralelle process to transform datastore as well?

댓글을 달려면 로그인하십시오.

datastore readsize and buffer chunk

댓글 수: 2
없음 표시없음 숨기기

답변 (1개)

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

datastore readsize and buffer chunk

댓글 수: 2 없음 표시없음 숨기기

답변 (1개)

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기