efficient way to process millions of files individually

Question

Yuchun Ding 2018년 12월 6일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/434304-efficient-way-to-process-millions-of-files-individually

댓글: Walter Roberson 2018년 12월 6일

I'd like to process 10 millions images on my harddrive and generate a measurement for each and stored in a matrix, I wonder what's the most efficient way to do this?

doing it in a for loops is stright forward but I guess it will take forever to finish...

댓글 수: 2
없음 표시없음 숨기기

John D'Errico 2018년 12월 6일

편집: John D'Errico 2018년 12월 6일

Big problems take big time, but computers are not infinitely fast nor infintely powerful. So the sooner you start the task off, the sooner you will finish.

You can probably gain some throughput with parallel processing, depending on how many cores you have available. But remember that may then leave you with a possible bottleneck in your disk access speed. So you need to make that as fast as possible too, which suggests you need to use ssd storage for those files.

You can tell the ultimate limits of what you can do easily enough.

Measure the time needed to read each file.
Meaure the time needed to process that file.

Multiply each by 1e7, then add. That is the fundamental limit you are faced with. So unless you can speed up one or the other of those operations, you can do no better.

Walter Roberson 2018년 12월 6일

Adding the two might not be optimal under the circumstance that the computation is inherently serial or can be parallelized to (number of cores) minus (number of disk controllers ) or fewer cores, especially if it can be done with cores divided by controllers , minus one, or fewer. In such cases you might be able to gain substantially from overlapping . II would have to think more about the formulae to work out the total time .

But time per file times number of files divided by number of controllers would be a lower bound on total time.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Walter Roberson 2018년 12월 6일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/434304-efficient-way-to-process-millions-of-files-individually#answer_350953

Typically the biggest barrier is the speed of reading the files.

For any one drive Typically it is more efficient to read large files than many short files (unless you have a lot of fragmentation and a mediocre controller .)

Drive speed and interface speeds can make significant differences . Even for SSD drives quality matters. For non SSD drives that are not targeted at Enterprise you would prefer something designed for usb 3.1 with appropriate cables and controllers.

You would prefer to have the files split among multiple drives preferably on multiple controllers .

You might think of using parallel processing. In the single drive case that will not help, not unless the calculation to process each file takes longer than reading the file and the calculation is not being automatically multithreaded .

If you have multiple drives on different controllers then parallel processing can potentially increase performance (provided each worker is fetching from aa different controller )

Let me emphasise again: if you have only one drive and your processing is i/o bound then parallel processing tends to make things worse due to increased contention on the bottleneck resources .

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

efficient way to process millions of files individually

댓글 수: 2
없음 표시없음 숨기기

답변 (1개)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

efficient way to process millions of files individually

댓글 수: 2 없음 표시없음 숨기기

답변 (1개)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기