Big Problem/Bug with new matfile command for partial mat file read/writes - creates massivly bloated files.

Please look at this minimal example:
%create a 1mb "incompressible" array
one_meg = uint8(rand(1,1000,1000)*256);
%choose a file, clear it and open it with write access
testfile = 'D:\Data\PGRtest\testfile.mat';
system(['del "' testfile '"'] );
matObj = matfile(testfile,'Writable',true);
%keep a copy of what we write to the file in memory for verification
memcpy = zeros(50,1000,1000,'uint8');
%write the array 50 times to this file
for i = 1:50
tic
%store in file and memory in same format - pages of 1000x1000
matObj.RawDat(i,1:1000,1:1000) = one_meg;
memcpy(i,1:1000,1:1000) = one_meg;
tm = toc;
%time increases from 45ms to 250ms at last iteration
fprintf('Iteration %i, time taken: %ims\n',i,tm*1000);
end
%check file size - should be 50mb or smaller from compression
%the file size is 1200mb....?
s = dir(testfile);
fprintf('file size: %i mb\n', s.bytes/1024/1024);
%load the mat file
load(testfile)
%the data inside is 50mb as expected no where near 1200mb
whos('RawDat')
%verify
%the read data is equal to the memory copy.. where did all that extra space go?
sum(abs(memcpy(:)-RawDat(:)))
This is using Windows 7 64bit, Matlab 2011b 64bit.
The problem is mostly described in the comments - essentially why does 50mb of data create a 1200mb mat file when created using the matfile system object?
I have tried storing the data with 2 dimensions instead of 3 I have tried using doubles not uint8. I have tried changing the default .mat file format from 7.3 although this is the only version that supports it.
I cant understand why it takes longer and longer - it is as if each write to the file rewrites all the existing data a second time so the first write is 1mb then 2mb then 3mb etc instead of 1mb each time.
I expect 'testfile' to be a <50mb mat file containing a 50x1000x1000 array. What I see is a 1.2GB file containing that array - clearly incorrect.
If the array is saved directly from workspace using 'save' the mat file is 2mb containing the same data.
Looks like this is a bug.
Any ideas? Do you get the same results? Thanks, Tom.

댓글 수: 3

Hi Tom,
looks strange, indeed. I get the same results (the 1.2 GB file as well as a reasonably sized file when saving once the data (also in 7.3 format). I will pass your example to our development to look into this.
Thanks,
Titus
Thanks, I already submitted it as a ticket to mathworks with ID 1-FWBIMT if it helps you.
Tom.
Ten years later and the same problem is still around... My data saved into a v7 mat file are around 1MB, as compared to almost 100MB in a v7.3 file. Loading and saving times are unfortunately proportionally longer as well. Note however that the data contained in the file take up only around 20MB so there is roughly a 5-fold increase in size by saving to a v7.3 file. Also note that I dont use the -nocompression flag when saving the file.
Any ideas or suggestions would be highly appreciated...

댓글을 달려면 로그인하십시오.

 채택된 답변

For the same reasons that growing an array in memory is a bad idea growing an array in a matfile is not a good programming practice. Your file has been horribly fragmented because of the matrix growth. The full 3d matrix must occupy one linear segment of the file.
If you preallocate the file variable by adding the line:
matObj.RawDat=memcpy; %preallocate
after creating the memcpy variable then your file size will be reasonable.
If your code is a model of what you want to do I suggest storing your chunks of data in cells of a RawData cell array inside your file.
You are also indexing into your array inefficiently but that does not seem to be causing any performance issues. For MATLAB it would be best if RawData was (1000,1000,50) in size.

댓글 수: 4

There is no garbage collection in matfile objects ? Releasing the space for the last two iterations would provide enough space for the next iteration. Even if the file was written in such a way as to avoid file corruption in case the writing ended early, that adds at most one iteration to the count.
Thank you for the suggestion it does indeed help to define the size before I start but this is just an example. What I want to do is use this method to save data from an incoming video stream after processing that makes it unsuitable for video compression and would be very large if entirely uncompressed so I cannot allocate the exact size before I start. I suppose the best work around is to allocate a very large array and assume I will not fill it. It seems odd that you should need to pre allocate a mat file on disk though it would not be the case with an ascii or binary file. I assumed adding indices would simply append data onto the file, even if i rearrange it as you suggested and have the sliding dimension at the end with the first 2 dimensions fixed the existing data does not need to be overwritten so why can the data not be simply appended creating a 1200mb file from 50mb is terribly inefficient and matlab should warn you that this is occurring at the very least. If it needs todo so much rewriting of existing data why not just start a new re compacted file it would take the same amount of time in this case although still very inefficient.
Did you see my suggestion to use a cell array? A cell array should not need to be preallocated and each cell is stored separately in the file so growth will not be an issue.
Yes sorry, I will give that a try and report how it goes either way it looks like you have come up with a solution!

댓글을 달려면 로그인하십시오.

추가 답변 (1개)

Keep in mind that save defaults to -v7, which has compression, but matfile uses -v7.3 which is HDF5 files which appear not to be compressed the way MATLAB uses them (though it could be that that has changed since -v7.3 files were first introduced.)

댓글 수: 3

Yes, I thought of this possibility too, I have tried to change the mat file format to all the available possibilities both by changing the default file format in the matlab preferences and by specifying format directly when saving the mat file, sadly it makes no difference, Thanks.
yes.. which is why it makes no difference

댓글을 달려면 로그인하십시오.

카테고리

도움말 센터File Exchange에서 Workspace Variables and MAT Files에 대해 자세히 알아보기

질문:

2011년 11월 27일

댓글:

2021년 4월 20일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by