Dear Matlab community, I want to read specific parts from a large (> 20 GB) binary file. However, the command tmpString=fread(fid,[1,16],'char=>char'); fails with "Out of memory." The command is applied very near to the beginning of the file (offset is 20 bytes). Why do I get this error and how can I successfully read in my file? Thank you for your suggestions, Ed

"However what I could think of is that Matlab tries to guess the encoding" I've had discussions with Mathworks support about this. The whole process is not properly documented unfortunately, which I told them can be a problem. Indeed, if you open a file without specifying a character encoding, matlab will try to guess the file encoding the first time you either: use any character reading function such as fgetl, fgets, fscanf, etc. use fread with a 'char' or '*char' precision ask for the encoding with the multi-output version of fopen. I haven't been given the full process of character set detection, but it does read the whole file which indeed can be an issue for large files. If any byte sequence in the file is not a valid UTF8 code point, then the algorithm uses some heuristics to see if it's a CJK encoding and if it still doesn't match, it assumes the local encoding. To prevent this autodetection to take place, you have to specify an encoding when you fopen the file. If you don't know what the encoding is for your binary file, I'd suggest using 'US-ASCII'. As we mentioned in the comment, it's unlikely that a binary file uses UTF8 unless it prefixes the text by a length. Unfortunately, it's not easy to go back to pre-2020a behaviour of automatically using the native encoding whatever it is, as R2020a has lost the ability of easily getting the local encoding. On the other hand, relying on native encoding when reading a binary file is asking for trouble.

Why do I get "Out of memory." when reading only 16 chars?

Walter Roberson 2020년 3월 25일

That is odd.

I wonder if it is trying to buffer the entire file?

Ohhh... Are you sure you want 20 characters and not 20 bytes? From R2020a if you fopen with specifying an encoding, and fread without specifying an encoding, then char for fread means utf8 decoded characters, which could be up to 4 bytes per character.

Ed Frank 2020년 3월 25일

I am quite sure my 20 characters are right as with a smaller file, everything works out just fine. But for the record: How did this work before R2020a?

Walter Roberson 2020년 3월 25일

Again, are you sure you want 20 utf8 characters and not 20 bytes such as uint8>=char?

Ed Frank 2020년 3월 25일

Yes. But does this affect my issue?

If it tries to buffer the entire file, how do I stop this?

Guillaume 2020년 3월 25일

Which OS is this on? and which version of matlab?

"How did this work before R2020a?"

If not specified, matlab used your native encoding as set by your OS. Now it uses utf8 by default, which is much better (IMO).

"But does this affect my issue?"

Unlikely. But then as Walter said, what you're seeing is unexpected so who knows.

Ed Frank 2020년 3월 25일

I experience this issue in Matlab R2020a on Windows 7.

If this is unexpected behaviour, maybe I just do this in an unfamiliar way? How would you read in parts of files like this?

Walter Roberson 2020년 3월 25일

I would experiment with fread uint8 for 80 bytes (the maximum valid to encode 20 utf8 characters) and use native2unicode and take the first 20 characters of the result.

It is really uncommon to specify utf8 for something identified as a binary file because binary files need to be particular about exact field widths. The exception would be for binary files in which fields are preceded by byte counts, in which case you would read the count and read that many uint8 and native2unicode that.

Guillaume 2020년 3월 25일

편집: Guillaume 2020년 3월 25일

MATLAB Online에서 열기

Showing us the code before that line would help (from the point the file is open).

Walter may be onto something. Maybe the out of memory error comes from the unicode library parsing if it's trying to decode text that is not utf8 as utf8. Do you know what the character encoding of your binary file?

Do you get an out of memory error if you read 16 bytes, 32 bytes or 80 bytes as suggested by Walter, i.e. if your replace the above line by:

curloc = ftell(fid);
fread(fid, 16, '*uint8');
disp('16 bytes read succesfully');
fseek(fid, curloc, 'bof');
fread(fid, 32, '*uint8');
disp('32 bytes read succesfully');
seek(fid, curloc, 'bof');
fread(fid, 80, '*uint8');
disp('80 bytes read succesfully');
keyboard;

Ed Frank 2020년 3월 25일

Before, there are several fread calls with the data types short and long. They are spread over many lines and sometimes in loops, which is why I didn't include them.

Your test code really works! So it's possibly an encoding issue. I will investigate this but first, thank you both for pointing this out!

Guillaume 2020년 3월 25일

MATLAB Online에서 열기

"Your test code really works"

Assuming that there's nothing confidential in there, can you post the full output of the 80 bytes read? If there's a bug with unicode decoding, I'd like to let mathworks know:

So, replace the offending line by

bytes = fread(fid, 80, '*uint8');
fprintf('%d ', bytes);
fprintf('\n');

and post/attach the content of byte.

Ed Frank 2020년 3월 25일

편집: Ed Frank 2020년 3월 25일

MATLAB Online에서 열기

In the tested file, this section of the file which should be a "comment field" seems to be filled with rather nonsense ASCII numbers and spaces. Nothing spectacular:

>> fprintf('%d ', bytes);
50 55 49 49 57 49 53 56 48 52 56 0 32 0 32 0 1 0 32 1 0 32 1 0 32 1 0 32 1 0 32 1 0 32 1 0 32 1 0 32 1 0 32 1 0 32 1 0 32 1 0 32 1 0 32 1 0 32 1 0 32 1 0 32 1 0 32 1 0 32 1 0 32 1 0 32 1 0 32 1 >> 
>> fprintf('%c ', bytes);
2 7 1 1 9 1 5 8 0 4 8                                                                                                                     >> 
>> 

I've kept the >> so you can see the spaces.

It's not the 0 and 1 causing the error, I get the same behaviour trying with a char vector of length 2, which even in the worst case should only be 8 bytes.

Walter Roberson 2020년 3월 25일

I just tested with that sequence of bytes on R2020a on MacOS High Sierra and encountered no difficulty. Do you encounter the same problem if you write just those bytes to a file?

Guillaume 2020년 3월 26일

No issue for me either on Win10 with just that sequence. Can the whole file be made available somewhere?

Whatever the encoding, the fact that there are 0s intermixed with non-zeros in the first 16 bytes would indicate that it's not just text stored at that offset.

Ed Frank 2020년 3월 26일

@Walter: No, the problem does not occur in a file which does only contain the mentioned bytes.

@Guillaume: No, unfortunately I cannot make the file available as this would violate corporate secrets.

I have solved this in a different way. However what I could think of is that Matlab tries to guess the encoding (which I didn't declare at fopen) and for this tries to buffer a large part of the file (or even the entire file). Could this be the reason of this strange behaviour?

Guillaume 2020년 3월 26일

"However what I could think of is that Matlab tries to guess the encoding"

No, as walter mentioned, as of R2020a, if you don't specify the encoding, it's UTF8. Prior to UTF8 it was whatever native encoding your system use. Matlab never tried to guess the encoding.

As walter said, it's very unlikely that your file uses UTF8 encoding for text (unless the text is prefixed by length information). It possibly doesn't use your native encoding either. When reading text from a binary file, it's always safer to read it as bytes (uint8) and then convert to the correct encoding with native2unicode.

Walter Roberson 2020년 4월 22일

It turns out that R2020a, fopen now tries to do encoding detection; https://www.mathworks.com/help/matlab/ref/fopen.html#btrnibn-1-encodingIn

Historically, encoding detection for text being read by readtable() used to examine the first 10 kilobytes of the file; matters might be different for fopen()

Ed Frank: if you are still interested, could you try starting by reading (say) one character, and timing the fopen() and the first short fread(), and then fread() of the next 79, to see whether the long time is at the fopen() or at the first fread() of character data, or if the position somehow triggers the delay ?

Guillaume 2020년 4월 29일

Sorry, I've been a bit too busy to follow answers recently but indeed I had conversations with Mathworks recently on text file parsing and indeed 2020a does automatic character set detection which is most likely the issue here. I'll post the details I've got from mathworks support in an answer.

Why do I get "Out of memory." when reading only 16 chars?

댓글 수: 17
이전 댓글 15개 표시 이전 댓글 15개 숨기기

채택된 답변

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

추가 답변 (0개)

카테고리

제품

태그

Community Treasure Hunt

Why do I get "Out of memory." when reading only 16 chars?

댓글 수: 17 이전 댓글 15개 표시 이전 댓글 15개 숨기기

채택된 답변

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

추가 답변 (0개)

카테고리

제품

태그

참고 항목

Community Treasure Hunt

댓글 수: 17
이전 댓글 15개 표시 이전 댓글 15개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기