textscan or import of unicode encoded textfile

Question

Hyung-Sik Kim 2011년 9월 22일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/16493-textscan-or-import-of-unicode-encoded-textfile

Question 1: Are textscan and importdata supposed to work with unicode encoded text file?

Question 2: After UTF-8 encoded file is opened with the correct encoding spec in the fopen argument, textscan output puts the following three characters ï»¿ preceding the very first valid data I have in the file. Is this expected behavior undocumented?

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Anne 2011년 12월 5일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/16493-textscan-or-import-of-unicode-encoded-textfile#answer_30313

MATLAB Online에서 열기

I have the same problem with my old MATLAB 7.3.0. Textscan won't read correctly unicode files, but it can deal with unicode formatted strings.

Thus a simple (but slow) workaround is to read text first with scanf and run textscan on the text.

[f,msg]=fopen(nomfic,'r','n','UTF-8');
LIGNES=textscan(f,'%[^\n]','delimiter','\n');

won't work with unicode encoded characters but

[f,msg]=fopen(nomfic,'r','n','UTF-8');
txt=fscanf(f,'%c');
LIGNES=textscan(txt,'%[^\n]','delimiter','\n');

will.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Answer 2

Walter Roberson 2011년 9월 22일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/16493-textscan-or-import-of-unicode-encoded-textfile#answer_22313

Answer 1: textscan() is; I do not know about importdata

Answer 2: When you explicitly specify one of the UTF-* as the encoding, the MATLAB code will not look for a Byte Order Mark, and will leave any Byte Order Mark in the file stream. If you do not explicitly specify the encoding, then the byte stream will be examined for a Byte Order Mark and if found the encoding will be determined by that.

It is not recommended that a Byte Order Mark be used with UTF-8, but some Windows editors insert it anyhow. The Byte Order Mark represented in UTF-8 is 0xEF,0xBB,0xBF which show up exactly as the characters you notice. See reference

I have not examined to see whether it makes a difference as to whether you opened the file with 'r' or 'rt' . I use 'rt' when referring to text files, as it can make a difference in some instances.