Chinese characters in html not recognized

조회 수: 4 (최근 30일)
Bohan Liu
Bohan Liu 2019년 4월 17일
답변: Patrik Forssén 2021년 2월 6일
Hi there! I need to read and parse some Chinese characters in some html files and save later into an excel worksheet.
Currently the main issue is that after reading the html file into character vector using urlread/webread, all the Chinese characters are displayed as weird symbols. Also in the following steps, it will be attempted to use strfind to find the index of Chinese characters in the complete character vector, but the Chinese characters to be searched for are displayed as ? in the m-file.
Untill now I have already tried out 2 methods to set the character encoding of MATLAB:
  1. slCharacterEncoding('GBK') (since the source html adopts GBK character encoding)
  2. edit the lcdata.xml file on MATLAB path
Neither of these two methods worked, nor did changing the MATLAB preference font/the regional setting. I have basically exhausted all possibilities that I can think of and have foud on the web.
I would appreciate it if someone could help me out with a viable solution. Thanks in advance!
Bohan

답변 (2개)

Walter Roberson
Walter Roberson 2019년 4월 17일
S = webread(url);
proper_S = native2unicode( uint8(S), 'GPK');

Patrik Forssén
Patrik Forssén 2021년 2월 6일
opt = weboptions('CharacterEncoding', 'GBK');
str = webread(url, opt);

카테고리

Help CenterFile Exchange에서 String Parsing에 대해 자세히 알아보기

제품


릴리스

R2017b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by