Matlab does not recognise hyphen

조회 수: 15 (최근 30일)
Philipp Braeuninger
Philipp Braeuninger 2018년 12월 2일
댓글: Walter Roberson 2018년 12월 4일
Hi all,
I'm trying to remove a simple hyphen "-" from a string array. But matlab does not seem to recognise the hyphen. I'm sourcing the text from a website and storing the text in a string array. Then by using strrep(myStringArray,'-','_') I'm trying to remove the hyphen. The weired thing is matlab does not remove it but when I stop the program in the debugger and locally execute this command again it works. Any thoughts on this highly appretiated!
  댓글 수: 9
Christopher Creutzig
Christopher Creutzig 2018년 12월 4일
MATLAB uses UTF-16, correct. (And I was assuming the string was read as a string. native2unicode is only useful if you read binary data or otherwise got raw numbers. MATLAB strings are in Unicode, in UTF-16 encoding.)
Walter Roberson
Walter Roberson 2018년 12월 4일
For unicode code points U+10000 and above, ideally it would be nice to see the codepoint itself, perhaps as a uint32, but uint16(char(s)) and char(s)+0 and s+0 cannot give that to you.
It gets kinda confusing... if you see 55296 (hex D800), are you seeing an actual code-point U+D800, or are you seeing Surrogate High Byte 0 ? According to the documentation for char() numeric inputs are treated as unicode code points, so char(55296) should have to be encoded into multiple positions encoded in UTF16. But if you are going to bother doing that, then why restrict inputs to 65535 ? The user-visible interface is as-if UTF16 is not used internally, and that instead a "character" header is tossed onto uint16() of the numeric values.
>> foo = char(55296)
foo =
'?'
>> whos foo
Name Size Bytes Class Attributes
foo 1x1 2 char ?
(It is not a ? that shows up, it is an empty box)
Evidence that UTF16 was not used: look at bytes: UTF16 encoding of U+D800 is more than 2 bytes.
>> D800DC00 = uint8([216 0 220 0])
D800DC00 =
1×4 uint8 row vector
216 0 220 0
>> bar = native2unicode(D800DC00, 'UTF16')
bar =
'?'
>> bar+0
ans =
55296 56320
>> whos bar
Name Size Bytes Class Attributes
bar 1x2 4 char
Actual unicode code point: U+10000 .
This all tends to suggest that UTF16 is not the internal representation in MATLAB, and that uint16(char(s)) will not show the unicode code points.

댓글을 달려면 로그인하십시오.

채택된 답변

Philipp Braeuninger
Philipp Braeuninger 2018년 12월 4일
편집: madhan ravi 2018년 12월 4일
Hi all,
thanks a lot for all your answers!!!
This is a really weired one I'm pulling the text from websites. I tried native2unicode() and renaming it (e.g. myNewArray). Nothing worked.
I finally worked around it by using the function "isletter" and using a for loop since isletter doesn't take a string array:
for iLine=1:length(myStringArray)
currentChangeLine=char(myStringArray(iLine));
idxUnderscore=strfind(currentChangeLine,'_'); % I have some underscores which I want to keep
idxWhiteSpace=find(isspace(currentChangeLine));
idxIsDigit=find(isstrprop(currentChangeLine,'digit')); % also I want to keep digits in the text
idxNotALetter=find(~isletter(currentChangeLine));
idxChange=setdiff(idxNotALetter,[idxUnderscore,idxWhiteSpace,idxIsDigit]);
% the line above is to work out the indices where it's not a letter and not a digit,white space or underscore
% i.e. this will filter out hyphens but alos other symbols like &,@, etc.
currentChangeLine(idxChange)='_'; % replace the hyphen with an underscore
myStringArray(iLine)=currentChangeLine;
end
It's not a neat solution but it worked!
Once again thanks a lot for all your help!

추가 답변 (1개)

Jan
Jan 2018년 12월 3일
편집: Jan 2018년 12월 3일
if I stop in the debugger and execute the command it does work
Then there must be another problem. The debugger can influence the result, if you create variables dynamcially by eval, e.g. in called scripts. Otherwise the code must do exactly the same in debug and non-debug mode. So if you observe, that your code does not consider the command, which is executed successfully during debugging, maybe the result is overwritten anywhere in the following code. Perhaps you use myStringArray instead of myNewArray after this line:
myNewArray= replace(myStringArray, ...
["-" "–" char(8211) "-" char(8212) "—" "—" "–"],'_');

카테고리

Help CenterFile Exchange에서 Data Type Conversion에 대해 자세히 알아보기

제품


릴리스

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by