UTF-8 strings in MEX-files

조회 수: 11 (최근 30일)
Cris Luengo
Cris Luengo 2017년 3월 26일
댓글: Bart Plovie 2023년 7월 24일
This question has been asked here before, but not with any satisfying answers. Since all those answers, a new documented function mxArrayToUTF8String has appeared. I'm hoping to find the function that does the reverse: make an mxArray from a UTF-8 encoded C or C++ string. I'm OK with an undocumented function, or using a bit of code from someone else. I'm not OK with linking some huge Unicode library, which I have no use for. All I need is convert UTF-8 to UTF-16 (which seems to be what MATLAB uses in their mxChar arrays).
Does anybody have any experience with UTF-8 encoded strings in MATLAB?
What does The MathWorks suggest we do if we want to work with UTF-8 encoded strings?

채택된 답변

Cris Luengo
Cris Luengo 2017년 3월 26일
A solution when using C++11:
I found this answer on StackOverflow: http://stackoverflow.com/a/38383389 It turns out C++11 has built-in functions for converting between various Unicode encodings.
#include "mex.h"
#include <algorithm>
#include <string>
#include <locale>
#include <codecvt>
...
std::string u8str = u8"µm²";
std::u16string u16str = std::wstring_convert< std::codecvt_utf8_utf16< char16_t >, char16_t >{}.from_bytes( u8str );
mwSize sz[ 2 ] = { 1, u16str.size() + 1 }; // +1 to include terminating null character
mxArray* mxstr = mxCreateCharArray( 2, sz );
std::copy( u16str.begin(), u16str.end() + 1, mxGetChars( mxstr )); // again +1 for terminating null character

추가 답변 (2개)

Walter Roberson
Walter Roberson 2017년 3월 26일
I am not all that familiar with those APIs, but how about if you mxCreateString to copy the char* into an mxArray, and then call into MATLAB to execute native2unicode() with 'UTF8' as the encoding?
  댓글 수: 1
Cris Luengo
Cris Luengo 2017년 3월 26일
Walter,
Unfortunately mxCreateString destroys the non-ASCII code points. Any byte with a value above 127 is converted to 65535.
But maybe I can convert to UINT8 array and use that function? Interesting idea, I need to try this tonight!

댓글을 달려면 로그인하십시오.


Jan
Jan 2017년 3월 26일
편집: Jan 2017년 3월 26일
Not a solution, but a contribution to the discussion:
See this discussion: http://www.mathworks.com/matlabcentral/newsreader/view_thread/301249 . The most reliable method I've found is http://site.icu-project.org/, which exactly matchs your excluded "*not* OK with linking some huge Unicode library".
Matlab and most libraries for C or the OS-APIs have a stable Unicode support. It is a pitty, that many conversion functions of the MEX API are not documented.
  댓글 수: 2
Cris Luengo
Cris Luengo 2017년 3월 26일
Jan, indeed, that is the library I was thinking of when I said I didn't want to link in some huge library... :)
UTF-8 to 16 should be pretty straightforward, there is no reason to know about Unicode at all, just about the two encodings. I would be able to write that translation myself, but don't want to reinvent the wheel.
Indeed, it would be easy to document some of those functions. You can't pretend the whole world uses UTF-16, especially when it's the worst way to encode Unicode IMO.
Bart Plovie
Bart Plovie 2023년 7월 24일
Had to do this today as well, and was a bit surprised there was no built-in function. I suspect UTF-16 is used since that's what Windows uses. So given the percentage of MATLAB users on Windows, I think it's a fair assumption on their part.
Anyhow, if someone else ever runs into this issue: on Windows you can solve it using MultiByteToWideChar (just include windows.h) like this:
int bufferSize = MultiByteToWideChar(CP_UTF8, 0, utf8String, -1, NULL, 0);
wchar_t* utf16String = (wchar_t*) mxMalloc(bufferSize * sizeof(wchar_t));
int rc = MultiByteToWideChar(CP_UTF8, 0, utf8String, -1, utf16String, bufferSize);
On Linux you can use iconv if you need something similar done:
iconv_t conversionDescriptor = iconv_open("UTF-16LE", "UTF-8");
int utf8Size = strlen(utf8String);
int bufferSize = utf8Size * sizeof(wchar_t);
wchar_t* utf16String = (wchar_t*) mxMalloc(bufferSize);
char* inbuf = (char*)utf8String;
char* outbuf = (char*)utf16String;
int rc = iconv(conversionDescriptor, &inbuf, &utf8Size, &outbuf, &bufferSize);
You might need to modify these a bit, but I hope it gets the idea across. I'd imagine UTF-16 little endian is used on Linux, but if it isn't you can just drop the LE.
Also, if you want to grab stuff from MATLAB with mxGetString() you can convert it to pretty much any format you want by first converting it to UTF-16 (WideChar) using CP_ACP as source encoding and MultiByteToWideChar, and then convert it back to MultiByte using the desired format. It's a bit of a hacky work-around, but windows's encoding translation seems to be pretty solid, just be careful with the buffer size allocation.

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Logical에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by