How to do one hot encoding of unusual letters in matlab?
조회 수: 2 (최근 30일)
이전 댓글 표시
I have a table of german characters. I would like to one hot encode them so that i can input it to a neural network. How should i go about?
The problem is some of the characters are not accepted by matlab as characters. For example, 'ä' 'ö' 'ü' 'ß'
Regardless, I would like to know how to one hot encode any character from a TABLE in matlab.
Thanks in advance!
댓글 수: 3
채택된 답변
Walter Roberson
2019년 7월 19일
편집: Walter Roberson
2019년 7월 19일
https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/ describes One-Hot Encoding (a term I was not aware of)
In MATLAB, see https://www.mathworks.com/help/deeplearning/ref/vec2ind.html
You might want to first construct a list of permitted characters, and map the input into an offset in that list. That will potentially save you from wasting bits on characters such as œ that you are not using.
댓글 수: 5
Walter Roberson
2019년 7월 30일
permitted = ['A' : 'Z', 'a' : 'z', 'ä', 'ö', 'ü', 'ß', ' ', '.' ] ;
[found, idx] = ismember(YourText, permitted);
assert(all(found), 'unpermitted character detected')
OneHot = ind2vec(idx) ;
추가 답변 (1개)
Guillaume
2019년 7월 30일
file containing the characters that I want to one hot encode. Do you know how I can go about now
%phoneme_set: A cell array of phonemes to one-hot encode
assert(numel(phoneme_set) > 64, 'Cannot one-hot encode more than 64 phonemes with a 64-bit integer')
phoneme_set(:, 2) = num2cell(2 .^ uint64(0:size(phoneme_set, 1)-1))
댓글 수: 3
Guillaume
2019년 7월 30일
number binary pattern (64 bits)
2^0 0000000000000000000000000000000000000000000000000000000000000001
2^1 0000000000000000000000000000000000000000000000000000000000000010
2^2 0000000000000000000000000000000000000000000000000000000000000100
...
2^63 1000000000000000000000000000000000000000000000000000000000000000
This is what one-hot encoding is. As I commented, this is useful for FPGAs and similar which operate at the bit level. On the generic processor of a computer, it's a complete waste of space but my answer does what you asked.
You could encode the one-hot encoded numbers that I generate as a vector of 0 and 1 (double) for even more waste of space:
phoneme_set(:, 3) = num2cell(fliplr(eye(size(phoneme_set, 1), 64)), 2)
Note that the binary pattern of the 0s in that encoding is 0000000000000000000000000000000000000000000000000000000000000000b and of the 1s is 0011111111110000000000000000000000000000000000000000000000000000b.
However, I suspect that what we have here is a XY problem. You have some unspecified problem doing something and you think you can solve it by using one-hot encoding (without really understanding what it means) so ask about one-hot encoding instead of your actual problem.
참고 항목
카테고리
Help Center 및 File Exchange에서 Gaussian Process Regression에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!