Problme with Text analysis
조회 수: 1 (최근 30일)
이전 댓글 표시
Hi, I try to clean a table containing both latin and non-latin strings to plot a wordcloud. I used regexprep function but not successfully. I can't remove korean strings. Any idea? Here an example of the code and the output:
pathName = 'Keyword Aug. 2020 to Oct. 2021_MatlabSmall.xlsx';
T = readtable(pathName,'Range','A:B');
% Convert all Character Vector to Lowercase
T.Keyword = lower(T.Keyword);
% Remove not useful keywords
T(strcmp(T.Keyword, '(not provided)'), :)=[];
T(strcmp(T.Keyword, '(not set)'), :)=[];
% Set lower case
T.Keyword = lower(T.Keyword);
% Remove links
T(contains(T.Keyword, 'http'), :)=[];
T(contains(T.Keyword, '.'), :)=[];
T.Keyword = strrep(T.Keyword, ' ', '_');
display(head(T));
% Replace non alphanumerics
T.Keyword = regexprep(T.Keyword,'^a-z','');
8×2 table
Keyword Sessions
_________________________________ ________
'stuff' 390
'forum' 128
'student' 76
'재료' 59
'stuff' 56
'uninstall_stuff_license_manager' 52
'stuff_resource_center' 43
'stuff_student_community' 34
댓글 수: 0
답변 (1개)
DGM
2021년 10월 19일
I'm terrible with regex, but this might get you somewhere. Replaces everything but lowercase alpha and underscores.
A = {'9.banana' 'orange-123_juice' 'ン戦国時' 'apple_sauce' 'abcクルミ' 'peach' 'pear' 'ピラミッド' 'cherry'}.'
B = regexprep(A,'[^a-z_]','')
댓글 수: 0
참고 항목
카테고리
Help Center 및 File Exchange에서 Text Data Preparation에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!