Problme with Text analysis

조회 수: 1 (최근 30일)
David MERCIER
David MERCIER 2021년 10월 19일
답변: DGM 2021년 10월 19일
Hi, I try to clean a table containing both latin and non-latin strings to plot a wordcloud. I used regexprep function but not successfully. I can't remove korean strings. Any idea? Here an example of the code and the output:
pathName = 'Keyword Aug. 2020 to Oct. 2021_MatlabSmall.xlsx';
T = readtable(pathName,'Range','A:B');
% Convert all Character Vector to Lowercase
T.Keyword = lower(T.Keyword);
% Remove not useful keywords
T(strcmp(T.Keyword, '(not provided)'), :)=[];
T(strcmp(T.Keyword, '(not set)'), :)=[];
% Set lower case
T.Keyword = lower(T.Keyword);
% Remove links
T(contains(T.Keyword, 'http'), :)=[];
T(contains(T.Keyword, '.'), :)=[];
T.Keyword = strrep(T.Keyword, ' ', '_');
display(head(T));
% Replace non alphanumerics
T.Keyword = regexprep(T.Keyword,'^a-z','');
8×2 table
Keyword Sessions
_________________________________ ________
'stuff' 390
'forum' 128
'student' 76
'재료' 59
'stuff' 56
'uninstall_stuff_license_manager' 52
'stuff_resource_center' 43
'stuff_student_community' 34

답변 (1개)

DGM
DGM 2021년 10월 19일
I'm terrible with regex, but this might get you somewhere. Replaces everything but lowercase alpha and underscores.
A = {'9.banana' 'orange-123_juice' 'ン戦国時' 'apple_sauce' 'abcクルミ' 'peach' 'pear' 'ピラミッド' 'cherry'}.'
A = 9×1 cell array
{'9.banana' } {'orange-123_juice'} {'ン戦国時' } {'apple_sauce' } {'abcクルミ' } {'peach' } {'pear' } {'ピラミッド' } {'cherry' }
B = regexprep(A,'[^a-z_]','')
B = 9×1 cell array
{'banana' } {'orange_juice'} {0×0 char } {'apple_sauce' } {'abc' } {'peach' } {'pear' } {0×0 char } {'cherry' }

카테고리

Help CenterFile Exchange에서 Text Data Preparation에 대해 자세히 알아보기

제품


릴리스

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by