Removing commas between columns in text data

조회 수: 5 (최근 30일)
Kim Maria Damiani
Kim Maria Damiani 2021년 10월 16일
댓글: Kim Maria Damiani 2021년 10월 16일
I have a txt file which is the ouput of a lemmatizer, in the form
Sometimes, ,, I, use, commas, .
I, like, writing, ,, I, like, reading
How can I read it into a tokenizedDocument deleting the unneccessary commas between tokens? A simple approach would be
test=readlines('/path/to/file.txt')
test=strrep(test,',','')
test=tokenizedDocument(test)
but it would remove even the commas already present in the original text, while I'd like to preserve punctuation-

채택된 답변

Walter Roberson
Walter Roberson 2021년 10월 16일
test = {'Sometimes, ,, I, use, commas, .'
'I, like, writing, ,, I, like, reading'};
test = regexprep(test, {'(?<=[^,]),\s', '\s*,,', '\s+\.'}, {' ', ',', '.'})
test = 2×1 cell array
{'Sometimes, I use commas.' } {'I like writing, I like reading'}
Notice we had to have a special rule for periods. You have 'use, commas' which should almost certainly translate to 'use commas' (so comma space becomes space), but after that 'commas, .' should not become 'commas .' .
To put it another way, we cannot use the rule that comma space pair is to be deleted: that works for the comma space between the word 'commas' and the period, but it does not work for the comma space pair between 'use' and 'commas': if you tried to apply that rule then 'use, commas' would merge together to 'usecommas' .

추가 답변 (1개)

Chunru
Chunru 2021년 10월 16일
test = {'Sometimes, ,, I, use, commas, .'
'I, like, writing, ,, I, like, reading'};
test = regexprep(test, ',\s', ' ')
test = 2×1 cell array
{'Sometimes , I use commas .' } {'I like writing , I like reading'}

카테고리

Help CenterFile Exchange에서 Text Data Preparation에 대해 자세히 알아보기

태그

제품


릴리스

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by