Fastest way to replace multipe substrings with a single new string?

조회 수: 9 (최근 30일)
Omar Salah
Omar Salah 2020년 6월 6일
댓글: Omar Salah 2020년 6월 18일
Hello Everyone,
I'm trying to replace 7k different substrings with the same Tag in a 50 milllion words dataset (cell array of size 1 million of strings of average size 50 words). and as you can see, using replace or regexprep takes a long time. I tried using strrep the same way as replace but it gives me this error.
Error using strrep
All nonscalar inputs must be the same size.
I want to ask, what is the fastest and less memory consuming way to do it?
Here is the code:
%using replace
Tag='IMPORTANT'
substr={'very','much'} % a cell array of +7k words
reptag=cell(1,size(substr,2));
tagcell=cellfun(@(x) Tag,reptag,'Uniformoutput',false);
maintext=replace(maintext,substr,tagcell);
% using regexprep
ev='(';
for evi=1:size(substr,2)
ev=[ev substr '|'];
end
ev=[ev(1:end-1) ')'];
maintext=regexprep(maintext,ev,Tag);
  댓글 수: 4
Omar Salah
Omar Salah 2020년 6월 10일
@james I can actually work with both. Either a cella rray of character vectors or a cell of strings. I move between them easily. Is one type faster than the other?
Omar Salah
Omar Salah 2020년 6월 10일
@stephen I never worked with C++ but I'm wondering, why would they be faster? Is it because they are compiled or because C++ functions are generally faster?

댓글을 달려면 로그인하십시오.

답변 (1개)

Mohammad Sami
Mohammad Sami 2020년 6월 11일
After some experimentations I think that if you tokenize your sentences, you can use a hashmap to lookup the words to replace.
An example code is as follows. If you want case insensitive matching, use function lower on both the words and sentences.
substr = cellstr(substr);
w = containers.Map(substr,substr); %create a hashmap of substring you want to replace
m2 = cellstr(sentences);
m5 = cell(length(m2),1);
for i = 1:length(m2)
m3 = split(m2{i},' '); % tokenize the sentence
m4 = w.isKey(m3); % lookup which words to replace
m3(m4) = {'IMPORTANT'}; % replace the words
m5(i) = join(m3,' '); % store the updated sentence
end
  댓글 수: 1
Omar Salah
Omar Salah 2020년 6월 18일
Wow! thanks. that's definitely something to try. I will try it tonight ang get back to you :)

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Characters and Strings에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by