Find indices of multiple strings within another string

조회 수: 12 (최근 30일)
Nick Smith
Nick Smith 2022년 4월 8일
답변: Paul 2022년 4월 10일
I am trying to efficiently find which strings (character vectors) match between two cell arrays.
One cell array contains ~1000 equations written as strings that I'm trying to parse by matching to strings in another array (100,000 items). I need to know the indices from the 100,000 items that are found within the ~1000 equations. There may be multiple of the 100,000 items found within each of the 1000 equations.
I'm currently implementing this as such:
Equations.Equation % this is a list of ~1000 equations, a cell array of character vectors
OutputData.DataName % list of ~100,000 possible strings I'm looking for in the equations (my variable names)
for ii = 1:length(Equations)
matches=cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName);
indices = find(matches);
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
This is fairly slow. Is there a way to more efficiently find within Equations(ii).Equation which items within OutputData.DataName are found and the index of those items?
  댓글 수: 4
Paul
Paul 2022년 4월 9일
Something's not working with this example data and the code in the question. Is there a typo somewherer?
Equations.Equation = { '(123X + 123Y).^2'; ...
'500 + 456X + 123Z'; ...
'200 * abs(789Z * pi) + 123X'}
Equations = struct with fields:
Equation: {3×1 cell}
OutputData.DataName = {'123A'; '123B'; '123C'; '123X'; '123Y'; '123Z'; '456X'; '456Y'; '456Z'; '789X'; '789Y'; '789Z'};
for ii = 1:length(Equations)
matches=cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName);
indices = find(matches);
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
Error using cellfun
Non-scalar in Uniform output, at index 1, output 1.
Set 'UniformOutput' to false.
Voss
Voss 2022년 4월 9일
It seems like Equations is actually a struct array:
Equations = struct('Equation',{ ...
'(123X + 123Y).^2'; ...
'500 + 456X + 123Z'; ...
'200 * abs(789Z * pi) + 123X'})
Equations = 3×1 struct array with fields:
Equation
OutputData.DataName = {'123A'; '123B'; '123C'; '123X'; '123Y'; '123Z'; '456X'; '456Y'; '456Z'; '789X'; '789Y'; '789Z'};
for ii = 1:length(Equations)
matches = cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName).'
indices = find(matches)
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
matches = 1×12 logical array
0 0 0 1 1 0 0 0 0 0 0 0
indices = 1×2
4 5
matches = 1×12 logical array
0 0 0 0 0 1 1 0 0 0 0 0
indices = 1×2
6 7
matches = 1×12 logical array
0 0 0 1 0 0 0 0 0 0 0 1
indices = 1×2
4 12

댓글을 달려면 로그인하십시오.

채택된 답변

Paul
Paul 2022년 4월 10일
It looks like using string variables with an inner loop is much faster than a cell array with cellfun, at least here on Answers with the data provided.
Orignal code, modified by @_
Equations = struct('Equation',{ ...
'(123X + 123Y).^2'; ...
'500 + 456X + 123Z'; ...
'200 * abs(789Z * pi) + 123X'});
OutputData.DataName = {'123A'; '123B'; '123C'; '123X'; '123Y'; '123Z'; '456X'; '456Y'; '456Z'; '789X'; '789Y'; '789Z'};
for ii = 1:length(Equations)
matches = cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName).'
indices = find(matches)
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
matches = 1×12 logical array
0 0 0 1 1 0 0 0 0 0 0 0
indices = 1×2
4 5
matches = 1×12 logical array
0 0 0 0 0 1 1 0 0 0 0 0
indices = 1×2
6 7
matches = 1×12 logical array
0 0 0 1 0 0 0 0 0 0 0 1
indices = 1×2
4 12
Convert the cell arrays to strings, and implement an inner loop to compute matches. Verify the results are the same
equations = string({Equations.Equation});
dataname = string(OutputData.DataName);
mathces = nan(1,numel(dataname));
for ii = 1:numel(equations)
for jj = 1:numel(dataname)
matches(jj) = contains(equations(ii),dataname(jj));
end
matches
indices = find(matches)
end
matches = 1×12 logical array
0 0 0 1 1 0 0 0 0 0 0 0
indices = 1×2
4 5
matches = 1×12 logical array
0 0 0 0 0 1 1 0 0 0 0 0
indices = 1×2
6 7
matches = 1×12 logical array
0 0 0 1 0 0 0 0 0 0 0 1
indices = 1×2
4 12
Wrap an outer loop aorund the original code to test timing.
ntrials = 1e5;
tic
for trials = 1:ntrials
for ii = 1:length(Equations)
matches = cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName).';
indices = find(matches);
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
end
toc
Elapsed time is 15.236180 seconds.
tic
for trials = 1:ntrials
for ii = 1:numel(equations)
for jj = 1:numel(dataname)
matches(jj) = contains(equations(ii),dataname(jj));
end
matches;
indices = find(matches);
end
end
toc
Elapsed time is 2.448469 seconds.
I was actually surprised that there isn't a string function that can replace that inner loop, but I couldnt't find one. Maybe it can be done using a particular pattern, but I couldn't figure that out either.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Loops and Conditional Statements에 대해 자세히 알아보기

제품


릴리스

R2016b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by