Finding duplicate strings in a cell array and their index

조회 수: 41 (최근 30일)
Jonathan Nastasi
Jonathan Nastasi 2015년 4월 11일
댓글: Paul Wintz 2021년 9월 10일
I have to convert a cell array with more than 100,000 elements and convert it to a structure array with four fields. Right now, I have something like:
% cell array = nameData
n = 1;
for j = 2:102
for i = 2:length(nameData)
S(n).name = nameData{i,j};
S(n).frequency = 1;
n = n+1;
end
end
However, I need to find duplicate strings in this array, and find information about them. Basically, I am collecting a database of strings and if I run across a duplicate, increase the frequency of that string rather than adding it to the structure.
I had been using loops within the previous two loops to achieve this:
for k = 1:n
if strcmpi(S(k).name, nameData{i,j}
S(k).frequency = S(k).frequency + 1;
end
end
However, I always just end up with all 100,000 structure elements. Any other solution I have gotten to work was entirely too slow, and this conversion from cell to structure array must happen in less than 20 seconds.
Thanks!
  댓글 수: 2
Stephen23
Stephen23 2015년 4월 12일
편집: Stephen23 2015년 4월 12일
You should avoid naming variables i and j as these are both names of the inbuilt imaginary unit.
Paul Wintz
Paul Wintz 2021년 9월 10일
The use of i and j as index variables are so ubiquitous to programming that I would say, instead, that you should avoid using i and j as the imaginary unit, and instead use 1i or 1j, which cannot be overwritten.

댓글을 달려면 로그인하십시오.

채택된 답변

Stephen23
Stephen23 2015년 4월 12일
편집: Stephen23 2015년 4월 13일
Learn to write vectorized code to make your code neater, faster and more robust: loops are not the first choice for solving problems in MATLAB, vectorization is!
This solution takes less than one second on my machine. First we generate an array of fake data, consisting of 100000 two-character strings of random characters:
N = 100000;
C = cellstr(char(32+randi(94,N,2)));
then we collect the unique ones into D and count their frequency in Y using hist:
tic
[D,~,X] = unique(C(:));
Y = hist(X,unique(X));
Z = struct('name',D,'freq',num2cell(Y(:)));
toc
The timer functions tic and toc print this to my command window:
Elapsed time is 0.379057 seconds.
And we can have a look at a random example of the output Z:
>> Z(5).name
ans =
!%
>> Z(5).freq
ans =
12
For newer versions you can use histogram instead. Note that vectorized code scale up to larger array sizes much nicer than loops do: even for one million elements in array C this method only took 4.87 seconds on my machine.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Matrix Indexing에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by