Vectorization approach to find matching elements in a matrix.
이 질문을 팔로우합니다.
- 팔로우하는 게시물 피드에서 업데이트를 확인할 수 있습니다.
- 정보 수신 기본 설정에 따라 이메일을 받을 수 있습니다.
오류 발생
페이지가 변경되었기 때문에 동작을 완료할 수 없습니다. 업데이트된 상태를 보려면 페이지를 다시 불러오십시오.
이전 댓글 표시
Hi experts, I am new to Matlab and I need your help with vectorization approach to improve speed of my code.
I have a big 10,000,000 x 15 matrix A containing integers in the range of 1-255, each row has 15 different integers. The integers in different rows can be the same, but each row is unique. I also have a smaller matrix B of 240,000 x 15 containing integers in the same range (i.e., 1-255), with the same conditions as of matrix A.
I need to find the rows in matrix A that has at least 2/3 of matching elements to the elements in each row of matrix B. If a row in matrix B has rows in matrix with matched conditions, then store those rows as below, or else just skip it.
- C = rows from matrix A
- D = rows from matrix B
Currently I am using for-loop to do this, but this takes a lot of time and could not even finish.
C = [];
D = [];
for i = 1:1:size(matrix_B,1)
C = [C; A(sum(ismember(matrix_A,matrix_B(i,:)),2)>=10,:)];
D = [D; B(i,:)];
end
Is there a vectorization approach that I can do to solve this problem? Or should I re-define my problem and change the entire solution? I also heard about Parallel Computing which would be a big help to this problem, but I do not know how to use it.
Thank you.
채택된 답변
John D'Errico
2023년 3월 28일
편집: John D'Errico
2023년 3월 28일
The part of your code that is a serious problem is where you build the arrays by growing them at each step. This is a really bad thing to do in MATLAB, and is why your code will be so abysmally slow. Your code forces MATLAB to reallocate the memory for those arrays millions of times, each time for a slightly larger amount of memeory. Then it is forced to copy over EVERYTHING from the old version of the array to the new one. UGH. Don't do things that way!
How would I solve this? I would probably define the distance between two rows as the number of elements in row 1 that are NOT in row 2. You already do something like that, using ismember.
But now you should be able to use rangesearch, to locate rows in one array that are "close" to rows in the other array, and rangesearch is the perfect tool to use a distance like this on a user provided function.
댓글 수: 3
Thank you for your reply.
I understand that building matrix by concatenation is bad, but how do I change that? I thought about using cell array but it does not seems to improve the speed.
Also, I am not sure what you are explaining in this:
"How would I solve this? I would probably define the distance between two rows as the number of elements in row 1 that are NOT in row 2. You already do something like that, using ismember.
But now you should be able to use knnsearch, to locate rows in one array that are "close" to rows in the other array, and knnsearch is able to use a user provided function."
Can you please give an example with clearer explanation? Thank you.
John D'Errico
2023년 3월 28일
편집: John D'Errico
2023년 3월 28일
(Note, at first, I was thinking that knnsearch would be the right tool, but of course, it is rangesearch that you need.)
First, you need a function that will compute the "distance" between rows. The distance is just the number of elements in vector 1, that are NOT in vector 2. So with 15 elements, a distance of zero means a perfect match. You used ismember, which should work.
Then use the rangesearch function.
help rangesearch
RANGESEARCH Radius search.
IDX = RANGESEARCH(X,Y,RADIUS) finds all the points in X that are
within distance RADIUS for points in Y. Rows of X and Y correspond to
observations, and columns correspond to variables. Y must have the same
number of columns as X. RADIUS is a numeric non-negative number
specifying the radius threshold. IDX is NY-by-1 cell array, where NY is
the number of rows in Y. IDX{I} contains the indices of points in X
whose distance to Y(I,:) are not greater than RADIUS, and these indices
are sorted in the ascending order of the corresponding distance values.
[IDX, D] = RANGESEARCH(X,Y,RADIUS) returns a NY-by-1 cell array D. D{I}
contains the distance values between Y(I,:) and the corresponding
points returned in IDX{I}.
[IDX, D]= RANGESEARCH(X,Y,'NAME1',VALUE1,...,'NAMEN',VALUEN) specifies
optional argument name/value pairs:
Name Value
'NSMethod' Nearest neighbors search method. Value is either:
'kdtree' - Creates and uses a kd-tree to find
nearest neighbors. 'kdtree' is only valid
when the distance metric is one of the
following metrics:
- 'euclidean'
- 'cityblock'
- 'minkowski'
- 'chebychev'
'exhaustive' - Uses the exhaustive search algorithm.
The distance values from all the points
in X to each point in Y are computed to
find nearest neighbors.
Default is 'kdtree' when the number of columns of X is
not greater than 10, X is not sparse, and the distance
metric is one of the above 4 metrics; otherwise,
default is 'exhaustive'.
'Distance' A string or a function handle specifying the distance
metric. The value can be one of the following:
'euclidean' - Euclidean distance (default).
'seuclidean' - Standardized Euclidean distance. Each
coordinate difference between X and a
query point is scaled by dividing by a
scale value S. The default value of S
is the standard deviation computed from
X, S=NANSTD(X). To specify another
value for S, use the 'Scale' argument.
'cityblock' - City Block distance.
'chebychev' - Chebychev distance (maximum coordinate
difference).
'minkowski' - Minkowski distance. The default
exponent is 2. To specify a different
exponent, use the 'P' argument.
'mahalanobis' - Mahalanobis distance, computed using a
positive definite covariance matrix C.
The default value of C is the sample
covariance matrix of X, as computed by
NANCOV(X). To specify another value for
C, use the 'Cov' argument.
'cosine' - One minus the cosine of the included
angle between observations (treated as
vectors).
'correlation' - One minus the sample linear
correlation between observations
(treated as sequences of values).
'spearman' - One minus the sample Spearman's rank
correlation between observations
(treated as sequences of values).
'hamming' - Hamming distance, percentage of
coordinates that differ.
'jaccard' - One minus the Jaccard coefficient, the
percentage of nonzero coordinates that
differ.
function - A distance function specified using @
(for example @DISTFUN). A distance
function must be of the form
function D2 = DISTFUN(ZI, ZJ),
taking as arguments a 1-by-N vector ZI
containing a single row of X or Y, an
M2-by-N matrix ZJ containing multiple
rows of X or Y, and returning an
M2-by-1 vector of distances D2, whose
Jth element is the distance between the
observations ZI and ZJ(J,:).
'P' A positive scalar indicating the exponent of Minkowski
distance. This argument is only valid when 'Distance'
is 'minkowski'. Default is 2.
'Cov' A positive definite matrix indicating the covariance
matrix when computing the Mahalanobis distance. This
argument is only valid when 'Distance' is
'mahalanobis'. Default is NANCOV(X).
'Scale' A vector S containing non-negative values, with length
equal to the number of columns in X. Each coordinate
difference between X and a query point is scaled by the
corresponding element of S. This argument is only valid
when 'Distance' is 'seuclidean'. Default is NANSTD(X).
'BucketSize' The maximum number of data points in the leaf node of the
kd-tree (default is 50). This argument is only
meaningful when kd-tree is used for finding nearest
neighbors.
'SortIndices' A flag to indicate if output distances and the
corresponding indices should be sorted in the order of
distances ranging from the smallest to the largest
distance. Default is true.
Example:
% Find the points in X whose distance are not greater than 1.5 to
% the points in Y, using the default distance metric 'euclidean'.
X = randn(100,5);
Y = randn(10, 5);
[idx, dist] = rangesearch(X,Y,1.5);
See also CREATENS, ExhaustiveSearcher, KDTreeSearcher, KNNSEARCH,
PDIST2.
Documentation for rangesearch
doc rangesearch
Other uses of rangesearch
ExhaustiveSearcher/rangesearch tall/rangesearch
KDTreeSearcher/rangesearch textanalytics/rangesearch
So the most difficult problem is writing the distance function. I wrote one below, it is vectorized, in the sense that if you have multiple rows for the Y array, it still works. And that is what rangesearch needs. As a test, I'll try it on a small set of rows. I'll sort them so it is easy to see tit works.
X = sort(randi(30,1,10),2)
X = 1×10
1 4 4 6 12 16 17 18 22 29
Y = sort(randi(30,3,10),2)
Y = 3×10
3 5 8 10 14 19 21 23 27 29
1 4 8 20 21 22 23 24 29 30
2 2 3 7 7 8 9 18 22 28
setInclusionDistance(X,Y)
ans = 3×1
9
6
8
As you can see, it does work.
Now just use rangesearch.
X = randi(45,[1000,15]);
Y = randi(45,[100,15]);
[IDX,D] = rangesearch(X,Y,5,'Distance',@setInclusionDistance);
So the test I ran here was relatively small, but it returned results. A small distance means there were many hits between rows. A distance of 5 means there were only 5 elements in two rows that were not the same. That will be pretty rare. In order to get a fw hits on such a small sample, I restricted the rows to lie between 1 and 45. I deliberately chose a small set like this so you can see that some rows of X had no hits at all.
IDX{1:5}
We can see that row 1 of X was "close" to many rows of Y, but there were not hits at all for rows 2 and 3.
function D = setInclusionDistance(X,Y)
% computes the number of elements in the vector X that are not in Y
D = numel(X) - sum(ismember(Y,X),2);
end
In a larger test on my computer of the size you mentioned, it took a few minutes to finish, but this is a large problem.
Thank you John for your detailed reply.
I tried your code and it works exactly how I wanted. Thank you a lot for that.
However, there is just one thing. When I it comes to large size matrices, it takes quite long to execute, not just a few minutes like you said. How long exactly did it take to finish on your computer with the datasets of size I mentioned in the question?
추가 답변 (0개)
카테고리
도움말 센터 및 File Exchange에서 Discrete Math에 대해 자세히 알아보기
참고 항목
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!웹사이트 선택
번역된 콘텐츠를 보고 지역별 이벤트와 혜택을 살펴보려면 웹사이트를 선택하십시오. 현재 계신 지역에 따라 다음 웹사이트를 권장합니다:
또한 다음 목록에서 웹사이트를 선택하실 수도 있습니다.
사이트 성능 최적화 방법
최고의 사이트 성능을 위해 중국 사이트(중국어 또는 영어)를 선택하십시오. 현재 계신 지역에서는 다른 국가의 MathWorks 사이트 방문이 최적화되지 않았습니다.
미주
- América Latina (Español)
- Canada (English)
- United States (English)
유럽
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
