Efficient method for finding index of closest value in very large array for a very large amount of examples

Question

0 개 추천

I have two very large one dimensional arrays, 'aRef' which is around 11,000,000 elements and 'aTest' which is around 10,000,000 elements. I need to find the index of the closest element in 'aRef' for all elements in 'aTest'. 'aRef' is sorted and 'aTest' can be sorted if that will help performance.

- Method 1: Returns at out of memory error as the arrays are far too large

diff = abs(bsxfun(@minus,aRef,aTest'));
[~, I] = min(diff);

- Method 2: Takes around 0.03 seconds per iteration (but varies greatly) and therefore around 300000 seconds in total

for k = 1:n
  diff = abs(aRef- aTest(k));
  [~, I(k)] = min(diff);
end

- Method 3: Takes around 0.013 seconds per iteration and therefore 130000 seconds in total

 for k = 1:n
   i_lower  = find(aRef <= aTest(k),1,'last');
   i_higher = find(aRef >= aTest(k),1,'first');
 end

Is there a more efficient method for this that won't exhaust the memory or take so long to run?

Thanks for your help.

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

Stephen23 2015년 5월 29일

Do not use diff as a variable name, as this is the names of the very useful inbuilt diff function.

One can check if a name is already defined using which.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Guillaume 2015년 5월 22일

MATLAB Online에서 열기

5 개 추천

Note: Using diff as a variable name is not a good idea as it shadows the very useful diff function. Also, for method 2, your code does not show the preallocation of I. If you don't preallocate I, it will seriously slow down the code.

Anyway, for two vectors of around 10,000 elements, the following is around 200 times faster than your method 1 on my machine.

edges = [-Inf, mean([aRef(2:end); aRef(1:end-1)]), +Inf];
I = discretize(aTest, edges);

Basically, it construct an edge vector half way between each elements of your aRef, and use the histogram functions of matlab to get the bin index your aTest would fall in. discretize is new in R2015a. On 2014b, you can use the third return value of hiscounts. On even older versions, the 2nd return value of histc (although histc behaves slightly differently with regards to the last bin).

%2014b
[~, ~, I] = histcounts(aTest, edges); %probably slower than discretize
%before 2014b
[~, I] = histc(aTest, edges); %return an extra element (for the +Inf bin)
I(end) = [];

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

Richard Barrett-Jolley 2018년 2월 1일

thanks!

Richard Barrett-Jolley 2018년 2월 1일

..actually none of these seem to quite work for me yet!

댓글을 달려면 로그인하십시오.

Answer 2

Damon Landau 2018년 8월 8일

MATLAB Online에서 열기

3 개 추천

I = interp1(aRef, 1:numel(aRef), aTest, 'nearest', 'extrap');

should be faster and (arguably) more straightforward than "discretize"

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Answer 3

James Tursa 2015년 5월 22일

편집: James Tursa 2015년 5월 22일

MATLAB Online에서 열기

2 개 추천

This is a fairly simple mex routine if you start with both arrays sorted, since you don't need to make comparisons between all of the elements to get the answer. A simple stepping process can get the job done. E.g., the timing on the sizes you have listed:

>> aRef  = sort(rand(11000000,1));
>> aTest = sort(rand(10000000,1));
>> tic; aClosest = closest(aRef,aTest); toc
Elapsed time is 0.130508 seconds.

A shorter example to see that the results are as expected:

>> aRef = 1:10
aRef =
     1     2     3     4     5     6     7     8     9    10
>> aTest = sort(rand(1,20)*20-5)
aTest =
  Columns 1 through 15
   -4.4987   -2.2252   -2.0507   -1.2286   -1.0066   -0.9144   -0.2292   -0.0079    4.0035    4.6724    5.1178    5.3893    5.6688    7.4530   10.6621
  Columns 16 through 20
   10.9377   12.8271   12.9305   13.5336   14.6298
>> aClosest = closest(aRef,aTest)
aClosest =
     1     1     1     1     1     1     1     1     4     5     5     5     6     7    10    10    10    10    10    10

The mex routine is below. (Note the caveat about not handling Inf's and NaN's consistently, although that code could be added if needed)

// closest.c
// C = closest(A,B)
// Inputs:  A = full real double vector (reference), sorted ascending
//          B = full real double vector (test),      sorted ascending
// Output:  C = index of closest value in A to value in B
//          i.e., A(C(i)) is the closest value in A to B(i)
//          Ties are resolved in favor of higher index
//          Is not currently coded to handle Inf's and NaN's consistently
//          C is the same size as B
// Programmer:  James Tursa
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    size_t i, j, k, m, n;
    double *A, *B, *C;
//    
// Check input arguments
    if( nrhs != 2 || 
        !mxIsDouble(prhs[0]) || !mxIsDouble(prhs[1]) ||
        mxIsComplex(prhs[0]) || mxIsComplex(prhs[1]) ||
        mxIsSparse(prhs[0])  || mxIsSparse(prhs[1]) ) {
        mexErrMsgTxt("Need exactly two full double vectors as input");
    }
// Check number of output arguments
    if( nlhs > 1 ) {
        mexErrMsgTxt("Too many outputs");
    }
// Get the number of elements involved
    m = mxGetNumberOfElements(prhs[0]);
    n = mxGetNumberOfElements(prhs[1]);
// Disallow empty reference vector
    if( m == 0 ) {
        mexErrMsgTxt("Reference vector (1st input) cannot be empty");
    }
// Create uninitialized output
    if( mxGetM(prhs[1]) == 1 ) {
        plhs[0] = mxCreateUninitNumericMatrix( 1, n, mxDOUBLE_CLASS, mxREAL );
    } else {
        plhs[0] = mxCreateUninitNumericMatrix( n, 1, mxDOUBLE_CLASS, mxREAL );
    }
// If B is empty, simply return empty result
    if( n == 0 ) {
        return;
    }
// Get the data pointers
    A = mxGetPr(prhs[0]);
    B = mxGetPr(prhs[1]);
    C = mxGetPr(plhs[0]);
    k = 0;
// Assign 1st index to all values B that are less than 1st A value
    while( k < n && B[k] < *A ) {
        C[k++] = 1.0;
    }
// Step through until B is between two A values, then test for result
    i = 0;
    for( j=k; j<n; j++ ) {
        while( i+1 < m && B[j] >= A[i+1] ) i++;
        if( i+1 == m ) break;
        if( B[j] - A[i] < A[i+1] - B[j] ) {
            C[j] = i + 1;
        } else {
            C[j] = i + 2;
        }
    }
// Assign last index to all values B that are more than last A value
    while( j < n ) {
        C[j++] = m;
    }
}

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

Ben M 2015년 5월 29일

Thanks for your response!

댓글을 달려면 로그인하십시오.

Answer 4

Jos (10584) 2015년 5월 29일

0 개 추천

I point you to my NEARESTPOINT function, available on the File Exchange:

http://www.mathworks.com/matlabcentral/fileexchange/8939

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Efficient method for finding index of closest value in very large array for a very large amount of examples

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

추가 답변 (3개)

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

카테고리

태그

Community Treasure Hunt

Efficient method for finding index of closest value in very large array for a very large amount of examples

댓글 수: 1 이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 4 이전 댓글 2개 표시 이전 댓글 2개 숨기기

추가 답변 (3개)

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 1 이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

카테고리

태그

참고 항목

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 4
이전 댓글 2개 표시 이전 댓글 2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기