Vectorized Levenshtein distances between arrays of text labels?

Question

0 개 추천

I have to compare "N" ID labels (several thousand) to each other in order to determine which are mistypings of each other. The labels have up to 20 characters. Preliminarily, I am considering the calculation of the N(N-1)/2 Levenshtein distances between them and using clustering to determine which labels correspond to the same ID. It is being done in Python, but none of the Levenshtein distance implementations are vectorized. That is, the NxN array of distances have to be iterated through on an element-by-element basis in order to calculate their values.

I thought that there might be a vectorized Matlab version of Levenshtein distance, which I could package for deployment and invocation from Python. I found the a few shown in the Annex below, as well as an "editDistance" function available in R2023b. None of these vectorize the calculation of N(N-2)/2 distances. I'm surprised that a vectorized implementation doesn't exist. Am I missing something obvious?

Annex: Matlab implementations of Levenshtein distance

댓글 수: 2
없음 표시 없음 숨기기

Stephen23 2024년 6월 11일

"Am I missing something obvious?"

The lack of an easily vectorizable algorithm.

FM 2024년 6월 11일

OK, so that confirms that I haven't overlooked anything in my search. Thanks!

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Nipun 2024년 6월 13일

MATLAB Online에서 열기

1 개 추천

Hi FM,

I understand that you want to compute the Levenshtein distances between several thousand ID labels in a vectorized manner using MATLAB and then interface it with Python.

Here is a MATLAB code that utilizes parallel processing to compute the Levenshtein distances more efficiently. This approach is not fully vectorized due to the nature of the algorithm, but it leverages MATLAB's parallel computing capabilities to speed up the computation.

function distanceMatrix = computeLevenshteinDistances(labels)
    % Number of labels
    N = numel(labels);
    
    % Initialize the distance matrix
    distanceMatrix = zeros(N, N);
    
    % Parallel loop to compute Levenshtein distances
    parfor i = 1:N
        for j = i+1:N
            distanceMatrix(i, j) = levenshtein(labels{i}, labels{j});
            distanceMatrix(j, i) = distanceMatrix(i, j); % Symmetric matrix
        end
    end
end
function d = levenshtein(s1, s2)
    % Calculate the Levenshtein distance between two strings
    m = length(s1);
    n = length(s2);
    
    % Initialize the distance matrix
    D = zeros(m+1, n+1);
    for i = 1:m+1
        D(i, 1) = i-1;
    end
    for j = 1:n+1
        D(1, j) = j-1;
    end
    
    % Compute the distances
    for i = 2:m+1
        for j = 2:n+1
            cost = (s1(i-1) ~= s2(j-1));
            D(i, j) = min([D(i-1, j) + 1, ...     % Deletion
                           D(i, j-1) + 1, ...     % Insertion
                           D(i-1, j-1) + cost]);  % Substitution
        end
    end
    
    d = D(m+1, n+1);
end

To integrate this MATLAB function with Python, you can use the MATLAB Engine API for Python. Here's an example of how to call the MATLAB function from Python:

import matlab.engine
import numpy as np
# Start MATLAB engine
eng = matlab.engine.start_matlab()
# Example labels
labels = ['label1', 'label2', 'label3', 'labelN']
# Convert Python list to MATLAB cell array
matlab_labels = matlab.cell.array(labels)
# Call the MATLAB function
distance_matrix = eng.computeLevenshteinDistances(matlab_labels)
# Convert the result to a numpy array
distance_matrix = np.array(distance_matrix)
# Stop MATLAB engine
eng.quit()
# Print the distance matrix
print(distance_matrix)

To summarize:

The MATLAB function computeLevenshteinDistances computes the Levenshtein distances between all pairs of labels using parallel processing.
The levenshtein function calculates the distance between two strings.
The Python script uses the MATLAB Engine API to call the MATLAB function and retrieve the distance matrix.

By using MATLAB's parallel computing capabilities, you can significantly speed up the computation of the Levenshtein distances for a large number of labels. I recommend using "parfor" instead of "for" loops to leverage parallel computation in MATLAB.

Refer to the following MathWorks documentation for more information on "parfor" in MATLAB: https://www.mathworks.com/help/parallel-computing/parfor.html

Hope this helps.

Regards,

Nipun

댓글 수: 3
이전 댓글 1개 표시 이전 댓글 1개 숨기기

FM 2024년 6월 13일

편집: FM 2024년 6월 14일

Thanks, Nipun. This is my iniital exposure to the parallel computing toolbox. I think it could be useful in the future. For this particular case, it might not be the right way to go.

I was initially hoping that Matlab's JiT compilation would allow it to bust through a bottleneck of 1.5 hours in my Python implementation. My tests of an unparallelized Matlab implementaton show that it would take 70 hours ( https://www.mathworks.com/matlabcentral/answers/2127991 ). For a typical machine with (spitballing) 5 cores, and assuming 2 threads/core, we can estimate an upper bound to the parallelization speedup of 2x5=10 times, It is not enough to bust through my current bottleneck.

I use the above ballparking calculations because I can't use the "conventional" way of estimating the time it would take. The "conventional" way is just to use a variable to count the number of Levenshtein distance calculations, then print this out alongside the time stamp. Using values from the first few minutes of execution, I can project out to 70 hours for the 450+ million calculations required. Unfortunately, however, the serial counting of calculations doesn't make sense when the outer loop is parallelized. A much more complex means is needed to estimate the total wall clock time, but it doesn't seem justified because the ballparking calculation above is sufficiently telling.

I am test-driving your levenshtein() function to see if it is faster than editDistance and will report back in this comment. From the notation, it seems to operate on cell arrays of char instead of strings per se.

Update: Replacing editDistance() with your levenshtein() reduces the estimated runtime to 2.2 hours for an unparallelized implementation. Parallelizing might knock that down by up to an order of magnitude on a non-specialized machine. This definitely puts the solution in the realm of worthwhile considering. There are challenges with progress monitoring. There are shared utilities for parfor progress monitoring, which could take some time to figure out. If they simply show the percentage of outer loops completed, it would be nonideal because the inner loops iterate a very variable number of times. An added unknown is that I'm not yet sure how compatible the use of such 3rd party software components is with our computing restrictions. If it was just Matlab source code, that would be a non-issue.

I still need to test-drive the use of the MATLAB Engine API for Python...for now, I will let the unparallelized implementation finish and confirm that the Levenshtein distances match those from Python (afternote: The results match) and actually test drive the clustering alogorthm being fed by the distance calculations before trying to find faster ways to generate them.

Paul 2024년 6월 13일

editDistance uses both recursion and arrayfun. It basically calls itself in a loop and on each call is validating the input strings, which really only needs to be done once, not to mention the overhead of arrayfun (which I think has been discussed elseswhere on this forum), and all of the other checks it runs each time it's called. Maybe all of that overhead contributes to its slowdown by a factor of 70/2.2.

FM 2024년 6월 14일

Yes, it does seem that editDistance is for interactive one-offs rather than large data sets en masse. I'm new enough to this area that I'm not sure what the use case for that is, at least in a programmatic context.

댓글을 달려면 로그인하십시오.

Answer 2

Christopher Creutzig 2024년 6월 14일

0 개 추천

For this application, I recommend using editDistanceSearcher: You probably don't want to look at clusters larger than some size anyway, and knnsearch on editDistanceSearcher will give you the neighbors of interest. It performs precomputation, trading memory for being faster in amortized time, as long as you call it often enough on the same dataset, which it sounds is exactly what you want to do.

댓글 수: 7
이전 댓글 5개 표시 이전 댓글 5개 숨기기

Christopher Creutzig 2024년 6월 19일

There are many ways to define clusters. What exactly is the one you are looking for?

If your clusters are defined by “groups of words separated by edit distance,” i.e., you regard all those as a cluster where you have stepping stones to get form A to B without making any steps with a Levenshtein distance of, say, 4 or more, then knowing all the neighbors of your words is all you need.

That is not data you would put into kmeans or kmedoids, of course. Such neighboring information transforms the clustering into a graph theoretical problem. You'd set up an undirected graph with the neighborhood information you have and ask for connected components (or block-cut components to avoid spurious connections from outliers).

FM 2024년 6월 20일

This is my first foray into clustering, so it may be naive. My plan, however, was to get a 2D matrix of all-pairs distances and feed that into HDBSCAN. I've only read about HDBSCAN, so as I try this, I will become more familiar with it. Of course, the O(N^2) calculation of all-pairs distances may make this infeasible, depending on the number of text labels and whether the distance calculation is vectorized or parallelized.

댓글을 달려면 로그인하십시오.

Answer 3

FM 2024년 6월 17일

편집: FM 2024년 6월 18일

0 개 추천

@Stephen23 provided the technically correct answer, though @Nipun's provided an extremely helpful alternative to explore though I haven't yet decided to adopt this alternative. I feel that both responses are useful answers, but only one can be accepted. Is there any MATLAB forum guidance on what should be formally designated as the answer, i.e., whether it is decided by which tecnically answers the question (which is still valuable) vs. which contributes possibly alternative solutions?

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Vectorized Levenshtein distances between arrays of text labels?

댓글 수: 2
없음 표시 없음 숨기기

채택된 답변

댓글 수: 3
이전 댓글 1개 표시 이전 댓글 1개 숨기기

추가 답변 (2개)

댓글 수: 7
이전 댓글 5개 표시 이전 댓글 5개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

카테고리

제품

릴리스

태그

Community Treasure Hunt

Vectorized Levenshtein distances between arrays of text labels?

댓글 수: 2 없음 표시 없음 숨기기

채택된 답변

댓글 수: 3 이전 댓글 1개 표시 이전 댓글 1개 숨기기

추가 답변 (2개)

댓글 수: 7 이전 댓글 5개 표시 이전 댓글 5개 숨기기

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

카테고리

제품

릴리스

태그

참고 항목

Community Treasure Hunt

댓글 수: 2
없음 표시 없음 숨기기

댓글 수: 3
이전 댓글 1개 표시 이전 댓글 1개 숨기기

댓글 수: 7
이전 댓글 5개 표시 이전 댓글 5개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기