Vectorized Levenshtein distances between arrays of text labels?
이전 댓글 표시
I have to compare "N" ID labels (several thousand) to each other in order to determine which are mistypings of each other. The labels have up to 20 characters. Preliminarily, I am considering the calculation of the N(N-1)/2 Levenshtein distances between them and using clustering to determine which labels correspond to the same ID. It is being done in Python, but none of the Levenshtein distance implementations are vectorized. That is, the NxN array of distances have to be iterated through on an element-by-element basis in order to calculate their values.
I thought that there might be a vectorized Matlab version of Levenshtein distance, which I could package for deployment and invocation from Python. I found the a few shown in the Annex below, as well as an "editDistance" function available in R2023b. None of these vectorize the calculation of N(N-2)/2 distances. I'm surprised that a vectorized implementation doesn't exist. Am I missing something obvious?
Annex: Matlab implementations of Levenshtein distance
채택된 답변
추가 답변 (2개)
Christopher Creutzig
2024년 6월 14일
0 개 추천
For this application, I recommend using editDistanceSearcher: You probably don't want to look at clusters larger than some size anyway, and knnsearch on editDistanceSearcher will give you the neighbors of interest. It performs precomputation, trading memory for being faster in amortized time, as long as you call it often enough on the same dataset, which it sounds is exactly what you want to do.
댓글 수: 7
FM
2024년 6월 17일
Christopher Creutzig
2024년 6월 17일
I was thinking of taking the 30,011 labels as the base set. If you then take the 5 closest elements within an edit distance of 3 for each of them, is that a start?
FM
2024년 6월 17일
> To get the 5 closest elements to each of the 30,011 labels, I need to calculate 30,010 distances for each label
No, you don't. That's exactly what editDistanceSearcher helps you avoid. If it helps thinking about it that way, it precomputes “bubble-shaped detectors” around the input values, avoiding most computations of distances that are larger than the original threshold. (Those bubbles aren't perfectly “distance 3” shaped, so some distances still need to be computed, but they will filter out the overwhelming majority of points further out.)
data = string(rand([1,10000]));
searcher = editDistanceSearcher(data,3);
timeit(@() editDistance(data(1),data))
timeit(@() knnsearch(searcher,data(1),K=5))
Christopher Creutzig
2024년 6월 19일
There are many ways to define clusters. What exactly is the one you are looking for?
If your clusters are defined by “groups of words separated by edit distance,” i.e., you regard all those as a cluster where you have stepping stones to get form A to B without making any steps with a Levenshtein distance of, say, 4 or more, then knowing all the neighbors of your words is all you need.
That is not data you would put into kmeans or kmedoids, of course. Such neighboring information transforms the clustering into a graph theoretical problem. You'd set up an undirected graph with the neighborhood information you have and ask for connected components (or block-cut components to avoid spurious connections from outliers).
FM
2024년 6월 20일
카테고리
도움말 센터 및 File Exchange에서 Matrix Indexing에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!