Levenshtein distance: String scanning

I'm in desperate need of help with creating a Levenshtein distance function with scanning behavior.
Basically there is a target string (s) which is supposed to scan a longer experimental string (s2). For example a target string (s) of length m = 3 should be compared to the first 3 entries of an experimental string (s2). Afterwards the target string (s) should be "moved" by 1 step to be compared to the entries 2-4 of the experimental string (s2) and so on.
Creating such a scanning behavior is not a problem, the real problems are insertions and deletions. After an insertion or deletion I still want to compare 2 strings of equal length. Therefore when inserting another entry, the last entry of the experimental string (s2) has to be removed, and when deleting an entry the next value of the experimental string should be added. Simply spoken the string lengths (m and n) should always be constant and have the same length.
Here a simple example:
For the strings s = [1 2 3] and s2 = [1 3 2 3 ...] I want to compare parts of the same length m = 3, therefore the first comparison should be between s = [1 2 3] and s2 = [1 3 2]. A deletion at position #2 should now change s2 to [entry1 entry3 entry4], in this case to s2 = [1 2 3], leading to an edit distance of 1 (with a deletion cost of 1). Here, an insertion of the value 2 at position #2 should lead to s2 = [entry1 insertion entry2] and therefore also s2 = [1 2 3].
Any ideas how I could achieve this?
I'm using an Levenshtein function similar to this:
% Variables
s = [1 2 3];
m=length(s);
s2 = [1 3 2 3 1 2 3];
t = s2(1:1:m);
n=length(t);
% Edit Matrix
mat=zeros(m+1,n+1);
for i=1:1:m
mat(i+1,1)=i;
end
for j=1:1:n
mat(1,j+1)=j;
end
for i=1:m
for j=1:n
if (s(i) == t(j))
mat(i+1,j+1)=mat(i,j);
else
mat(i+1,j+1)= min([mat(i+1,j)+1,mat(i,j+1)+1,mat(i,j)+1]);
end
end
end
%Edit distance
d = mat(m+1,n+1);

댓글 수: 5

Stephen23
Stephen23 2018년 12월 13일
@Marcel Dorer: your actual question is not very clear:
  1. Calculating the Levenshtein distance is not hard, so this is not the problem.
  2. Creating a sliding window is also not hard, a loop would do it easily.
  3. You mention making insertions and deletions so that the strings match, but do you not describe any restrictions on these... so without any restriction, you can insert/delete as you desire until the strings match. So far this seems to be equivalent to your current description:
S2 = repmat(S1,1,N)
but I doubt that that is actually what you want. What exactly do you want? Please show the complete expected output vector for your two example input vectors (or whatever output you expect to get - all of it please).
Marcel Dorer
Marcel Dorer 2018년 12월 14일
편집: Marcel Dorer 2018년 12월 14일
Sorry, I try to be more clear.
What I have
I have 2 strings, a short one (s) and a very long one (s2). I now want to know if the long string (s2) contains the short string (s). I use the Levenshtein distance to compare those strings and I am mainly interested in the edit distance outputs (d), more specifically if those values are below a predefined threshold.
To achieve this I "scan" the long string s2 with the shorter string s. This is done by comparing the string s with a string t, which is a part of s2 with length = length(s). Levenshtein distance is calculated and s is compared to the next string t, which again is a portion of s2, this time shifted by 1. This process is repeated until the end of s2 is reached. This way I get mutiple Levenshtein distance outputs and based on the predefined threshold I know how often and on what positions the 2 strings (s and s1) matched.
The problem
The Levenshtein distance uses 3 operations: Insertion of an entry, Deletion of an entry and Substitution of an entry. Since I am comparing 2 stings of the same length = length(s), the operations Insertion (length+1) and Deletion (length-1) would lead to unequal string lengths and force the usage of the opposit operation later on. This should not be the case, string lengths should stay equal.
To achieve this I want to perform those operations on the string t. Here, in case of a Deletion (which would lead to a string with length(t) = length(s)-1), I should be able to append the string t with the next value of s2 to keep the length the same. In case of an Insertion (length(t) = length(s)+1) I want to remove the last value of string t, to again keep the lengths the same at length(s).
As far as I understand I would need to create multiple Edit Matrices, since the operations Insertion and Deletion should lead to new strings t.
I hope this clarifies my problem. I don't have much time now, but I will also try to clarify the OP later on. Thanks for your feedback!
Stephen23
Stephen23 2018년 12월 14일
편집: Stephen23 2018년 12월 14일
"I now want to know if the long string (s2) contains the short string (s). "
Do you want the actual steps required to make this transformation from one string to the other? The code you gave (which seems to work, calculating the Levenshtein distance) does not actually need to make those changes to the input strings: it simply calculates the distance (which is usually what people are interested in). This means there are no strings with values being inserted or deleted, and nothing changes size. So it is not clear what your explanation relates to.
In any case, your goal "I now want to know if the long string (s2) contains the short string (s). " does not require Levenshtein distance at all: I can think of several simple ways to check this using MATLAB. What is the actual goal?
Marcel Dorer
Marcel Dorer 2018년 12월 14일
편집: Marcel Dorer 2018년 12월 14일
Ok sorry maybe I should've been even clearer. I meant:
I want to know if the long string (s2) contains the short string (s) or a variation of s which lies within my predefined Levenshtein distance threshold.
This is the reasoning behind me using the Levenshtein distance in the first place: To allow for some slight variation. As I wrote above: "I am mainly interested in the edit distance outputs (d), more specifically if those values are below a predefined threshold."
So my goal is to get those values d, which should be the result of comparisons of 2 strings of equal length, with Insertions and Deletions not modifying the length but instead removing the surplus or refilling to length(s) with the next values of s1.
My code above works for basic Levenshtein. The scanning of one string with the other is not the problem. The problem is the Operations Insertion and Deletion altering the length of one of the strings. String length should still be the same after every operation.
By "simply calculating the distance" the algotirhm changes those lengths because it adds and removes values as part of its process and this should not happen. Sure your variables aren't overwritten, but the edit matrix is based on adding and deleting values of your strings.
Hopefully this will make my problem clear:
The edit matrix is always a square, and Deletions and Insertions are vertical and horizontal movements. Therefore for every Deletion there has to be an Insertion later on and for every Insertion there needs to be a Deletion later on. This should not happen, instead, after those operations the algorithm should continue with a new edit matrix by either appending t with the next value of s2 (in case of Deletion) or removing the last value of t (in case of Insertion).
Basically the suplementary Deletion/Insertion (which is vertical or horizontal movement) following every Insertion/Deletion should be done automatically by continuing from a new edit distance matrix.
Edited with additional clarification

댓글을 달려면 로그인하십시오.

답변 (0개)

카테고리

도움말 센터File Exchange에서 Get Started with MATLAB에 대해 자세히 알아보기

제품

릴리스

R2018b

질문:

2018년 12월 13일

댓글:

2018년 12월 14일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by