The "Did you mean..." routine

Question

Chad Greene 2013년 8월 15일

3
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/84830-the-did-you-mean-routine

댓글: Chad Greene 2014년 8월 16일

Is there a way to access the routine that looks for approximate string matches? I have an n x 1 cell array. I know how to search for exact matches, or case-insensitive matches, or matches of the first few characters, but my list has tens of thousands of entries and sometimes the user does not know exactly what phrase to search for. I'd like my script to be robust enough to allow the user to search the list below for a phrase like "algebra 2" and offer a list of close matches or, "Did you mean algebra II?"

math
science
reading
chemistry 
algebra I
algebra II
gym

Ideas?

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Chad Greene 2013년 8월 15일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/84830-the-did-you-mean-routine#answer_94375

MATLAB Online에서 열기

It's not exactly what I want, but this is close enough to work for me:

% given the input aclass: 
aclass = 'algebra 1'; 
x=strcmpi(strtrim(aclass),classnames); % logical array of matches 
if sum(x)==0
    fprintf('Class name not found. Did you mean one of these? \n')
    nearbynames = strncmpi(strtrim(aclass),classnames,4);
    disp(classnames(nearbynames))
    return
end

This assumes the user has at least the first 4 letters correct, and offers suggestions if they get it wrong.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Answer 2

Cedric 2013년 8월 15일

2
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/84830-the-did-you-mean-routine#answer_94384

편집: Cedric 2013년 8월 17일

MATLAB Online에서 열기

See EDIT 2 and 3 below for a better version of the initial answer.

Here is another idea, just for the fun of it:

 classNames = {'math', 'science', 'reading', 'chemistry', 'algebra I', ...
               'algebra II', 'gym'} ;
 aClass = 'algebra 1' ;
 % - Define a "spectrum" function based on letter freq.
 spec = @(name) accumarray(upper(name.')-31, ones(size(name)), [60 1]) ;
 % - Compute spec of chosen class, compare (norm based) with other class
 %  names, and find order of names based on min norm.
 spec_aClass = spec(aClass) ;
 spec_dist   = arrayfun(@(k) norm(spec(classNames{k})-spec_aClass), ...
                        1:numel(classNames)) ;
 [~,idx] = sort(spec_dist) ;

Then propose as many ordered names as you want; for example, having run the code above..

 >> nearbyNames = classNames(idx(1:4))
 nearbyNames = 
    'algebra I'    'algebra II'    'reading'    'math'

Of course, the method for computing a "spectrum" is a bit too simplistic, and we should filter aClass and classNames first to ensure that no character is outside of ASCII range 32-90 when "uppered" (which could be done by eliminating them =[] or by replacing them with spaces).

Note that you could also try to see how LIKE is implemented in SQL servers.

EDIT 1: just had another idea .. if aClass is not found in classNames, you could just define buffer = sort([classNames, aClass]), then look for aClass in buffer and propose +/-1 or 2 elements around the matching position. This would be less robust than what I first proposed though, in the sense that if aClass were defined as 'lagebra II' (permuted first two letters), it would not be close to 'algebra II' at all.

EDIT 2: just made a function out of it, with a simple character filtering mechanism.

EDIT 3: corrected a few small mistakes which didn't alter the functioning.

 function nearbyNames = getNearbyNames(classNames, aClass, n)
    persistent P__clean ;
    if isempty(P__clean)
       P__clean = 32 + zeros(256, 1) ;
       P__clean(48:90) = 48:90 ;
    end
    spec = @(name) accumarray(P__clean(upper(name))-31, ones(size(name)), [59 1]) ;
    spec_aClass = spec(aClass) ;
    spec_dist   = cellfun(@(name) norm(spec(name)-spec_aClass), classNames) ;
    [~,idx]     = sort(spec_dist) ;
    nearbyNames = classNames(idx(1:min(n,length(idx)))) ;
 end

With that ..

 >> getNearbyNames(classNames, aClass, 4)
 ans = 
    'algebra I'    'algebra II'    'reading'    'math'

Note that instead of n for the size of nearbyNames to return, you could pass a float (threshold) which characterizes a maximal norm for being defined as nearby, and have something like:

    [sds,idx]   = sort(spec_dist) ;
    nearbyNames = classNames(idx(sds<=threshold)) ;

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

Joseph Cheng 2014년 8월 7일

편집: Joseph Cheng 2014년 8월 7일

wouldn't ismember() work out for this as well?

Y = cellfun(@(x) sum(ismember(lower(x),lower('algebra I')))/((length(x)+length('algebra I'))/2),X)

where the above x is the cell list of your classes/subject. then if you get a value of 1 there is an exact match. if there are no result of 1 then put a threshold say 80% match or highest match would be the suggested result.

here looking at the average of the input string and the comparison string you could put in a partial string for the input and look for the highest matches.

Additional conditions could be used when there are more than 1 match (since above method would allow for misspelled words). then look for which one has the similar order or direct string compare (one to one and pad if necessary.) And hope not too many anagrams are in your list.

Chad Greene 2014년 8월 16일

I've needed this on a few occasions, so I turned it into a function called strlookup. Thanks again for your help.

댓글을 달려면 로그인하십시오.

Answer 3

Nikolay S. 2013년 11월 26일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/84830-the-did-you-mean-routine#answer_116435

Having asked the same question I find those functions promising: http://www.mathworks.com/matlabcentral/fileexchange/36981-find-nearest-matching-string-from-a-set http://www.mathworks.com/matlabcentral/fileexchange/17585-calculation-of-distance-between-strings

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Answer 4

Image Analyst 2013년 8월 15일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/84830-the-did-you-mean-routine#answer_94368

Perhaps drilling down into the citations in this might lead somewhere: http://infolab.stanford.edu/~backrub/google.html

댓글 수: 2
없음 표시없음 숨기기

Chad Greene 2013년 8월 15일

Oh my, that looks like a bit of an undertaking.

Image Analyst 2013년 8월 15일

I didn't think it would be easy. I'm sure web sites have whole teams of people working on parsing inputs and trying to figure out what people really meant, or figure out similar terms, when they typed in something.

댓글을 달려면 로그인하십시오.

The "Did you mean..." routine

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (3개)

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 2
없음 표시없음 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

The "Did you mean..." routine

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (3개)

댓글 수: 3 이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 2 없음 표시없음 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 2
없음 표시없음 숨기기