Match names from two different columns - Comparing strings of different lengths

Question

Maria 2014년 9월 6일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/153737-match-names-from-two-different-columns-comparing-strings-of-different-lengths

댓글: Maria 2014년 9월 6일

I have a cell variable A with about 4000 rows and 5 columns:

     C1    c2                   c3        c4          c5
  A={1997 'Michelle Applebaum' 'Salmon'  'BASIC'     'STEEL'
    1997  'Jambardella Arnold' 'Butter'  'BASIC'     'STEEL'
    1999  'Cai von Rumohr'     'Cow'     'CAPITAL'   'AEROPLANE'
    2011  'Pierre Smith'       'Milk'    'GOOD'      'AEROPLANE'
    2004  'Jinder Kauffman'    'Star'    'CAPITAL'   'PHONE'

And I have a second cell variable B with about 200 000 rows and 8 columns:

     c1         c2      c3       c4     c5       c6                     c7      c8
  B={2013  29  2225  'ELD1'  29  'SMITH         P'  4817  'HAYWOOD'
     2013  70  2628  'CCRN'  70  'FRANCE        J'  11688  'CANTORFZ'
     2013  02  952  'ABFS'  02  'KAUFFMAN      J'  356  'BUCK'
     2013  20297  157  'DUK'  20297  'ARNOLD        J'  1382  'LAWRENCE'
     2013  78362  260  'APA'  78362  'ARIF          A'  2213  'STIFEL'}

The focus is on c2 of A and on c6 of B.

C2 of A gives the complete first name and last name (and sometimes other names in between) of an individual.
C6 of B gives the last name (in capital letters) and only the initial of the first name of an individual.

I am trying to match both cells . So in case both last names are the same (or silmilar) I would like to add the columns of B to A.

       C1    c2                   c3       c4      c5
    A={1997 'Michelle Applebaum' 'Salmon' 'BASIC' 'STEEL'
       1997 'Jambardella Arnold ' 'Butter' 'BASIC' 'STEEL' 2013 20297 157 'DUK' 20297 'ARNOLD  J' 1382 'LAWRENCE'  
       1999 'Cai von Rumohr'     'Cow'    'CAPITAL' 'AEROPLANE'
      2011  'Pierre Smith'       'Milk'   'GOOD'    'AEROPLANE' 2013 29  2225 'ELD1' 29 'SMITH P'  4817  'HAYWOOD'
      2004  'Jinder Kauffman'     'Star'   'CAPITAL' 'PHONE'     2013 02  952  'ABFS' 02'KAUFFMAN J' 356  'BUCK'

I never matched cells comparing names, and the function has to be case insensitive, ignore points, commas or spaces, and also ignore if the same letter appears twice in a row. I say this because variable A was wrote by me by hand, so it's possible the last name is not exactly equal in both variables. `strrep & strtrim` functions can help solving the problem.

Can someone help me please? Thank you.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Guillaume 2014년 9월 6일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/153737-match-names-from-two-different-columns-comparing-strings-of-different-lengths#answer_150945

MATLAB Online에서 열기

You've not specified what happened to compound surnames (like 'von Ruhmor' in your example) or first names (like 'Jean-Pierre'). it also appears that the fields in B are fixed width so what happens to very long surname? So, I've assumed that only the last part of a compound surname is in B, only the initial of compound first name is in B and for very long surname B is larger. I'm also ignoring case where names don't match exactly:

%extract names from A and transform in uppercase 'SURNAME 1STLETTEROFNAME':
namesfromA = upper(regexprep(A(:,2), '([A-Z]).* ([A-Z][a-z]+)', '$2 $1'));
%x=extract names from B and remove extra spaces:
namesfromB = regexprep(B(:, 6), '([A-Z]+) +([A-Z])', '$1 $2');
%find the intersection and postion of matches. Assume there is only ever one match
[~, ia, ib] = intersect(namesfromA, namesfromB);
%add matches to A:
A(ia, 6:13) = B(ib, :);

If surnames don't match exactly, then it gets a lot more complicated. There are a numbers of algorithm ( Damerau-Levenshtein, Hamming) that allows you to find how two strings are similar, but I don't think any of them are built-in in matlab.

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Maria 2014년 9월 6일

MATLAB Online에서 열기

Thank you very much for all the information. I know that getting all the matches will be a problem. I was first thinking of doing something like this:

A(:, end+1) = lower(strrep(strtrim(A(:,2)),' ',''));
B(:,end+1)= lower(strrep(strtrim(B(:,6)),' ',''));

But then I cannot use strcmpi or strncmpi because I am working with different sizes & content even if there is a match. So I was trying to build something that for instance would do a match if they find 5 letters in a row that are equal.

But anyway, your code works and is really helpful! I will read more about the numbers of algorithm. Thank you again.

댓글을 달려면 로그인하십시오.

Answer 2

Stephen23 2014년 9월 6일

2
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/153737-match-names-from-two-different-columns-comparing-strings-of-different-lengths#answer_150944

You are trying to match words in two cell arrays, including allowing for punctuation characters and possibly slightly different spellings... not an easy task to perform! You will also need to consider initials and name order.

It really depends on how different the strings might be: if the differences only include repeated characters, then you might be able to get away with creating some regexp pattern to help with that.

If the differences are more complicated, then you need to find a similarity measure. Some common similarity measures are the Rabin-Karp algorithm, the Levenshtein distance, the Needleman–Wunsch algorithm, or the Hamming distance.

You will also find submissions on MATLAB File Exchange that support several of these measures for analyzing string similarity, and plenty of examples online.

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Maria 2014년 9월 6일

Thank you very much for all the information. I will do some research about the topic of similarity measure. Thank you.

댓글을 달려면 로그인하십시오.

Match names from two different columns - Comparing strings of different lengths

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (1개)

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

Match names from two different columns - Comparing strings of different lengths

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (1개)

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기