regexp: what am I missing from the documentation?

Question

Raymond MacNeil 2019년 12월 28일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/498274-regexp-what-am-i-missing-from-the-documentation

편집: Stephen23 2019년 12월 28일

I have tried to carefully read the regexp documentation, and I am able to sucessfully implement regexp in the simplest cases. For example, given:

test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8'

I can use the following code to retrieve each of the separate names, with the ending numeral and/or whitespace:

exp = '\w*[^1-9\s]';
MyMatch = regexp(test, exp, 'match')

MyMatch = 1×8 cell array

Columns 1 through 6

{'John'} {'Ron'} {'James'} {'Dongo'} {'Chloe'} {'Billgo'}

Columns 7 through 8

{'Marie'} {'Aaron'}

However, despite much effort, I cannot achieve a more complex result (example provided below). I try to limit the number of questions I post to the community, but here is a situation where I ask if the experts can point to where I am erring in my use of regexp to give a (slightly more complex) result. Note that this is not a specific problem I am trying to solve. I merely invented a 'random' problem in an effort to become more adpept in my use of regexp.

For the following example, assume that all name instances in a character vector test have one of two possible problems.

A single digit immediately follows the name (e.g., James7)
The name has 'go' appended to its end.

NB: We know in advance there are no name instances in test that would require us to consider the possibility that 'go' is just the natural ending of a name instance (e.g., Hugogo).

Thus, given the character vector:

test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8'

The desired output is:

MyMatch = 1×8 cell array

Columns 1 through 6

{'John'} {'Ron'} {'James'} {'Don'} {'Chloe'} {'Bill'}

Columns 7 through 8

{'Marie'} {'Aaron'}

Examples of attempted (and failed) solutions:

% Given the documentation's statement, 'If you specify a lookahead assertion before an expression, 
% the operation is equivalent to a logical AND."
MyMatch = regexp(test, '(?<=\w*[^*go\s)\w*[^1-9\s]', 'match') 
% Attempts to implement 'OR' logic: (exp|exp)
% (1)
[tok, mat] = regexp(test, '(\w+)([^*go\s]|[^1-9\s])', 'tokens', 'match');
vertcat(tok{:}) % then extract col1
% (2)
[tok, mat] = regexp(test, '((\w+)([^*go\s]))|((\w+)([^1-9\s]))', 'tokens', 'match')
vertcat(tok{:}) % then extract col1
% ...

And so on and so forth...

What is your approach/solution (using regexp) to the above? Is it better to take a multipronged approach? e.g., convert to cell array first, use two regexp, etc..
What is your approach/solution (using regexp) given:

test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8 Hugogo' % note Hugogo
% we want the 'MyMatch' or 'MyTokens' cell array to contain 'Hugo'

Thanks for your time, and Happy New Year!

Sincerely,

Ray

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

dpb 2019년 12월 28일

I feel your pain w/ regular expressions...I've struggled mightily and also have never been able to "get it" as far as figuring out the syntax. Some seem to have a knack...

As far as the specifics of the last, I think that's going to be a hard nut to crack unless you consider the trailing characters are known from a given set and/or the last two characters can't be repeated ever, regardless what two that might be (I didn't try but I suspect can find real cases in the wild). If it's just a learning exercise for the purpose of learning, probably doesn't matter.

Personally, I'm now old enough and retired from active consulting so I've taken becoming adept with RE's off my bucket list; if were 30-40 year ago again, my approach knowing what I now know about them and my particular block, I'd go find a text and use it. The doc and what I've been able to find online are ok for poking around and occasionally one can find a way to solve a particular problem that way, but at least for me trying to learn so as to be able to apply in general going forward, they've been of very little help.

Stephen23 2019년 12월 28일

편집: Stephen23 2019년 12월 28일

The regexp documentation's focus is rather on the function rather than the regular expression syntax. For more detailed explanations of the syntax see:

https://www.mathworks.com/help/matlab/matlab_prog/regular-expressions.html

https://www.mathworks.com/help/matlab/matlab_prog/lookahead-assertions-in-regular-expressions.html

https://www.mathworks.com/help/matlab/matlab_prog/tokens-in-regular-expressions.html

https://www.mathworks.com/help/matlab/matlab_prog/dynamic-regular-expressions.html

You might also like to download my FEX submission iregexp, which creates an interactive figure for trying different regular expressions and parse strings, and seeing regexp's outputs:

https://www.mathworks.com/matlabcentral/fileexchange/48930-interactive-regular-expression-tool

Raymond MacNeil 2019년 12월 28일

Thanks, Stephen. I have previously examined these additonal pages, but I should probably dig into these more.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Stephen23 2019년 12월 28일

3
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/498274-regexp-what-am-i-missing-from-the-documentation#answer_407971

편집: Stephen23 2019년 12월 28일

MATLAB Online에서 열기

A direct interpretation of your description "assume that all name instances in a character vector test have one of two possible problems. 1. A single digit immediately follows the name (e.g., James7) 2. The name has 'go' appended to its end." is to use one lookahead assertion:

>> test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8 Hugogo';
>> regexp(test,'\w+(?=(\d|go)\>)','match')
ans = 
    'John'    'Ron'    'James'    'Don'    'Chloe'    'Bill'    'Marie'    'Aaron'    'Hugo'

Or similarly using a non-captured token:

>> tkn = regexpi(test,'(\w+)(?:\d|go)\>','tokens');
>> [tkn{:}]
ans = 
    'John'    'Ron'    'James'    'Don'    'Chloe'    'Bill'    'Marie'    'Aaron'    'Hugo'

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Raymond MacNeil 2019년 12월 28일

Nice, thanks!

댓글을 달려면 로그인하십시오.

regexp: what am I missing from the documentation?

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

채택된 답변

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

regexp: what am I missing from the documentation?

댓글 수: 4 이전 댓글 2개 표시이전 댓글 2개 숨기기

채택된 답변

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 4
이전 댓글 2개 표시이전 댓글 2개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기