regexp: what am I missing from the documentation?
조회 수: 1 (최근 30일)
이전 댓글 표시
I have tried to carefully read the regexp documentation, and I am able to sucessfully implement regexp in the simplest cases. For example, given:
test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8'
I can use the following code to retrieve each of the separate names, with the ending numeral and/or whitespace:
exp = '\w*[^1-9\s]';
MyMatch = regexp(test, exp, 'match')
MyMatch = 1×8 cell array
Columns 1 through 6
{'John'} {'Ron'} {'James'} {'Dongo'} {'Chloe'} {'Billgo'}
Columns 7 through 8
{'Marie'} {'Aaron'}
However, despite much effort, I cannot achieve a more complex result (example provided below). I try to limit the number of questions I post to the community, but here is a situation where I ask if the experts can point to where I am erring in my use of regexp to give a (slightly more complex) result. Note that this is not a specific problem I am trying to solve. I merely invented a 'random' problem in an effort to become more adpept in my use of regexp.
For the following example, assume that all name instances in a character vector test have one of two possible problems.
- A single digit immediately follows the name (e.g., James7)
- The name has 'go' appended to its end.
NB: We know in advance there are no name instances in test that would require us to consider the possibility that 'go' is just the natural ending of a name instance (e.g., Hugogo).
Thus, given the character vector:
test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8'
The desired output is:
MyMatch = 1×8 cell array
Columns 1 through 6
{'John'} {'Ron'} {'James'} {'Don'} {'Chloe'} {'Bill'}
Columns 7 through 8
{'Marie'} {'Aaron'}
Examples of attempted (and failed) solutions:
% Given the documentation's statement, 'If you specify a lookahead assertion before an expression,
% the operation is equivalent to a logical AND."
MyMatch = regexp(test, '(?<=\w*[^*go\s)\w*[^1-9\s]', 'match')
% Attempts to implement 'OR' logic: (exp|exp)
% (1)
[tok, mat] = regexp(test, '(\w+)([^*go\s]|[^1-9\s])', 'tokens', 'match');
vertcat(tok{:}) % then extract col1
% (2)
[tok, mat] = regexp(test, '((\w+)([^*go\s]))|((\w+)([^1-9\s]))', 'tokens', 'match')
vertcat(tok{:}) % then extract col1
% ...
And so on and so forth...
- What is your approach/solution (using regexp) to the above? Is it better to take a multipronged approach? e.g., convert to cell array first, use two regexp, etc..
- What is your approach/solution (using regexp) given:
test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8 Hugogo' % note Hugogo
% we want the 'MyMatch' or 'MyTokens' cell array to contain 'Hugo'
Thanks for your time, and Happy New Year!
Sincerely,
Ray
댓글 수: 4
Stephen23
2019년 12월 28일
편집: Stephen23
2019년 12월 28일
The regexp documentation's focus is rather on the function rather than the regular expression syntax. For more detailed explanations of the syntax see:
You might also like to download my FEX submission iregexp, which creates an interactive figure for trying different regular expressions and parse strings, and seeing regexp's outputs:
채택된 답변
Stephen23
2019년 12월 28일
편집: Stephen23
2019년 12월 28일
A direct interpretation of your description "assume that all name instances in a character vector test have one of two possible problems. 1. A single digit immediately follows the name (e.g., James7) 2. The name has 'go' appended to its end." is to use one lookahead assertion:
>> test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8 Hugogo';
>> regexp(test,'\w+(?=(\d|go)\>)','match')
ans =
'John' 'Ron' 'James' 'Don' 'Chloe' 'Bill' 'Marie' 'Aaron' 'Hugo'
Or similarly using a non-captured token:
>> tkn = regexpi(test,'(\w+)(?:\d|go)\>','tokens');
>> [tkn{:}]
ans =
'John' 'Ron' 'James' 'Don' 'Chloe' 'Bill' 'Marie' 'Aaron' 'Hugo'
추가 답변 (0개)
참고 항목
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!