regexprep $ and look-behind -- bug or expected ?

조회 수: 4 (최근 30일)
Walter Roberson
Walter Roberson 2016년 12월 1일
편집: Stephen23 2016년 12월 3일
This question deals with regexprep and look-around operators.
Suppose you have
YourCell = {'2016-11-22 00:00:00.8'; '2016-11-22 00:00:00.9'; '2016-11-22 00:00:01'};
and you want to automatically add something like '.0' to the case that does not end in period followed by a digit .
NewCell = regexprep(YourCell, '(:\d\d)$', '$1.0', 'lineanchors')
which takes the approach of matching colon followed by two digits as a group, followed by end of line, and for that group substitutes the group followed by .0 . In the regexprep replacement the $1 means "first grouped object". So we know that the task can be done.
But when I was investigating, I took a different tactic, involving look-around operators. I decided I would look for end-of-line that was not proceeded by (period followed a digit), and for that end of line I would substitute '.0' .
The look-behind-for-match operator in regexp / regexprep is (?<=EXPRESSION) and the look-behind-for-non-match operator is (?<!EXPRESSION) . These are documented at https://www.mathworks.com/help/matlab/ref/regexp.html#input_argument_expression in the "Lookaround Assertions" section. Accordingly, it seems to me that I should be able to use either
regexprep(YourCell, '(?<!\.\d)$', '.0', 'lineanchors')
or
regexprep(YourCell, '$(?<!\.\d)', '.0', 'lineanchors')
However, no replacement is made.
Is the look-behind incorrect? Well we can test by chaning the $ to :
regexprep(YourCell, '(?<!\.\d):', '.0', 'lineanchors')
ans =
3×1 cell array
'2016-11-22 00.000.000.8'
'2016-11-22 00.000.000.9'
'2016-11-22 00.000.001'
and observing that we do get replacement of colons (that do not happen to be proceeded by period and a digit) with the target string. We can check whether the look-around is being ignored with
regexprep(YourCell, '(?<!:\d\d):', '.0', 'lineanchors')
ans =
3×1 cell array
'2016-11-22 00.000:00.8'
'2016-11-22 00.000:00.9'
'2016-11-22 00.000:01'
and seeing that the pattern is in fact actively used, that the colon is only matched when not preceded with colon-digit-digit . So the look-around is working.
Is the end-of-line anchor the problem?
regexprep(YourCell, '(\d)$', '$1.0', 'lineanchors')
ans =
3×1 cell array
'2016-11-22 00:00:00.8.0'
'2016-11-22 00:00:00.9.0'
'2016-11-22 00:00:01.0'
No, the only matched digit was the one at the end of the line, so the line anchor is matching properly.
The difficulty only occurs when you have a look-around in conjunction with a line anchor. The problem happens for the ^ anchor as well, as can be explored with
regexprep(YourCell, '(^)(?=\d)', '$1.0', 'lineanchors') %nothing happens!
regexprep(YourCell, '(-)(?=2)', '$1.0', 'lineanchors') %works
regexprep(YourCell, '^(2)', '$1.0', 'lineanchors') %works
The question is then whether it is expected that look-arounds do not work in conjunction with line-anchors, or if this is a MATLAB bug ?
Though I do see the line anchor working if at least one real character is matched:
regexprep(YourCell, '^(?=2).*', 'BLOB','lineanchors') %works, substitutes
regexprep(YourCell, '^(?=3).*', 'BLOB','lineanchors') %no substitutions, which is correct
You can see that my lookbehind works by testing with
regexp(YourCell, '.(?<!\.\d)$', 'match','lineanchors')
ans =
3×1 cell array
{}
{}
{1×1 cell}
>> ans{3}
ans =
cell
'1'
So it looks like a successful match of a zero-width expression is not triggering a replacement when I think it should.
  댓글 수: 5
Walter Roberson
Walter Roberson 2016년 12월 2일
Stephen, you might be amused by one I found yesterday:
a='I want THAAAAAT APPPPPLE ):):): totally unprepared';
regexp(a, '(.+){3,:}', 'match')
Do not do this on a session with unsaved work, as it will run away beyond the ability to control-C and you will have to kill the process.
The non-malformed regexp would have been
regexp(a, '(.+){3,}', 'match')
per isakson
per isakson 2016년 12월 3일
편집: per isakson 2016년 12월 3일
@Walter, Without reading "Do not do this on a session with unsaved work" I tried your code with a couple of unsaved files. That was dumb! Neither, Cntrl+C nor Pausing had any effect.
Good news:
  • switching between files in the Matlab editor and copy&paste to Notepad++ still worked (R2016a).
  • Save, Save All &nbsp in the tool strip saved the files (R2016a).

댓글을 달려면 로그인하십시오.

채택된 답변

per isakson
per isakson 2016년 12월 3일
편집: per isakson 2016년 12월 3일
Expected, it has something to do with "$" not matching "one or more" characters in the string. This works
regexprep( YourCell, '(?<!\.\d)$', '.0', 'emptymatch' )
ans =
'2016-11-22 00:00:00.8'
'2016-11-22 00:00:00.9'
'2016-11-22 00:00:01.0'
  댓글 수: 2
Walter Roberson
Walter Roberson 2016년 12월 3일
Thanks, Per!
Stephen23
Stephen23 2016년 12월 3일
편집: Stephen23 2016년 12월 3일
@per isakson: nicely caught. emptymatch is about the only option I have not used, so this gives me a good excuse to play some more... the learning never stops :)

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Environment and Settings에 대해 자세히 알아보기

제품

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by