regexprep $ and look-behind -- bug or expected ?
조회 수: 4 (최근 30일)
이전 댓글 표시
This question deals with regexprep and look-around operators.
Suppose you have
YourCell = {'2016-11-22 00:00:00.8'; '2016-11-22 00:00:00.9'; '2016-11-22 00:00:01'};
and you want to automatically add something like '.0' to the case that does not end in period followed by a digit .
Over in http://www.mathworks.com/matlabcentral/answers/313845-converting-string-cell-array-to-dates I provided the answer
NewCell = regexprep(YourCell, '(:\d\d)$', '$1.0', 'lineanchors')
which takes the approach of matching colon followed by two digits as a group, followed by end of line, and for that group substitutes the group followed by .0 . In the regexprep replacement the $1 means "first grouped object". So we know that the task can be done.
But when I was investigating, I took a different tactic, involving look-around operators. I decided I would look for end-of-line that was not proceeded by (period followed a digit), and for that end of line I would substitute '.0' .
The look-behind-for-match operator in regexp / regexprep is (?<=EXPRESSION) and the look-behind-for-non-match operator is (?<!EXPRESSION) . These are documented at https://www.mathworks.com/help/matlab/ref/regexp.html#input_argument_expression in the "Lookaround Assertions" section. Accordingly, it seems to me that I should be able to use either
regexprep(YourCell, '(?<!\.\d)$', '.0', 'lineanchors')
or
regexprep(YourCell, '$(?<!\.\d)', '.0', 'lineanchors')
However, no replacement is made.
Is the look-behind incorrect? Well we can test by chaning the $ to :
regexprep(YourCell, '(?<!\.\d):', '.0', 'lineanchors')
ans =
3×1 cell array
'2016-11-22 00.000.000.8'
'2016-11-22 00.000.000.9'
'2016-11-22 00.000.001'
and observing that we do get replacement of colons (that do not happen to be proceeded by period and a digit) with the target string. We can check whether the look-around is being ignored with
regexprep(YourCell, '(?<!:\d\d):', '.0', 'lineanchors')
ans =
3×1 cell array
'2016-11-22 00.000:00.8'
'2016-11-22 00.000:00.9'
'2016-11-22 00.000:01'
and seeing that the pattern is in fact actively used, that the colon is only matched when not preceded with colon-digit-digit . So the look-around is working.
Is the end-of-line anchor the problem?
regexprep(YourCell, '(\d)$', '$1.0', 'lineanchors')
ans =
3×1 cell array
'2016-11-22 00:00:00.8.0'
'2016-11-22 00:00:00.9.0'
'2016-11-22 00:00:01.0'
No, the only matched digit was the one at the end of the line, so the line anchor is matching properly.
The difficulty only occurs when you have a look-around in conjunction with a line anchor. The problem happens for the ^ anchor as well, as can be explored with
regexprep(YourCell, '(^)(?=\d)', '$1.0', 'lineanchors') %nothing happens!
regexprep(YourCell, '(-)(?=2)', '$1.0', 'lineanchors') %works
regexprep(YourCell, '^(2)', '$1.0', 'lineanchors') %works
The question is then whether it is expected that look-arounds do not work in conjunction with line-anchors, or if this is a MATLAB bug ?
Though I do see the line anchor working if at least one real character is matched:
regexprep(YourCell, '^(?=2).*', 'BLOB','lineanchors') %works, substitutes
regexprep(YourCell, '^(?=3).*', 'BLOB','lineanchors') %no substitutions, which is correct
You can see that my lookbehind works by testing with
regexp(YourCell, '.(?<!\.\d)$', 'match','lineanchors')
ans =
3×1 cell array
{}
{}
{1×1 cell}
>> ans{3}
ans =
cell
'1'
So it looks like a successful match of a zero-width expression is not triggering a replacement when I think it should.
댓글 수: 5
per isakson
2016년 12월 3일
편집: per isakson
2016년 12월 3일
@Walter, Without reading "Do not do this on a session with unsaved work" I tried your code with a couple of unsaved files. That was dumb! Neither, Cntrl+C nor Pausing had any effect.
Good news:
- switching between files in the Matlab editor and copy&paste to Notepad++ still worked (R2016a).
- Save, Save All   in the tool strip saved the files (R2016a).
채택된 답변
per isakson
2016년 12월 3일
편집: per isakson
2016년 12월 3일
Expected, it has something to do with "$" not matching "one or more" characters in the string. This works
regexprep( YourCell, '(?<!\.\d)$', '.0', 'emptymatch' )
ans =
'2016-11-22 00:00:00.8'
'2016-11-22 00:00:00.9'
'2016-11-22 00:00:01.0'
댓글 수: 2
추가 답변 (0개)
참고 항목
카테고리
Help Center 및 File Exchange에서 Environment and Settings에 대해 자세히 알아보기
제품
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!