regexprep $ and look-behind -- bug or expected ?

4 views (last 30 days)
This question deals with regexprep and look-around operators.
Suppose you have
YourCell = {'2016-11-22 00:00:00.8'; '2016-11-22 00:00:00.9'; '2016-11-22 00:00:01'};
and you want to automatically add something like '.0' to the case that does not end in period followed by a digit .
NewCell = regexprep(YourCell, '(:\d\d)$', '$1.0', 'lineanchors')
which takes the approach of matching colon followed by two digits as a group, followed by end of line, and for that group substitutes the group followed by .0 . In the regexprep replacement the $1 means "first grouped object". So we know that the task can be done.
But when I was investigating, I took a different tactic, involving look-around operators. I decided I would look for end-of-line that was not proceeded by (period followed a digit), and for that end of line I would substitute '.0' .
The look-behind-for-match operator in regexp / regexprep is (?<=EXPRESSION) and the look-behind-for-non-match operator is (?<!EXPRESSION) . These are documented at https://www.mathworks.com/help/matlab/ref/regexp.html#input_argument_expression in the "Lookaround Assertions" section. Accordingly, it seems to me that I should be able to use either
regexprep(YourCell, '(?<!\.\d)$', '.0', 'lineanchors')
or
regexprep(YourCell, '$(?<!\.\d)', '.0', 'lineanchors')
However, no replacement is made.
Is the look-behind incorrect? Well we can test by chaning the $ to :
regexprep(YourCell, '(?<!\.\d):', '.0', 'lineanchors')
ans =
3×1 cell array
'2016-11-22 00.000.000.8'
'2016-11-22 00.000.000.9'
'2016-11-22 00.000.001'
and observing that we do get replacement of colons (that do not happen to be proceeded by period and a digit) with the target string. We can check whether the look-around is being ignored with
regexprep(YourCell, '(?<!:\d\d):', '.0', 'lineanchors')
ans =
3×1 cell array
'2016-11-22 00.000:00.8'
'2016-11-22 00.000:00.9'
'2016-11-22 00.000:01'
and seeing that the pattern is in fact actively used, that the colon is only matched when not preceded with colon-digit-digit . So the look-around is working.
Is the end-of-line anchor the problem?
regexprep(YourCell, '(\d)$', '$1.0', 'lineanchors')
ans =
3×1 cell array
'2016-11-22 00:00:00.8.0'
'2016-11-22 00:00:00.9.0'
'2016-11-22 00:00:01.0'
No, the only matched digit was the one at the end of the line, so the line anchor is matching properly.
The difficulty only occurs when you have a look-around in conjunction with a line anchor. The problem happens for the ^ anchor as well, as can be explored with
regexprep(YourCell, '(^)(?=\d)', '$1.0', 'lineanchors') %nothing happens!
regexprep(YourCell, '(-)(?=2)', '$1.0', 'lineanchors') %works
regexprep(YourCell, '^(2)', '$1.0', 'lineanchors') %works
The question is then whether it is expected that look-arounds do not work in conjunction with line-anchors, or if this is a MATLAB bug ?
Though I do see the line anchor working if at least one real character is matched:
regexprep(YourCell, '^(?=2).*', 'BLOB','lineanchors') %works, substitutes
regexprep(YourCell, '^(?=3).*', 'BLOB','lineanchors') %no substitutions, which is correct
You can see that my lookbehind works by testing with
regexp(YourCell, '.(?<!\.\d)$', 'match','lineanchors')
ans =
3×1 cell array
{}
{}
{1×1 cell}
>> ans{3}
ans =
cell
'1'
So it looks like a successful match of a zero-width expression is not triggering a replacement when I think it should.
  5 Comments
Walter Roberson
Walter Roberson on 2 Dec 2016
Stephen, you might be amused by one I found yesterday:
a='I want THAAAAAT APPPPPLE ):):): totally unprepared';
regexp(a, '(.+){3,:}', 'match')
Do not do this on a session with unsaved work, as it will run away beyond the ability to control-C and you will have to kill the process.
The non-malformed regexp would have been
regexp(a, '(.+){3,}', 'match')
per isakson
per isakson on 3 Dec 2016
Edited: per isakson on 3 Dec 2016
@Walter, Without reading "Do not do this on a session with unsaved work" I tried your code with a couple of unsaved files. That was dumb! Neither, Cntrl+C nor Pausing had any effect.
Good news:
  • switching between files in the Matlab editor and copy&paste to Notepad++ still worked (R2016a).
  • Save, Save All &nbsp in the tool strip saved the files (R2016a).

Sign in to comment.

Accepted Answer

per isakson
per isakson on 3 Dec 2016
Edited: per isakson on 3 Dec 2016
Expected, it has something to do with "$" not matching "one or more" characters in the string. This works
regexprep( YourCell, '(?<!\.\d)$', '.0', 'emptymatch' )
ans =
'2016-11-22 00:00:00.8'
'2016-11-22 00:00:00.9'
'2016-11-22 00:00:01.0'
  2 Comments
Stephen23
Stephen23 on 3 Dec 2016
Edited: Stephen23 on 3 Dec 2016
@per isakson: nicely caught. emptymatch is about the only option I have not used, so this gives me a good excuse to play some more... the learning never stops :)

Sign in to comment.

More Answers (0)

Categories

Find more on Environment and Settings in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!