Wanted: Examples on how to use "Dynamic Regular Expressions" to debug regular expressions

Question

per isakson 2017년 10월 14일

1
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/361293-wanted-examples-on-how-to-use-dynamic-regular-expressions-to-debug-regular-expressions

편집: per isakson 2017년 10월 19일

I try to develop a function, is_string_constant, which takes a text string of Matlab code and returns a logical vector, which is true for the positions of string constants, e.g. '%s%f'. (But neither ';c=d' in a=b';c=d'; nor in comments.)

Status

I have not found a similar function in the FEX or elsewhere
I found % MATLAB Comment Stripping Toolbox by Peter J. Acklam
At regex101 I have a working regular expression (in PCRE(php)). Matlabs regular expressions is close to PCRE.
At regex101 my PCRE-expression works under Python too. regex101 automatically makes a Python-script based on my test case.
So far I failed to port my PCRE-expression to Matlab
I'm trying to use "Dynamic Regular Expressions" to understand where it goes wrong. Now, I'm moving around (?@disp($0)) in the expression and see code fragments printed in the Command Window. However, it remains to make something useful out of it.

Doc says: Dynamic Regular Expressions

 (?@cmd) 
 Execute the MATLAB command represented by cmd, but discard any output 
 the command returns. (Helpful for diagnosing regular expressions.)
 Example: '\w*?(\w)(?@disp($1))\1\w*' matches words that include double
 letters (such as pp), and displays intermediate results.

Questions:

Is there already a is_string_constant to find somewhere
Where can I find tutorials and example on how to debug regular expressions in Matlab
Would it be crazy to try Java or Python to do the job? (I haven't used either.)
Other tips

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Stephen23 2017년 10월 14일

2
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/361293-wanted-examples-on-how-to-use-dynamic-regular-expressions-to-debug-regular-expressions#answer_285767

편집: Stephen23 2017년 10월 17일

As far as I can tell there is no simple "regexp-debug" tool or method, but here are a few tips based on my experience writing my FEX submission words2num. The function words2num is based around several large regular expressions in order to identify numbers written in English words. By default the main regular expression currently has a total of 2803 characters, around 100 groups, and 11 dynamic regular expressions. Obviously I spent a lot of time reading the MATLAB documentation, in particular:

https://www.mathworks.com/help/matlab/matlab_prog/regular-expressions.html

https://www.mathworks.com/help/matlab/matlab_prog/dynamic-regular-expressions.html

and I also wrote a simple interactive tool iregexp for quickly checking what I was working on.

Your questions:

"Is there already a is_string_constant to find somewhere?"

Not that I have seen.

"Where can I find tutorials and example on how to debug regular expressions in Matlab?"

I have not seen tutorials specifically about debugging regular expressions. The hints given in the dynamic regular expression help are about the closest in the documentation: one of the most useful pieces of info for me was this line: "The operators $& (or the equivalent $0), $`, and $' refer to that part of the input text that is currently a match, all characters that precede the current match, and all characters to follow the current match, respectively." Using these, together with $1, $2, etc., made identifying and debugging the matches easier.

"Would it be crazy to try Java or Python to do the job? (I haven't used either.)"

Python regular expressions have their own quirks, but are generally easy to use and has some nice features (e.g. options DEBUG and VERBOSE).

"Other tips"

I broke the regular expression down into very small atomic parts, each of which was tested and matches exactly one particular feature of the words I needed to match. In the code the parts are joined systematically into one expression. This provided a way for me to ensure that the expression matched the required logic.
Identifying and clearly specifying the required logic was 90% of the task for me. I spent months with pieces of paper scribbling down ideas and crossing things out, breaking the problem into what could be called a high-level and mid-level description of the regular expression. Once the logic was correct the regular expression itself followed quite naturally.
I created a set of around two thousand test cases, some sourced from the internet and some created systematically by hand. These were hugely important to ensuring the correct output for all cases, and identifying cases that I had not considered.
The regexp option warnings is useful when debugging.
The most common challenge for me was because of greediness and backtracking not always providing the matching that I needed when there are many similar-but-different possible matches. The only solution was to go back to the paper and start again...

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Answer 2

Cedric 2017년 10월 14일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/361293-wanted-examples-on-how-to-use-dynamic-regular-expressions-to-debug-regular-expressions#answer_285797

편집: Cedric 2017년 10월 16일

MATLAB Online에서 열기

Regular expressions may not be that appropriate in this context; I used them in the past for doing exactly this, but it was too complicated for being really satisfactory.

I took 10 minutes for building a basic loop (being a regexp evangelist, it was quite painful ;)), which seems to be working on a few test strings:

strs = {
    '', ...
    'abc', ...
    '''abc''', ...
    '% ''abc''', ...
    's = ''hello'' ; b = c'' ; fprintf([''A''''s content '',''%d : %s''], i, str{:}.'') % ''abc''' ...
} ;
for sId = 1 : numel( strs )
    is_string_constant( strs{sId}, true ) ;
    fprintf( '\n' ) ;
end

Outputs (skipping the empty string):

>> test
abc
000
'abc'
11111
% 'abc'
0000000
s = 'hello' ; b = c' ; fprintf(['A''s content ','%d : %s'], i, str{:}.') % 'abc'
00001111111000000000000000000000111111111111111011111111100000000000000000000000

EDIT 1: I spent another 20 minutes building a simple debug function (attached). It doesn't do much but avoids the hassle of updating patterns. It seems to be managing well internal levels of parentheses and escaped ones.

PS: .. but of course, I don't see why it is useful unless we can't output the match and/or tokens for a reason, so I may just have wasted 20 minutes (lol) and we are back to "However, it remains to make something useful out of it.".

 >> match = regexp_debug( 'hello world', '(ll.).*?(o.l)', 'match', 'once' )
      match: 'lo worl'
    token_1: 'llo'
    token_2: 'orl'
 match =
    'llo worl'
 >> tokens = regexp_debug( 'hello world', '(ll.).*?(o.l)', 'tokens', 'once' )
      match: 'lo worl'
    token_1: 'llo'
    token_2: 'orl'
 tokens =
  1×2 cell array
    {'llo'}    {'orl'}
 >> [tokens, start] = regexp_debug( 'hello world', '((?<=(l|\(\)))l.).*?(o.l)', 'tokens', 'start', 'once' )
      match: 'lo worl'
    token_1: 'lo'
    token_2: 'orl'
 tokens =
  1×2 cell array
    {'lo'}    {'orl'}
 start =
     4

Further EDITs:

15/10 - Added the match in the output.
16/10 @ 02:34UTC - Corrected bug in tokens count.

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

Walter Roberson 2017년 10월 14일

Regular expressions do poorly on "balancing" problems, such as matching brackets or matching quote marks. In fact, some of the foundational theory on Regular Expressions shows that they cannot handle balancing problems: in order to handle balancing you need an indefinitely-large push-down stack or equivalent. Perl "extended regular expressions" implement that explicitly.

Quote marks have the additional complication that if the quote is fallowed by another quote then that is a single literal quote that is not considered to balance anything.

I would have to think more about how it could be done with dynamic matches. For bracket matching it would involve recursion and backtracking. You want to find the smallest match, so it is not enough to find that open and close bracket counts match, you can only accept at the point where the counts match and do not match on any substring... But for quote marks.. mumble mumble mumble.

per isakson 2017년 10월 18일

편집: per isakson 2017년 10월 19일

MATLAB Online에서 열기

[...] make something useful out of it regexp returns an empty cell array and I'm lost. That happens all too often.

I have edited this comment 23 hour later to make it easier to read and understand.

Some "print statement" inside the regular expression might help to understand what is going on. I've done a small test.

act = is_string_constant( str );

where

function    is_str = is_string_constant( str )
% comments excluded to limit the size
    xpr = [
    '(?m)                       '   ... % ^ and $ match at line breaks
    '(                          '   ... % capturing grouping parenthesis
    '   ^                       '   ... % beginning of line 
    '   (?:                     '   ... % check for blip as transpose 
    '       (?:                 '   ... %
    '       [\w\x5D\x29\x7D\x2E]'   ... %   a blip following on of "\w])}."  
    '       \x27+               '   ... %   could be the transpose operator
    '       |                   '   ... % or
    '       [^\x27\x25]         '   ... %   any character but "'%"
    '       )+                  '   ... % 
    '   )                       '   ... %
    '(?@dre(''A: '',$`,$0,$'')) '   ... % print statement
    '   |                       '   ... % or
    '   (                       '   ... % capture: string constant
    '       \x27                '   ... % blip followed by any character 
    '       [^\x27\n]*          '   ... %   but  blip or new line starts
    '       (                   '   ... %   a string constant   
    '           \x27\x27        '   ... %   double blip
    '           [^\x27\n]*      '   ... %   any character but blip or nl
    '       )*                  '   ... % zero or more characters in const
    '       \x27                '   ... % closing blip
    '   )                       '   ... %
    '(?@dre(''B: '',$`,$0,$'')) '   ... % print statement
    '   |                       '   ... %
    '   (                       '   ... %
    '       %[^\n]*             '   ... % What remains must be a comment.
    '   )                       '   ... %
    '(?@dre(''C: '',$`,$0,$'')) '   ... % print statement
    ')*                         '   ... %
    ];
    xpr( isspace( xpr ) ) = [];
    cac = regexp( str, xpr, 'tokenExtents', 'warnings' );
    is_str = cac;
  end

Writing regular expressions this way make them a bit more readable and it makes it easy to insert "print statements".

It might be a waste of time to analyze the regular expression itself. I've made too many wild changes and some comments might be out of sync.

the function dre() reads

function    dre( varargin )
    %
    cac = strrep( varargin, char(10), char(167) );
    cac = strrep( cac     , char(13), char(167) );
    %
    fprintf( '%s  ', cac{1} );
    cprintf( '_blue', '%s', cac{2} )
    cprintf( '_green', '%s', cac{3} )
    cprintf( '_magenta', '%s', cac{4} )
    fprintf('\n');
end

cprint is cprintf - display formatted colored text in the Command Window by Yair Altman

The test string with line breaks is

str =
str = 'abc'
  %{
whatever!"#¤&/()
  %}
a = 17;

In the output I replaced the line breaks by § to make it readable.

This colored output conveys a picture of how the processing of the input string progresses and halts.

Cedric 2017년 10월 18일

편집: Cedric 2017년 10월 18일

Have you looked at the functions that I had attached with the initial answer? We may have written roughly the same loop! EDIT: ah no in fact, as you treat the whole stream in one shot I guess.

per isakson 2017년 10월 19일

편집: per isakson 2017년 10월 19일

MATLAB Online에서 열기

Yes, I've studied the code, which you attach to the comment above.

"the whole stream in one shot I guess" Yes, the seven colored lines are the debug-print-out from one call to regexp

cac = regexp( str, xpr, 'tokenExtents', 'warnings' );

I've edited my description above on my "small test".

댓글을 달려면 로그인하십시오.

Wanted: Examples on how to use "Dynamic Regular Expressions" to debug regular expressions

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (1개)

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

Wanted: Examples on how to use "Dynamic Regular Expressions" to debug regular expressions

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

추가 답변 (1개)

댓글 수: 5 이전 댓글 3개 표시이전 댓글 3개 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 5
이전 댓글 3개 표시이전 댓글 3개 숨기기