regexprep: Nested ordinal token not captured

Question

FM 2023년 1월 5일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1888847-regexprep-nested-ordinal-token-not-captured

편집: FM 2023년 1월 7일

I am trying to modify file paths with consecutive repeated folder names, e.g, "archive" is repeated in "Clients/archive/archive/20220428.1349.zip". The modification I seek is to truncate that path beyond the 2nd occurance of a repeated folder, leaving the trailing file path separator, e.g., "Clients/archive/archive/". I thought this would do it:

FolderInSelf = regexprep( FolderInSelf, ...
    "^(.*/(\w+)/\2/).*", "$1" );

"FolderInSelf" is vertical vector of strings, each representing a file path that contains a consecutively repeated folder name.

The outer set of brackets captures the 1st token, which is for the path upto the repeated folder, excluding anything after the slash.

The inner set of brackets is the 2nd token, which is the for the first occurrence of the repeated folder name ("archive" in the example above).

The back reference "\2" describes the fact that the token is repeated, and separated by a slash.

I am puzzled by why the above "regexprep" does nothing to the strings in FolderInSelf. To troubleshoot, I chose a simpler command that worked as expected

>> regexprep( "Clients/archive/archive/20220428.1349.zip", ...
              "^(.*/(archive)/archive/).*", "$1" )
   ans = "Clients/archive/archive/"

If I replace "$1" with "$2", I expect to get "archive" (the 2nd token). Instead, I get:

ans = "$2"

This suggest that the 2nd token is not being captured. Can anyone point out what I am doing wrong?

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

FM 2023년 1월 5일

편집: FM 2023년 1월 5일

Thanks, Rik, You're absolutely right, at least as of 2018: https://www.mathworks.com/matlabcentral/answers/436217-regular-expression-are-nesting-of-group-operators-supported

If you don't mind posting this as the answer, I'll mark it as answered.

This is quite a severe limitation in regular expressions. :(

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Rik 2023년 1월 5일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1888847-regexprep-nested-ordinal-token-not-captured#answer_1142137

편집: Rik 2023년 1월 5일

MATLAB Online에서 열기

I'm not entirely sure tokens can be nested (at least in the implementation that Matlab uses).

You can also explore the output of your tokens first with regexp:

regexp( "Clients/archive/archive/20220428.1349.zip", ...
    "^(.*/(archive)/archive/).*", "tokens" )
ans = 1×1 cell array
    {["Clients/archive/archive/"]}

I suspect the inner parentheses are considered grouping, not token-capturing.

I just tested this on the oldest Matlab I can run (v6.5 from 2002, which requires a bit of trickery to extract the tokens), and there the result is the same as in the online editor. So the remarks from the thread you found hold for just about any release of Matlab you can still get to run.

I might interest you to know that the output on GNU Octave (a mostly-compatible software suite) is not the same:

x=regexp( 'Clients/archive/archive/20220428.1349.zip', '^(.*/(archive)/archive/).*', 'tokens' )

x =

{

[1,1] =

{

[1,1] = Clients/archive/archive/

[1,2] = archive

}

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

Rik 2023년 1월 5일

I understand it may not be a solution for you, but I just wanted to put it out there in case it solves the issue for someone else.

Reading your comment, I don't believe I have a suggestion you have not thought of.

FM 2023년 1월 5일

That's good. Hopefully it will help someone.

댓글을 달려면 로그인하십시오.

Answer 2

FM 2023년 1월 5일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/1888847-regexprep-nested-ordinal-token-not-captured#answer_1142267

편집: FM 2023년 1월 7일

MATLAB Online에서 열기

If table "tFolderInSelf" contains a column "Path" consisting of a vertical vector of strings, then the following code truncates the paths after the second consecutive repetition of a folder name:

% Extract the repeated folder names
tFolderInSelf.Folder_x2 = regexp( tFolderInSelf.Path, ...
   "\<([\w.-]+)/\1/" , "match", "once" )
% Match the path upto the repeated folder name
tFolderInSelf.PathTrunc = regexp( tFolderInSelf.Path, ...
   ".*\<"+tFolderInSelf.Folder_x2, "match", "once" );
% Move the match "PathTrunc" next to "Path" for comparison
tFolderInSelf = movevars( ...
   tFolderInSelf, "PathTrunc", After="Path" );
% Cleaned-up viewing                            
categorical( unique( tFolderInSelf.PathTrunc ) )
                                                
Clients/archive/archive/                        
IT/sync/sync/                                   
Knowledge/Bayes/Bayes/                          
Knowledge/MathPrg/MIPsolveSpd/MIPsolveSpd/      
PD/SLT/20220411-0627/20220411/20220411/         
PD/SLT/20220411-0627/20220425/20220425/         
<...etc...>

Each row of the column tFolderInSelf.PathTrunc is a scalar string. The "regexp" option "Once" ensures that each row has only one element. This allows "regexp" to return an column vector of strings rather than a column vector of cells, as it does not have to accommodate variable length row vectors of strings for each table row.

It is possible that this code can be broken if one of the paths in tFolderInSelf.Path does not contain a repeated folder. In my case, the data set was built using only paths that contain repeated folders.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

regexprep: Nested ordinal token not captured

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

추가 답변 (1개)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

regexprep: Nested ordinal token not captured

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 3 이전 댓글 1개 표시이전 댓글 1개 숨기기

추가 답변 (1개)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기