regexprep: Nested ordinal token not captured

조회 수: 1 (최근 30일)
FM
FM 2023년 1월 5일
편집: FM 2023년 1월 7일
I am trying to modify file paths with consecutive repeated folder names, e.g, "archive" is repeated in "Clients/archive/archive/20220428.1349.zip". The modification I seek is to truncate that path beyond the 2nd occurance of a repeated folder, leaving the trailing file path separator, e.g., "Clients/archive/archive/". I thought this would do it:
FolderInSelf = regexprep( FolderInSelf, ...
"^(.*/(\w+)/\2/).*", "$1" );
"FolderInSelf" is vertical vector of strings, each representing a file path that contains a consecutively repeated folder name.
The outer set of brackets captures the 1st token, which is for the path upto the repeated folder, excluding anything after the slash.
The inner set of brackets is the 2nd token, which is the for the first occurrence of the repeated folder name ("archive" in the example above).
The back reference "\2" describes the fact that the token is repeated, and separated by a slash.
I am puzzled by why the above "regexprep" does nothing to the strings in FolderInSelf. To troubleshoot, I chose a simpler command that worked as expected
>> regexprep( "Clients/archive/archive/20220428.1349.zip", ...
"^(.*/(archive)/archive/).*", "$1" )
ans = "Clients/archive/archive/"
If I replace "$1" with "$2", I expect to get "archive" (the 2nd token). Instead, I get:
ans = "$2"
This suggest that the 2nd token is not being captured. Can anyone point out what I am doing wrong?
  댓글 수: 1
FM
FM 2023년 1월 5일
편집: FM 2023년 1월 5일
If you don't mind posting this as the answer, I'll mark it as answered.
This is quite a severe limitation in regular expressions. :(

댓글을 달려면 로그인하십시오.

채택된 답변

Rik
Rik 2023년 1월 5일
편집: Rik 2023년 1월 5일
I'm not entirely sure tokens can be nested (at least in the implementation that Matlab uses).
You can also explore the output of your tokens first with regexp:
regexp( "Clients/archive/archive/20220428.1349.zip", ...
"^(.*/(archive)/archive/).*", "tokens" )
ans = 1×1 cell array
{["Clients/archive/archive/"]}
I suspect the inner parentheses are considered grouping, not token-capturing.
I just tested this on the oldest Matlab I can run (v6.5 from 2002, which requires a bit of trickery to extract the tokens), and there the result is the same as in the online editor. So the remarks from the thread you found hold for just about any release of Matlab you can still get to run.
I might interest you to know that the output on GNU Octave (a mostly-compatible software suite) is not the same:
x=regexp( 'Clients/archive/archive/20220428.1349.zip', '^(.*/(archive)/archive/).*', 'tokens' )
x =
{
[1,1] =
{
[1,1] = Clients/archive/archive/
[1,2] = archive
}
}
  댓글 수: 3
Rik
Rik 2023년 1월 5일
I understand it may not be a solution for you, but I just wanted to put it out there in case it solves the issue for someone else.
Reading your comment, I don't believe I have a suggestion you have not thought of.
FM
FM 2023년 1월 5일
That's good. Hopefully it will help someone.

댓글을 달려면 로그인하십시오.

추가 답변 (1개)

FM
FM 2023년 1월 5일
편집: FM 2023년 1월 7일
If table "tFolderInSelf" contains a column "Path" consisting of a vertical vector of strings, then the following code truncates the paths after the second consecutive repetition of a folder name:
% Extract the repeated folder names
tFolderInSelf.Folder_x2 = regexp( tFolderInSelf.Path, ...
"\<([\w.-]+)/\1/" , "match", "once" )
% Match the path upto the repeated folder name
tFolderInSelf.PathTrunc = regexp( tFolderInSelf.Path, ...
".*\<"+tFolderInSelf.Folder_x2, "match", "once" );
% Move the match "PathTrunc" next to "Path" for comparison
tFolderInSelf = movevars( ...
tFolderInSelf, "PathTrunc", After="Path" );
% Cleaned-up viewing
categorical( unique( tFolderInSelf.PathTrunc ) )
Clients/archive/archive/
IT/sync/sync/
Knowledge/Bayes/Bayes/
Knowledge/MathPrg/MIPsolveSpd/MIPsolveSpd/
PD/SLT/20220411-0627/20220411/20220411/
PD/SLT/20220411-0627/20220425/20220425/
<...etc...>
Each row of the column tFolderInSelf.PathTrunc is a scalar string. The "regexp" option "Once" ensures that each row has only one element. This allows "regexp" to return an column vector of strings rather than a column vector of cells, as it does not have to accommodate variable length row vectors of strings for each table row.
It is possible that this code can be broken if one of the paths in tFolderInSelf.Path does not contain a repeated folder. In my case, the data set was built using only paths that contain repeated folders.

카테고리

Help CenterFile Exchange에서 Interactive Control and Callbacks에 대해 자세히 알아보기

태그

제품


릴리스

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by