Removing specific characters from string in nested cells

Question

Bob Thompson 2018년 6월 13일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/405540-removing-specific-characters-from-string-in-nested-cells

댓글: Stephen23 2022년 12월 30일

I have a series of strings which are contained within a nested cell array (because regexp loves to nest cells), and I would like to remove any non numeric or white space characters from them so that I can convert them to doubles, namely astrick.

I'm looking for the least painful way of removing any of these special characters from all strings. I do not have a sample file to attach, sorry, but I have dictated the shape of a sample array below.

X == 1x1 cell
X{1} == 1x1 cell (because regexp can't help itself apparently)
X{1}{1} = {'1234.,  ';'12.,*  ';'1234.,  ','123.,*   ','  321.,*  '};

댓글 수: 12
이전 댓글 10개 표시이전 댓글 10개 숨기기

Bob Thompson 2018년 6월 13일

MATLAB Online에서 열기

Stephen, it is related to the same file, but not the same part of the file. I believe I figured the other question out, but didn't think it was elegant enough to post as an answer to my own question.

I am unable to upload an actual sample document, but a sample of what I'm extracting from would be the following.

   1  ****TABLE1****
   COLUMN1= 1.12, 2.23, 3.34, 4.45, 5.56, 6.67,
   COLUMN2= 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
   COLUMN3= 1.23, 0.34, 3.45, 5.78*, 6.54*, 8.23,
   1  ****TABLE2****

I am trying to capture the values of columns 1 and 3 from the table. I am specifically having troubles with column 3 which contains the astrick, as column 1 works fine with str2double.

col3{1 = regexp(input, '\<COLUMN3=\s*(.{1,400})1  ****','tokens');
col3{1 = regexp(col3{1{1}, '\s+','split');

I am initiating the first level of the cell as I will have multiple tables. The use of (.{1,400}) was done because I don't know how many values are in the table, and I cannot simply do (.*) because '1 **' occurs multiple times throughout the file. I don't think I can use \d or \w because of the ',' and '.' mixed in with the values. I used the second regexp to split the single string the first resulted in, as I found this more consistent with use through str2double than simply applying str2double to the entire string.

Bob Thompson 2018년 6월 14일

편집: Bob Thompson 2018년 6월 14일

MATLAB Online에서 열기

Hmmm. I'm currently using fileread and just importing the entire file as a single string. I've used fgetl in the past for other scripts, but due to the variability of this file I don't know if it's a good fit. Textscan might work, but I don't know that separating by each \n will work either, as it is possible that my various bits of data will be contained on multiple lines.

I've been working with it some again today, and I realized that my previous codes work fine for the first column of values as these do not seem to ever have special characters. I can therefore get the number of values from this array, and use that to create a repeating string for the third column.

col1 = regexp(input, '\<COLUMN1=\s*(.{1,400})1  ****','tokens');
col1 = regexp(col1{1}, '\s+','split');
colvals(:,1) = str2double(col1{1});
nvals = length(colvals);
dups = repmat('(\d*.\d*).{1,3}\s*',1,nvals); % Modified from Paolo's comment
string = ['COLUMN3=\s+',dups];
col3 = regexp(input, string, 'tokens');

This seems to work, and removes the need to conduct the split a second time, which is nice.

I'm not really sure what the ':' from Paolo's comment is supposed to do, I don't see it anywhere in the regexp documentation, and it's not in any of my strings.

Also, OCDER and Paolo, I appreciate your help, so if one of you wants to write up an actual answer I would be happy to accept it.

Bob Thompson 2018년 6월 15일

Ah, I see. It doesn't appear in regexp.m comments, which is where I was looking.

Stephen23 2018년 6월 15일

MATLAB Online에서 열기

@Bob Nbob: you are right, it does not appear in the Mfile help. I notice that many other useful regular expression features also do not appear in the Mfile help: notably missing are dynamic expressions, lookaround operators, and named capture.

Both the inbuilt help and the page I linked to give a very useful introduction, and explain all features of regular expressions in MATLAB:

doc regexp
doc('Regular Expressions')

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Paolo 2018년 6월 15일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/405540-removing-specific-characters-from-string-in-nested-cells#answer_324855

편집: Paolo 2018년 6월 15일

MATLAB Online에서 열기

Perhaps this can easily be achieved in two steps. For your input:

    1  ****TABLE1****
   COLUMN1= 1.12, 2.23, 3.34, 4.45, 5.56, 6.67,
   COLUMN2= 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
   COLUMN3= 1.23, 0.34, 3.45, 5.78*, 6.54*, 8.23,
   1  ****TABLE2****

Step 1. Find and replace all punctuation characters (let's say ",", "." and "*"). Live regex here .

   data = fileread('CORR.txt');
   expression_sub = '(?<=\d\.\d*\*?)([\*\.,])';
   data = regexprep(data,expression_sub,'');

Data will now not contain those characters. Data is now:

     '   1  ****TABLE1****
        COLUMN1= 1.12 2.23 3.34 4.45 5.56 6.67
        COLUMN2= 0.00 0.00 0.00 0.00 0.00 0.00
        COLUMN3= 1.23 0.34 3.45 5.78 6.54 8.23
        1  ****TABLE2****
     '

Step 2. Match your data. Live regex here. The expression is greedy and will try to match as many digit, full stop, digits combinations as it can. Therefore you don't need to repmat your expression like you showed.

 expression_match = '(?<=COLUMN[1,3]=\s)(\d.?\d*\s)*';
 [tokens,match] = regexp(data_sub,expression_match,'tokens','match');

Matlab manipulation.

 column1 = str2double(strsplit(cell2mat(tokens{1}),' '));
 column3 = str2double(strsplit(cell2mat(tokens{2}),' '));

column1 =

1.1200 2.2300 3.3400 4.4500 5.5600 6.6700

column3 =

1.2300 0.3400 3.4500 5.7800 6.5400 8.2300

댓글 수: 2
없음 표시없음 숨기기

Bob Thompson 2018년 6월 18일

Ha, using (\d.?\d*\s)* is pretty slick. I'm a little sad I didn't think of that.

Stephen23 2022년 12월 30일

MATLAB Online에서 열기

@Bob Thompson: the dot needs to be escaped as well (otherwise it matches all characters), e.g.:

(\d+\.?\d*\s)*

댓글을 달려면 로그인하십시오.

Answer 2

George Abrahams 2022년 12월 30일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/405540-removing-specific-characters-from-string-in-nested-cells#answer_1138717

MATLAB Online에서 열기

The others are right to fix the root problem causing the tricky nested cell array. Having said that, for future reference, my deepreplace function on File Exchange / GitHub would have done exactly what you requested.

x = {{{'1234.,  ';'12.,*  ';'1234.,  ';'123.,*   ';'  321.,*  '}}};
% Remove any character except for digits (0-9) and period (.)
match = regexpPattern('[^\d.]');
x = deepreplace(x,match,'');
% x = 1×1 cell array
%     {1×1 cell}
% x{1} = 1×1 cell array
%     {5×1 cell}
% x{1}{1} = 5×1 cell array
%     {'1234.'}
%     {'12.'  }
%     {'1234.'}
%     {'12310'}
%     {'321.' }

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Removing specific characters from string in nested cells

댓글 수: 12
이전 댓글 10개 표시이전 댓글 10개 숨기기

채택된 답변

댓글 수: 2
없음 표시없음 숨기기

추가 답변 (1개)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

Removing specific characters from string in nested cells

댓글 수: 12 이전 댓글 10개 표시이전 댓글 10개 숨기기

채택된 답변

댓글 수: 2 없음 표시없음 숨기기

추가 답변 (1개)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 12
이전 댓글 10개 표시이전 댓글 10개 숨기기

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기