Extracting string and number pairs from a mixed string

Question

LukasJ 2020년 2월 13일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/505344-extracting-string-and-number-pairs-from-a-mixed-string

편집: Stephen23 2020년 6월 15일

Dear all,

I have read in material data and have gotten an array of chemical compositions of the type (chr):

'Fe61Zr8Co7Mo15B7Y2' sometimes even '(Fe0.5Co0.5)58.5Cr14Mo6C15B6Er0.5' wherein Fe and Co have a percentage of 29.25 (58.5*0.5...).

How do I extract the chemical composition so that I receive an array (or two) like

['Fe', 'Zr', 'Co', 'Mo', 'B', 'Y'; 61, 8, 7, 15, 7, 2]?

I am somehow failing to use sscanf correctly, e.g. numbers = sscanf('Fe61Zr8Co7Mo15B7Y2', '%f') won't do anything since I need to omit varying strings :(

Thanks a lot in advance!

Greetings

Lukas

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Stephen23 2020년 2월 13일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/505344-extracting-string-and-number-pairs-from-a-mixed-string#answer_415448

편집: Stephen23 2020년 6월 15일

MATLAB Online에서 열기

The two lines marked with %%% are used to convert substrings like '(Fe0.5Co0.5)58.5' into 'Fe29.25Co29.25', after which the designator+number matching and extraction is easy:

>> str = '(Fe0.5Co0.5)58.5Cr14Mo6C15B6Er0.5';
>> baz = @(s,b)regexprep(s,'(\d+\.?\d*)','${num2str(str2double($1)*str2double(b))}'); %%%
>> tmp = regexprep(str,'\((([A-Z][a-z]*\d+\.?\d*)+)\)(\d+\.?\d*)','${baz($1,$2)}');   %%%
>> out = regexp(tmp,'([A-Z][a-z]*)(\d+\.?\d*)','tokens');
>> out = vertcat(out{:})
out = 
    'Fe'    '29.25'
    'Co'    '29.25'
    'Cr'    '14'   
    'Mo'    '6'    
    'C'     '15'   
    'B'     '6'    
    'Er'    '0.5' 
>> str2double(out(:,2)) % optional
ans =
        29.25
        29.25
           14
            6
           15
            6
          0.5
     

The code uses two layers of dynamic regular expression, the first calls baz for each each '(AxBy...)N' substring, then inside baz each of x, y, etc. is multiplied with N. The result is converted to char for reinsertiion.

댓글 수: 6
이전 댓글 4개 표시이전 댓글 4개 숨기기

Stephen23 2020년 2월 14일

편집: Stephen23 2020년 6월 15일

MATLAB Online에서 열기

"2nd line creates a function, baz, with the variables (s,b), I guess"

Yes, it defines an anonymous function with two input arguments:

https://www.mathworks.com/help/matlab/matlab_prog/anonymous-functions.html

"Then baz is called in executing regexprerp again?"

The function |baz| is called by |regexprep| on the third line. For your example string its two inputs will be 'Fe0.5Co0.5' and '58.5'.

"Is there any tutorial I could use, to be able to write code like this myself?"

Read the documentation, try the examples.

"Regexp and regexprep appear to be very powerful but the inputs look kind of chaotic"

Regular expressions are a language unto themselves. They require practice and reading the documentation again and again and again and again and again...

https://www.mathworks.com/help/matlab/matlab_prog/regular-expressions.html

In this case I also used dynamic regular expressions to call |baz| and to recalculate those numeric values:

https://www.mathworks.com/help/matlab/matlab_prog/dynamic-regular-expressions.html

Here is a breakdown of the regular expressions:

'(\d+\.?\d*)' % match integer or decimal number
 (            % start group 1
  \d+         % one or more digits
     \.?      % optional period character
        \d*   % zero or more digits
           )  % end group 1.
 '\((([A-Z][a-z]*\d+\.?\d*)+)\)(\d+\.?\d*)' % match e.g. '(Fe0.5Co0.5)58.5' % FIXED
  \(                                        % literal left parenthesis
    (                                       % start group 1
     (                                      % start group 0
      [A-Z]                                 % match one A-Z character
           [a-z]*                           % match zero or more a-z characters
                 \d+\.?\d*                  % like 1st regex, match integer or decimal % FIXED
                          )+                % one or more of group 0
                            )               % end group 1
                             \)             % literal right parenthesis
                               (            % start group 2
                                \d+\.?\d*   % like 1st regex, match integer or decimal
                                         )  % end group 2
'([A-Z][a-z]*)(\d+\.?\d*)' % match e.g. 'Cr14' or 'Fe29.25'
 (                         % start group 1
  [A-Z]                    % match one A-Z character
       [a-z]*              % match zero or more a-z characters
             )             % end group 1
              (            % start group 2
               \d+\.?\d*   % like 1st regex, match integer or decimal
                        )  % end group 2

You can see that the regular expressions always match the numbers and the element symbols in the same way, just group them in slightly different ways. The groups are referred to using |$1| and |$2| in the replacement expressions.

You might also like to download my interactive tool IRGEXP:

https://www.mathworks.com/matlabcentral/fileexchange/48930-interactive-regular-expression-tool

LukasJ 2020년 6월 15일

Dear Stephen,

unfortunately I have run into trouble again and I am (still) not on a level to adjust the code you created. Multiple errors occur in my data, e.g.:

(Fe71.2B24Y4.8)92Nb8 is giving me

"Fe" "71.2"

"B" "24"

"Y" "4.8"

"Nb" "8"

wherein apparently Fe, B, Y haven't been multiplied by 92 or rather 0.92. I suppose that's because baz only takes two inputs (str2double is called twice in baz).

Another issue are compositions which contain more "subcompositions" and are separated by other brackets.

e.g. I read in

str = '{[(Fe0.6Co0.4)0.75B0.2Si0.05]0.96Nb0.04}96Cr4'

expecting it to not work since there are {, } and [,] I simply replaced those by ( and )... Well that didn't work either :D. For str I get

"Fe" "0.45"

"Co" "0.3"

"B" "0.2"

"Si" "0.05"

"Nb" "0.04"

"Cr" "4"

:(

Also when I try to modify the regexp like

regexp = '($|\[|\{)(([A-Z][a-z]*\d+\.\d*)+)(\}|\]|$)(\d+\.?\d*)'

that won't match anything more than before since there are brackets within the outer curly or edgy brackets.

Thanks a lot for your help so far!

Best wishes,

Lukas

Off-Topic: Is that Cheburashka in your profile picture?

Stephen23 2020년 6월 15일

편집: Stephen23 2020년 6월 15일

MATLAB Online에서 열기

The reason is that I missed a question mark here:

tmp = regexprep(str,'\((([A-Z][a-z]*\d+\.?\d*)+)\)(\d+\.?\d*)','${baz($1,$2)}')
%                                        ^ missing

which meant that the regular expression did not match integer numbers, only numbers with a decimal point.

With that question mark in place (I have now corrected the question and comments), this is the output:

>> out = regexp(tmp,'([A-Z][a-z]*)(\d+\.?\d*)','tokens');
>> out = vertcat(out{:})
out = 
    'Fe'    '6550.4'
    'B'     '2208'  
    'Y'     '441.6' 
    'Nb'    '8'  

"Another issue are compositions which contain more "subcompositions" and are separated by other brackets."

Regular expressions alone are not really suitable for this. For parsing arbitrarily nested brackets like that you will probably have to write your own string parser, e.g. based on a recursive function which uses regular expressions or other string manipulation inside it. Have you worked with recursive functions before?

Чебура́шка, да!

LukasJ 2020년 6월 15일

편집: LukasJ 2020년 6월 15일

MATLAB Online에서 열기

But this output is also wrong! The numbers should sum-up to 1 or 100 and the proportion of Fe should then be 0.65504. The complete output should be:

out =

'Fe' '0.65504'

'B' '0.2208'

'Y' '0.04416'

'Nb' '0.08'

The issues come from the inconsistent notation in the dataset I guess :(.

"Have you worked with recursive functions before?"

I vaguely remember writing the Fibonacci-series as a function during my bachelor studies... But I don't think I'm capable of applying any of that on this problem in a finite amount of time :D

I will probably just iterate over all the "unclear" cases where { and [ occur, extract whats inside and then apply your function.

e.g. match

regexp = '\{.+\}'

And then check what's inside.

At least that would be feasible for me to do :)

댓글을 달려면 로그인하십시오.

Answer 2

LukasJ 2020년 2월 13일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/505344-extracting-string-and-number-pairs-from-a-mixed-string#answer_415446

ok so

numbers = str2double(regexp(myString,'[\d.]+','match')); does give me the numbers

but

words = regexp(myString,'[\w*]+','match')); does not give me the chemical notations, probably because those aren't "words" :/

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Extracting string and number pairs from a mixed string

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 6
이전 댓글 4개 표시이전 댓글 4개 숨기기

추가 답변 (1개)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Extracting string and number pairs from a mixed string

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 6 이전 댓글 4개 표시이전 댓글 4개 숨기기

추가 답변 (1개)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 6
이전 댓글 4개 표시이전 댓글 4개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기