Extracting string and number pairs from a mixed string

조회 수: 7 (최근 30일)
LukasJ
LukasJ 2020년 2월 13일
편집: Stephen23 2020년 6월 15일
Dear all,
I have read in material data and have gotten an array of chemical compositions of the type (chr):
'Fe61Zr8Co7Mo15B7Y2' sometimes even '(Fe0.5Co0.5)58.5Cr14Mo6C15B6Er0.5' wherein Fe and Co have a percentage of 29.25 (58.5*0.5...).
How do I extract the chemical composition so that I receive an array (or two) like
['Fe', 'Zr', 'Co', 'Mo', 'B', 'Y'; 61, 8, 7, 15, 7, 2]?
I am somehow failing to use sscanf correctly, e.g. numbers = sscanf('Fe61Zr8Co7Mo15B7Y2', '%f') won't do anything since I need to omit varying strings :(
Thanks a lot in advance!
Greetings
Lukas

채택된 답변

Stephen23
Stephen23 2020년 2월 13일
편집: Stephen23 2020년 6월 15일
The two lines marked with %%% are used to convert substrings like '(Fe0.5Co0.5)58.5' into 'Fe29.25Co29.25', after which the designator+number matching and extraction is easy:
>> str = '(Fe0.5Co0.5)58.5Cr14Mo6C15B6Er0.5';
>> baz = @(s,b)regexprep(s,'(\d+\.?\d*)','${num2str(str2double($1)*str2double(b))}'); %%%
>> tmp = regexprep(str,'\((([A-Z][a-z]*\d+\.?\d*)+)\)(\d+\.?\d*)','${baz($1,$2)}'); %%%
>> out = regexp(tmp,'([A-Z][a-z]*)(\d+\.?\d*)','tokens');
>> out = vertcat(out{:})
out =
'Fe' '29.25'
'Co' '29.25'
'Cr' '14'
'Mo' '6'
'C' '15'
'B' '6'
'Er' '0.5'
>> str2double(out(:,2)) % optional
ans =
29.25
29.25
14
6
15
6
0.5
The code uses two layers of dynamic regular expression, the first calls baz for each each '(AxBy...)N' substring, then inside baz each of x, y, etc. is multiplied with N. The result is converted to char for reinsertiion.
  댓글 수: 6
Stephen23
Stephen23 2020년 6월 15일
편집: Stephen23 2020년 6월 15일
The reason is that I missed a question mark here:
tmp = regexprep(str,'\((([A-Z][a-z]*\d+\.?\d*)+)\)(\d+\.?\d*)','${baz($1,$2)}')
% ^ missing
which meant that the regular expression did not match integer numbers, only numbers with a decimal point.
With that question mark in place (I have now corrected the question and comments), this is the output:
>> out = regexp(tmp,'([A-Z][a-z]*)(\d+\.?\d*)','tokens');
>> out = vertcat(out{:})
out =
'Fe' '6550.4'
'B' '2208'
'Y' '441.6'
'Nb' '8'
"Another issue are compositions which contain more "subcompositions" and are separated by other brackets."
Regular expressions alone are not really suitable for this. For parsing arbitrarily nested brackets like that you will probably have to write your own string parser, e.g. based on a recursive function which uses regular expressions or other string manipulation inside it. Have you worked with recursive functions before?
Чебура́шка, да!
LukasJ
LukasJ 2020년 6월 15일
편집: LukasJ 2020년 6월 15일
But this output is also wrong! The numbers should sum-up to 1 or 100 and the proportion of Fe should then be 0.65504. The complete output should be:
out =
'Fe' '0.65504'
'B' '0.2208'
'Y' '0.04416'
'Nb' '0.08'
The issues come from the inconsistent notation in the dataset I guess :(.
"Have you worked with recursive functions before?"
I vaguely remember writing the Fibonacci-series as a function during my bachelor studies... But I don't think I'm capable of applying any of that on this problem in a finite amount of time :D
I will probably just iterate over all the "unclear" cases where { and [ occur, extract whats inside and then apply your function.
e.g. match
regexp = '\{.+\}'
And then check what's inside.
At least that would be feasible for me to do :)

댓글을 달려면 로그인하십시오.

추가 답변 (1개)

LukasJ
LukasJ 2020년 2월 13일
ok so
numbers = str2double(regexp(myString,'[\d.]+','match')); does give me the numbers
but
words = regexp(myString,'[\w*]+','match')); does not give me the chemical notations, probably because those aren't "words" :/

카테고리

Help CenterFile Exchange에서 Characters and Strings에 대해 자세히 알아보기

제품


릴리스

R2017b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by