How to extract info from a chemical formula

Hi All, I want to break down a chemical formula into its constituents. For example: Silicon Dioxide = SiO2 I want to take the string 'SiO2' and parse it so that I know I have 1 silicon and 2 oxygens. I want to do this for more complex compounds as well, say Polyimide which is 'C22H10N2O5'. So I need to handle upper case and lower case letters, and 1 or 2 digit numbers. Any help would be much appreciated.
Pat

답변 (4개)

Fangjun Jiang
Fangjun Jiang 2011년 8월 12일

2 개 추천

Use the combination of isstrprop() and regexp() might help. You need to provide more examples and explain what you want.
str='C22H10N2O5'
num=regexp(str,'\d+','match')
isstrprop(str,'alpha')
isstrprop(str,'digit')
isstrprop(str,'upper')
One solution:
str='C22H10PuCrN2O5';
[EleList,Trash,EleEnd]=regexp(str,['[','A':'Z','][','a':'z',']?'],'match');
[Num,NumStart]=regexp(str,'\d+','match');
NumList=ones(size(EleList));
Index=ismember(EleEnd+1,NumStart);
NumList(Index)=cellfun(@str2num,Num);

댓글 수: 3

Patrick Knapp
Patrick Knapp 2011년 8월 12일
Thanks, those will be helpful. For the SiO2 example I would like a structure returned that says
Ans.element = {'Si','O'}
Ans.num = {1, 2}
For Polyimide we would then have
Ans.element = {'C','H','N','O'}
Ans.num = {22, 10, 2, 5}
I then have a program that grabs the names and quantities of the individual elements and does what I want with them. My problem is extracting them from a string, which as was mentioned above, does not follow simple rules
Fangjun Jiang
Fangjun Jiang 2011년 8월 12일
Please refresh my memory. Is it true that the element can only have two or one letter? If it is two-letter, is it true that the first letter is always uppercase and the second letter is lowcase?
Paulo Silva
Paulo Silva 2011년 8월 12일
look at any periodic table that you might find online, all symbols should be there

댓글을 달려면 로그인하십시오.

Paulo Silva
Paulo Silva 2011년 8월 12일

0 개 추천

That's not easy to do, for example not all formulas have the constituents separated by a number, you also need to have all possible constituents in a list so you can identify them in any formula and after it check if there's a number after each constituents.
Please search in the File Exchange ,you might get lucky and find it already done by someone.

댓글 수: 1

Kelly Kearney
Kelly Kearney 2011년 8월 12일
Well, assuming he's not dealing with any of those U** elements at the upper end of the periodic table, then all elements consist of either one capital letter or a capital and lowercase letter. So it should be pretty easy to pick those out. Will you always have the base formula, or will it be arranged structurally (i.e. Si(OH)4, or SiO4H4?)

댓글을 달려면 로그인하십시오.

Patrick Knapp
Patrick Knapp 2011년 8월 12일

0 개 추천

I figured it out. Thanks for the help Fangjun!
str = 'SiO2';
num=regexp(str,'\d+','match'); % cell array containing the numbers
D = isstrprop(str,'digit'); %logical array giving location of numbers
U = isstrprop(str,'upper'); %logical giving location of upper case alphas
L = isstrprop(str,'lower'); %logical giving location of lower case alphas
NumElem = sum(U); %number of upper case alphas == number of elements in formula
Formula = struct('element',{},'quantity',{}); %initialize output
%%Loop through formula to extract quantities
n = 1;
num_counter = 1;
for i = 1:NumElem
if U(n)
if U(n) && L(n+1)
Formula(i).element = str(n:n+1);
n = n+2;
if ~D(n)
Formula(i).quantity = 1;
elseif D(n)
Formula(i).quantity = str2num(num{num_counter}); %#ok<*ST2NM>
n = n+length(num{num_counter});
num_counter = num_counter+1;
end
elseif U(n) && ~L(n+1)
Formula(i).element = str(n);
n = n+1;
if D(n)
Formula(i).quantity = str2num(num{num_counter});
n = n+length(num{num_counter});
num_counter = num_counter+1;
elseif ~D(n)
Formula(i).quantity = 1;
end
end
else
n = n+1;
end
end

댓글 수: 1

Fangjun Jiang
Fangjun Jiang 2011년 8월 12일
Nice! I couldn't resist coming up with a no-loop solution. See my updated answer.

댓글을 달려면 로그인하십시오.

phenan08
phenan08 2023년 1월 26일

0 개 추천

If it can help, I wrote a formula string parser to determine the composition of a molecule, element by element.
It is possible to use semi-developped formulas, and the script returns 4 outputs: the raw molecular formula, the composition table (the different elements with their counts), the average MW and the monoisotopic mass.

카테고리

도움말 센터File Exchange에서 Chemistry에 대해 자세히 알아보기

질문:

2011년 8월 12일

답변:

2023년 1월 26일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by