How to extract data from a table format HTML?

조회 수: 5 (최근 30일)
Alan Cesar Pilon Miro
Alan Cesar Pilon Miro 2023년 2월 1일
댓글: Jonas 2023년 2월 6일
Hi,
I want to access a html and extract some information. However, when I use webread and then htmlTree I miss part of html data and don't know why.
Example:
Using this url
I would like to get information about the rows or columns of SMILES and InChL fields. However, when I use the code below I can't observe this information. I have tried different selectors, but I don't know if the data is dynamically generated.
html = webread(url);
tree = htmlTree(html);
selector= "td";
subtrees= findElement(tree,selector);
str = extractHTMLText(subtrees);
table_data = str(1:end);
Thank you,
Alan

채택된 답변

Jonas
Jonas 2023년 2월 2일
without digging deeper into html, we can use just text seach:
d=webread('http://www.knapsackfamily.com/knapsack_core/information.php?word=C00000152',weboptions('Timeout',15));
SMILESfirstTry=extractBetween(d,'<th class="inf">SMILES</th>','</td>','Boundaries','exclusive');
SMILESsecondTry=extractAfter(SMILESfirstTry{1},'<td colspan="4">')
SMILESsecondTry = 'c1c(ccc(c1)/C=C/C(=O)O)O'
similar could be done for the other tags
simlarly a bit more html stuff:
tree = htmlTree(d);
selector= "tr";
subtrees= findElement(tree,selector);
str = extractHTMLText(subtrees);
searchTags={'InChIKey' 'InChICode' 'SMILES'};
location=contains(str,searchTags);
rawEntries=str(location)
rawEntries = 3×1 string array
"InChIKey NGSWKAQJJWESNS-ZZXKWVIFSA-N" "InChICode InChI=1S/C9H8O3/c10-8-4-1-7(2-5-8)3-6-9(11)12/h1-6,10H,(H,11,12)/b6-3+" "SMILES c1c(ccc(c1)/C=C/C(=O)O)O"
extractAfter(rawEntries,' ')
ans = 3×1 string array
"NGSWKAQJJWESNS-ZZXKWVIFSA-N" "InChI=1S/C9H8O3/c10-8-4-1-7(2-5-8)3-6-9(11)12/h1-6,10H,(H,11,12)/b6-3+" "c1c(ccc(c1)/C=C/C(=O)O)O"
  댓글 수: 2
Alan Cesar Pilon Miro
Alan Cesar Pilon Miro 2023년 2월 3일
Hi Jonas,
Thank you! the first method worked very well.
Just to mentioned. I had some difficults in the second way, I could not find the objetcts.
Jonas
Jonas 2023년 2월 6일
thx for your reply. make sure, that your the data returned from webread is not empty, since the website seems to be quite slow, sometimes the returned data is empty. maybe further increasing the timeout limit can help here

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 MATLAB Report Generator에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by