Iteratively search in a website (for dummies)

조회 수: 1 (최근 30일)
Luca D'Angelo
Luca D'Angelo 2024년 5월 3일
답변: Luca D'Angelo 2024년 5월 9일
Hi all,
I have a list of thousands of chemical formula (or potentially formula). What I'd like to do is to iteratively get one of this formula (for i=1:size(FormulaList,1)....end), insert the formula into the search bar of the website (that is: https://pubchem.ncbi.nlm.nih.gov/ ), and check if I have a possible matches or I get something like this ("0 results found"):
I've tried to apply the method described here ( https://it.mathworks.com/matlabcentral/answers/400522-retrieving-data-from-a-web-page ) but I was not able to understand how to get the "curl" (sorry: I'm completely ignorant in this!).
Cheers,
Luca
[SL: removed the parenthesis from the end of one of the hyperlinks]

채택된 답변

Luca D'Angelo
Luca D'Angelo 2024년 5월 9일
I've found the solution.
% MassList: column-vector with molecular formula
tic
for mass=1:size(MassList,1)
url=strcat('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastformula/',MassList(mass,1),'/cids/JSON?list_return=cachekey');
try
jsonData = webread(url);
ResNum(mass,1)=jsonData.IdentifierList.Size;
catch
ResNum(mass,1)=0;
end
pause(0.205) % the website asks for max 5 requests / second
end
toc
The resulting column-array provides the number of compounds with the same molecular formula found in PubChem.

추가 답변 (1개)

Steven Lord
Steven Lord 2024년 5월 3일
Your best bet is probably to use one of the access methods that PubChem provides, as described on this page. Note the usage policy. If you have thousands of requests it's likely going to take minutes or longer, or the bulk data downloads functionality linked in the usage policy may be a better fit for your needs.
From the MATLAB side of things, the functions in this documentation category likely will be of use to you as may be the functions on this documentation page. [Before you ask no, I don't have any examples specific to using those functions to access that database.]
  댓글 수: 3
Steven Lord
Steven Lord 2024년 5월 6일
You haven't shown us what values you're using for the maxAttempts and waitTime variables in your code.
Luca D'Angelo
Luca D'Angelo 2024년 5월 6일
opt=weboptions("Timeout",5);
molecularFormula = 'C9H8O4'; % Example molecular formula
apiUrl = sprintf('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/formula/%s/cids/JSON', molecularFormula);
maxAttempts = 10; % Maximum number of attempts
waitTime = 5; % Time to wait between attempts (in seconds)
attempt = 1;
while attempt <= maxAttempts
jsonData = webread(apiUrl);
if ~isfield(jsonData, 'Waiting') || isempty(jsonData.Waiting) %|| ~strcmpi(jsonData.Waiting, 'true')
break; % Exit loop if request is not waiting anymore
end
attempt = attempt + 1;
pause(waitTime);
end
% Check if the request is still processing after the loop
if isfield(jsonData, 'Waiting') && ~isempty(jsonData.Waiting) && strcmpi(jsonData.Waiting, 'true')
disp('Your request is still processing. Please wait and try again later.');
return;
end
if isfield(jsonData, 'Fault')
disp(['Error: ', jsonData.Fault.Message]);
return;
end
numResults = 0; % Initialize number of results
if isfield(jsonData, 'IdentifierList') && isfield(jsonData.IdentifierList, 'CID')
numResults = numel(jsonData.IdentifierList.CID); % Number of search results
end
disp(['Number of results for molecular formula "', molecularFormula, '": ', num2str(numResults)]);
It doesn't really matter, actually. Most of the previous code was written by chatgpt but it's useless. The main lines are:
opt=weboptions("Timeout",5);
molecularFormula = 'C9H8O4'; % Example molecular formula
apiUrl = sprintf('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/formula/%s/cids/JSON', molecularFormula);
jsonData = webread(apiUrl);
if webread worked, maybe I would be able to find the information I am looking for. The problem is that I think the function launches the search but then doesn't wait for the website to ‘load’ the result, so it shows ‘Your request is still running’. Maybe I should find a way to launch the command, wait and then check if the webpage 'loaded' the results. What do you think?

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Genomics and Next Generation Sequencing에 대해 자세히 알아보기

제품


릴리스

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by