How to make sure a web page is entirely loaded before using webread

Hello,
While scraping data from a website, I don't have any issue for the main part of it, except for a data table; the latter is made available as its elements can be observed while inspecting the html/webpage.
My assumption is that the webpage requires some time in order to be fully loaded, before using using webread properly.
A timeout parameter or a simple for loop until success do not allow to fix the issue. Similarly, I don't mind loading text data instead of a table, as this is not the real problem here.
So all in all, I'm trying to find a way to open/fully load a web page and use this latter output as an argument/input to webread in a second step. Unless a specific parameter related to webread or weboptions exists and allows to address the issue.
Thanks for your help!

댓글 수: 11

I'm not sure how RESTful works. When you send the web server the request, do you think it sends back some signal that it's "done" and all the data that will be sent has been sent? Or do you think it's just waiting a while and doesn't hear anything more after a certain waiting period, and then "times out" and declares that it's done because no more data is being delivered?
I'm not an expert, hence my question, but I think it's the first option, even if the data couldn't be fully loaded. Though in case the request can't get any result at all (which is not my case, as I can retrieve other data from the same web page), webread is supposed to try getting web contents until the timeout is reached.
As a tentative, I've imposed a pause(10) statement in the webread.m function, without success.
So therer are two options
  • this statement is not supposed to have any effect due to the structure of the functions openHTTPConnection and readContentFromWebService
  • or it should in principle fix the issue, which would mean my issue is not there and is more related to protected contents (I have doubts about this scenario as all the html elements I need can be 'inspected' while right clicking on them on the web page)
% Open the HTTP connection and obtain the connection and content type.
connection = openHTTPConnection(url, options, '');
pause(10)
% Send the request and read the content from the web service.
[varargout{1:nargout}] = readContentFromWebService(connection, options);
Does the web page load if you use a web browser? If so, you might have to call tech support.
Are you sure that the table isn't dynamic content that's simply unavailable without the ability to execute the necessary scripts? That's an utterly common thing these days.
That dynamic loading is also my working diagnosis.
You can use the network tracking section of your browser debugging tools to try to find the source address of the table. The easiest way to open that window is to right click on the page and select inspect element.
Thanks for your views.
I was also thinking about a dynamic content, but it seems the table contents (each element of it) is available when right clicking on them and inspecting the elements. Would the table be a dynamic content I assume a link to an external web page would be displayed instead (if I'm not mistaken).
I'm now working on another assumption. It seems the table in question can only be accessed after login with password on the webpage, which would require using weboptions.Though, after using the following statements, the issue can't be fixed either, which means I might need to examine authentification/authorization aspects.
options=weboptions('Username','xxx','Password','yyy');
details_tmp=webread(url_name,options);
Therefore, if the tentative here above (the pause(10) statement) is confirmed to be sufficient (to make sure the webpage can be entirely loaded before loading the data) I might publish another post on authentification/authorization issues when using webread if needed, after reviewing the existing posts on this topic.
No, I meant that you can easily open the network activity by using the inspect element option. If a page loads an object dynamically (e.g. with a script that runs when loading the initial page), then it will show up as a different object/source in the network view when you reload the page. Often that will be some sort of JSON object that is parsed by the script on the page.
This is generally done to increase loading time. On this website, the statistics when you click on a username are loaded from an external JSON. That means the page only has to load the script that tells your browser how to interpret the user stats, instead of having to contain the same hidden object for every time a username appears on the page. It also allows a central update of stats for each user: just update the JSON file and all pages will update accordingly.
@Rik Yes this is how I understood it. In your example (the statistics of a Matlab user), three dynamic links are being used for the chart:
  • <script src="/matlabcentral/profile/assets/core-01ec908d852dd668e7ae3c25324b00002997ef998387f5dffd4d996b7fde242d.js"
  • <script src="/matlabcentral/profile/assets/charts-70abe752108b9b28083412ff1a9c89ef24c92337c7c011e8bf0018430a231b92.js" cache="true"></script>
  • <script src="/matlabcentral/profile/assets/animated-c33dad0fafaa02c86b4f422dbf8280866a6029167dd5c6211af74be83cc26ed6.js" cache="true"></script>
whereas the right hand side contents can be easily read, i.e. the Rank, Reputation, Contributions, etc...
In my case, when logged in to the web page, inspecting the elements doesn't show dynamic links for the table, but contents close to the format related to the Rank, Reputation, Contributions in your example. After having imposed a pause(10) in webread.m I could conclude that the issue was at another level, probably the login.
After several trials I could fix the login issue, even if it's not as clean as I wanted, based on the post:
So basically:
  • [a, h] = web(url_name);
  • manual login on Matlab's internal browser; the latter needs to stay open at any time
  • details_tmp = get(h, 'HtmlText');
All in all, it seems webread has some strong limitations, since the following statements don't work.
options=weboptions('Username','xxx','Password','yyy')
details_tmp=webread(url_name,options);
It would hence be interesting to know under which conditions they are assumed to work.
Rik
Rik 2023년 1월 2일
편집: Rik 2023년 1월 2일
I would venture the guess that it has to do with cookies, which are tricky to deal with in the context of a simple webread.
tom3w
tom3w 2023년 1월 2일
편집: tom3w 2023년 1월 2일
@Rik Yes it's a very likely possibility. Let's hope Matlab's tech/dev team will review different use cases in the perspective of improving or better documenting webread/weboptions.
Thanks all for your questions and suggestions, I could eventually fix my main issue.

댓글을 달려면 로그인하십시오.

답변 (0개)

제품

릴리스

R2020b

질문:

2023년 1월 1일

편집:

2023년 1월 2일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by