Image extraction from webpage

Question

0 개 추천

There are serial-numbered webpages (some of these numbers don't exist), which have images of interest at one particular location in the html file:

<h4 id="COMPANY">COMPANY</h4>
<p><img class="image" border="0" src="/resources/companyName_company.jpg"/></p>

The companyName is different in each numbered webpage.

However, urlwrite gives only html pages without these images. When opened in browser, these images are absent. Since it is these images that are of interest, and none of the other content of the webpage, the whole purpose is defeated. How can this be resolved ? Is there a way to get only these images, and nothing else from the webpage ?

댓글 수: 2
없음 표시 없음 숨기기

Rik 2020년 4월 27일

You mean the html doesn't contain these lines?

b 2020년 4월 27일

No, the html does contain these lines. But when opened in browser, there is no image. The heading in between the <h4></h4> appears correctly. The image part, which should be just below it, does not appear.

Everything else on the webpage is unneeded information. Unable to figure out how to filter that out and extract only this image part.

This structure is unique in the html pages. Every numbered html page has the structure of <heading> immediately followed by <image> .

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Rik 2020년 4월 27일

1 개 추천

The HTML file doesn't contain the image. It contains a relative path to the image. Because you don't have the image file in the location the HTML file specifies the image doesn't show up. You need to use the 3 step process below to get the image file.

download the HTML file
determine '/resources/companyName_company.jpg'
dowload the image from website.com/resources/companyName_company.jpg

댓글 수: 18
이전 댓글 16개 표시 이전 댓글 16개 숨기기

Rik 2020년 4월 29일

MATLAB Online에서 열기

Despite what you said, the pattern you mentioned is not unique. Below is my guess for your pattern. Modify as needed.

for n=2%1:1000
    %read HTML
    %url=sprintf(sprintf('https://companyNameWebsite.org/%i?outline=by_category',n));
    url='https://www.mathworks.com/matlabcentral/answers/uploaded_files/288498/company3a.txt';
    try
        data=webread(url);
    catch ME
        %check if the error is what you expect for a non-existent page
        if ~WhatYouExpect(ME)
            rethrow(ME)
        else
            continue%go to next iteration
        end
    end
    
    t=strsplit(data,'<h4');
    pattern=' id="company"';numel_pattern=numel(pattern);
    partial_url='';%set a default in case of failure
    for k=1:numel(t)
        try
            if strcmp(t{k}(1:numel_pattern),pattern)
                ind1=strfind(t{k},'src="')+4;ind1=ind1(1)+1;
                ind2=strfind(t{k},'"');
                ind2=ind2(ind2>ind1);ind2=ind2(1)-1;
                partial_url=t{k}(ind1:ind2);
            end
        catch
            %line too short, or url reading failed
        end
    end
    if isempty(partial_url)
        %Should the code throw an error here? Warn? Simply continue?
    end
    
    %store the image
    img_url=sprintf('%s%s','https://companyNameWebsite.org',partial_url);
    websave(___)
end

Rik 2020년 4월 29일

Glad to be of help.

Since you suggested to be bound by an NDA not to provide more details I don't see what adding "(subject to testing)" is trying to accomplish. Obviously it works on a recent release of Matlab for this example, otherwise I wouldn't have posted it. The only thing it currently accomplishes is sounding condescending.

b 2020년 4월 30일

This is working well. I can clearly see now how strsplit, strfind and strcmp can be used. After experimenting with this code on few different configurations, the run-time is also reasonable - a few hours for 1000 cases. Another thing was that the fileSize of the image file that it retrieves is exactly the same as the original file. This may not be surprising to an experienced coder, but something could be done as a modification to bring down the run-time as well as disk-space so that the user gets an option to vary the fileSize of the retrieved file. If the original image file is 4MB, but maybe only 60kb suffices, then that is a reduction by ~70 times. This will translate to an almost equivalent reduction in the run-time and surely the same amount of reduction in disk-space. Instead of 4GB of space, only 60MB will be used. The trick will be in the amount of processing time taken by the dimension or the size reducing algorithm.

But that goes beyond the purview of this question thread.

댓글을 달려면 로그인하십시오.

Image extraction from webpage

댓글 수: 2
없음 표시 없음 숨기기

채택된 답변

댓글 수: 18
이전 댓글 16개 표시 이전 댓글 16개 숨기기

추가 답변 (0개)

카테고리

태그

Community Treasure Hunt

Image extraction from webpage

댓글 수: 2 없음 표시 없음 숨기기

채택된 답변

댓글 수: 18 이전 댓글 16개 표시 이전 댓글 16개 숨기기

추가 답변 (0개)

카테고리

태그

참고 항목

Community Treasure Hunt

댓글 수: 2
없음 표시 없음 숨기기

댓글 수: 18
이전 댓글 16개 표시 이전 댓글 16개 숨기기