How to parse html text having singleton tag?

Question

0 개 추천

I have a htmldata stored in the form of char array

 <html>
  <head>
  </head>
  <body>
    <div class="header">
          HEADER1
    </div>
    <div class="content">
       <br>my data
    </div>
      </body>
</html>

I want to retreive data between tags for which i tried some thing like

import javax.xml.parsers.DocumentBuilderFactory
dbf=javax.xml.parsers.DocumentBuilderFactory.newInstance();
builder = dbf.newDocumentBuilder();
is=org.xml.sax.InputSource(java.io.StringReader(htmldata));
dom=builder.parse(is);

The above code works fine when there are no singleton tags. but it throws error when I add singleton tags [Fatal Error] :36:7: The element type "br" must be terminated by the matching end-tag "</br>".

even xmlread throwns same error

>> xmlread(is)
[Fatal Error] :The element type "br" must be terminated by the matching end-tag "</br>".
Error using xmlread (line 106)
Java exception occurred:
org.xml.sax.SAXParseException; The element type "br" must be terminated by the matching end-tag
"</br>".
  at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
  at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)

is there any workaround for this ?

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

Purav Panchal 2022년 8월 31일

Hey, did you find any solution?

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Arjun 2024년 11월 5일

MATLAB Online에서 열기

3 개 추천

Hi @SR,

I see that you want to parse some HTML text containing singleton tags using MATLAB.

The issue arises because you're using an XML parser to process HTML content. XML parsers expect every opened tag to have a corresponding closing tag, which isn't always the case with HTML. HTML allows for self-closing tags that don't require a separate closing tag, leading to errors when parsed with XML tools. To handle HTML properly, you can use “htmlTree” offered by MATLAB R2021b and newer versions.

Kindly refer to the documentation of the “htmlTree” for more information: https://www.mathworks.com/help/releases/R2021b/textanalytics/ref/htmltree.html

Kindly refer to the code snippet below for illustration:

htmlData = [...
    '<html>', ...
    '  <head>', ...
    '  </head>', ...
    '  <body>', ...
    '    <div class="header">', ...
    '          HEADER1', ...
    '    </div>', ...
    '    <div class="content">', ...
    '       <br>my data', ...
    '    </div>', ...
    '      </body>', ...
    '</html>'];
htmlString = string(htmlData);
% Parse the HTML content using htmlTree
tree = htmlTree(htmlString);
% Extract content from the header and content divs
headerContent = extractHTMLText(findElement(tree, ".header"));
contentData = extractHTMLText(findElement(tree, ".content"));
disp("Header Content: " + headerContent);
Header Content: HEADER1
disp("Content Data: " + contentData);
Content Data: my data

I hope this will help!

댓글 수: 2
없음 표시 없음 숨기기

Sheeba Ransing 2024년 11월 7일

Thank you!

Arjun 2024년 11월 8일

Welcome!

댓글을 달려면 로그인하십시오.

How to parse html text having singleton tag?

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 2
없음 표시 없음 숨기기

추가 답변 (0개)

카테고리

태그

Community Treasure Hunt

How to parse html text having singleton tag?

댓글 수: 1 이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

채택된 답변

댓글 수: 2 없음 표시 없음 숨기기

추가 답변 (0개)

카테고리

태그

참고 항목

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 2
없음 표시 없음 숨기기