XML parsing vs regexp

조회 수: 4 (최근 30일)
Sebastian Holmqvist
Sebastian Holmqvist 2012년 7월 30일
I'm trying to extract values from two elements in a pretty large xml. I'm stuck between doing it "the right way" and doing it "the fast way". I.e parsing vs regexp.
elem_num = 1e4;
%%Create sample xml string
xml_str = cell(1, elem_num+2);
xml_str(1) = {''};
for i=1:elem_num
xml_str(i+1) = {'<elem><aa>abc</aa><ab>def</ab></elem>'};
end
xml_str(elem_num+2) = {''};
xml_str = cell2mat(xml_str);
%%Convert string to stream and parse
stream = java.io.StringBufferInputStream(xml_str);
factory = javaMethod('newInstance', ...
'javax.xml.parsers.DocumentBuilderFactory');
builder = factory.newDocumentBuilder;
document = builder.parse(stream);
%%Parse DOM properly
tic;
aa_list = document.getElementsByTagName('aa');
aa_num = aa_list.getLength;
aa = cell(1, aa_num);
for i=1:aa_num
aa(i) = aa_list.item(i-1).getTextContent;
end
ab_list = document.getElementsByTagName('ab');
ab_num = ab_list.getLength;
ab = cell(1, ab_num);
for i=1:ab_num
ab(i) = ab_list.item(i-1).getTextContent;
end
toc;
%%Use regexp
tic;
aa_regexp = regexp(xml_str, '(abc)', 'tokens');
ab_regexp = regexp(xml_str, '(def)', 'tokens');
toc;
As you can see in my code, parsing might be the correct way of handling xml, but takes ages to compute compared to regexp.
% XML Parsing: Elapsed time is 3.222058 seconds.
% Regexp: Elapsed time is 0.050301 seconds.
Any tips on how to speed this up? E.g another parser, a better way of doing it etc?

답변 (1개)

Walter Roberson
Walter Roberson 2012년 7월 30일
Often, when HTML or XML are analyzed in terms of extended regular expressions, the implementations are vulnerable to alternative representations of the closing quote on strings, failing to detect a close quote that HTML or XML say is there. The earlier problem was with "double byte character sets", so people learned to deal with that. But then people were caught off-guard with Unicode Code Point representations of the double-quote, such as via a \u or \ux escape sequence.
  댓글 수: 2
Sebastian Holmqvist
Sebastian Holmqvist 2012년 7월 30일
Ah, very informative, thanks!
Any ideas on how to get the parsing sped up? 3 seconds for 1e4 repeated elements is killing me atm..
Walter Roberson
Walter Roberson 2012년 7월 30일
Sorry I have never used the parser.

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 String Parsing에 대해 자세히 알아보기

태그

제품

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by