Transaction Data in HTML file

Question

v k 2020년 5월 15일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/525598-transaction-data-in-html-file

댓글: v k 2020년 5월 16일

Hello, I have a whole bunch of html files in a directory, named transactionData1.html, transactionData2.html and so on. In these HTML files, transaction information is buried with the following parameters of interest :

<b>Customer Name: </b>Michael Henesi<br /> 
... (some other stuff)
<b>Transaction ID:</b> 21987335670

The transaction ID has varying length and is sometimes not available (no entry in that field). Sometimes there are multiple transactions. Sometimes, the transaction ID is specified as:

<b>Transaction ID: </b>21987335670

that is, the space before transaction ID gets shifted to space after the colon.

In some HTML files, both, the Customer Name and the Transaction ID information is missing.

The objective is to get all the Transaction IDs, along with the Customer Names, from all the files in the directory, in one text file. How can this be done ?

댓글 수: 2
없음 표시없음 숨기기

per isakson 2020년 5월 15일

You increase your chance to get a useful answer if you upload a few html-files that represent the cases.

In some HTML files, both, the Customer Name and the Transaction ID information is missing.
The transaction ID has varying length and is sometimes not available
Sometimes there are multiple transactions.
etc.

v k 2020년 5월 15일

transData1.txt

Sure. I have attached one such file (in text format).

In the transaction ID field, sometimes it is just blank.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

per isakson 2020년 5월 15일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/525598-transaction-data-in-html-file#answer_432558

편집: per isakson 2020년 5월 15일

MATLAB Online에서 열기

This is a start

%%
sad = dir( 'd:\m\cssm\transData*.txt' );
len = length( sad );
out = cell( len, 2 );
for jj = 1 : len
    chr = fileread( fullfile( sad(jj).folder, sad(jj).name ) );
    xpr = '<b>Customer Name: <\/b>([^<]+).+<b>Transaction ID:<\/b>\x20*(\d+)';
    cac = regexp( chr, xpr, 'tokens' );
    if not( isempty( cac{1} ) )
       out(jj,:) = cac{:}; 
    end
end
out

It outputs

out =
  1×2 cell array
    {'Wee Lu'}    {'8299045'}
>> 

In response to comment

%%
sad = dir( 'd:\m\cssm\transData*.txt' );
len = length( sad );
out = cell( len, 2 );
for jj = 1 : len
    chr = fileread( fullfile( sad(jj).folder, sad(jj).name ) );
    xpr = '<b>Customer Name: <\/b>([^<]*).*<b>Transaction ID:<\/b>\x20*(\d*)';
    cac = regexp( chr, xpr, 'tokens' );
    if not(isempty( cac{1}{1} )) && not(isempty( cac{1}{2} ))
        out(jj,:) = cac{1};
    elseif not(isempty( cac{1}{1} ))
        out(jj,1) = cac{1}(1);
        out(jj,2) = {'-99'};
    elseif not(isempty( cac{1}{2} ))
        out(jj,1) = {'---'};
        out(jj,2) = cac{1}(2);
    else
        out(jj,1) = {'---'};
        out(jj,2) = {'-99'};
    end
end
out

outputs

out =
  2×2 cell array
    {'Wee Lu'  }    {'8299045'}
    {'Lam Soon'}    {'-99'    }
>> 

댓글 수: 6
이전 댓글 4개 표시이전 댓글 4개 숨기기

per isakson 2020년 5월 16일

편집: per isakson 2020년 5월 16일

Regular expressions are not trivial. They require concentration, care with the details and reading the documentation carefully. See regexp, Match regular expression (case sensitive). There is a lot on regular expressions on the Internet, e.g. RegEx101. There are many flavors of regular expressions and Matlab has its own, which however is fairly close to PCRE.

Answers

\x20 stands for space; character with hexadecimal value 20.
* zero or more times of previous "item"
() capture tokens; keeps what's in the parentheses for the output
[^<] anything except for <

The Matlab documentation explains this much better than I do!

v k 2020년 5월 16일

Much appreciation.

댓글을 달려면 로그인하십시오.

Transaction Data in HTML file

댓글 수: 2
없음 표시없음 숨기기

채택된 답변

댓글 수: 6
이전 댓글 4개 표시이전 댓글 4개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

Transaction Data in HTML file

댓글 수: 2 없음 표시없음 숨기기

채택된 답변

댓글 수: 6 이전 댓글 4개 표시이전 댓글 4개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 6
이전 댓글 4개 표시이전 댓글 4개 숨기기