Help parsing predictably messy data

Question

newbie9 2019년 8월 13일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/475998-help-parsing-predictably-messy-data

댓글: newbie9 2019년 8월 15일

I have some really messy data that I am trying to reorganize. The attached *.txt is an example subset. The first screenshot below explains how it is organized. Every file is different (i.e., nheaders will vary, number of rows/columns in each block will vary), but there are clues as to what the shape of each data block is. The second attachment is an *.xlsx that shows the general desired format (I also have a screenshot of it below). I color-coded some of the numbers in header1 in the Excel so that it is more obvious where they come from in the original file.

I am struggling with how to get started on going from the messy (but predictably messy!) format to the clean data block. Thank you for any tips to get started. (Using Matlab R2018a.)

댓글 수: 2
없음 표시없음 숨기기

dpb 2019년 8월 13일

Looks pretty straightforward...you just read the first couple records and parse them first...

Attach the input spreadsheet itself...can't test without something to work from.

newbie9 2019년 8월 13일

The input is attached; it's the TXT file.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

dpb 2019년 8월 14일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/475998-help-parsing-predictably-messy-data#answer_387649

편집: dpb 2019년 8월 15일

MATLAB Online에서 열기

function getdata=a1(fname)
% read file sections, return as cell array of sections for subsequent use
  fid=fopen(fname,'r');         % open the file
  % do error checking here
  nHdr=sscanf(fgetl(fid),'%d %*s');
  for i=1:nHdr
    z=sscanf(fgetl(fid),'%f');
    nGrp=numel(z);
    for j=1:nGrp
      [hdr,xyz,data]=scangroup(fid)
    end
  end
end
function [hdr, xyz, data]=scangroup(fid)
% read a specific file group of the form  
%     1  header1                                                              
%    15    3  x y
%   0.6  0.2  0.2
%    1   1   15.100  0.7734  0.6247  0.9668
%    1   1   15.300  0.4880  0.3941  0.6100
%    .....
%    1   1   17.900  0.0000  0.0000  0.0000
% read the header record and strip initial numeric -- keep all rest of 
% line in case  is more than just one word as in sample file
  hdr=fgetl(fid);
  hdr=strtrim(hdr);       % remove leading/trailing white space
  ix=regexp(hdr,'\w*');   % find beginning words in header line
  hdr=hdr(1,ix(2):end);   % return complete line after leading numerics
  % get the data size information
  l=fgetl(fid);
  rc=sscanf(l,'%d');  % row, column dimensions
  xyz=fscanf(fid,'%f',[1,rc(2)]);
  fmt=repmat('%f',1,3+rc(2));
  data=cell2mat(textscan(fid,fmt,rc(1)));
  fgetl(fid);             % eat \n for next pass
end

Above reads the file and returns each block in the hdr,xyz,data variables with the first floating point array as z (I have no idea what any of these are so creating a meaningul variable name was beyond my ken--rename to whatever make sense for the application space).

To build your rearranged file, I'd begin by opening a text file for output at the beginning and then just write sequentially to it as you go. You'll just need to repmat() the z and zyz vectors to the size of the data block before writing--where there's a possible issue I don't know the answer to is if it is possible that later on in the file there are more columns than have been previously seen--then the size will grow beyond what has been previously written. That wouldn't cause any problem on writing; it could create some difficulty in reading--altho it's possible readtable with an import options object just might be able to handle it--I've never tested that case.

That should get you going...

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

newbie9 2019년 8월 15일

wow, thank you @dpb, this is great. Great idea on writing the output as I go. Thanks again

댓글을 달려면 로그인하십시오.

Help parsing predictably messy data

댓글 수: 2
없음 표시없음 숨기기

채택된 답변

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

Help parsing predictably messy data

댓글 수: 2 없음 표시없음 숨기기

채택된 답변

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기