How can I extract data from a large ASCII file?
이 질문을 팔로우합니다.
- 팔로우하는 게시물 피드에서 업데이트를 확인할 수 있습니다.
- 정보 수신 기본 설정에 따라 이메일을 받을 수 있습니다.
오류 발생
페이지가 변경되었기 때문에 동작을 완료할 수 없습니다. 업데이트된 상태를 보려면 페이지를 다시 불러오십시오.
이전 댓글 표시
0 개 추천
I have a ASCII data file with unknown number of columns and rows. In the file there are some unwanted text lines on top. I want to read only some of the columns in the data set with the header by removing the text lines on top. I appreciate if anyone can help.
채택된 답변
Cedric
2014년 5월 20일
>> doc textscan
and use the parameter HeaderLines to skip the header. Supposing that you have 7 header lines and that you need columns 1 and 3 (discarding the rest of each line), you should have something like:
content = fileread( 'myData.txt' ) ;
data = textscan( content, '%f%*f%f%*[^\n]', 'HeaderLines', 7, ...
'CollectOutput', true ) ;
data = data{1} ;
where you see a * in the formatSpec argument to skip column 2 and %*[^\n] at the end to skip the rest of the line.
댓글 수: 14
By T and Z you mean columns 20 and 26 (like in Excel), or T_fit and another column whose header contains Z?
anton fernando
2014년 5월 20일
편집: anton fernando
2014년 5월 20일
I need to extract the H2o and O3 columns. it is 4th and 6th column in the data set that I posted.
Do you need to identify that these are columns 4 and 6 programmatically (based on pattern matching in the header), or are H2O and O3 always columns 4 and 6 (in all files)?
If these are always columns 4 and 6, you can use the following formatSpec (2nd argument in the call to TEXTSCAN):
'%*f%*f%*f%f%*f%f%*[^\n]'
where the %*f sequences skip (float) columns 1,2,3,5, the %f sequences read (float) columns 4 and 6, and the %*[^\n] sequence skips the rest of lines. Note that I cannot determine based on your picture whether there are 6 or 7 header lines, so you'll have to adjust this number.
anton fernando
2014년 5월 20일
편집: anton fernando
2014년 5월 20일
Actually your code worked. Thank you very much. How can I read the data set when I don't know the column number and only when I know the column heading? Thank you again. You saved me lot of time from wasting. You are the best.
See my last comment for columns 4 and 6.
If you are able to build the formatSpec based on the a priori knowledge of which columns (number) you need, the approach above is by far the easiest way to process your file(s).
If you need to find programmatically columns number based on header names 'H2O' and 'O3' (and others), it gets significantly more complicated. If you really need this feature, I can extend my solution though; just let me know.
Hey, I can understand what you have done here now. But there is a problem. I don't get data as they are in the data file. Some data read as 0. I want to get data in (0.350E+13) format. Also what ('CollectOutput', true) part does in your code? Thank you. P.S: I think I can read data by the column number. All data are in same column in each file. Thank you again.
My guess is that the code reads data correctly, but you don't display them correctly afterwards. For example, if you just type data, MATLAB will factor 1.0e+12, and display a lot of 0.0000 which are in fact non-zero. Try to access one of these elements direct, e.g. data(10,1). If it isn't that, could you attach (icon staple) one of these files instead of a screenshot so I can try? If files are large, truncate one to e.g. 50 rows.
The CollectOutput parameter groups columns in data{1} (in this case). Without it, you get one cell per column, which is more difficult to use then (you have to merge cells content manually afterwards).
The file is attached.
Yep, that's it. You probably updated the example to
data = textscan( content, '%*f%*f%*f%f%*f%f%*[^\n]', 'HeaderLines', 12, ...
'CollectOutput', true ) ;
data = data{1} ;
Now you can see your data e.g. per row, and you'll see that they are non-zero:
>> data(50,:)
ans =
1.0e-03 *
0.5940 0.0073
Optionally, if you really wanted to display the whole array and see non-zero entries, you could type
>> format long
Then ..
>> data
data =
1.0e+02 *
0.009380000000000 -9.990000000000000
0.008250000000000 -9.990000000000000
0.007240000000000 -9.990000000000000
0.006350000000000 -9.990000000000000
0.005550000000000 -9.990000000000000
0.004840000000000 -9.990000000000000
0.004200000000000 -9.990000000000000
0.003630000000000 -9.990000000000000
0.003130000000000 -9.990000000000000
0.002670000000000 0.000000614000000
0.002280000000000 0.000000292000000
0.001950000000000 0.000000110000000
0.001680000000000 0.000000076300000
0.001440000000000 0.000000035600000
0.001230000000000 0.000000047500000
0.001060000000000 0.000000042100000
0.000906000000000 0.000000042300000
0.000775000000000 0.000000042200000
0.000664000000000 0.000000043400000
0.000567000000000 0.000000045600000
0.000484000000000 0.000000047000000
0.000413000000000 0.000000048500000
0.000353000000000 0.000000049900000
0.000301000000000 0.000000051300000
0.000256000000000 0.000000052600000
...
but again, the fact that you displayed a lot of zeros initially is just a display artifact (truncation), so there is no need to set the display format to long if all you need is to compute with these numbers afterwards.
Thank you very much. One last question. if I need the data of the 80th column do I have to write %f 80 times?
You're welcome. Interestingly, REPMAT works with strings as well (as with numeric/cell arrays), so no, if you needed columns 4, 6, and 80, you could do something like (not tested, you may have to check and adapt/correct):
formatSpec = ['%*f%*f%*f%f%*f%f', repmat('%*f', 1, 73), '%f%*[^\n]'] ;
data = textscan( content, formatSpec, 'HeaderLines', 12, ...
'CollectOutput', true ) ;
data = data{1} ;
Thank you. I really appreciate your help.
My pleasure!
추가 답변 (0개)
카테고리
도움말 센터 및 File Exchange에서 Large Files and Big Data에 대해 자세히 알아보기
태그
참고 항목
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!웹사이트 선택
번역된 콘텐츠를 보고 지역별 이벤트와 혜택을 살펴보려면 웹사이트를 선택하십시오. 현재 계신 지역에 따라 다음 웹사이트를 권장합니다:
또한 다음 목록에서 웹사이트를 선택하실 수도 있습니다.
사이트 성능 최적화 방법
최고의 사이트 성능을 위해 중국 사이트(중국어 또는 영어)를 선택하십시오. 현재 계신 지역에서는 다른 국가의 MathWorks 사이트 방문이 최적화되지 않았습니다.
미주
- América Latina (Español)
- Canada (English)
- United States (English)
유럽
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
