read in only certain columns of big text file

Question

Charlotte 2013년 10월 24일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/91340-read-in-only-certain-columns-of-big-text-file

댓글: Ashim 2016년 1월 4일

I want to read in only certain columns of a big tab-separated values text document (.tsv) in the form of a table with 900 columns and 100 lines. Every line in the textfile has the format:

8 columns %s

112 repetitions of: %d%d%d%d%d%d%s

100 columns: %d (irrelevant to me)

How can I import the table as a cell array of dimension (number of columns)*(number of lines) (or vice versa) without having to specify all the 900 format specifier (& skipping the last 100 columns)?

In fact, I only need every 8+6*k column and every 8+6*(k+1) column for k=1:112, i.e. the last 2 elements (%d%s) of the sequence repeated 112 times.

Using textscan(fileid,'%s','Delimiter','\t'); gives me a cell array of size 1*(number of total elements) instead, which is not very practicle to deal with if I want to use certain columns. Also, I didn't know how to solve the format specifier issue and simply read everything as strings.

Using readtable('filename.tsv','Delimiter','\t'); gives me the error message: Undefined function 'readtable' for input arguments of type 'char'.

댓글 수: 2
없음 표시없음 숨기기

Simon 2013년 10월 24일

편집: Simon 2013년 10월 24일

Hi!

Can you give a short(!) example of the file, maybe two or three lines? Or try my solution below.

Image Analyst 2014년 3월 26일

readtable is only with R2013b and later - you probably have an earlier version.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Cedric 2013년 10월 24일

3
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/91340-read-in-only-certain-columns-of-big-text-file#answer_110003

편집: Cedric 2013년 10월 24일

MATLAB Online에서 열기

I'd do it as follows, assuming that what you don't want to do is to have to build the formatSpec by yourself. We read the first line of the file to identify non-numeric columns (string) and we use this information to build an appropriate formatSpec for TEXTSCAN. Then we can read the file and optionally convert the cell array of columns (mix of cell arrays and numeric arrays) into a large cell array.

 filename = 'myFile.tsv' ;
 % - Get structure from first line.
 fid  = fopen( filename, 'r' ) ;
 line = fgetl( fid ) ;
 fclose( fid ) ;
 isStrCol = isnan( str2double( regexp( line, '[^\t]+', 'match' ))) ;
 % - Build formatSpec for TEXTSCAN.
 fmt = cell( 1, numel(isStrCol) ) ;
 fmt(isStrCol)  = {'%s'} ;
 fmt(~isStrCol) = {'%f'} ;
 fmt = [fmt{:}] ;
 % - Read full file.
 fid  = fopen( filename, 'r' ) ;
 data = textscan( fid, fmt, Inf, 'Delimiter', '\t' ) ;
 fclose( fid ) ;
 % - Optional: aggregate columns into large cell array.
 for colId = find( ~isStrCol )
    data{colId} = num2cell( data{colId} ) ;
 end
 data = [data{:}] ;

From there, it is easy to select relevant columns.

Note that this solution assumes that there are no white-spaces in columns. If it is not true, I can update the solution so it really works with tabs as separator (in fact, TEXTSCAN seems to use the white space as delimiter even when with specific only \t as delimiter).

댓글 수: 10
이전 댓글 8개 표시이전 댓글 8개 숨기기

Charlotte 2013년 10월 30일

Great, thank you very much!

Ashim 2016년 1월 4일

The answer from Cedric is applicable but in case when NaNs are present on the first line, here 'line' variable, the code returns the column as a string rather than a double. How do you overcome that?

댓글을 달려면 로그인하십시오.

Answer 2

Simon 2013년 10월 24일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/91340-read-in-only-certain-columns-of-big-text-file#answer_100693

MATLAB Online에서 열기

Hi!

I'll try an answer for my short example:

1  2  s  4  5  a  5
a  4  c  d  6  f  i
9  8  d  g  10  i  n

(all tab separated!)

I want to extract columns 2 and 5 as double and 3 and 6 as string (you need to adapt ind1 and ind2 to your needs)

% read in file with textscan
fid = fopen('tabdata.txt');
FC = textscan(fid,'%s','Delimiter','\n');
fclose(fid);
FC = FC{1};
% split all lines (attention: the last column is not processed!!!)
FCsplit = regexp(FC, '(\S*)[\s]', 'match');
% columns to extract as double
ind1 = 2:3:5;
% columns to extract as string (right neighbor of column of doubles)
ind2 = ind1 + 1;
% prepare result arrays for double and string (as cell array)
D = zeros(length(FCsplit), length(ind1));
S = cell(length(FCsplit), length(ind1));
% loop over all columns to extract
for n = 1:length(ind1)
    D(:, n) = cellfun(@(x) str2double(x(ind1(n))), FCsplit);
    S(:, n) = cellfun(@(x) strtrim(x(ind2(n))), FCsplit);
end

I don't know about the performance for large files. Try it out!

댓글 수: 2
없음 표시없음 숨기기

Charlotte 2013년 10월 29일

MATLAB Online에서 열기

EtData_example.txt

Thanks a lot for your answer! It works well, except that I have (different amount of) white-spaces within my %s-columns that also split the data with

FCsplit = regexp(FC, '(\S*)[\s]', 'match');

So that I get different amounts of columns for every line. Do you have any idea how to solve this?

I attached 2 example lines of my file.

Simon 2013년 10월 30일

MATLAB Online에서 열기

Hi!

The "(\S*)" catches the column contents, it is everything but white spaces. The "[\s]" splits at white spaces.

If you have white spaces you may catch everything except the tab and split at tab:

FCsplit = regexp(FC, '([^\t])\t', 'match');

댓글을 달려면 로그인하십시오.

Answer 3

Ken Atwell 2013년 10월 24일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/91340-read-in-only-certain-columns-of-big-text-file#answer_107240

MATLAB Online에서 열기

You can use repmat avoid typing '%*d%*d%*d%*d%*d%d%s' 112 times (also note the use of '*' to skip importing certain columns. Try something like this:

 formatSpec = ['%*s%*s%*s%*s%*s%*s%*s%*s' repmat('%*d%*d%*d%*d%*d%d%s', [1,112]) '%*[^\n]'];
 textscan(fileId, formatSpec, 'Delimiter', '\t')

댓글 수: 2
없음 표시없음 숨기기

Ken Atwell 2013년 10월 24일

MATLAB Online에서 열기

Test code with a dummy file:

 of = fopen('test.tsv', 'w');
 for i = 1:100
    fprintf(of, 's1\ts2\ts3\ts4\ts5\ts6\ts7\ts8\t');
    for j=1:112
        fprintf(of, '%d\t%d\t%d\t%d\t%d\t%d\tstr\t', j+0, j+1, j+2, j+3, j+4, j+5);
    end
    for j=1:100
        fprintf(of, '-1\t');
    end
    fprintf(of, '\n');
 end
 fclose (of);
 fileId = fopen('test.tsv');
formatSpec = ['%*s%*s%*s%*s%*s%*s%*s%*s' repmat('%*d%*d%*d%*d%*d%d%s', [1,112]) '%*[^\n]'];
 T = textscan(fileId, formatSpec, 'Delimiter', '\t');
 fclose (fileId);

Charlotte 2013년 10월 29일

That works well for the format specifier, thank you!

But from there I still don't know how to read in my data as cell array of dimension (num of columns)x(num of lines).

댓글을 달려면 로그인하십시오.

read in only certain columns of big text file

댓글 수: 2
없음 표시없음 숨기기

채택된 답변

댓글 수: 10
이전 댓글 8개 표시이전 댓글 8개 숨기기

추가 답변 (2개)

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 2
없음 표시없음 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

read in only certain columns of big text file

댓글 수: 2 없음 표시없음 숨기기

채택된 답변

댓글 수: 10 이전 댓글 8개 표시이전 댓글 8개 숨기기

추가 답변 (2개)

댓글 수: 2 없음 표시없음 숨기기

댓글 수: 2 없음 표시없음 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 10
이전 댓글 8개 표시이전 댓글 8개 숨기기

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 2
없음 표시없음 숨기기