Cannot Load CSV file
이 질문을 팔로우합니다.
- 팔로우하는 게시물 피드에서 업데이트를 확인할 수 있습니다.
- 정보 수신 기본 설정에 따라 이메일을 받을 수 있습니다.
오류 발생
페이지가 변경되었기 때문에 동작을 완료할 수 없습니다. 업데이트된 상태를 보려면 페이지를 다시 불러오십시오.
이전 댓글 표시
0 개 추천
I am trying to load a csv file using the import tool.
It takes forever (like a weekend was not enough...).
I've included the screenshot of what I am doing.
The file has numbers from H2 to AEQ639774. From A1 to AEQ1 I have headers. From A2 to G639774 I have identifiers.
I was trying to first load the numbers into a numeric matrix, and then repeat the process for headers and identifiers separately. But not even this works.
The file is 1.28 GB.. so big but not that big.
My machine has 16gb ram so that should be enough.
I am probably doing something wrong!
Thanks in advance!
채택된 답변
Adam Danz
2018년 7월 31일
0 개 추천
That sounds fishy. What version of matlab are you using? I assume the problem persists after exiting and rebooting Matlab.
You could use an alternative method of importing the data such as xlsread() which bypasses some of the processing done by the import tool.
댓글 수: 15
romulo alves
2018년 7월 31일
MATLAB R2017b
romulo alves
2018년 7월 31일
oh and I am able to import small parts of the data quickly if I choose only 20 to 30 cells. so I suppose the problem is that the file is too large? is there a more efficient way to do this?
xlsread and reshaping are not really helping.
I was also trying to follow this instructions https://nl.mathworks.com/matlabcentral/answers/231857-loading-large-csv-files
but it's not working either..
Adam Danz
2018년 7월 31일
What was the problem with xlsread()?
What was the problem with reading the file in chunks (as described in your link)?
If xlsread() causes problems when reading the entire numerical matrix within your csv file, first I'd like to know what the problem is, secondly, you could try extracting the data in chunks. You'd make a loop that pulls out 100 columns at a time, for example.
romulo alves
2018년 7월 31일
편집: romulo alves
2018년 7월 31일
So, if I do xlsread('DOT.csv','H7:T20'), trying to extract only a little bit of numeric part, I get the message
Unable to read XLS file "path" File is not in recognized format.
If I do:
chunk_nRows = 2e4 ;
% - Open file.
fId = fopen( 'DOT.csv' ) ;
% - Read first line, convert to double, determine #columns.
line = fgetl( fId ) ;
row = sscanf( line, '%f,' )' ;
nCols = numel( row ) ;
% - Prealloc data, copy first row, init loop counter.
data = zeros( chunk_nRows, nCols ) ;
data(1,:) = row ;
rowCnt = 1 ;
% - Loop over rest of the file.
while ~feof( fId )
rowCnt = rowCnt + 1 ;
% - Realloc + a chunk if rowCnt larger than data array.
if rowCnt > size( data, 1 )
fprintf( 'Realloc ..\n' ) ;
data(size(data, 1)+chunk_nRows, nCols) = 0 ;
end
% - Read line, convert and store.
line = fgetl( fId ) ;
data(rowCnt,:) = sscanf( line, '%f,' )' ;
end
% - Truncate data to last row (truncate last chunk).
data = data(1:rowCnt,:) ;
% - Close file.
fclose( fId ) ;
I get the message
Subscript indices must either be real positive integers or logicals.
I checked and the code stops when
rowCnt = 20001
romulo alves
2018년 7월 31일
One thing that might influence this is the fact that some cells, instead of blanks or numbers, have an "e" or an "r". Could this be it?
romulo alves
2018년 7월 31일
편집: Walter Roberson
2018년 7월 31일
the dataset is here in case that helps
Walter Roberson
2018년 7월 31일
That suggests that the file has inconsistent format. Perhaps it repeats headers after 20000 rows, for example.
I suggest reading it with readtable(). If you have a new enough MATLAB, use detectImportOptions first.
Unfortunately registration is required to download that dataset.
romulo alves
2018년 7월 31일
편집: romulo alves
2018년 7월 31일
Thanks. So with readtable() and detectImportOptions I am able to load specific columns. This takes over 1 min per column, though. Since I have over 800 columns, this takes over 13 hours. Furthermore, I have other csv files I will need to load, so I am looking at a long time of loading data. Is there a quicker alternative? Thank you!
Adam Danz
2018년 7월 31일
I'm not sure this would speed things up or not but have you tried importing the data as a cell array, then filtering out non-numerical elements, then filling all empties with NaN?
romulo alves
2018년 7월 31일
편집: Walter Roberson
2018년 7월 31일
Thanks. I'll try. Meanwhile, if it is of any usefulness, I put the data on dropbox. Can be accessed via:
ok. so the readtable thing worked. Apparently the time is not proportional to the number of columns as I was assuming. the problem now is that even numbers appear as
'3000'
instead of
3000
Is it possible to go around this?
Walter Roberson
2018년 7월 31일
Weird, the quotes around the numbers are in the csv file itself.
I will see if I can work around it.
By default, the variables created by readtable() are double when the entire column is numeric, or cell arrays of character vectors when any element in a column is not numeric. This is where detectImportOptions() comes in handy. Have you tried that recommendation already?
Off the top of my head, I see two options. The first is to import cleaner data using detectImportOption(). The second is import the dirty data you've got and then clean it. UPDATE: If the quotes are part of the actual data as Walter discovered, the 2nd option may be the better of the two.
If your data is all in a cell array with mixed strings and doubles, you could use str2double() as in this example.
dirty = {'3000'; 3000; 'r'; 2000; '1000'};
clean = str2double(dirty);
isDbl = cellfun(@isnumeric, dirty);
clean(isDbl) = [dirty{isDbl}];
clean =
3000
3000
NaN
2000
1000
Walter Roberson
2018년 7월 31일
The 'e' and 'r' are probably the reason that most numbers are coded as if they are strings.
What do you want done with the 'e' and 'r' ? Is it okay to treat both of them the same way as empty cells, by changing all three of them into NaN ?
Walter Roberson
2018년 8월 1일
The file turns out to be UTF8 encoded, because it contains accented characters at various points. That leads to some problems.
I started working with reading in the entire file at one time to process as a single string (there can be a lot of advantages to working that way), but I encountered a Mathworks bug with native2unicode at the point of 1 gigabyte of decoded characters.
추가 답변 (0개)
카테고리
도움말 센터 및 File Exchange에서 Text Data Preparation에 대해 자세히 알아보기
참고 항목
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!웹사이트 선택
번역된 콘텐츠를 보고 지역별 이벤트와 혜택을 살펴보려면 웹사이트를 선택하십시오. 현재 계신 지역에 따라 다음 웹사이트를 권장합니다:
또한 다음 목록에서 웹사이트를 선택하실 수도 있습니다.
사이트 성능 최적화 방법
최고의 사이트 성능을 위해 중국 사이트(중국어 또는 영어)를 선택하십시오. 현재 계신 지역에서는 다른 국가의 MathWorks 사이트 방문이 최적화되지 않았습니다.
미주
- América Latina (Español)
- Canada (English)
- United States (English)
유럽
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
