Read CSV into table, but quoted text data contain new-lines

Question

FM 2021년 7월 12일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/877103-read-csv-into-table-but-quoted-text-data-contain-new-lines

댓글: FM 2021년 7월 13일

I am using "readtable" to read a CSV file into a table. The first row consist of column headings, and there are no row names. Some quoted text data contain new-lines, such as in this single string consisting of multiple lines:

"...some text</p><p>    
</p><p>                 
more text...end text. "

The "readtable" function interprets the 2nd line above as a new record. None of the "readtable" parameters seem to be for recognizing newlines within a quote string as part of the sting itself. Is it possible to do this with "readtable"?

According to the following thread, it should be enough that the multi-line string be enclosed in quotes:

https://www.mathworks.com/matlabcentral/answers/350559-how-do-i-readtable-with-a-csv-file-when-table-elements-have-newlines-or-double-quotation-marks

However, I get the error:

Error using readtable (line 216)                                         
Reading failed at line 3. All lines of a text                            
file must have the same number of delimiters.                            
Line 3 has 0 delimiters, while preceding lines                           
have 20.                                                                 
                                                                         
Note: readtable detected the following parameters:                       
'Delimiter', ',', 'HeaderLines', 0, 'ReadVariableNames', true, 'Format', 
'%f%f%f%q%q%q%q%q%q%f%f%f%f%q%q%f%D%q%f%f%q'                             

Line 3 is the "</p><p>" in the sample above.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Walter Roberson 2021년 7월 12일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/877103-read-csv-into-table-but-quoted-text-data-contain-new-lines#answer_745248

In order to have any possibility of success using readtable() you would need to use detectImportOptions and then set the LineEnding option to be empty. However, you would then have problems at end of line.

I suggest that you use textscan() instead.

tname = tempname()
tname = '/tmp/tp59a3c316_8110_4c14_bb7d_a80801e3c42b'
fid = fopen(tname, 'w');
cleanMe = onCleanup(@() delete(tname));
fprintf(fid, '"...some text</p><p>\n</p><p>\nmore text...end text. ", "hello"\n"line 2", "L2 field 2"\n');
fclose(fid)
ans = 0
dbtype(tname)
1     "...some text</p><p>
2     </p><p>
3     more text...end text. ", "hello"
4     "line 2", "L2 field 2"
fid = fopen(tname, 'r');
data_cell = textscan(fid, '%q%q', 'delimiter', ',')
data_cell = 1×2 cell array
    {2×1 cell}    {2×1 cell}
fclose(fid)
ans = 0
data_cell{1}
ans = 2×1 cell array
    {'...some text</p><p>↵</p><p>↵more text...end text. '}
    {'line 2'                                            }
data_cell{2}
ans = 2×1 cell array
    {'hello'     }
    {'L2 field 2'}

It turned out to be important to not use comma between the two %q and to use 'delimiter', ','

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

FM 2021년 7월 13일

Hi, Walter,

Your example of how to set the 'Format' led to to try something with "readtable" that succeeded:

cd C:\Path\to\Folder\With\File.csv                          
sd.project_info = readtable( 'File.csv', ...                
   'Delimiter',',' , ...                                    
   'Format', ...                                            
      '%f%f%f%q%q%q%q%q%q%f%f%f%f%q%q%f%D%q%f%f%q%q%q%f' ); 

The "Format" string differs from the one in my originally posted question in that the original "Format" ended in only one "%q" field, but should have ended in "%q%q%q%f". It was short 3 fields, which I added.

There are probably some heuristics that look at the longest lines, especially if they recur in the file, and base the "Format" off of those. So the header row didn't get used, since it is very short, though it would have been the right one to use -- to determine the number of fields, that is, not to determine the data type of the fields.

Alternatively, perhaps the heuristics use the lines containing the most occurrences of delimiters, but then when it tries to infer the "Format", it ignores the delimiters within quoted strings. Again, that would lead to using the very long data lines instead of the header row.

This above is all guesswork. In the end, if the specific and correct "Format" is given, "readtable" figures out that the record continues beyond a newline.

Thanks!

P.S. Microsoft Access properly imports the data, but the complexities of setting an ODBC data source name was daunting. I never got through it.

댓글을 달려면 로그인하십시오.

Read CSV into table, but quoted text data contain new-lines

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

Read CSV into table, but quoted text data contain new-lines

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

릴리스

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기