How do I determine the number of data in a 'messy' CSV and then import it?

조회 수: 4 (최근 30일)
I have a CSV file that looks like the following:
"","JOHN DOE","p0123TP01, Approved"
"","Rectum (4)(Volume: 57.77)","CTV (21)(Volume: 32.39)"
"0.0","100.0","100.0"
"0.1","78.01999","100.0"
"0.2","76.2245","100.0"
"0.3","75.21035","100.0"
[skipping to the end of the file]
"58.1","","1.88277"
"58.2","","0.21338"
"58.3","","0.0"
but the number of data sets can vary, so I must write a flexible script to accomodate however many data rows and columns exist:
"","DOE, JOHN","revised blurring for Case 16, ANON, original; calculating correlation between blurred dose and toxicity","","","","","","","","","","","","","",""
"","patient (1)(Volume: 18578.06)","Prostate (2)(Volume: 19.80)","Proximal-SV (3)(Volume: 6.40)","Rectum (4)(Volume: 70.44)","Bladder (5)(Volume: 143.70)","Intestine (6)(Volume: 8.40)","CutLine (7)(Volume: 415.77)","Air (8)(Volume: 5.06)","InsideRectum (9)(Volume: 36.17)","PTV1 (10)(Volume: 76.77)","PTVoriginal (11)(Volume: 80.89)","PTV2 (12)(Volume: 56.52)","Guide1 (13)(Volume: 81.79)","Guide2 (14)(Volume: 68.50)","guide (15)(Volume: 66.91)","CTV-operator (16)(Volume: 26.15)"
"0.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.1","12.88921","100.0","100.0","69.70608","73.36377","2.54878","82.87909","16.40558","73.91824","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.2","12.06455","100.0","100.0","67.50839","69.03589","0.88936","79.92729","16.40558","71.92567","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.3","11.56745","100.0","100.0","65.82671","67.72518","0.32793","77.7594","16.40558","70.44129","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.4","11.09599","100.0","100.0","64.40629","66.56731","0.15624","75.78776","16.40558","69.10236","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.5","10.1524","100.0","100.0","63.25727","65.87648","0.07211","73.65644","16.40558","68.12462","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.6","9.48733","100.0","100.0","61.64318","64.92351","0.0103","67.57963","16.40558","65.96868","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
"0.7","9.1257","100.0","100.0","60.35885","62.43443","0.0","59.25954","16.40558","65.12147","100.0","100.0","100.0","100.0","100.0","100.0","100.0"
[... skipping to end of file ...]
"58.1","0.00498","4.57181","0.23601","","","","","","","1.20431","1.143","1.63577","1.13037","1.34968","1.38174","3.51704"
"58.2","4.9E-4","0.46111","0.0","","","","","","","0.1189","0.11285","0.1615","0.1116","0.13326","0.13642","0.34903"
"58.3","0.0","0.0","","","","","","","","0.0","0.0","0.0","0.0","0.0","0.0","0.0"
What textscan formatSpec do I need to store this data? The following code results in a 1x1 cell array with element 0x2 empty double:
function [StudyIDcol,DistributionType,OrganSpecification,Doses,Volumes,DoseUnit,VolumeUnit] = CreateTableVariables(folderpath,type,StudyID)
% First we read the file. Code courtesy of Walter Roberson of the MathWorks community
fid = fopen([folderpath,type,'/',StudyID,'.csv'],'rt'); % rt = 'read in text mode'
if fid == -1 % added for debugging at the suggestion of Jan Hansen of the MathWorks community
error('Author:Function:OpenFile', 'Cannot open file');
end
num = 0;
% headers = cell % unfinished
while true %endless loop requiring 'break' command to exit!
H1 = fgetl(fid) ;
if feof(fid); break; end
H2 = fgetl(fid) ;
if feof(fid); break; end
datacell = textscan(fid, '%f%f', 'Delimiter', ',', 'CollectOutput', true) ; % originally '%f%f'
if isempty(datacell) || isempty(datacell{1}); break; end
% if any(isnan(datacell{1}(end,:))); datacell{1}(end,:) = []; end % commented out to verify the above command quits the loop
num = num + 1;
headers(num,:) = {H1, H2} ;
data(num) = datacell; % We store it to the data cell array only after checking that it contains data.
%fgetl(fid); %the empty line between organs
end
fclose(fid);
I am reading the textscan formatSpec documentation and more trying to solve this problem. Please help me understand how to use textscan.
I am now trying to write a script to flexibly import such files; I must learn how to scan it to count the number of rows and columns for the data beginning at "0.0","100.0","100.0" (not necessarily this number each time) ...
%function data = ReadMIMDVH(filepath)
filepath = '/home/sony/Documents/research/data/MIM Cumulative DVH/planned/1.csv';
% We write a script specifically to import MIM data.
% First we scan the file to see how many rows of data it has.
% We construct the format spec based on this number of rows and columns.
formatSpec = ['%*s %*s %*s %*s %q %q',repmat('%f',[rows,columns])];
% We read the file using textscan.
fileID = fopen(filepath);
DVH = textscan(fileID,formatSpec,'Delimiter',{',' '/n'});
%end

채택된 답변

Guillaume
Guillaume 2018년 1월 22일
I'd use readtable which should be able to work out the formatspec on its own. So just:
t = readtable(filepath); %all done file is read and parsed.
  댓글 수: 5
Guillaume
Guillaume 2018년 1월 23일
My next task is to learn how to parse the Volume variable names to strip the underscore and all characters after it.
A regexprep should do that very easily. Explain exactly what you want.
Daniel Bridges
Daniel Bridges 2018년 1월 23일
편집: Daniel Bridges 2018년 1월 23일
Please see the question, " How do I parse and erase from a string while importing CSV with tableread?" The goal is to most efficiently import the volume structure names, e.g.
"Rectum (4)(Volume: 70.44)","Bladder (5)(Volume: 143.70)"
keeping only the names 'Rectum', 'Bladder', etc. I was able to find one method, but it took four lines of code, and I suspect there is a way to do it with one. If you would post a better method, I would unaccept my answer and accept yours.

댓글을 달려면 로그인하십시오.

추가 답변 (1개)

Walter Roberson
Walter Roberson 2018년 1월 22일
opts = detectImportOptions(filepath);
nvar = length(opts.VariableNames);
opts.VariableNames = sprintfc('Var%d', 1:nvar);
test = readtable(filepath, opts);

카테고리

Help CenterFile Exchange에서 Large Files and Big Data에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by