Conversion of mat to csv is time and space complex
이전 댓글 표시
I have a script converting .mat data files to csv for further proseccing in python.
The .mat file contains a 'debug' structure with ~320 timeseries.
The problem is in line 36 with the cell2table function. It takes for ages and fills my 16GB of memory, while the original .mat file is only 270MB max.
What would be a better way of doing this?
% Load the .mat file
data_struct = load('path\to\mat_file');
disp('data_struct loaded');
% Get the struct variable
data = data_struct.debug;
% Access the struct field names
field_names = fieldnames(data);
% Extract the timestamps from the first timeseries
timestamps = data.(field_names{1}).Time;
% Initialize a cell array to store data
csvData = cell(length(timestamps)+1, length(field_names)+1);
disp('csvData initialized');
% Set the header row with field names and 'time' in the first column
csvData(1, 1) = {'time'};
csvData(1, 2:end) = field_names;
% Fill in the timestamps in the first column
csvData(2:end, 1) = num2cell(timestamps);
disp('timestamps created');
% Fill in the timeseries data in the remaining columns
for i = 1:length(field_names)
current_field_name = field_names{i};
timeseriesData = data.(current_field_name).Data;
csvData(2:end, i+1) = num2cell(timeseriesData);
fprintf('Data %d/%d has been written to cell \n', i, length(field_names));
end
% Convert the cell array to a table
disp('Converting cell to table');
csvTable = cell2table(csvData);
disp('Converting done');
% Write the table to a CSV file
writetable(csvTable, 'output.csv');
disp('csv created');
댓글 수: 5
"It takes for ages and fills my 16GB of memory, while the original .mat file is only 270MB max."
MAT files since v7.0 are compressed by default, so in general it is expected that the uncompressed data would be larger.
CSV files are also uncompressed, so will in general be larger than the MAT file. How much larger (or smaller) ... depends on how much the data could be compressed.
"What would be a better way of doing this?"
SCIPY can natively import v4, v6, v7 to v7.2 MAT files. Apparently an HDF5 library can import v7.3 MAT files.
Why do you need this conversion at all?
What version are the MAt files?
Star Strider
2024년 1월 21일
‘Why do you need this conversion at all?’
My guess is that Python needs to be able to read and use them —
‘I have a script converting .mat data files to csv for further proseccing in python.’
"My guess is that Python needs to be able to read and use them"
Python can already read and use them (if we accept the installation of modules). In my comment I wrote that SCIPY can already import most MAT file versions. You can think of SCIPY as an extension of NUMPY, it includes lots of useful numeric routines e.g. FFT, linear algebra solvers, ODE solvers, optimization routines, etc.:
I know that SCIPY can import MAT files because I have used this functionality many times. It works.
The routines that are related to MAT files are listed here:
So my question remains unanswered: is this (apparently challenging) conversion really necessary?
Star Strider
2024년 1월 21일
I haven’t used Python (yet). I’ll keep that in mind.
Philipp Kutschmann
2024년 1월 21일
답변 (1개)
Shubham
2024년 3월 15일
Hi Philipp,
Converting a large .mat file with many timeseries into a CSV format can indeed be memory-intensive, especially when using cell arrays and tables due to the overhead associated with these data structures. A more efficient approach might be to write directly to the CSV file without converting the entire dataset into a table first. This can significantly reduce memory usage and potentially speed up the process.
Here's how you can modify your script to write directly to a CSV file:
% Load the .mat file
data_struct = load('path\to\mat_file');
disp('data_struct loaded');
% Get the struct variable
data = data_struct.debug;
% Access the struct field names
field_names = fieldnames(data);
% Extract the timestamps from the first timeseries
timestamps = data.(field_names{1}).Time;
% Open a file for writing
fid = fopen('output.csv', 'w');
% Write the header row with field names and 'time' in the first column
fprintf(fid, 'time,');
fprintf(fid, '%s,', field_names{1:end-1});
fprintf(fid, '%s\n', field_names{end});
% Write the data rows
for i = 1:length(timestamps)
% Write the timestamp
fprintf(fid, '%f,', timestamps(i));
% Write the timeseries data for each field
for j = 1:length(field_names)
current_field_name = field_names{j};
timeseriesData = data.(current_field_name).Data(i);
if j == length(field_names)
% For the last field, end the line
fprintf(fid, '%f\n', timeseriesData);
else
fprintf(fid, '%f,', timeseriesData);
end
end
% Optional: Display progress
if mod(i, 1000) == 0
fprintf('Row %d/%d has been written \n', i, length(timestamps));
end
end
% Close the file
fclose(fid);
disp('CSV created');
This script does the following:
- Instead of accumulating all data in memory, it directly writes to the file.
- It prints the column names (time and all field_names) as the first row of the CSV.
- For each timestamp, it writes the corresponding data from each timeseries. This is done row by row, significantly reducing memory usage since only one row of data is handled at a time.
- After writing all the data, it properly closes the file.
Advantages:
- This approach avoids creating a large cell array or table in memory, instead writing data directly to a file as it is processed.
- Directly writing to a file can be faster than creating a large data structure and then converting it to a table and writing it to a file.
Considerations:
- Ensure that the format (e.g., floating-point precision) is appropriate for your data. Adjust the fprintf format specifiers as needed.
- Consider adding error handling, for example, checking if the file was successfully opened before proceeding with writing.
- The optional progress reporting (every 1000 rows) can help monitor the script's progress, especially with large datasets. Adjust the frequency of these messages as needed based on your dataset size.
카테고리
도움말 센터 및 File Exchange에서 Call Python from MATLAB에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!