Statistics of datastore of tabular data

조회 수: 3 (최근 30일)
Omar Kamel
Omar Kamel 2024년 3월 25일
댓글: Omar Kamel 2024년 3월 28일
Hey all,
I have thousands of parquet files. Each file has more than 50,000 rows of numerical data with more than 100 columns each. My data can't fit in memory so I use datastores to import and handle the data for machine learning workflow downstream. I would like to know if it is possible to calculate some statistics (max, min, mean, std for each channel) of each file during the datastore creation process, which I can use afterwards to filter and select the relevant segments of data for my downstream analysis.
Thanks in advance

채택된 답변

Abhas
Abhas 2024년 3월 26일
Hi Omar,
To calculate statistics (max, min, mean, std for each channel) during the datastore creation process in MATLAB and use them for filtering and selecting relevant data segments for downstream analysis, you can follow these steps:
  1. Create a Datastore: Initialize a 'datastore' for your Parquet files.
  2. Define Custom Function: Create a function to compute the desired statistics for each chunk of data.
  3. Apply Transformation: Use the 'transform' function to apply your custom statistics calculation to the datastore.
  4. Read and Aggregate Statistics: Iterate over the datastore to read the statistics of each chunk and aggregate them globally.
  5. Use Statistics for Filtering: Leverage the aggregated statistics to filter and select relevant data segments.
Here's the MATLAB code to reflect the above steps:
% Step 1: Create Your Datastore
ds = parquetDatastore('path/to/your/parquet/files/*.parquet');
% Step 2: Define Your Custom Function
function statsTable = calculateStats(tbl)
statsTable = varfun(@min, tbl, 'OutputFormat', 'table');
statsTable.Properties.VariableNames = strcat(statsTable.Properties.VariableNames, '_min');
maxTable = varfun(@max, tbl, 'OutputFormat', 'table');
maxTable.Properties.VariableNames = strcat(maxTable.Properties.VariableNames, '_max');
statsTable = [statsTable, maxTable];
meanTable = varfun(@mean, tbl, 'OutputFormat', 'table');
meanTable.Properties.VariableNames = strcat(meanTable.Properties.VariableNames, '_mean');
statsTable = [statsTable, meanTable];
stdTable = varfun(@std, tbl, 'OutputFormat', 'table');
stdTable.Properties.VariableNames = strcat(stdTable.Properties.VariableNames, '_std');
statsTable = [statsTable, stdTable];
end
% Step 3: Apply the Transformation
ds = transform(ds, @calculateStats);
% Step 4: Read and Aggregate the Statistics
globalMin = inf; % Initialize for min. Do similarly for max, mean, std
while hasdata(ds)
statsChunk = read(ds);
chunkMin = min(table2array(statsChunk(:, contains(statsChunk.Properties.VariableNames, '_min'))), [], 'all');
globalMin = min(globalMin, chunkMin);
% Update global max, mean, std similarly
end
% At this point, globalMin (and other statistics) can be used for filtering and selecting relevant data segments
At this point, you have the aggregated statistics (e.g., globalMin) which you can use to filter and select relevant segments of your data for further analysis.
You may refer to the following documentation links to have a better understanding on working with datastore and transform in MATLAB:
  1. parquetDatastore: https://www.mathworks.com/help/matlab/ref/matlab.io.datastore.parquetdatastore.html?s_tid=doc_ta
  2. transform: https://www.mathworks.com/help/matlab/ref/matlab.io.datastore.transform.html?s_tid=doc_ta
  댓글 수: 1
Omar Kamel
Omar Kamel 2024년 3월 28일
Hi Abhas, Thanks a lot for the elaborate answer. This is what I was exactly looking for.

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Data Preprocessing에 대해 자세히 알아보기

제품


릴리스

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by