Summing values with database "group by" functionality

조회 수: 2 (최근 30일)
David
David 2012년 5월 4일
Tables in databases can easily be transformed by use of the "Group by" function.
Groups ususally occur on a common lable (such as state) while the data columns can be summed, averaged, counted, etc.
In the past, I've take matrices in matlab and dumped them into SQL tables where I can easily use this functionality.
This can take a long time to write the data andif it can be done in Matlab, before writign teh data, it would save considerable time.
Is there a way to do this in MatLab directly?

답변 (4개)

Sean de Wolski
Sean de Wolski 2012년 5월 4일

per isakson
per isakson 2012년 5월 4일
In the Statistical toolbox there is a class named Dataset. It has a method grpstats.
grpstats
Class: dataset
Summary statistics by group for dataset arrays
I think that does exactly what you ask for and more.
With plain matlab I'm convinced that data in a cell array, a for loop and some logical indexing will do the job. If you provide a toy example of data someone here will give you a piece of code, which demonstrates the approach.
--- CONT. ---
Study the Statistical toolbox Exampel: "Using Dataset Arrays"
--- CONT. 2012-05-07 ---
@David: This "file" is a mess.
  1. There are seven column headers or is it five?
  2. The first and second data row each contains eight values.
  3. The third data row contains nine values.
  4. Both comma and space are used as list separators
Additional information is needed. How to identify "missing values", etc.. Or is space not a separator but part of text values? You tell me.
  댓글 수: 4
Walter Roberson
Walter Roberson 2012년 5월 7일
Looks to me as if comma is used as a decimal grouping in this file.
per isakson
per isakson 2012년 5월 8일
@Walter, yes indeed.

댓글을 달려면 로그인하십시오.


Peter Perkins
Peter Perkins 2012년 5월 7일
David, your example data has a few problems, notably the percent signs. Without having any details about what you are trying to do (and in particular how you might want to define groups in your data), here is an example of what you might do. This example uses a dataset array, but since you have nothing but numeric data, there's no reason why you could not use grpstats on a matrix.
>> % remove embedded spaces from last column header in file
>> % remove stray space from end of next to last line
>> data = dataset('File','tmp.dat','Delimiter',' ');
>> % remove the percent signs
>> data.ID = uint64(data.ID)/100;
>> data.Loss = str2double(strrep(data.Loss,'%',''))/100;
>> data.EL = str2double(strrep(data.EL,'%',''));
>> data
data =
Event ID Loss Rate EL
1949 44688717 0 14.2 9e-07
3216 34845443 1.6e-07 574.8 4.7e-05
6443 33870362 8.65e-06 29301 0.002467
3721 31438262 6.04e-06 18995 0.001723
[snip]
570 15437760 3e-08 50.7 9.4e-06
3860 15242217 2.9e-07 435.2 8.14e-05
>> data.LossGroup = ...
ordinal(data.Loss,{'High' 'Low'},[],[0,median(data.Loss),Inf])
data =
Event ID Loss Rate EL LossGroup
1949 44688717 0 14.2 9e-07 High
3216 34845443 1.6e-07 574.8 4.7e-05 High
6443 33870362 8.65e-06 29301 0.002467 Low
3721 31438262 6.04e-06 18995 0.001723 Low
[snip]
570 15437760 3e-08 50.7 9.4e-06 High
3860 15242217 2.9e-07 435.2 8.14e-05 High
>> lossGroupMeans = grpstats(data,'LossGroup','mean', ...
'DataVars',{'Rate' 'ELRelativeRate'})
lossGroupMeans =
LossGroup GroupCount mean_Rate mean_EL
High High 16 1837.7 0.00026079
Low Low 17 49862 0.0082853

per isakson
per isakson 2012년 5월 8일
Here is a solution that doesn't require the Statistical toolbox.
Separators:
  1. list separator: space
  2. thousand separator: comma
  3. traling: %
Approach:
  1. read the first line (header) to a separate variable.
  2. read the rest of the file to a string buffer
  3. remove "," and "%" from the string buffer
  4. read the string buffer with textscan
  5. convert the cell array of double vectors to a double array
[ hdr, M ] = Read_text_file();
The whole numbers in the file have been converted to "flints" (see Floating Points ). The mean of the "rows", for which Event==638, can be calculated with logical indexing.
mean( M( M(:,1)==638, : ), 1 )
With "flint" it is safe to use "==". With floating point numbers one need to allow for rounding errors
mean( M( abs(M(:,5)-0.0024558) < epsilon, : ), 1 )
where epsilon is some appropriate small number.
--- Attachment ---
function [ hdr, M ] = Read_text_file()
fid = fopen( 'Read_text_file.txt', 'r' );
hdr = fgetl( fid );
str = fread( fid, '*char' );
sts = fclose( fid ); %#ok<NASGU>
str( str == ',' ) = [];
str( str == '%' ) = [];
cac = textscan( str, '%f%f%f%f%f' );
M = [ cac{:} ];
end

카테고리

Help CenterFile Exchange에서 Large Files and Big Data에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by