How to cluster data in a histogram ?

조회 수: 6 (최근 30일)
Sim
Sim 2022년 11월 3일
댓글: Sim 2022년 11월 3일
How to cluster data in a histogram ?
Desired output
Input
A = duration({'00:01:01'
'00:00:53'
'00:00:55'
'00:00:54'
'00:00:54'
'00:00:53'
'02:45:08'
'00:01:33'
'00:00:57'
'00:00:58'
'00:00:51'
'00:00:45'
'00:01:03'
'00:00:56'
'00:00:45'
'00:26:52'
'00:01:12'
'00:00:41'
'00:00:56'
'00:01:16'
'00:01:47'
'00:09:22'
'00:00:40'
'00:00:38'
'00:00:48'
'00:00:38'
'00:00:42'
'00:00:42'
'00:01:06'
'00:01:00'
'00:00:43'
'00:00:47'
'00:00:43'
'00:00:50'
'00:00:52'
'00:01:20'
'00:01:35'
'00:00:54'
'00:01:05'
'00:02:07'
'00:00:43'
'00:00:39'
'00:29:36'
'00:00:39'
'00:00:39'
'00:01:01'
'00:01:09'
'00:01:12'
'00:01:11'
'00:01:12'
'00:01:06'
'00:01:06'
'00:01:00'
'00:01:15'
'00:01:08'
'00:00:39'
'00:00:59'
'00:00:54'
'00:01:25'
'00:01:01'
'00:01:03'
'00:01:03'
'00:00:56'
'00:01:19'
'00:01:05'
'00:01:00'
'00:01:09'
'00:01:12'
'00:00:52'
'00:00:40'
'00:01:09'
'00:01:00'
'00:01:04'
'00:00:57'
'00:02:07'
'00:02:44'
'00:00:51'
'00:01:22'
'00:01:10'
'00:01:07'
'00:01:07'
'00:00:48'
'00:00:59'
'00:01:02'
'00:00:48'
'00:00:49'
'00:00:56'
'00:01:03'
'00:00:53'
'00:01:23'
'00:00:40'
'00:01:25'
'00:01:15'
'00:01:13'
'00:02:14'
'00:01:08'
'00:00:53'
'00:01:00'})
A = 98×1 duration array
00:01:01 00:00:53 00:00:55 00:00:54 00:00:54 00:00:53 02:45:08 00:01:33 00:00:57 00:00:58 00:00:51 00:00:45 00:01:03 00:00:56 00:00:45 00:26:52 00:01:12 00:00:41 00:00:56 00:01:16 00:01:47 00:09:22 00:00:40 00:00:38 00:00:48 00:00:38 00:00:42 00:00:42 00:01:06 00:01:00
binLimits = [min(A) max(A)];
binWidth = duration('00:00:5');
[counts,binEdges] = histcounts(A, 'BinLimits',binLimits,'BinWidth',binWidth);
histogram(A,'binEdges',binEdges,'FaceColor','k','EdgeColor','k');

채택된 답변

Cris LaPierre
Cris LaPierre 2022년 11월 3일
You need to define the criteria for what makes a cluster and what makes an outlier. Once you know how you will define that, then you can use rmoutliers to apply your criteria and remove the outliers.
To me, it looks like you want to use quartiles. I find viewing this with a boxchart is easiest. Note that MATLAB does not accept the duration data type for outlier detection. Use the minutes function to covert your durations into numeric values, and vice versa.
A = duration({'00:01:01';'00:00:53';'00:00:55';'00:00:54';'00:00:54';'00:00:53';'02:45:08';'00:01:33';
'00:00:57';'00:00:58';'00:00:51';'00:00:45';'00:01:03';'00:00:56';'00:00:45';'00:26:52';'00:01:12';
'00:00:41';'00:00:56';'00:01:16';'00:01:47';'00:09:22';'00:00:40';'00:00:38';'00:00:48';'00:00:38';
'00:00:42';'00:00:42';'00:01:06';'00:01:00';'00:00:43';'00:00:47';'00:00:43';'00:00:50';'00:00:52';
'00:01:20';'00:01:35';'00:00:54';'00:01:05';'00:02:07';'00:00:43';'00:00:39';'00:29:36';'00:00:39';
'00:00:39';'00:01:01';'00:01:09';'00:01:12';'00:01:11';'00:01:12';'00:01:06';'00:01:06';'00:01:00';
'00:01:15';'00:01:08';'00:00:39';'00:00:59';'00:00:54';'00:01:25';'00:01:01';'00:01:03';'00:01:03';
'00:00:56';'00:01:19';'00:01:05';'00:01:00';'00:01:09';'00:01:12';'00:00:52';'00:00:40';'00:01:09';
'00:01:00';'00:01:04';'00:00:57';'00:02:07';'00:02:44';'00:00:51';'00:01:22';'00:01:10';'00:01:07';
'00:01:07';'00:00:48';'00:00:59';'00:01:02';'00:00:48';'00:00:49';'00:00:56';'00:01:03';'00:00:53';
'00:01:23';'00:00:40';'00:01:25';'00:01:15';'00:01:13';'00:02:14';'00:01:08';'00:00:53';'00:01:00'});
B = minutes(A)
B = 98×1
1.0167 0.8833 0.9167 0.9000 0.9000 0.8833 165.1333 1.5500 0.9500 0.9667
boxchart(B)
Here, outlers are indicated with the 'o' marker. Outliers are values that are more than 1.5 · IQR away from the top or bottom of the box. You indicate there should be 4, but we see the default definition classifies more, so you will need to use a custom definition.
I find using the Clean Outlier Data task in a live script is a quick and interactive way to find the desired threshold. Since tasks don't work, here, I'll just use rmoutliers with a threshold of 20 applied.
% Remove outliers
B= rmoutliers(B,"quartiles","ThresholdFactor",20);
% View results
boxchart(B)
% convert back to duration
A = minutes(B);
A.Format = 'hh:mm:ss'
A = 94×1 duration array
00:01:01 00:00:53 00:00:55 00:00:54 00:00:54 00:00:53 00:01:33 00:00:57 00:00:58 00:00:51 00:00:45 00:01:03 00:00:56 00:00:45 00:01:12 00:00:41 00:00:56 00:01:16 00:01:47 00:00:40 00:00:38 00:00:48 00:00:38 00:00:42 00:00:42 00:01:06 00:01:00 00:00:43 00:00:47 00:00:43
% Original code for creating a histogram
binLimits = [min(A) max(A)];
binWidth = duration('00:00:5');
[counts,binEdges] = histcounts(A, 'BinLimits',binLimits,'BinWidth',binWidth);
histogram(A,'binEdges',binEdges,'FaceColor','k','EdgeColor','k');
  댓글 수: 1
Sim
Sim 2022년 11월 3일
thanks a lot, very very appreaciated!!! :-)

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

Help CenterFile Exchange에서 Cluster Analysis and Anomaly Detection에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by