Removing Error Data from Table
이 질문을 팔로우합니다.
- 팔로우하는 게시물 피드에서 업데이트를 확인할 수 있습니다.
- 정보 수신 기본 설정에 따라 이메일을 받을 수 있습니다.
오류 발생
페이지가 변경되었기 때문에 동작을 완료할 수 없습니다. 업데이트된 상태를 보려면 페이지를 다시 불러오십시오.
이전 댓글 표시
0 개 추천
Hi,
I am working on a project that involves plotting Temperature vs. Sensor Data on a scatter plot. My goal is to calculate the coefficients of a trend line for future predictions. However, there are instances where my sensor outputs false results. Instead of manually removing this data, I am in search of an outlier function that can automatically eliminate these anomalies and provide coefficients. These coefficients would be useful for defining the boundaries of valid data in future datasets.
I am aware of the standard outlier removal functions. However, the scatter plot I've attached here illustrates my challenge. You can observe the general pattern of valid data, interspersed with some errors. Although there's no trend line in the plot, it's possible to visualize how it would look and identify the inaccurate data points.
My question is: Is there a way to configure an outlier function to take into account both temperature and sensor output in identifying false results? Additionally, can this function reference previous data to aid in this evaluation? I have posted the plots below and also attached data samples of some sensor output for your reference.


채택된 답변
Star Strider
2023년 12월 5일
편집: Star Strider
2023년 12월 5일
This is definitely an interesting problem, and one I have thought about at other times.
I am not certain that I have completely solved it, since it still may require a bit of interactivity, depending on the data. For these two data sets, it seems to work reasonably well with these parameters. The scatterhist calls are just there as a way to illustrate the data, and are not absolutely necessary.
Try this —
LD = load('scatterData.mat')
LD = struct with fields:
dataTable4: [8321×2 table]
dataTable5: [8498×2 table]
% dataTable4 ————————————————————————————————————————————————————————————————————————————————
T4 = LD.dataTable4;
T4 = rmmissing(T4)
T4 = 8135×2 table
Temperature NO2_WE_Voltage
___________ ______________
10.769 249.58
10.669 249.71
10.991 248.95
11.243 248.65
11.301 248.27
11.414 248.4
11.414 248.22
11.283 247.82
11.258 248.17
11.27 248.1
11.215 248.12
11.228 248.07
11.414 248
11.706 247.52
12.028 246.84
12.801 245.61
VN4 = T4.Properties.VariableNames;
figure
hsh4 = scatterhist(T4{:,1}, T4{:,2}, 'NBins',50, 'Marker','.')
hsh4 =
1×3 Axes array:
Axes Axes Axes
idx4 = clusterdata(table2array(T4), 'MaxClust',15, 'Criterion','distance');
[Uidx,~,uix] = unique(idx4); % Unique Cluster Numbers & Indices
ClustCounts = accumarray(idx4, 1); % Count Cluster Members
T4_ClusterCounts = table(Uidx, ClustCounts) % Results Summary Table
T4_ClusterCounts = 15×2 table
Uidx ClustCounts
____ ___________
1 3
2 136
3 1
4 1
5 1
6 1
7 3526
8 1
9 4
10 2173
11 2273
12 1
13 11
14 2
15 1
[MaxCounts,MaxCidx] = max((T4_ClusterCounts{:,2})) % Largest Cluster & Index
MaxCounts = 3526
MaxCidx = 7
cm = colormap(turbo(numel(unique(idx4))));
hold on
Ax = hsh4(1);
% get(Ax)
Ax.UserData
ans = 1×4 cell array
{'gscatter'} {8135×1 double} {8135×1 double} {0×0 double}
scatter(Ax.UserData{2}, Ax.UserData{3}, 10, idx4)
colormap(turbo(numel(unique(idx4))))
hold off
colorbar

Lv4 = idx4 == MaxCidx;
figure
scatter(Ax.UserData{2}(Lv4), Ax.UserData{3}(Lv4), 10, '.')
grid

% dataTable5 ————————————————————————————————————————————————————————————————————————————————
T5 = LD.dataTable5;
T5 = rmmissing(T5)
T5 = 7487×2 table
Temperature NO2_WE_Voltage
___________ ______________
12.411 252.73
12.255 252.96
12.255 253.31
12.27 253.18
12.368 252.81
12.356 252.68
12.338 253.21
12.368 252.3
12.338 252.78
12.27 252.33
12.169 252.96
12.283 252.43
12.451 252.63
12.69 251.52
13.224 251.09
13.831 249.76
VN5 = T5.Properties.VariableNames;
figure
hsh5 = scatterhist(T5{:,1}, T5{:,2}, 'NBins',50, 'Marker','.')
hsh5 =
1×3 Axes array:
Axes Axes Axes
idx5 = clusterdata(table2array(T5), 'MaxClust',10, 'Criterion','distance');
[Uidx,~,uix] = unique(idx5);
ClustCounts = accumarray(idx5, 1);
T5_ClusterCounts = table(Uidx, ClustCounts)
T5_ClusterCounts = 10×2 table
Uidx ClustCounts
____ ___________
1 2
2 7411
3 1
4 64
5 1
6 1
7 1
8 3
9 2
10 1
[MaxCounts,MaxCidx] = max((T5_ClusterCounts{:,2}))
MaxCounts = 7411
MaxCidx = 2
cm = colormap(turbo(numel(unique(idx5))));
hold on
Ax = hsh5(1);
% get(Ax)
Ax.UserData
ans = 1×4 cell array
{'gscatter'} {7487×1 double} {7487×1 double} {0×0 double}
scatter(Ax.UserData{2}, Ax.UserData{3}, 10, idx5)
colormap(turbo(numel(unique(idx5))))
hold off
colorbar

Lv5 = idx5 == MaxCidx;
figure
scatter(Ax.UserData{2}(Lv5), Ax.UserData{3}(Lv5), 10, '.')
grid

The clusterdata function works here with 'MaxClust' (the maximum number of clusters) set at 10, however that may not generalise to all data sets, and more clusters may be necessary in those instances. That is where the ‘interactivity’ requirement may arise.
It would likely be possible to create a function from this code (with or without the plots, however I believe the plots are necessary to illustrate the data and how the code works).
I have been working on this for a few hours and will ponder that in the morning.
EDIT — (5 Dec 2023 at 04:35)
Streamlined code, added more explanation text.
.
댓글 수: 11
Thanks, the `clusterdata` function looks interesting. The issue with my data isn't necessarily a noise-related issue, but rather false data caused by sensor malfunctions due to faulty wiring or power supply issues. So, I think this function might be what I require instead of using something like `removeoutliers`.
From what I have understood, `clusterdata` will segment the data into small blocks, and we select the block that has the maximum samples. However, what determines how each block is created? How close does data need to be to one another to be classified in the same cluster?
I am learning MATLAB as I progress and would eventually like to create a function that uses `clusterdata` and outputs it in a coder for a .NET project. How can I determine if a function such as `clusterdata` will be compatible for such an export?
Star Strider
2023년 12월 5일
My pleasure.
The clusterdata function has a number of options, including a number of name-value pair options that govern how it works. (I chose the most common options, because I am not certain exactly what you want.) I encourage you to read the options to decide what works best (I am not certain what criteria you want to use to define the clusters), although the options I chose seem to give quite good results. (I initially considered using histograms until I realised that clusterdata was more efficient and provided the result I wanted. There are other clustering algorithms available and my initial approach used kmeans, however that did not give good results.)
I used an accumarray call to calculate the total number of points in each cluster, and then chose the cluster with the greatest number of points as the ‘winner’. That definitely worked with these data sets.
The last scatter plot for each data set demonstrates how to get the data for the ‘winning’ cluster to plot or to do further analysis.
I will delete my Answer if it does not work for you.
Thanks i will try the different options.
Eventually i need to add a trend line and work out the coeffieients, but need remove the false datam therefore this function seems to work.
Is it possible to plot scatterhist in a subplot ?
I have corrected my code to use scatter instead of scatterhist (since scatterhist is not actually necessary), and added linear regressions with fitlm and predict and associated plots. All that is necessary to plot these as subplots is to assign them tto the various subplot axes.
Try this —
LD = load('scatterData.mat')
LD = struct with fields:
dataTable4: [8321×2 table]
dataTable5: [8498×2 table]
% dataTable4 ————————————————————————————————————————————————————————————————————————————————
T4 = LD.dataTable4;
T4 = rmmissing(T4)
T4 = 8135×2 table
Temperature NO2_WE_Voltage
___________ ______________
10.769 249.58
10.669 249.71
10.991 248.95
11.243 248.65
11.301 248.27
11.414 248.4
11.414 248.22
11.283 247.82
11.258 248.17
11.27 248.1
11.215 248.12
11.228 248.07
11.414 248
11.706 247.52
12.028 246.84
12.801 245.61
VN4 = T4.Properties.VariableNames;
idx4 = clusterdata(table2array(T4), 'MaxClust',15, 'Criterion','distance');
[Uidx,~,uix] = unique(idx4); % Unique Cluster Numbers & Indices
ClustCounts = accumarray(idx4, 1); % Count Cluster Members
T4_ClusterCounts = table(Uidx, ClustCounts) % Results Summary Table
T4_ClusterCounts = 15×2 table
Uidx ClustCounts
____ ___________
1 3
2 136
3 1
4 1
5 1
6 1
7 3526
8 1
9 4
10 2173
11 2273
12 1
13 11
14 2
15 1
[MaxCounts,MaxCidx] = max((T4_ClusterCounts{:,2})) % Largest Cluster & Index
MaxCounts = 3526
MaxCidx = 7
cm = colormap(turbo(numel(unique(idx4))));
figure
scatter(T4{:,1}, T4{:,2}, 10, idx4, '.')
grid
colormap(turbo(numel(unique(idx4))))
colorbar
xlabel(VN4{1})
ylabel(strrep(VN4{2},'_','\_'))
title('dataTable4')

Lv4 = idx4 == MaxCidx;
mdl4 = fitlm(T4{Lv4,1}, T4{Lv4,2})
mdl4 =
Linear regression model:
y ~ 1 + x1
Estimated Coefficients:
Estimate SE tStat pValue
________ _________ _______ ______
(Intercept) 270.78 0.11505 2353.7 0
x1 -1.7755 0.0060972 -291.19 0
Number of observations: 3526, Error degrees of freedom: 3524
Root Mean Squared Error: 2.35
R-squared: 0.96, Adjusted R-Squared: 0.96
F-statistic vs. constant model: 8.48e+04, p-value = 0
xv4 = linspace(min(T4{Lv4,1}), max(T4{Lv4,1}), numel(T4{Lv4,1})).';
[y4,yci4] = predict(mdl4, xv4);
figure
hs = scatter(T4{Lv4,1}, T4{Lv4,2}, 10, '.', 'DisplayName','Data');
hold on
hp1 = plot(xv4, y4, '-r', 'DisplayName','Linear Regression');
hp2 = plot(xv4, yci4, '--r', 'DisplayName','95% Confidence Intervals');
hold off
grid
xlabel(VN4{1})
ylabel(strrep(VN4{2},'_','\_'))
title('dataTable4')
legend([hs,hp1,hp2(1)], 'Location','best')

% dataTable5 ————————————————————————————————————————————————————————————————————————————————
T5 = LD.dataTable5;
T5 = rmmissing(T5)
T5 = 7487×2 table
Temperature NO2_WE_Voltage
___________ ______________
12.411 252.73
12.255 252.96
12.255 253.31
12.27 253.18
12.368 252.81
12.356 252.68
12.338 253.21
12.368 252.3
12.338 252.78
12.27 252.33
12.169 252.96
12.283 252.43
12.451 252.63
12.69 251.52
13.224 251.09
13.831 249.76
VN5 = T5.Properties.VariableNames;
idx5 = clusterdata(table2array(T5), 'MaxClust',10, 'Criterion','distance');
[Uidx,~,uix] = unique(idx5);
ClustCounts = accumarray(idx5, 1);
T5_ClusterCounts = table(Uidx, ClustCounts)
T5_ClusterCounts = 10×2 table
Uidx ClustCounts
____ ___________
1 2
2 7411
3 1
4 64
5 1
6 1
7 1
8 3
9 2
10 1
[MaxCounts,MaxCidx] = max((T5_ClusterCounts{:,2}))
MaxCounts = 7411
MaxCidx = 2
figure
scatter(T5{:,1}, T5{:,2}, 10, idx5, '.')
grid
colormap(turbo(numel(unique(idx4))))
colorbar
xlabel(VN5{1})
ylabel(strrep(VN5{2},'_','\_'))
title('dataTable5')

Lv5 = idx5 == MaxCidx;
mdl5 = fitlm(T5{Lv5,1}, T5{Lv5,2})
mdl5 =
Linear regression model:
y ~ 1 + x1
Estimated Coefficients:
Estimate SE tStat pValue
________ ________ ______ ______
(Intercept) 277.68 0.26206 1059.6 0
x1 -1.7983 0.013717 -131.1 0
Number of observations: 7411, Error degrees of freedom: 7409
Root Mean Squared Error: 8.03
R-squared: 0.699, Adjusted R-Squared: 0.699
F-statistic vs. constant model: 1.72e+04, p-value = 0
xv5 = linspace(min(T5{Lv5,1}), max(T5{Lv5,1}), numel(T5{Lv5,1})).';
[y5,yci5] = predict(mdl5, xv5);
figure
hs = scatter(T5{Lv5,1}, T5{Lv5,2}, 10, '.', 'DisplayName','Data');
hold on
hp1 = plot(xv5, y5, '-r', 'DisplayName','Linear Regression');
hp2 = plot(xv5, yci5, '--r', 'DisplayName','95% Confidence Intervals');
hold off
grid
xlabel(VN5{1})
ylabel(strrep(VN5{2},'_','\_'))
title('dataTable5')
legend([hs,hp1,hp2(1)], 'Location','best')

.
Thanks i iwll try this out tommrow. I think the regression line will need to be poly 2 or 3
Star Strider
2023년 12월 5일
Is there any special functional relationship between the temperature and sensor voltage? If there is, using fitnlm to fit it to that model would likely be more appropriate. The code is essentially the same except for replacing fitlm with fitnlm.
Yes, i am working on correction factor which changes with temperture and over time. The plot below is theorical plot, which shows how the trend line will need to look and over time as temperture decrease that region get more flat.

That actually looks to be almost linear. That aside, since humidity is also a factor, perhaps doing a multiple linear regression (the regress function and others) with temperature and humidity as predictors could be worthwhile.
Doing a second-order polynomial fit with fitlm is straightforward —
LD = load('scatterData.mat')
LD = struct with fields:
dataTable4: [8321×2 table]
dataTable5: [8498×2 table]
% dataTable4 ————————————————————————————————————————————————————————————————————————————————
T4 = LD.dataTable4;
T4 = rmmissing(T4)
T4 = 8135×2 table
Temperature NO2_WE_Voltage
___________ ______________
10.769 249.58
10.669 249.71
10.991 248.95
11.243 248.65
11.301 248.27
11.414 248.4
11.414 248.22
11.283 247.82
11.258 248.17
11.27 248.1
11.215 248.12
11.228 248.07
11.414 248
11.706 247.52
12.028 246.84
12.801 245.61
VN4 = T4.Properties.VariableNames;
idx4 = clusterdata(table2array(T4), 'MaxClust',15, 'Criterion','distance');
[Uidx,~,uix] = unique(idx4); % Unique Cluster Numbers & Indices
ClustCounts = accumarray(idx4, 1); % Count Cluster Members
T4_ClusterCounts = table(Uidx, ClustCounts) % Results Summary Table
T4_ClusterCounts = 15×2 table
Uidx ClustCounts
____ ___________
1 3
2 136
3 1
4 1
5 1
6 1
7 3526
8 1
9 4
10 2173
11 2273
12 1
13 11
14 2
15 1
[MaxCounts,MaxCidx] = max((T4_ClusterCounts{:,2})) % Largest Cluster & Index
MaxCounts = 3526
MaxCidx = 7
cm = colormap(turbo(numel(unique(idx4))));
% figure
% scatter(T4{:,1}, T4{:,2}, 10, idx4, '.')
% grid
% colormap(turbo(numel(unique(idx4))))
% colorbar
% xlabel(VN4{1})
% ylabel(strrep(VN4{2},'_','\_'))
% title('dataTable4')
Lv4 = idx4 == MaxCidx;
mdl4 = fitlm([T4{Lv4,1}.^2 T4{Lv4,1}], T4{Lv4,2}) % 'x1' = Temperature^2, 'x2' = Temperature
mdl4 =
Linear regression model:
y ~ 1 + x1 + x2
Estimated Coefficients:
Estimate SE tStat pValue
_________ __________ _______ ___________
(Intercept) 263.79 0.17135 1539.4 0
x1 -0.026384 0.00055086 -47.895 0
x2 -0.85033 0.019891 -42.75 4.6442e-322
Number of observations: 3526, Error degrees of freedom: 3523
Root Mean Squared Error: 1.83
R-squared: 0.976, Adjusted R-Squared: 0.976
F-statistic vs. constant model: 7.11e+04, p-value = 0
xv4 = linspace(min(T4{Lv4,1}), max(T4{Lv4,1}), numel(T4{Lv4,1})).';
[y4,yci4] = predict(mdl4, [xv4.^2, xv4]);
figure
hs = scatter(T4{Lv4,1}, T4{Lv4,2}, 10, '.', 'DisplayName','Data');
hold on
hp1 = plot(xv4, y4, '-r', 'DisplayName','Linear Regression');
hp2 = plot(xv4, yci4, '--r', 'DisplayName','95% Confidence Intervals');
hold off
grid
xlabel(VN4{1})
ylabel(strrep(VN4{2},'_','\_'))
title('dataTable4')
legend([hs,hp1,hp2(1)], 'Location','best')

% dataTable5 ————————————————————————————————————————————————————————————————————————————————
T5 = LD.dataTable5;
T5 = rmmissing(T5)
T5 = 7487×2 table
Temperature NO2_WE_Voltage
___________ ______________
12.411 252.73
12.255 252.96
12.255 253.31
12.27 253.18
12.368 252.81
12.356 252.68
12.338 253.21
12.368 252.3
12.338 252.78
12.27 252.33
12.169 252.96
12.283 252.43
12.451 252.63
12.69 251.52
13.224 251.09
13.831 249.76
VN5 = T5.Properties.VariableNames;
idx5 = clusterdata(table2array(T5), 'MaxClust',10, 'Criterion','distance');
[Uidx,~,uix] = unique(idx5);
ClustCounts = accumarray(idx5, 1);
T5_ClusterCounts = table(Uidx, ClustCounts)
T5_ClusterCounts = 10×2 table
Uidx ClustCounts
____ ___________
1 2
2 7411
3 1
4 64
5 1
6 1
7 1
8 3
9 2
10 1
[MaxCounts,MaxCidx] = max((T5_ClusterCounts{:,2}))
MaxCounts = 7411
MaxCidx = 2
% figure
% scatter(T5{:,1}, T5{:,2}, 10, idx5, '.')
% grid
% colormap(turbo(numel(unique(idx4))))
% colorbar
% xlabel(VN5{1})
% ylabel(strrep(VN5{2},'_','\_'))
% title('dataTable5')
Lv5 = idx5 == MaxCidx;
mdl5 = fitlm([T5{Lv5,1}.^2 T5{Lv5,1}], T5{Lv5,2}) % 'x1' = Temperature^2, 'x2' = Temperature
mdl5 =
Linear regression model:
y ~ 1 + x1 + x2
Estimated Coefficients:
Estimate SE tStat pValue
__________ _________ _______ ___________
(Intercept) 276.74 0.48267 573.35 0
x1 -0.0035276 0.0015207 -2.3197 0.020385
x2 -1.6735 0.055514 -30.145 2.0505e-188
Number of observations: 7411, Error degrees of freedom: 7408
Root Mean Squared Error: 8.03
R-squared: 0.699, Adjusted R-Squared: 0.699
F-statistic vs. constant model: 8.6e+03, p-value = 0
xv5 = linspace(min(T5{Lv5,1}), max(T5{Lv5,1}), numel(T5{Lv5,1})).';
[y5,yci5] = predict(mdl5, [xv5.^2 xv5]);
figure
hs = scatter(T5{Lv5,1}, T5{Lv5,2}, 10, '.', 'DisplayName','Data');
hold on
hp1 = plot(xv5, y5, '-r', 'DisplayName','Linear Regression');
hp2 = plot(xv5, yci5, '--r', 'DisplayName','95% Confidence Intervals');
hold off
grid
xlabel(VN5{1})
ylabel(strrep(VN5{2},'_','\_'))
title('dataTable5')
legend([hs,hp1,hp2(1)], 'Location','best')

It would be better to actually model the dynamics of the sensor if you have a model for it. I suspect that the reactions at the catalyst surface are simple first-order kinetics.
.
Star Strider
2023년 12월 7일
Just done it
Star Strider
2023년 12월 7일
Thank you!
추가 답변 (0개)
카테고리
도움말 센터 및 File Exchange에서 Fit Postprocessing에 대해 자세히 알아보기
제품
태그
참고 항목
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!웹사이트 선택
번역된 콘텐츠를 보고 지역별 이벤트와 혜택을 살펴보려면 웹사이트를 선택하십시오. 현재 계신 지역에 따라 다음 웹사이트를 권장합니다:
또한 다음 목록에서 웹사이트를 선택하실 수도 있습니다.
사이트 성능 최적화 방법
최고의 사이트 성능을 위해 중국 사이트(중국어 또는 영어)를 선택하십시오. 현재 계신 지역에서는 다른 국가의 MathWorks 사이트 방문이 최적화되지 않았습니다.
미주
- América Latina (Español)
- Canada (English)
- United States (English)
유럽
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
