이 질문을 팔로우합니다.
- 팔로우하는 게시물 피드에서 업데이트를 확인할 수 있습니다.
- 정보 수신 기본 설정에 따라 이메일을 받을 수 있습니다.
How to remove outliers from exponentially weighted moving mean in real-time?
조회 수: 2 (최근 30일)
이전 댓글 표시
Cai Chin
2020년 11월 9일
I am using MATLAB R2020a on a MacOS. I am calculating an exponentially weighted moving mean using the dsp.MovingAverage function and am trying to remove vector elements in real-time based on 2 conditions - if the new element causes the mean to exceed 1.5 times the 'overall' mean so far, or if it is below 0.5 times the 'overall' mean so far.
In other words, the weighted mean with the current element is compared to the previous weighted mean, and if the current element causes the weighted mean to increase above 1.5 times the previous mean or go below 0.5 times the previous mean, then it should be ignored and the recursive equation is instead applied to the next element, and so on. In the end, I'd like to have a vector containing the outliers removed.
This is the function I am using to calculate the exponentially weighted moving mean:
movavgExp = dsp.MovingAverage('Method', 'Exponential weighting', 'ForgettingFactor', 0.4);
mean_cycle_period_exp = movavgExp(cycle_period_step_change);
I would very much appreciate any suggestions on how to tackle this, thanks in advance!
댓글 수: 1
riccardo
2020년 11월 9일
I haven't had time to check the calcs, but the error is likely due to a zero-indexing conditions in the loop:
>> for i = 2:lenght....
.....
if x(i) > 1.5*(1 - 1/w(i - 1))* >>x(i - 2)<< + (1/w(i - 1))*x(i - 1)
i-2 = 0 at the start
채택된 답변
riccardo
2020년 11월 9일
I am not sure what you ar eaiming at, here:
>>
% Calculate moving mean with weights manually
x = zeros(length(cycle_period_step_change), 1);
x(1) = 2;
>>
"x" is an array of zeros, with a leading "2".
The plot you've shown is just the consequence of it, with a smooth transition from 2 -> 0, due to the averaging. Am I missing something ?
댓글 수: 24
Cai Chin
2020년 11월 9일
Hi riccardo, sorry for the confusion, I now realise my mistake, the algorithm I used was never being applied to my dataset ('cycle_period_step_change'). Essentially, I was trying to manipulate the algorithm used by the dsp.MovingAverage function to exclude values I deem outliers on an element-by-element basis. However, I am unsure as to how to do this. Any suggestions would be greatly appreciated! Thanks
riccardo
2020년 11월 10일
Removing data points shouldn't be an issue, while keeping a list of removed ones may slow down things.
E.G., in pseudocode, you could use a while loop - say:
<initialise stuff for i==1>
i=2;
while i<= max,
<calculate new candidate moving average and check limits>
if <limits OK>
<moving average> = <candidate>
else
<outlier> => <moving average unchanged>
<add outlier data and index to list of removed ones>
end
i = 1+1
end
If you have plenty of memory, you could pre-allocate (with some nonsensical data) an array of outliers of the size of your data and assign every outlier to the corresponding element, avoiding the need to build a dynamic array.
Cai Chin
2020년 11월 10일
Hi, I have tried to incorporate this into my code, but the vector 'x' now just returns zeros for all the elements except the first since this is prespecified. However, I only wanted to replace the outlier values with zeros so as not to change the size of the matrix in the for loop and then remove them afterwards. At the moment, it just seems to be assigning zero to all the elements in the vector after the first element. How can I solve this? Thanks!
% Calculate moving mean with weights manually
x = zeros(length(cycle_periods), 1);
x(1) = cycle_periods(1);
for i = 2:length(cycle_periods)
x(i) = (1 - 1/w(i))*x(i - 1) + (1/w(i))*cycle_periods(i);
if x(i) < 1.5*(x(i - 1)) && x(i) > 0.5*(x(i - 1)) % Identify high and low outliers
x(i) = x(i);
else
x(i) = 0;
end
end
riccardo
2020년 11월 10일
You should be careful with initialising your moving average array, particularly given the constraints you are imposing later.
I would do 3 things :
1) x = cycle_periods; %% why not ? unless you are sure that your moving average is ~ 0
2) the candidate "new" value not straight into x:
mavCnd = = (1 - 1/w(i))*x(i - 1) + (1/w(i))*cycle_periods(i);
3) the "if" condition when the check fails must keep the last computed "good" value, not 0 - otherwise you're again flattening everything
if <OK>
x(i) = mavCnd;
else
x(i) = x(i-1);
end
Cai Chin
2020년 11월 10일
Hi, thanks again for your suggestion, it makes total sense. However, when I try and implement it, the length of mavCand and x is still 89. I thought that it would remove the outliers and then just move on to the next input element and compare it to the previously accepted one. Also if there are 2 outliers in a row, will the 'x(i) = x(i - 1)' still work? Thanks again
x = cycle_periods;
for i = 2:length(cycle_periods)
mavCnd(i) = (1 - 1/w(i))*x(i - 1) + (1/w(i))*cycle_periods(i);
if mavCnd(i) < 1.5*(x(i - 1)) && mavCnd(i) > 0.5*(x(i - 1)) % Identify high and low outliers
x(i) = mavCnd(i);
else
x(i) = x(i - 1);
end
end
riccardo
2020년 11월 10일
편집: riccardo
2020년 11월 10일
Actually "mavCnd" was meant to be a scalar - no need for an array of temporary values (unless you wanted those as well).
As to the result, in case of outliers being removed this code would keep the moving average constant, even in the case of 2-3 successive values.
This would create plateaus of constant values, corresponding to the removed outliers.
From what you say, it seems you actually want to keep only the moving average points corresponding to acceptable data.
You could either post-process the moving average and "pluck out" the unwanted values (using the indexes of the removed outliers) or use a "while" cycle with a separate index "j" for the moving average array and discard the "tail" of leftovers once the cycle ends.
Just be careful that your time-axis must be sync'd - presumably.
i=2;
j=2;
while i<=Max
%%<code here> .....
if <OK>
%% <mavCnd is fine -> update x>
x(j) = mavCnd;
time(j) = <time of point "i">;
%% <next available moving average slot>
j = j+1;
else
%% <outlier>
%% <save outlier info out>
%% <ignore x and j>
end
i=i+1
end
Cai Chin
2020년 11월 10일
Hi thanks again for this suggestion. However, I don't quite understand the 'j' index. This is what I tried but I don't understand how to remove the outliers (which I do not require stored in a separate array). As for the time point, I'm only interested in the cycle period number, as in the index number of a particular cycle. Could you please explain how I would implement your suggestion? Thanks again,
x = cycle_periods;
i = 2;
j = 2;
while i <= length(cycle_periods)
mavCnd = (1 - 1/w(i))*x(i - 1) + (1/w(i))*cycle_periods(i);
if mavCnd < 1.5*(x(i - 1)) && mavCnd > 0.5*(x(i - 1)) % Identify high and low outliers
x(j) = mavCnd;
time(j) = i;
j = j + 1;
else
x(i) = x(i - 1);
end
i = i + 1;
end
riccardo
2020년 11월 10일
"j" is a running index on the x elements which are assigned a valid moving average value and corresponding time cycle.
This should be what you need, then, if you don't keep record of the outliers:
x = cycle_periods;
i = 2;
j = 2;
while i <= length(cycle_periods)
mavCnd = (1 - 1/w(i))*x(i - 1) + (1/w(i))*cycle_periods(i);
if mavCnd < 1.5*(x(i - 1)) && mavCnd > 0.5*(x(i - 1)) % Identify high and low outliers
x(j) = mavCnd;
time(j) = i;
j = j + 1;
end
i = i + 1;
end
% scrap the tail
x(j:end)=[];
% now plot( time, x ) should show yor moving average array
% also:
% "idxCycle = [1:length(cycle_periods)];"
% "cycle_periods( time )" is the list of valid points
% "cycle_periods( idxCycle( ~ismember( idxCycle, time ) ) )" is the list of outliers taken out
Cai Chin
2020년 11월 10일
Hi, thanks so much again for explaining. However, when I implemented the code, I encountered 2 issues:
- The number of elements in the 'time' variable is not equal to that in the 'x' vector so the graph is not produced. When I use a 'pseudo' time variable with the same number of elements as x that simply counts up in equal increments, there are visible outliers still remaining in the weighted mean.
- x still seems to have outliers in it even when the code to remove them is applied. For instance, this is the vector I get when I have implemented the code. I have underlined the values that seem erroneous
x = [0.0040, 1.8060, 0.5871, 0.8476, 0.9794, 1.8796, 2.5095, 0.6071, 0.7588, 0.7499, 0.6041, 0.6300, 0.7719, 0.7576, 0.7688, 0.6300, 0.6512, 0.8080, 0.7752, 0.7614, 0.6270, 0.7952, 0.7688, 0.7608, 0.6249, 0.7936, 0.7792, 0.7618, 0.616, 0.6496, 0.6882, 1.1490, 2.0880, 0.6568, 0.6522, 0.8050, 0.7860, 0.6198, 0.7468, 0.7466, 0.7742, 0.6496, 0.6688, 0.6672, 0.6632, 3.1392, 0.6720, 0.8366, 0.8098, 0.8052, 0.8094, 0.6592, 0.6616, 0.6672, 0.8186, 0.7906, 0.7816, 0.6282, 0.6402, 0.6514, 0.7918]
I just can't seem to work out what the problem is, thanks again!
riccardo
2020년 11월 11일
(1) apologies, I thought it was obvious that in the case time was initialised to the length of cycle_periods, then the same tail-cutter as with x should be used: the first j-1 elements are what you need.
(2) I just spotted that you use "i" as index on "x" - you should use "j" for "x" - always - as the last acceptable moving average is stored in "x(j-1)", not "x(i-1)", and "j" gets incremented at every new value stored
Cai Chin
2020년 11월 11일
Hi, thanks again for all you help. I have implemented the changes you suggested but now it gives a very strange distribution.
I'm also still a little confused over what the difference between what 'x' and 'j' is.
Also, by assigning 'cycle_periods' to x, has that affected the algorithm used to calculate the moving mean? The algorithm is this:
wN,λ = λwN−1,λ + 1
‾xN,λ =( 1−1wN,λ)‾xN−1,λ + (1wN,λ)xN
- ‾xN,λ — Moving average at the current sample
- xN — Current data input sample
- ‾xN−1,λ — Moving average at the previous sample
- λ — Forgetting factor
- wN,λ — Weighting factor applied to the current data sample
- (1−1wN,λ)‾xN−1,λ — Effect of the previous data on the average
The vector 'x' now seems to contain a lot of the first element of 'cycle_periods' as shown below:
x = [0.0040, 0.0040, 0.0040, 0.0040, 0.0040, 0.0040, 0.0040, 0.0040, 0.0050, 0.0040, 0.0040, 0.0050, 0.0040, 0.0050, 0.0030, 0.0040, 0.0040, 0.0040, 0.0040, 0.0040, 0.0040, 0.0040, 0.0040, 0.0040, 0.0050, 0.0050, 0.0050, 0.7830, 0.0040, 0.8010, 0.7720, 0.7560, 0.7800, 0.0050, 0.7960, 0.7840, 0.7600, 0.7690, 0.0040, 0.8110, 0.0040, 0.8590, 0.0050, 0.8350, 2.4050, 0.8200, 0.0040, 0.8140, 0.0050, 0.8090, 0.7890, 0.7740, 0.0030, 0.7480, 0.7420, 0.7650, 0.8110, 0.0040, 0.8350, 0.0040, 0.8330, 0.0040, 0.8280, 0.0040, 3.9230, 0.0040, 0.8390, 0.0040, 0.8430, 0.8110, 0.8050, 0.8060, 0.8230, 0.0040, 0.8260, 0.0040, 0.8330, 0.0040, 0.8250, 0.7930, 0.7810, 0.7840, 0.0050, 0.7990, 0.0050, 0.8130, 0.0050, 0.7910, 0.7950]
Thanks again
riccardo
2020년 11월 11일
"x" is your moving average array, "j" is the index of the element of x being updated, while "i" is the index running through the input data array.
the problem you see is due to the fact that your data sequence starts with a very small value (probabaly the "0.004") and initialising x(1) as this, your criterion to accept "new data" kills everything, essentially - because 1.5*x(1) = 0.006 and 0.5*x(1) = 0.002 -> hence you can accept new points is they fall within [0.002, 0.006].
Theis issue isn't going to go away: if the input data point is 0.9 it will be accepted if the current moving average is between 0.6 (0.9/1.5) and 1.8 (0.9/0.5), etc.
if your data "picks up after a bit", so to speak, you could try to leave a grace period of enough samples to "load" the initial part of the moving average array sensibly, without discarding anything, and then start discarding.
The criterion to choose an outlier to discard should be defined based on the data and the objective of the analysis.
Cai Chin
2020년 11월 11일
Hi riccardo, that definitely makes sense. However, why does "x" have so many repeated values of '0.0040' which is the first element of 'cycle_periods'? Shouldn't it contain all the accepted moving average values? Thanks
riccardo
2020년 11월 12일
if you use :
>>
..................................
else
x(i) = x(i - 1);
end
<<
that explains it
Cai Chin
2020년 11월 12일
Hi, thanks again. I have executed the code again using your suggestion but again, the
% Calculate exponentially weighted moving mean manually whilst removing
% outliers in real-time
x = zeros(length(cycle_periods),1); % array for exponentially weighted means
x (1) = cycle_periods(1);
i = 2; % index for counting through input signal
j = 2; % index for counting through accepted exponential mean values
while i <= length(cycle_periods)
mavCnd = (1 - 1/w(j))*x(j - 1) + (1/w(j))*cycle_periods(i);
if mavCnd < 1.5*(x(j - 1)) && mavCnd > 0.5*(x(j - 1)) % Identify high and low outliers
x(j) = mavCnd;
cycle_index(j) = i;
j = j + 1;
else
x(i) = x(i - 1);
end
i = i + 1;
end
% Scrap the tail
x(j - 1:end)=[];
x = [0.0040, 0.0040, 0.0040, 0.0040, 0.0040, 0.0040, 0.0040, 0.0040, 0.0042, 0.0042, 0.0041, 0.0043, 0.0043, 0.0044, 0.0041, 0.0041, 0.0041, 0.0041, 0.0040, 0.0040, 0.0040, 0.0040, 0.0040, 0.0040, 0.0042, 0.0044]
I just can't seem to understand why there are so many repeats of the first element..
Thanks again
riccardo
2020년 11월 12일
Apologies, I was a bit cryptic: REMOVE the following lines
>>
else
x(i) = x(i - 1);
>>
otherwise, everytime the criterion is not met, the x(i-1) element is "copied forward" to the x(i) element.
Cai Chin
2020년 11월 14일
Hi riccardo, thank you very much for all you help, the code now works as I had wanted it to
Cai Chin
2020년 11월 19일
Hi riccardo, I was wondering if you wouldn't mind helping me with another issue I am having? Instead of removing the outliers, I would like them to be replaced by the mean of the framing 2 accepted values (1 from the past and one from the future). I have tried to do this retrospectively by creating a new array whereby the unaccepted values are added as 'zero's which are then replaced by the next new accepted value comes in. If the previous value is zero, this is now replaced by the framing value mean. However, this does not account for the fact that there may be multiple unacceptable values in a row - it only accouunts for the zero of the previous value but not others before it. Any advice would be greatly appreciated, thanks!
% Calculate exponentially weighted moving mean and tau without outliers
accepted_means = zeros(length(cycle_periods_filtered),1); % array for accepted exponentially weighted means
accepted_means(1) = cycle_periods_filtered(1);
k = zeros(length(cycle_periods_filtered),1); % array for accepted raw cycle periods
m = zeros(length(cycle_periods_filtered), 1); % array for raw periods for all cycles with outliers replaced by mean of framing values
k(1) = cycle_periods_filtered(1);
m(1) = cycle_periods_filtered(1);
tau = m/3; % pre-allocation for efficiency
i = 2; % index for counting through input signal
j = 2; % index for counting through accepted exponential mean values
n = 2; % index for counting through raw periods of all cycles
while i <= length(cycle_periods_filtered)
mavCurrent = (1 - 1/w(j))*accepted_means(j - 1) + (1/w(j))*cycle_periods_filtered(i);
if cycle_periods_filtered(i) < 1.5*(accepted_means(j - 1)) && cycle_periods_filtered(i) > 0.5*(accepted_means(j - 1)) % Identify high and low outliers
accepted_means(j) = mavCurrent;
k(j) = cycle_periods_filtered(i);
m(n) = cycle_periods_filtered(i);
cycle_index3(j) = i;
tau(n) = m(n)/3;
if m(n - 1) == 0
m(n - 1) = (k(j) + k(j - 1))/2;
tau(n - 1) = m(n)/3;
end
j = j + 1;
n = n + 1;
else
m(n) = 0;
n = n + 1;
end
i = i + 1;
end
riccardo
2020년 11월 20일
I would advise against doing this "on the spot", i.e. as the raw input is processed.
If you wanted to replace all outliers with some values derived from accepted average values, you should do that as a post-process step.
(1) get the weighted average + arrays of indexes into the original raw data of - outliers - accepted average values;
(2) replace the outlier elements with corresponding "local averages" - note that a simple mean value between the values surrounding some outlier(s) (as per your example) may not be the best choice, in this case (say you have N consecutive outliers, replacing all of them with the mean value of the 2 averages around the N is pretty arbitrary and may risk introducing unwanted artifacts).
Cai Chin
2020년 11월 20일
Hi riccardo, thanks for your suggestion. Ideally, I would like it to be done "on the spot"; is there any way of counting through all N "zeros" and replacing them with the mean of the accepted values on either side of "N", instead of just replacing the zero immediately before the accepted value?
Otherwise, do you happen to have a suggestion as to how to do this without introducing unwanted artifacts as you say?
Thanks again!
riccardo
2020년 11월 23일
You should be clear with what your objective is.
Getting the raw data and obtaining the moving average array is data analysis/signal processing, because you do not "change" the data, although you may discard a subset of it. Here you look at the data to understand its behaviour.
Replacing any data point is not "analysis" any longer, it's "modeling": in this case you must be careful to avoid any unwanted artifacts/side-effects caused (potentially) when you introduce a "different" subset, in place of the outliers.
Unless you have a specific and well based reason (say a causal relationship, like in a first principle equation) to say "this data point should be corrected in this way", any other type of model can only be statistical in nature, that's why a post-processing approach would be safer (e.g. if you had reason to believe the data should be continuous/smooth/etc., you could apply a certain type of interpolator/spline to the moving average result array, in order to produce the "missing points" with the desired property).
Cai Chin
2020년 11월 23일
Hi riccardo, thanks again for your advice. Is there any way of doing this interpolation on a real-time basis such that, if the condition is not met, the abnormal sample is replaced by a value that would fit in based on the exponentially weighted moving mean trajectory of the previous samples? i.e. the unacceptable sample is replaced by a value that would uphold the 'smooth' trajectory as soon as it is discovered rather than post hoc. Thanks!
riccardo
2020년 11월 24일
In RT you cannot interpolate, but you could try extrapolating a "candidate outlier" point using the gradient based on the last 2 accepted average values.
The problem arise when you find several adjacent/consecutive outliers: in this case the latest 2 acceptable values may be "too far back".
추가 답변 (0개)
참고 항목
카테고리
Help Center 및 File Exchange에서 Descriptive Statistics에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!오류 발생
페이지가 변경되었기 때문에 동작을 완료할 수 없습니다. 업데이트된 상태를 보려면 페이지를 다시 불러오십시오.
웹사이트 선택
번역된 콘텐츠를 보고 지역별 이벤트와 혜택을 살펴보려면 웹사이트를 선택하십시오. 현재 계신 지역에 따라 다음 웹사이트를 권장합니다:
또한 다음 목록에서 웹사이트를 선택하실 수도 있습니다.
사이트 성능 최적화 방법
최고의 사이트 성능을 위해 중국 사이트(중국어 또는 영어)를 선택하십시오. 현재 계신 지역에서는 다른 국가의 MathWorks 사이트 방문이 최적화되지 않았습니다.
미주
- América Latina (Español)
- Canada (English)
- United States (English)
유럽
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
아시아 태평양
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)