Removing outliers from the data creates gaps. Filling these gaps with missing values or the median of surrounding values does not address the issue.Why?

조회 수: 7 (최근 30일)
I am analyzing EMG data in windows. In each window, I apply z-score normalization to identify and remove outliers. To address the gaps created by removing these outliers, I attempt to fill the empty spaces with the median of the surrounding values. Additionally, I have experimented with MATLAB built-in functions such as 'movmedian' for this purpose.
here is my function:
function data_clean = remove_outliers_and_fill(data)
% Calculate z-scores for each column
z_scores = zscore(data);
% Define outlier threshold
threshold =3;
% Identify outliers
outliers = abs(z_scores) > threshold;
% Copy data to preserve original shape
data_clean = data;
% Loop through each column
[num_rows, num_cols] = size(data);
for col = 1:num_cols
for row = 1:num_rows
if outliers(row, col)
range_start = max(1, row-10);
range_end = min(num_rows, row+10);
neighbors = data(range_start:range_end, col);
% Exclude the outlier from median calculation
filtered_neighbors = neighbors(neighbors ~= data(row, col));
median_value = median(filtered_neighbors);
data_clean(row, col) = median_value;
end
end
end
end
here is the plot where it creates gaps after applying the above function.

채택된 답변

Star Strider
Star Strider 2024년 6월 17일
Your version/release is not stated, however beginning with R2017a, the filloutliers function has been available. Using the 'median' or 'mean' as the ‘findmethod’ (I use 'median' here), it will automatically consider as outlliers anything within outside ±3 standard deviations (equivalent to your ‘zscore’ reference). See the documentation I linked to here for details.
If you have R2017a or a later version/release, try this —
V1 = readmatrix('data.csv');
L = numel(V1);
X1 = linspace(0, L-1, L);
figure
plot(X1, V1)
grid
xlim([4300 5000])
title('Original')
[B,TF,L,U,C] = filloutliers(V1, 'linear', 'median');
figure
plot(X1, V1, 'DisplayName','Original Data')
hold on
plot(X1, B, '-r', 'DisplayName','Outliers Filled (Linear Interpolation)')
hold off
grid
xlim([4300 5000])
legend('Location','best')
title('Filled Outliers')
.
  댓글 수: 2
Seemab
Seemab 2024년 6월 18일
이동: Star Strider 2024년 6월 18일
Thankyou @Star Strider ,this is a good approach to remove the outlier from the temporal data.
In my case it removes the outlier from the temporal data but the frequency remained affected at the outlier place.
I am looking for something which can remove the outliers from both time and frequency.
Star Strider
Star Strider 2024년 6월 18일
As always, my pleasure!
I am not certain what you intend by ‘I am looking for something which can remove the outliers from both time and frequency.’ If you want to remove the outliers rather than fill them by interpolating them, you can use the rmoutliers function. I do not usually suggest that because it disrupts the integrity of the data.
If you want to remoove specific frequencies from your data, use the Signal Processing Toolbox to create frequency-selective filters. There are several filtering options, and I can help you design and implement the filters.
One caution however is that it will be necessary to have a matching vector of sampling times for each dependent variable data element before you do any processing of the data. The reason is that the sampling times provide the frequency information and the regularity of the samples themselves. For optimal performace, the sampling frequency must be constant, and the sampling intervals consistent from sample to sample. If that is not the situation for your data, there is a function (resample) that can regularise the sampling frequency (and interpolate the dependent variable data) to proivide that. At that point, you can use various filters. Again, I can help you design and implement them.
.

댓글을 달려면 로그인하십시오.

추가 답변 (1개)

Nipun
Nipun 2024년 6월 17일
Hi Seemab,
I understand that you want to remove outliers from your EMG data, fill the gaps with the median of the surrounding values, and avoid gaps in the resulting data. The gaps might be due to not considering edge cases correctly or the outlier removal leaving isolated data points.
Here's an improved version of your function to address the gaps:
  1. Use movmedian to smooth the data after outlier removal.
  2. Ensure the median replacement does not create new outliers
function data_clean = remove_outliers_and_fill(data)
% Calculate z-scores for each column
z_scores = zscore(data);
% Define outlier threshold
threshold = 3;
% Identify outliers
outliers = abs(z_scores) > threshold;
% Copy data to preserve original shape
data_clean = data;
% Loop through each column
[num_rows, num_cols] = size(data);
for col = 1:num_cols
for row = 1:num_rows
if outliers(row, col)
range_start = max(1, row-10);
range_end = min(num_rows, row+10);
neighbors = data(range_start:range_end, col);
% Exclude the outlier from median calculation
filtered_neighbors = neighbors(neighbors ~= data(row, col));
median_value = median(filtered_neighbors);
data_clean(row, col) = median_value;
end
end
end
% Use movmedian to smooth the data after filling
window_size = 5; % Adjust window size as needed
for col = 1:num_cols
data_clean(:, col) = movmedian(data_clean(:, col), window_size);
end
end
Example Usage
% Sample data (replace with actual EMG data)
data = randn(5000, 1) * 1e-5;
% Add some artificial outliers for testing
data(4700:4720) = 3e-5;
% Clean the data
data_clean = remove_outliers_and_fill(data);
% Plot original and cleaned data
figure;
subplot(2,1,1);
plot(data);
title('Original Data');
xlabel('Time (windows)');
ylabel('Amplitude');
subplot(2,1,2);
plot(data_clean);
title('Cleaned Data');
xlabel('Time (windows)');
ylabel('Amplitude');
For more information on the movmedian function, refer to the MathWorks documentation: https://www.mathworks.com/help/matlab/ref/movmedian.html
Hope this helps.
Regards,
Nipun
  댓글 수: 1
Seemab
Seemab 2024년 6월 17일
편집: Seemab 2024년 6월 17일
The function you provided is still leaving gaps in my data.
I have attached the dataset to this comment.
i used the following function ,it works but i am not sure it's a good idea to take the median of whole row.
function data_clean = remove_outliers_and_fill(data)
% Ensure the data has only one row and 5000 columns
[~, num_cols] = size(data);
% if num_rows ~= 1 || num_cols ~= 5000
% error('The input data must be 1 row and 5000 columns.');
% end
%
% Calculate z-scores for the row
z_scores = zscore(data);
% Define outlier threshold
threshold = 3;
% Identify outliers
outliers = abs(z_scores) > threshold;
% Copy data to preserve original shape
data_clean = data;
% Replace outliers with the median of the row excluding outliers
if any(outliers)
% Exclude the outliers from the median calculation
filtered_data = data(~outliers);
median_value = median(filtered_data);
% Replace outliers with the median value
data_clean(outliers) = median_value;
end
window_size = 100; % Adjust window size as needed
for col = 1:num_cols
data_clean(:, col) = movmedian(data_clean(:, col), window_size);
end
end

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Multirate Signal Processing에 대해 자세히 알아보기

제품

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by