minimum distances of data points from Fitted curve and Gaussian Mixture Model Clustering... not working.
이전 댓글 표시
Hi everybody I would expect minimum distances of points from the curve follow a gaussian trend (as my professor aims) but I can't get a good graph at the end of the Gaussian Mixture Model..
May you have a check to my code? Do you see something wrong?
If you would be available to run my script I leave you the data.txt file and the matrix I use in the script at the following link: https://mega.nz/folder/goRHHLgY#fwnNkiLuFj-oZdaSPZmJ2Q
flowdata = readtable("flodata.txt");
% column 1: pressure at pipe start [bars]
% column 2: acoustic noise at start [standard deviation]
% column 3: pressure at pipe end (10km from start) [bars]
% column 4: acoustic noise at pipe end[standard deviation]
% column 5: flow rate [m3/hour]
load('flowdata.mat');
% summary(flowdata)
% Vectors of Pressures
pressureIN = flowdata.pressureAtPipeStart_bars_;
pressureOUT= flowdata.pressureAtPipeEnd_10kmFromStart__bars_;
% Vectors of Acoustic Noises at IN and at OUT
aNoiseIN=flowdata.acousticNoiseAtStart_standardDeviation_;
aNoiseOUT=flowdata.acousticNoiseAtPipeEnd_standardDeviation_;
% Vector of flowrate
flowrate = flowdata.flowRate_m3_hour_;
% % Table of flowRate
% flowRate = table(flowrate);
% rawData = [pressureIN aNoiseIN pressureOUT aNoiseOUT flowrate];
sample_distance = 10;
%t = (linspace(0, 7948790, 794879))/86400;
[samples,channels] = size(flowdata);
PHeadLoss = pressureIN-pressureOUT;
t = linspace(0, samples*sample_distance, samples)/86400;
figure()
plot(flowrate, PHeadLoss, '.')
densityplot(flowrate, PHeadLoss, [50, 50])
% histogram on raw data
hist(flowrate, 5000)
hold on
%hist(flowrate + 0.1*randn(numel(flowrate), 1), 5000) % leggera perturbazione rispetto ai valori originali
hist(PHeadLoss, 5000)
%flowrate
% all main curves overlaid
figure()
plot(t, pressureIN, '-g',t, pressureOUT, '-r',t, flowrate, '-b');
axis tight;
title(' Input , Output Pressure , FR vs time BEFORE EVERY CLEANING');
legend('Input Pressure','Output Pressure','Flow Rate');
xlabel('Tempo [Days]');
ylabel('all');
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
rawData = [pressureIN pressureOUT flowrate];
% Removing the NaN values
% Creo un array di indici logici che indica quali righe contengono almeno un NaN
idx_nan = any(isnan(rawData), 2);
cleanData = rawData;
% Elimino le righe selezionate
cleanData(idx_nan, :) = [];
[samples,channels] = size(cleanData);
t = linspace(0, samples*sample_distance, samples)/86400;
% Vectors of Pressures
pressureIN = cleanData(:,1);
pressureOUT= cleanData(:,2);
flowrate = cleanData(:,3);
PHeadLoss = pressureIN - pressureOUT;
%DATA
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
hold on
% After having removed NaN
densityplot(flowrate, PHeadLoss, [50,50]);
% Now I remove the negative values of pressures and flowrate (No physical
% meaning)
% Creo un array di indici logici che indica quali righe hanno almeno un valore negativo
idx_neg = any(cleanData < 0, 2);
% Elimino le righe selezionate
cleanData(idx_neg, :) = [];
[samples,channels] = size(cleanData);
t = linspace(0, samples*sample_distance, samples)/86400;
pressureIN = cleanData(:,1);
pressureOUT= cleanData(:,2);
flowrate = cleanData(:,3);
PHeadLoss = pressureIN - pressureOUT;
%DATA
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
%hold on
densityplot(flowrate, PHeadLoss, [50,50]);
sample_distance = 10;
%t = (linspace(0, 7948790, 794879))/86400;
[samples,channels] = size(cleanData);
t = linspace(0, samples*sample_distance, samples)/86400;
% all main curves overlaid
figure()
plot(t, pressureIN, '-g',t, pressureOUT, '-r',t, flowrate, '-b');
axis tight;
title(' Input , Output Pressure , FR vs time');
legend('Input Pressure','Output Pressure','Flow Rate');
xlabel('Tempo [Days]');
ylabel('all');
% First hypothesis
% large variations on the output pressure can be seen at days 1 to 15 ,
% not related to input pression
% we should not use that portion of the data
% It's a brutal cleaning but...
% I am sceptical about what happens during the first 15 days :
% lots of peaks are seen on the output pressure side that are
% not reflected / correlated to the input side ,
% so I prefer to discard the first 20 days (yes it's brutal !)
ind = t>13;
t = t(ind);
pressureIN = pressureIN(ind);
pressureOUT = pressureOUT(ind);
flowrate = flowrate(ind);
PHeadLoss = pressureIN-pressureOUT;
figure()
plot(t, pressureIN, '-g',t, pressureOUT, '-r',t, flowrate, '-b');
axis tight;
title(' Input , Output Pressure , FR vs time');
legend('Input Pressure','Output Pressure','Flow Rate');
xlabel('Tempo [Days]');
ylabel('all');
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
%DATA
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
densityplot(flowrate, PHeadLoss, [50,50]);
% Second Hypothesis
% keep data (valid) for gradient below a certain threshold (and non zero
% FR too btw)
% i also don't like those sudden drops to zero ,
% because to me its like dividing zero by zero ,
% when you want to make a relationship between two variables ,
% I avoid using background noise data , it will blurr the result.
% This function gradient calculates the gradient of a signal,
% which is the rate of change of the signal over time.
pressureIN_grad = abs(gradient(pressureIN));
pressureOUT_grad = abs(gradient(pressureOUT));
flowrate_grad = abs(gradient(flowrate));
threshold = 0.0073;
ind1 = pressureIN_grad<threshold*max(pressureIN_grad);
ind2 = pressureOUT_grad<threshold*max(pressureOUT_grad);
ind3 = flowrate_grad<threshold*max(flowrate_grad);
ind4 = flowrate> 0;
ind = ind1 & ind2 & ind3 & ind4;
t = t(ind);
pressureIN = pressureIN(ind);
pressureOUT = pressureOUT(ind);
flowrate = flowrate(ind);
figure()
plot(t, pressureIN, '.g',t, pressureOUT, '.r',t, flowrate, '.b');
axis tight;
title(' Input , Output Pressure , FR vs time');
legend('Input Pressure','Output Pressure','Flow Rate');
xlabel('Tempo [Days]');
ylabel('all');
PHeadLoss = pressureIN - pressureOUT;
densityplot(flowrate, PHeadLoss, [50,50])
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
%DATA
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
densityplot(flowrate, PHeadLoss, [50,50]);
% Third Hypothesis
% PRESSURE HEAD LOSS (APPROX _no fattore di attrito, flusso del fluido e
% densità del fluido)
PHeadLoss = pressureIN - pressureOUT;
% consider only positive values
ind = PHeadLoss> 0;
t = t(ind);
pressureIN = pressureIN(ind);
pressureOUT = pressureOUT(ind);
flowrate = flowrate(ind);
PHeadLoss = pressureIN - pressureOUT;
densityplot(flowrate, PHeadLoss, [50,50])
%DATA
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
figure('Name',' PRESSURE HEAD LOSS vs FLOWRATE')
plot(flowrate, PHeadLoss, '*k');
title('Pressure Head Loss vs FLOW RATE');
xlabel('flow rate');
ylabel('P Head Loss');
densityplot(flowrate, PHeadLoss, [50,50]);
%%%
%flowrate = flowrate + 0.1*randn(numel(flowrate), 1)
figure()
plot(flowrate, PHeadLoss, '*k')
hold on
% Now I create the FITTING CURVE
x2 = 0:0.01:max(flowrate);
%y2 = 0.002941548*x2.^2+45;
y2 = 0.00294150.*x2.^2+45;
plot(x2, y2)
% Now I compute the distances from each point to the fitted curve
[xy,distance_abs,t_a ]= distance2curve([x2' y2'], [flowrate PHeadLoss]);
hold on
plot(xy(:,1), xy(:,2), '.r') % projections of minimum distances points on the fitted curve
figure()
hist(distance_abs, 100) % istogramma su 100 bin
dx = xy(:,1)-flowrate;
dy = xy(:,2)-PHeadLoss;
dx = (dx-mean(dx));
dx = dx/prctile(abs(dx), 95);
dy = (dy-mean(dy));
dy = dy/prctile(abs(dy), 95);
figure
hist(dx, 100)
xlabel('dx')
figure()
hist(dy, 100)
ylabel('dy')
figure()
histogram2(dx, dy, 100)
xlabel('dx');
ylabel('dy');
figure()
plot(dx, dy, '.b')
xlabel('dx');
ylabel('dy');
figure()
scatter(dx, dy, 'filled')
xlabel('dx');
ylabel('dy');
densityplot(dx, dy, [50,50])
xlabel('dx');
ylabel('dy');
% GMM
%% Machine Learning - Clustering
bond.x = dx';
bond.y = dy';
%% Visualize Data
% Visualization of the slice data in the component "corp".
% Coupon Rate vs. YTM plot, differentiated by credit rating
gscatter(bond.x,bond.y)
% Label the plot
xlabel('bondx')
ylabel('bondy')
%% Select Features to use for Clustering
% Getting the data from the colums Coupon rate, YTM, currente Yeld and
% credit rating.
% Features
% Number of Clusters to create
%eva=evalclusters([bond.x' bond.y'],'kmeans','CalinskiHarabasz','KList',[1:6]);
eva=evalclusters([bond.x' bond.y'],'gmdistribution','CalinskiHarabasz','KList',[1:6]);
numClust = eva.OptimalK;
%% Gaussian Mixture Models (GMM)
gmobj = fitgmdist([bond.x' bond.y'],numClust);
gidx = cluster(gmobj,[bond.x' bond.y']);
gscatter(bond.x', bond.y', gidx)
% Visualize results
%figure()
%I = find(gidx == 1);
%plot(dx(I), dy(I), '*r');
%hold on
%I = find(gidx == 2);
%plot(dx(I), dy(I), '*b')
densityplot(dx, dy, [50,50])
%figure()
%I = find(gidx == 1);
%plot(bond.x(I), bond.y(I), '*r');
%hold on
%figure()
%I = find(gidx == 2);
%plot(bond.x(I), bond.y(I), '*b')
figure()
gscatter(bond.x',bond.y', gidx)
figure()
scatter3(bond.x',bond.y', gidx)
%
Thanks a lot in advance !
답변 (2개)
Abhishek Chakram
2024년 2월 12일
Hi Giuseppe Zumbo,
Based on my observations on the code provided, you probably want to go through these when trying to fit a Gaussian Mixture Model (GMM) to the minimum distances of points from a curve to observe a Gaussian trend:
- Experiment with different histogram bin sizes.
- Verify the appropriateness of the quadratic curve.
- Review the normalization process.
- Visualize the Gaussian components of the GMM.
Here is corrected code snippet for the same:
% Load Data
% Assuming that the data is stored in 'flowdata.mat' with variable 'data'
load('flowdata.mat'); % Replace with the actual variable name and path
% If 'data' is not the variable name, replace it with the correct one
% For example, if the variable is named 'distances', use 'distances' instead of 'data'
% Prepare the data
% Assuming 'data' contains two columns corresponding to x and y coordinates
bond.x = data(:, 1);
bond.y = data(:, 2);
% Visualize Data
figure;
scatter(bond.x, bond.y);
xlabel('bond x');
ylabel('bond y');
title('Initial Data Scatter Plot');
% Determine the number of clusters using the Calinski-Harabasz criterion
eva = evalclusters([bond.x bond.y],'gmdistribution','CalinskiHarabasz','KList',[1:6]);
numClust = eva.OptimalK;
% Fit Gaussian Mixture Models (GMM)
gmobj = fitgmdist([bond.x bond.y], numClust);
% Assign data points to clusters
gidx = cluster(gmobj, [bond.x bond.y]);
% Visualize results with GMM cluster index
figure;
gscatter(bond.x, bond.y, gidx);
xlabel('bond x');
ylabel('bond y');
title('GMM Clustering');
% Visualize Gaussian components
hold on;
gmPDF = @(x,y) reshape(pdf(gmobj, [x(:) y(:)]), size(x));
fcontour(gmPDF, [min(bond.x) max(bond.x) min(bond.y) max(bond.y)]);
title('Gaussian Mixture Contours');
hold off;
% ... (any additional code)
You can refer to the following documentation to know more about the functions used:
- pdf: https://www.mathworks.com/help/stats/prob.normaldistribution.pdf.html
- fcontour: https://www.mathworks.com/help/matlab/ref/fcontour.html
Best Regards,
Abhishek Chakram
카테고리
도움말 센터 및 File Exchange에서 Lengths and Angles에 대해 자세히 알아보기
제품
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!