Comparing two curve fits (using AIC?)
이전 댓글 표시
Hello all,
Let's say for a moment I have some a priori ideas about a family of functions that might best describe a particular data set. After I fit the data with each candidate, I can simply look at the outputs (e.g., RMSE, r²) and chose the one with the best values. However, is there a way to gain some degree of confidence about that selection? For example, let's say I have reason to believe that in a set of data like the below, the true distribution is described by a low-order integer exponent in the function y=b×xⁿ+c. So, I might capture some data and then try to fit it as follows:
% Generate some data
b = 1; % True coefficient
n = 3; % True exponent
c = 0; % True y-intercept
rx = randi(100, [100, 1])/10;
rn = 3 + rand([100, 1])/4 - 0.125;
rc = (rand([100, 1]) * 500) - 250 + c;
ry = b * rx.^(rn) + rc;
% Generate and fit the models
m = @(b, c, n, x) b * x.^n + c;
m1 = @(p, x) m(p(1), p(2), 1, x); % n=1
m2 = @(p, x) m(p(1), p(2), 2, x); % n=2
m3 = @(p, x) m(p(1), p(2), 3, x); % n=3
c1 = lsqcurvefit(m1, [1, 0], rx, ry)
c2 = lsqcurvefit(m2, [1, 0], rx, ry)
c3 = lsqcurvefit(m3, [1, 0], rx, ry)
% Plot the results
x = linspace(min(rx), max(rx));
plot(rx, ry, 'ok', x, m1(c1, x), '-g', x, m2(c2, x), '-r', x, m3(c3, x), '-b')
% Determine which model is most likely to be correct. Use AIC?
In the above example, m3 should generally fit the data best, unless the random number generation is very unlucky, because that's what I set it to.
For those who are familiar with Prism, there I might perform this test by starting an "analysis" and chosing compare, and then selecting "for each data set, which of two equations (models) fits best" and then chosing the "Akaike's Information Criterion" test. After running that on one set of data, I got

This is very helpful as it lets me know how certain I am about the fit choice (i.e., 86% sure n=3 is better than n=2). What's the best way to do something similar in MATLAB?
Thanks in advance.
채택된 답변
추가 답변 (1개)
The lsqcurvefit returns the residual norm of the fit. The model that has the least residual norm would be the "best" one by this criterion:
% Generate some data
b = 1; % True coefficient
n = 3; % True exponent
c = 0; % True y-intercept
rx = randi(100, [100, 1])/10;
rn = 3 + rand([100, 1])/4 - 0.125;
rc = (rand([100, 1]) * 500) - 250 + c;
ry = b * rx.^(rn) + rc;
% Generate and fit the models
m = @(b, c, n, x) b * x.^n + c;
m1 = @(p, x) m(p(1), p(2), 1, x); % n=1
m2 = @(p, x) m(p(1), p(2), 2, x); % n=2
m3 = @(p, x) m(p(1), p(2), 3, x); % n=3
[c1, resnorm1] = lsqcurvefit(m1, [1, 0], rx, ry);
[c2, resnorm2] = lsqcurvefit(m2, [1, 0], rx, ry);
[c3, resnorm3] = lsqcurvefit(m3, [1, 0], rx, ry);
% Plot the results
x = linspace(min(rx), max(rx));
plot(rx, ry, 'ok', x, m1(c1, x), '-g', x, m2(c2, x), '-r', x, m3(c3, x), '-b')
% Determine which model is most likely to be correct. Use AIC?
[~, idx] = min([resnorm1 resnorm2 resnorm3]) % Index of best model
댓글 수: 8
Bjorn Gustavsson
2023년 1월 13일
Sure but the Akaike Information Criterion (and its Bayesian IC and all their siblings) add a penalty cost to avoid(reduce the risk of/handle) the problem of overfitting, I think that's the feature the OP wonders about. (I'm by no means an expert on the model-selection problem, but comment since you're Mathworks stadd just to say: matlab has a couple of aic-functions but in some specialised add-on toolboxes and not in the statistics toolbox where it should have its "most basic home" - as it seems to me...)
James Akula
2023년 1월 13일
Bjorn Gustavsson
2023년 1월 13일
(I'm getting out on dodgy looking thin ice over scary deep waters here) In your example case it should be as "simple" as to look at the parameter-covariance-matrix - since both models have the same number of parameters if you extend the model to also fit for the exponent. If the best value for the exponent is 3 with a standard deviation of "much less than 1" then this ought to show that it is better than an exponent of 2. When it comes to chosing between models with different numbers of parameters my "statistics vision" is not good enough to bring any enlightenment - I just go with the best AICc-model.
Bora Eryilmaz
2023년 1월 13일
I am not an expert on AIC criterion but as your models have the same number of parameters, it may not be very applicable in this scenario. In term of its model ranking power, AIC (with same number of parameters) would give similar results to looking at the residual norm. Assuming a normal distribution of modeling errors, log likelihood needed for AIC is basically the same as the residual norm.
In your scenario, you can rank the models looking at the residual norm of a "test dataset", and not the "training data" as was done above.
Bjorn Gustavsson
2023년 1월 14일
It really makes me queasy to say that one model is true with some probability here, when it is so simple to use the exponent as another free parameter and then calculate the parameter covariance-matrix to get an estimate of the uncertainty of the exponent from its standard deviation. I know this is a simplified toy-model example, but it really makes me uneasy...
James Akula
2023년 1월 17일
Bjorn Gustavsson
2023년 1월 18일
@James Akula, sure, and my suggestion might work well enough for this (with reasonable certainty) simplified toy-example, but can run into severe problems in more complicated real-world cases. I mainly thought it was worth mentioning as a reminder for the case where someone were to have this specific choise to make. (and I don't need anyone to argue this quandry with - I come down on all three of the two sides in this discussion and have frequent and angry arguments with myself from last week...)
카테고리
도움말 센터 및 File Exchange에서 Get Started with Curve Fitting Toolbox에 대해 자세히 알아보기
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!

