Maximum size for linear regression

What is the maximum size for linear regression problems? One dependent variable.

답변 (1개)

John D'Errico
John D'Errico 2017년 8월 12일
편집: John D'Errico 2017년 8월 12일

1 개 추천

There is NO maximum size.
The only limit will be essentially a function of how much memory you have.
Are you asking about a model like y=a*x+b? Since a simple linear model like this is soooooo trivial to compute, even for a huge number of parameters, I doubt you should be worried.
Do you have hundreds of millions of points for such a model?
How many parameters are you trying to estimate?
For example,
x = randn(1e8,1);
y = randn(size(x));
M = polyfit(x,y,1);
This quick test took roughly 4 seconds to estimate the model for 100 million points. That may seem like it is significant, but it took roughly that long just to generate the random data!
Each of the arrays x and y require roughly 0.8 gigabyte of RAM apiece just to store. The linear regression itself very temporarily required 3 more gigabytes of RAM.
So, if I was trying to do this with say 500 million data points, I would have been stretching the limits of the RAM I have installed, and MATLAB would have started doing a bit of disk swapping. Even at that, it would have been doable since I have a solid state drive.
Again though, it is really only the memory you have installed that limits such a computation.

댓글 수: 5

Tom Graney
Tom Graney 2017년 8월 12일
A linear model with problem sizes of up to 250 x 5000.
John D'Errico
John D'Errico 2017년 8월 12일
편집: John D'Errico 2017년 8월 12일
How fast can I say utterly TRIVIAL? :)
John D'Errico
John D'Errico 2017년 8월 12일
편집: John D'Errico 2017년 8월 12일
Assuming you mean 5000 observations with 250 unknowns...
A = rand(5000,250);b = rand(5000,1);
timeit(@() A\b)
ans =
0.10432
I did not even see more than a tiny blip on the monitor to see any additional memory consumed.
If you meant 250 observations with 5000 variables, then the problem is underdetermined as a linear regression. Still trivial to solve, though most of the variables will be zero in the result.
A = rand(250,5000);b = rand(250,1);
timeit(@() A\b)
ans =
0.15686
More variables, so a bit more time. Again, no serious blip on the memory consumed. Though in both cases, it was enough to fire up multiple cores to solve the problem.
Tom Graney
Tom Graney 2017년 8월 12일
Yes, 5000 observations and 250 unknowns. Excel does not seem to be able to do this.
John D'Errico
John D'Errico 2017년 8월 13일
It is not even a large problem for MATLAB.

댓글을 달려면 로그인하십시오.

질문:

2017년 8월 12일

댓글:

2017년 8월 13일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by