Regression with tall array (Using datastore, CSV) - Error

Hi

댓글 수: 5

Ive J
Ive J 2021년 7월 12일
편집: Ive J 2021년 7월 12일
Do you mean?
result = fitglm(x, y, 'Distribution', 'binomial', 'Link', 'logit');
because you have an extra ) there (though I'm sure the error nags about something else).
Can you confirm you have tall arrays (for x and y)?
istall(x)
ans =
logical
1
Also, are you trying to set the fromula? because error says so, but your call to fitglm doesn't show this.
Yes, your fitglm-line is the one I have, the ) was a copy-paste error.
And yes, x and y are both tall arrays.
No, I am not calling a special formula.
can you share the output of your dependent/independent variables?
x
y
x is a 1000x500 (tall) table. This are the first entries:
7 6 12 12 15 13 12 30 71 6
3 4 4 0 0 1 10 2 6 1
1 0 0 0 0 0 2 0 0 0
1 0 4 0 0 0 0 0 4 0
6 3 5 2 0 0 10 0 3 0
3 26 10 3 0 2 15 7 24 1
17 85 5 4 0 0 29 0 6 0
1 0 1 0 0 2 1 0 0 0
2 0 3 0 0 0 9 0 4 0
5 18 11 2 0 1 6 0 3 0
3 1 0 0 0 2 4 0 0 0
2 0 0 0 0 0 0 0 0 0
2 0 10 0 0 0 0 0 0 0
2 0 1 1 0 3 0 0 3 0
2 16 3 0 0 0 3 2 36 1
y is a 1000x1 (tall) table and the first entries are:
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
I just tried to see if it was tall arrays and fitglm
>> X=[1:1000].'; X=tall(X);
>> Y=randn(size(X)); % this is interesting sidelight on the way...
Error using randn
Size inputs must be numeric.
>> size(X)
ans =
1×2 tall double row vector
1000 1
>> Y=randn(1000,1); Y=tall(Y); % OK, have to brute-force it
>> fitglm(X,Y,'Distribution',"normal")
Iteration [1]: 0% completed
Iteration [1]: 50% completed
Iteration [1]: 100% completed
Iteration [2]: 0% completed
Iteration [2]: 50% completed
Iteration [2]: 100% completed
Iteration [3]: 0% completed
Iteration [3]: 100% completed
ans =
Compact generalized linear regression model:
y ~ 1 + x1
Distribution = Normal
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ ________ _______
(Intercept) 0.0015036 0.064429 0.023338 0.98139
x1 1.6177e-05 0.00011151 0.14507 0.88468
1000 observations, 998 error degrees of freedom
Estimated Dispersion: 1.04
F-statistic vs. constant model: 0.021, p-value = 0.885
>>
So, fitglm will accept tall arrays; the syntax must be else where it would seem...

댓글을 달려면 로그인하십시오.

 채택된 답변

Ive J
Ive J 2021년 7월 13일
편집: Ive J 2021년 7월 13일
Well, your data is tall table, and that's what MATLAB complains about: since your first argument is a table, MATLAB thinks y is modelspec. You have two options:
% 1-feed fitglm with matrix
mdl = fitglm(x{:, :}, y{:, :}, 'Link', 'logit', 'Distribution', 'binomial');
% 2-OR: merge x and y as a table
data = [x, y]; % last column is the dependent variable by default
mdl = fitglm(data, 'Link', 'logit', 'Distribution', 'binomial');
Btw, your data is fairly small and (I assume) fits within memory, tall arrays should be avoided for such small datasets.

댓글 수: 2

Hi Ive,
I merged the x and y tables and converted the new table before building the tall array with:
ds = transform(ds,@table2array);
Now it works, Thanks for your help!
PS: the file here was was only a smaller sample. The "real" one is 320000x30000.
If I were you I would also test with arrays. Processing tables is almost always (based on my experience) slower than arrays.
Good luck!

댓글을 달려면 로그인하십시오.

추가 답변 (0개)

카테고리

도움말 센터File Exchange에서 Tables에 대해 자세히 알아보기

질문:

2021년 7월 12일

편집:

2021년 8월 1일

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by