Fixed Effects Design Matrix Must be of full column rank with multiple categorical predictors

조회 수: 53 (최근 30일)
I am probably doing something very dumb, however I cannot figure out my mistake.
I am trying to regress out some predictors from a data set -- I have two categorical predictors, A1 and A2 in a table, something like this:
It seems obvious to me that A1 and A2 are linearly independent. They are also linearly independent from the intercept, which I believe should be a categorical variable that looks like ones(1,11) ? But regardless, I want the global mean to not be removed from everything, so I don't include an intercept in the model.
Then, if I run something like this:
lme = fitlme('values ~ A1 + A2 -1, 'DummyVarCoding','full' )
I always get the same error :
Error using classreg.regr.lmeutils.StandardLinearLikeMixedModel/validateInputs (line 229)
Fixed Effects design matrix X must be of full column rank.
I don't understand why this is happening -- and probably this shows that I have a pretty big misunderstanding of what the dummy variables actually are.
However, if I run two fitlme's -- one on the subset A1==1 and one on A1==0, they both work, which just super confuses me.

답변 (1개)

Ive J
Ive J 2022년 1월 29일
The error is self-explanatory, and the reason is full dummy variable scheme you're using (why?). See here https://mathworks.com/help/stats/dummy-indicator-variables.html
Note that the error has nothing to do with mixed-model design. Consider this example:
n = 100; % sample size
tab = table(randn(n,1), categorical(randi([0 1], n, 1)), ...
categorical(randi([0, 1], n, 1)),...
'VariableNames', {'value', 'A1', 'A2'});
mdl1 = fitlm(tab, 'value ~ A1 + A2 - 1', 'DummyVarCoding', 'full') % design matrix is rank deficient
Warning: Regression design matrix is rank deficient to within machine precision.
mdl1 =
Linear regression model: value ~ A1 + A2 Estimated Coefficients: Estimate SE tStat pValue _________ _______ ________ _______ A1_0 -0.20234 0.20399 -0.99191 0.32373 A1_1 0 0 NaN NaN A2_0 -0.045804 0.17202 -0.26627 0.7906 A2_1 0.097693 0.18145 0.53839 0.59155 Number of observations: 100, Error degrees of freedom: 97 Root Mean Squared Error: 1.02 R-squared: 0.0145, Adjusted R-Squared: -0.00585 F-statistic vs. constant model: 0.712, p-value = 0.493
So, what happened? Let's construct the design matrix:
X = [dummyvar(tab.A1), dummyvar(tab.A2)]; % DummyVarCoding -> full
disp(rank(X)) % 3 < size(X, 2) --> 3 < 4 --> rank deficient
3
% what about when considering them alone?
disp(rank(X(:, 1:2))) % full rank
2
disp(rank(X(:, 3:4))) % full rank
2
We can approximately find the problematic variable:
[~, R] = qr(X, 0);
find(abs(diag(R)) < 1e-6)
ans = 4
Therefore, don't set 'DummyVarCoding' in such cases (default is 'reference')

카테고리

Help CenterFile Exchange에서 Multiple Linear Regression에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by