Dummy variable coding in mixed models (LME)

Question

5 개 추천

Hi all,

I've been a little perplexed by the different ways to code dummy variables when fitting a linear mixed model (using fitlme). Specifically I have a model with two categorical fixed factors. I'd like to do contrasts between the different levels of each factor. Now several online sources tell me I should use 'effects' coding for this, but it isn't clear to me why this is, nor is it clear to me how I should code my contrast matrix when using 'effects' coding rather than reference coding.

For example, if I have a model with an intercept and one categorical fixed factor with three levels, such that:

T       = table(y,var1); % y is my response variable, var1 a categorical variable with three levels
formula = 'y ~ var1';
m1      = fitlme(T,formula,'DummyVarCoding', 'reference');
me1     = fitlme(T,formula,'DummyVarCoding', 'effects');

Now I'd like to test whether there's a difference between the first and second categories. In the case of the m1 (reference coding) and me1 (effects coding) I would do this as follows:

[p,F,df1,df2]     = coefTest(m1,[0 1 0]); % coefTest(lme,H)
[pe,Fe,df1e,df2e] = coefTest(me1,[1 1 0]);

This gives me very different results even though in both cases multiplying each contrast matrix with the associated fixed effects matrix gives me the same value (according to the documentation for coefTest: "It tests the null hypothesis that H0: Hβ = 0, where β is the fixed-effects vector."

When I try both options with some simulated data, where the first two levels of the fixed factor differ significantly, I only find this significant difference when using reference coding. Any insight would be greatly appreciated.

Here's the code for some simulated data:

d1 = zeros(100,1)+randn(100,1);
d2 = ones(100,1)+randn(100,1);
d3 = ones(100,1).*2+randn(100,1);
d = [d1;d2;d3];
cov1 = [zeros(100,1);ones(100,1);ones(100,1).*2];
T = table(d,categorical(cov1),'VariableNames',{'d','cov1'});
f = 'd~cov1';
m1 = fitlme(T,f);
me1 = fitlme(T,f,'DummyVarCoding','effects');
[p,F,df1,df2]     = coefTest(m1,[0 1 0]); % coefTest(lme,H)
[pe,Fe,df1e,df2e] = coefTest(me1,[1 1 0]);

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Paul 2020년 7월 24일

편집: Paul 2020년 7월 24일

3 개 추천

Hi! Yes I solved it.

The problem was that, in the example above, I calculated the H vector incorrectly. This is the vector [0 1 0] which is input to the coefTest() function and which tells the function which contrast to perform.

To compute correct H vectors we need to first look at the model coefficients. For the dummy coded model these are:

(intercept)

Cov_1

Cov_2

And for the effects coded model these are:

(intercept)

Cov_0

Cov_1

To compute the H vector for a contrast, you need to make a vector for each of the two conditions you want to contrast and subtract those vectors from each other. In this case we want to compare condition 0 with condition 1.

For the dummy coded model condition 0 is represented by only the intercept, because in dummy coding the intercept is set equal to the estimate of the first condition, in other words the intercept = Cov_0, so this vector = [1 0 0]. Condition 1 is represented by the intercept + the coefficient Cov_1, so the vector = [1 1 0]. The difference between these vectors: H= [0 -1 0]. You could flip the order of the subtraction around: H = [0 1 0] would be equivalent.

For the effects coded model the intercept does NOT represent condition 0. Instead, condition 0 is now represented by the intercept + the coefficient Cov_0, so this vector = [1 1 0]. The second condition is equal to the intercept + Cov_1, just like in the dummy coded model, but now this vector = [1 0 1]. The difference between these two is H = [0 1 -1] or [0 -1 1].

Using the new vectors both coefTest(m1, [0 1 0]) and coefTest(me1, [0 1 -1]) return the same output.

I personally prefer using dummy coding for doing contrasts, because I find it a bit easier to intuit the H vector, but in principle they are entirely equivalent. When using the anova() function though, you NEED to use the effects coded model to get accurate statistics.

However, you might wonder how to compute contrasts with the third condition, represented by Cov_2, when using effects coding. Cov_2 is not explicitly mentioned in the effects coded model, so how does this work? Well condition 2 is represented by the vector [1 -1 -1], so a contrast between condition 1 and 2 would give H = [1 0 1] - [1 -1 -1] = [0 1 2] when using effects coding. Whereas using dummy coding it would give H = [1 1 0] - [1 0 1] = [0 1 -1].

Another thing I found hard to figure out because the Matlab documentation doesn't discuss it, is how to compare groups of conditions, for instance when you have nested categories. I thought it might be helpful to mention here as well. For example imagine you want to test the efficacy of two diets on weight loss and you have data from both men and women. You would have four categories: weight loss with diet 1 and weight loss with diet 2 for both men and women. You want to study the interaction between gender and type of diet. Your model would be: 'Weight loss ~ DietType * Gender'.

Your dummy coded model would have the following coefficients:

(intercept) < which represents the category Diet_0 + Gender_0

Diet_1

Gender_1

Diet_1*Gender_1

If you want to see whether there's a difference between the first and second diet, the procedure is to add the vectors for each individual category within a group, and then subtract the two group vectors, as follows:

(v1 + v2) - (v3 + v4) = H

Here v1 represents Gender_0 on Diet_0 (just the intercept), v2 represents Gender_1 on Diet_0, v3 represents Gender_0 on Diet_1 and v4 represents Gender_1 on Diet_1. The calculation would be:

([1 0 0 0] + [1 0 1 0]) - ([1 1 0 0]+[1 1 1 1]) = [0 -2 0 -1] = H

If you want to compare efficacy of dieting between men and women, you would say H =

([1 0 0 0] + [1 1 0 0]) - ([1 0 1 0] + [1 1 1 1]) = [0 0 -2 -1]

댓글 수: 6
이전 댓글 4개 표시 이전 댓글 4개 숨기기

Paul 2020년 7월 25일

편집: Paul 2020년 7월 25일

Hi Xue,

Discovering LMEs with Matlab has helped me a lot in my work but it was a frustrating proces to figure out the details. I'm happy to help others avoid some of that frustration.

No, in the effects coded model Cov_2 is NOT the intercept. In the effects coded model the intercept is not equal to any of your groups/conditions, but rather lies somewhere in between them. In my example the estimate of Cov_2 in the effects coded model uses [1 -1 -1]. This should give the same value as using the vector [1 0 1] in the dummy coded model. This is why interpreting a dummy coded model is (for me at least) a bit more intuitive than interpreting an effects coded model. I also mention this in the paragraph above: "However, you might wonder..."
The function anova() only works properly with an effects coded model because it ASSUMES you are using an effects coded model as input.

I'm no expert on the exact mathematics, but this is related to one specific characteristic of effects coded models, namely that if you sum all the vectors related to all possible conditions they should produce all zeros, except at the intercept. So in the above example there are three conditions: Cov_0, Cov_1, Cov_2. In the effects coded model these are associated with the vectors [1 1 0], [1 0 1] and [1 -1 -1]. If you sum those all together, the result is [3 0 0]. All elements are 0 except for the position of the intercept (3). This is always the case for effects coded models no matter how large, and it is this specific characteristic of the effects coded model which anova() requires to work.

In the dummy coded model the three conditions are associated with [1 0 0], [1 1 0] and [1 0 1], which sum to [3 1 1]. So they do not sum to zero, obviously.

So important to understand is that the difference between a dummy coded model and an effects coded model is mostly in what the point of reference (intercept) is and how the other coefficients are related to the intercept. If you have a model with only two conditions, in a dummy coded model your intercept would be equal to condition 1 (which would be associated with vector [1 0]) and the only coefficient would be the difference between condition 1/intercept and condition 2 (which would be associated with vector [1 1]). In an effects coded model the intercept would be the halfway point between condition 1 and coindtion 2, and the only coefficient for the model would need to be added to the intercept to produce the estimate of condition 1 ([1 1]) and subtracted from the intercept to produce condition 2 ([1 -1]). Because that is the only difference, dummy coded and effects coded models produce identical fits, identical confidence intervals, identical log-likelihoods, identical Rsquared etc etc etc.

Let me know if you have any more questions.

Paul 2024년 2월 16일

Hi LM_BU,

I don't think I said that "Cov_0 is [1 1 0]". I did say that the first condition is represented by "intercept + Cov_0 is [1 1 0]". In effects coding, every condition contains the intercept; that's part of the definition of the effects coding and as far as I can tell this is also in line with the contents of the documentation page you linked.

I do not see the example you mention (of testing the main effect of supplier) ont he page you linked. Are we both looking at the docs for 2023b?

I do agree with you that when testing for the main effect of supplier, you should leave out the intercept, because you are interested in the effect of supplier, not of the intercept.

In my examples above, I am mostly talking about testing between conditions, which is different from testing main effects. When testing between conditions, you subtract the effects vectors from the two conditions. Since a condition in effects coding always includes the intercept, subtracting effects vectors of any two conditions will always produce an H vector with the intercept element == 0. Example: if you contrast condition 1 [1 0 0] with condition 2 [1 1 0], your H vector will be [1 0 0] - [1 1 0] = [0 -1 0].

Ken Campbell 2024년 12월 18일

@Paul - this was super-helpful - thank you!

Have you worked out how to correct the p-values if you perform multiple comparisons?

Ken

Paul 2024년 12월 19일

편집: Paul 2024년 12월 19일

Hi Ken, happy that it was useful :)

For 'normal' multiple comparisons I use the Holm-Bonferroni method. It's just as conservative with respect to type 1 errors as Bonferroni, but produces fewer type 2 errors.

For correction of the degrees of freedom of ANOVAs I use the 'satterthwaite' option in the Anova() function.

Let me know if you had something else / more specfic in mind.

댓글을 달려면 로그인하십시오.

Answer 2

Xue Zhang 2020년 7월 24일

0 개 추천

Same question here, have you figured out why? Thanks!

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

Paul 2020년 7월 24일

Yes, see my above comment.

댓글을 달려면 로그인하십시오.

Answer 3

osman 2025년 8월 19일

0 개 추천

Hi,

I understand what you said above. I might missing something but the example that I see when I type doc coeftest in MATLAB is below. Here they say the difference between the evening and morning shifts is found by pVal = coefTest(lme,[0 1 -1]) but it should be like yours pVal = coefTest(lme,[0 1 0]). Am I wrong?

load('shift.mat')

shift.Shift = nominal(shift.Shift);

shift.Operator = nominal(shift.Operator);

lme = fitlme(shift,'QCDev ~ Shift + (1|Operator)') %not effects coding, reference coding

%the output

%Fixed effects coefficients (95% CIs):

% Name Estimate SE tStat DF pValue

% {'(Intercept)' } 3.1196 0.88681 3.5178 12 0.0042407

% {'Shift_Morning'} -0.3868 0.48344 -0.80009 12 0.43921

% {'Shift_Night' } 1.9856 0.48344 4.1072 12 0.0014535

%Test if there is any difference between the evening and morning shifts.

pVal = coefTest(lme,[0 1 -1])

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

Paul 2025년 8월 21일

편집: Paul 2025년 8월 21일

Let me walk through this for a moment:

You want to test evening vs morning shifts. The model is reference coded, with 'evening shift' being the reference group. It is represented by the vector [1 0 0]. The morning shift is represented by the vector [1 1 0]. Subtract the two and you end up with [0 -1 0] indeed or equivalent [0 1 0]. For an effects coded model, the vectors would be [1 1 0] for evening, [1 0 1] for morning, and then the difference would be the one cited in the docs: [0 1 -1].

So the documentation that you cite seems incorrect from my perspective. But I don't personally see this documentation in Matlab, nor in the online docs. What version are you using?

You can check whether you get identical output from coefTest(lme, [0 1 0]) from a model that is reference coded and coefTest(lme, [0 1 -1]) from a model that is effects coded. If so, that would at least support this idea of how the vectors are coded.

If you want to be sure you should use some data with known, significant effects and check with what kind of vector coefTest() produces the expected result.

댓글을 달려면 로그인하십시오.

Answer 4

osman 2025년 8월 21일

0 개 추천

Thanks, yes. Here is what you said and they gave the same p val.

clear all

load('shift.mat')

lmeR = fitlme(shift,'QCDev ~ Shift + (1|Operator)')

lmeE = fitlme(shift,'QCDev ~ Shift + (1|Operator)','DummyVarCoding','effects')

% % Test if there is any difference between the evening and morning shifts.

pVal_ref = coefTest(lmeR,[0 1 0])

pVal_E = coefTest(lmeE,[0 1 -1])

%the output

%pVal_ref =

% 0.4392

%pVal_E =

% 0.4392

Here is the link they test the reference coded model with the wrong coefTest H matrix (the one for effects coding) 0 1 -1.

https://www.mathworks.com/help/stats/linearmixedmodel.coeftest.html

Basically they used the H matrix of effects coded model in the ref. coded model. I guess they made this mistake because in the link below they used the same model with effects coding but forgot to do the same way for writing the help section of coeftest.

https://www.mathworks.com/help/stats/fitlme.html

Do you agree?

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

Paul 2025년 8월 28일

Yeah I do agree that's what it looks like.

댓글을 달려면 로그인하십시오.

Dummy variable coding in mixed models (LME)

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 6
이전 댓글 4개 표시 이전 댓글 4개 숨기기

추가 답변 (3개)

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

카테고리

태그

Community Treasure Hunt

Dummy variable coding in mixed models (LME)

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

채택된 답변

댓글 수: 6 이전 댓글 4개 표시 이전 댓글 4개 숨기기

추가 답변 (3개)

댓글 수: 1 이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 1 이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 1 이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

카테고리

태그

참고 항목

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글 수: 6
이전 댓글 4개 표시 이전 댓글 4개 숨기기

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기