Need advice for coding dummyvar vectors - Regression

조회 수: 5 (최근 30일)
Douglas Leaffer
Douglas Leaffer 2023년 6월 14일
댓글: Douglas Leaffer 2023년 6월 15일
How should I properly code dummvar vectors for use in regression analysis in MATLAB? I have attached a sample table of data (.xlsx file) that I wish to import into MATLAB then run regressions (outcome is last column in table). Some of the categoricals are WindAirport, WindRail, etc. and are simply coded as 1 or 0 (as 'double' variable types); others are logicals T/F. For my model output I need to show both the groups, such as: Site_0 and Site_1 and their regression slope coefficients, as well as the model intercept term and its coefficient. Shall all dummyvars be categorical ? logical ? or double to acheive the desired model output? I will use fitglm as the model function. Any advice is welcome. Thank you.
T=readtable('chels_sample.xlsx'); % alternatively, code as T = xlsread('chels_sample.xlsx')
modelspec = 'lnUFP~ 1 + Day_0 + Day_1 + WindAirport + WindRail'; % just a few binary terms, for example
mdl = fitglm(T,modelspec,'Distribution','normal')

답변 (1개)

the cyclist
the cyclist 2023년 6월 14일
편집: the cyclist 2023년 6월 14일
The documentation for fitglm states
"If data is in a table or dataset array tbl, then, by default, fitglm treats all categorical values, logical values, character arrays, string arrays, and cell arrays of character vectors as categorical variables."
It looks like Day_0 and Day_1 were read in as logical
T=readtable('chels_sample.xlsx');
class(T.Day_0)
ans = 'logical'
class(T.Day_1)
ans = 'logical'
but that WindAirport and WindRail were not:
class(T.WindAirport)
ans = 'double'
class(T.WindRail)
ans = 'double'
therefore I would explicitly convert those
T.WindAirport = categorical(T.WindAirport);
T.WindRail = categorical(T.WindRail);
before calling the model
modelspec = 'lnUFP~ 1 + Day_0 + Day_1 + WindAirport + WindRail'; % just a few binary terms, for example
mdl = fitglm(T,modelspec,'Distribution','normal')
mdl =
Generalized linear regression model: lnUFP ~ 1 + Day_0 + Day_1 + WindAirport + WindRail Distribution = Normal Estimated Coefficients: Estimate SE tStat pValue ________ ________ _______ ___________ (Intercept) 9.2949 0.072015 129.07 2.0029e-109 WindAirport_1 0.7141 0.35646 2.0033 0.047963 WindRail_1 -0.4161 0.10031 -4.1483 7.2379e-05 99 observations, 96 error degrees of freedom Estimated Dispersion: 0.244 F-statistic vs. constant model: 12.1, p-value = 2.11e-05
The coefficient of WindAirport_1 is when the value is (categorical) 1. WindAirport=0 is the reference level.
  댓글 수: 3
the cyclist
the cyclist 2023년 6월 15일
The overall model intercept term is in the output: Intercept = 9.2949. The intercept is the value of the response when
  • all categorical explanatory variables are at their reference level, and
  • all continuous explanatory values are zero
I notice that Day_0 and Day_1 are constant in your data, which I expect is why there are no estimated coefficients for them. (Perhaps you only uploaded a subset of the data?) If they are constant, they should not be in the model. The same seems to be true for Site_0 and Site_1, and many of your other variables. So, I don't understand that.
For the categorical variables that do have different values (e.g. WindRail), the estimate reported is the change in response for the different levels (e.g. WindRail=1), relative to the reference level (WindRail=0). I would not call that a slope, which would only be calculated for a continuous variable.
Douglas Leaffer
Douglas Leaffer 2023년 6월 15일
Thank you again @the cyclist

댓글을 달려면 로그인하십시오.

카테고리

Help CenterFile Exchange에서 Dimensionality Reduction and Feature Extraction에 대해 자세히 알아보기

제품


릴리스

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by