F-statistic and t-statistic

F-statistic

Purpose

In linear regression, the F-statistic is the test statistic for the analysis of variance (ANOVA) approach to test the significance of the model or the components in the model.

Definition

The F-statistic in the linear model output display is the statistic for testing the statistical significance of the model. The model property ModelFitVsNullModel contains the same statistic.

The F-statistic values in the anova display allow you to assess the significance of the terms or components in the model.

How To

Fit a regression model (mdl) by using fitlm or stepwiselm. Then, you can:

Find the F-statistic vs. constant model in the output display or by using
```
disp(mdl)
```
Display the F-statistic of the model by entering
```
mdl.ModelFitVsNullModel
```
Display the ANOVA for the model using
```
anova(mdl,'summary')
```
Obtain the F-statistic values for the components, except for the constant term using
```
anova(mdl)
```
For details, see the anova method of the LinearModel class.

Assess Fit of Model Using F-statistic

Open Live Script

This example shows how to assess the fit of the model and the significance of the regression coefficients using the F-statistic.

Load the sample data.

load hospital
tbl = table(hospital.Age,hospital.Weight,hospital.Smoker,hospital.BloodPressure(:,1), ...
      'VariableNames',{'Age','Weight','Smoker','BloodPressure'});
tbl.Smoker = categorical(tbl.Smoker);

Fit a linear regression model.

mdl = fitlm(tbl,'BloodPressure ~ Age*Weight + Smoker + Weight^2')

mdl = 
Linear regression model:
    BloodPressure ~ 1 + Smoker + Age*Weight + Weight^2

Estimated Coefficients:
                    Estimate        SE         tStat        pValue  
                   __________    _________    ________    __________

    (Intercept)        168.02       27.694       6.067    2.7149e-08
    Age              0.079569      0.39861     0.19962       0.84221
    Weight           -0.69041       0.3435     -2.0099      0.047305
    Smoker_true        9.8027       1.0256      9.5584    1.5969e-15
    Age:Weight     0.00021796    0.0025258    0.086294       0.93142
    Weight^2        0.0021877    0.0011037      1.9822      0.050375


Number of observations: 100, Error degrees of freedom: 94
Root Mean Squared Error: 4.73
R-squared: 0.528,  Adjusted R-Squared: 0.503
F-statistic vs. constant model: 21, p-value = 4.81e-14

The F-statistic of the linear fit versus the constant model is 21, with a p-value of 4.81e-14. The model is significant at the 5% significance level. The R-squared value of 0.528 means the model explains about 53% of the variability in the response. There might be other predictor (explanatory) variables that are not included in the current model.

You can also programmatically access the F-statistic of the model.

mdl.ModelFitVsNullModel

ans = struct with fields:
        Fstat: 21.0120
       Pvalue: 4.8099e-14
    NullModel: 'constant'

Display the ANOVA table for the fitted model.

anova(mdl,'summary')

ans=5×5 table
                   SumSq     DF    MeanSq      F         pValue  
                   ______    __    ______    ______    __________

    Total          4461.2    99    45.062                        
    Model          2354.5     5     470.9    21.012    4.8099e-14
    . Linear       2263.3     3    754.42    33.663    7.2417e-15
    . Nonlinear    91.248     2    45.624    2.0358        0.1363
    Residual       2106.6    94    22.411

This display separates the variability in the model into linear and nonlinear terms. Since there are two non-linear terms (Weight^2 and the interaction between Weight and Age), the nonlinear degrees of freedom in the DF column is 2. There are three linear terms in the model (one Smoker indicator variable, Weight, and Age). The corresponding F-statistics in the F column are for testing the significance of the linear and nonlinear terms as separate groups.

When there are replicated observations, the residual term is also separated into two parts; first is the error due to the lack of fit, and second is the pure error independent from the model, obtained from the replicated observations. In that case, the F-statistic is for testing the lack of fit, that is, whether the fit is adequate or not. But, in this example, there are no replicated observations.

Display the ANOVA table for the model terms.

anova(mdl)

ans=6×5 table
                   SumSq      DF     MeanSq         F          pValue  
                  ________    __    ________    _________    __________

    Age             62.991     1      62.991       2.8107      0.096959
    Weight        0.064104     1    0.064104    0.0028604       0.95746
    Smoker          2047.5     1      2047.5       91.363    1.5969e-15
    Age:Weight     0.16689     1     0.16689    0.0074466       0.93142
    Weight^2        88.057     1      88.057       3.9292      0.050375
    Error           2106.6    94      22.411

This display decomposes the ANOVA table into the model terms. The corresponding F-statistics in the F column assess the statistical significance of each term. For example, the F-test for Smoker tests whether the coefficient of the indicator variable for Smoker is different from zero. That is, the F-test determines whether being a smoker has a significant effect on BloodPressure. The degrees of freedom for each model term is the numerator degrees of freedom for the corresponding F-test. All the terms have one degree of freedom. In the case of a categorical variable, the degrees of freedom is the number of indicator variables. Smoker has only one indicator variable, so it also has one degree of freedom.

t-statistic

Purpose

In linear regression, the t-statistic is useful for making inferences about the regression coefficients. The hypothesis test on coefficient i tests the null hypothesis that it is equal to zero – meaning the corresponding term is not significant – versus the alternate hypothesis that the coefficient is different from zero.

Definition

For a hypotheses test on coefficient i, with

H₀ : β_i = 0

H₁ : β_i ≠ 0,

the t-statistic is:

$t = \frac{b_{i}}{S E (b_{i})},$

where SE(b_i) is the standard error of the estimated coefficient b_i.

How To

After obtaining a fitted model, say, mdl, using fitlm or stepwiselm, you can:

Find the coefficient estimates, the standard errors of the estimates (SE), and the t-statistic values of hypothesis tests for the corresponding coefficients (tStat) in the output display.
Call for the display using
```
display(mdl)
```

Assess Significance of Regression Coefficients Using t-statistic

Open Live Script

This example shows how to test for the significance of the regression coefficients using t-statistic.

Load the sample data and fit the linear regression model.

load hald
mdl = fitlm(ingredients,heat)

mdl = 
Linear regression model:
    y ~ 1 + x1 + x2 + x3 + x4

Estimated Coefficients:
                   Estimate      SE        tStat       pValue 
                   ________    _______    ________    ________

    (Intercept)      62.405     70.071      0.8906     0.39913
    x1               1.5511    0.74477      2.0827    0.070822
    x2              0.51017    0.72379     0.70486      0.5009
    x3              0.10191    0.75471     0.13503     0.89592
    x4             -0.14406    0.70905    -0.20317     0.84407


Number of observations: 13, Error degrees of freedom: 8
Root Mean Squared Error: 2.45
R-squared: 0.982,  Adjusted R-Squared: 0.974
F-statistic vs. constant model: 111, p-value = 4.76e-07

You can see that for each coefficient, tStat = Estimate/SE. The $p$ -values for the hypotheses tests are in the pValue column. Each $t$ -statistic tests for the significance of each term given other terms in the model. According to these results, none of the coefficients seem significant at the 5% significance level, although the R-squared value for the model is really high at 0.97. This often indicates possible multicollinearity among the predictor variables.

Use stepwise regression to decide which variables to include in the model.

load hald
mdl = stepwiselm(ingredients,heat)

1. Adding x4, FStat = 22.7985, pValue = 0.000576232
2. Adding x1, FStat = 108.2239, pValue = 1.105281e-06

mdl = 
Linear regression model:
    y ~ 1 + x1 + x4

Estimated Coefficients:
                   Estimate       SE        tStat       pValue  
                   ________    ________    _______    __________

    (Intercept)       103.1       2.124      48.54    3.3243e-13
    x1                 1.44     0.13842     10.403    1.1053e-06
    x4             -0.61395    0.048645    -12.621    1.8149e-07


Number of observations: 13, Error degrees of freedom: 10
Root Mean Squared Error: 2.73
R-squared: 0.972,  Adjusted R-Squared: 0.967
F-statistic vs. constant model: 177, p-value = 1.58e-08

In this example, stepwiselm starts with the constant model (default) and uses forward selection to incrementally add x4 and x1. Each predictor variable in the final model is significant given the other one is in the model. The algorithm stops when adding none of the other predictor variables significantly improves in the model. For details on stepwise regression, see stepwiselm.