Article Text

ANOVA, t tests, and linear regression
Free
1. Robert W Platt
1. McGill University/Montreal Children's Hospital Research Institute, 2300 Tupper, Montreal, PQ H3H 1P3, Canada
1. Correspondence to: Dr Platt.

## Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

In the last issue, I discussed logistic regression and the structure of linear models when the response or outcome is binary. Binary outcomes can take on only two values, like dead/alive or boy/girl, as compared with continuous outcomes which can take on any value on a numeric scale, like blood pressure or weight. Now, let's take a step back and consider the various models and tests for continuous outcomes. The common theme in these methods is explaining variability in the response variable, and dividing the total variance of a statistic into variation that can be explained and random variation that cannot be explained.

The t test is probably the simplest commonly used statistical procedure. To compare the mean of a continuous variable in two different populations, the difference between the two means divided by its standard deviation has a special distribution, known in this case as the “t distribution”. This relationship also allows construction of confidence intervals for the difference in means, and these provide information about the mean difference and its variability. When the difference between the two means (the between groups variability) is large relative to its standard deviation (the random variability) the t test will be statistically significant.

What happens when we want to test if there is a difference in means among three or more groups? Analysis of variance, or ANOVA, generalizes the t test to several groups. Since there are more than two groups being compared, we have to look at more than just mean differences. The method for testing the whether the mean level in all of the groups is the same follows a general pattern similar to that for the t test. The variance between groups summarizes the part of the total variability in the measures that can be explained by the assumption that the measurements come from different populations. The ratio between this “between groups variance” and the total variance in the dataset is high when there is a significant difference. This will occur when the means of the groups are far apart and the variability within the groups is small. The appropriate test of statistical significance here is the F test, which compares the ratio of the two variances to values found in F distribution tables.

The general test in the ANOVA model tests the null hypothesis that all of the group means are equal. Rejecting this hypothesis means that we believe that at least one difference of two means is not zero; often, we are interested in a specific difference, or in finding out which of the differences is significantly different from zero. To do this requires a second step—one that compares individual means using a modified version of the t test which can be done with a variety of common procedures.

Finally, consider the situation where, rather than dividing the population into groups, we wish to examine the association between a continuous outcome and a continuous variable (this can be thought of as an ANOVA where we have many different groups and these groups are ordered by the values of the continuous covariate). Here, we use linear regression, which associates the two variables through a β coefficient.1 This can easily be generalized to multiple regression, where we consider several covariates at the same time to try to understand their joint relationship to the outcome.

The t test can be thought of as a simple regression model with the covariate taking on only two values, and the ANOVA can also be viewed as a regression model with multiple covariates. More complicated ANOVA models can also be thought of in regression frameworks. The regression approach requires more work but it allows us to consider all these models in one unified framework and thus allows complete control of the comparisons made. Further, the calculation of the β coefficients and standard errors for these coefficients allows us to use confidence intervals rather than relying on hypothesis tests as in the ANOVA.

These three procedures are the main ways of dealing with the association of a continuous variable with continuous or categorical (grouping) covariates. The regression approach has many advantages, including the unified framework, the easy use of confidence intervals, and the option to manipulatethe covariates, that usually make it the best choice.

An essay on “the campaign to encourage responsible drinking has scored a steady decline in drunk driving accidents and deaths. Now it's time to address ourselves to the car commercials” (Robert Ramsay, Globe and Mail, April 1997).

Skateboarding on danger list

“Skateboarding has joined hang gliding and rock climbing as one of California's legally dangerous sports” (Herald, October 1997).

Child pushed screen; fell to death

“He put his two hands on the screen and it fell”. Parents blamed the landlord for failing to childproof the floor level windows. “This is the only way to make windows safe”, said Dr Barry Pless (Montreal Gazette, 1997).

Injuries add to call for safe infant walkers

“Doctors and consumer advocates have long sought to ban baby walkers because they cause more injuries than any other children's product. The government has tried to solve the problem through optional warning labels and public education campaigns, but a new study concludes these efforts are ineffective” (Susan Gilber, New York Times, 1997).