Article Text

## Statistics from Altmetric.com

## Introduction

Elsewhere1 we have described the rationale for carrying out cluster randomised controlled trials (CRCTs) in injury prevention and discussed key issues relating to the design and ethical conduct of such studies. In this companion paper we focus on sample size calculations for cluster randomised trials and on the methods for statistical analyses of these studies.

## Design effect and the intracluster correlation coefficient

As previously reported in our companion paper,1 a major disadvantage of CRCTs is that they generally require a larger sample size than do individually randomised trials. This increase in sample size can be quite substantial, depending on the size of the clusters being randomised and the degree of similarity of outcomes among members of the same cluster. The key measure of this similarity is called the intracluster or intraclass correlation coefficient (ICC), often denoted as *ρ*. This measure reflects the correlation between outcome values in members of the same cluster. If all members of the same cluster have identical values of the outcome measure, the ICC is equal to 1. An ICC of 0 would indicate that there is no correlation in outcome values between members of the same cluster, such that a member of any particular cluster is likely to have values that are no more similar to those of another member of the same cluster than they are to a member of a different cluster. Negative values of the ICC would occur where outcomes for members in the same cluster are less alike than they are for members in different clusters, but this is unlikely to be the case in CRCTs, although estimates of the ICC may be negative due to sampling error. Assuming that the ICC has only non-negative values, it can also be expressed as the ratio of the variation between clusters compared with the total variation in the outcome of interest, or, equivalently, the proportion of the total variation in the outcome of interest that occurs between clusters.i

A study of ICCs in CRCTs in primary care organisations for a range of outcomes found that the majority of values were less than 0.055.2 Typically, ICCs in injury prevention CRCTs are less than 0.2 (see table 1), but they can vary substantially depending on the type of cluster and outcome measure and also because of sampling error. For example, in a trial of safety advice at child health surveillance centres with general practice as the cluster and a mean of 55 children per cluster, the ICC for any medically attended injury in the study children was 0.017,8 whereas in a trial assessing the effect of giving out free smoke alarms where electoral wards containing a mean of 3686 households were randomised, the ICC for all injuries was 0.00017.7 Some studies carried out in schools have found larger ICCs—for example, 0.21 for use of any visibility aid at 8 weeks in a study assessing the effect of distributing free visibility aids to schoolchildren,11 where schools were randomised and one class of schoolchildren in each school was included, and 0.187 for children's knowledge of fire and burn prevention in a trial where teachers delivered a safety intervention to school classes.10 ICCs tend to be larger in studies with naturally occurring smaller clusters (eg, families, school classes) and smaller in larger clusters (eg, electoral wards), since people in small clusters are usually more alike than people in larger clusters. ICCs also tend to increase as the prevalence of a binary outcome approaches 0.5,13 and to be larger for process measures such as adherence to guidelines compared with patient-specific outcomes.14 ICCs can be adjusted for characteristics of study participants, such as age and sex, or of clusters, such as size, which often reduces the magnitude of the ICCs.2

### Reporting the intracluster correlation coefficient observed in the trial for each outcomeii

The ICC can be retrospectively estimated after trial data have been collected by estimation of the components of variance of the outcome measure, using analysis of variance, random effects models or other appropriate methods.16 The ICC for each outcome, preferably with a CI, should be given in reports of CRCTs: published values of ICCs can be invaluable for researchers planning CRCTs in similar areas, as it is essential that such trials adequately account for clustering in order to avoid being underpowered.

## Sample size and power calculations

If the CRCT sample size calculations do not take clustering into account, the CRCT will be underpowered and have a high type 2 error rate, thereby wasting resources and potentially failing to detect interventions that may be beneficial in reducing injuries. Study power will also be reduced if the ICC used in calculating the sample size is lower than the ICC observed in the study. The effect of low power will be wide CIs and inconclusive results.

Determination of the sample size required for a CRCT must incorporate the cluster size and appropriate ICC values into the calculations. For the simplest case of a CRCT with two treatment arms, clusters of equal size and a continuous or binary outcome measure, the required sample size can be estimated in two stages. First, the sample size that would be required for an individually randomised trial is calculated according to the usual considerations of power, statistical significance and the minimal important difference to be detected, as well as the likely variability of the outcome measure (for continuous measures) or proportion with the outcome in the control arm (for binary measures). Then this estimate of sample size is multiplied by (1+(cluster size−1)×ICC), or equivalently (1+(m−1) ρ) where m is the cluster size and ρ is the ICC, to obtain the number of participants required for the CRCT. The total number of clusters required for the trial can then be obtained by dividing the total number of participants required (in both study arms) by the cluster size. The value of (1+(cluster size−1)×ICC) is called the design effect or variance inflation factor. This is the amount by which using a CRCT rather than an individually randomised design inflates the sample size.

As an example, the CRCT of an intervention for pregnant women by midwives and health visitors to reduce baby walker use9 was designed to detect an absolute difference of 10% in baby walker possession when the babies were 9 months old. To detect this difference in an individually randomised trial with a two-sided 5% significance level and 80% power and assuming 50% of mothers in the control arm possessed a baby walker would require 388 mothers in each study arm (776 in total). To allow for clustering by general practice, it was estimated that the ICC would be 0.017 (based on a previous study) and the cluster size would be 23, giving a design effect of (1+(23−1)×0.017)=1.37. This increased the required sample size from a total of 776 to 1063 mothers, necessitating a total of 46 (ie, 1063/23) enrolled practices. In another CRCT, of warm-up exercises to prevent sports injuries in members of handball sports clubs, the investigators estimated the sample size assuming an ICC of 0.07 and an average sports club size of 15 members, based on pilot data.12 This gave a design effect of 1.98, therefore requiring twice as many participants for the CRCT as for an individually randomised design (1830 vs 915 participants).

The above description of sample size calculations assumes that the cluster size is constant, which may be appropriate when sampling a fixed number of participants within clusters, but typically the cluster size will vary. The average cluster size can be used in sample size calculations when the cluster size does not vary much, which will slightly underestimate the necessary sample size.16 Eldridge and colleagues have shown that if the coefficient of variation (SD/mean) of the cluster sizes is less than 0.23, this underestimation of sample size is negligible; however, for larger variation in cluster size, as typically occurs in studies randomising, for example, general clinical practices, a somewhat more complex formula should be used.17 An alternative approach to calculating sample sizes in CRCTs18 incorporates the coefficient of variation of the outcome between clusters in the sample size formulae. This method is particularly useful when the primary outcome measure is an incidence rate—for example, the rate of falls or medically attended injuries.

The sample size calculations can also be extended to accommodate stratified and matched-pair cluster randomised designs,16 although there may be insufficient information available at the planning stage of these trial designs to enable these formulae to be used. Power can be improved in these study designs if clusters are matched on or stratified by variables strongly related to the outcome. A conservative strategy if these designs are used is to use the same approach for sample size calculation as for the unmatched and unstratified designs.16 Computer simulation can be used in sample size calculations with more complex study designs to incorporate design features beyond those accommodated by standard formulae for sample size.

The ICCs for calculating sample size can be estimated on the basis of the pattern of values from similar trials or studies that have reported these values, if they have used the same type of cluster (eg, school class) and the same outcome measure. If no published information is available, a pilot study could be carried out to ascertain this information.12 19 Published ICCs are likely to be imprecise, especially those from studies with a relatively small number of clusters, and it is worth carrying out sensitivity analyses to examine the effect of a range of ICC values on the required sample size.20 In the CRCT of an intervention to reduce baby walker use,9 the ICC observed in the trial was 0.053, which was higher than the value of 0.017 used in the sample size calculations. With this higher value, the study had a power of 60% rather than the planned 80% to detect a difference of 10% in baby walker possession between groups; underestimating the ICC for sample size calculations reduces study power. Use of ICCs adjusted for individual or cluster level characteristics can reduce the required sample size, but the analysis will need to incorporate this adjustment too.2

### Reporting how the effects of clustering were incorporated into the sample size calculations

CRCTs should report how the sample size was calculated, including details of the method of calculation used, the number of clusters and the cluster size. The ICC (or coefficient of variation of the outcome) used in the calculations should also be stated, along with an indication of its uncertainty if available, as well as the effect size to be detected and the planned study power. The assumptions used in the calculations should be stated. As an example: ‘In youth team handball, the incidence of acute injuries to the knee and ankle is estimated to be 12 per 100 players per league season. From a pilot study conducted to determine the incidence of injury during the previous season (submitted for publication), we estimated that the cluster effects for club randomisation gave an inflation factor of 2.0 based on a cluster size of 15 and an intracluster correlation coefficient of 0.07. We then calculated that to achieve 90% power with α=5% to detect a RR reduction of 50%, we would need 915 players in each group. Therefore, when we initiated the trial, we were hoping to include 60 clubs in each group (a total of 120 clubs; with an average of 15 players in each club).’12

## Methods for data analysis

It is crucial that a CRCT is analysed using methods that account for clustering. If the analysis instead uses conventional methods, which assume independence of observations, the resulting significance levels are likely to be too low and the CIs too narrow, increasing the risk of a type 1 error—that is, falsely identifying an ineffective intervention as being effective.21 A study of an intervention in schools to increase knowledge about wearing bicycle helmets did not appear to account for clustering of the 162 children within 12 school classes in the analysis.4 The finding of a higher knowledge score in the intervention group than in the control group would have been less statistically significant had the analysis accounted for clustering.

There are several alternative approaches to analysing CRCTs, described more fully elsewhere,16 22–24 with as yet no consensus as to which is most appropriate in a given situation, although all approaches that account for clustering in some way are preferable to an analysis that ignores the clustering altogether. The simplest approach is to aggregate the data from individual participants in each cluster; means, proportions or rates can be calculated for each cluster, depending on the outcome of interest. Then standard methods of analysis such as unpaired t tests or non-parametric tests can be used to compare these aggregated values between randomisation groups.16 25 These tests give equal weight to each cluster in the analysis but can be adapted to weight the analysis by cluster size26 or by minimal variance weights (values that are proportional to the inverse of the variances of the outcome for each cluster).27 Using minimal variance weights is more efficient than using cluster size weights, particularly with large cluster sizes. As an example of an analysis using aggregated data, Kendrick *et al*8 calculated the child injury rate in each general practice in their trial of safety advice and used a t test, weighted by the number of children in the practice, to compare intervention with control practices. In a CRCT evaluating a multifactorial programme to prevent falls in care wards for older people, Cumming *et al*6 used an unpaired t test to compare fall rates among hospital patients calculated for each participating hospital ward. A limitation of the cluster-level approach for analysis is that it cannot fully account for other variables that have been measured in individual participants and which may be imbalanced between study arms; accounting for these variables can remove the effect of imbalance and increase the precision of the estimate of intervention effect.2 One could aggregate these variables at cluster level, too, but at the risk of losing important information. For example, there may be little variation in the proportion of males at cluster level (eg, general practice), limiting the possibility of examining whether the effect of an intervention varies by sex, where power is sufficient to compare subgroups.

A more sophisticated approach to the analysis of clustered RCTs is to use multilevel modelling, also known as random effects or hierarchical modelling.16 22 24 This method incorporates the individual participant level outcomes into a statistical model that accounts for correlations between participants from the same cluster. In essence, the variation in responses is split into different levels—that is, variation between participants within clusters (level 1 variation) and variation between clusters (level 2 variation). The intervention group is included in the model as a level 2 variable as it is constant for participants within clusters (level 1). The estimates, significance levels and CIs derived for the intervention effect using these models will then account for the clustered nature of the data. The model can be extended to include cluster and participant level variables and also to include interactions between the intervention and cluster or participant variables, such as sex of participant. This method of analysis has been used to estimate the effect of the intervention in a number of CRCTs in injury prevention5 7 10 11 and also to examine whether the intervention effect varied by factors such as sex, age or deprivation.10 11

Another method of analysis that also uses the individual participant responses is called generalised estimating equations (GEEs).16 22 24 This method accounts for the correlation of responses within clusters by specifying a correlation structure for the analysis—for example, that the responses of cluster members are equally correlated. Unlike the random effects method, the GEE method does not explicitly model the variation between clusters. It makes fewer assumptions about the distribution of data than the multilevel approach, and is robust to incorrect specification of the correlation structure of the data. Cumming *et al*,6 in the study of a fall prevention programme, also carried out an analysis of fall rates at the individual level, using GEEs to allow for clustering by hospital ward. With this approach, they could adjust for both individual length of stay and the pre-intervention rate of falls in the ward.

The above methods of analysis can accommodate different types of outcome, including binary, continuous and ordinal outcomes as well as rates and time to an event. Although methodological work has tended to focus on continuous and binary outcomes, a recent paper28 compared the performance of different approaches in the analysis of counts and rates, which are often primary outcomes in injury research, and reported that multilevel models tended to have the best performance along with Bayesian hierarchical models.

Accounting for cluster and individual level covariates in the analysis can reduce the ICC2 and hence increase the precision of the statistical analysis; for example, in the trial by Cumming *et al*6 the unadjusted ICC was 0.014, but adjustment for length of stay and previous falls reduced it to 0.003. Multilevel models require an adequate number of clusters so assumptions can be checked and GEE models require a reasonably large number of clusters (at least 20 per group) to be reliable29; for fewer clusters, the cluster level analyses are more robust.16

Matched-pair CRCTs can be analysed using aggregate measures at cluster level16—for example by calculating means or proportions, depending on the outcome measure, in each cluster and then comparing these within matched pairs using a paired t test or non-parametric test such as the Wilcoxon signed rank sum test. Methods developed for meta-analysis can also be used to analyse paired cluster randomised trials30; this approach uses a random-effects model for a meta-analysis across the matched pairs. Differences in the outcome variable between intervention and control clusters are calculated for each matched pair, and these are then weighted appropriately to obtain the estimated effect of the intervention. This method of analysis has the advantage that results can be clearly displayed with a forest plot as used in standard meta-analyses. The analysis can also be extended to account for participant level covariates. Matched-pair studies can also be analysed using multilevel models, with an additional term for the matched pair. As with unmatched studies, this requires a large number of clusters to be reliable. In the trial assessing the effect of giving out free smoke alarms, the clusters (electoral wards) were pair-matched using a deprivation score. The analysis compared incidence rates of each outcome using a multilevel Poisson model with pair included as a level. The authors also carried out an analysis using the meta-analysis approach, which produced similar results.7

A number of statistical packages can be used for the analysis of CRCTs including SAS, Stata, R and MLwiN. Bayesian analyses can be carried out using WINBUGS. A review of these packages can be found at http://www.cmm.bristol.ac.uk/learning-training/multilevel-m-software/index.shtml. The following papers give further information on recent developments and alternative approaches in the analysis of CRCTs.29 31–33

### Reporting how the effects of clustering were incorporated into the analysis

The description of statistical methods should state whether analysis was carried out at the cluster or individual level, and explain how the correlation of responses between individuals within clusters is taken into account. For example: ‘A two-level analysis was used, with the care home nested within the [Primary Care Organization]. The analysis was performed using the random effects Poisson model in STATA with cluster as the random effect. The unit of analysis was care home; all other variables were used as fixed effects. This took account of the hierarchical nature of the data, including both the variability at the cluster level and at the care home level.’5

## Conclusions

Cluster randomised trials pose a number of complex design and analytical issues. The required sample size can be considerably larger than for an individually randomised trial, so careful consideration and justification are needed before a clustered design is used. Clustering also needs to be incorporated into the statistical analyses. Simple analyses can be carried out using data aggregated at the cluster level, and these are preferred for studies with relatively few clusters. More complex methods using individual-level data can be used for larger studies, and these can be extended to adjust for baseline covariates and incorporate tests of interaction. Because of the added complexity of these trials, it is particularly important that they are clearly reported, in accordance with the CONSORT guidelines for cluster randomised trials.

## References

## Footnotes

Linked articles 023119

Funding CD was funded in part by a grant from the National Center for Injury Prevention and Control, Centers for Disease Control and Prevention, Atlanta, GA.

Competing interests None.

Provenance and peer review Commissioned; externally peer reviewed.

↵i ICC=σ

_{b}^{2}/(σ_{b}^{2}+σ_{w}^{2}) where σ_{b}^{2}is the between-cluster component of variance and σ_{w}^{2}is the within-cluster component of variance.↵ii An extension of the CONSORT guidelines for the reporting of individually randomised controlled trials, which addresses the unique aspects of reporting CRCTs, has been published.15 At the end of each section, we provide guidelines and examples for reporting the issue discussed, based on the extended CONSORT guidelines.