Article Text
Abstract
Background In sport injury epidemiology research, the injury incidence rate (IR) is defined as the number of injuries over a given length of participation time (exposure, eg, game hours). However, it is common that individual weekly exposure is missing due to requirements of personnel at every game to record exposure information. Ignoring this issue will lead to an inflated injury rate because the total exposure serves as the denominator of IR, where the number of injury cases were captured accurately.
Purpose This paper used data collected from a large community cohort study in youth ice hockey as an example, and compared six methods to handle missing weekly exposure of individual players.
Methods The six methods to handle missing weekly exposures include available case analysis, last observation carried forward, mean imputation, multiple imputation, bootstrapping and best/worst case analysis. To estimate injury rate ratios (IRRs) between Alberta and Quebec, as in the original study, three statistical models were applied to the imputed datasets: Poisson, zeroinflated Poisson and negative binomial regression models.
Results The final sample for imputation included 2098 players for whom 12.5% of weekly game hours were missing. Estimated IRs and IRRs with CIs from different imputation methods were similar when the proportion of missing was small. Simulations showed that mean and multiple imputations provide the least biased estimates of IRR when the proportion of missing was large.
Conclusions Complicated methods, like multiple imputation or bootstrap, are not superior over the mean imputation, a much simpler method, in handling missing weekly exposure of injury data where exposures were missing at random.
Statistics from Altmetric.com
Introduction
Sport and recreation are the leading cause of injury requiring medical attention in youth, and their consequences consume enormous resources.1––3 In Alberta, Canada, it is estimated that 30–40% of youth (ages 11–18 years) sustain a sport and recreational injury requiring medical attention each year.4 ,5 Sport and recreation injuries significantly lower the quality of life of Canadians and may lead to permanent disability and death. In the research of sport injuries, the injury rate was often used to quantify the frequency of injuries occurring over a given period of time.
In epidemiology, injury incidence rate (IR) is defined as the ratio of the total number of injuries over exposure time (eg, game hours). The numerator and denominator are necessary to calculate the IR. However, the collection of weekly exposure time for individual participant is challenging. It depends on a large group of personnel in the study, especially at the community level. In a previous study comparing IRs in youth ice hockey between two Canadian provinces, Alberta and Quebec,6 weekly exposure sheets were used to record game information of individual players and these data were primarily collected by a team designate (eg, safety coach or manager), while the injuries were assessed and reported by the team therapist. Thus, proper collections of weekly exposure data relied heavily on the availability of the team designate at each team session. When the team designate was not available in a particular session, and there was no replacement, the exposure time was missing for that session. The ideal arrangement was to have the coach as the team designate, although this was not common in large community studies. By contrast, injury cases were captured accurately because the team therapist was present when the injury occurred. This is a common issue in sport injury epidemiology studies: the missing exposure hours were often ignored and not reported in large community studies.7––9 If the exposure hours are not well captured, the resulting IR will be biased, and consequently, biased conclusions will be drawn. This challenge has not received adequate attention and solutions were not available. In this paper, we use data from a previous study6 and compare different methods of handling missing individual weekly exposure data.
Missing data per se is a wellrecognised problem in statistics. An intuitive approach is to analyse only the nonmissing observations, which is called the available case analysis. This approach, in general, is considered unbiased, but subject to a loss of efficiency by excluding missing data.10––13 In our study, however, total exposure was the denominator of IR. Therefore, ignoring missing weekly exposure leads to a biased estimate of IR instead of loss of efficiency. Another convenient solution is to substitute missing data with the last observed value for each individual. This approach is called last observation carried forward. Substituting missing data with the mean of observed values was also used, although this approach was subject to underestimations of the variance.12 Additionally, scientifically more rigorous statistical approaches for missing data have been suggested. Among them, the multiple imputation has been preferred and applied in the past 20 years in a wide range of areas including genetics, clinical trials, cancer research, among others.14––21 In multiple imputation, missing data were imputed by a predictive model several times and the estimation was performed on each imputed sample. The imputed estimate was the average of the estimates from imputed samples. Bootstrapping is another popular method to handle missing data statistically. This method resamples the data, imputes the missing values, and performs inference on the resampled data.22 Algorithms of multiple imputation and bootstrap methods are readily available in statistical software to handle missing continuous or categorical outcomes.
The original cohort study examining injury risk in youth ice hockey concluded that players in Alberta where policy allowed body checking had a significantly higher IR than players in Quebec where policy did not allow body checking.6 If, for example, missing game hours in Alberta players were substituted with his/her largest game hour, and those in Quebec players were substituted with individual smallest hour during the season, the injury rate ratio (IRR) between Alberta and Quebec would be closest to the null value 1, which gives a conservative estimate of IRR. This approach is termed Best/Worst case analysis. It may be of interest to sport injury researchers and, hence, is included in the comparison of imputation methods in this paper.
In this paper, we investigated above six methods of imputing weekly missing exposure hours for each player and compared estimated IRs in Alberta and Quebec along with IRRs between the two provinces. These methods have been implemented in statistical software for applied researchers. Additionally, three statistical models were applied in the analysis of imputed data: Poisson, zeroinflated Poisson (ZIP), and negative binomial.
Methods
Dataset
In the previous study of youth ice hockey, 2154 players were recruited from 150 teams. Details of data collection can be found in Emery et al.6 The missing exposure for imputation is individual weekly game hours, the total of which from all players is the denominator for estimated IR. In this paper, players with missing information on any of the covariates: province, previous injury, or level of play, were excluded because these covariates will be used in the multiple imputation model. If players had missing values in these covariates, their weekly exposure hours cannot be imputed. Therefore, they were excluded. Additionally, players with missing weekly exposure hours throughout the season were excluded. This reduced sample would be used consistently in all imputation methods. The start and end of season for each team in the original study were used as the start and end for imputation in this paper. In other words, the number of total weeks for each player remained the same as those in the original paper, and individual missing weekly hours during the season were imputed.
Methods to impute missing exposures
Different methods to handle missing weekly exposure for each player are described below. Exposure time refers to game hours in both the imputation and the analysis. Team was treated as a cluster variable in both the imputation model and the analysis of imputed data.
Available case analysis
The weeks with missing game hours for each player were removed from the calculation of his/her total game hours. In this paper, the missingness was in the weekly exposure, sum of which gave the denominator of the IR. When missing weekly exposures were excluded, the resulting total exposure hours were smaller than the actual total exposure hours. Based on the study design and injury definition, injury cases were captured accurately by the team therapist. Hence, when missing weekly exposure hours were excluded, the estimated IRs were inflated. Therefore, the impact of ignoring individual missing weekly exposure in this example was not on the loss of efficiency but in the bias of IR.
Last observation carried forward
For the weeks where the player had missing game hours, the value in the previous closest week from the same player was used to substitute the missing values. This approach is considered compatible with the intentiontotreat principle in RCTs.23 ,24 The major disadvantage of this approach is that it relies heavily on the assumptions of missing completely at random and unchanged profile,12 which means that each player was expected to have the same number of game hours as in previous observed week.
Mean imputation
In the original cohort study, mean game hours of the team were used to substitute missing weekly game hours for individual players within the team.6 But in this paper, missing weekly game hours were substituted with mean game hours from each player, which were calculated by dividing total observed game hours by total number of observed weeks. This is the unconditional mean imputation defined by Little and Rubin.12 The reason for this approach was that some teams had missing exposure in all players at a particular week. Furthermore, this approach possesses the same feature as last observation carried forward and multiple imputation in that playerspecific exposure or characteristics were applied in the imputation. Therefore, the unconditional mean imputation from individual player was applied for the purpose of comparison of methods in this paper.
Multiple predictive imputation
Missing exposure time at each player's respective week was imputed by the predicted exposure hours from a linear regression model adjusting for clustering by team. Rubin suggested that the predictive model for imputation should include all variables in the analysis and relevant variables related to the probability of missing.25 Significant predictors identified from previous study,6 such as province, previous history of injury and level of play, were included to impute individual missing weekly game hours with adjustment of clustering by team. Imputations were performed in statistical package R 2.10.126 using the ‘pan’ function under the ‘pan’ library for the adjustment of clustering by team. Negative imputed values were replaced by 0. Five imputations, as suggested in the literature,19 ,20 were performed on each missing weekly exposure. The final imputed estimates, including exposure hours, IRs and IRRs, were derived by averaging estimates from imputed samples. Associated variances were calculated by a function of the average of estimated variances of estimates from imputed data and the variance of estimates across imputed data. Statistical details of multiple imputation procedure were provided in online supplementary appendix A. Total exposure hours in each province, IRs and IRRs from imputed datasets were estimated by the procedure described above. Analysis on the imputed datasets was performed in STATA27 for consistency in the estimations of parameters as the original study.6
Bootstrap and imputation
Bootstrap is a powerful resampling method and extended to handle missing data.22 ,28 In particular, nonparametric bootstrapping was applied in this study.22 The procedure of nonparametric bootstrapping includes the following steps: draw k independent bootstrap samples with replacement from the original data, impute the missing values in each sample independently, estimate parameters in each imputed dataset and, finally, calculate the mean of all k estimates as the bootstrap estimate with the 2.5th and 97.5th percentiles of the k estimates to construct the bootstrap 95% CI. This approach is advantageous in that it does not depend on the mechanism of missing.22 In this study, we followed the same procedure of nonparametric bootstrapping described by Efron,22 and the key step was that the number of injuries and other covariates for each player were sampled along with his/her game hours within his/her team because the injury was part of the outcome for each player. In other words, vectors of player covariates, weekly exposure hours and injuries, were sampled within the team. As recommended by Efron,22 2000 bootstrap replications were implemented. Imputations using the ‘pan’ function were applied to missing weekly game hours in each bootstrap sample. Negative imputed hours were replaced by 0. After imputations, mean total game hours, IR in each province, and IRR between Alberta and Quebec were estimated in each bootstrap sample. Following the approach by Efron,22 we use the means of 2000 estimated total hours and 2000 IRs as the bootstrap estimates of imputed hours and IR for each province, respectively. The 2.5th and 97.5th percentiles of these 2000 estimates gave the corresponding 95% CI. The mean of 2000 IRRs was the bootstrap estimate of IRR. The 2.5th and 97.5th percentiles of the 2000 IRRs provided the 95% CI. Bootstrapping and imputations were performed in R 2.10.1.26 Analysis on bootstrap samples was performed in STATA27 for consistency in the estimations of parameters as original study6
Best/worst case analysis
The primary objective of the original study was to estimate the IRR between Alberta and Quebec.6 Since the estimated IR is higher in Alberta than Quebec, for a conservative estimation of IRR between the two provinces, missing weekly game hours in Alberta players were substituted with his/her maximum game hours, whereas, missing hours in Quebec players were substituted with his/her minimum hours.
Simulations
To compare the performance of estimates among different imputation methods, 1000 simulations were performed, and all six methods were applied to impute missing exposures to derive IRR between Alberta and Quebec. We assume that the two provinces, Alberta and Quebec, had equal number of teams (G=30), with equal number of players (n_{g}=10). All teams played 20 weeks during the season. The weekly exposure hours for each team were sampled from a normal distribution, where mean and SD were the same as provincial estimates from the original study.6 That is, the weekly exposure hours in Alberta were sampled from normal distribution with mean=2.21 and sd=0.50 and those from Quebec were sampled from normal distribution with mean=2.04 and sd=0.51. The sampled value was repeated for every player within the same team for that week. The number of injuries for each player at each week was then sampled from a Poisson distribution. The parameter of the Poisson distribution equals to the product of injury rate and weekly exposure hour. The injury rates were chosen as 4.72 and 1.44/1000 h, for Alberta and Quebec, respectively.
After generating the number of injuries, we assigned missingness to the weekly exposure hours. To compare the impact of different proportions of missing weekly exposure, we ran simulations with two settings: (1) Alberta 20% and Quebec 10%, (2) Alberta 50% and Quebec 20% missing weekly exposure. In the analysis of original data, the missing proportions from both provinces were similar and relatively small, around 10%. The results showed that different methods provided similar results. We believe that the impact of different methods was small when the proportion of missing was small in both provinces. In the simulation, we investigated higher and different proportions of missing in the two provinces. Since more players were recruited in Alberta than Quebec, it could be more difficult to followup players and collect weekly exposures in Alberta. So we picked 20% in Alberta and 10% in Quebec in the first simulation. By the same argument, we set 50% in Alberta and 20% in Quebec to investigate the impact of larger proportion of missing. We randomly assigned missing weekly exposure for each player using a Bernoulli random variable. For setting (1), the Bernoulli random variable has a probability of event equals to 0.2 for Alberta and 0.1 for Quebec, where the event indicates a missing exposure. A thousand datasets were generated with this method and each of the six methods was applied to impute missing exposure in each dataset. For normality consideration, the log IRR and CI were estimated from each imputed dataset. Bias and its SE, MSE, and coverage probability were obtained to compare these methods. Since the number of injuries was generated by Poisson distribution, we analysed the simulated data with the Poisson model. The comparison was performed on the log of IRR because the distribution of log (IRR) can be approximated by the normal distribution. All simulations and analysis were performed in R.26
Statistical models
Poisson regression was applied in the previous study6 to calculate IRs and IRRs with adjustment for clustering by team and total game hours as offset. Player level of clustering was not considered because the number of injuries was aggregated at player level. In this secondary data analysis, the same Poisson model was applied to each imputed dataset. Additionally, zero injury in a given game for a given player was common: 87% of the players had no injuries.6 To accommodate this feature, we applied the ZIP model and used imputed game hours to predict the excessive zeros. That is, the number of game hours was predictive for 0 injuries. Furthermore, large number of 0 s in the dataset may increase the variance of the number of injuries, which is the overdispersion problem and the negative binomial model was applied. To be consistent with the previous study,6 these three models were fitted to each imputed dataset, adjusting for clustering by team and using imputed total game hours as offset.
Procedure of analysis
Missing game hours of each player from the start of the season to the end of the season were imputed using each method described above. The number of imputed weeks for each team was kept consistent with that in the original study.6 For example, if the team started at the third week of the season, missing game hours in the first 2 weeks were not imputed. For the same reason, if the team ended its season in the third week from the end of the season year, imputations were not applied to the last 2 weeks. Poisson regression, zeroinflated Poisson and negative binomial were fitted to the datasets obtained from each method, adjusting for clusters by team with imputed total game hours as offset. IRRs between Alberta and Quebec with 95% CIs were estimated. Analysis of imputed data was performed in STATA V.10.027 to ensure consistency with model approximations in the previous study.6
Ethical considerations
In the original study, written consents to participate in the study were obtained from all players and their parents.6 This study is a secondary data analysis of the original studies, in which ethics approval of data collection from the Office of Medical Bioethics at the University of Calgary, University of Alberta, McGill University, Universite de Montreal and Laval University have been previously obtained.6
Results
Eighteen players who did not have game hours recorded throughout the season were excluded from the analysis. Thirtyeight players missing any level of play or previous injury were also excluded. The final sample included 2098 players from 150 teams. Among these players, 230 players had no missing weekly exposure, and 1898 had at least 1 week of missing exposure. In the 230 players without missing weekly exposure, 16.96% of them had at least one injury. In the 1898 players having at least 1 week of missing exposure, 10.76% had at least one injury. Among the total 2098 players, the average proportion of missing exposure weeks was 11.87% in Alberta, and the average proportion of missing exposure weeks was 12.81% in Quebec.
Descriptive statistics of imputed total game hours are summarised in Otable 1. As expected, the mean total game hours from available case method was the smallest among all methods, 40.95 h for Alberta players and 46.09 h for Quebec players. The mean of total game hours from best/worst approach was the largest for Alberta players at 55.26 h because missing hours were substituted with the maximum weekly game hour for each player. For multiple imputation, mean imputation, last observation carried forward (LOCF) and bootstrap, the estimated mean total game hours were close.
Estimated IRs in Alberta and Quebec and IRRs between the two provinces are also summarised in table 1. Available case analysis produced the highest IRs of each province among all methods, although 95% CIs did not suggest significant differences across different methods. This was expected because available case analysis did not impute any missing weekly exposure, sum of which was the denominator of injury rate. The proportion of missing exposures was small, which was the reason why the magnitude of difference among methods was small. All the IRRs and CIs in table 1 indicated that Alberta players had a significantly higher IR than Quebec players. As expected, the IRR in best/worst approach is the smallest because maximum hours were substituted for Alberta players and minimum hours were substituted for Quebec players. The CI for this IRR still excluded 1, indicating that Alberta players had significantly higher IR than Quebec players even in the conservative estimation of game hours.
IRRs from ZIP model were the lowest among the three models except in the best/worst substitution method. This was not surprising because the total game hours were used to predict excessive zeros. The smaller hours of the player on ice, the higher chance that he had 0 injuries, that is, more zeros. The average game hours of Alberta players were smaller than those of Quebec players in almost all imputed methods except best/worst case. More zero injuries were expected in Alberta than in Quebec. Therefore, the IR in Alberta players was attenuated due to excessive zeros and, hence, the IRR between Alberta and Quebec was slightly smaller in ZIP model than the IRR in other models, although the difference was not significant suggested by the 95% CIs. Negative binomial model accounted for the overdispersion problem in these data. Therefore, the CIs using negative binomial model were slightly wider than those using Poisson model. Under the same model, on the other hand, CIs of IRRs were similar for multiple imputation, mean imputation, LOCF, and available case analysis. Bootstrap and best/worst substitutions produced slightly narrower CIs than other methods.
Simulations were performed and results from the six methods were summarised in tables 2 and 3 with different proportions of missing. When the proportion of missing weekly exposure was small, biases from the six methods were small. As expected, the available case analysis and the best/worst case analysis produced the largest magnitude of biases. When the proportion of missing weekly exposure increases (table 3), the bias from available case analysis inflates substantially. Consequently, the coverage probability of 95% CI was considerably reduced. The large bias was expected because the available case analysis essentially replaced missing weekly exposures with 0. Therefore, a large proportion of missing weekly exposure severely inflates the bias in the estimate from the available case analysis. This is a counterexample of the unbiasedness of estimates from the available case analysis in handling missing data. On the other hand, mean imputation, multiple imputation, bootstrap and LOCF provided estimates with much smaller biases. Estimates based on mean and multiple imputations had the smallest biases, although no significant difference was detected among the four methods.
Discussion
To our knowledge, this is the first study to examine the effect of missing weekly exposure in sport injury epidemiological study. Exposure information is a crucial component of IRs in most sports. In this study, we assumed that the injury cases were captured accurately by the team therapist, as in other sport injury studies. Comparisons of different methods to impute individual weekly exposure demonstrated that they produced similar estimates of IRs and IRRs except for best/worst case analysis. All methods gave 95% CIs excluding 1, which indicated that Alberta players had significantly higher IRs than Quebec players. This conclusion was consistent with the original findings.6 When the proportion of missing weekly exposure was small, IRR estimates were close among different methods except best/worst case, indicating that simpler approaches, such as mean and last observation carried forward were sufficient in this study. Based on the simulation results, we expect that the conclusion holds for other proportions of missing. That is, mean imputation and multiple imputation still provide the least biased estimate for different proportions of missing weekly exposures.
The estimated IRRs in this paper, however, cannot be compared directly with those in the previous study.6 There were several reasons. First, the sample size was different in that we excluded players missing any of the covariates: province, level of play and previous injury in this study. Second, players with missing hours throughout the year were excluded from the imputation and analysis. The reason for exclusions was the difficulty of implementing last observation carry forward and best/worst case. In the previous study,6 mean imputation was applied for these players. Finally, the mean imputation procedure in this study was slightly different from the approach in the previous study.6 In previous study, missing exposures were replaced at the team level based on average team exposure. In this study, however, the mean imputations were focused on average hours from each player instead of team level for the purpose of comparing with other methods where player characteristics were considered.
The results provided in this study apply to the situation where weekly exposure data were missing. The imputation was performed on the weekly exposure hours instead of the yearend total exposure, which was used as an offset in the Poisson regression model. Future research can be developed to investigate the imputation methods for the situation where total exposure was missing, although this may be rare in large community studies.
Limitations
In multiple imputation, the linear regression model adjusting for cluster by team was used to impute missing exposures. To avoid negative imputed game hours, log transformation was attempted. A total of 1061 players had at least 1 week of 0 game hours. Hence, the log function cannot be applied to these players. Furthermore, adding 0.5 to the 0 observed hour and applying the log transformation did not solve the problem because the impact of exponentiation was huge when imputed values were backtransformed to the original scale. In the linear imputation model, the proportion of negative predicted hours was small (mean 2.42% and maximum 13.79%). The magnitude of negative predicted weekly hours was small (maximum absolute value −5.89). Therefore, we replaced these negative predictions with 0 s. The most appropriate approach may be the Tobit model,29 although the prediction function for imputing missing data from the Tobit model and adjusting for cluster structure is not available in statistical software, which may be of interest for future research. Given that the proportion and magnitude of negative predictions in this study were ignorable, replacing them with 0 was considered a reasonable approach and convenient for researchers in sports.
Players with missing covariates (province, previous injury and level of play) were excluded in the imputation and analysis. The number of players excluded was 56, which was a very small proportion (2.60%) of the original data. Therefore, the exclusion would not have significant impact on the results and conclusions from this study. Imputation methods may also be applied to missing covariates, although this is outside the scope of this study. Additionally, missing in the injury was not investigated in this study because the number of players missing injuries was extremely small in the original study. Investigations of missing Poisson outcomes may be of interest for future research.
Different methods of imputation provided very similar estimates of IRR under the same model. This was not surprising in that total game hours served as the denominator of the IR, and only 12.5% of the weekly hours were missing. The number of injuries was assumed to be completely captured. Complicated statistical methods of imputation were applied in the literature to handle missing data, including EM algorithm and Bayesian methods. In this paper, we focused on simple methods for sport researchers. Simulations showed that mean imputation performed very well even when the proportion of missing was large. Therefore, complicated methods like EM algorithm and Bayesian methods were not necessary. In our opinion, these methods are rarely needed.
Lastly, the conclusions from comparisons in this paper may not be generalisable to studies with missing total exposure of the season because imputations were performed on weekly individual exposure hours instead of total exposure over the season. Missing total exposure hours is a severe problem to estimate injury IR, although this may not occur in large longitudinal community studies and is outside the scope of this paper. Furthermore, missing completely at random in the weekly game hours was assumed, which meant that the missingness was due to chance alone and not related to the injury or other predictive factors.12 While data collection depended on compliance of team designates, noncompliance was typically related to absence of the designate from a session and failure to identify a replacement. Thus, missing data for this reason is arguably independent of player injury or other predictive factors. Furthermore, based on the proportions of having at least one injury in players with and without missing weekly exposure, 10.76% and 16.96%, respectively, we can see that the proportion of injury does not relate to missing weekly exposure. Therefore, the assumption of missing completely at random may be satisfied. In the case where missing exposures from individual player were related to his injury in a previous week, or he was moved to the upper level of play, for example, the assumption of missing completely at random is not valid and may be of interest for future study.
Conclusions
Estimated IRRs from different imputation methods were similar when the proportion of missing weekly exposure was relatively small and data were missing at random. Simple methods of handling missing weekly individual exposure in injury data, such as mean imputation, were sufficient even when the proportion of missing was large. Complicated approaches, multiple imputation or bootstrapping, do not appear to be necessary in handling weekly individual missing exposure.
What is already known on this subject

Missing exposure is an important issue in injury epidemiology. It is the denominator of injury rates and, hence, affects the accuracy of the estimate; but handling of missing exposure in injury rate has not been investigated before.
What this study adds

Several methods convenient to impute missing weekly exposure were compared. Simple methods, such as mean imputation or last observation carried forward, provided estimated injury rates and ratios similar to results from complicated methods, such as multiple imputation or bootstrapping.
Acknowledgments
Dr Carolyn Emery is supported by a Population Health Investigator Award from the Alberta Heritage Foundation, a New Investigator Award from the Canadian Institutes of Health Research and a Professorship in Paediatric Rehabilitation funded by the Alberta Children's Hospital Foundation, Alberta Children's Hospital Research Institute for Child and Maternal Health, Faculty of Medicine, University of Calgary.
References
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Files in this Data Supplement:
 Data supplement 1  Online appendix
Footnotes

Contributors JK proposed the statistical models and methods, performed all the analysis to obtain all the results presented, interpreted the results and wrote the manuscript.

YY contributed to the statistical methods, critically reviewed and revised the manuscript for statistical components.

CE contributed to acquisition of data, interpretation of results and critically reviewed the manuscript for sport epidemiology contents.

Competing interests None.

Ethics approval Office of Medical Bioethics at the University of Calgary, University of Alberta, McGill University, Universite de Montreal and Laval University.

Provenance and peer review Not commissioned; externally peer reviewed.