Article Text
Abstract
Objective To test the predictive ability of multinomial regression method in obtaining category of death distribution for cases with unknown/ill-defined mortality codes.
Methods The authors evaluated the performance of the multinomial regression model by fitting the model to trial datasets from 2004 Mexican vital registration data. To predict category of death, the regression method makes use of explanatory variables, such as gender, age, place of crash, place of residence, education and insurance type. The authors compared the results of a full model regression with those of a reduced model that only contained gender and age as explanatory variables. For this comparison, the authors constructed two forms of data: dummy variable adjustment method and case-wise deleted method. The comparison was made through estimated area under the curve (AUC) for each outcome variable.
Results The full model significantly outperformed the gender-age (reduced) model using both datasets. In the case-wise deleted method, the AUC was increased from 0.55 to 0.7 for the reduced model and from 0.64 to 0.84 for the full model. Improvement in AUC using the dummy variable adjustment method was less significant.
Conclusions To predict ill-defined categories of death, adding relevant explanatory variables to gender and age is recommended. Multiple imputations may perform even better than this model especially when significant portion of the data are missing.
Statistics from Altmetric.com
Background
Many countries collect and codify individual level category of death data using the International Statistical Classification of Diseases and Related Health Problems (ICD).1 Although ICD provides global category of death coding standardisation, a considerable number of ICD codes are poorly specified or completely unknown. For example, ICD code X59, which is defined as ‘exposure to unknown factor’ comprises nearly 12% (6005 observations) of deaths in the 2004 Mexican vital registration dataset. Other researchers have also reported X59 codes as having a large number of observations with poor identification.2–4
When a significant portion of cases within a vital registration dataset are assigned to unknown/ill-defined categories, as in the case of the Mexican dataset, estimation of burden of disease and injury without redistribution of these unknown/ill-defined categories is of limited usefulness. To mitigate this problem, it is not uncommon to assign a known category to the unknown/ill-defined categories by gender–age proportional redistribution.5 ,6 We believe this practice primarily improves the quality of category of death data. However, it seems that age–gender proportional redistribution lacks enough sophistication to serve as an ultimate method to deal with the problem of assigning a true category to observations with the unknown/ill-defined categories.
We hypothesise that the missing pattern of the outcome variable (category of death) is missing at random and that some factors other than gender and age can contribute in predicting the true category of death in injury category of death data. Therefore, in this study we fit a regression model to predict the category of death using a set of potential predictor variables. We were also interested in measuring the extent to which the additional variables improved the quality of prediction compared with when the model only contained gender and age as covariate. Finally, we applied the fitted model to the full vital registration data to predict unknown/ill-defined categories of injury death.
Method
Data
In this study, we used Mexican vital registration data constructed in 2004 and collected by the Ministry of Health in Mexico. We selected observations in the dataset coded as ‘external categories of morbidity and mortality’. These observations correspond to the Chapter XX of ICD-10, codes V01 through Y89 (transport injuries). The selected dataset had 50 044 observations, of which 26.2% (13 111) of cases were coded to unknown/ill-defined categories of deaths. We received the data from the Harvard Initiative for Global Health in 2007 as part of a bilateral agreement between the Harvard University and the Mexican Ministry of Health.
The choice of regression model and the variables of the model used
In order to accommodate a multi-category outcome variable in a multiple regression model we used multinomial logistic regression (MNLR) technique using Stata 10. The outcome variable of the regression model was the category of death. We selected the explanatory variables based on their availability and potential relevance. We theorised that place of death, location of residence, time of death, type of occupation, medical insurance status, education and marital status were the variables that could help with better identification of the death categories. These categorical variables, in addition to gender and age groups, served as our explanatory variables. The Mexican dataset contained the nature of injury (known also as injury diagnosis) variable. However, based on our experience with working on different injury mortality data, we assumed that nature of injury data is not routinely available in injury death datasets. Hence, we refrained from using the variable. ICD-10 coded external cause of transport death (V01–Y89) contains rather detailed categories of death. To reduce the number of categories, we used the global burden of disease (GBD) category grouping published in 2010 by the GBD injury expert group.7 The last version of the GBD cause category list contains 18 external categories of death for road injuries. Running MNLR is computationally intensive when the categories of outcome and explanatory variables exceed a certain limit. As a consequence, we regrouped the 18 categories of death into nine blocks. Four categories of unknown/ill defined ICD codes were: Unknown road injury (V87, V88, V89, Y32 and Y850), Unknown transport injury (V99 and Y859), Unknown accident (X594 and X599) and Unknown injury (Y344, Y349 and Y872). You can find the detailed description of the GBD external cause coding using the web link https://sites.google.com/site/gbdinjuryexpertgroup/Home/discussion-12-external-cause. Table 1 provides description of explanatory variables we used in the regression models. The regrouping strategy was done arbitrarily based on the number of observations in each group and our subjective assessment of group similarities in relation to road injuries. Regrouping the response categories of the explanatory variables also involved a trade off between losing predictive power and achieving a faster computation. After some trial and error sequence, we succeeded to fine-tune the number of categories by taking into account this power–speed trade off.
The analysis had two phases. In phase one, the ‘validation phase’, we aimed to check the quality of the MNLR in returning the true category of death when the categories were ill defined/unknown. For the validation phase we only used observations with known category. The second phase, ‘application phase’, involved fitting the model to the entire dataset. The goal was to use the MNLR technique to illustrate the results of our practice.
Validation phase
Table 2 summarises the changes in the number of observations at each analysis step. From a total of 50 044 death observations at least one explanatory variable had missing values in 22 641 (41%) of cases. We followed two analysis strategies: deleting cases with any missing values on the explanatory variables and specifying a dummy variable as a flag for the missing values. We called the first model ‘case-wise deleted model’ and the second one ‘dummy variable adjustment model’. Because MNLR is a single imputation technique it is not able to impute missing values in both explanatory variables and outcome variables at the same time. In situations where there are few missing values on the explanatory variables, case-wise deletion shows the optimal performance of the MNLR. We tried both strategies to compare their prediction performance and made suggestions for future analyses.
We deleted observations with unknown/ill-defined categories. The new datasets contained only observations with known category. Then we randomly turned 20% of the observations with known category to missing. We called the subsample with the missing category ‘test data’ and the rest of the sample ‘trainer data’. The next step was to set up the model and predict the category of death for both missing cases and known cases. We tested the performance of the models by employing the area under the (receiver operating) curve (AUC) estimation method previously explained by researchers as a unique index for measuring the performance of a test. The receiver operating curve is a function that shows the relationship between sensitivity (true positive rate) and 1-specificity (false positive rate) of a diagnostic test. When a test is 100% sensitive and 100% specific the AUC equals unity. As AUC approaches 0.5 the diagnostic test becomes useless.8 We reported AUC of the full models and that of the gender–age models and compared the models in terms of the AUC index and its statistical significance. To do the AUC estimates for multiple categories of death, we performed the test for each category separately. In each run, we tagged the target death category as the main group and all other categories as the alternative group.
We also reported the goodness of fit of the models using the likelihood ratio (LR) of the full models to the intercept-only models. The statistically significant contribution of each outcome variable in the model was tested through Wald statistics.9
Application phase
To avoid losing 41% of our observations through case-wise deletion method, we used the dummy variable adjustment approach to predict the category of death in this phase. Using the MNLR, a predicted probability was assigned to each of the nine dependent categories. As a result, for each observation, the sum of the predicted probabilities of the nine categories was unity. The total predicted number of injury cases for each category was obtained by adding together the probabilities of each category across the individuals.
Results
Validation phase
We successfully fit both dummy variable adjustment models and case-wise deleted models to the data. LR statistics for the dummy variable adjustment models were as follows: full model LR χ2 (288)=21 639.4, p<0.001, reduced model LR χ2 (72)=4745.5, p<0.001. Likewise, for the case-wise deleted models the statistics were: full model LR χ2 (224)=12 838.47, p<0.001, reduced model LR χ2 (56)=2707, p<0.001. Based on the results of the Wald test we preserved all the variables in the model.
Table 3 summarises the results of the AUC estimates along with 95% CIs for the full model and the reduced model using the two missing data approaches mentioned earlier. Regardless of the missing data approach, the estimates showed a better performance of the full models over the reduced models. The difference in the model performance between the full models and the reduced models was statistically significant for all the categories of the outcome variables (p<0.001). Compared with the dummy variable adjustment approach, case-wise deletion strategy notably improved the AUC estimates. Using the dummy variable adjustment technique the MNLR predictions improved; for most of the outcome categories, a poor AUC estimate in the reduced model (AUC=0.5–0.6) to a fair estimate in the full model (AUC=0.6–0.7). Following the case-wise deleted method, the AUC estimates stepped up from fair to good (AUC=0.6–0.8).
Application phase
We fit the data using the dummy variable adjustment technique (LR (208): 2302.5, p<0.001). According to the results of the Wald statistics we preserved all the explanatory variables in the model. Table 4 runs through the percent distribution of four unknown/ill-defined categories of injury death on the nine known categories. Figure 1 complement the results presented in table 4 by illustrating the distribution of categories of death before and after the redistribution.
Discussion
The quality of most vital registration datasets is reduced by the fact that a considerable proportion of deaths are coded to poorly defined or absolutely unknown categories. Even in presumably high-quality data from well-developed countries, ill-defined categories of death constitute a substantial proportion of coded deaths.10 We previously showed that employing a simple Bayesian approach can effectively help identify the external categories of injuries when such data are missing from hospital discharge records.11 We extended the practice of predicting the missing categories of injuries by employing regression methods. In theory, statistical models based on regression or imputation can make use of additional covariate information to better redistribute missing observations. In practice, however, there are few published studies explaining how efficiently regression models predict ill-defined or unknown category of disease and injury mortality.12 The results of our study showed that using a range of available and relevant explanatory variables noticeably improved the performance of imputations compared with when only gender and age variables were used.
Our methodology, however, had some restrictions: (1) MNLR takes limited number of outcome categories. Likewise, as the number of explanatory variables and their response categories increases, the computation becomes more intensive and time consuming and sometimes the model does not converge. This limited capacity of the MNLR may cause inconsistency in the results of the analysis because the analyst has to arbitrarily regroup the categories in order to have the software handle the data matrix. One solution to this problem is to use multiple imputations. (2) A more serious problem with single imputation techniques, such as MNLR, arises when a large number of values on explanatory variables are missing. In this case, dummy variable adjustment is an intuitively wise strategy to deal with the missing values.13 However, the estimates of the dummy method are not unbiased. The bias is the result of the fact that the dummy method assigns a fixed value to missing observations.14 Multiple imputations assign a varying number to the missing observations and avoid this bias. (3) We showed that the performance of the case-wise deleted approach was superior to the dummy adjustment method. If the number of missing values on explanatory variables is negligible, case-wise deletion technique is intuitively the best performing strategy to handle missing data using MNLR. However, in reality, few datasets have near complete non-missing values on all explanatory variables. In this case, unless the missingness is completely at random, the estimates of the outcome variables can be biased. We produced different AUC estimates through the two approaches. This implies that the dummy variable approach is not an optimal solution. (4) As a single imputation technique, MNLR only imputes one value for the missing outcome categories. Ignoring the uncertainty around a single predicted data can be overcome by using simulation techniques that draw random samples of the predicted values. (5) Single imputation or multiple imputations technique relies on the MAR assumption. If in reality the missing pattern is completely at random or, rarely, missing not at random the use of imputation strategy generates biased results.15 ,16 Finally, the AUC estimates go with the strong assumption that the predicted estimates are compared with a perfect gold standard benchmark. In our case, the known categories of death served as gold standard points of reference. Any variability in the quality of assigning known categories of death by coders can potentially change the results of the AUC analysis from one setting to another.
Despite all the limitations mentioned above, our method can be applied to any large datasets with the problem of missing values or unknown/ill-defined categories on an outcome that takes several response categories. We also encourage adding the nature of injury (injury diagnosis) variable(s) for the predictions when they are available. Further investigation is needed to explore the extent to which multiple imputations can improve on identification of missing or unknown/ill-defined categories of (road) injury death. Those who are interested in our analysis scripts (Stata do file) can contact the corresponding author (shahraz{at}brandeis.edu or sharaz{at}gmail.com).
What is already known on the subject
-
Gender and age are the two predictors of missing or ill-defined categories of road injury death.
-
We think data analysts commonly use proportional gender–age redistribution of missing or ill defined injury death categories on defined causes.
What this study adds
-
There are a number of covariates other than gender and age that significantly contribute in predicting the cause category when the cause of death is missing or ill defined in mortality datasets.
-
We showed the extent to which performance of prediction can be improved by using additional relevant covariates in the regression models.
Acknowledgments
The authors would like to thank the following people for their kind help in reading through the manuscript and providing very useful comments: William Stason, Casey Fryer Sweeney, Julie Johnson and Danielle Fuller.
Footnotes
-
Funding The current project was entirely funded by the World Bank Global Road Safety Facility.
-
Competing interests None.
-
Provenance and peer review Not commissioned; externally peer reviewed.
-
Data sharing statement The authors would like to share the Stata do files and the SMCL files so that the readers and the reviewers can follow the analysis steps and the detailed results of the regression models applied.