Article Text

Download PDFPDF

A combined Fuzzy and Naïve Bayesian strategy can be used to assign event codes to injury narratives
  1. H Marucci-Wellman1,
  2. M Lehto2,
  3. H Corns1
  1. 1Center for Injury Epidemiology, Liberty Mutual Research Institute for Safety, 71 Frankland Road, Hopkinton, Massachusetts, USA
  2. 2School of Industrial Engineering, Purdue University, 1287 Grissom Hall, West Lafayette, Indiana, USA/ School of Management/Center for Global Innovation & Entrepreneurship Kyunghee University Seoul 130-701, Korea
  1. Correspondence to Dr Helen Marucci-Wellman, Center for Injury Epidemiology, Liberty Mutual Research Institute for Safety, 71 Frankland Road, Hopkinton, MA 01748, USA; helen.wellman{at}libertymutual.com

Abstract

Background Bayesian methods show promise for classifying injury narratives from large administrative datasets into cause groups. This study examined a combined approach where two Bayesian models (Fuzzy and Naïve) were used to either classify a narrative or select it for manual review.

Methods Injury narratives were extracted from claims filed with a worker's compensation insurance provider between January 2002 and December 2004. Narratives were separated into a training set (n=11,000) and prediction set (n=3,000). Expert coders assigned two-digit Bureau of Labor Statistics Occupational Injury and Illness Classification event codes to each narrative. Fuzzy and Naïve Bayesian models were developed using manually classified cases in the training set. Two semi-automatic machine coding strategies were evaluated. The first strategy assigned cases for manual review if the Fuzzy and Naïve models disagreed on the classification. The second strategy selected additional cases for manual review from the Agree dataset using prediction strength to reach a level of 50% computer coding and 50% manual coding.

Results When agreement alone was used as the filtering strategy, the majority were coded by the computer (n=1,928, 64%) leaving 36% for manual review. The overall combined (human plus computer) sensitivity was 0.90 and positive predictive value (PPV) was >0.90 for 11 of 18 2-digit event categories. Implementing the 2nd strategy improved results with an overall sensitivity of 0.95 and PPV >0.90 for 17 of 18 categories.

Conclusions A combined Naïve-Fuzzy Bayesian approach can classify some narratives with high accuracy and identify others most beneficial for manual review, reducing the burden on human coders.

  • e-Code
  • e-coding
  • injury
  • narrative analyses
  • surveillance
  • text mining

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Injury narratives contain useful information for the prevention of injuries.1–3 Using the computer to help classify narratives has the potential for reducing the burden implicit in manually reviewing and classifying large numbers of narratives from administrative injury databases. The accuracy and completeness of manual coding is another important issue.4 5 The Centers for Disease Control and Prevention have recently been actively promoting strategies for improving the quality and completeness of external cause of injury coding in the USA, including the use of automated systems that assist coders in assigning event codes.5

Machine learning algorithms offer a way potentially to learn how to classify injury text systematically from the typically massive amounts of previously manually coded narratives in administrative databases.6–9 We recently showed that two different Bayesian models (naive vs fuzzy) classified injury narratives from a large administrative dataset into broad (one-digit) Bureau of Labor Statistic (BLS) Occupational Injury and Illness Classification System (OIICS) event categories (eg, fall, struck by, transportation, overexertion) with high sensitivity (0.78–0.80). Each model also did fairly well (sensitivity 0.64–0.70) at a more detailed (two-digit) level.10 The naive Bayesian model performed slightly better than the fuzzy Bayesian model. However, the predictions of the fuzzy Bayesian model were more intuitive because they were based on the single word or word combination most strongly related to the assigned category.

An important result of our previous study was that the prediction strengths assigned by both models were strongly related to the actual probability that the prediction was correct. This suggested that prediction strength could be used effectively to filter out narratives for manual review in a semicomputerised approach in which part of the narratives are computer coded. The objective of this follow-on study was to develop and test methodologies for combining the fuzzy and naive Bayesian approaches along with selected manual coding. We hypothesised that a combined naive–fuzzy semicomputerised approach could both improve computer classification accuracy, and guide strategic assignment of narratives for manual review to optimise the accuracy of the final combined (human plus computer) coded dataset while minimising the number manually coded.

Methods

Data collection

Over 17 000 records were randomly extracted from claims filed between January 2002 and December 2004 with a workers' compensation insurance provider.9 The OIICS scheme includes approximately 40 mutually exclusive event categories (September 2007 version). The two coders who classified these narratives went through an intensive training process and had over 7 years of coding experience. After eliminating the cases on which the two coders disagreed, the remaining 14 000 records were considered ‘gold standard’ classifications. The data were then divided into a training set of 11 000 cases, which was used for model development, and a prediction dataset of 3000 cases that was used for evaluation of the model. Each record included a unique identifier, a narrative describing how the injury occurred, and a two-digit BLS OIICS event code. The distribution of the two-digit OIICS event gold standard classifications for the training and prediction datasets were similar (p=0.25).

Model development

Two Bayesian models, referred to as naive Bayes and fuzzy Bayes, were developed to generate two independent sets of predictions.i

The naive Bayes model calculates the probability of a particular event code category using the expression:P(Ei|n)=jP(nj|Ei)P(Ei)P(nj)where P(Ei|n) is the probability of event code category Ei given the set of n words in the narrative. P(nj|Ei) is the probability of word nj given category Ei. P(Ei) is the probability of category Ei, and P(nj) is the probability of word nj in the entire keyword list. In application, P(nj|Ei), P(Ei) andP(nj) are all normally estimated on the basis of their frequency in a training set. Essentially, the naive algorithm calculates the probability of an event category by multiplying the likelihood ratios for each word in the narrative and prior probabilities.

The fuzzy Bayes model calculates the probability of a particular event code using the expression:P(Ei|n)=MAXjP(nj|Ei)P(Ei)P(nj)

The primary difference from naive Bayes is that instead of multiplying the conditional probabilities, fuzzy Bayes estimates P(Ei|n) using the ‘index term’ most strongly predictive of the category.

These two different models were developed and evaluated using the Textminer program developed by one of the authors (ML). Both models used the statistical relationship between terms in the 11 000 injury narratives in the training set and the manually assigned two-digit BLS OIICS event codes to estimate, in the predictive dataset of 3000, the probability a human coder would assign a particular code to a new narrative, given the words that were present in the narrative.

Model evaluation

Two semicomputerised strategies were evaluated. Our earlier results suggested that predictions would be more likely to be correct when the fuzzy and naive algorithms predicted the same classification, because both models showed good performance on their own.10 Therefore, our first strategy was to accept the computer-assigned codes if the fuzzy and naive algorithms agreed (agree dataset) and manually review the remaining narratives in which fuzzy and naive disagreed (disagree dataset). Our earlier results also suggested that narratives could be effectively filtered out for manual review using prediction strength. Therefore, our second strategy (which would be desirable if a higher positive predictive value (PPV) were desired and additional coding resources were available) was to filter the agree dataset further using the prediction strengths assigned by the naive Bayes model.

To test the first hypothesis, working with the predictive dataset of 3000 cases, we separately analysed prediction accuracy for the cases in which the models agreed (agree dataset) or disagreed (disagree dataset), in terms of sensitivity and PPV. Sensitivity (true positives) was the percentage of gold standard (human-coded) narratives in each category also coded by the algorithm and PPV was the percentage of narratives correctly coded into a specific category out of all narratives coded by the algorithm into that category. We did not evaluate specificity and negative predictive value because they were high (nearing 1.0) with little differentiation across categories (see earlier results).10

To test the second hypothesis, we took the agree dataset, and filtered out enough of the weakest predictions (eg, low naive prediction strength) to result in half of the 3000 prediction narratives being manually coded, because we had enough human resources to classify half of the narratives manually. We then evaluated the prediction accuracy for the refined subset of computer-coded narratives in terms of sensitivity and PPV.

A follow-on set of analyses were also conducted to evaluate prediction accuracy for this combined computer–human-coded dataset (all 3000 prediction narratives) resulting from strategies 1 and 2. We measured the sensitivity and PPV, assuming that the cases filtered out for each strategy to be coded by a human were coded correctly (eg, the disagree dataset for strategy 1 and additional cases from the agree dataset for strategy 2).

To illustrate the trade-off between the sensitivity of the computer-coded cases and the number of cases that would need to be manually classified, we repeated this process using different prediction thresholds and generated a plot showing how prediction sensitivity of the computer codes improved as the number being strategically filtered out for manual coding increased.

Results

We first briefly compare model performance for cases in which the fuzzy and naive algorithms agreed versus disagreed. We then present results showing the influence of additional selections for manual review of the weakly predicted subset of cases based on the prediction strength of the cases in which the fuzzy and naive algorithms agreed and, finally, the effects of adding in a manual review of the selected out weakly predicted subset of cases for each case selection scenario. Figure 1 is a flow diagram of the strategic filtering process for assignment of narratives for computer or manual coding.

Figure 1

Flow diagram of the strategic filtering process for assignment of narratives for computer or manual coding.

Classifications in which fuzzy and naive agreed

The fuzzy and naive algorithms agreed on 1928 (64%) of the 3000 classifications in the prediction set (table 1). The overall sensitivity for these coded narratives was 0.85, substantially higher than the sensitivity for the entire dataset using the naive algorithm (0.70). The lowest sensitivities were recorded for the struck against (0.42), bodily reaction (0.63) and explosion categories (0.60), but these categories had high PPV (0.88, 0.90 and 1.0, respectively). Very high sensitivities, but lower PPV were found for the overexertion (0.99 sensitivity, 0.83 PPV) and highway categories (1.0 sensitivity, 0.94 PPV). The two categories with the lowest PPV were non-highway (0.67) and exposure to stress (0.76).

Table 1

Evaluation of strategic selection of computer-assigned codes in which fuzzy and naive agree on classification alone and when the naive prediction strength is greater than 0.89

When naive prediction strength was used to obtain the 1500 most strongly predicted computer-coded narratives (or half of the dataset), 428 narratives from the agree dataset with naive prediction strength values less than 0.89 were screened out for manual coding in addition to the disagree dataset. Dropping these cases from the agree dataset (most strongly predicted computer-coded narratives) improved the accuracy of the computer predictions (table 1). The PPV improved to above 0.80 across most categories, except for those with less than 10 observations and the non-highway accident category, which improved from 0.67 to 0.75 (table 1).

Classifications in which fuzzy and naive disagreed

A basic principal of probability theory indicates if two independent predictions (fuzzy and naive) are the same, you should be more confident that the prediction is correct. This can be seen in this example narrative: ‘STRK BY NAIL HIT IN THE EYE WITH NAIL FROM NAIL GUN, LOSS OF SIGHT IN LEFT EYE’. For this narrative the fuzzy Bayesian algorithm multiple word model maximises the ‘struck by’ category given the sequence words ‘in-eye’ (P(Ei|n)=0.79). The naive Bayesian algorithm also found the highest strength category to be ‘struck by’ using all the words in the narrative (P(Ei|n)=0.99). In this example, all of the words in the narrative, along with the maximum evidence words in the narrative, support the same classification, the ‘struck by’ category. (Note: P(Ei|n) is the probability of event category Ei given the set of n words in the narrative).

On the other hand, when the fuzzy and naive algorithms disagreed, there was ambiguity in the narrative. An example follows:

‘LIFTING A PIECE OF DECK TO THROW AWAY, SLIPPED ON ICE & TWISTED BACK’

For this narrative the sequence ‘lifting-piece’ classified the narrative into the ‘overexertion’ category using the fuzzy algorithm (P(Ei|n)=0.75). The naive algorithm, which uses all the words in the narrative, classified the narrative into the ‘bodily reaction’ category (P(Ei|n)=0.69). There were several words that were strong predictors of the bodily reaction category ‘slipped on ice’ and ‘twisted back’, while the multiple word model maximised the ‘overexertion’ category given the words ‘lifting-piece’.

Analysis of the 1072 (36%) cases in which the fuzzy and naive algorithms disagreed (table 2) showed, with a few exceptions, that the PPV was low for both algorithms for nearly every category, as was the overall sensitivity (fuzzy 0.29, naive 0.40, table 2). A quick comparison with table 1 reveals that the overall performance for the agree cases is dramatically better. This is illustrated for the ‘struck by’ category using a Venn diagram (figure 2). It can be seen there that few cases were predicted correctly when naive and fuzzy disagreed.

Table 2

Evaluation of computer-assigned codes in which naive and fuzzy algorithms disagree on classification (n=1072)

Figure 2

Comparison of gold standard records versus those coded by the computer algorithm (fuzzy and naive algorithms) for the ‘struck by’ category (Bureau of Labor Statistic event code group 02). This comparison is shown separately for the subset of cases in which fuzzy and naive agreed on the classification and in which fuzzy and naive disagreed on the classification.

Semi-computerised coding strategies

In many real-world applications, for research, legal, administrative or other reasons, it is necessary to code all the cases. If we include the effects of manual review of the filtered (most weakly predicted) subset of cases, we obtain a combined (or team) level of performance. The overall combined performance when cases are strategically filtered for manual review is shown in table 3 for two semicomputerised coding strategies. Using the first strategy, in which 1928 cases in the agree dataset are classified by the computer and the remaining 1072 cases in which the naive and fuzzy classifications disagree are manually reviewed results in high overall sensitivity (0.90), and PPV above 0.85 for most categories. This combined performance is quite good, and requires only 36% of the cases to be manually coded.

Table 3

Performance of two semicomputerised coding strategies

Using the second strategy, in which another 428 narratives are filtered from the agree dataset when naive prediction strength is less than 0.89 and added to the disagree cases, increases the proportion of manually coded cases to 50% and further improves the results. This strategy results in an observed sensitivity above 0.95 for the overall dataset (table 3), and PPV above 0.92 for nearly all categories (table 3).

In actual practice, an organisation with limited coding resources that could ideally be deployed can reach a targeted level of accuracy by varying the number of manually reviewed cases. Figure 3 shows how targeted levels of sensitivity on the x-axis can be reached by manually reviewing different proportions of the cases. Each point on the curve shows the resulting sensitivity and proportion of computer-assigned cases for particular prediction strength thresholds used to filter the agree and disagree datasets. The two strategies discussed earlier correspond to two different points on the curve. Strategy 1 corresponds to using a prediction strength threshold of 1 for the disagree set and a threshold of 0 for the agree dataset as indicated by the point (1, 0) on the curve. Strategy 2 corresponds to the point (1, 0.89) and results in more manual coding, but a higher overall sensitivity.

Figure 3

Trade-off between sensitivity and proportion coded by computer using a semi-autonomous classification strategy.

Figure 2, therefore, shows the trade-off between accuracy and the number of manually reviewed cases. As stated earlier, all computer classifications of the agree dataset were predicted at a sensitivity of 0.85. To improve on that we filtered out more narratives for manual review from the agree dataset.

Discussion

The findings from this study suggest that, for classification of large administrative database injury narratives into discrete categories, a human–machine integrated approach may be preferable to either the human or machine approach alone. A strategy of reducing human resources by half using Bayesian models and strategic but simple assignment of targeted narratives revealed a sensitivity of at least 0.90 for the final computer–manual assigned codes. While human coding will probably always be necessary for the classification of complex, ambiguous situations or for emerging issues, the ability of the computer to target those types of narratives for human investigation is what makes the human–machine integration a highly valuable tool.

While both the naive and fuzzy models alone performed fairly well at the two-digit level in our earlier analyses,10 the high performance achieved when the fuzzy and naive algorithms agreed on a classification and the quantity of narratives that could be classified correctly using this filter is noteworthy. By following this approach, the computer was able to classify 64% of the narratives with an overall sensitivity of 0.85 of the computer-assigned codes, confirming that using fuzzy and naive agreement as a filter strategy could be highly effective. In addition, over 50% of the original (struck by, bodily reaction, non-highway, pedestrian, explosion and other classifiable) cases were targeted for manual review resulting in a large increase in capture (sensitivity increased ≥0.79 and PPV increased >0.85) in categories with originally low sensitivities in the earlier analyses (eg. struck against, bodily reaction).10 In situations in which the computer classification is made with high confidence, it is likely that the computer may assign a classification more systematically (with repeated evidence that words in the narrative are associated with a particular category) than a human coder. Also similarly, the classifications made by the computer may be used to identify coding errors of human classifications.

While both algorithms use the same evidence from the training dataset, each may come up with a different classification. Disagreement between the fuzzy and naive algorithms may indicate that these narratives were more ambiguous than usual. This follows because one algorithm (fuzzy) focuses only on the strongest single piece of evidence while the other (naive) combines all the evidence, so both algorithms will agree when the evidence is consistent or unambiguous for a particular category.

Applying a second filter using the prediction strength with the aim of 50% manual coding and 50% computer coding was very simple and resulted in an improvement in the final coded dataset. The overall accuracy at the two-digit level was 0.95 and the PPV ranged from 0.88 for the ‘exposure to stress’ category to 1.0 for the ‘contact with electric current’ category. Even more critical was that the greatest improvement in sensitivity and PPV occurred in the categories that had lower sensitivities and PPV after the first filter. This demonstrates the power of using the prediction strengths as a filter and that, from a combined fuzzy–naive Bayesian approach, one not only gets a prediction, but there is additional information about confidence in that prediction, enough to target certain narratives strategically for manual review. This method showed that with half the resources (only half coded by humans) we could obtain a distribution of the final coded dataset (human and computer codes strategically assigned) that was highly representative of the gold standard distribution and the PPV of the classifications were also very high.

It should also be mentioned that more sophisticated filtering strategies could yield further improvements. For example, a plot of PPV compared with sensitivity achieved at different threshold levels for each category could be used to pick out optimum threshold levels for each category. Other information in both the fuzzy and naive algorithms could also be used to aid in filtering. In particular, the naive prediction strength could be adjusted simultaneously considering other information from the algorithms (such as the fuzzy strength, the difference between the highest and the second highest fuzzy strengths, if the naive classification agreed with the second fuzzy classification (yes/no), etc).

An important next step is to determine how the algorithms and filtering strategies tested here will perform on other datasets with different underlying causes of injury distributions, different coding protocols, or longer or shorter narratives. Future studies might also focus on issues such as the feasibility of predicting more detailed codes, or improving performance on the few categories observed in this study to be difficult to predict. In particular, the naive and fuzzy Bayes algorithms both tended to under-predict cases manually assigned to the ‘unspecified categories’. This result can partly be explained by both the small number of training narratives in the dataset for the unspecified categories and the lack of unique predictors indicating that particular narratives are not classifiable. Also, with difficult narratives, human coders may be more likely to assign the narrative into an ‘unspecified’ category even though the computer may be able to identify detail in the narrative allowing for a specific code to be assigned. This suggests that computer coding might also offer a strategy for reducing undercounts in applications in which human coders tend to over-assign cases into the ‘unspecified’ category.11

Conclusion

This study indicates that utilising an integrated computer–human approach, with strategic assignment of manual narratives, results in a very high final accuracy at the two-digit classification level. While some specific categories have lower predicted strengths and are more difficult to code, these are likely to be filtered out for manual review, thus maintaining a high accuracy even in these categories. In order to maintain high accuracy requirements or to identify emerging issues, manual coding may never be totally replaced. However, strategically assigning manual coders to the more ambiguous narratives, while allowing the computer to classify common incident narratives, allows for a more efficient and cost-effective utilisation of resources. The strategies utilised reduced manual coding by half and improved accuracy beyond what would be expected with manual coding alone. Most importantly, there is reason to believe further improvements in performance are quite achievable, due to both model refinements and learning that occurs when new manually coded narratives are fed back into the system to fine tune the predictive models in real-world settings.

What is already known on the subject

  • The Centers for Disease Control and Prevention have recently been actively promoting strategies for improving the quality and completeness of external cause of injury coding in the USA, including the use of automated systems that assist coders in assigning event codes.

  • Computerised automated systems have been recognised as a potential solution to improve accuracy and reduce resource requirements needed for the manual classification of narratives in large administrative databases.

  • Two different Bayesian models (naive vs fuzzy) have been shown to be able to classify injury narratives from large administrative datasets into broad (one-digit) classifications with high accuracy and with fair accuracy at a more detailed level.

What this study adds

  • A semi-automatic approach to assigning classifications based on a combined naive–fuzzy Bayesian approach to strategic filtering to identify narratives most beneficial for manual review can be implemented with minimal text processing and result in high accuracy for assigning two-digit event code classifications.

  • When the fuzzy and naive Bayesian models agreed on a classification, the sensitivity was very high (0.85) for classifying injury narratives into two-digit categories.

  • It follows that agreement can be used as an easily implementable and effective filtering strategy for a combined partial manual, partial computer approach to classifying injury narratives in large administrative databases. When using agreement alone as a filtering strategy, manual coding was reduced to only one-third of the dataset and resulted in a sensitivity of 0.90 for the final coded dataset.

  • Other possible applications of the combined fuzzy–naive approach can include the identification of coding errors.

Acknowledgments

The authors would like to thank Ms. Barbara Webster, Dr Manuel Cifuentes and Mr. Theodore Courtney for their thoughtful reviews and recommendations and Ms Peg Rothwell for editorial input on the final manuscript.

References

Footnotes

  • Competing interests None.

  • Ethics approval This study was conducted with the approval of the Liberty Mutual Research Institute for Safety and Purdue University.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • i We refer the reader to our earlier article for a comprehensive and detailed explanation of the Bayesian models used in the study.10 The same training and prediction datasets, algorithms, gold standard classifications, machine-assigned codes and prediction strengths used in that study were used for this research.10