Count data distributions and their zero-modified equivalents as a framework for modelling microbial data with a relatively high occurrence of zero counts
Introduction
In the evaluation of microbiological quality of foodstuffs, bacterial load is conventionally expressed in terms of log CFU cm− 2 or g− 1. Logarithmic transformation is believed to approximate or induce data normality, which is fundamental for the application of parametric statistical data analysis such as analysis of variance. While logarithmic transformation can be suitable for bacterial counts of high occurrence, such as mesophile or total viable counts, whose log CFU can approximate to a normal distribution, this approach may be unsuitable for bacterial counts of lower occurrence, such as the hygiene indicators, coliforms, Escherichia coli, or pathogens (i.e., Salmonella Typhimurium, Listeria monocytogenes, etc). This may lead to the widely-held practice (e.g. Gill et al., 1996, Gill et al., 1998) that whenever bacterial colonies are not observed (zero counts), a low log value corresponding to the limit of enumeration of the microbiological test can be inserted. This statistical practice for ‘censored’ observations is known as imputation, and, depending on the proportion of zero counts or censored points, the mean values are normally overestimated (Hirano et al., 1994, Hornung and Reed, 1990). A maximum likelihood procedure for censored values was introduced by Rouse et al. (1985), whose assumption is that the underlying frequency distribution approximates a lognormal. With this assumption, Pouillot et al. (2007) modelled the contamination of L. monocytogenes in cold-smoked salmon. However, it is unclear how the method will perform when the untransformed data are not normal, and, while it is possible to modify the maximum likelihood for other data distributions, this still requires that the distribution be known.
In recent years, there have been considerable developments (Karlis and Ntzoufras, 2005) and interest in models for count data, particularly in econometrics (Ridout et al., 1998), clinical research (Cheung, 2002), epidemiology (Bulsara et al., 2004) and social science (Lord et al., 2005). Poisson models provide only a standard framework for the analysis of count data, because, in practice, many real-life counting outcomes exhibit more variability than the nominal variance under the Poisson distribution (which is equal to the expected value), a condition called over-dispersion. One frequent manifestation of over-dispersion is that the incidence of zero counts is greater than expected for the Poisson distribution. In this way, it is worthwhile to consider the mechanism by which the over-dispersion occurs and use more flexible models such as the heterogeneous Poisson models and zero-modified models. The most-commonly used heterogeneous Poisson distribution is the negative binomial (Masago et al., 2004, Gale et al., 1997, Hinde and Demetrio, 1998) which loosens Poisson restrictions by allowing the expected number of events (λ) to be a function of some unobserved random variable that follows a gamma distribution (Ridout et al., 1998). Zero-inflated models (Lambert, 1992) are mixture models of two data generation processes: one generating always zero counts (point mass at zero) and the other generating both zero and non-zero counts (either a Poisson or a negative binomial process). On the other hand, hurdle models (Mullahy, 1986) consist of a truncated count component employed for positive counts and a hurdle component modelling zero versus larger counts. More specifically, in both zero-modified models, a logit model with binomial assumption is used to determine which of the two processes generates an observation.
An alternative conceptual framework for bacterial data with a large number of zero counts is introduced in this work, whereby a distribution is not fitted to log-transformed data but to plate count data. Additionally, solving for a proper distribution to this type of bacterial count data can go in parallel with building appropriate count data regression models that would produce more accurate estimates of experimental effects under study (covariates). For instance, a negative binomial regression or a zero-modified negative binomial regression would make it possible to better assess possible differences in bacterial contamination among abattoirs, or to assess the effects of an intervention during processing on the numbers of a pathogen. Therefore, having in mind the potential use of the methods presented here, the distribution fitting has been performed in this article within a regression modelling context as a preamble to a follow-up work where covariates will be included. In the following sections, regression concepts and notations are introduced for the specific case of null regression models (intercepts only and absence of covariates).
The main objective of this study was to introduce count data frequency distributions for fitting bacterial load data that do not approximate to a normal distribution after logarithmic transformation due to the high proportion of zero counts. The fitting procedure shown in this article provides a protocol that can serve as a starting point for the statistical treatment of this kind of bacterial data. Two actual data sets, with different levels of zero counts were used in this study, and they corresponded to total coliforms and E. coli counts from pre-chill beef carcasses produced at nine Irish slaughterhouses over a two-year period. Poisson, negative binomial, and two zero-modified (zero-inflated and hurdle) parameterisations for the Poisson and the negative binomial distributions were fitted to both data sets and results were compared and analysed.
Section snippets
Sampling of beef carcasses and microbiological analyses
Nine beef export abattoirs, with a throughput of at least 30 000 cattle/annum each, located in the south, east and west of Ireland, were visited to obtain a representative sample of cattle being slaughtered throughout the country. Five of the abattoirs were each visited three times and the remaining four on two occasions. During each visit, approximately 30 animals were randomly sampled at the end of the slaughter line after washing by swabbing the two carcass sides. Polyurethane sponges (Sydney
Results
Defining commercially attainable acceptance criteria for the hygienic performance of beef carcass dressing processes, Gill et al. (1998) observed that the bacteria of interest must be counted in approximately 85% of samples for there to be an approximation to normal distribution of log CFU. In the present study, the logarithmic transformation of the total viable counts (log [CFU/10,000 cm2]) on pre-chill beef carcasses (n = 672) brought about the approximation of the data to a normal distribution (
Discussion
Although the Poisson distribution is normally the recommended approach for analysing count data, the extra variability of the bacterial data can be handled using the modifications to the Poisson shown in this paper. Bacterial data made up of a considerable amount of zero counts can be appropriately represented by using such modified count distributions. These distributions have been demonstrated to depict with great accuracy the observed data since they are capable of dealing with the
Conclusions
An alternative conceptual framework that accurately represents the dispersion of microbial counts from bacteria of low occurrence has been introduced. The typical logarithmic transformation of CFU/cm2 (as a way to approximate data normality) was disregarded and analysis was conducted on the discrete variable of CFUs counted on Petri dishes. Distributions for counting outcome data – modified from the baseline Poisson so as to account for both the large variance of the count data and the excess
Acknowledgments
The authors wish to acknowledge safefood, The Food Safety Promotion Board and the Food Institutional Research Measure (FIRM) administered by the Irish Department of Agriculture, Fisheries and Food. The authors also wish to acknowledge the partial financial support of ProSafeBeef, an EU 6th Framework project. The reviewers are gratefully acknowledged for detailed useful comments.
References (30)
- et al.
Setting control limits for Escherichia coli counts in samples collected routinely from pig or beef carcasses
Journal of Food Protection
(2006) - et al.
Use of total Escherichia coli counts to assess the hygienic characteristics of a beef carcass dressing process
International Journal of Food Microbiology
(1996) - et al.
Overdispersion: models and estimation
Computational Statistics and Data Analysis
(1998) - et al.
Bacterial count from bovine carcasses as an indicator of hygiene at slaughtering places: a proposal for sampling
Journal of Food Protection
(1992) - et al.
Poisson, Poisson-gamma and zero-inflated regression models for motor vehicle crashes balancing statistical fit and theory
Accident Analysis and Prevention
(2005) Specification and testing of some modified count data models
Journal of Econometrics
(1986)- et al.
Risk classification for claim counts: a comparative analysis of various zero-inflated mixed Poisson and hurdle models
North American Actuarial Journal
(2008) - et al.
Evaluating risk factors associated with severe hypoglycemia in epidemiology studies – what method should we use?
Diabetic Medicine
(2004) - et al.
Count data models for financial data
Zero-inflated models for regression analysis of count data: a study of growth and development
Statistics in Medicine
(2002)
Drinking water treatment increases micro-organism clustering: the implications for microbiological risk assessment
Journal of Water Supply Research and Technology – Aqua
Regression analyses of counts and rates: Poisson, overdispersed Poisson and negative binomial models
Psychological Bulletin
Evaluation of the hygienic performances of the processes for beef carcass dressing at 10 packing plants
Journal of Applied Microbiology
Econometric Analysis
Estimation of and temporal changes in means and variances of populations of Pseudomonas syringae on snap bean leaflets
Phytopathology
Cited by (57)
Calculating the limit of detection for a dilution series
2023, Journal of Microbiological MethodsCross contamination of Escherichia coli O157:H7 in fresh-cut leafy vegetables: Derivation of a food safety objective and other risk management metrics
2023, Food ControlCitation Excerpt :At lower levels (4 log CFU/g), contamination in the final product showed a higher number of negative samples (⁓ 12%), which led to a worse fitting and higher AIC value for the log normal distribution (Table 4). However, this level of non-contaminated samples was not sufficient to consider a zero-inflated scenario for which the percentage of negative samples should be much higher (e.g., > 50%) (Gonzales-Barron et al., 2010). More than 90% samples were positive for the pathogen at the lowest contamination level (1 log CFU/g) even though contamination values were < LOQ, which did not allow a proper analysis or fitting of probability distributions.
Estimating the distribution of norovirus in individual oysters
2020, International Journal of Food MicrobiologyProbabilistic model for estimating Listeria monocytogenes concentration in cooked meat products from presence/absence data
2020, Food Research International