Fixing the nonconvergence bug in logistic regression with SPLUS and SAS
Introduction
For analyzing clinical studies with binary outcomes, the logistic regression model [1], [2] is often used. The straightforward interpretation of the estimated parameters as log odds ratios favored its popularity in medical research, and the capability of allowing models with more than one covariate enables estimation of odds ratios that are adjusted for other covariates [2]. Parameter estimation is usually based on maximization of the (log) likelihood function (maximum likelihood method) via an iteratively weighted least-squares algorithm [3]. However, it is also known that there are certain situations particularly occurring in samples with a high number of parameters relative to sample size where finite maximum likelihood parameter estimates do not exist. In those cases the likelihood converges to a finite value while at least one parameter estimate diverges to ±∞ [4]. This phenomenon is due to special conditions in a data set and known as ‘separation’. The simplest example of separation arises in the analysis of a binary outcome and a single binary covariate if the resulting 2×2 table has one zero cell count. Generally, the probability of occurrence of separation is too high to be negligible [5], [6].
Some statistical software packages for logistic regression may warn the user in case of separation that the parameter estimates do not have converged [7]. Others simply base convergence of the model fitting algorithm on the deviance or the log likelihood and will not detect separation [8]. In both cases, the resulting odds ratio estimates are based on the last iteration carried out. For covariates causing separation the resulting estimates are completely arbitrary and thus extremely inaccurate [5] and misleading. Although exact logistic regression [9] which has been implemented in the new SAS version 8.1 [10] can provide finite and accurate estimates in some situations, it cannot generally be used as a tool to cope with separation [5].
In this paper, we present SPLUS [8] and SAS [11] programs to solve the separation problem. In our programs parameter estimation is based on the penalized maximum likelihood approach originally developed by Firth [12] and suggested for the logistic regression model by Heinze and Schemper [5]. This approach provides an ideal solution to the problem of separation. It has been shown that parameter estimates from this approach are always finite and have lower small sample bias than maximum likelihood estimates. Because of asymmetric shapes of the profile penalized likelihood in case of separation, Heinze and Schemper [5] recommend the construction of confidence intervals based on profile penalized likelihood instead of using the simpler Wald method. Our programs can be used to perform both ways of interval estimation and to compare them graphically by plotting the profile penalized log likelihood (PPL) function as has been suggested [5].
In Section 2 we describe the algorithms used by the SPLUS library logistf and by the SAS macro FL. Section 3 gives an overview of the application of our programs. Finally, in Section 4 we compare results obtained from the penalized maximum likelihood approach with those from standard analysis by means of a worked example and describe the availability of the program.
Section snippets
Computing estimates and confidence limits
A logistic regression model is given by Pr(yi=1∣xi)=πi=1/{1+exp(−xiβ)} where (yi, xi), yi∈{0, 1}, i=1, …, n, denotes a sample of n observations of the outcome variable y and the 1×k covariate vector x. Usually, xi1=1 denotes the constant. Maximum likelihood estimates of the regression parameters βr, r=1, …, k, are usually obtained by solving the k score equations , r=1, …, k, where L is the likelihood function. However, in small samples these estimates may be seriously
Program description
Both the SPLUS library logistf and the SAS macro FL apply the penalized maximum likelihood approach outlined above. Several simple options allow the user to specify the logistic regression model for which parameter estimates should be obtained. While the complete User's Guide to those two programs can be found in Technical Reports [15], [6] here we can only give a brief summary of the parameters that can be set by the user.
Example and availability
Use of logistf and FL is exemplified by means of the analysis of an epidemiological data set [16] that can be downloaded from the WWW location http://www.cytel.com/examples/sex.dat. Purpose of this case-control study was to evaluate the effects of condom use, lubricated condom use, spermicide use, oral contraceptivce use, diaphragm use and age on risk of acquiring first urinary tract infection. The data set, ‘sex’ contains data on these variables (CONDOM, LUBRI, SPERM, ORAL, DIAPHRAG, AGE) for
References (16)
- et al.
SAS and SPLUS programs to perform Cox regression without convergence problems
Computer Methods and Programs in Biomedicine
(2002) - et al.
- et al.
- et al.
On the existence of maximum likelihood estimates in logistic regression models
Biometrika
(1984) - et al.
A solution for the problem of separation in logistic regression
to appear in Statistics in Medicine
(2002) - G. Heinze, Technical Report 10: the Application of Firth's Procedure to Cox and Logistic Regression, Department of...
- SAS/STAT User's Guide, Version 8, SAS Institute Inc., Cary, NC,...