Fixing the nonconvergence bug in logistic regression with SPLUS and SAS

https://doi.org/10.1016/S0169-2607(02)00088-3Get rights and content

Abstract

When analyzing clinical data with binary outcomes, the parameter estimates and consequently the odds ratio estimates of a logistic model sometimes do not converge to finite values. This phenomenon is due to special conditions in a data set and known as ‘separation’. Statistical software packages for logistic regression using the maximum likelihood method cannot appropriately deal with this problem. A new procedure to solve the problem has been proposed by Heinze and Schemper (Stat. Med. 21 (2002) pp. 2409–3419). It has been shown that unlike the standard maximum likelihood method, this method always leads to finite parameter estimates. We developed a SAS macro and an SPLUS library to make this method available from within one of these widely used statistical software packages. Our programs are also capable of performing interval estimation based on profile penalized log likelihood (PPL) and of plotting the PPL function as was suggested by Heinze and Schemper (Stat. Med. 21 (2002) pp. 2409–3419).

Introduction

For analyzing clinical studies with binary outcomes, the logistic regression model [1], [2] is often used. The straightforward interpretation of the estimated parameters as log odds ratios favored its popularity in medical research, and the capability of allowing models with more than one covariate enables estimation of odds ratios that are adjusted for other covariates [2]. Parameter estimation is usually based on maximization of the (log) likelihood function (maximum likelihood method) via an iteratively weighted least-squares algorithm [3]. However, it is also known that there are certain situations particularly occurring in samples with a high number of parameters relative to sample size where finite maximum likelihood parameter estimates do not exist. In those cases the likelihood converges to a finite value while at least one parameter estimate diverges to ±∞ [4]. This phenomenon is due to special conditions in a data set and known as ‘separation’. The simplest example of separation arises in the analysis of a binary outcome and a single binary covariate if the resulting 2×2 table has one zero cell count. Generally, the probability of occurrence of separation is too high to be negligible [5], [6].

Some statistical software packages for logistic regression may warn the user in case of separation that the parameter estimates do not have converged [7]. Others simply base convergence of the model fitting algorithm on the deviance or the log likelihood and will not detect separation [8]. In both cases, the resulting odds ratio estimates are based on the last iteration carried out. For covariates causing separation the resulting estimates are completely arbitrary and thus extremely inaccurate [5] and misleading. Although exact logistic regression [9] which has been implemented in the new SAS version 8.1 [10] can provide finite and accurate estimates in some situations, it cannot generally be used as a tool to cope with separation [5].

In this paper, we present SPLUS [8] and SAS [11] programs to solve the separation problem. In our programs parameter estimation is based on the penalized maximum likelihood approach originally developed by Firth [12] and suggested for the logistic regression model by Heinze and Schemper [5]. This approach provides an ideal solution to the problem of separation. It has been shown that parameter estimates from this approach are always finite and have lower small sample bias than maximum likelihood estimates. Because of asymmetric shapes of the profile penalized likelihood in case of separation, Heinze and Schemper [5] recommend the construction of confidence intervals based on profile penalized likelihood instead of using the simpler Wald method. Our programs can be used to perform both ways of interval estimation and to compare them graphically by plotting the profile penalized log likelihood (PPL) function as has been suggested [5].

In Section 2 we describe the algorithms used by the SPLUS library logistf and by the SAS macro FL. Section 3 gives an overview of the application of our programs. Finally, in Section 4 we compare results obtained from the penalized maximum likelihood approach with those from standard analysis by means of a worked example and describe the availability of the program.

Section snippets

Computing estimates and confidence limits

A logistic regression model is given by Pr(yi=1∣xi)=πi=1/{1+exp(−xiβ)} where (yi, xi), yi∈{0, 1}, i=1, …, n, denotes a sample of n observations of the outcome variable y and the 1×k covariate vector x. Usually, xi1=1 denotes the constant. Maximum likelihood estimates of the regression parameters βr, r=1, …, k, are usually obtained by solving the k score equations logL/∂βr≡U(βr)≡∑i=1n(yi−πi)xir=0, r=1, …, k, where L is the likelihood function. However, in small samples these estimates may be seriously

Program description

Both the SPLUS library logistf and the SAS macro FL apply the penalized maximum likelihood approach outlined above. Several simple options allow the user to specify the logistic regression model for which parameter estimates should be obtained. While the complete User's Guide to those two programs can be found in Technical Reports [15], [6] here we can only give a brief summary of the parameters that can be set by the user.

Example and availability

Use of logistf and FL is exemplified by means of the analysis of an epidemiological data set [16] that can be downloaded from the WWW location http://www.cytel.com/examples/sex.dat. Purpose of this case-control study was to evaluate the effects of condom use, lubricated condom use, spermicide use, oral contraceptivce use, diaphragm use and age on risk of acquiring first urinary tract infection. The data set, ‘sex’ contains data on these variables (CONDOM, LUBRI, SPERM, ORAL, DIAPHRAG, AGE) for

References (16)

  • G. Heinze et al.

    SAS and SPLUS programs to perform Cox regression without convergence problems

    Computer Methods and Programs in Biomedicine

    (2002)
  • D.R. Cox
  • D.W. Hosmer et al.
  • P. McCullagh et al.
  • A. Albert et al.

    On the existence of maximum likelihood estimates in logistic regression models

    Biometrika

    (1984)
  • G. Heinze et al.

    A solution for the problem of separation in logistic regression

    to appear in Statistics in Medicine

    (2002)
  • G. Heinze, Technical Report 10: the Application of Firth's Procedure to Cox and Logistic Regression, Department of...
  • SAS/STAT User's Guide, Version 8, SAS Institute Inc., Cary, NC,...
There are more references available in the full text version of this article.

Cited by (0)

View full text