Elsevier

International Journal of Forecasting

Volume 22, Issue 4, October–December 2006, Pages 679-688
International Journal of Forecasting

Another look at measures of forecast accuracy

https://doi.org/10.1016/j.ijforecast.2006.03.001Get rights and content

Abstract

We discuss and compare measures of accuracy of univariate time series forecasts. The methods used in the M-competition as well as the M3-competition, and many of the measures recommended by previous authors on this topic, are found to be degenerate in commonly occurring situations. Instead, we propose that the mean absolute scaled error become the standard measure for comparing forecast accuracy across multiple time series.

Introduction

Many measures of forecast accuracy have been proposed in the past, and several authors have made recommendations about what should be used when comparing the accuracy of forecast methods applied to univariate time series data. It is our contention that many of these proposed measures of forecast accuracy are not generally applicable, can be infinite or undefined, and can produce misleading results. We provide our own recommendations of what should be used in empirical comparisons. In particular, we do not recommend the use of any of the measures that were used in the M-competition or the M3-competition.

To demonstrate the inadequacy of many measures of forecast accuracy, we provide three examples of real data in Fig. 1. These show series N0472 from the M3-competition,2 monthly log stock returns for the Walt Disney Corporation, and monthly sales of a lubricant product sold in large containers. Note that the Disney return series and the lubricant sales series both include exact zero observations, and the Disney series contains negative values. Suppose that we are interested in comparing the forecast accuracy of four simple methods: (1) the historical mean using data up to the most recent observation; (2) the “naïve” or random-walk method based on the most recent observation; (3) simple exponential smoothing and (4) Holt's method. We do not suggest that these are the best methods for these data, but they are all simple methods that are widely applied. We compare the in-sample performance of the methods (based on one-step-ahead forecasts) and the out-of-sample performance (based on forecasting the data in the hold-out period using only information from the fitting period).

Table 1, Table 2, Table 3 show some forecast error measures for these methods applied to the example data. The acronyms are defined below and we explicitly define the measures in 2 A critical survey of accuracy measures, 3 Scaled errors. The relative measures are all computed relative to a naïve (random walk) method.

In these tables, we have included measures that have been previously recommended for use in comparing forecast accuracy across many series. Most textbooks recommend the use of the MAPE (e.g., Hanke & Reitsch, 1995, p.120, and Bowerman, O'Connell, & Koehler, 2004, p.18) and it was the primary measure in the M-competition (Makridakis et al., 1982). In contrast, Makridakis, Wheelwright, and Hyndman (1998, p. 45) warn against the use of the MAPE in some circumstances, including those encountered in these examples. Armstrong and Collopy (1992) recommended the use of GMRAE, MdRAE and MdAPE. Fildes (1992) also recommended the use of MdAPE and GMRAE (although he described the latter as the relative geometric root mean square error or GRMSE). The MdRAE, sMAPE and sMdAPE were used in the M3-competition (Makridakis & Hibon, 2000).

The M-competition and M3-competition also used rankings amongst competing methods. We do not include those here as they are dependent on the number of methods being considered. They also give no indication of the size of the forecast errors. Similarly, both competitions included measures based on the percentage of times one method was better than a benchmark method. Again, such measures are not included here as they do not indicate the size of the errors.

To our knowledge, the MASE has not been proposed before. We consider it the best available measure of forecast accuracy and we argue for it in Section 3.

Note that there are many infinite values occurring in Table 1, Table 2, Table 3 due to division by zero. Division by numbers close to zero also results in very large numbers. The undefined values arise due to the division of zero by zero. Some of these are due to computations of the form Yt / (Yt  Yt−1) where Yt−1 = Yt = 0, and others are due to computations of the form (Yt  Yt−1) / (Yt  Yt−1) where Yt = Yt−1. In the latter case, it is algebraically possible to cancel the numerator and denominator, although numerical results will be undefined. Also note that the sMAPE can take negative values although it is meant to be an “absolute percentage error”.

Note that with random walk forecasts, the in-sample results for MASE and all results for MdRAE and GMRAE are 1 by definition, as they involve comparison with naïve forecasts. However, some of the values for MdRAE and GMRAE are undefined as explained above.

Of the measures in Table 1, Table 2, Table 3, only the MASE can be used for these series due to the occurrence of infinite and undefined values. These three series are not degenerate or unusual—intermittent demand data often contain zeros and many time series of interest to forecasters contain negative observations. The cause of the problems with M3 series N0472 is the occurrence of consecutive observations which take the same value, something that very often occurs with real data.

Section snippets

A critical survey of accuracy measures

Let Yt denote the observation at time t and Ft denote the forecast of Yt. Then define the forecast error et = Yt  Ft. The forecasts may be computed from a common base time, and be of varying forecast horizons. Thus, we may compute out-of-sample forecasts Fn+1,…, Fn+m based on data from times t = 1,…, n. Alternatively, the forecasts may be from varying base times, and be of a consistent forecast horizon. That is, we may compute forecasts F1+h,…, Fm+h where each Fj+h is based on data from times t = 1,…,

Scaled errors

Relative measures and measures based on relative errors both try to remove the scale of the data by comparing the forecasts with those obtained from some benchmark forecast method, usually the naïve method. However, they both have problems. Relative errors have a statistical distribution with undefined mean and infinite variance. Relative measures can only be computed when there are several forecasts on the same series, and so cannot be used to measure out-of-sample forecast accuracy at a

Application to M3-competition data

We demonstrate the use of MASE using the M3-competition data (Makridakis & Hibon, 2000). Fig. 2 shows the MASE at each forecast horizon for four forecasting methods applied to the M3-competition data. The errors have been scaled by the one-step in-sample forecast errors from the naïve method, and then averaged across all series. So a value of 2 indicates that the out-of-sample forecast errors are, on average, about twice as large as the in-sample one-step forecast errors from the naïve method.

Conclusion

Despite two decades of papers on measures of forecast error, we believe that some fundamental problems have been overlooked. In particular, the measures used in the M-competition and the M3-competition, and the measures recommended by other authors, all have problems—they can give infinite or undefined values in commonly occurring situations.

We propose that scaled errors become the standard measure for forecast accuracy, where the forecast error is scaled by the in-sample mean absolute error

Acknowledgments

We thank Michelle Hibon for kindly providing the forecasts submitted to the M3-competition, and two anonymous referees for providing some thoughtful comments.

Rob J. Hyndman is a Professor of Statistics in the Department of Econometrics and Business Statistics at Monash University, Australia. He has published extensively in leading statistical and forecasting journals, and is Editor-in-Chief of the International Journal of Forecasting. He is also co-author of the well-known text on business forecasting, Forecasting: methods and applications (Makridakis, Wheelwright & Hyndman, 3rd ed., 1998, Wiley).

References (25)

  • S. Makridakis et al.

    The M3-competition: Results, conclusions and implications

    International Journal of Forecasting

    (2000)
  • P.A. Thompson

    An MSE statistic for comparing forecast accuracy across series

    International Journal of Forecasting

    (1990)
  • Cited by (3551)

    View all citing articles on Scopus

    Rob J. Hyndman is a Professor of Statistics in the Department of Econometrics and Business Statistics at Monash University, Australia. He has published extensively in leading statistical and forecasting journals, and is Editor-in-Chief of the International Journal of Forecasting. He is also co-author of the well-known text on business forecasting, Forecasting: methods and applications (Makridakis, Wheelwright & Hyndman, 3rd ed., 1998, Wiley).

    Anne B. Koehler is a professor of decision sciences and the Panuska Professor of Business Administration at Miami University, Ohio. Professor Koehler has numerous publications, many of which are on forecasting models for seasonal time series and exponential smoothing methods. She is co-author of the fourth edition of Forecasting, time series, and regression: an applied approach, published by Duxbury in 2005.

    1

    Tel.: +1 513 529 4826.

    View full text