Understanding spatial concentrations of road accidents using frequent item sets

https://doi.org/10.1016/j.aap.2005.03.023Get rights and content

Abstract

This paper aims at understanding why road accidents tend to cluster in specific road segments. More particularly, it aims at analyzing which are the characteristics of the accidents occurring in “black” zones compared to those scattered all over the road. A technique of frequent item sets (data mining) is applied for automatically identifying accident circumstances that frequently occur together, for accidents located in and outside “black” zones. A Belgian periurban region is used as case study. Results show that accidents occurring in “black” zones are characterized by left-turns at signalized intersections, collisions with pedestrians, loss control of the vehicle (run-off-roadway) and rainy weather conditions. Accidents occurring outside “black” zones (scattered in space) are characterized by left turns on intersections with traffic signs, head-on collisions and drunken road user(s). Furthermore, parallel collisions and accidents on highways or roads with separated lanes, occurring at night or during the weekend are frequently occurring accident patterns for all accident locations. These exploratory results show the potentiality of the frequent item set method in addition to more classical statistical techniques, but also suggest that there is no unique countermeasure for reducing the number of accidents.

Introduction

Traffic collisions remain one of the leading causes of premature death and morbidity in most countries. In Belgium as in many European countries, traffic safety is currently one of the government's priorities. Identifying dangerous accident locations and profiling them in terms of accident-related data and location/environmental characteristics provide new insights into the complexity and causes of road accidents.

Long ago, the spatial structure of road accidents was demonstrated, but no official and universal agreement exists for defining significant spatial concentrations of road accidents. In general, methods developed for identifying accident concentrations often apply to hot spots (also called “black” spots, hazardous locations, sites with promise, etc.) which are pinpoint concentrations of road accidents that often migrate over time (see e.g. Silcock and Smyth, 1985, Maher, 1990, Nguyen, 1991, Joly et al., 1992, Hauer, 1996, Thomas, 1996 or Vandersmissen et al., 1996). More recently, the identification of “black” zones or hazardous road segments has been reconsidered in literature (see Flahaut et al., 2003 for a review); they arise from the awareness of the spatial interaction existing between contiguous accident pinpoint locations. The existence of such road sections on which the number of accidents is high reveals spatial concentrations and hence suggests spatial dependence between individual accidents’ occurrences. In fact, these studies focus on a well-known exploratory spatial data analysis problem: the definition and the explanation of hot spots (see e.g. Levine, 2002 or Vistisen, 2002).

In this paper, the location and the length of the “black” zones are defined by means of local spatial autocorrelation indices (see Section 3.2), and they are considered as given in our problem. Therefore, the problem tackled in this paper is not the definition of the “black” zone, but its exploration. We argue that, indeed, it is not possible to develop effective countermeasures to reduce the number of accidents at these locations without being able to properly and systematically relate accident frequency and severity to a number of variables such as roadway geometries, traffic control devices, roadside features, roadway conditions, driver behavior or vehicle type (Kononov and Janson, 2002). Hence, several attempts are found in literature for explaining the spatial variation of road unsafety at several levels of spatial aggregation (see Flahaut, 2004a, Flahaut, 2004b for a review). Our approach, however, is purely exploratory, i.e. to understand how road accidents cluster in hazardous road segments. More specifically, we are interested in finding out which factors are associated to the accidents in “black” zones by generating frequent item sets. This data mining technique automatically identifies accident circumstances that frequently occur together. This way, we expose a number of hypotheses, which we then try to explain using other research studies and domain knowledge. Statistical models have been widely used on such accident data to analyze road crashes in order to explain the relationship between crash involvement and traffic on the one hand and geometric and environmental factors on the other hand (Lee et al., 2002). However, Chen and Jovanis (2002) indicate that not only the main effects of driver, vehicle, roadway and environmental factors should be analyzed, interactions between factors are also very likely to be significant. The authors demonstrate that the large number of potentially important factors, combined with the complex nature of crash etiology and injury outcome present certain challenges when using classic statistical analysis on datasets with large dimensions such as an exponential increase in the number of parameters as the number of variables increases and the invalidity of statistical tests as a consequence of sparse data in large contingency tables. Furthermore, a large number of factors need to be selected and a comprehensive but feasible set of main factors and interactions need to be specified for testing in statistical models.

This is where data mining comes into play. Data mining can be defined as the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in large amounts of data (Fayyad et al., 1996). From a statistical perspective it can be viewed as a computer automated exploratory data analysis of (usually) large complex data sets (Friedman, 1997). However, in contrast with statistical techniques, the problems and methods of data mining have some distinct features of their own. Not only can data sets be much larger than in statistics and are data analyses on a correspondingly larger scale, there are also differences of emphasis in the approach to modeling: compared with statistics, data mining pays less attention to the large-scale asymptotic properties of its inferences and more to the general philosophy of “learning”, including consideration of the complexity of models and the computations they require (Hosking et al., 1997). Furthermore, data mining has tackled with problems such as what to do in situations where the number of variables is so large that looking at all pairs of variables is computationally infeasible (Mannila, 2000). Additionally, in contrast with statistics, data mining is typically a form of secondary data analysis: the data has been collected for some other purpose than for answering a specific data analytical question. For the purposes of this paper it is sufficient to point out that statistical models are particularly likely to be preferable when fairly simple models are adequate and the important variables can be identified before modeling. However, when dealing with a large and complex data set of road accidents, the use of data mining methods seems particularly useful.

In literature some examples of the use of data mining in road accidents analyses can be found. For example, clustering techniques are used to discover frequent patterns in accident data (see e.g. Ljubic et al., 2002). Additionally, the data mining technique of rule induction can be used to identify rule sets representing interesting subgroups in accident data (see e.g. Kavsek et al., 2002). Furthermore, decision trees (see e.g. Strnad et al., 1998, Clarke et al., 1998) and neural networks (see e.g. Mussone et al., 1999) are used to model and analyze road accidents. Finally, spatial data mining (see e.g. Zeitouni and Chelghoum, 2001) can be applied.

In this research, data mining is applied for understanding the characteristics of the accidents associated to “black” zones or hazardous road segments. In particular, an existing technique of frequent item sets is used as an explorative technique to generate accident patterns, which can give rise to possible new and surprising accident patterns that were not yet found in other research. More specifically, accident circumstances that frequently occur together inside “black” zones will be identified. Furthermore, these patterns are compared with accident characteristics occurring outside those “black” zones. This allows the investigation of the differences between accident patterns inside and outside “black” zones, and hence to understand why spatial concentrations are observed.

The remainder of this paper is organized as follows. First a formal introduction to the association algorithm and the concept of frequent item sets is provided (Section 2). This will be followed by a description of the dataset and the studied area (Section 3). In Section 4, the empirical study is explained and in Section 5 the results of this study are presented. The paper will be completed with a summary of the conclusions and directions for future research.

Section snippets

KDD process

As explained in the introduction, data mining is used to discover patterns and relationships in data, with an emphasis on large, observational databases (Friedman, 1997). According to Fayyad et al. (1996) data mining can be considered as a separate step of the “knowledge discovery in databases” (KDD) process (see Fig. 1). This KDD process refers to the overall process of discovering useful knowledge from data. The additional steps in the KDD process, such as data preparation, data selection,

The studied area

In Belgium, each road accident occurring on a public road and involving casualties is reported officially (National Institute of Statistics). Its location is known accurately on numbered roads because there is a stone marker at every hectometer; numbered roads are motorways, national and provincial roads linking towns together. Hence, this analysis is limited to accidents with casualties on numbered roads. The period under study is 1997–1999: it is long enough to limit random fluctuations in

Empirical study

As explained in Section 2.1 of this paper, we can distinguish different steps in the mining process: a pre-processing step and a transformation step in which the available data are prepared for the use of the mining technique, a mining step for generating the frequent item sets and a post-processing step for evaluating and interpreting the most interesting patterns.

Accident patterns in “black” zones

Selecting the frequent item sets that are unique for accidents occurring inside a “black” zone and with very strong lift values results in 50 item sets of size 2 (lift < 0.5 or lift > 5), 108 item sets of size 3 (lift < 0.5 or lift > 5) and 240 item sets of size 4 (lift < 0.5 or lift > 15). Table 2 gives an overview of the most interesting of these frequent item sets. In the remainder of this paper, we will refer to the number of these item sets [N] when discussing the results.

A first result shows that

Frequent item sets and accident analysis

In this paper, the association algorithm was used on a data set of road accidents to profile “black” zones in terms of accident-related data and location characteristics. More specifically, frequent item sets are generated to identify accident circumstances that frequently occur together in order to find out which factors explain the occurrence of the accidents in “black” zones. As explained in the introduction, the use of this technique coincides with the explorative character of this research

Acknowledgements

This research was supported by the OSTC and the Flemish Research Centre for Traffic Safety. The authors would also like to thank dr. Tom Brijs for his encouragement and helpful suggestions.

References (57)

  • E. LaScala et al.

    Demographic and environmental correlates of pedestrian injury collisions: a spatial analysis

    Accid. Anal. Prev.

    (2000)
  • J. Lee et al.

    Impact of roadside features on the frequency and severity of run-off-roadway accidents: an empirical analysis

    Accid. Anal. Prev.

    (2002)
  • M. Maher

    A bivariate negative binomial model to explain traffic accident migration

    Accid. Anal. Prev.

    (1990)
  • J.-L. Martin

    Relationship between crash rate and hourly traffic flow on interurban motorways

    Accid. Anal. Prev.

    (2002)
  • L. Mussone et al.

    An analysis of urban collisions using an artificial intelligence model

    Accid. Anal. Prev.

    (1999)
  • S. Rajalin

    The connection between risky driving and involvement in fatal accidents

    Accid. Anal. Prev.

    (1994)
  • M. Strnad et al.

    Young children injury analysis by the classification entropy method

    Accid. Anal. Prev.

    (1998)
  • I. Thomas

    Spatial data aggregation: exploratory analysis of road accidents

    Accid. Anal. Prev.

    (1996)
  • Y. Wong et al.

    Driver behaviour at horizontal curves: risk compensation and the margin of safety

    Accid. Anal. Prev.

    (1992)
  • Agent, K.R., Deen, R.C., 1975. Relationship between roadway geometrics and accidents. Transportation Research Record...
  • R. Agrawal et al.

    Mining association rules between sets of items in large databases

  • R. Agrawal et al.

    Fast Discovery of Association Rules Advances in Knowledge Discovery and Data Mining

    (1996)
  • S.S. Anand et al.

    Tackling the cross sales problem using data mining

  • L. Anselin

    Local indicators of spatial association-LISA

    Geographical Anal.

    (1995)
  • M. Berry et al.

    Data Mining Techniques for Marketing, Sales and Customer Support

    (1997)
  • M. Braddock et al.

    Using a geographic information system to understand child pedestrian injury

    Am. J. Public Health

    (1994)
  • S. Brin et al.

    Beyond market baskets: generalizing association rules to correlations

  • Casaer, F., Eckhardt, N., Steenberghen T., Thomas, I., Wets, G., Quality assessment of the Belgian traffic accident...
  • Cited by (84)

    • Environmental impacts of bicycling in urban areas: A micro-simulation approach

      2023, Transportation Research Part D: Transport and Environment
    • Strategic planning support for road safety measures based on accident data mining

      2022, IATSS Research
      Citation Excerpt :

      For road accident data, comprehensive clustering methods and similarity measures are presented in [10–12], and [13]. Another unsupervised method used for accident analysis is frequent itemset mining which results in information about the (relative) frequency of co-occurring accident features, as applied by [14]. When not only the co-occurrence, but also the direction of the relationship is of interest, frequent itemset mining can be extended to association rules mining.

    View all citing articles on Scopus
    View full text