Article Text
Abstract
The recent COVID-19 pandemic stimulated unprecedented linkage of datasets worldwide, and while injury is endemic rather than pandemic, there is much to be learned by the injury prevention community from the data science approaches taken to respond to the pandemic to support research into the primary, secondary and tertiary prevention of injuries. The use of routinely collected data to produce real-world evidence, as an alternative to clinical trials, has been gaining in popularity as the availability and quality of digital health platforms grow and the linkage landscape, and the analytics required to make best use of linked and unstructured data, is rapidly evolving. Capitalising on existing data sources, innovative linkage and advanced analytic approaches provides the opportunity to undertake novel injury prevention research and generate new knowledge, while avoiding data waste and additional burden to participants. We provide a tangible, but not exhaustive, list of examples showing the breadth and value of data linkage, along with the emerging capabilities of natural language processing techniques to enhance injury research. To optimise data science approaches to injury prevention, injury researchers in this area need to share methods, code, models and tools to improve consistence and efficiencies in this field. Increased collaboration between injury prevention researchers and data scientists working on population data linkage systems has much to offer this field of research.
- Injury Diagnosis
- Theory
- Mechanism
- Coding Systems
- Epidemiology
- Indicators
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
Introduction
The recent COVID-19 pandemic stimulated unprecedented linkage of datasets in many jurisdictions to better understand the patterns of transmission of the SARS-CoV-2 virus, vulnerability to disease and the effectiveness of counter-measures.1 While injury is endemic rather than pandemic, there is much to be learnt by the injury prevention community from the approaches taken to respond to the pandemic. Using examples, we highlight how various aspects of data science techniques, such as data linkage, and natural language processing (NLP) can support research into the primary, secondary and tertiary prevention of injuries. This can, in turn, provide a more comprehensive understanding of risk factors and counter-measures to inform programme and policy development.
Data linkage is a technique in which data from an individual or an entity (such as a household or organisation) are linked together. Pieces of information about individuals are collected by many different organisations for health, work, education, taxation and many other purposes and exist in large numbers of unlinked data siloes. Bringing such data together poses many challenges, not least privacy protection. However, this is possible using a common unique identity for the individual or entity held in the different databases or through probabilistic matching based on a variety of attributes such as name, sex, date of birth and address. The Administrative Data Research UK website provides a good example of how such identity matching is achieved.2 In almost all jurisdictions, this is legally possible and is increasingly successfully conducted through the creation and utilisation of trusted research environments (TREs), which hold deidentified data from many sources on the general population. More details on the technical aspects behind these developments and their widespread use can be found in a variety of online sources such as the International Population Data Linkage Network (https://ipdln.org/) and the video produced by the Australian Population Health Research Network.3
Why use of data linkage in injury research?
Given the well-known lack of funding for injury prevention research4 and the expense of conducting large scale and long-term randomised trials in the community, there is a dearth of trial-based evidence for many promising interventions. The use of routinely collected data to produce real-world evidence, as an alternative to clinical trials, has been gaining in popularity as the availability and quality of digital health platforms grow.5
Arguably, given the social drivers of injury risk, such as the underlying differential exposures to hazards and the ability to modify and respond to these as experienced by people from different socioeconomic, racial, ethnic, geographic and income groups,6 many interventions aimed at non-injury factors could prevent injuries or reduce their severity, but these data are often lacking or insufficiently characterised when single data sources are used. Embedding trials, cohorts and evaluations of natural experiments in population data linkage systems with a wide array of health and other data provides opportunities to conduct evaluations with respect to injury outcomes and a wide range of their consequences across individual, family and societal domains as highlighted in the Injury List of All Deficits (LOAD) framework,7 even when these were not originally planned. In addition, extensive data are captured in text-based form on the antecedents of injury in paramedic case descriptions, emergency department (ED) presenting complaint and triage fields, hospital admission records and discharge summaries, and in other routinely completed specialist forms and notes along a person’s treatment journey. Radiology reports and other clinical records contain rich detail on the nature and severity of injury, much of which is not incorporated in routine coding systems such as International Classification of Diseases 10th revision. Similarly, police crash records, insurance records and ED text narratives of the injury event contain valuable information about the circumstances of the event, built environment and road infrastructure and safety.8 Very little of this is coded and available in electronic sources accessible by researchers. NLP has the ability to improve the quality and depth of data capture, which when linked to other datasets, can substantially improve the utility and completeness of data. NLP is based out of the fields of artificial intelligence and linguistics and allows computers to interpret words and phrases written by people and includes named entity recognition and text summarisation which could help autoencode diagnoses and mechanisms of injury and underlying aetiology, as we detail below.9 Recently, there has been considerable interest in the development of a component of NLP known as large language models (LLMs), which are computer models capable of generating text and predicting answers and are widely used in internet search engines and predictive text.
The linkage landscape, and the analytics required to make best use of linked and unstructured data, is rapidly evolving. What might not have been possible previously can become feasible quite quickly. Capitalising on existing data sources, innovative linkage and advanced analytic approaches provides the opportunity to undertake novel injury prevention research, and generate new knowledge, while avoiding data waste and additional burden to participants.
Key uses of data linkage in injury prevention research
Population-based data linkage provides the potential to evaluate:
Embedded individual or cluster randomised trials which link to injury outcomes.
Cohorts and surveys which links to injury outcomes.
Evaluations of natural experiments with links to injury outcomes.
Embedded randomised trials
Notably, injury prevention or injury treatment trials with long-term outcomes monitored by linkage to routine data are largely absent. This reflects a wider trend. In the UK, where record linkage is well established, only a minority (<3%) of trials are linked to routine data.10 Nevertheless, studies have shown the capability of data linkage to establish important end points in trials. For example, the West of Scotland Coronary Prevention Study highlighted the benefits of long-term data linkage.11 In this randomised, placebo-controlled, primary prevention trial of pravastatin (a medication designed to lower cholesterol in the blood), capturing long-term mortality through data linkage dramatically changed the cost benefit analysis of statin prescribing for heart disease prevention. The 15-year linkage was not planned in the original 5-year study and cost a mere £15 000 and is arguably the best example of the cost-effectiveness of record linkage research.11
Cohort studies
The use of population-based cohort studies using record linkage to evaluate injury risk and outcomes is more prevalent. The Millennium Cohort Study (MCS) in the UK identified an association between physical activity and injury risk; boys, but not girls, who were overall more physically active experienced higher rates of injuries resulting in ED attendances and hospital admissions.12 13 The Avon Longitudinal Study of Parents and Children birth cohort demonstrated that much of area-level variations in childhood injuries was due to variations in maternal and child health risk factors in individuals clustered into neighbourhoods.14 15 Several population-based cohort studies have identified the detrimental impact of childhood injury admission on educational attainment.16–18 In the USA, the MCS was set up to evaluate the impact of military experience on service members and veteran health using a recruited cohort 260 228 military personnel across 5 panels between 2001 and 2021, with repeated surveys and data linkage to administrative and medical data sources, which has enabled many studies including the relationship between deployment injuries and mental and physical quality of life.19
A number of population cohort studies using linked data have focused on priority and at-risk populations, which have been historically challenging populations to study. Using data linkage in Ontario, Canada, O’Neill et al were able to explore mental health and assault care treatment in the year prior to death by homicide, providing important insights into potential pathways for homicide prevention.20 Another example was the use of linked data in New Zealand to show the high rate of self-harm in people released from prison, challenges of transitioning from prison to the community and the opportunities to improve the care of people in this situation.21
Natural experiments
In recent years, there has been considerable development of methodologies for the evaluation of natural experiments of interventions where the intervention has not been randomised.22 Natural experiments are an attractive design as they enable evaluations of interventions that are difficult to randomly allocate, such as policy and health system changes, as well as those where the providers find it difficult to randomise their services.23 Well-designed evaluations using data linkage provide an efficient approach to measuring outcomes in natural experiments.
Recent examples of linkage between service provision data and high-quality research registers provide exemplars for the evaluation of secondary and tertiary prevention initiatives. The Emergency Medicine Retrieval and Transport Service (EMERTS) in Wales provides physicians to emergency events by helicopter and fast cars. An evaluation required severity matched cases who were or were not transported to hospital by EMERTS was achieved through individual record linkage to the Trauma Audit and Research Network database.24 A 37% reduction in risk-adjusted mortality in patients transported by EMERTS was observed, which resulted in 24-hour, nationwide expansion of the service. Similarly, linked routine Victorian State Trauma Registry and hospital clinical performance and administrative data were used to evaluate the impact of new infrastructure (purpose-built ward) and a new model of allied healthcare and found the new model of care, but not the infrastructure change alone, was highly cost-effective.25
Notably, a key limitation of randomised controlled trials can be low external validity. Data linkage can be used to assess whether findings from trials are generalisable to the wider population. For example, the Nurse-Family Partnership in the USA was a randomised trial of nurse home visitations to families at risk of poor health and social outcomes. They found that nurse home visitations reduced ED presentations for injuries and poisoning and resulted in fewer cases of child abuse or neglect.26 Based on this, similar models were implemented in a number of jurisdictions. However, the results were not replicated in other areas when population data linkage was used in the evaluation.27 28 Data linkage helped to demonstrate that the intervention model did not generalise to other settings.
Household data linkage
The ability to link data both at individual household level and at individual levels opens up opportunities to evaluate injury prevention interventions at the household or grouped household level.29 Housing quality has been shown to be related to a variety of injuries, but studies are few.30 The Carmarthenshire Housing and Health Study used data linkage to follow 32 009 residents across 8558 social homes to evaluate the health impacts of various housing improvements over 10 years.31 These authors reported a 39% reduction in emergency admissions in people over 60 overall, along with smaller but still significant reductions in injury admissions associated with some of selected interventions; however the study was not powered to test specific reductions in injuries.31
Studies from the UK and New Zealand have used household record linkage to explore the distance between homes and the nearest alcohol outlets as an exposure, with alcohol-related harms as the outcome (alcohol-related hospital attendances and admission and police reported crime).32 33 These studies found disparate results, which potentially relate to environmental and exposure-related factors across different populations.
Household level linkage would also help augment the quantification of the impact of injuries on cohabiting family members, as recommended in the LOAD framework.7
The benefits of NLP in improving data quality and depth for injury prevention research
NLP and machine learning techniques have been applied to unstructured text data for many years in the injury domain, most commonly in the fields of occupational surveillance, ED injury surveillance, product safety surveillance and social media surveillance, particularly in relation to mental-health, self-harm and substance abuse.34 35
NLP techniques have been used for: (1) better understanding causal mechanisms involved in injury events by providing more contextual information about direct and underlying mechanisms beyond basic coded categories, (2) capturing pre-event risk factors, mechanisms and object interactions, (3) capturing rare causes/emerging hazards, which would otherwise not have been captured in coded form and (4) automating/semi-automating coding to enable more rapid reporting of data than coding resources allow.36–38 As LLMs continue to develop, there is significant potential for a flow on effect to enhance NLP techniques to become more sophisticated, accurate and real time to support decision-making.39 However, such algorithms may make errors and require validation against human expert coding for clinical accuracy.40 There are also challenges to the use of LLMs due to privacy protection and large computational requirements. Some LLMs require the data to be moved from its original source and fed into the LLM. The difficulty with this is that the text may be highly disclosive of an individual, given that many injury events are reported in the media. An alternative approach is the importation of open source LLMs into the original data source environment or the TRE and the research conducted there. This hampers open science to some extent, but can be ameliorated by sharing of the underlying codes and algorithms. This field of research is promising, but still in its infancy in terms of contributing to injury research.
Closing comments and next steps
As data linkage capabilities evolve and the analytic techniques, which unlock the unstructured data inherent in datasets available for linkage mature, the potential to use these data sources to enhance evidence-informed injury prevention and policy development is evident. We have provided a tangible, but not exhaustive list of examples showing the breadth and value of data linkage, and the emerging capabilities of NLP techniques to enhance injury research. Data linkage capacity continues to grow in high-income, low-income and middle-income countries.
What injury prevention researchers need to do is to carry out a review of the data sources available to them and explore whether there are existing data linkage facilities in their jurisdiction or setting. A good place to start is by searching the IPDLN membership directory for potential collaborators (https://ipdln.org/membership/). Once researchers are informed of the potential for linkage in their area, they can consider what additional questions could be answered through this paradigm. In particular, the embedding of individual or household interventions that are primarily designed to prevent injuries, for example, installation of stair gates or handrails or where injury prevention could be a secondary aim, for example, wealth transfers to reduce overall inequalities in health, into data linkage systems would facilitate the evaluation of interventions that are difficult or impossible to achieve through standard randomised trials.
Another issue is the importance for injury researchers in this area to share their methods, code, models and tools to improve consistence and efficiencies in this field. Increased collaboration between injury prevention researchers and data scientists working on population data linkage systems has much to offer this field of research.
Ethics statements
Patient consent for publication
Ethics approval
This work was a combination of expert opinion and literature review and commentary and did not require ethical approval.
References
Footnotes
X @EmergTrauma
Correction notice This article was updated to CC-BY-NC on 14/11/24.
Contributors RAL contributed to the original conception and design of the paper, prepared the initial draft and undertook revisions and final approval of the paper. BJG and KV contributed to the revised conception and design, contributed to the second draft and revisions and final approval of the paper. RAL is the guarantor.
Funding RAL is supported by the Administrative Data Research Wales grant (ES/W012227/1) funded by the Economic and Social Research Council (ESRC).BJG is supported by a National Health and Medical Research Council of Australia Investigator Grant (L2, ID2009998). KV is supported by funding from the Motor Accident Insurance Commission Queensland.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.