Not all evidence is created equally (an update)

The Australian and New Zealand Society of Evidence Based Policing (@ANZSEBP) recently retweeted a graphic from an old blog of mine, so this seems a good time to update and explain it a little.

The chart above is adapted from various sources and emphasizes quantitative studies and randomized trials. Some argue that randomized trials can be of limited value, or difficult to implement, and that observational studies and other sources of information can inform policing. This is all true. Moreover, qualitative research can be useful when interpreting evaluation results, seeking insights into why programs succeed or fail, and considering where they can go moving forward. But if you have an opportunity to conduct an evaluation, try and design it to get the best possible assessment of the program.

Any research field has variable levels of what is called methodological quality. If you think all evaluations are useful for deciding how we spend our money, then boy, do I have a bridge to sell you!

Just look through Amazon. Reviewers rarely compare one product against another. You more frequently find five-star reviews alongside comments such as “Can’t wait to try this!” or “It arrived on time and works as advertised”. Your widget might work as advertised, but better than other widgets?

One of the biggest challenges evaluators encounter is rejecting competing explanations for a crime drop. Here’s a recent example. San Francisco’s police department credited crisis intervention training with a reduction in police use of force incidents. Simply noting a change in numbers doesn’t however rule out a range of other possible explanations, such as officers conducting fewer proactive field investigations or making fewer arrests (activities that can sometimes spark an incident).

Not to mention, it is not uncommon for two or three different programs to claim credit for crime drops in the same area.

The center column of my updated figure now shows examples of each level. If terms like cross-sectional or longitudinal feel unnecessarily technical—welcome to academic jargon—then the examples may help. You might make the connection between my example of the license plate readers and San Francisco’s crisis intervention training.

San Fran’s CIT program scores at best a 2 (because of a simplistic pre-post claim), or worse one of the zeroes, because it is an internal non-peer reviewed assertion. The lowest zero level probably seems harsh on police chiefs, but many are unfamiliar with, and do not review, the research when the media calls or they write their magnum opus. They trade on their “expertise” and hope or believe their authority is a replacement for a lack of knowledge (unfortunately, it frequently works).

Experience is valuable, but it is also vulnerable to many potential biases that make it less reliable.

And when academics get quoted in newspapers, it goes through too many filters and is usually too brief, to be a reliable source for decision-makers.

For the other zero, while I recognize some police departments do exemplary research and may be impervious to political and internal pressure, regretfully, this is rarely the case. Third party evaluations often bring more rigor and impartiality.


Once we hit level 3 we cross an important threshold. Writing on evidence-based policing (EBP), Larry Sherman argued “the bare minimum, rock-bottom standard for EBP is this: a comparison group is essential to every test to be included as ‘evidence’”. Above level 2 we cross this hurdle, hence the chart background turns from red (Danger Will Robinson!) or yellow (bees!), to green.

What’s suspect or just interesting, becomes what’s promising or what works.

Up at level 5 we have experiments that randomize treatment and control groups/areas, because (in principle) they can rule out most of the problems associated with less rigorous studies. For example, by limiting our capacity to influence where or to whom a program is applied, we remove (or at least reduce) the risk of selection bias. I have encountered police commanders who all but demanded their pet area receive a patrol intervention, only to be thwarted by randomization. Would the program have worked in those areas anyway, or just because they were paying attention to those areas already?

Good randomization studies can rule out a large swathe of competing explanations, and this approach remains the strongest research design for testing many ideas (I don’t recommend it for parachutes).

It is sometimes, incorrectly, argued that randomization is unethical because it withholds benefits from the control units or areas.

We have evaluations precisely because we have not proven certain programs work. Randomization is therefore a highly ethical approach to gauging the value of spending taxpayer dollars.

Finally, randomized experiments usually contribute to the 5* meta-analyses that examine the totality of evidence for a crime reduction program. The real world is messy, and systematic reviews conducted by trained analysts are vital tools to help us make sense of complicated areas. Within a systematic review, a single study find its place in the wider entirety of research, making its contribution to policy knowledge.

There is of course much more to understand about this area, and there are numerous verbose books about research design and evaluation methodology. Until you are brave enough for that, I hope this short, non-technical overview helps you understand the graphic and appreciate that not all research is created equal.

Not all evidence is created equally

In the policy world, not all evidence is created equally.

I’m not talking about forensic or criminal evidence (though those areas have hierarchies too). I’m referring to the evidence we need to make a good policy choice. Policy decisions in policing include if my police department should support a second responder program to prevent domestic abuse, or should dedicate officers to teach D.A.R.E. in our local school? (The answer to both questions is no). The evidence I’m talking about here is not just ‘what works’ to reduce crime significantly, but also how it works (what mechanism is taking place), and when it works (in what contexts the tactic may be effective).

We can harvest ideas about what might work from a variety of sources. Telep and Lum found that a library database was the least accessed source for officers when learning about what tactics might work[1]. That might be because for most police officers, a deep dive into the academic literature is like being asked to do foot patrol in Hades while wearing a nylon ballistic vest. But it also means that the ‘what works’, ‘how works’, and ‘when works’ of crime reduction are never fully understood. Too many cops unfortunately rely on their intuition, opinion or unreliable sources, as noted by Ken Pease and Jason Roach:

Police officers may be persuaded by the experience of other officers, but seldom by academic research, however extensive and sophisticated. Collegiality among police officers is an enduring feature of police culture. Most officers are not aware of, are not taught about and choose not to seek out relevant academic research. When launching local initiatives, their first action tends to be the arrangement of visits to similar initiatives in other forces, rather than taking to the journals.[2]

In fact, not only do officers favor information from other cops, but it also has to be from the right police officers. Just about everyone in policing at some point has been told to forget what they learned at the police academy. That was certainly my experience. And those courses were usually taught by experienced police officers! It’s too easy to end up with a very narrow and restricted field of experience on which to draw.

Fortunately, while a basic understanding of different research qualities is helpful, you do not need to have an advanced degree in research methodology to be able to implement evidence-based policing from a range of evidence sources. It’s sufficient to appreciate that there is a hierarchy of evidence in which to place your trust, and to have a rudimentary understanding of the differences between them. I’m not dismissing any forms of research, but I am saying that some research is more reliable than others and more useful for operational decision-making. For example internal department research may not be great for choosing between prevention strategy options, but it is hugely useful for identifying the range of problems.

There are a number of hierarchies of evidence available. Criminologists are familiar with the Maryland Scientific Methods Scale devised by Larry Sherman and colleagues[3]. In this scale, studies are ranked from 1 (weak) to 5 (strong) based on the study’s capacity to show causality and limit the effects of selection bias. Selection bias occurs when researchers choose participants in a study rather than allow random selection. It’s not necessarily an intent to harm the study, but it is a common problem. When we select participants or places to be part of a study, we can unconsciously choose places that will perform better than randomly selected sites. We cherry-pick the subjects, and so bias the study and get the result we want. It’s why so many supposedly great projects have been difficult to replicate.

But the Maryland scale only addresses academic research, and police officers get knowledge and information from a range of sources. Rob Briner’s work in evidence-based management also stresses this. Go into any police canteen or break room and you hear anecdote after anecdote (I wrote a few weeks ago about the challenges of ‘experience’). These examples of homespun wisdom—also known as carefully selected case studies—are often illustrative of an unusual case and not a story of the mundane and ordinary that we deal with every day.

This professional expertise can also be supplemented by the stakeholder concerns of the local community (including the public or the local enforcement community such as other police departments or prosecutors). Knowing what is important to stakeholders and innovative approaches adopted by colleagues is useful in the hunt for a solution to a crime or disorder problem.

Police also get information from experts and internal reports. In larger forces, reports from statistics or crime analysis units can be important sources of information. Organizational data are therefore useful to officers who try and replicate crime reduction that a colleague in a different district appears to have achieved. All of these sources are important to some police officers, and they deserve a place on the hierarchy. But because they are hand selected, can be abnormal, might be influenced internally, or have not been subjected to a lot of scrutiny, they get a lower place on the chart.

In the figure on this page, I’ve pulled together a hierarchy of evidence from a variety of sources and tried to combine them in such a way that you can appreciate different sources and their benefits and concerns. And hopefully this might help you appreciate a little how to interpret the evidence from sources such as The National Institute of Justice’s CrimeSolutions.gov website and The UK College of Policing’s Crime Reduction Toolkit. In a later post I’ll try and expand on each of the levels (from 0 to 5*).

 

  1. Telep, C.W. and C. Lum, The receptivity of officers to empirical research and evidence-based policing: An examination of survey data from three agencies. Police Quarterly, 2014. 17(4): p. 359-385.
  2. Pease, K. and J. Roach, How to morph experience into evidence in Advances in Evidence-Based Policing, J. Knutsson and L. Tompson, Editors. 2017, Routledge. p. 84-97.
  3. Sherman, L.W., et al., Preventing Crime: What works, what doesn’t, what’s promising. 1998, National Institute of Justice: Washington DC.