Not all evidence is created equally (an update)

The Australian and New Zealand Society of Evidence Based Policing (@ANZSEBP) recently retweeted a graphic from an old blog of mine, so this seems a good time to update and explain it a little.

The chart above is adapted from various sources and emphasizes quantitative studies and randomized trials. Some argue that randomized trials can be of limited value, or difficult to implement, and that observational studies and other sources of information can inform policing. This is all true. Moreover, qualitative research can be useful when interpreting evaluation results, seeking insights into why programs succeed or fail, and considering where they can go moving forward. But if you have an opportunity to conduct an evaluation, try and design it to get the best possible assessment of the program.

Any research field has variable levels of what is called methodological quality. If you think all evaluations are useful for deciding how we spend our money, then boy, do I have a bridge to sell you!

Just look through Amazon. Reviewers rarely compare one product against another. You more frequently find five-star reviews alongside comments such as “Can’t wait to try this!” or “It arrived on time and works as advertised”. Your widget might work as advertised, but better than other widgets?

One of the biggest challenges evaluators encounter is rejecting competing explanations for a crime drop. Here’s a recent example. San Francisco’s police department credited crisis intervention training with a reduction in police use of force incidents. Simply noting a change in numbers doesn’t however rule out a range of other possible explanations, such as officers conducting fewer proactive field investigations or making fewer arrests (activities that can sometimes spark an incident).

Not to mention, it is not uncommon for two or three different programs to claim credit for crime drops in the same area.

The center column of my updated figure now shows examples of each level. If terms like cross-sectional or longitudinal feel unnecessarily technical—welcome to academic jargon—then the examples may help. You might make the connection between my example of the license plate readers and San Francisco’s crisis intervention training.

San Fran’s CIT program scores at best a 2 (because of a simplistic pre-post claim), or worse one of the zeroes, because it is an internal non-peer reviewed assertion. The lowest zero level probably seems harsh on police chiefs, but many are unfamiliar with, and do not review, the research when the media calls or they write their magnum opus. They trade on their “expertise” and hope or believe their authority is a replacement for a lack of knowledge (unfortunately, it frequently works).

Experience is valuable, but it is also vulnerable to many potential biases that make it less reliable.

And when academics get quoted in newspapers, it goes through too many filters and is usually too brief, to be a reliable source for decision-makers.

For the other zero, while I recognize some police departments do exemplary research and may be impervious to political and internal pressure, regretfully, this is rarely the case. Third party evaluations often bring more rigor and impartiality.


Once we hit level 3 we cross an important threshold. Writing on evidence-based policing (EBP), Larry Sherman argued “the bare minimum, rock-bottom standard for EBP is this: a comparison group is essential to every test to be included as ‘evidence’”. Above level 2 we cross this hurdle, hence the chart background turns from red (Danger Will Robinson!) or yellow (bees!), to green.

What’s suspect or just interesting, becomes what’s promising or what works.

Up at level 5 we have experiments that randomize treatment and control groups/areas, because (in principle) they can rule out most of the problems associated with less rigorous studies. For example, by limiting our capacity to influence where or to whom a program is applied, we remove (or at least reduce) the risk of selection bias. I have encountered police commanders who all but demanded their pet area receive a patrol intervention, only to be thwarted by randomization. Would the program have worked in those areas anyway, or just because they were paying attention to those areas already?

Good randomization studies can rule out a large swathe of competing explanations, and this approach remains the strongest research design for testing many ideas (I don’t recommend it for parachutes).

It is sometimes, incorrectly, argued that randomization is unethical because it withholds benefits from the control units or areas.

We have evaluations precisely because we have not proven certain programs work. Randomization is therefore a highly ethical approach to gauging the value of spending taxpayer dollars.

Finally, randomized experiments usually contribute to the 5* meta-analyses that examine the totality of evidence for a crime reduction program. The real world is messy, and systematic reviews conducted by trained analysts are vital tools to help us make sense of complicated areas. Within a systematic review, a single study find its place in the wider entirety of research, making its contribution to policy knowledge.

There is of course much more to understand about this area, and there are numerous verbose books about research design and evaluation methodology. Until you are brave enough for that, I hope this short, non-technical overview helps you understand the graphic and appreciate that not all research is created equal.