Skip to main content

Mechanical Turkers may have out-predicted the most popular crime-predicting algorithm

Mechanical Turkers may have out-predicted the most popular crime-predicting algorithm

Share this story

Illustration by Alex Castro / The Verge

Our most sophisticated crime-predicting algorithms may not be as good as we thought. A study published today in Science Advances takes a look at the popular COMPAS algorithm — used to assess the likelihood that a given defendant will reoffend — and finds the algorithm is no more accurate than the average person’s guess. If the findings hold, they would be a black eye for sentencing algorithms in general, indicating we may simply not have the tools to accurately predict whether a defendant will commit further crimes.

Developed by Equivant (formerly Northpointe), the COMPAS algorithm examines a defendant’s criminal record alongside a series of other factors to assess how likely they are to be arrested again in the next two years. COMPAS’ risk assessment can then inform a judge’s decisions about bail or even sentencing. If the algorithm is inaccurate, the result could be a longer sentence for an otherwise low-risk defendant, a significant harm for anyone impacted.

Reached by The Verge, Equivant contested the accuracy of the paper in a lengthy statement, calling the work “highly misleading.”

“The ceiling in predictive power is lower than I had thought”

COMPAS has been criticized by ProPublica for racial bias (a claim some statisticians dispute), but the new paper, from Hany Farid and Julia Dressel of Dartmouth, tackles a more fundamental question: are COMPAS’ predictions any good? Drawing on ProPublica’s data, Farid and Dressel found the algorithm predicted reoffenses roughly 65 percent of the time — a low bar, given that roughly 45 percent of defendants reoffend.

In its statement, however, Equivant argues it has cleared the 70 percent AUC standard for risk assessment tools.

The most surprising results came when researchers compared COMPAS to other kinds of prediction. Farid and Dressel recruited 462 random workers through Amazon’s Mechanical Turk platform, and asked the Turkers to “read a few sentences about an actual person and predict if they will commit a crime in the future.” They were paid one dollar for completing the task, with a five dollar bonus if their accuracy was over 65 percent. Surprisingly, the median Turker ended up two points better than COMPAS, clocking in at 67 percent accuracy. 

George Mason University law professor Megan Stevenson has done similarly pessimistic research on risk assessment programs in Kentucky, and says she was surprised by just how bad the finding was for COMPAS. The sample size is small so it’s hard to be sure COMPAS’ disadvantage will hold up in further testing, but it’s damning enough that COMPAS is in the same general range as such an ad hoc system. 

“The paper definitely has me thinking that the ceiling in predictive power is lower than I had thought,” Stevenson told The Verge, “and I didn’t think it was that high to begin with.”

The researchers also edged out COMPAS with a simpler linear algorithm, which looked only at a defendant’s age and criminal record. That algorithm also out-performed COMPAS, a finding that surprised even the researchers, given the 137 factors involved in a COMPAS assessment. “We typically would expect that as we add more data to a classifier and / or increase the complexity of the classifier, that the classification accuracy would improve,” Farid told The Verge. “We found this not to be the case.”

Equivant challenged this finding as well, arguing the small data sample had led the researchers to over-fit their algorithm. Furthermore, the company downplayed the number of different factors that actually determine a given risk assessment. “In fact, the vast number of these 137 are needs factors and are not used as predictors in the COMPAS risk assessment,” the company said. “The COMPAS risk assessment has six inputs only.”

Risk assessment scores have become an increasingly common feature of the US Justice System, with similar products often used for decisions about pre-trial detainment. Controversially, the specific details of the algorithm are often treated as a trade secret, making it difficult for lawyers to contest the results. Last year, the Supreme Court declined to hear a case challenging the legality of the COMPAS system, which argued that keeping the algorithm secret violated the defendant’s constitutional rights.

Notably, both systems maintained roughly the same bias profile as COMPAS, maintaining predictive parity across races but distributing error disproportionately, with false positives more likely to occur among black defendants.

The study’s biggest weakness is the data itself. Court records are notoriously messy, and the data is drawn from just two years in a specific county, which could limit its predictive power. Recidivism studies also face a long-standing problem in reliably measuring false positives, since a longer prison sentence can prevent a person from reoffending while they’re incarcerated.

Still, researchers expect plenty of confirmation studies are already underway. “My guess is that there will be a slew of papers coming out trying to replicate or refute this finding,” says Stevenson, “so we’ll know more in the next couple years.”