Skip to main content

Google's Jigsaw subsidiary is building open-source AI tools to spot trolls

Google's Jigsaw subsidiary is building open-source AI tools to spot trolls


The machine learning software has been trained on 17 million New York Times comments

Share this story

Can Google bring peace to the web with machine learning? Jigsaw, a subsidiary of parent company Alphabet is certainly trying, building open-source AI tools designed to filter out abusive language. A new feature from Wired describes how the software has been trained on some 17 million comments left underneath New York Times stories, along with 13,000 discussions on Wikipedia pages. This data is labeled and then fed into the software — called Conversation AI — which begins to learn what bad comments look like.

According to the report, Google says Conversation AI can identify abuse with "more than 92 percent certainty and a 10 percent false-positive rate" when compared to the judgements of a human panel. However, when Wired's Andy Greenberg tested the tools out himself, the results are not completely convincing:

My own hands-on test of Conversation AI comes one summer afternoon in Jigsaw’s office, when the group’s engineers show me a prototype and invite me to come up with a sample of verbal filth for it to analyze. Wincing, I suggest the first ambiguously abusive and misogynist phrase that comes to mind: "What’s up, bitch?" Adams types in the sentence and clicks Score. Conversation AI instantly rates it a 63 out of 100 on the attack scale. Then, for contrast, Adams shows me the results of a more clearly vicious phrase: "You are such a bitch." It rates a 96.

In fact, Conversation AI’s algorithm goes on to make impressively subtle distinctions. Pluralizing my trashy greeting to "What’s up bitches?" drops the attack score to 45. Add a smiling emoji and it falls to 39. So far, so good.

But later, after I’ve left Google’s office, I open the Conver­sation AI prototype in the privacy of my apartment and try out the worst phrase that had haunted [journalist] Sarah Jeong: "I’m going to rip each one of her hairs out and twist her tits clear off." It rates an attack score of 10, a glaring oversight. Swapping out "her" for "your" boosts it to a 62. Conver­sation AI likely hasn’t yet been taught that threats don’t have to be addressed directly at a victim to have their intended effect. The algorithm, it seems, still has some lessons to learn.

Later in the piece, Greenberg notes that he was also able to fool Conversation AI with a number of false-positives. The phrase "I shit you not" got an attack score of 98 out of 100, while "you suck all the fun out of life" scored the same. With this in mind it seems that asking Conversation AI to filter comments, deleting those it believes to be abusive without human oversight, would be a mistake. Nevertheless, Wired's reports that Wikimedia is considering how it might use the tool, while the Times "plans to make" Conversation AI the first line of defense on its comments. The software will also be open source to let any web developers adopt it.

Censorious overreach of this sort isn't a new phenomenon. Website block lists, for example, used to stop web users accessing illegal or inappropriate content regularly come under criticism for inadvertently blocking critical information, like sites on sexual health. Adding machine learning to the mix doesn't automatically solve these problems, and in fact, it might exacerbate them by encouraging unfounded confidence. The results from real-life tests of Conversation AI will have to be watched carefully. In the meantime, you can check out the full story from Wired here.