Skip to main content

How to teach a robot to write

How to teach a robot to write


Automated writing programs are coming to journalism. Is it good news?

Share this story

First, the robots came for the switchboard operators. Then, the chauffeurs. Now, it looks like the ink-stained journalist is next on the list.

Last week, the Associated Press announced it would be automating its articles on quarterly earnings reports. Instead of 300 articles written by humans, the company's new software will write 4,400 of them, each formatted for AP style, in mere seconds. It’s not the first time a company has tried out automatic writing: last year, a reporter at The LA Times wrote an automated earthquake-reporting program that combined prewritten sentences with automatic seismograph reports to report quakes just seconds after they happen. The natural language-generation company Narrative Science has been churning out automated sports reporting for years.

How do you make a robot that writes sentences?

But while these projects are usually seen as something akin to a server cluster at a writing desk, the reality is something more complex and less apocalyptic. It’s a dance between writing, coding, and data analysis. If you’ve got the right data (in the AP’s case, a stream of earnings reports), it’s not that hard to pull it into your system. If you’ve got the right code, it’s not that hard to figure out where the information should go. The only missing piece is an understanding of the English language itself. So how do you make a robot that writes sentences?

In the case of AP style, a lot of the work has already been done. Every Associated Press article already comes with a clear, direct opening and a structure that spirals out from there. All the algorithm needs to do is code in the same reasoning a reporter might employ. Algorithms detect the most volatile or newsworthy shift in a given earnings report and slot that in as the lede. Circling outward, the program might sense that a certain topic has already been covered recently and decide it's better to talk about something else. Automated Insights CEO Robbie Allen, the man responsible for the system, describes it as more complicated than it looks. "It can’t be Madlibs. If it starts to sound automated, it gets stilted or highly repetitive," he says. "It’s very complicated to avoid that."

"It can't be Madlibs."

The staffers who keep the copy fresh are scribes and coders in equal measure. (Allen says he looks for "stats majors who worked on the school paper.") They're not writers in the traditional sense — most of the language work is done beforehand, long before the data is available — but each job requires close attention. For sports articles, the Automated Insights team does all its work during the off-season and then watches the articles write themselves from the sidelines, as soon as each game’s results are available. "I’m often quite surprised by the result," says Joe Procopio, the company’s head of product engineering. "There might be four or five variables that determine what that lead sentence looks like." Even if you wrote the program yourself, it’s hard to know what’s coming.

"Stats majors who worked on the school paper"

Allen is quick to acknowledge the limitations of the approach. "How many stories are there where you need thousands of articles in a few seconds?" he says. It’s a specific niche, but it’s one that keeps expanding. The LA Times’ earthquake reports were simple Madlibs, filling numbers into prewritten sentences, but the quarterly reports go further, making judgments about which facts are important and which sentence structures make sense. According to Allen, the more important business comes next, as the systems become smart enough to enter the gray area where written reports blend into automated number-crunching. Automated Insights' next project is reworking its system, called Wordsmith, into something that publishers can license off the shelf. It takes more technical chops than most writers have, just like early web publishing, but that doesn’t mean they can’t learn.

Without something like Wordsmith, it would be nearly impossible

Allen's favorite example is the company's work with Yahoo, where the same software was put to work writing up fantasy football reports. The product was simple: a weekly email telling you how your fantasy team did, written with as much verve as possible. But the scale of the job made it impossible to write the reports one at a time. There were more than 6 million subscribers, each one managing a slightly different team. It was easy to distinguish a given team's highs and lows, but pulling it all together in a punchy and digestible form is much harder. Without something like Wordsmith, it would be nearly impossible.

The real decisions are about which information you see first

Still, like any complex system, it’s often hard to tell whether it’s code or writing that’s taking the lead. Automated Insights gets a lot of attention the "robotic writing" tag, but maybe it makes more sense to give the credit to AP style or the data feeds that are sending the team instant and well-groomed information on company and ballgames. By now, the style is so well-defined that it makes basic language decisions easy to translate into code. Maybe the process is just humans using computers as a disguise, like the old story of the Mechanical Turk.

The real decisions are about which information you see first — and that’s something modern programs have gotten very good at deciding for you. It’s no different from the ranking you’d see in a Google search, the Facebook News Feed, or even the "related articles" box on a news site. The decisions may not always be right, but we’re not squeamish about leaving them to computers. Automated Insights is just coding them into text, putting a human face on the automated process. The surprise is that, in an industry full of human information filters, that turns out to be a pretty good trick.