Skip to main content

First computers recognized our faces, now they know what we’re doing

First computers recognized our faces, now they know what we’re doing

We haven't designed fully sentient artificial intelligence just yet, but we're steadily teaching computers how to see, read, and understand our world. Last month, Google engineers showed off their "Deep Dream," software capable of taking an image and ascertaining what was in it by turning it into a nightmare fusion of flesh and tentacles. The release follows research by scientists from Stanford University, who developed a similar program called NeuralTalk, capable of analyzing images and describing them with eerily accurate sentences.

First published last year, the program and the accompanying study is the work of Fei-Fei Li, director of the Stanford Artificial Intelligence Laboratory, and Andrej Karpathy, a graduate student. Their software is capable of looking at pictures of complex scenes and identifying exactly what's happening. A picture of a man in a black shirt playing guitar, for example, is picked out as "man in black shirt is playing guitar," while pictures of a black-and-white dog jumping over a bar, a man in a blue wetsuit surfing a wave, and little girl eating cake are also correctly described with a single sentence. In several cases, it's unnervingly accurate.

Like Google's Deep Dream, the software uses a neural network to work out what's going on in each picture, comparing parts of the image to those it's already seen and describing them as humans would. Neural networks are designed to be like human brains, and they work a little like children. Once they've been taught the basics of our world — that's what a window usually looks like, that's what a table usually looks like, that's what a cat who's trying to eat a cheeseburger looks like — then they can apply that understanding to other pictures and video.

pastry-image

It's still not perfect. A fully-grown woman gingerly holding a huge donut is tagged as "a little girl holding a blow dryer next to her head," while an inquisitive giraffe is mislabeled as a dog looking out of a window. A cheerful couple in a garden with a birthday cake appears under the heading "a man in a green shirt is standing next to an elephant," with a bush starring as the elephant and, weirdly, the cake standing in for the man. But in most cases, these descriptions are secondary guesses — alongside the elephant suggestion, the program also correctly identifies the cake couple as "a woman standing outside holding a coconut cake with a man looking on."

The software easily identifies a dog jumping over a bar

The incredible amount of visual information on the internet has, until recently, had to be manually labeled in order for it to be searchable. When Google first built Google Maps, it relied on a team of employees to dig through and check every single entry, humans given the task of looking at every number captured in the world to make sure it denoted a real address. When they were done, and sick of the tiresome job, they built Google Brain. Where it had previously taken a team weeks of work to complete the task, Google Brain could transcribe all of the Street View data from France in under an hour.

"I consider the pixel data in images and video to be the dark matter of the Internet," Li told The New York Times last year. "We are now starting to illuminate it." Leading the charge for that illumination are web giants such as Facebook and Google, who are keen to categorize the millions of pictures and search results they need to sift through. Previous research focused on single object recognition — in a 2012 Google study, a computer taught itself to recognize a cat — but computer scientists have said this misses the bigger picture. "We've focused on objects, and we've ignored verbs," Ali Farhadi, computer scientist at the University of Washington, told The New York Times.

truck-identify

But more recent programs have focused on more complex strings of data in an attempt trying to teach computers what's happening in a picture rather than simply what's in shot. The Stanford scientists' study uses the kind of natural language we could eventually use to search through image repositories, leading to an easy hypothetical situation where rather than scanning through tens of thousands of family photos, services such as Google Photos can quickly pull up "the one where the dog is jumping on the couch," or "the selfie I took in Times Square." Search results, too, would benefit from the technology, potentially allowing you to search YouTube or Google for the exact scenes you want, rather than simply finding the pictures or videos their uploaders were mindful enough to correctly label.

Neural networks have potential applications out in the real world, too. At CES this year, Nvidia's Jen-Hsun Huang announced his company's Drive PX, a "supercomputer" for your car that incorporated "deep neural network computer vision." Using the same learning techniques as other neural networks, Huang said the technology will be able to automatically spot hazards as you drive, warning you of pedestrians, signs, ambulances, and other objects that it's learned about. The neural network means the Drive PX won't need to have reference images for every kind of car — if it's got four wheels like a car, a grille like a car, and a windscreen like a car, it's probably a car. Larger cars could be SUVs, while cars with lights on top could be police vehicles. Huang's company has been chasing this technology for a while, too, having provided the graphics processing units actually used by the Stanford team.

nvidia-neural

As the technology to automatically work out what's happening in images is progressing rapidly, its leaders are making their efforts available to all on code repositories such as GitHub. Google's Deep Dream, in particular, has captured the imagination of many with its trippy visual side effects, contorting images into the shapes of dogs and slugs as it attempts to find reference points it understands. But the proliferation of this machine learning has a creepy side too — if your computer can work out exactly what's happening in your pictures, what happens when it works out exactly what you are?