We usually think of surveillance cameras as digital eyes, watching over us or watching out for us, depending on your view. But really, they’re more like portholes: useful only when someone is looking through them. Sometimes that means a human watching live footage, usually from multiple video feeds. Most surveillance cameras are passive, however. They’re there as a deterrence, or to provide evidence if something goes wrong. Your car got stolen? Check the CCTV.
But this is changing — and fast. Artificial intelligence is giving surveillance cameras digital brains to match their eyes, letting them analyze live video with no humans necessary. This could be good news for public safety, helping police and first responders more easily spot crimes and accidents and have a range of scientific and industrial applications. But it also raises serious questions about the future of privacy and poses novel risks to social justice.
What happens when governments can track huge numbers of people using CCTV? When police can digitally tail you around a city just by uploading your mugshot into a database? Or when a biased algorithm is running on the cameras in your local mall, pinging the cops because it doesn’t like the look of a particular group of teens?
These scenarios are still a way off, but we’re already seeing the first fruits of combining artificial intelligence with surveillance. IC Realtime is one example. Its flagship product, unveiled last December, was billed as Google for CCTV. It’s an app and web platform named Ella that uses AI to analyze what’s happening in video feeds and make it instantly searchable. Ella can recognize hundreds of thousands of natural language queries, letting users search footage to find clips showing specific animals, people wearing clothes of a certain color, or even individual car makes and models.
In a web demo, IC Realtime CEO Matt Sailor showed The Verge a version of Ella hooked up to around 40 cameras surveilling an industrial park. He typed in various searches — “a man wearing red,” “UPS vans,” “police cars” — all of which brought up relevant footage in a few seconds. He then narrowed the results by time period and location and pointed out how users can give thumbs-up or thumbs-down to clips to improve the results — just like Netflix.
“Let’s say there’s a robbery and you don’t really know what happened,” says Sailor. “But there was a Jeep Wrangler speeding east afterward. So we go in, we search for ‘Jeep Wrangler,’ and there it is.” On-screen, clips begin to populate the feed, showing different Jeep Wranglers gliding past. This will be the first big advantage of combining AI and CCTV, explains Sailor: making it easy to find what you’re looking for. “Without this technology, you’d know nothing more than your camera, and you’d have to sift through hours and hours and hours of video,” he says.
Ella runs on Google Cloud and can search footage from pretty much any CCTV system. “[It] works well on a one-camera system — just [like] a nanny cam or dog cam — all the way up to enterprise, with a matrix of thousands of cameras,” says Sailor. Users will pay a monthly fee for access, starting at around $7, and scaling up with the number of cameras.
IC Realtime wants to target businesses of all sizes but thinks its tech will also appeal to individual consumers. These customers are already well-served by a booming market for “smart” home security cams made by companies like Amazon, Logitech, Netgear, and the Google-owned Nest. But Sailor says this tech is much more rudimentary than IC Realtime’s. These cameras connect to home Wi-Fi and offer live streams via an app, and they automatically record footage when they see something move. But, says Sailor, they can’t tell the difference between a break-in and a bird, leading to a lot of false positives. “They’re very basic technology that’s been around for years,” he says. “No AI, no deep learning.”
That won’t be the case for long. While IC Realtime offers cloud-based analytics that can upgrade existing, dumb cameras, other companies are building artificial intelligence directly into their hardware. Boulder AI is one such startup, selling “vision as a service” using its own standalone AI cameras. The big advantage of integrating AI into the device is that they don’t require an internet connection to work. Boulder sells to a wide range of industries, tailoring the machine vision systems it builds to individual clients.
“The applications are really all over the board,” founder Darren Odom tells The Verge. “Our platform’s sold to companies in banking, energy. We’ve even got an application where we’re looking at pizzas, determining if they’re the right size and shape.”
Odom gives the example of a customer in Idaho who had built a dam. In order to meet environmental regulations, they were monitoring the numbers of fish making it over the top of the structure. “They used to have a person sitting with a window into this fish ladder, ticking off how many trout went by,” says Odom. (A fish ladder is exactly what it sounds like: a stepped waterway that fish use to travel uphill.) “Then they moved to video and someone [remotely] watching it.” Finally, they contacted Boulder, which built them a custom AI CCTV system to identify types of fish going up the fish ladder. “We really nailed fish species identification using computer vision,” Odom says proudly. “We are now 100 percent at identifying trout in Idaho.”
If IC Realtime represents the generic end of the market, Boulder shows what a boutique contractor can do. In both cases, though, what these firms are currently offering is just the tip of the iceberg. In the same way that machine learning has made swift gains in its ability to identify objects, the skill of analyzing scenes, activities, and movements is expected to rapidly improve. Everything’s in place, including the basic research, the computing power, and the training datasets — a key component in creating competent AI. Two of the biggest datasets for video analysis are made by YouTube and Facebook, companies that have said they want AI to help moderate content on their platforms (though both admit it’s not ready yet). YouTube’s dataset, for example, contains more than 450,000 hours of labeled video that it hopes will spur “innovation and advancement in video understanding.” The breadth of organizations involved in building such datasets gives some idea of the field’s importance. Google, MIT, IBM, and DeepMind are all involved in their own similar projects.
IC Realtime is already working on advanced tools like facial recognition. After that, it wants to be able to analyze what’s happening on-screen. Sailor says he’s already spoken to potential clients in education who want surveillance that can recognize when students are getting into trouble in schools. “They’re interested in preemptive notifications for a fight, for example,” he says. All the system would need to do would be to look out for pupils clumping together and then alert a human, who could check the video feed to see what’s happening or head over in person to investigate.
Boulder, too, is exploring this sort of advanced analysis. One prototype system it’s working on is supposed to analyze the behavior of people in a bank. “We’re specifically looking for bad guys, and detecting the difference between a normal actor and someone acting out of bounds,” says Odom. To do this, they’re using old security cam footage to train their system to spot aberrant behavior. But a lot of this video is low-quality, so they’re also shooting their own training footage with actors. Odom wasn’t able to go into details, but said the system would be looking for specific facial expressions and actions. “Our actors are doing things like crouching, pushing, over the shoulder glances,” he said.
For experts in surveillance and AI, the introduction of these sorts of capabilities is fraught with potential difficulties, both technical and ethical. And, as is often the case with AI, these two categories are intertwined. It’s a technical problem that machines can’t understand the world as well as humans do, but it becomes an ethical one when we assume they can and let them make decisions for us.
Alex Hauptmann, a professor at Carnegie Mellon who specializes in this sort of computer analysis, says that although AI has propelled the field forward hugely in recent years, there are still fundamental challenges in getting computers to understand video. And the biggest of these is a challenge for cameras we don’t often think about anymore: resolution.
Take, for example, a neural network that’s been trained to analyze human actions in a video. These work by breaking down the human body into segments — arms, legs, shoulders, heads, etc. — then watching how these stick figures change from one frame of video to the next. From this, the AI can tell you whether someone’s running, for example, or brushing their hair. “But this depends on the resolution of the video you have,” Hauptmann tells The Verge. “If I’m looking at the end of a parking lot with one camera, I’m lucky if I can tell if someone opened a car door. If you’re right in front of a [camera] and playing a guitar, it can track you down to the individual fingers.”
This is a big problem for CCTV, where the cameras are often grainy and the angles are often weird. Hauptmann gives the example of a camera in a convenience store that’s aimed at the cash register, but it also overlooks the window facing the street. If a mugging happens outside, partially obscured from the camera, then AI would be stumped. “But we, as people, can imagine what’s going on and piece it all together. Computers can’t do that,” he says.
Similarly, while AI is great at identifying what’s going on in a video at a fairly high level (e.g., someone is brushing their teeth or looking at their phone or playing football), it can’t yet extract vital context. Take the neural network that can analyze human actions, for example. It might be able to look at the footage and say “this person is running,” but it can’t tell you whether they’re running because they’re late for a bus or because they’ve just stolen someone’s phone.
These accuracy problems should make us think twice about some of the claims of AI startups. We’re nowhere near the point where a computer can understand what it sees on video with the same insight as a human. (Researchers will tell you this is so difficult it’s basically synonymous with “solving” intelligence in general.) But things are progressing fast.
Hauptmann says using license plate tracking to follow vehicles is “a solved problem for practical purposes,” and facial recognition in controlled settings is the same. (Facial recognition using low-quality CCTV footage is another thing.) Identifying things like cars and items of clothing is also pretty solid and automatically tracking one person across multiple cameras can be done, but only if the conditions are right. “You’re pretty good at tracking an individual in a non-crowded scene — but in a crowded scene, forget it,” says Hauptmann. He says it’s especially tough if the individual is wearing nondescript clothing.
Even these pretty basic tools can have powerful effects at scale, however. China provides one example of what this can look like. Its western Xinjiang region, where dissent from the local Uighur ethnic group is being suppressed, has been described as “a laboratory for high-tech social controls,” in a recent Wall Street Journal report. In Xinjiang, traditional methods of surveillance and civil control are combined with facial recognition, license plate scanners, iris scanners, and ubiquitous CCTV to create a “total surveillance state” where individuals are tracked constantly in public spaces. In Moscow, a similar infrastructure is being assembled, with facial recognition software plugged into a centralized system of more than 100,000 high-resolution cameras which cover more than 90 percent of the city’s apartment entrances.
In these sorts of cases, there’s likely to be a virtuous cycle in play, with the systems collecting more data as the software gets better, which in turn helps the software get even better. “I think it’ll all improve quite a bit,” says Hauptmann. “It’s been coming.”
If these systems are in the works, then we already have problems like algorithmic bias. This is not a hypothetical challenge. Studies have shown that machine learning systems soak up the racial and sexist prejudices of the society that programs them — from image recognition software that always puts women in kitchens, to criminal justice systems that always say black people are more likely to re-offend. If we train AI surveillance system using old footage, like from CCTV or police body cameras, then biases that exist in society are likely to be perpetuated.
This process is already taking place in law enforcement, says Meredith Whittaker, co-director of the ethics-focused AI Now institute at NYU, and will spread into the private sector. Whittaker gives the example of Axon (formerly Taser), which bought several AI companies to help build video analytics into its products. “The data they have is from police body cams, which tells us a lot about who an individual police officer may profile, but doesn’t give us a full picture,” says Whittaker. “There’s a real danger with this that we are universalizing biased pictures of criminality and crime.”
Even if we manage to fix the biases in these automated systems, that doesn’t make them benign, says ACLU senior policy analyst Jay Stanley. He says that changing CCTV cameras from passive into active observers could have a huge chilling effect on civil society.
“We want people to not just be free, but to feel free. And that means that they don’t have to worry about how an unknown, unseen audience may be interpreting or misinterpreting their every movement and utterance,” says Stanley. “The concern is that people will begin to monitor themselves constantly, worrying that everything they do will be misinterpreted and bring down negative consequences on their life.”
Stanley also says that false alarms from inaccurate AI surveillance could also lead to more dangerous confrontations between law enforcement and members of the public. Think of the shooting of Daniel Shaver, for example, in which a police officer was called to a hotel room in Texas after Shaver was seen with a gun. Police Sergeant Charles Langley gunned down Shaver while he was crawling on the floor toward him as requested. The gun Shaver was seen with was revealed to be a pellet gun used in his pest-control job.
If a human can make such an error, what chance does a computer have? And if surveillance systems become even partially automated, will such errors become more or less common? “If the technology is out there, there will be some police forces out there looking at it,” says Stanley.
Whittaker says what we’re seeing in this field is only one part of a larger trend in AI, in which we use these relatively crude tools to try and classify people based on their image. She points to controversial research published last year that claimed to be able to identify sexuality using facial recognition as a similar example. The accuracy of the AI’s results was questionable, but critics pointed out that it didn’t matter whether or not it worked; it mattered whether people believed it worked and made judgments using this data all the same.
“It’s troubling to me that a lot of these systems are being pumped into our core infrastructure without the democratic process that would allow us to ask questions about their effectiveness, or to inform the populations they’ll be deployed on,” says Whittaker. “This is one more example in the drumbeat of algorithmic systems that are offering to classify and determine the typology of individuals based on pattern recognition drawn from data that embed cultural and historical biases.”
When we ask IC Realtime about problems of how AI surveillance could be abused, they gave an answer that’s common in the tech industry: these technologies are value neutral, and it’s only how they’re implemented and by whom that makes them either good or bad. “With any new technology there’s a danger it could fall into the wrong hands,” says Sailor. “That’s true of any technology … and I think the pros in this aspect greatly outweigh the cons.”