Skip to main content

Exclusive: The Caavo streaming box is built on game-changing machine vision for TV

Exclusive: The Caavo streaming box is built on game-changing machine vision for TV


Something huge is happening here

Share this story

If you buy something from a Verge link, Vox Media may earn a commission. See our ethics statement.

the caavo box

The Caavo TV box was announced yesterday in an live demo with The Verge’s own Lauren Goode and Walt Mossberg at Code Media — it’s not exactly a streaming box, but rather a universal control system for every other streaming box you might have. You can plug in an Apple TV, Roku, Amazon Fire TV stick, and your cable box, and then simply ask for content. The Caavo will figure out what device has that content, and then play it on your TV.

There’s even an Alexa skill that lets you ask your Amazon Echo to control your Apple TV, which is fairly surprising given the balkanized state of TV services and devices. (Amazon and Apple don’t usually play nice when it comes to TV.) You can even switch back and forth between all of your various boxes’ native remotes and the Caavo remote without any hassle. It looks great.

Here’s the fun demo from Code Media yesterday:

But... how does it work? Sending a bunch of preprogrammed IR macros to various boxes is an ancient, inevitably doomed idea, and HDMI-CEC isn’t nearly universal or resilient enough to support search and discovery as powerful as Caavo’s demos.

Caavo execs played it pretty coy yesterday, but late last night they exclusively confirmed to me that they’re doing something no one has ever tried before: they’re processing the video your TV streaming boxes send over HDMI, using machine vision to figure out what’s on-screen, and then determining what command to send next based on that information.

Caavo has built an AI to control any device you might attach to a TV

Essentially, Caavo has built an AI that simulates a human user to control any device you might attach to a TV, through whatever method the system can use, whether it’s IR, HDMI-CEC, or direct control over an API on your home network. They’re calling this system “visual analytics,” or VA, and it is quite possibly the thing that will crack the entire living room convergence game wide open.

“We really wanted to keep it quiet because it’s a very volatile piece of the whole story,” Caavo CTO Ashish Aggarwal told me. “In some cases we have roundtrip confirmation via IP, so you send a command and it comes back, ‘yeah, I got it,’ and the box does what it does. But in other cases you have to have a closed-loop system where you have to know you selected the right profile and that you’re in the right app. Those kinds of things are by understanding what’s coming through on video, analyzing it, and doing some interesting things with it.”

I’ve been writing about TV boxes forever, and the gap between sending a command to another device and knowing that it worked is what’s killed virtually every promising TV product for years. Most companies have tried to solve this problem by trying to make content deals or working with inconsistent control standards, and they’ve all hit the rocks and failed. The list is long: WebTV. Fanhattan. Boxee. Google TV. It’s a tech graveyard.

Caavo’s CEO worked at Microsoft when the company tried to make Xbox TV happen

And if they don’t outright fail out of the market, most TV devices offer an incomplete content lineup that means you’re stuck switching to another device when you can’t find what you want: There’s no Amazon app on the Apple TV, and you can’t watch anything you’ve bought on iTunes on your Nvidia Shield. It’s a mess.

Perhaps the highest-profile failure of this method is the Xbox One, which launched with a very ambitious TV control system that relied on IR commands sent through the Kinect. Microsoft assumed at launch that a middling TV control system would get the company leverage to negotiate actual content deals that let the Xbox navigate and display content natively, but it never worked, and over the past two years the Xbox One has been intensely refocused on gaming over general entertainment.

You know who worked at Microsoft when all that was happening? Caavo CEO Andrew Einaudi and now-deceased founder (and former Slingbox CEO) Blake Krikorian. But Caavo’s ability to understand what’s on your screen and truly, immediately control every possible device completely changes the dynamics of the industry.

“Andrew and Blake worked on this feature forever,” says Aggarwal. “The last words I heard from Blake were, ‘Ashish, if you can make this happen, we’ll be everywhere.’ We worked our ass off to get this feature right, and we want it to be everywhere.”

There are still some hacks involved in the Caavo system

There are still some hacks involved in the Caavo system, and ways for other companies to potentially block the device. Caavo can’t control the Apple TV unless you install the Caavo app, for instance — but the way it works is wild: when it decides to control the Apple TV, it goes to the home screen, sends a bunch of scroll commands to the box, uses machine vision to locate the Caavo app icon, and then opens the app so it can pass a URL to the streaming app you actually want to use. “There is no way to launch an app using an API on the Apple TV,” says Aggarwal. “We need to know is where our app is, launch that app, and then hand the URL off to Netflix.”

Just think about the automated intelligence it takes to do that — Caavo is scanning the entire interface of the Apple TV, watching what its control signals do on that interface, and making intelligent decisions about what signals to send next, all to send a Netflix URL to the Netflix app. It’s doing similar things on the Roku with a custom service and on other boxes with similar apps, but there’s no reason it can’t just click around like a standard human user if those apps and integrations get blocked. This thing is an end-run around the deal problem that’s killed so much TV innovation.

an end-run around the deal problem that’s killed TV innovation

Caavo CEO Andrew Einaudi says the demands on the system were intense. “It’s gotta work with my original remotes, and it’s gotta work with all these services,” he says. “Ashish has answered the call of duty.”

The fact that Caavo is watching your screen means it can do all kinds of other stuff no other universal remote or TV box can do. You can ignore Caavo and launch your Roku like usual using the Roku remote, and then ask the Caavo Alexa skill to take over, and it’ll work. “If you launch content from any other system, we know what you’re doing,” says Aggarwal. “We can still control it.” Because it knows what’s on your screen.

That’s brilliant, but there are a bunch of issues that come along for the ride: Vizio just got smacked by the FTC for detecting what was displayed on its TVs and collecting data without being transparent enough with customers. Apple can’t get Netflix in its new TV app because Netflix doesn’t want to hand over viewer data. This is not a simple ecosystem to launch new products in, especially when the core technology of your product is tracking what people are watching in the privacy of their own homes.

This is not a simple ecosystem to launch new products in

But Aggarwal is pretty confident about things — after all, the company isn’t launching until June, and their executive pedigrees mean they’ve taken meetings with all the big players. “It’s still very early — we just figured out how to do it,” he says. “We want to make sure that every single player is happy, and that we honor their value propositions to the consumer. We want to really encompass this whole space and bring all of these partners together.”

It’s a tall order, and Caavo has a long way to go — and a lot of conversations to have with obstinate, protective players like cable companies who have been very slow to integrate with streaming services and devices. But if the company can pull it off and bring true smart control to the living room, Caavo will be the first company to solve the convergence problem in the modern era. We’ll find out in June.