Check your cell phone contract, and you might come across the following turn of phrase: "We do not sell your personal information." Some version of that phrase is in nearly every carrier Terms of Service, and divides the world’s data into two camps: the kind that personally identifies you and the kind that doesn’t. Your phone, your address, and your social security number all fall into the first camp: if Verizon’s caught trading them, they’ve got a lawsuit on their hands. Your zip code and your birthday, on the other hand, are fair game.
That 'anonymized' data turns into a record of everywhere you've been
Here’s where it gets interesting: the second kind of data also includes your location, determined by the nearest cell tower whenever your phone checks in with the network, providing roughly a hundred data points each day. It’s the same data law enforcement uses if they’ve got a warrant out — but it belongs to the carriers, and as long as your "personal" data is stripped out, they’re allowed to sell the anonymized data to whoever they want. AT&T is already testing the data trade with its Adworks Lab, and they're not alone. Verizon's Precision Market Insights program claims to be able to build a detailed demographic breakdown of everyone in a stadium on a given night — a feat that would be impossible without anonymized data-mashing.
The problem is, the data may not be anonymous after all. Last week, a group of MIT data scientists found a way to work back to 95 percent of the people in a European carrier’s data set from just four new location data points. Those could be Foursquare posts, geolocated tweets, or items on a credit card slip. If someone's got four of those hits, along with a batch of anonymized data from the carriers, it's enough to single you out. Suddenly, that "anonymized" data turns into a detailed record of everywhere you've been.
This kind of re-identification has happened before. In the mid-1990s, when a Massachusets state group released a crop of anonymized medical records, a data scientist named Latanya Sweeney was able to re-identify them by comparing them to local voter rolls — and responded by mailing the governor a full copy of his private medical history. As detailed by Paul Ohm, she later proved just a birthdate, zip code and gender is enough to identify 87 percent of the population, and knowing where someone is makes them even easier to ID. "Location pins you down a hell of a lot," said Lee Tien, a lawyer for the Electronic Frontier Foundation. "To know you're in a particular city, even if it's a big city like San Francisco, that ruled out most of the world right there."
All that’s left is a little math, but this is the kind of math that gets you in trouble. To a lawyer, running this algorithm counts as a data breach, which states have harsh laws about. Once you cross from "anonymous" to "personal" data, you'll face a world of ugly consequences if anyone finds out. But to a data scientist, it's as simple as connecting the dots.
The tradeoff between privacy and utility is taking place behind closed doors
For the most part, the carriers have protected themselves by aggregating the data, never letting partners see all the data at the same time — but it’s an imperfect solution at best. If the algorithm is weak, data scientists can often work back to single users through subtraction, and in other cases, the result is only to chop the data into less manageable chunks. That makes re-identification harder, but it doesn't make it impossible.
A banner for Verizon's Precision Market Insight program
Most worrying of all, the tradeoff between privacy and utility is taking place behind closed doors. The strength of the de-identification tools is usually specified in the contract between the carrier and whoever they’re selling to (leaks are bad business, after all), but there are no third-party specifications to live up to. Unlike encryption, where public audits and white-hat attacks are accepted as a gold standard of security, nobody's ever put these aggregation algorithms to the test. And each step taken to preserve privacy makes the data less useful to the businesses who are footing the bill. If a business did want to work back to personal data, there would be little to stop them.
Already, some apps have gotten in on the data-trading game
For many companies, there are little to no regulations to abide by. GPS companies and software makers aren’t bound by the same FCC regulations that hamper carrier data, so they can sell whatever they like. GPS manufacturers like TomTom have been selling their more-accurate GPS data to traffic analysts for years. The maps on Windows Phone were built from that data, acquired through a partnership with Nokia's Navteq. (Apple and Google have been harvesting the same data to beef up their traffic-monitoring programs.) At the moment, most app makers are more interested in getting bought by larger companies than potentially sketchy revenue sources like data trading — but it may not always be that way. Already, some apps have gotten in on the data-trading game, and as the mobile data market matures, it may become an increasingly common way of turning a profit.
Information doesn't come in discrete droplets anymore
One company we talked to, Airsage, has contracted for mobility data for a third of the country's phone users. That's billions of data points, about a terabyte of information each day, mapped for traffic info and market research. (Because of its contract, Airsage can't name the carriers that send them data, but there's reason to believe Verizon is one of them.) Of all the data traders involved in this story, only Airsage was willing to go into detail on their aggregation practices. They limit population queries to a minimum of eight people, broken out by census block. When we asked Verizon, they said they had "technical, procedural and administrative safeguards in place," and declined to elaborate. AT&T Adworks did not respond to requests for comment.
There’s a reason they don’t want to talk about it. It’s hard to design an airtight data system, and even harder to talk about it in a reassuring way. It’s unsettling to realize how much data we leave behind, and how eager companies are to scoop it up and sell it off. The usual reassurance is that it isn’t "personal" data, as if we’re protected by ignoring certain droplets of data — the identifiable kind, the SSN, the address. But information doesn’t come in discrete droplets anymore. Today, it's more like a flood: always flowing, changing, and combining in inconvenient ways. Location data, in particular, may be so informative that it can never be truly private. As Tien put it, "you have to be pretty clever if you’re going to stop it."