In 2006, Netflix showed the world that anonymity isn't the same thing as privacy. The company released millions of movie ratings from its users, offering a cash prize to anyone who could use them to build a better recommendation engine. But when researchers cross-referenced the now-public scores with iMDB reviews, they discovered they could effectively unmask the people behind them, even if customer names had been replaced by numbers. Netflix ended up in court with a woman who feared its contest would reveal her sexual orientation — that the films she was comfortable enough to talk about in public could inadvertently expose the ones she never meant to go beyond her queue. Today, researchers from MIT have published yet another reminder that no matter how many personal details are stripped away, our digital footprints can be uncomfortably revealing.
If what we watch is a touchy subject, what we buy is even more intimate and revealing. Programs like Facebook Beacon, designed to advertise users' recent purchases to their friends, have been widely reviled. But the study, published today in Science, isn't about personal sharing. It's based on testing what the researchers call unicity: the odds that if you know fragments of a person's shopping history, you can match them against a much larger amount of data, uncovering everything else they've bought. As it turns out, those odds are very high.
Unlike Netflix, banks aren't likely to be releasing millions of records to the public. And we already reveal our purchases in countless other ways — to social networks, department stores, and advertisers that can piece the threads together in eerily accurate ways. But as we record and quantify more of our lives, it's worth thinking about who might be watching, and what they could find. The NSA has reportedly mined credit card details the same way it's swept up emails and phone metadata; it already tracks suspects by matching calls against its vast trove of anonymous records. And taking previous, similar studies into account, the researchers speculate that most kinds of large-scale databases will be similarly revealing. "The research here is really on the limits of anonymization [on] big, high-dimensional data — mobile phone data, credit cards, browsing, and so forth," says lead author Yves-Alexandre de Montjoye. "How does your behavior compare with the behavior of others, and potentially make us unique?"
Three purchases can give you a match 94 percent of the time
The authors started with three months of credit card transactions from 1.1 million people, provided by an unnamed bank in an unnamed OECD country (the lead author, Yves-Alexandre de Montjoye of MIT, wouldn't get more specific). They randomly pulled out a few single purchases for each person, then put the entire set in an anonymized database, removing details like names or bank account numbers. The database only gave prices within a range, since knowing that someone spent exactly $3.21 at Starbucks could give them away almost immediately. And researchers dropped super-high payments — more than $22,000 — for the same reason.
But this proved a minor stumbling block. When the authors mapped locations, dates, and prices of someone's non-anonymous purchases against the whole database, it was usually easy to find a single, unique pattern. With three points or more, it was virtually a certainty. "You bought a coffee at that coffee shop, and you bought jeans at that shop, and then you bought a pizza," says de Montjoye, by way of example. There's a 94 percent chance that you're the only person who did so. Taking away price altogether made these matches harder to find. But with four purchases, it was back up to 90 percent.
De Montjoye admits that other countries or regions could see slightly different results; logically, the more people use credit cards in a given area, the more difficult finding a match will become. Unicity already varies by income and gender — women are more identifiable than men, and richer people more identifiable than poorer ones. These are the kind of differences that are likely to draw both speculation and stereotyping, but the authors say that figuring out the real factors behind them is beyond this study's scope.
Even the fuzziest data creates unique patterns
Overall, though, it's not surprising that this kind of extrapolation is possible. De Montjoye authored a similar paper in 2012, using cellphone location tracking instead of purchases. If you knew where someone had been at four points in time, there was a 95 percent chance of finding the rest of their movements in a 1.5 million-person database. What's more interesting, perhaps, is that de Montjoye believes it's almost impossible to make this kind of information truly anonymous.
The researchers also looked at the same records at a much coarser level, trying to create points so vague they couldn't be matched. To some extent, they succeeded. At the far end of the scale, they made anonymous purchases only accurate to within 15 days and a geographical range of 350 shops, widening the price ranges. With these changes, there's less than a 15 percent chance that knowing four things someone has bought will help you find them. It takes 10 known points to get an 80 percent chance. But that's still not a guarantee of privacy.
For researchers and companies who want to work with "big data," at least, de Montjoye hopes that owners can find a way to minimize what they give away. He suggests granting access to more abstract code instead of the raw information, comparing the system to Pandora's Music Genome Project — which captures "distinct musical characteristics" to show a profile of someone's tastes without actually revealing their playlists. And on a larger level, he wants to show people how much even vague details can reveal about a person. "We need to rethink what it means for anonymization," says de Montjoye. "And maybe at least be aware of what the risks are."