In 2010 Christian Rudder, one of the founders of OKCupid, started a blog to accompany his massively popular dating site. Called OKTrends, it was an under-the-hood look at the vast amounts of self-reported data he and his colleagues had access to as the administrators of a site where millions of people answered extensive questionnaires, filled out in-depth profiles, and messaged potential partners.
On OKTrends, Rudder made ample use of his Harvard math degree, pumping out pie charts and line graphs to bolster observations like, "heavy Twitter users masturbate more often" than light Twitter users and "black people are more than twice as likely to mention their faith in their profiles" as people who identify as white, asian, or hispanic. But the much-loved blog went dormant after less than 12 months.
Four years later, OKTrends is back in book form with Rudder’s Dataclysm: Who We Are When We Think No One is Looking.
The original OKTrends blog was fascinating, less so for its observations about which profile photos were more likely to attract messages — which it did meticulously — than for its comments on larger issues of self-identification. In one widely-shared post, Rudder created word clouds based on how users describe themselves, indexed by race and gender. For white men, the top hit was Tom Clancy. For black women, soul food.
Dataclysm is a 247-page expansion on posts like those. It's augmented by additional data culled from sites like Reddit and Craigslist in an effort to expose the patterns that fascinate Rudder and his data-collecting colleagues. "Looking at people like this is like looking at the Earth from space," he writes. "You lose the detail, but you get to see something familiar in a totally new way." Armed with that sense of wonder and a sharp enthusiasm for the data he’s collected, Rudder tackles a range of subjects in three sections, each containing dozens of lovely two-toned graphs: What Brings Us Together (dating and sexual attraction), What Pulls Us Apart (social and political fractures), and What Makes Us Who We Are (how we self-identify).
Dataclysm is full of weird bits of viral trivia of the sort to make you go "huh," or "wow."
Dataclysm is full of weird bits of viral trivia of the sort to make you go "huh," or "wow" — visuals that if they hadn’t been printed on paper-and-ink, would see a million reblogs. Dataclysm calculates the relationship between "percentiles of attractiveness" and how many friends a Facebook user has. It tells us that Twitter users with more than 1,000 followers use a lot of corny marketing words like "marketing" and "tweetup." We learn that when you compare the words most commonly used on Twitter with those used in the English language elsewhere, Twitter users write "love" and "today" with far more frequency. Rudder has found that white men on dating sites are far less likely to send messages to black women than any other race. The least popular white male interest on OkCupid? Slow Jams. For black men, it’s Borges. Rudder has also compiled maps showing where Craigslist missed connections are most likely to occur, state-by-state. In New York it’s the subway; in Texas it’s Walmart; in Southern California, the gym.
So: People on Twitter are terrible. Women are desirable and have a rough go of it. We’re divided by class, which correlates with geography. Americans are racist as hell. Huh. Wow.
This is, Rudder writes, "a series of vignettes"; you’ll find very little analysis in Dataclysm. Rudder’s writing skirts politically charged topics, oftentimes connecting the data to his own personal experiences or paving the way for a block quote cribbed from a liberal arts syllabus. He cites Naomi Klein’s The Beauty Myth on the double-standards facing women; when discussing being black in America, he quotes Barack Obama.
Dataclysm reads like the data scientist’s equivalent of a ¯\_(ツ)_/¯
Rudder is excited by the idea of being able to see, as the book’s cover says, how we behave "when we think no one is watching" — a world where data collection doesn’t occur in a lab, but in the channels of the internet where participants are free from self-consciousness. This, says Rudder, is the authentic stuff. The subtext, though he never quite goes out and says it, is an idea that’s held by many in the tech industry: we’re living through the democratization of information, in this case of hard data.
But buried in the back of the book, in Rudder’s notes about the data-collection itself, we learn that the author gathered most of his information through a combination of buddy-to-buddy and business-to-business interactions with the people behind the companies who collect it, an admission that doesn’t do much to dissolve the vision of Silicon Valley as an exclusive foosball-peppered frat lounge. Occasionally, Rudder does make reference to what amounts to this extra-civilian status, offhandedly commenting on secretive decisions social media companies make to perpetuate loops of endless likes and faves. Rudder might have written a more useful book about that design process, knowledgeable as he is of the inner workings of the industry. But Dataclysm, on the whole, bereft of serious analysis and unwilling to take advantage of its privileged insider perspective, reads like the data scientist’s equivalent of a ¯\_(ツ)_/¯.
Big, popular books about sociology have traditionally had big, popular ideas that carried them: Malcolm Gladwell’s "tipping point," Pierre Bourdieu’s invention of the term "cultural capital." The idea is that you do your research first, then craft a digestible thesis around it. If Dataclysm has a central idea embedded in it, it’s that it’s okay for the tech industry to scrape your data off every last surface you touch — and then to write sociology books about it. After all, look at all the pretty graphs it can produce!
If Dataclysm has a central idea, it’s that it’s okay for the tech industry to scrape your data off every surface you touch
In his conclusion, Rudder says he hopes this is just the beginning, that this science will be further refined and we’ll be able to extrapolate great things from it. I do too. But a lot has changed since he started the OkTrends blog. In 2011, we didn’t yet know the extent to which we were being surveilled by our government. Facebook hadn’t yet admitted to giving its users the lab rat treatment, a contentious bit of information that inspired Rudder’s first post to OkTrends since it went dormant three years ago titled "We Experiment on Human Beings!"
The OkTrends blog, and this book, harken back to a more innocent time — a time when our anxiety about big data wasn’t so omnipresent and the challenges it presents were less ominous. These days, we have more information than any of us know what to do with, all of it downloaded and archived with little to help us interpret it. Graphs that make you go "huh" don’t really help, and there’s something that doesn't feel right about a startup president writing a light, punchy book about issues of race and class just because he has the keys to the data locker. Given the amount of information being gathered about us, we need something that takes the ethical questions of 2014 more seriously, or at least helps us better understand the industries from which these numbers come — not a book filled with data about data collection heaped upon an existing mountain of data, all of it telling us what we sort of already knew.