In January 2012, the Federal Bureau of Investigation made an open request to "industry" for a "social media alert, mapping, and analysis application" with the "flexibility to change search parameters and geo-locate the search based on breaking events or emerging threats." They wanted an app, in other words, that would allow keyword searches — for gunfire, for meth, for protests, for killings — in specific areas to locate bad guys and potential bad guys. They wanted a social media scraper. They wanted covert access to your tweets, your status updates, your Instagrammed photos. They wanted your YouTube comments and comments on websites made through Facebook. They wanted all of them, and more. This FBI request, along with similar desires among police outside the federal government, has since given birth to a cluster of companies that hope to capitalize.
With names like BlueJay, SnapTrends, and even the multibillion-dollar public-records database company LexisNexis, these companies claim to offer an unprecedented tool to police officers — a definitive, easy-to-use product that can keep police informed when people publicly announce crimes or potential crimes on the web.
Scouring social media may not be useful to predict crimes
This may sound like a potential violation of civil rights and freedoms. But analytics experts — who have used big data to predict how we shop and how we vote and what we look for on the web — bring up another point: scouring social media may not be useful to predict crimes. It may, in fact, be the opposite — an expensively wild goose chase, an endless search for a needle. It may lead to fruitless searches and false arrests. And if they’re right, these companies, these police departments, and the FBI may be scouring the web for something that isn’t there — a simple way to locate criminals on social media.
'A social canvas'
BlueJay is the police social-media scraper that's received the most attention so far. It was developed by the Sioux Falls, South Dakota-based Bright Planet — a company that advertises itself as "the definitive answer to the challenge of harvesting Big Data from the Deep Web." Its board of directors includes a navy admiral who served as President Reagan's national security advisor and a former governor of South Dakota.
BlueJay claims to offer a "crime scanner" for Twitter that will help police "monitor large public events, social unrest, gang communications, and criminally predicated individuals." It will also help to "identify potential witnesses and indicators for evidence," "track department mentions," and capture "tweets from the entire Twitter firehose."
Twitter’s application programming interface (API) is free and offered to anyone who wants it. The API is essentially a sample, a subset of billions of tweets than anyone can produce and stream based on whatever parameters they want to set. "Firehose" access is different: it's a stream of literally every tweet, in real time. It's relatively expensive, and not as easy to access as Twitter’s API. BlueJay thus claims their product is superior because it goes extra steps to access the firehose, allowing police to monitor every single tweet produced by every single Twitter user in their jurisdiction, rather than a subset. (There’s been some academic discussion about whether or not anyone needs access to billions of tweets rather than millions. This recent study out of Arizona State University suggests the API introduces significant bias into the tweet stream, meaning that maybe it’s more accurate to access the firehose.)
Plus, "BlueJay is invisible and covert," so no one will know if they’re being tracked. BlueJay costs $150 per month, for a minimum of three months.
"BlueJay is invisible and covert."
SnapTrends is another very similar option. It provides basically the same service as BlueJay, but it’s able to identify where keywords are trending on social media. It then shows those trends in red on a map as hotspots. The point is to identify where things are happening in real time and to pinpoint a search, in the words of the FBI, "based on breaking events or emerging threats." SnapTrends isn’t limiting itself to police use, however; its salespeople see it as a potential tool for news organizations and even the energy industry. SnapTrends representatives wouldn’t give an exact price, but indicated the service was slightly more expensive than BlueJay.
LexisNexis jumped into the ring in mid-October, at the International Association of Chiefs of Police (IACP) conference in Philadelphia, with Social Media Monitor — its version of BlueJay and SnapTrends. The advantage, its salespeople claim, is that the service is tacked onto LexisNexis’ other services; it adds to an already-valuable trove of useful data. A LexisNexis representative would not disclose how much the service costs, but he said the service uses Twitter’s firehose. "By entering a few search terms," a press release announces, "law enforcement personnel are provided a social canvas within minutes, adding a virtual dimension to traditional public records data."
Providing that "social canvas" seems to be a common thread in these companies and others such as 3i:Mind, NICE, and FLIR, as reported by Rolling Stone last month. But a question remains: do these products serve a real purpose?
'Spam and garbage'
Tom H.C. Anderson is founder and managing partner of Anderson Analytics, a full-service market research company specializing in data and text mining for businesses. He has an MBA from the University of Connecticut, where he speaks in graduate-level marketing research and data-mining courses, and a master’s degree in economics from Lund University in Sweden.
Anderson doesn’t work with police agencies (his company compiles data for Kodak, Starwood Hotels & Resorts, Jiffy Lube, and MTV, among others) but that's why The Verge got in touch with him. While scraping websites and social media for crime analytics is a burgeoning field, marketers such as Anderson have been scraping the web for information about how people interact on the web for years. Anderson’s been data scraping since 2005.
Unlike BlueJay, SnapTrents, and Social Media Monitor, Anderson’s company shies away from social media data.
"As an analyst, I don’t really find blogs and Twitter all that interesting," he says.
First of all, he says, there’s a relatively small percentage of people who use Twitter. Pew puts that number at around 18 percent of all internet users. So Twitter doesn’t even represent a majority of those with web access. Second, he says, BlueJay may use Twitter’s firehose as a selling point, but Anderson says perhaps they shouldn’t. "The information you get from Twitter is usually not all that great," Anderson says. "Most of it is spam and garbage, people trying to sell their product, or their religion, or their blog. And so you have to filter out a huge portion of that noise." In addition, he says, "there’s very little contextual information available; there’s usually only a time and date stamp, and 140 characters." He points out that it’s also sometimes difficult to identify whether someone’s male or female. Companies pay for this data sometimes, he says, because it helps show, to some degree, where and when products are being mentioned.
"As an analyst, I don’t really find blogs and Twitter all that interesting."
"The problem is, how many people are offering real, honest, and useful information about Coke, for example, in a tweet?" he asks. "Not often."
Since more people use Facebook than Twitter, Anderson acknowledges that Facebook might provide a more thorough portrait of how a community communicates — and about what. But even there, he says, "a huge percentage of Facebook pages are company pages, and some major percentage of users set their wall to private," he says. "So it’s a walled garden."
Rather than scrape Twitter data, he says, good marketers are inclined to focus on specific geographic areas with structured surveys, and to scour websites designed to discuss pertinent information. Online forums and review sites such as Yelp and TripAdvisor are far more useful, he says. This approach relies less on the firehose of Twitter data and more on using focused data to hone in on a specific goal.
Experts in crime analytics make similar points.
George Mohler is an assistant professor in mathematics and computer science at Santa Clara University. He also co-founded PredPol, a company that uses historical crime-incident reports to identify geographic hotspots — areas where crime is likely to occur.
Mohler also argues for a more structured approach.
"If you’re trying to monitor crime risk over a whole city, it’d maybe be useful to have geo-located tweets," Mohler says. But if a police force is trying to deploy officers efficiently, he says, a better solution would be to rely less on a firehose of Twitter data and more on honing in on a specific goal.
"Tweets can create bias in data," he says. "We don’t typically use them."
Where both Anderson and Mohler agree, however, is that social media scraping might be useful to determine when a flash mob is going out of control, when someone openly admits to a crime, when a protest is about to happen, or if a family member of a homicide victim might be in danger. Products such as BlueJay, SnapTrends, and LexisNexis might alert law enforcement sooner when things go haywire away from the ears of police scanners.
"On social media, people have to assume some risk," says Hanni Fakhoury, a litigator and advocate who focuses on criminal law, privacy, and free speech at the Electronic Frontier Foundation. "On one hand, if you brag [on Twitter] about a crime you committed, it's on you, that's your fault. But on the other hand, the problem with these programs [such as BlueJay and SnapTrends and others] is that they encourage police to aggregate information about who random people are, who they associate with, and who they interact with. There’s room for abuse there."
Fakhoury points out that BlueJay and SnapTrends might allow police to know about a protest a few minutes earlier that they would’ve otherwise, but there’s real risk of "stereotypes and hyperbole" — and possible overreactions to Facebook posts and sarcastic hashtags — that could bring a SWAT team to an innocent family’s door, or could mistake a peaceful protest for violent radicalization.
"This is what the NSA is doing but on a smaller scale," Fakhoury says. "They’re sifting through this data to generate investigative leads, but they’re also sweeping up lots of innocent people, too."
Update: A previous version of this story incorrectly stated that SnapTrends does not have access to the Twitter "firehose." It does. This story was edited after publication to remove that inaccuracy.