Real-time search is hard. With old-fashioned, web-crawling, desktop search, you have fairly stable sets, strong ranking and network relationships that help verify results, and lots and lots of data. You need smart algorithms to bring results up quickly, rank them well, and detect bad actors gaming the system, but mostly you're relying on the sheer mass of data. Real-time search isn't like that. You need to find data on the move, with lots of little pieces flying everywhere. It's like turning a Newtonian physics problem into a quantum one.

In a new post on its engineering blog, Twitter's Edwin Chen and Alpa Jain pose the two basic problems of real-time search:

  1. The queries people perform have probably never before been seen, so it's impossible to know without very specific context what they mean. How would you know that #bindersfullofwomen refers to politics, and not office accessories, or that people searching for "horses and bayonets" are interested in the Presidential debates?
  2. Since these spikes in search queries are so short-lived, there’s only a small window of opportunity to learn what they mean.

Just today, data scientist Hilary Mason described the effort put into Bitly's new real-time search engine and API, by using rates of clicks to track bursts of activity, then reassembling all these links into coherent stories, while still tracking location and other metadata. These are difficult problems.

Twitter treats real-time search like a CAPTCHA problem

Like Bitly, Twitter has a great real-time data set and very smart data scientists and engineers. But instead of relying on a primarily computational solution, Twitter treats real-time search more like a CAPTCHA problem. With this kind of messy data, lots of human brains can find meaning much faster and more accurately than lots of lines of code. So Twitter uses a real-time computation system called Storm to identify search spikes, then Mechanical Turk (Amazon's crowdsourcing online platform for small jobs) to farm out annotating that data to human beings all over the world. The annotations basically take the spiking search term and tag it for relevance and intent. A human annotator (Twitter calls them "judges") can tell Twitter's systems whether searches for "Stanford" refer to a university or to its football team, or that searches for "Big Bird" aren't primarily referencing a children's show, but a political debate. This helps Twitter make trending topics smarter and more coherent.

But here's the dark stroke of genius behind using huge masses of people to help sort out the meaning of Twitter searches: part of the judges' task is also to match spiking search terms with pictures, events, and other categories that can help Twitter serve up relevant advertising. "For example, suppose our evaluators tell us that [Big Bird] is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer." The judges are like little focus groups that match intent with revenue.

Step one: make real-time search work
Step two: make real-time search pay

This solves the second half of Twitter's Google problem. First, it has to make real-time search fast, relevant, and reliable. Second, it has to intelligently make search work as a driver of advertising without frustrating or bewildering either the advertiser or the consumer. (Excuse me, customer. Jack Dorsey likes it when you call them customers.) If a customer — that is, an advertiser — makes an ad buy, Twitter needs to assure the customer that the relevant ad will be served and served well, and that its search engine is an instrument for that purpose (without being ostentatiously or obnoxiously so.)

Search-driven ads are a tried and true way to make money on the web. Twitter's approach solves the Facebook problem ("I'm just here to hang out, not to buy anything") of social advertising and leverages Twitter's unique role as a fast clearing-house for news and emerging trends. Twitter just told everyone how they make money, and how they will make money from now until the time that algorithms catch up with human brains or the service falls out of fashion, whichever comes first.