Google Search is still one of the most powerful, reliable products on the web, thanks to a smart algorithm and the sheer brute-force quantity of data that Google pulls off the web on a daily basis. The power of Google Search rests on its crawl being bigger and faster than anyone else. But there's a problem — or at least there is if you're a researcher. Once Google has collected that data, it's not interested in sharing. It makes sense as a business play (you wouldn't want to do Bing any favors), but it cuts researchers off from one of the most powerful stockpiles of web data we have.
A project called Common Crawl is building its own version of Google's crawl that will be open to anyone who wants it. The result isn't quite as large or fresh as Mountain View's data heap, but it's large enough to power a growing flock of businesses and projects. And for researchers struggling to understand the shape of the web, it could be a lifeline.
"One of our goals is to lower the barrier for an interesting idea."
Common Crawl estimates that it caches roughly five billion pages each time through — just a fraction of Google's massive crawls, but enough to give a usable picture of an often-ephemeral web. Other projects exist to capture this data, most notably The Internet Archive, but Common Crawl crawls more pages more often and shares the data more efficiently. If you wanted to track the rise of non-Twitter hashtags, say, or parenting blogs with Italian domain names, all you have to do is work through the data. It's also useful for startups that can't afford their own crawls. The data has been used to power TinEye, a reverse image-search tool, and social-data miner Lucky Oyster. The company only performs four crawls a year, so it's less useful for anything recent, but plenty of projects have been willing to settle for a few months of lead time in exchange for a free map of the central web. "I think we're going to start seeing this data in startups more and more," Common Crawl Director Lisa Green told The Verge. "One of our goals is to lower the barrier for an interesting idea."
Hosting all that data would usually be expensive — at one point, Common Crawl was paying more than $7000 a month for data storage — but now the project gets its hosting for free thanks to Amazon Public Data Sets, a little-known program that offers free hosting as long as the information is useful and free to download. The result is a lifeline for a low-budget non-profit like Common Crawl. "I don't mean to sound like a cheerleader for Amazon, but honestly this project wouldn't work without them," said Green.
"I worry that data-licensing is facing the same thing that open-source software faced."
At four crawls a year, it won't be much use for Google competitors on the straight web-search front, but sites like TinEye are already starting to duplicate various functions of image searching. It's easy to imagine using Common Crawl data to build something like Google Trends, Google Ngrams, or a useful search feature Google hasn’t thought of yet. And if it happens, developers won’t have to worry about a license holding them down.