Google’s search engine for datasets, the cunningly named Dataset Search, is now out of beta, with new tools to better filter searches and access to almost 25 million datasets.
Dataset Search launched in September 2018, with Google hoping to slowly unify the fragmented world of online, open-access data. Although many institutions like universities, governments, and labs publish data online, it’s often difficult to find using traditional search. But by adding open-source metadata tags to their webpages, these groups can have their data indexed by Dataset Search, which now covers a huge range of information — everything from skiing injuries to volcano eruptions to penguin populations.
Find data on everything from skiing injuries to penguin populations
Google would not share any specific usage figures for the search engine, but it said “hundreds of thousands of users” have tried Dataset Search since its launch, and the reaction from the scientific community was overall positive.
Natasha Noy, a research scientist at Google AI who helped create the tool, tells The Verge that “most [data] repositories have been very responsive” and that the engine’s launch meant older scientific institutions are now taking “publishing metadata more seriously.”
“For example, [the prestigious scientific journal] Nature is changing its policies to require data sharing with proper metadata,” Noy says, highlighting a change that will make the data underpinning top-flight scientific research more accessible in future.
New features added to Dataset Search include the ability to filter data by type (tables, images, text, etc), whether it’s free to use, and the geographic areas it covers. The engine is also now available to use on mobile and has expanded dataset descriptions.
Google says the corpus covered by the search engine — almost 25 million datasets — is only a “fraction of datasets on the web,” but a “significant” one all the same. The largest topics indexed are geosciences, biology, and agriculture, and the most common queries include “education,” “weather,” “cancer,” “crime,” “soccer,” and “dogs.” The US is also the leader in open government datasets, publishing more than 2 million online.
Noy would not comment on future plans for Dataset Search, but she says the team was thinking about a number of functions they hope would be useful, including “understanding how datasets are cited and reused” and “helping users explore datasets in Dataset Search when they don’t necessarily know what they are looking for.”
“And, of course, continuing to expand the corpus,” says Noy. There’s always more data out there.