Google has released an extremely large dataset which could help developers build software that accurately interprets human language. Known as the Wikilinks Corpus, the collection comprises more than 40 million individual links from web pages to Wikipedia articles, known as "mentions." Analysing the context of each mention alongside the content of the destination article should allow engineers to more accurately determine the meanings of ambiguous words.
Humans are "amazingly good" at disambiguation
As a post on Google's Research Blog points out, humans are "amazingly good" at distinguishing between meanings — for instance, "Dodge" the car brand and the verb "to dodge." This is at least partly attributable to the massive banks of experience that most of us have built up over many years of language use, the sorts of personal archives that brand new pieces of software are unable to draw on.
Similar to data used by Google's search algorithms, the Wikilinks Corpus was developed with help from researchers at the University of Massachusetts Amherst and is significantly larger than previous datasets — most importantly, it's available for free. While Google is unable to distribute the actual content of individual web pages for copyright reasons, code for recreating the full set will soon be available on the university's website.