In 2010, the Library of Congress announced plans to collect every public Twitter post in a single searchable archive, as part of a bold attempt to create a new repository of digital information. Two years later, however, the project has yet to get off the ground, primarily because the Library hasn't come up with an efficient way to harness such a massive amount of data.
On Friday, the LOC published a white paper explaining the delay, which it attributes to a lack of available software and constrained budgets. The organization has already created a private archive, but it remains virtually unsearchable. According to the library, a single query on its current system "could take 24 hours" to yield results. Fixing this problem, it says, "would require an extensive infrastructure of hundreds if not thousands of servers," which would be well beyond the Library's current budget.
"What we have here is a large and growing lake."
Deputy Librarian of Congress Robert Dizard Jr. tells the Washington Post that the LOC has thus far invested "tens of thousands" of dollars in the project, but recent budget cuts have tightened its purse strings, making it difficult to spend money on the kind of massive computing overhaul the project would demand. Colorado-based data company Gnip is in charge of creating the archive, and has so far collected more than 133 terabytes of Twitter data. The fundamental problem is that the Library hasn't found a way to make any sense of this information.
"You often hear a reference to Twitter as a fire hose, that constant stream of tweets going around the world," Dizard said. "What we have here is a large and growing lake. What we need is the technology that allows us to both understand and make useful that lake of information."
Complicating matters even further is the fact that Twitter's terms of agreement may make it difficult for the Library to make its archive fully accessible. The agreement, which hadn't been made public until today, prohibits "a substantial portion of the collection on its web site in a form that can be easily downloaded." This would suggest, then, that the social network may have been wary of fully committing to the project from the very beginning — perhaps because it already had plans to launch a similar service of its own.