Hundreds of years from now, today’s DVDs, web servers, and flash drives will all be long dead. But one copy of a music video — for alternative rock band OK Go’s song "This Too Shall Pass" — could still be playing. The Rube Goldberg-inspired video is part of a 202-megabyte cache of data that Microsoft and the University of Washington say they’ve written to DNA storage — the largest known DNA storage trove created to date.
This DNA data storage project is a research partnership between Microsoft and researchers at the University of Washington’s computer science and engineering department, with help from startup Twist Bioscience. Its goal is to advance the technology that could one day make synthetic strands of DNA a viable alternative to conventional hard drives, optical disks, and other storage methods. DNA storage could offer a couple of major advantages over anything we have today. It can theoretically hold huge amounts of data at incredible density, and kept in cool, dry, and dark conditions, it could maintain its integrity for hundreds or even thousands of years.
Previous projects encoded a full book and Martin Luther King, Jr.'s "I Have a Dream"
The idea has been in the proof of concept stage for years, and so far, the information stored has been modest. In 2012, Harvard Medical School researchers stored a digital book in DNA. In 2013, the European Bioinformatics Institute copied 739 kilobytes of sound, images, and text, including a 26-second audio clip of Martin Luther King, Jr.’s "I Have a Dream" speech. More recently, Harvard Medical School and a Technicolor research group reported storing and retrieving 22 megabytes that included French silent film A Trip to the Moon. This new work builds on previous efforts, and — possibly — marks another move toward a technology that could have real-world applications.
In order to write data to DNA, researchers translate the binary code of a file into the nucleotide molecules that form DNA’s building blocks, assigning different base pairs to represent ones and zeroes. In this case, Twist Bioscience created custom strings of DNA based on the resulting patterns, encoding the OK Go video, copies of the Universal Declaration of Human Rights in different languages, the top 100 books from Project Gutenberg, and the Crop Trust seed database. Microsoft principal researcher Karin Strauss says the team picked OK Go because "they’re very innovative and are bringing different things from different areas into their field, and we feel we are doing something very similar."
DNA storage remains expensive and slow compared to current methods
To read the files back, a team would use the same DNA sequencing process that scientists use to decode the genomes from plants or animals, then translate the results back into binary code. Besides simply storing more material, University of Washington principal researcher Luis Henrique Ceze says this particular experiment also showed that they could still sort through the much larger batch of DNA and find specific sequences — something that’s vital for retrieving files in a real-world storage system.
Granted, 200MB is still a fraction of the storage space researchers are promising DNA could one day give us. A report last year said that an exabyte, or roughly a million terabytes, of data could fit in a DNA cluster the size of a grain of sand. But so far, nobody’s actually done it. And there are still major barriers to swapping DNA in for your trusty hard drive. Encoding data in DNA is prohibitively expensive, it can’t be easily rewritten, and the process of reading it back is slow. One unnamed scientist went so far as to call the 2012 book storage test a "vanity project," saying it was "like showing you could painstakingly use an abacus to solve a Hamiltonian path problem that would take the average computer a microsecond."
"We could store the entire accessible internet in a shoebox."
Strauss and Ceze say that the cost is high, but say that it’s coming down rapidly. "We don't see any fundamental reason why it can't be much, much cheaper and much, much faster," says Ceze. "I think it's fair to say we could see something concrete within the decade that's going to affect people's lives, and the usage of computers." But he admits that it’s speculative, and highly dependent on whether there’s enough incentive to actually use the new method.
Ideally, DNA promises a new way to preserve important material that’s rarely accessed or modified, forestalling a looming data storage crisis. That material includes medical records and important cultural artifacts in places like the Library of Congress — sometimes stored today on fragile formats like CDs or DVDs, which can decay within decades. Or it could be used to preserve the vast amount of data that’s in danger of being lost online — as Strauss puts it, "we could store the entire accessible internet in a shoebox." Ceze suggests that DNA could even be used in general computing, not just storage. "What's really exciting here is that all this progress being made in DNA storage, I think is showing a very concrete example of using nature to build better computer systems," he says. "And showing it's becoming more and more real is exciting to me."
Update July 7th, 4:15PM ET: Updated title for the project, previously known as Project Palix.