This April, police solved a decades-old mystery — the identity of the Golden State Killer — with a previously unused DNA technique. Searching for a sample match in existing databases turned up nothing, but a search through a public DNA database located 10 to 15 possible distant relatives, which let police narrow down a suspect list and ultimately gave them the lead they needed.
It was a new technique at the time, but after the high-profile success, that technique has proved to be one of the most powerful new tools in forensics. In the months since, groups like Parabon NanoLabs and the DNA Doe Project have identified at least 19 different cold case samples through this method, called familial DNA testing of public databases, providing crucial new leads for previously unsolvable cases.
Now, a pair of new discoveries could make that technique even more powerful. A paper published today in the journal Science finds that the same technique could span much further than contemporary labs realize, covering nearly the entire population from a relatively small base of samples. At the same time, researchers publishing in Cell have devised a way to extrapolate from incomplete samples, building out a broader picture of the genome than was originally tested. Taken together, those techniques would allow researchers to identify nearly anyone using only existing samples, a frighteningly powerful new tool for DNA forensics.
“The big limitation is coverage”
Familial DNA testing is a break from conventional DNA testing, which looks for positive matches, like matching the DNA from a bloody glove to the DNA from a specific suspect. Crucially, a match can only be made if the suspect’s DNA can be collected, which makes it impractical for most cold cases. But familial DNA searches look for partial matches, which could indicate that the sample comes from a sibling or a parent rather than the same person. That’s not enough to conclusively identify a person on its own, but it can give police a crucial lead that can lead to further testing down the road.
To find those partial matches, labs have drawn heavily on public DNA databases like GEDMatch and DNALand. Those searches don’t require court approval because the data is already public, but they’re more limited in scope. The largest database, GEDMatch, contains just under a million genetic profiles, significantly limiting the scope of many searches. The FBI’s National DNA Index, in contrast, contains more than 17 million profiles, but can only be accessed under specific legal circumstances. Consumer DNA services like 23andMe and MyHeritage also contain significantly more samples, but their policies typically rule out law enforcement searches of this kind.
The result is a new scramble for data, and new uncertainty about how far the public data could reach. “The big limitation is coverage,” says Yaniv Erlich, a computer science professor at Columbia University and chief science officer of MyHeritage. “And even if you find an individual, it requires complex analysis from that point.”
A database covering 2 percent of the population could match nearly anyone
Now, Erlich has joined with other researchers from Columbia and Hebrew University to examine exactly how far that coverage could reach. For the Science paper, the team looked at a data set of 1.28 million individuals (largely drawn from the MyHeritage database) and produced a statistical analysis of how likely it is that a given person can be matched to a relative whose DNA is in the database. According to those results, researchers found that more than 60 percent of searches would result in a third cousin or closer match (the same proximity used for the Golden State Killer suspect), giving a reasonable chance to de-identify the target. As a result, researchers estimate a database would only need to cover 2 percent of a target population to provide a third-cousin-or-better match to nearly any person. “With the exponential growth of consumer genomics,” the researchers write, “we posit that such database scale is foreseeable for some third-party websites in the near future.”
Notably, that prediction is based on a homogenous population, but most collections of genetic data show significant racial disparities. The most significant one is in law enforcement databases, which are drawn from arrestee or convict populations and skew toward black and Latino populations as a result. Consumer and public databases exhibit the opposite bias, skewing toward Caucasians, who are subsequently more likely to be identified with a familial search, Erlich says.
At the same time, another group of scientists is expanding the reach of those techniques even further. Consumer genetic tests extract different portions of the genome than law enforcement tests, which has led to an ongoing comparison problem when a full sample cannot be obtained. But a group of researchers at Stanford University, University of California at Davis, and the University of Michigan have developed a method for comparing results even when portions of the genome don’t overlap, drawing on known correlations between different portions of genetic code. The method isn’t fully developed, but it could give forensic analysts much more flexibility in the type of data they can use.
According to UC Davis’ Michael Edge, who worked on the Cell paper, the new research “suggests a framework that law enforcement could use to start thinking about backward compatibility of existing STR databases with SNP data, but more work would be necessary to see how practical it would be.”