From the early days of the COVID-19 pandemic, epidemiologist Melissa Haendel knew that the United States was going to have a data problem. There didn’t seem to be a national strategy to control the virus, and cases were springing up in sporadic hotspots around the country. With such a patchwork response, nationwide information about the people who got sick would probably be hard to come by.
Other researchers around the country were pinpointing similar problems. In Seattle, Adam Wilcox, the chief analytics officer at UW Medicine, was reaching out to colleagues. The city was the first US COVID-19 hotspot. “We had 10 times the data, in terms of just raw testing, than other areas,” he says. He wanted to share that data with other hospitals, so they would have that information on hand before COVID-19 cases started to climb in their area. Everyone wanted to get as much data as possible in the hands of as many people as possible, so they could start to understand the virus.
Everyone wanted to get as much data as possible in the hands of as many people as possible
Haendel was in a good position to help make that happen. She’s the chair of the National Center for Data to Health (CD2H), a National Institutes of Health program that works to improve collaboration and data sharing within the medical research community. So one week in March, just after she’d started working from home and pulled her 10th grader out of school, she started trying to figure out how to use existing data-sharing projects to help fight this new disease.
The solution Haendel and CD2H landed on sounds simple: a centralized, anonymous database of health records from people who tested positive for COVID-19. Researchers could use the data to figure out why some people get very sick and others don’t, how conditions like cancer and asthma interact with the disease, and which treatments end up being effective.
But in the United States, building that type of resource isn’t easy. “The US healthcare system is very fragmented,” Haendel says. “And because we have no centralized healthcare, that makes it also the case that we have no centralized healthcare data.” Hospitals, citing privacy concerns, don’t like to give out their patients’ health data. Even if hospitals agree to share, they all use different ways of storing information. At one institution, the classification “female” could go into a record as one, and “male” could go in as two — and at the next, they’d be reversed.
“The US healthcare system is very fragmented”
Emergencies, though, have a way of busting through norms. “Nothing like a pandemic to bring out the best in an institution,” Haendel says. And after only a few months of breakneck work from CD2H and collaborators around the country, the National COVID Cohort Collaborative Data Enclave, or N3C, opened to researchers at the start of September. Now that it’s in place, it could help bolster pandemic responses in the future. It’s unique from anything that’s come before it, in size and scope, Haendel says. “No other resource has ever tried to do this before.”
Patient health records are fairly accessible to scientists — under health privacy laws, the records can be used for research as long as identifying information (like names and locations) are removed. The catch is that researchers are usually limited to records of patients at the places that they work. The dataset can only include as many patients as that institution treats, and it’s geographically restricted. Researchers can’t be sure that patient data in New York City would be equivalent to patient data in Alabama. Using information from multiple places would help make sure the results were as representative as possible.
But it can be risky for institutions to share and combine their data, Wilcox says. Moving data outside of the control of an organization risks a data breach, which could lead to patient mistrust, open the institution up to legal issues, or create other competitive disadvantages, he says. They need to balance all those concerns against the potential benefits. “The organization needs to approve it. Is this a good idea? Do we want to participate in it?” Wilcox says.
Institutions often answer those questions with a “no.” They want to maintain ownership and control over their own data, says Anita Walden, assistant director at CD2H. The pandemic changed that culture. People who may typically be reluctant to participate in programs like this one were suddenly all-in, she says. “Because of COVID-19, people just want to do what they can.”
“people just want to do what they can”
Getting institutions to send in their data was only the first step. Next, experts had to transform that data into something useful. Medical institutions all collect and record health information in slightly different ways, and there haven’t been incentives for them to standardize their methods. Many institutions spent hundreds of millions of dollars to set up their electronic medical records — they don’t want to change things unless they absolutely have to.
“It’s like turning the Titanic at this point,” says Emily Pfaff, who leads the team at N3C merging different institutions’ data. The companies that make the software for electronic health records, like Epic, also don’t make their strategies for storing data available to outside researchers. “If you want to practice open science with clinical data, which I think many of us do, you’re not going to be able to do that with the data formatted in the way that the electronic health record does it,” she says. “You have to transform that data.”
Countries like the United Kingdom, which have centralized health care systems, don’t have to deal with the same problems: data from every patient in the country’s National Health Service is already in one place. In May, researchers published a study that analyzed records from over 17 million people to find risk factors for death from COVID-19.
But in the US, for N3C, it’s not as simple. Instead of a COVID-19 patient’s data heading directly into a national database, the new process is far more involved. Let’s say a pregnant woman goes to her doctor with symptoms of what she thinks could be COVID-19. She gets tested, and the test comes back positive. That result shows up in her health record. If her health care provider is participating in the N3C database, that record gets flagged. “Then her health record has a chance to get caught by our net, because what our net is looking for, among other things, is a positive COVID test,” Pfaff says.
“what our net is looking for, among other things, is a positive COVID test”
Her data then travels into a database, where a program (which had to be created from scratch) transforms information about the patient’s treatments and preexisting conditions into a standardized format. Then, it’ll get pushed into the N3C data enclave, undergo a quality check, and then — without her name or the name of the institution the record came from — be available for researchers.
Nearly 70 institutions have started the process to contribute data to the enclave. Data from 20 sites has passed through the full process, and data is accessible to researchers. At the end of September, the database held around 65,000 COVID-19 cases, Pfaff says, and around 650,000 non-COVID-19 cases (which can be used as controls). There’s no specific numerical goal, she says. “We would take as many as possible.”
Using the data
As some experts were working to get medical institutions on board with the project and others were figuring out how to harmonize a crush of data, still others were organizing to figure out what, exactly, they wanted to do with the resulting information. They sorted into a handful of working groups, each focused on a different area: there’s one focused on the intersection of diabetes and COVID-19, for example, and another on kidney injuries.
Elaine Hill, a health economist at the University of Rochester, is heading up a group focused on pregnancy and COVID-19. The first thing they’re hoping to do, she says, is figure out just how many people had the virus when they gave birth — only a few hospitals have published that data so far. “Then, we’re interested in understanding how COVID-19 infection affects pregnancy-related outcomes for both mother and baby,” she says. Thanks to the database, they’ll be able to do that with nationwide information, not just data from patients in a handful of places.
That wide view of the problem is one key benefit of a large, national database. Different places across the US had different COVID-19 prevention policies, different regulations around lockdowns, and have different demographics. Combining them gives a more complete picture of how the virus hit the country. “It makes it possible to shed light on things we wouldn’t be able to with just my Rochester cohort,” Hill says.
Some symptoms or complications from COVID-19 are also rare, and one hospital might only see one or two total patients who have them. “When you’re gathering data across the nation, you have a bigger population, and can look at trends in those rarer conditions,” Walden says. Larger datasets can make it possible for analysts to use more complicated machine learning techniques, as well.
the project could offer a blueprint for better data sharing in the future
If all goes well with N3C, the project could offer a blueprint for better data sharing in the future. More than that, it can offer a concrete tool to future projects — the code needed to clean, transform, and merge data from multiple hospitals now exists. “I almost feel like it’s building pandemic-ready infrastructure for the future,” Pfaff says. And now that research institutions have shared data once — even though it’s under unique circumstances — they may be more willing to do it again in the future.
“Five years from now, the greatest value of this data set won’t be the data,” Wilcox says. “It’ll have been the methods that we learned trying to get it working.”