There are tens of thousands of genes in the human genome: minuscule twists of DNA and RNA that combine to express all of the traits and characteristics that make each of us unique. Each gene is given a name and alphanumeric code, known as a symbol, which scientists use to coordinate research. But over the past year or so, some 27 human genes have been renamed, all because Microsoft Excel kept misreading their symbols as dates.
The problem isn’t as unexpected as it first sounds. Excel is a behemoth in the spreadsheet world and is regularly used by scientists to track their work and even conduct clinical trials. But its default settings were designed with more mundane applications in mind, so when a user inputs a gene’s alphanumeric symbol into a spreadsheet, like MARCH1 — short for “Membrane Associated Ring-CH-Type Finger 1” — Excel converts that into a date: 1-Mar.
This is extremely frustrating, even dangerous, corrupting data that scientists have to sort through by hand to restore. It’s also surprisingly widespread and affects even peer-reviewed scientific work. One study from 2016 examined genetic data shared alongside 3,597 published papers and found that roughly one-fifth had been affected by Excel errors.
“It’s really, really annoying,” Dezső Módos, a systems biologist at the Quadram Institute in the UK, told The Verge. Módos, whose job involves analyzing freshly sequenced genetic data, says Excel errors happen all the time, simply because the software is often the first thing to hand when scientists process numerical data. “It’s a widespread tool and if you are a bit computationally illiterate you will use it,” he says. “During my PhD studies I did as well!”
There’s no easy fix, either. Excel doesn’t offer the option to turn off this auto-formatting, and the only way to avoid it is to change the data type for individual columns. Even then, a scientist might fix their data but export it as a CSV file without saving the formatting. Or, another scientist might load the data without the correct formatting, changing gene symbols back into dates. The end result is that while knowledgeable Excel users can avoid this problem, it’s easy for mistakes to be introduced.
Help has arrived, though, in the form of the scientific body in charge of standardizing the names of genes, the HUGO Gene Nomenclature Committee, or HGNC. This week, the HGNC published new guidelines for gene naming, including for “symbols that affect data handling and retrieval.” From now on, they say, human genes and the proteins they expressed will be named with one eye on Excel’s auto-formatting. That means the symbol MARCH1 has now become MARCHF1, while SEPT1 has become SEPTIN1, and so on. A record of old symbols and names will be stored by HGNC to avoid confusion in the future.
So far, the names of some 27 genes have been changed like this over the past year, Elspeth Bruford, the coordinator of HGNC, tells The Verge, but the guidelines themselves weren’t formally announced until this week. “We consulted the respective research communities to discuss the proposed updates, and we also notified researchers who had published on these genes specifically when the changes were being put into effect,” says Bruford.
As Bruford makes clear, the art of naming genes is very much driven by consensus. Like the lexicographers charged with updating dictionaries, the Gene Nomenclature Committee has to be sensitive to the needs of those individuals who will be most affected by their work.
THRILLED by this announcement by the Human Gene Nomenclature Committee. pic.twitter.com/BqLIOMm69d— Janna Hutz (@jannahutz) August 4, 2020
This wasn’t always the case, mind. In the early, frontier days of genetics, gene naming was often a playground for creative scientists, leading to notorious genes like “sonic hedgehog” (yes, named for that Sonic) and “Indy” (short for “I’m not dead yet”; a reference to the gene’s function, which can double the life span of fruit flies when mutated).
Now, though, the HGNC has taken matters firmly in hand, and current guidelines don’t cede much ground to whimsy or ego. The focus is on practical concerns: how do we minimize confusion? For that reason, gene symbols should be unique, and gene names should be brief and specific, says the committee. They cannot use subscript or superscript; can only contain Latin letters and Arabic numerals; and should not spell out names or words, particularly offensive ones (a rule that should hold true “ideally in any language”).
And while the decision to rename genes is not taken lightly, it’s not unusual, says Bruford. Many gene symbols that can be read as nouns have been renamed to avoid false positives during searches, for example. In the past, CARS has become CARS1, WARS changed to WARS1, and MARS tweaked to MARS1. Other changes have been made to avoid insult.
“We always have to imagine a clinician having to explain to a parent that their child has a mutation in a particular gene,” says Bruford. “For example, HECA used to have the gene name ‘headcase homolog (Drosophila),’ named after the equivalent gene in fruit fly, but we changed it to ‘hdc homolog, cell cycle regulator’ to avoid potential offense.”
But Bruford says this is the first time that the guidelines have been rewritten specifically to counter the problems caused by software. So far, the reactions seem to be extremely positive — some would even say joyous.
After geneticist Janna Hutz shared the relevant section of HGNC’s new guidelines on Twitter, the response from the community was jubilant. “THRILLED by this announcement by the Human Gene Nomenclature Committee,” tweeted Hutz herself. “Finally!!!” responded Mudra Hegde, a computational biologist at the Broad Institute in Massachusetts. “Greatest news of the day!” said a pseudonymous Twitter user.
Bruford notes that there has been some dissent about the decision, but it mostly seems to be focused on a single question: why was it easier to rename human genes than it was to change how Excel works? Why, exactly, in a fight between Microsoft and the entire genetics community, was it the scientists who had to back down?
Microsoft did not respond to a request for comment, but Bruford’s theory is that it’s simply not worth the trouble to change. “This is quite a limited use case of the Excel software,” she says. “There is very little incentive for Microsoft to make a significant change to features that are used extremely widely by the rest of the massive community of Excel users.”
Bruford doesn’t seem bitter about the situation, though. After all, she says, it wouldn’t do to wait on a hypothetical Excel update to fix these problems when a long-term solution can be introduced by scientists themselves. Microsoft Excel may be fleeting, but human genes will be around for as long as we are. It’s best to give them names that work.
Correction: The story has been corrected to clarify that Excel users can save spreadsheets that retain their formatting, avoiding the mistake where gene symbols are changed into dates. We regret the error.