Scientists rename human genes to stop Microsoft Excel from misreading them as dates

Illustration by Alex Castro / The Verge

There are tens of thousands of genes in the human genome: minuscule twists of DNA and RNA that combine to express all of the traits and characteristics that make each of us unique. Each gene is given a name and alphanumeric code, known as a symbol, which scientists use to coordinate research. But over the past year or so, some 27 human genes have been renamed, all because Microsoft Excel kept misreading their symbols as dates.

The problem isn’t as unexpected as it first sounds. Excel is a behemoth in the spreadsheet world and is regularly used by scientists to track their work and even conduct clinical trials. But its default settings were designed with more mundane applications in mind, so when a user inputs a gene’s alphanumeric symbol into a spreadsheet, like MARCH1 — short for “Membrane Associated Ring-CH-Type Finger 1” — Excel converts that into a date: 1-Mar.

This is extremely frustrating, even dangerous, corrupting data that scientists have to sort through by hand to restore. It’s also surprisingly widespread and affects even peer-reviewed scientific work. One study from 2016 examined genetic data shared alongside 3,597 published papers and found that roughly one-fifth had been affected by Excel errors.

“It’s really, really annoying,” Dezső Módos, a systems biologist at the Quadram Institute in the UK, told The Verge. Módos, whose job involves analyzing freshly sequenced genetic data, says Excel errors happen all the time, simply because the software is often the first thing to hand when scientists process numerical data. “It’s a widespread tool and if you are a bit computationally illiterate you will use it,” he says. “During my PhD studies I did as well!”

Examples of gene symbols being rendered as dates in Microsoft Excel.
GIF: The Verge

There’s no easy fix, either. Excel doesn’t offer the option to turn off this auto-formatting, and the only way to avoid it is to change the data type for individual columns. Even then, a scientist might fix their data but export it as a CSV file without saving the formatting. Or, another scientist might load the data without the correct formatting, changing gene symbols back into dates. The end result is that while knowledgeable Excel users can avoid this problem, it’s easy for mistakes to be introduced.

Help has arrived, though, in the form of the scientific body in charge of standardizing the names of genes, the HUGO Gene Nomenclature Committee, or HGNC. This week, the HGNC published new guidelines for gene naming, including for “symbols that affect data handling and retrieval.” From now on, they say, human genes and the proteins they expressed will be named with one eye on Excel’s auto-formatting. That means the symbol MARCH1 has now become MARCHF1, while SEPT1 has become SEPTIN1, and so on. A record of old symbols and names will be stored by HGNC to avoid confusion in the future.

So far, the names of some 27 genes have been changed like this over the past year, Elspeth Bruford, the coordinator of HGNC, tells The Verge, but the guidelines themselves weren’t formally announced until this week. “We consulted the respective research communities to discuss the proposed updates, and we also notified researchers who had published on these genes specifically when the changes were being put into effect,” says Bruford.

As Bruford makes clear, the art of naming genes is very much driven by consensus. Like the lexicographers charged with updating dictionaries, the Gene Nomenclature Committee has to be sensitive to the needs of those individuals who will be most affected by their work.

This wasn’t always the case, mind. In the early, frontier days of genetics, gene naming was often a playground for creative scientists, leading to notorious genes like “sonic hedgehog” (yes, named for that Sonic) and “Indy” (short for “I’m not dead yet”; a reference to the gene’s function, which can double the life span of fruit flies when mutated).

Now, though, the HGNC has taken matters firmly in hand, and current guidelines don’t cede much ground to whimsy or ego. The focus is on practical concerns: how do we minimize confusion? For that reason, gene symbols should be unique, and gene names should be brief and specific, says the committee. They cannot use subscript or superscript; can only contain Latin letters and Arabic numerals; and should not spell out names or words, particularly offensive ones (a rule that should hold true “ideally in any language”).

And while the decision to rename genes is not taken lightly, it’s not unusual, says Bruford. Many gene symbols that can be read as nouns have been renamed to avoid false positives during searches, for example. In the past, CARS has become CARS1, WARS changed to WARS1, and MARS tweaked to MARS1. Other changes have been made to avoid insult.

“We always have to imagine a clinician having to explain to a parent that their child has a mutation in a particular gene,” says Bruford. “For example, HECA used to have the gene name ‘headcase homolog (Drosophila),’ named after the equivalent gene in fruit fly, but we changed it to ‘hdc homolog, cell cycle regulator’ to avoid potential offense.”

But Bruford says this is the first time that the guidelines have been rewritten specifically to counter the problems caused by software. So far, the reactions seem to be extremely positive — some would even say joyous.

After geneticist Janna Hutz shared the relevant section of HGNC’s new guidelines on Twitter, the response from the community was jubilant. “THRILLED by this announcement by the Human Gene Nomenclature Committee,” tweeted Hutz herself. “Finally!!!” responded Mudra Hegde, a computational biologist at the Broad Institute in Massachusetts. “Greatest news of the day!” said a pseudonymous Twitter user.

Bruford notes that there has been some dissent about the decision, but it mostly seems to be focused on a single question: why was it easier to rename human genes than it was to change how Excel works? Why, exactly, in a fight between Microsoft and the entire genetics community, was it the scientists who had to back down?

Microsoft did not respond to a request for comment, but Bruford’s theory is that it’s simply not worth the trouble to change. “This is quite a limited use case of the Excel software,” she says. “There is very little incentive for Microsoft to make a significant change to features that are used extremely widely by the rest of the massive community of Excel users.”

Bruford doesn’t seem bitter about the situation, though. After all, she says, it wouldn’t do to wait on a hypothetical Excel update to fix these problems when a long-term solution can be introduced by scientists themselves. Microsoft Excel may be fleeting, but human genes will be around for as long as we are. It’s best to give them names that work.

Correction: The story has been corrected to clarify that Excel users can save spreadsheets that retain their formatting, avoiding the mistake where gene symbols are changed into dates. We regret the error.

Comments

To be honest, things would be better if people learned Excel’s conventions rather than fight them. Being in a multilingual environment, people ignoring Excel features and inventing their own formatting is a nightmare. Dates and numbers and currency… Ugh. If you tell Excel what it is as opposed to setting everything to text and making your own, things work so much better when you share the spreadsheet to a colleague using Excel in a different language.

I will never forgive Microsoft on how Excel handles CSV files in other languages though. It’s called a COMMA separated values file Microsoft, not semi-colon separated values file!

Psh, we just need to take that advice to it’s logical conclusion.

English language update, ; is now a comma.

Office localization is the worst. Inconsistent and badly documented

Localization is not the same as international sufficiency. The latter is the translation of menu/UI text and the latter is the ability to handle international data – which BTW Office does so much better than similar products.

If using CSV files from other languages then the option to import UTF-8 CSVs, Most-likely the CSVs have not been formatted in a way that makes it apparent that it’s not using Western ANSI character.

To be honest, things would be better if Microsoft learned how to release Excel from its archaic limitations.

To be honest, Excel should be able to adapt to the way people want to use it instead of forcing people to change their use to what it wants them to do. This is a fundamental principle of interface design since Don Norman’s Psychology of Everyday Things. If it is not possible to turn this behavior off (and it isn’t), that is Microsoft’s mistake and Microsoft should fix it.

The only workarounds I know of are to format the cells as text first, or to trick Excel by entering an apostrophe or space first. Forcing the entire universe to change the way it does things in order to conform with the way the software wants you to do it is a very Apple thing to do.

Excel does let you do things the way you want. Doing whatever you want is what causes issues when you want to share the files with other people since everyone’s settings and formatting is not the same. If you are going to collaborate and share documents, using Excel’s features properly allows it to adapt to various machines, languages, and regions.
The article says you can’t turn off the auto-formatting, but I would argue changing the cell type or starting the cell with an apostrophe to stop Excel from guessing is doing exactly that. And Excel formulas ignore the starting apostrophe nor is the apostrophe shown anywhere but the formula bar.

Excel is an extremely powerful tool that is too often abused due to how easy it is to get started in.

The problem is always in how obscure, obtuse and un-intuitive anything in Excel is put together. Getting a correct answer from the Excel "Online Help" function is generally about a 1 in an 100 proposition. I can give you numerous examples where a simple question to the Excel Online Help function will result in: No answer found! No result found! As if the thing cannot speak and understand basic English.

Someone said here, in the comments, that there is a so-called "Manual" for Excel. Where, pray tell, is that manual hidden? Is it on the non-existent and mysterious Windows 10, Windows 7, Windows XP OPERATING SYSTEM CD, which is NOT EVER PROVIDED WITH ANY COMPUTER YOU BUY? Is that where it is?

Highlight row in excel, right-click/ format cells/ text. Problem solved.

That only works before you input the data. Once it’s converted, changing the cell format back to text doesn’t revert it back to what you first input.

I have this issue with csv with dates that are in a different format than my locale. When I open the csv directly, Excel parse the dates with the wrong format rendering this column unusable. The only way to have correct dates is to open a blank Excel sheet, change the format of the column the dates will be, open the csv with a text editor, copy the data from here and paste in Excel.

I guess the issue mentioned here is happening more when you share your document rather than when you input your data in it

changing the cell format back to text doesn’t revert it back to what you first input

Must be more nuanced than this. I do this all the time and it works fine.

I just tried to input MARCH1, enter, Excel changes it to a date, then changing the cell format back to text changes the value to "36951"

Right, so it’s specific to this data. When I do it, it is usually a number separated by dashes that gets converted to a date, and changing it to text or number reverts it to what I originally entered.

Perhaps that’s what you meant, but it is worth clarifying that the behavior is limited to specific types of conversion.

An easy solution would be to create a template .XLS with every cell set to text, after which the cells could be converted as needed.

In my limited testings this morning, every input that Excel automatically converts to a date (in a fresh, unmodified blank Excel file) is reverted to a number when converted back to text.
That’s why when I open my csv with dates directly in Excel, I loose this column due to bad format conversion (01/02/03 gets converted to 2003-02-01(instead of 2003-01-02) and 01/31/03 is not converted at all)

But I guess you’re right in your other comments, scientists may share xslx files rather than crude csv, so if they properly formatted first there shouldn’t be an issue

This didn’t work for me. I just entered 1-12-20 which Excel automatically formatted to 1/12/2020. When I converted to text it became 4382. This happens to me all the time.

I don’t know what to tell you. I’m on PTO and not in front of a spreadsheet right now. I’m not saying Excel is perfect but somehow I’ve been able to reverse formatting when this sort of thing happened. Maybe changing it to "general" or whatever the format is called, or maybe to number. I don’t remember exactly (clearly) but there are a lot of formatting options and I can usually find something to fix my fuck-ups.

We’re specifically talking about text auto formatted to a date. No matter how it’s displayed on your spreadsheet, Excel sees all dates as a number. This is so you can do math with dates (ie Aug 4 + 6 = Aug 10 or Aug10 – Aug 4 = 6, etc). This is why once Excel has converted any text to a date, if you try to convert it to something else it will be based on a number.

Yes, there is no UNDO with data import.

And yes, if you keep your data in a TXT or CSV file format, then it will repeat the same
mistaken data transformation over and over again.

And yes, where again did Excel hide the so-called "Data Import Wizard"?

If a scientist can’t remember the simple process of changing format before entering data then I’m seriously questioning what else they forgot to do in analysing their data.

It’s not "a scientist", it’s every scientist on the planet who works in the field of human genetics for as long as Excel is a tool of choice (and any other application with auto-date formatting), which could be decades. It’s not even that the problem will often lead to mistakes in the final analysis, it’s the extra time wasted by having to check your work for these trivial mistakes — especially if you’re working with multiple spreadsheets and forever copying data from one sheet to another.

I could not agree with you more. All databases require that you set data types before you go about analyzing your data. Excel has two grids: regular worksheets and the data model. The data model grid holds 1.5 billion rows and 2 billion columns on each table and you can have as many tables as you have memory for in your Mac or PC. Microsoft Excel can be programmed using Microsoft Visual Basic, JavaScript, Objective-C, and even AppleScript.

The way to do it is to use the actual CSV/TXT file import that Excel has provided for years (decades?) The Data panel on the ribbon has a whole section devoted to data import that allows you to set your column types. That is what you should be using. Otherwise Excel is using the heuristics of the 99.999% of users not entering genomes and interprets the data accordingly.

I don’t think you know what you’re talking about.
That doesn’t work with preexisting data and may not even work for certain number formats.
For example if excel opens a CSV file with 12-15 digit codes, it will automatically convert it to the scientific convention or round the ending digits basically corrupting your data.
Any formatting after the fact will not recover data that is already changed.

What would have been nice is if Excel had a global setting for "science" or "data analytics" to stop the annoying reformatting and keep everything as text unless told otherwise.

Actually I think it is you that is not sure you know what you are talking about.
The last screen that comes up when inputting a CSV asks you to select the format you want to use. Select Text. Problem solved.

Are you using Data Get From TXT/CSV ? in Office 365 Excel?

Because the last screen does not ask you to select a format. It’s either Import or Cancel!

Why?

PS: Office 365 is now Microsoft 365, because "re-naming" makes problems go away, as per the MSFT marketing and legal department…

View All Comments
Back to top ↑