By James Kwak
A friend brought to my attention another example of how Excel may actually be a precursor of Skynet, after the London Whale trade and the Reinhart-Rogoff controversy. This comes to us in a research note from several years ago by several bioinformatics researchers titled “Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.” The problem is that various genes have names like “DEC1” or identifiers like “2310009E13.” When you important those text strings into Excel, by default, the former is converted into a date and the later is converted into scientific notation (2.310009 x 10^13). Since dates in Excel are really numbers behind the scenes (beginning with January 1, 1900), those text identifiers have been irretrievably converted into numbers.
This problem is related to what makes Excel so popular: it’s powerful, intuitive, and easy to use. In this case, it is guessing at what you really mean when you give it data in a certain format, and most of the time it’s right—which saves you the trouble of manually parsing text strings and converting them into dates (which you can do using various Excel functions, if you know how). But the price of that convenience is that it also makes it very easy to make mistakes, if you don’t know what you’re doing or you’re not extremely careful.
There are workarounds to this problem, but as of 2004, it had infected several public databases. As the authors write, “There is no way to know how many times and in how many laboratories the default date and floating point conversions to non-gene names have adversely affected an experiment or caused genes to ‘disappear’ from view.”