With Great Power . . .

By James Kwak

A friend brought to my attention another example of how Excel may actually be a precursor of Skynet, after the London Whale trade and the Reinhart-Rogoff controversy. This comes to us in a research note from several years ago by several bioinformatics researchers titled “Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.” The problem is that various genes have names like “DEC1” or identifiers like “2310009E13.” When you important those text strings into Excel, by default, the former is converted into a date and the later is converted into scientific notation (2.310009 x 10^13). Since dates in Excel are really numbers behind the scenes (beginning with January 1, 1900), those text identifiers have been irretrievably converted into numbers.

This problem is related to what makes Excel so popular: it’s powerful, intuitive, and easy to use. In this case, it is guessing at what you really mean when you give it data in a certain format, and most of the time it’s right—which saves you the trouble of manually parsing text strings and converting them into dates (which you can do using various Excel functions, if you know how). But the price of that convenience is that it also makes it very easy to make mistakes, if you don’t know what you’re doing or you’re not extremely careful.

There are workarounds to this problem, but as of 2004, it had infected several public databases. As the authors write, “There is no way to know how many times and in how many laboratories the default date and floating point conversions to non-gene names have adversely affected an experiment or caused genes to ‘disappear’ from view.”

15 thoughts on “With Great Power . . .

  1. To replace SAS with Excel in medical research was, obviously, motivated by ideology, not science. Clearest case of malpractice, ever. Long past due that clinicians start suing “informatics” for abuse of technology.

    Big data = big guvno

  2. There is even an R package designed specifically to combat this issue:

    http://cran.r-project.org/web/packages/HGNChelper/index.html

    Although it won’t help with the identifiers that get changed to scientific notation. It is rote habit for me to inform my clients (I am a biostatistician) that they have to import text data into Excel using the import wizard, and specifically identify the gene symbol column as text to circumvent this issue.

    I do find it hilarious that econometricians like R&R use Excel in their ‘research’. It has been well documented in statistics circles for years now that Excel isn’t reliable, does unexpected things with your data, and is the antithesis of reproducible research.

    Excel is to data analysis as PowerPoint is to presentations. Excel allows you to do really nonsensical things with the least amount of effort.

  3. The need to distinguish an arithmetic-value from its grammatical-identity is precisely why bookkeeping is a double-entry process.

    This mistake does not just happen in spreadsheets, it is happening daily in each and every relational database that is being used. We do not realize our error because spreadsheets and relational databases do not have what a proper double-entry book-of-accounts has, in the form of an ‘audit-trail’.

    I discovered the missing audit trail in the early 1980s. That corruption of financial data got me into my deeper study of double-entry bookkeeping. I spent ten years from 1994-2004 attending software conferences in my effort to call the software developer’s attention to these faulty patterns in software development. My efforts fell on deaf ears there, as well in “The Science” community of Woods Hole, in the town where I live.

    The danger in working without an audit-trail is initially felt by our monetary system. Now, in our more fully computer-driven culture, the loss of the audit-trail is effecting every aspect of our political and of our social-life.

    The irony in all this is that the programming language that can do the audit-trail and the proper double-entry book-of-accounts was refined into practice as the second oldest language in software development.

    Its named ‘Lisp’. The Lisp-language, developed at Simon Johnson’s own MIT, was for years the Introduction-to-Computer-Science course. Several years ago, in the final dumbing-down of ‘computer-science’ Lisp lost that distinguished position at MIT.

  4. We’ve lost LISP, SAS has fled science in favor of Mat-Lab and other graphical statistics packages, and FORTRAN isn’t even offered as a programming option in most colleges anymore. Heck, it’s getting hard to find a basic html class these days.

  5. Shorter Philip H (and I speak as someone who’s first 2 experences with programming a computer were in 1970, learning APL and Fortran): Hey, you! Get off of my Lawn.

    Even shorter Philip H: Kids, these days!

  6. There’s a real technology for the rigorous management of data. It’s been around for years and it isn’t Excel. There’s also a real science to developing a database schema, and the business rules that provide constraints on the thing. This isn’t about Fortran or LISP, which carried their own data structures before formal database management existed. Those structures were just as prone to corruption, and worse they were isolated inside the written programs. That all changed once data got disconnected from programs and managed in rigorously developed structures.

    It does pertain to SAS only in that SAS libraries should be built right on top of formal databases, as should code for Fortran, Lisp, C++, APL and all the other programming languages – if the researchers know what they’re doing that is.

    This stuff has been plaguing research since the advent of desktop computing power. Everyone ran off with their own machine and their favorite set of “tools”. Testing and debugging cycles, let alone formal database and lifecycle management for the associated code are the furthest thing from the minds of those who are only worried about the next funding cycle. What other result would you expect?

    Edsger Dijkstra, the Dutch computer scientist, worried about this stuff for years, and it’s all come to pass. The only rational choice is to back up and do it right. Can we do that?

  7. @Norm Cimon: But where does the double-entry book-of-accounts fit into your data-management system?

    Double-entry bookkeeping, done properly, is charged with proving the validity of traded-value, simultaneous with proving the rights to the ownership in exchange for that traded-value. Presently, such a proof is the deeper issue relative to spreadsheets and databases that supply no proof at all.

    As value models a primarily arithmetic set of behaviors, rights controls a primarily grammatical set of identifiers. Lisp ‘streams’ supply four essential solutions to the journal | ledger proof, in the bookkeeper’s book-of-accounts. 1/ is the quote operator, 2/ is the macro language, 3/ is the built in analogical number system, and 4/ is the ‘cons’, ‘car’, and ‘cdr’ method of controlling rational relationships between dissimilar data types (value and its identifier, for example).

    The example in bookkeeping is the arithmetic functions that are grammatically identified to form the journal | ledger proof of the Statement of Profit [loss] versus its tautologically equivalent Balance Sheet.

    Each and every one of the four lisp patterns are essential to recording a valid history, and audit-trail that will reconstruct each and every accounting within that history , as a judicial system of checks and balances. Lisp’s system | language rationale is the only fair and equitable way to resolve bookkeeping disputes. Without the audit-trail, our corrupt monetary-system is super-ceding our judicial system (as Eric Holder recently pointed out).

    This loss of justice results in an endless procession of violence that we are seeing more and more of in today’s computer-driven cultures.

  8. To Dan: You’re misunderstanding the fundamental nature of the design behind a formal database. You can include both value and ownership constraints. Those rules are built in from the get-go and they can be used to confine manipulations of data that are possible. It can’t be emphasized too strongly that the ad-hoc incorporation of data into a program, no matter what the language, requires each program to build it’s own set of rules. That’s a prescription for what we have – a set of stove-pipes that makes the system opaque unless you dig through the code or, in the case of this Excel snafu, the default rules for importing data in an ad-hoc fashion. That’s not the way to go.
    DB design is done at the lowest level, with the constraints built in before any work is done on top. Think about it. Why would you program and re-program the same set of constraints over and over again, when you can do it once for good? It also means that if there are rule-changes in the future, you’ve only got one place you have to go to make those changes and you’ve propagated that integrity to all the programs that sit on top of that data. Think about it.

  9. Hi Norm,
    You and I are not that far apart. I understand your formal database where you have made the effort to create a complete system that can be adjusted once for all applications of the data. However, there is in the double-entry pattern roles, set subject to rules, played by data that must not be co-mingled.

    The bookkeeping language is seven centuries old. Its reasons for being as it is have not changed in that time. What changed with industrial bookkeeping, in the late 19th century, was the automation of accounting reports. In Pacioli’s — 1494 A.D. — time the reports were annual. In the industrial applications the reports were automated to be quarterly. The system that I experience at GE was reported weekly.

    It is this ability to report data on the fly that is important to cost-accounting. In cost accounting, data management is organized by contracts, and sub-contracts, within each contract. The solution calls for a separation of physical-states, generated by behavior, as work. Work-states are set isomorphic to a control-language, as composed story-reports. It is this relationship between work’s behavior versus ownership’s identity that generates, in bookkeeping, the need for double-entry.

    Double-entry never commingles the measure of value into the expression of rights.

    Norm wrote: “It can’t be emphasized too strongly that the ad-hoc incorporation of data into a program, no matter what the language, requires each program to build it’s own set of rules.”

    Norm, we are in complete agreement here, as well. I believe that when the dust settles, it will be the fundamental way that data is organized within the complete system that you speak of that counts. That organization, in a proper book-of-accounts, must balance both debtor-value, set isomorphic to creditor-rights, as well as to generate a statement of profit [loss] that is tautologically equivalent to the Balance Sheet. The words ‘isomorphic’ and ‘tautological’ are the key to bookkeeping’s proof of a legitimate balancing of accounts.

    The P&L reports the value in Entity’s care and control. The Balance Sheet tells who owns the rights to that value. It is this four-way balancing of accounts that I have not seen hide nor hair of in my thirty-four years of bookkeeping research and development. On the other hand, it is easy to see, in The London Whale betting, for example, That J. B. Morgan Chase is not using a proper separation and balancing of these essential four reports. If they are not doing it, it is a sure thing that most other banks are not doing it. I will be impressed if even one is found that is keeping a proper book-of-accounts.
    Dan

  10. Well, no surprise there. If you notice a lot of the services and tools we use these days have a “predictive” skill designed to make our lives easier. It is this whole “semantic” web thing but unlike many people I am not too concerned about it (1) because many of the problems can be easily identified and are well known for the most part and (2) I am certain there is some ambitious entrepreneur out there who will make a project out of solving these problems.

  11. To iv: the problem was solved 40 years ago. It’s just a lot more convenient to ignore the solution, and it makes a lot more quick money. But when it fails it brings down the house of cards. That slow-motion collapse is still happening by the way. And it isn’t the “semantic web thing”. Those semantics depend on rigorously defined ontologies, something the financial community seems only dimly aware of. That’s exactly what Dan is talking about, by the way, with his discussion of the “pragmatics” for a specific ontology – that of cost accounting. The knowledge is already there. The understanding about how to build a reliable system is already there. It just hasn’t been done.

    This isn’t about any new inventions, it’s about getting people wise enough in the financial industry to actually incorporate these ideas into their workflows. Until they do the threat of additional damage remains.

  12. To Norm & Bruce: awhile back Bruce sent a link to “Ambiguity, confidence and Fraud: A critical account of the materiality of money.” By Fiona Haines. Her abstract reads:

    “This paper extends the analysis of ambiguity in the context of white collar and corporate crime by subjecting money to critical account. Drawing from economic sociology and the sociology of emotions it articulates how the numerical values attached to a particular monetary unit requires confidence in the integrity of that number. Further, dependence on confidence is reproduced at the level of company accounts and is managed through rules governing accounting and auditing practice, practices that engender confidence, despite their inability to identify securely company value. Critically, confidence both in the value of money and in its reproduction in a set of accounts requires the presence of the state. Yet, the state can never establish when confidence is well placed, hence the opportunity for fraud is ever present.”

    Her writing is filled with feminine thoroughness and wisdom; what is missing is a link to bookkeepers who understood the wisdom in bookkeeping, versus today’s generation of scholars that suffer a difficult gap in the recorded history of commerce. Her keywords are ‘money’, ‘confidence’, ‘ambiguity’, ‘white-collar-crime’, and ‘fraud’.

    She goes on to tell the story of broken accounting that we all know so well. I like these two sentences, which form her thesis statement:

    “Financial fraud is premised on an understanding that there has been a criminal misrepresentation of a financial or business state of affairs by one or more individuals (or businesses) perpetrated against another. Yet, misrepresentation presupposes the possibility of an accurate representation.”

    Haines makes the clearest distinction of the role played by the proper book-of-accounts that I have ever encountered. The book-of-accounts method is seven centuries old. Another female writer, Denise Schmandt-Besserat, at UT Austin, took the study of accounting back ten centuries in her study of “How Writing Came About.” Job #1 in accounting asks: “Who owns what?”. Haines correctly argues that “We will know that truth when we see it.”

    We all know fairness and honesty when we are lucky enough to encounter its example. Haines makes the clear distinction between ‘misrepresentation’ versus the ‘accurate-representation’. Norm is also correct when he tells us:

    “The problem was solved 40 years ago. It’s just a lot more convenient to ignore the solution, and it makes a lot more quick money. But when it fails it brings down the house of cards.”

    I experienced a proper book-of-accounts fifty-seven years ago at the GE Turbine Division in Lynn, MA. At GE, mimeograph cards played the role money plays in hermetically tight bookkeeping-roles played subject-to-bookkeeping-rules. Physical-roles, versus intellectual-rules, generate a tautological equivalence between (1) work-value played subject to (2) ownership-rights. It is this tautological equivalence, coupled with an isomophy that occurs naturally between Newtonian-mechanical-behavior, versus Planck’s quantum-of-action. Quantum is information communicating an isomorphic relationship between an arithmetically-computed value-of-work, versus a grammatically-computed rights-to-ownership of that work.

    One sees this isomorphy in Gödel’s proof of the incompleteness theorem. Recall that Gödel creates a language of prime numbers and their powers to differentially-identify each of the characters that are essential to differentiate the role each character communicates in his statement of proof. Simply put, his proof is distinguishing each axioms-arithmetic-state from that state’s grammatical-composition of proof.

    Gödel clearly points out this isomorphy in Note #9 of his proof. What he fails to point out is that the isomorphic proof is itself a phenomena that occurs in every aspect of life-systems, and its supporting of life-as-a-language. It is that language of control that is so badly screwed up today’s computer-software.

    Bookkeeping, since Pacioli documented the discipline in 1494, has used the relationship between Newtonian-arithmetic-value, set isomorphic to Planckian-grammatical-rights. Newtonian-arithmetic is a space-system that models debtor-value. The Planckian-grammar is a time-language that controls creditor-rights.

    Fiona Haines draws upon this intuition when she tells: “Financial fraud is premised on an understanding that there has been a criminal misrepresentation of a financial or business state of affairs by one or more individuals (or businesses) perpetrated against another. Yet, misrepresentation presupposes the possibility of an accurate representation.”

    Her faith, and ours, presupposes the accurate representation of facts that is with us in the very nature of our being, as life-system’s enjoying our place in time. May we become clear-headed enough to fix this problem.
    Dan

Comments are closed.