N-Grams and the History of Computing

Google N-Gram Search
Google N-Gram Search

As I'm sure most of you know, late least year Google announced a new research tool known as the Ngram Viewer. (An n-gram is any sequence of items--in this case words--of length n; so a 2-gram would be any word pair). The tool was released in conjunction with the publication of a paper in Science that made use of it to explore the history of culture. That paper is, in turn, part of a broader movement in recent years to apply the tools of digital text search and analysis to humanities questions. My colleague, Ben Schmidt, for example, has been exploring the possibilities for some time, using rather more sophisticated tools than the Ngram Viewer. The researchers who produced the Science paper in fact see themselves as tilling a new field, culturomics, or the quantitative study of culture and its history. All of this has, of course, raised some doubts among those who have studied culture for a long time using more qualitative methods, and skeptics abound. The limits and dangers of the NGram Viewer are indeed significant:

  1. Among those most obvious is its source base, Google Books. It is a collection of published works that universities have seen fit to collect, thus excluding the universe of unpublished texts, ephemera, comic books, the vast majority of software and hardware manuals, etc., etc., not to mention the entirety of non-written culture.
  2. N-grams may be used in unexpected and unforeseen contexts, thus deceiving the user. A search for "gold" and "silver," for example, might be taken as a proxy for the relative popularity of each metal, but both terms are used in a wide variety of metaphorical ways that confound such analysis, not to mention terms such as "Gold Bond Medicated Powder."
  3. The Google Books source base suffers from data defects, most notably optical character recognition (OCR) errors and incorrectly dated texts. In many cases such errors might disappear as background noise, but the rarer ones search terms and the further back in time one goes (due to both typeface changes and fewer available books), the more serious the problem grows.

This list is certainly not exhaustive. The paper authors have acknowledged some of these limitations on their website. With all of those caveats in mind, I thought it might be fun to take the Ngram Viewer for a spin and see if it can tell us anything interesting about the history of computing. I began with an obvious thought: why not track the historical trajectory of the major paradigms (according to received wisdom) for the organization of computer hardware. So I searched for "mainframe computer," "minicomputer," "microcomputer," "Internet." (see attached image) One problem that came to mind immediately is the incommensurability of n-grams of different orders (i.e. 1- vs. 2- vs. 3-grams). Because there are massively more possible 2-grams than 1-grams,it's not clear that "mainframe computer" can be fairly compared to the latter 3. But if I search for just "mainframe," I run the risk of false positives if there is some hidden alternative meaning for the term that has nothing to do with computers (luckily, this seems not to be the case, see the 2nd attached image). Then I considered another problem: how do I know that the terms I've chosen mark out equally broad swaths of conceptual space? Do I need to include quasi-synonymous terms like "personal computer" in order to match the scope of "mainframe"? If "Internet" is a subset of "networking," or "network" which one is the more apt comparison? Moreover, each term has particular connotations that shift over time; "PC," for example might usually refer to the IBM PC in the years immediately after 1981, then later to PC-compatibles, then later "political correctness". Internet has a relatively fixed meaning, but network could mean virtually anything (TV network, radio network, kinship network, etc.). Already my head was starting to hurt. Still within this single search,an odd anomaly also showed up. There was apparently a lesser-known Internet bubble, circa 1900. Luckily the NGram Viewer links directly into Google Books, allowing one to see the original context for some of the results. In this case the early-twentieth-century Internet turns out to be a product of a combination of source-dating errors and OCR errors (due largely to the abbreviation internat.) Despite all of these issues, this whole quest did tell me something quite interesting (although perhaps obvious in retrospect): the use of the term "mainframe" or "mainframe computer" didn't take off until the late-1970s and didn't peak until the end of the 1980s. My interpretation: before microcomputers, mainframes rarely required special identification, most of the time they were just "computers." It wasn't until micros became commonplace that it became necessary to define the older, bigger machines as something distinct. (Of course this could be all wrong, perhaps the growth of "mainframe" is just tracking the growth of computing in general...) It would be interesting to know if this naming project was carried out by the partisans of mainframes ("we don't make just any kind of computer, but powerful mainframes!") or the partisans of micros ("you need to get away from the big, slow 'mainframe' paradigm, and into the modern age of flexibility and individual initiative!"), or by some other mechanism. The Ngram Viewer by itself is entirely powerless to answer questions of this kind, but its capacity to raise them is certainly not to be scoffed at. In conclusion, the NGrams Viewer is a tool to be used with caution, a tool with many intellectual and linguistic challenges to its effective use, a tool that can badly "lie" if used naively and without an understanding of its severe limits. All that said, it's pretty awesome.