My last post was on the Indus script paper by Rao et al., just published online in Science. Several reactions have appeared. A significant fraction of those following the story would have seen the tantrum by Steve Farmer, Richard Sproat, and Michael Witzel, that I will discuss further below. I haven't spent much time surfing the blogosphere, but two more skeptical comments (thanks to JK, commenting on my previous post) are here (Mark Liberman) and here (Fernando Pereira). The most insightful comment that I have seen is Liberman's "This is a topic that traditionally generates more heat than light".
Before understanding the reactions it is important to understand the background and the work of Rao et al. The background is that, for over a century now, it had been assumed that the tablets containing seals found in the Indus valley archaeological sites contained writing in an unknown, pictographic script; and much effort has been devoted to deciphering the script. Then Farmer, Sproat and Witzel (authors of the above screed, and long-time researchers in the field) published a paper in 2004 arguing that the scripts do not encode language but are some form of non-linguistic symbolic writing. (Actually, they don't so much argue it as assert it. More on that below.) That paper is 39 pages long, containing 29 pages of text followed by many references, and is in fact a useful read in summarising the existing state of the art, even if one disagrees with its conclusions.
Rao et al. disagree with its conclusions, in their paper in Science that contains one page of text, a couple of figures, and about 15 pages of supporting data that mainly describes their methodology. The methodology is, to my astonishment, apparently new in this field, although certainly not in computational linguistics.
Basically, the method is to model language as a Markov chain. A Markov chain is a sequence of "things" (words, letters, events) with the following property: the probability of any individual event depends on its predecessor only, not on the entire previous sequence. For example, if a DNA sequence were generated by a Markov process and you saw "ACAGTGAC", the next nucleotide would be determined (probabilistically) only by the final nucleotide in this sequence, C, and not by the others. In a generalised (n-order) Markov chain, each event depends on its n immediate predecessors, where n is 1 or more but usually not very large.
Imagine you had never seen writing in English before, and were confronted with it for the first time. (Imagine, also, that you had figured out that uppercase and lowercase letters represent the same thing.) You may quickly find that some letters (e) occur more frequently than others (z). But if you generate random text with frequencies that agree with what one sees in English, the result would look nothing like English.
Looking more closely, you may observe some anomalies: if you see a "q", the following letter is almost always a "u". If you see "i" and "e" together, the "i" comes before the "e" except when after a "c" (as in "receive"). You may even note some weird exceptions. Though "a" is a common letter (the third most common, after "e" and "t", by many counts), "aa" is a very rare combination in English, though "ee" and "tt" are quite common. Similarly, "ae" is much rarer than "ea". None of these observations can be accounted for by letter frequencies alone. But most of them can be fully encompassed in a first-order Markov model (the "i before e except after c" rule is more complicated.)
In general, if two letters (say A and B) appear with frequencies PA and PB, a random sequence would contain each of "AB" and "BA" in the frequency PAPB. If you do not observe this (and, in English and all other languages, we do not), we can assume that the sequence is not random. The first order Markov model is the next simplest assumption, and is often adequate.
Shannon, in his classic 1948 paper on information theory, actually uses Markov models of English to construct pseudo-English sentences (using 26 letters and a space as his 27 symbols). A completely random string, with all symbols equally probable, looks like
"XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD."
A string maintaining actual frequencies of symbols, but with no correlations, looks like
"OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL."
A first-order Markov model yields
"ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE."
A second-order Markov model yields
"IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE."
The point is not whether these make sense but how much they look like English at first glance. A first-order Markov model is clearly a dramatic improvement on random models, and a second order Markov model is even better. Shannon goes on to Markov models that use words rather than letters as individual symbols, and in this case a first-order Markov model already gives something that looks grammatically correct over short distances:
"THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED."
My colleague Ronojoy Adhikari, who is a co-author of the Rao et al. paper, points out to me that Shannon was not the first to try such exercises: Markov himself preceded Shannon by about 30 years.
What Rao et al did was, essentially, to assume that the Indus scripts are generated by a first-order Markov process. In the light of what we know about languages, this may seem a rather obvious thing to do. They use a measure called "conditional entropy" (that, again, stems from Shannon's work) to measure the extent of first-order correlations; and a "relative conditional entropy" that compares the conditional entropy to that of a random sequence with the same frequencies of individual symbols. A more correlated sequence has lower conditional entropy, so the relative conditional entropy must be between 1 and 0.
What they find is that the conditional entropy for an Indus script is very similar to that of known languages, and very different from non-linguistic symbol systems.
What are the criticisms? Let us look first at the rant from Farmer et al. In comments to my previous post, Ronojoy refuses to dignify it with a response; but lay readers may be interested in my point of view anyway.
Farmer and colleagues do have one genuine criticism (which, in my opinion, is not a serious problem with the paper). But they bury it under so much misleading and simply ad hominem rubbish that it is better to clear that refuse heap first.
First, they say:
"It is important to realize that all their demonstration shows is that the Indus sign system has some kind of rough structure, which has been known since the 1920s. Similar results could be expected if they compared their artificial sign sets to any man-made symbol system, linguistic or nonlinguistic. Our paper in fact made much of this point and also gave examples of striking statistical overlaps between real-world (not invented) nonlinguistic and linguistic systems and argued that it is not possible to distinguish the two using statistical measures alone."
In fact, the "rough structure" that they discuss in their paper, that has been "known since the 1920s", is only the fact that some symbols occur more often than others! Correlations among successive symbols are completely ignored. The last sentence quoted above refers, I think, to their "Figure 2" which deals only with frequencies of individual signs, not with any kind of correlated structure. Yes, some Indus symbols are more frequent, and some heraldic blazons are more frequent, than others. It is also true that some road signs are more common than others (credit). Of course that does not tell us anything about whether a sequence of road signs constitutes a language. This is not at all the claim that Rao et al. are making, and it staggers me that the point is being missed so widely.
Another charge that Farmer et al level at Rao et al is that of Dravidian nationalism. Reading the surnames of the authors (only one seems to be of Dravidian origin), it is a comical accusation.
There's more along the same lines. But one accusation is true: Rao et al are misleading in their main text on what non-linguistic systems, exactly, they are comparing the Indus script with. They plot a "type 1 system" and a "type 2 system" in the first part of their figure; only on reading the supporting text does one learn that these are not actual corpuses of non-linguistic symbol systems, but synthetically generated corpuses. (The second half of that figure does contain some genuine actual non-linguistic symbol systems.)
Ronojoy responds in a comment:
The non-linguistic Type1 and Type2 are controls; the comparision in Fig.2 is with real world data sets which are like the controls - DNA with lot of variability and high entropy, Fortran code with fairly rigid order and low entropy. The controls are limiting cases of the real world data : Type1 has no correlation, while Type2 is maximally correlated. In Fig 2, they represent the two extremes. Our conclusion would still be valid if we deleted the controls. Comparing more datasets is part of ongoing work.
This could have been made clearer in the main text -- but it is made clear enough in the supporting data. I would have liked to see the real world data in figure 1 too (which is the more striking figure). And that's my only disappointment with the paper. But I expect that it will be rectified in future work by these authors.
Was this work trivial? The reaction of Liberman seems to be: "Huh, they only counted digrams?" Yes, to anyone familiar with Markov processes (that includes huge swaths of modern science) it is trivial. But apparently nobody had done it before! To me, that is the value of interdisciplinary research: what is obvious to one community may be new to another, and the results may be far-reaching.
[Update 28/04/09: small typo corrected. Also, see comments below from Mark Liberman and myself.]
[Update 01/05/09: comments closed.]