Monday, April 27, 2009

Indus: What did Rao et al. really do?

My last post was on the Indus script paper by Rao et al., just published online in Science. Several reactions have appeared. A significant fraction of those following the story would have seen the tantrum by Steve Farmer, Richard Sproat, and Michael Witzel, that I will discuss further below. I haven't spent much time surfing the blogosphere, but two more skeptical comments (thanks to JK, commenting on my previous post) are here (Mark Liberman) and here (Fernando Pereira). The most insightful comment that I have seen is Liberman's "This is a topic that traditionally generates more heat than light".



Before understanding the reactions it is important to understand the background and the work of Rao et al. The background is that, for over a century now, it had been assumed that the tablets containing seals found in the Indus valley archaeological sites contained writing in an unknown, pictographic script; and much effort has been devoted to deciphering the script. Then Farmer, Sproat and Witzel (authors of the above screed, and long-time researchers in the field) published a paper in 2004 arguing that the scripts do not encode language but are some form of non-linguistic symbolic writing. (Actually, they don't so much argue it as assert it. More on that below.) That paper is 39 pages long, containing 29 pages of text followed by many references, and is in fact a useful read in summarising the existing state of the art, even if one disagrees with its conclusions.

Rao et al. disagree with its conclusions, in their paper in Science that contains one page of text, a couple of figures, and about 15 pages of supporting data that mainly describes their methodology. The methodology is, to my astonishment, apparently new in this field, although certainly not in computational linguistics.



Basically, the method is to model language as a Markov chain. A Markov chain is a sequence of "things" (words, letters, events) with the following property: the probability of any individual event depends on its predecessor only, not on the entire previous sequence. For example, if a DNA sequence were generated by a Markov process and you saw "ACAGTGAC", the next nucleotide would be determined (probabilistically) only by the final nucleotide in this sequence, C, and not by the others. In a generalised (n-order) Markov chain, each event depends on its n immediate predecessors, where n is 1 or more but usually not very large.

Imagine you had never seen writing in English before, and were confronted with it for the first time. (Imagine, also, that you had figured out that uppercase and lowercase letters represent the same thing.) You may quickly find that some letters (e) occur more frequently than others (z). But if you generate random text with frequencies that agree with what one sees in English, the result would look nothing like English.

Looking more closely, you may observe some anomalies: if you see a "q", the following letter is almost always a "u". If you see "i" and "e" together, the "i" comes before the "e" except when after a "c" (as in "receive"). You may even note some weird exceptions. Though "a" is a common letter (the third most common, after "e" and "t", by many counts), "aa" is a very rare combination in English, though "ee" and "tt" are quite common. Similarly, "ae" is much rarer than "ea". None of these observations can be accounted for by letter frequencies alone. But most of them can be fully encompassed in a first-order Markov model (the "i before e except after c" rule is more complicated.)

In general, if two letters (say A and B) appear with frequencies PA and PB, a random sequence would contain each of "AB" and "BA" in the frequency PAPB. If you do not observe this (and, in English and all other languages, we do not), we can assume that the sequence is not random. The first order Markov model is the next simplest assumption, and is often adequate.

Shannon, in his classic 1948 paper on information theory, actually uses Markov models of English to construct pseudo-English sentences (using 26 letters and a space as his 27 symbols). A completely random string, with all symbols equally probable, looks like
"XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD."
A string maintaining actual frequencies of symbols, but with no correlations, looks like
"OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL."
A first-order Markov model yields
"ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE."
A second-order Markov model yields
"IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE."
The point is not whether these make sense but how much they look like English at first glance. A first-order Markov model is clearly a dramatic improvement on random models, and a second order Markov model is even better. Shannon goes on to Markov models that use words rather than letters as individual symbols, and in this case a first-order Markov model already gives something that looks grammatically correct over short distances:
"THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED."

My colleague Ronojoy Adhikari, who is a co-author of the Rao et al. paper, points out to me that Shannon was not the first to try such exercises: Markov himself preceded Shannon by about 30 years.



What Rao et al did was, essentially, to assume that the Indus scripts are generated by a first-order Markov process. In the light of what we know about languages, this may seem a rather obvious thing to do. They use a measure called "conditional entropy" (that, again, stems from Shannon's work) to measure the extent of first-order correlations; and a "relative conditional entropy" that compares the conditional entropy to that of a random sequence with the same frequencies of individual symbols. A more correlated sequence has lower conditional entropy, so the relative conditional entropy must be between 1 and 0.

What they find is that the conditional entropy for an Indus script is very similar to that of known languages, and very different from non-linguistic symbol systems.



What are the criticisms? Let us look first at the rant from Farmer et al. In comments to my previous post, Ronojoy refuses to dignify it with a response; but lay readers may be interested in my point of view anyway.

Farmer and colleagues do have one genuine criticism (which, in my opinion, is not a serious problem with the paper). But they bury it under so much misleading and simply ad hominem rubbish that it is better to clear that refuse heap first.

First, they say:

"It is important to realize that all their demonstration shows is that the Indus sign system has some kind of rough structure, which has been known since the 1920s. Similar results could be expected if they compared their artificial sign sets to any man-made symbol system, linguistic or nonlinguistic. Our paper in fact made much of this point and also gave examples of striking statistical overlaps between real-world (not invented) nonlinguistic and linguistic systems and argued that it is not possible to distinguish the two using statistical measures alone."

In fact, the "rough structure" that they discuss in their paper, that has been "known since the 1920s", is only the fact that some symbols occur more often than others! Correlations among successive symbols are completely ignored. The last sentence quoted above refers, I think, to their "Figure 2" which deals only with frequencies of individual signs, not with any kind of correlated structure. Yes, some Indus symbols are more frequent, and some heraldic blazons are more frequent, than others. It is also true that some road signs are more common than others (credit). Of course that does not tell us anything about whether a sequence of road signs constitutes a language. This is not at all the claim that Rao et al. are making, and it staggers me that the point is being missed so widely.

Another charge that Farmer et al level at Rao et al is that of Dravidian nationalism. Reading the surnames of the authors (only one seems to be of Dravidian origin), it is a comical accusation.

There's more along the same lines. But one accusation is true: Rao et al are misleading in their main text on what non-linguistic systems, exactly, they are comparing the Indus script with. They plot a "type 1 system" and a "type 2 system" in the first part of their figure; only on reading the supporting text does one learn that these are not actual corpuses of non-linguistic symbol systems, but synthetically generated corpuses. (The second half of that figure does contain some genuine actual non-linguistic symbol systems.)

Ronojoy responds in a comment:

The non-linguistic Type1 and Type2 are controls; the comparision in Fig.2 is with real world data sets which are like the controls - DNA with lot of variability and high entropy, Fortran code with fairly rigid order and low entropy. The controls are limiting cases of the real world data : Type1 has no correlation, while Type2 is maximally correlated. In Fig 2, they represent the two extremes. Our conclusion would still be valid if we deleted the controls. Comparing more datasets is part of ongoing work.

This could have been made clearer in the main text -- but it is made clear enough in the supporting data. I would have liked to see the real world data in figure 1 too (which is the more striking figure). And that's my only disappointment with the paper. But I expect that it will be rectified in future work by these authors.



Was this work trivial? The reaction of Liberman seems to be: "Huh, they only counted digrams?" Yes, to anyone familiar with Markov processes (that includes huge swaths of modern science) it is trivial. But apparently nobody had done it before! To me, that is the value of interdisciplinary research: what is obvious to one community may be new to another, and the results may be far-reaching.

[Update 28/04/09: small typo corrected. Also, see comments below from Mark Liberman and myself.]




[Update 01/05/09: comments closed.]

Saturday, April 25, 2009

The Indus Valley script

A paper just published online in Science magazine (subscription required for full text) makes a significant contribution to the debate about whether the Indus Valley "script" was really a script for a language, or was a mere nonlinguistic symbol system. They consider the statistical properties of texts in known languages and non-languages (in particular, the conditional entropy) and compare with a large corpus of Indus Valley text. The conclusion is that the Indus Valley symbols indeed encode a language. It looks extremely convincing. Now the task remains to decode the thing.

I'm a bit slow to blog on this because I wanted to read the paper first: it turns out it's very short (barely a page plus references and figures in the ScienceExpress format, and no doubt the final published version will be even more compact.) One of the authors is my colleague Ronojoy Adhikari and I had a brief lunch-room summary from him which pretty much covers the paper. (A longer summary from Ronojoy is here, courtesy Rahul Basu.) It has attracted a fair bit of media attention (km has a nice summary and a couple of links).

A couple of points are striking: first, of the languages compared, the conditional entropy for the Indus scripts seems closest to Old Tamil, which is suggestive given the belief that the Indus Valley residents may have been Dravidians and thus the ancestors of the ancient Tamil people. (Rig Vedic Sanskrit is plotted in only one of the two sub-figures, unfortunately -- the one on relative conditional entropy. Given its geographic, though probably not temporal, proximity to the Indus Valley civilisation, it would have been nice to see it in the other plot too.) Second, when comparing with non-linguistic systems, the difference is extremely stark. And convincing. One could imagine a primitive linguistic system falling somewhere between the "real languages" and Fortran, or between the "real languages" and Vinca. But in this case, the Indus script curve is bang on top of the other "real language" curves, and well separated from everything else.

I'm one of those who believes that "interdisciplinary research" will tell us more and more in the future, and this is an example of computer scientists, physicists and linguists succeeding in putting together a few simple ideas that tell us a great deal. (It is not Ronojoy's only recent interdisciplinary foray; primarily a "soft condensed matter" physicist, he also recently published an article [free arxiv preprint] on the harmonics produced by loading the membrane in Indian drums. The topic is of interest to me -- I did a project on it long ago as an undergraduate -- but will perhaps talk about it some other time.)

Thursday, April 23, 2009

The Binayak Sen case drags on

I last posted on Dr Binayak Sen nearly a year ago, when Tehelka featured him on their cover. Nothing much has happened in a year except that his health is suffering and the vindictive Chattisgarh administration is denying him a doctor of his choice.

This is a few days late, but former Supreme Court judge V R Krishna Iyer has written an excellent letter to our Prime Minister on the case. The thing is, Manmohan Singh is an intelligent, well-informed and (reputedly) upright man, so I don't believe that he is unaware of the case or ignorant of its implications. So why hasn't he said anything yet? He has spoken repeatedly on the dangers of the Naxalite movement. But does he believe that imprisoning social workers and human rights activists, while giving tickets to the likes of Jagdish Tytler, is the way to improve democracy and quell Naxalite violence?

Wednesday, April 22, 2009

Tortured evidence

So why did the Bush administration authorise -- and try to find legal cover for -- the use of techniques on detainees that have been regarded as torture when practised anywhere else in the world? Here's one answer:

The Bush administration applied relentless pressure on interrogators to use harsh methods on detainees in part to find evidence of cooperation between al Qaida and the late Iraqi dictator Saddam Hussein's regime, according to a former senior U.S. intelligence official and a former Army psychiatrist.

Such information would've provided a foundation for one of former President George W. Bush's main arguments for invading Iraq in 2003. In fact, no evidence has ever been found of operational ties between Osama bin Laden's terrorist network and Saddam's regime...

As several people have remarked recently, historically torture has been used not to elicit information, but to elicit false confessions. It seems the Bushies were no different.

There's an old joke that I mentioned before about the Delhi Police going to capture a lion. But the punchline seems to apply perfectly to the CIA under the Bushies.

Tuesday, April 21, 2009

Meta-news

Being a regular reader of the online comic xkcd, I'm not sure which aspect of this news item interests me more: xkcd gets a printed book, or xkcd gets a New York Times article about it.

Are geeks taking over the world? Or merely the mainstream media?

Friday, April 17, 2009

Torture

Ajmal Kasab, the gunman in the Mumbai terror attacks who was photographed stalking CST station with a machine gun and was later caught alive, is appearing in court and has already retracted his confession, stating that it was "coerced". He was photographed and then caught, wasn't he? What is there to confess? Well, one assumes, all the details about the planners and the journey from Karachi and the local accomplices (two of whom are also being tried). This "confession" was sent to Pakistan; other evidence that included identical DNA reports for two different individuals, Kasab and Abu Ismail, which according to Chidambaram was a "minor clerical error".

If by "coercion" Kasab means torture, can we believe his confession? Plenty of evidence says that we cannot (links 1, 2). In fact, eliciting false confessions is often the aim of torture, by totalitarian regimes the world over. I am not sure what "coercion" refers to in Kasab's case, but I really, really hope that it was not torture: if it was, all evidence obtained from him is useless. (And my hopes are not high: we know how Indian police treat common criminals and suspects.) Also, making a mess of this trial -- the only recent case of a terrorist being caught alive, anywhere in the world -- would be a disgrace to our police and investigatory system.

Elsewhere in the world, Barack Obama authorised the release of four memos detailing "justifications" for torture under the Bush administration. The memos make it clear that the authors knew they were endorsing the very methods that they routinely condemn when practised by foreign governments. Several responses are out on the internet; Glenn Greenwald and Andrew Sullivan are worth reading. Obama has said he is not in favour of prosecuting the CIA operatives who tortured (though that doesn't close the door on independent prosecutors); however, he did not say anything -- one way or the other -- on those who wrote the memos or those in the administration who authorised the torture. Recently, the Red Cross released its own report on the treatment of prisoners at Guantanamo Bay. Both the report and the memos make gruesome reading. Both have spurred international discussion. I wonder if anyone in power in India is talking about the use of torture by our agencies. The media routinely turns up individual cases, but nobody seems to be talking about the issue as a matter of policy.

Wednesday, April 08, 2009