Wednesday, September 29, 2010


Here. My "new home" post (explaining some of the whys) is here, and my first "real" new post is here. I'll try to keep blogging there; if you'd like to keep reading, please update your feedreader. This blog is closed.

Friday, September 17, 2010

The Indus argument continues

Last year much excitement and noise occurred, including on this blog [1, 2, 3], when a group of scientists (led by Rajesh Rao at the University of Washington, and including my colleague Ronojoy Adhikari) published a brief paper in Science supplying evidence, on statistical grounds, that the Indus symbols constituted a writing system. In their words, they "present evidence for the linguistic hypothesis by showing that the script’s conditional entropy is closer to those of natural languages than various types of nonlinguistic systems."

This rather modest claim outraged Steve Farmer, Richard Sproat and (presumably) Michael Witzel (FSW), who had previously "proved" that the Harappan civilization was not literate (the paper was subtitled "The myth of a literate Harappan civilization"). In a series of online screeds, they attacked the work of Rao et al: for reviews, see this previous post, and links and comments therein.

Now Richard Sproat has published his latest attack on Rao et al. in the journal Computational Linguistics. Rao et al have a rejoinder, as do another set of researchers, and Sproat has a further response to both groups (but primarily to Rao et al); all these rejoinders will appear in the December issue of Computational Linguistics.

To summarise quickly, the way I see it: Sproat claims (as he previously did on the internet) that Rao et al.'s use of "conditional entropy" is useless in distinguishing scripts from non-scripts, because one can construct non-scripts with the same conditional entropy, and because their extreme ("type 1" and "type 2") non-linguistic systems are artificial examples. Rao et al. respond that that is a mischaracterisation of what they did, observe that Sproat entirely fails to mention the second figure from the same paper or the more recent "block entropy" results, and repeat (in case it wasn't obvious) that they don't claim to prove anything, only offer evidence. They give inductive and Bayesian arguments for why the mass of evidence, including their own, should increase our belief that the Indus symbols were a script.

In connection with the Bayesian arguments, Rao et al. do me the honour of citing my blog post on the matter, thus giving this humble blog its first scholarly citation. My argument was as follows: Given prior degrees of belief, $P(S)$ for the script hypothesis and $P(NS)$ for the non-script hypothesis, and give "likelihoods" of data given each hypothesis, $P(D|S)$ and $P(D|NS)$, Bayes' theorem tells us how to calculate our posterior degrees of belief in each hypothesis given the data:
$P(S|D) = \frac{P(D|S)P(S)}{P(D|S)P(S) + P(D|NS)P(NS)}$
We can crudely estimate P(D|NS) by looking at the "spread" of the language band of the figure 1A in their Science paper and ask how likely it is that a generic non-language sequence would fall in that band: assuming that it can fall anywhere between the two extreme limits that they plot, we can eyeball it as 0.1 (the band occupies 10% of the total spread) [Update 17/09/2010: See the plot below, which is identical to the one in Science, except for the addition of Fortran (blue squares, to be ignored here).] Let us say a new language is very likely (say 0.9) to fall in the same band. Then $P(D|NS) = 0.1$, $P(D|S) = 0.9$. If we were initially neutrally placed between the hypotheses ($P(NS) = P(S) = 0.5$), then we get $P(S|D) = 0.9$: that is, after seeing these data we should be 90% convinced of the script hypothesis. Even if we started out rather strongly skeptical of the script hypothesis ($P(S) = 0.2$, $P(NS) = 0.8$), the Bayesian formula tells us that, after seeing the data, we would be almost 70% convinced ($P(S|D) = 0.69$).

We can quibble with these numbers, but the general point is that this is how science works: we adjust our degrees of belief in hypotheses based on the data we have and the extent to which the hypotheses explain those data.

Sproat apparently disagrees with this "inductive" approach, and accuses Rao et al of lack of clarity in their goals. On the first page, he clarifies that he was talking only of the Science paper and has not read carefully analysed [correction 17/09/10] the more recent papers by Rao and colleagues; he says those works do not affect questions on the previous paper, writing,

'To give a stark example, if someone should eventually demonstrate rigorously that cottontop tamarins are capable of learning “regular” grammars, that would have no bearing on the questions currently surrounding Marc Hauser’s 2002 publication in Cognition.'

In this way Sproat succeeds in insinuating, without saying it, that the work of Rao et al. may have been fraudulent. (Link to Hauser case coverage)

A little later, on the claim that the arguments of FSW "had been accepted by many archaeologists and linguistics", he offers this belated evidence that such people do exist:

Perhaps they do not exist? But they do: Andrew Lawler, a science reporter who in 2004 interviewed a large number of people on both sides of the debate notes that “many others are convinced that Farmer, Witzel, and Sproat have found a way to move away from sterile discussions of decipherment, and they find few flaws in their arguments” (Lawler 2004, page 2029), and quotes the Sanskrit scholar George Thompson and University of Pennsylvania Professor Emeritus of Indian studies Frank Southworth.

Having thus convincingly cited a science reporter to prove that the academic community widely accepts FSW's thesis, he proceeds to the actual claims about the symbols; after a few pages of nitpicks not very different from the above, he addresses a point which he had previously raised in this comment: why does figure 1A in the Science paper not include Fortran? He suspects that Fortran's curve would have overlapped significantly with the languages, "compromising the visual aspect of the plot". I actually find that explanation credible(*), and I was not comfortable with the manner of presentation of the data in the Science paper: but I view this as a problem with the "system" rather than the authors. Enormous prestige is attached to publication in journals like Science. To allow more authors to publish, Science has a one-page "brevia" format (which Rao et al. used) that allows essential conclusions to be presented on that printed page, while the substance of the paper is in supplementary material online. Rao et al. can argue, correctly, that they hid nothing in their full paper (including the supplementary material); but obviously what was shown in the main "brevia" format was selected for maximum instantaneous visual impact. And they are not the only ones to do this. I'd argue that formats like "brevia" are designed to encourage this sort of thing, and the blame goes to journals like Science. It is annoying, but to compare it with the Hauser fraud is odious.

Sproat's response doesn't improve in the subsequent pages. He distinguishes between his preferred "deductive" way of interpreting data and the "inductive" approach preferred by Rao et al; he complains that they did not clarify this in their original paper (though I would have thought the language was clear enough, that they nowhere claimed to be "deducing" anything, only offering "evidence"); he nitpicks (as I would have expected) with the Bayesian arguments. Overall, for all his combativeness, he is notably vaguer in his assertions than previously. He ends on this petulant note:

I end by noting that Rao et al., in particular, might feel grateful that they were given an opportunity to respond in this forum. My colleagues and I were not so lucky: when we wrote a letter to Science outlining our objections to the original paper, the magazine refused to publish our letter, citing “space limitations”. Fortunately Computational Linguistics is still open for the exchange of critical discussion.

The openness of CL is to be applauded, but I can think of some additional explanations for why Computational Linguistics allowed the response while Science did not. One is that the Science paper by Rao et al. was not a vicious personal attack on another set of researchers, and as such, did not merit a "rejoinder" unless it could be shown that the paper was wrong. Another may have been the quality of Rao et al's response on this occasion (Sproat could, if he liked, offer us a basis for comparison by linking his rejected letter to Science) [update 17/09/10: here].

I don't expect this exchange in a scholarly journal to end the argument, but perhaps the participants can take a break now.

(*) UPDATE 17/09/2010: Rajesh Rao writes:

By the way, the reason that Fortran was included in Fig 1B rather than 1A is quite mundane: a reviewer asked us to compare DNA, proteins, and Fortran, and we included these in a separate plot in the revised version of the paper. Just to prove we didn't have any nefarious designs, I have attached a plot that Nisha Yadav created that includes Fortran in the Science Fig 1A plot. The result is what we expect.

The plot is below (click to enlarge); the blue squares are the Fortran symbols.

Rajesh also remarks that the Bayesian posterior probability estimates -- that I derived from the bigram graph in the Science paper -- can probably be sharpened from the newer block entropy results. However, since Sproat makes it clear that he is only addressing the Science paper and is unwilling to let later work influence his perception, I think it's worth pointing out that the data in the Science paper are already rather convincing.