Saturday, April 25, 2009

The Indus Valley script

A paper just published online in Science magazine (subscription required for full text) makes a significant contribution to the debate about whether the Indus Valley "script" was really a script for a language, or was a mere nonlinguistic symbol system. They consider the statistical properties of texts in known languages and non-languages (in particular, the conditional entropy) and compare with a large corpus of Indus Valley text. The conclusion is that the Indus Valley symbols indeed encode a language. It looks extremely convincing. Now the task remains to decode the thing.

I'm a bit slow to blog on this because I wanted to read the paper first: it turns out it's very short (barely a page plus references and figures in the ScienceExpress format, and no doubt the final published version will be even more compact.) One of the authors is my colleague Ronojoy Adhikari and I had a brief lunch-room summary from him which pretty much covers the paper. (A longer summary from Ronojoy is here, courtesy Rahul Basu.) It has attracted a fair bit of media attention (km has a nice summary and a couple of links).

A couple of points are striking: first, of the languages compared, the conditional entropy for the Indus scripts seems closest to Old Tamil, which is suggestive given the belief that the Indus Valley residents may have been Dravidians and thus the ancestors of the ancient Tamil people. (Rig Vedic Sanskrit is plotted in only one of the two sub-figures, unfortunately -- the one on relative conditional entropy. Given its geographic, though probably not temporal, proximity to the Indus Valley civilisation, it would have been nice to see it in the other plot too.) Second, when comparing with non-linguistic systems, the difference is extremely stark. And convincing. One could imagine a primitive linguistic system falling somewhere between the "real languages" and Fortran, or between the "real languages" and Vinca. But in this case, the Indus script curve is bang on top of the other "real language" curves, and well separated from everything else.

I'm one of those who believes that "interdisciplinary research" will tell us more and more in the future, and this is an example of computer scientists, physicists and linguists succeeding in putting together a few simple ideas that tell us a great deal. (It is not Ronojoy's only recent interdisciplinary foray; primarily a "soft condensed matter" physicist, he also recently published an article [free arxiv preprint] on the harmonics produced by loading the membrane in Indian drums. The topic is of interest to me -- I did a project on it long ago as an undergraduate -- but will perhaps talk about it some other time.)


gaddeswarup said...

Off topic. I wonder whether you can comment on another paper in the same issue of Science and reported in Sciencedaily:
on the mystery of horse domestication. This may have some implication on the original home of Aryans.

Rahul Siddharthan said...

Thanks. The article only says horses are likely to have been first domesticated around 5000 BC in the Eurasian steppe region. They base this on the diversity of colours that developed there, and in particular, high selection coefficients for two colour-related genes that can best be explained by selective breeding. This time period is much older than the Indus Valley civilisation or the Aryan invasion. There is lots of other good evidence that the Aryans came to India from the Caspian Sea area, but I don't regard the horse domestication argument as very convincing (though maybe it can be improved?) Aryans could have domesticated horses independently, or acquired them from traders.

Anyway, there is a big and insistent "the Indus Valley civilisation people were the ancient Aryans" crowd and they tend not to be moved by any kind of scientific or other arguments. I predict that even if follow-up work on the Indus script shows a grammar extremely similar to ancient Tamil and extremely different from ancient Sanskrit, it will not change their minds in the slightest. It is a religious faith.

gaddeswarup said...

Thanks. I think that subject, object, verb occur in different orders in the sentences of Samskrit and Dravidian languages. Does this research give any hints about this?

Anonymous said...

About Aryan migration: Many recent papers (e.g. here and here) suggest that, genetically, there was no major migration to India for the past 10000 years or so. How do you explain the "Aryan migration" then ?

As far as I understand, there are many open questions are we should approach it with an open mind. Without any prejudice.

km said...

Thanks for blogging this, Rahul.

Second, when comparing with non-linguistic systems, the difference is extremely stark. And convincing. This is what makes me curious about this whole business of figuring out what is and what is not a language.

For example, if a pattern analyzer were to be "applied" on a set of icons that are closely related (like say, traffic signs on a highway), how does the expert know that the icons are related but do not constitute a language?

km said...

To the Anonymous5:55PM: No prejudice.

Sadly, most linguistic research (by Indians) available on the web seems to be the handiwork of, well, let's just say, not very unprejudiced people and the work itself is of very dubious quality.

Anonymous said...

Does Ronojoy have anything to say about the rebuttal from Witzel, Farmer and Sproat mentioned in Rahul Basu's post?

Anonymous said...

I am the previous anonymous again. I have gone through the brief two-page rebuttal of Witzel, Farmer and Sproat. I would like to note that they accuse Rao et al of intellectual dishonesty, no less. This is a very serious charge.

Interested authors can check the rebuttal for themselves by going to Steve Farmer's website at

Rahul Siddharthan said...

gaddeswarup - I think they are very far from that sort of analysis. Right now they are only looking at some very basic statistical properties. Grammar, syntax, figuring out which symbols are nouns and which are verbs, etc, will come much later and may not be amenable to such methods...

anonymous 555 - thanks, but as far as I can tell, those papers only say foreign invasions are small in number, not that they did not happen (we know that they did happen, repeatedly, over centuries up until recent history.) They say nothing about the possible social or cultural impact of such invasions (which again was very significant, both with the Muslim and the British invasions).

anonymous 1148: well, Farmer et al have the decency not to be anonymous. Rahul B calls their writeup "intemperate"; I would use a stronger word. Their main objection seems to be that these authors are using a metric (conditional entropy) that has not been used before in such studies. I think the metric makes a lot of sense, regardless of whether it has been used before. They accuse the authors of Dravidian nationalism when only one author is Dravidian (Mahadevan, whose role seems to have been in supplying the texts, not so much in the mathematical analysis).

I do somewhat agree with the objection that they rely on synthetic type 1 and 2 systems in the first subfigure, and also that the methodology is insufficiently clear in the main paper and one has to read the supporting data to understand it. I don't agree that the conclusions are useless and certainly not with your "intellectual dishonesty" charge.

I think the problem is that the known "type 1" (Vinca) and "type 2" (Sumerian deity symbol system) do not have sufficiently large corpuses to permit this sort of statistical analysis, so the authors generate larger synthetic corpuses with (as far as one knows) the same statistical properties. This is not unusual in many fields (including mine, computational biology) but perhaps it is unusual in linguistics. It is not dishonest as long as one makes it clear what is going on; one could argue that the authors failed to do so sufficiently in the main text, but it is all there in the supporting text.

Ronojoy said he would send me a response. After going through everything more carefully I may make another blog post.

JK said...


Here are two other responses

Falling for the magic formulaand

Conditional entropy and the Indus Script

Anonymous said...


Here's what Farmer et al say in their rebuttal: On pages 2-3 of the online Supplemental Information section of their paper, we find to our surprise -- in contradistinction to what they say in the paper itself -- that this claim is not based on a comparison of Indus signs with real-world nonlinguistic systems, but with two wholly artificial systems invented by the authors, one consisting of 200,000 randomly ordered signs and another of 200,000 fully ordered signs, that they spuriously claim represent the structures of all real-world nonlinguistic sign systems (which they refer to as Type 1 and Type 2).Now, I read this as a charge of intellectual dishonesty. If you disagree, fine. I may be anonymous but in these circumstances what does that have to do with anything? I am not personally making any such charge: I am not competent enough to do so. But the words used by Farmer et al are quite strong and disturbed me which is why I posted here. If I offended you, apologies.

Incidentally, Farmer et al's main point, I think, is that one can always find statistical regularities in such comparisons which can be startling. Hence, relying on them can be very misleading. Their rebuttal mentions that their original paper on the Indus Valley script discusses this point.

In a very different context, in 1994, Witztum, Rips and Rosenberg published a paper in the peer-reviewed Statistical Science where they used statistical techniques to make the extraordinary claim that the Book of Genesis "encodes" events that did not occur until much later -- thus hinting, without saying so, at the "divine" origin of the Torah. In a subsequent paper in the same journal, Brendan McKay, Dror Bar-Natan, Maya Bar-Hillel and Gil Kalai debunked the study pointing out (i) the alleged results depended on particular choices made for the statistical experiments and alternate but equally plausible choices do not give the results that WRR claim, and (ii) Any sufficiently long text will contain startling statistical regularities. They actually take Tolstoy's War and Peace and show this.

For the 1994 study, you will have to look up the journal but the 1999 debunking is available on line in working paper form at

(Go to the link for the Discussion Papers; the paper is #196.) Having known about the "Bible codes" controversy as the above incident was known, I will confess I am sympathetic to this point of Farmer et al and at this stage, biased in their favour. But, of course, I'd be interested to know what Rao et al have to say.

Rahul Siddharthan said...

JK - thanks for the links. They are better written than the Farmer rant, but I think they are all missing the point of the Rao et al paper.

anonymous - will reply in a separate post. But, briefly, "other people have done bad statistics so this statistics could be bad" is not an argument. The paper's claim is, essentially, this: "the information content (or conditional entropy) in pairs of tokens is different, in languages, from what would be expected in random sequences or highly-ordered sequences." The authors argue that a group of symbols that are not of linguistic significance should not exhibit this sort of "bigram" structure. I don't notice Farmer et al, or the blogs that JK links to, providing counterexamples. (DNA sequence does have significant "dinucleotide" correlations, and protein sequences are highly nonrandom, but it appears that both of these are still much closer to random than human languages are. km's example above of road signs would be an interesting study: one would expect some correlations and predictability there, but I don't know whether it would compare with languages).

I also agree with Liberman's comment on his blog that "this is a topic that traditionally generates more heat than light."

Dr. Ronojoy Adhikari said...

Rahul and others : We are not going to respond to ad hominem attacks from Farmer and co.

The non-linguistic Type1 and Type2 are controls; the comparision in Fig.2 is with real world data sets which are like the controls - DNA with lot of variability and high entropy, Fortran code with fairly rigid order and low entropy. The controls are limiting cases of the real world data : Type1 has no correlation, while Type2 is maximally correlated. In Fig 2, they represent the two extremes. Our conclusion would still be valid if we deleted the controls. Comparing more datasets is part of ongoing work.

Anonymous said...

I also agree with Liberman's comment on his blog that "this is a topic that traditionally generates more heat than light."Absolutely; which is why I'll stop here.

The crux of the matter - as I see it - relates to the following two statements. Rahul attributes to Rao et al the claim that a group of symbols that are not of linguistic significance should not exhibit this sort of "bigram" structure. On the other hand, Farmer et al say In our own paper, in fact, we showed that striking overlaps exist between Indus sign frequencies, frequencies in medieval heraldic signs, and in a variety of natural languages [2]. It can be demonstrated that many statistical overlaps exist in symbol systems in general, not just in those that encode speech.If I understand things right, Rao et al seem to argue that while one might find some statistical regularity between different scripts (whether or not they encode speech) the particular type of statistical regularity uncovered by them is a sufficient condition for distinguishing writing which encode speech from those which do not. Farmer et al disagree.

As I said, I tend to side with Farmer et al at this stage. This is because I am not sure of the validity of the claim that Rahul attributes to Rao et al. Is it a theoretical claim based on some linguistic theory? Or is it a well-known and accepted empirical regularity? Part of the objection seems to be that the comparison is restricted to four written languages and four symbol systems which do not encode speech. Of the latter four, two are "artificial." Perhaps Ronojoy or someone could clarify these issues.

Finally, the WRR piece is a cautionary tale. Note that Eliyahu Rips is actually a Professor of Mathematics at the Hebrew University and so he's no fool. (That might explain why the editors of Stat. Sci. took it seriously in the first place.) The WRR paper, as McKay et al acknowledge, is in a league of its own in comparison to other, similar work which also claims to have uncovered proof of the divine origin of the Bible, Koran etc. Unlike the others which are easily refuted, this one's pretty sophisticated and illustrates that uncovering statistical sophistry is not straightforward. It's a reason - for me, at least and I guess you disagree - to be sceptical of Rao et al's claims at least for the time being.

I'll leave it there.

Dr. Ronojoy Adhikari said...

I will only comment on what is the most important difference between Farmer's analysis and our's : Farmer et al concentrate on one-point probability distributions, P(a), the frequency with which the a-th sign is used. This quantity can never capture correlations between signs. We look at the two point probability distribution P(ab), the probability of the sign pair "ab". Divide this by the P(a), and we get P(b|a), the conditional probability of b, given a. This can, and does, capture correlations.

Rahul Siddharthan said...

Ronojoy -- you know, that thought occurred to me when reading their rant: are they talking about single-symbol frequencies, and are they really trying to compare that with what you did? I'm at a loss for words. (But as soon as I recover my vocabulary, I will write an updated post.)

Dr. Ronojoy Adhikari said...

They indeed are!

நா. கணேசன் said...

>7. You are dismissive of the Indus-Dravidian
>hypothesis because of the geographical distance
> between the Indus Valley sites and present-day
>Tamil Nadu, as well as the temporal distance
>between the Indus civilisation and the earliest
>recorded Tamil. But there is a Dravidian-origin language,
>Brahui, still spoken in parts of Pakistan and some
>neighbouring areas.

While some like Hans Hock take Brahui to be
recent immigrants (say, 1000 years ago),
David McAlpin has given the linguinstic data
that argues for an ancient date for Brahui
in Pakistan.

David McAlpin has shown that Brahuis are in their current location
for millennia, and were the first to branch off from Proto-Dravidian.
David W. McAlpin, Velars, Uvulars, and North Dravidian hypothesis
JAOS, vol. 123, pp. 521-546, 2003.
In any case, Gujarat is full of Indus civ. sites
and traditionally on of the five Dravida lands
(pancha-dravida countires, see the paper by
Madhav Deshpande, Univ. of Michigan about
Pancha-Gauda and Pancha-Dravida lands.
I can send the pdf if you give me your email

The Dravidian substratum names in Sindh and Gujarat
are the main reasons for their presence in Indus
civilization. For the place names originating in
"paLLi" in Sindh etc., see Franklin Southworth,
also Parpola (1994) books. One substratum
effect (consonant assimilation) between old Tamil and Indo-Aryan,
which probably occurred when Aryans
met Dravidians in the Indus valley send me your email address.

N. Ganesan

>Some scholars believe that
>the Elam language, roughly contemporaneous with
>the Harappans, was also of the Dravidian family.
>Finally, the known continuous history of Tamil is much
>longer than the gap that you object to between the
>Indus valley and Old Tamil. It does not seem such a
> stretch to think the Indus language may have been
>a Dravidian language.

kalyan97 said...

Rahul Siddharthan said...
Kalyanaraman: I'd say a script must be a written representation of the spoken language, such that (a) any spoken text in that language may be written, and (b) that writing can then be read by anyone familiar with the script, to reproduce the spoken text. There may be some limitations and ambiguities (in today's scripts these mainly arise with foreign words), but mostly, I think this should be a requirement….5/01/2009 12:23 PM

Kalyanaraman: Rahul, if this definition is agreed among the blog posters, Richard Sproat should revise his views and concede that Indus script is a writing system. I say it is a writing system conveying messages about metallurgical repertoire of early smiths and mine-workers.
I suggest that the mathematicians on this blog should also respond to Fernando Pereira

kalyan97 said...

Dr. Ronojoy Adhikari said...
I will only comment on what is the most important difference between Farmer's analysis and our's : Farmer et al concentrate on one-point probability distributions, P(a), the frequency with which the a-th sign is used. This quantity can never capture correlations between signs…We look at the two point probability distribution P(ab), the probability of the sign pair "ab". Divide this by the P(a), and we get P(b|a), the conditional probability of b, given a. This can, and does, capture correlations. 27/2009 7:07 PM
Kalyanraman: Brilliant summary by Ronojoy. The next step is to analyse the nature of a writing system which can explain P(b|a) I say that b|a of the writing system is explained by the repertoire only one semantic caterogy: mine-work and smithy. Let me add one more area of correlation: invention of alloying in metallurgy and invention of writing system necessitated by the metallurgical invention.

நா. கணேசன் said...


For the Indus connection with South India,
see the papers by Iravatham Mahadevan

Especially, the bowl from Sulur (கோயம்புத்தூர் அருகே சூலூர்)
now preserved at the British Museum, London. Ganesan

venkat said...

there is no doubt in my mind that the entire superstructure built on the theories of Indus language and script is going to be a colossal failure. One thing alone seems incontrovertibly clear- that not one of the seals carries a full grammatical sentence; not one of them carries,either, any account of social or cultural episodes. Unlike the Cuneiform clay tablets or the Egyptian papyrii, which record daily life transcations, there is nothing, but nothing, to make use of in the decipherment of the Indus "writing". To an unbiassed mind it appears as nothing but the symbols used in the modern warehouses of great trading companies of the world. To give all kinds of interpretations based on so-called computer models assume a prior supposition; such are pointless endeavours leading to nowhere.

Years ago a smart Bengali student of mine brought over to me a copy of Mortimer Wheeler's book on Indus Valley Civilisation, and pointed out , to my great surprise, that one particular seal featuring a bullock cart, contained every part of the bullock cart; there was not one scratch in excess, not one short of need. So it seems challengingly necessary to adopt some other model to understand these seals.

Until then all talk of deciphering is just a flight of fancy gone too far and will be a waste of effort!!.

Venkat Iyer

kalyan97 said...

I have a comment to add to Venkat Iyer's. My decoding is simple. The entire corpus consiste of mleccha (copper) smith guild tokens. The small ebook is at

kalyan97 said...


From this corpus, can you identify the epigraph which your student showed you?

Or, are you referring to a copper model of a cart found at Chanhudaro -- Mackay called it Sheffield of Ancient India? See photo at


kalyan97 said...

This is the copper cart model found at Chanhu-daro, called the Sheffield of Ancient India by Ernest Mackay (1936, Illustrated News of London)