Monday, April 27, 2009

Indus: What did Rao et al. really do?

My last post was on the Indus script paper by Rao et al., just published online in Science. Several reactions have appeared. A significant fraction of those following the story would have seen the tantrum by Steve Farmer, Richard Sproat, and Michael Witzel, that I will discuss further below. I haven't spent much time surfing the blogosphere, but two more skeptical comments (thanks to JK, commenting on my previous post) are here (Mark Liberman) and here (Fernando Pereira). The most insightful comment that I have seen is Liberman's "This is a topic that traditionally generates more heat than light".



Before understanding the reactions it is important to understand the background and the work of Rao et al. The background is that, for over a century now, it had been assumed that the tablets containing seals found in the Indus valley archaeological sites contained writing in an unknown, pictographic script; and much effort has been devoted to deciphering the script. Then Farmer, Sproat and Witzel (authors of the above screed, and long-time researchers in the field) published a paper in 2004 arguing that the scripts do not encode language but are some form of non-linguistic symbolic writing. (Actually, they don't so much argue it as assert it. More on that below.) That paper is 39 pages long, containing 29 pages of text followed by many references, and is in fact a useful read in summarising the existing state of the art, even if one disagrees with its conclusions.

Rao et al. disagree with its conclusions, in their paper in Science that contains one page of text, a couple of figures, and about 15 pages of supporting data that mainly describes their methodology. The methodology is, to my astonishment, apparently new in this field, although certainly not in computational linguistics.



Basically, the method is to model language as a Markov chain. A Markov chain is a sequence of "things" (words, letters, events) with the following property: the probability of any individual event depends on its predecessor only, not on the entire previous sequence. For example, if a DNA sequence were generated by a Markov process and you saw "ACAGTGAC", the next nucleotide would be determined (probabilistically) only by the final nucleotide in this sequence, C, and not by the others. In a generalised (n-order) Markov chain, each event depends on its n immediate predecessors, where n is 1 or more but usually not very large.

Imagine you had never seen writing in English before, and were confronted with it for the first time. (Imagine, also, that you had figured out that uppercase and lowercase letters represent the same thing.) You may quickly find that some letters (e) occur more frequently than others (z). But if you generate random text with frequencies that agree with what one sees in English, the result would look nothing like English.

Looking more closely, you may observe some anomalies: if you see a "q", the following letter is almost always a "u". If you see "i" and "e" together, the "i" comes before the "e" except when after a "c" (as in "receive"). You may even note some weird exceptions. Though "a" is a common letter (the third most common, after "e" and "t", by many counts), "aa" is a very rare combination in English, though "ee" and "tt" are quite common. Similarly, "ae" is much rarer than "ea". None of these observations can be accounted for by letter frequencies alone. But most of them can be fully encompassed in a first-order Markov model (the "i before e except after c" rule is more complicated.)

In general, if two letters (say A and B) appear with frequencies PA and PB, a random sequence would contain each of "AB" and "BA" in the frequency PAPB. If you do not observe this (and, in English and all other languages, we do not), we can assume that the sequence is not random. The first order Markov model is the next simplest assumption, and is often adequate.

Shannon, in his classic 1948 paper on information theory, actually uses Markov models of English to construct pseudo-English sentences (using 26 letters and a space as his 27 symbols). A completely random string, with all symbols equally probable, looks like
"XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD."
A string maintaining actual frequencies of symbols, but with no correlations, looks like
"OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL."
A first-order Markov model yields
"ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE."
A second-order Markov model yields
"IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE."
The point is not whether these make sense but how much they look like English at first glance. A first-order Markov model is clearly a dramatic improvement on random models, and a second order Markov model is even better. Shannon goes on to Markov models that use words rather than letters as individual symbols, and in this case a first-order Markov model already gives something that looks grammatically correct over short distances:
"THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED."

My colleague Ronojoy Adhikari, who is a co-author of the Rao et al. paper, points out to me that Shannon was not the first to try such exercises: Markov himself preceded Shannon by about 30 years.



What Rao et al did was, essentially, to assume that the Indus scripts are generated by a first-order Markov process. In the light of what we know about languages, this may seem a rather obvious thing to do. They use a measure called "conditional entropy" (that, again, stems from Shannon's work) to measure the extent of first-order correlations; and a "relative conditional entropy" that compares the conditional entropy to that of a random sequence with the same frequencies of individual symbols. A more correlated sequence has lower conditional entropy, so the relative conditional entropy must be between 1 and 0.

What they find is that the conditional entropy for an Indus script is very similar to that of known languages, and very different from non-linguistic symbol systems.



What are the criticisms? Let us look first at the rant from Farmer et al. In comments to my previous post, Ronojoy refuses to dignify it with a response; but lay readers may be interested in my point of view anyway.

Farmer and colleagues do have one genuine criticism (which, in my opinion, is not a serious problem with the paper). But they bury it under so much misleading and simply ad hominem rubbish that it is better to clear that refuse heap first.

First, they say:

"It is important to realize that all their demonstration shows is that the Indus sign system has some kind of rough structure, which has been known since the 1920s. Similar results could be expected if they compared their artificial sign sets to any man-made symbol system, linguistic or nonlinguistic. Our paper in fact made much of this point and also gave examples of striking statistical overlaps between real-world (not invented) nonlinguistic and linguistic systems and argued that it is not possible to distinguish the two using statistical measures alone."

In fact, the "rough structure" that they discuss in their paper, that has been "known since the 1920s", is only the fact that some symbols occur more often than others! Correlations among successive symbols are completely ignored. The last sentence quoted above refers, I think, to their "Figure 2" which deals only with frequencies of individual signs, not with any kind of correlated structure. Yes, some Indus symbols are more frequent, and some heraldic blazons are more frequent, than others. It is also true that some road signs are more common than others (credit). Of course that does not tell us anything about whether a sequence of road signs constitutes a language. This is not at all the claim that Rao et al. are making, and it staggers me that the point is being missed so widely.

Another charge that Farmer et al level at Rao et al is that of Dravidian nationalism. Reading the surnames of the authors (only one seems to be of Dravidian origin), it is a comical accusation.

There's more along the same lines. But one accusation is true: Rao et al are misleading in their main text on what non-linguistic systems, exactly, they are comparing the Indus script with. They plot a "type 1 system" and a "type 2 system" in the first part of their figure; only on reading the supporting text does one learn that these are not actual corpuses of non-linguistic symbol systems, but synthetically generated corpuses. (The second half of that figure does contain some genuine actual non-linguistic symbol systems.)

Ronojoy responds in a comment:

The non-linguistic Type1 and Type2 are controls; the comparision in Fig.2 is with real world data sets which are like the controls - DNA with lot of variability and high entropy, Fortran code with fairly rigid order and low entropy. The controls are limiting cases of the real world data : Type1 has no correlation, while Type2 is maximally correlated. In Fig 2, they represent the two extremes. Our conclusion would still be valid if we deleted the controls. Comparing more datasets is part of ongoing work.

This could have been made clearer in the main text -- but it is made clear enough in the supporting data. I would have liked to see the real world data in figure 1 too (which is the more striking figure). And that's my only disappointment with the paper. But I expect that it will be rectified in future work by these authors.



Was this work trivial? The reaction of Liberman seems to be: "Huh, they only counted digrams?" Yes, to anyone familiar with Markov processes (that includes huge swaths of modern science) it is trivial. But apparently nobody had done it before! To me, that is the value of interdisciplinary research: what is obvious to one community may be new to another, and the results may be far-reaching.

[Update 28/04/09: small typo corrected. Also, see comments below from Mark Liberman and myself.]




[Update 01/05/09: comments closed.]

56 comments:

myl said...

Actually, my reaction was that their figure 1A -- which seems to be the main argument offered for the script-like nature of the inscriptions -- doesn't demonstrate any properties that actually depend on digram statistics at all. Exactly the same sort of plot would arise from any random process with approximately the right number of symbols and the right (unconditioned) entropy -- as is shown in the little simulation at the bottom of my post.

In such a case, the conditional entropy will behave in just the same way as the unconditional entropy -- and both will look exactly like the functions in the middle of their figure 1A, which they suggest will only arise from a script-like process.

As for the difference cited in the supplementary material between the unigram entropy and the conditional (bigram) entropy, it's also quite easy to imagine non-linguistic processes that would generate such a difference.

So I'm quite puzzled about what the logic of the article is supposed to be.

km said...

Fascinating!

I am still soaking it all in.

Sunil said...

Fascinating stuff Rahul, thanks for posting it. And thanks for the Markov refresher too.

Rahul Siddharthan said...

Mark - I suppose I need to do my own simulation to answer your question. But your graph, and Sproat's, don't actually look very similar to me to the published graphs of Rao et al -- and, puzzlingly, are noticeably lower. Your saturating value is about 2.5 whereas all their graphs go above 3. One would always expect conditional entropies to be below unconditional entropies, if the distribution is the same, so I can only guess that your Zipf distribution is not a very good model for languages. As for Cosma's graph with a geometric distribution, to me it seems to agree somewhat with Rao et al's type 1 graph (conditional entropy for uncorrelated data, which should be the same as unconditional entropy). It's hard to say because Cosma cuts off at N=60 which includes only about 3 points. But both graphs cross 3 at about N=30 and reach about 3.5 or 4 at N=60. Cosma's graph saturates (at a value higher than the language graphs) and Rao et al's doesn't, but this could be due to differences in statistics of the uncorrelated distributions.

Rao et al describe their construction of the type 1 data as "dataset of 10,000 lines of text, each containing 20 signs, based on the assumption that each sign has an equal probability of following any other." The alphabet size is the same as in the Indus texts, but if I read the above correctly, the frequencies of the signs will be equal. I'm not sure why they did that: perhaps if Ronojoy is reading this, he can clarify.

Also, take a look at supplementary figure S1. This shows the unconditional entropy in Indus and other scripts, which should be the best comparison to your graphs (you, Sproat, Cosma): the only difference is that you assume a distribution of symbols and Figure S1 uses the actual Indus text. For most languages, it saturates at around 4.5 nats, which is substantially higher than what you all see, and also substantially higher than what they report for the conditional entropy. Their "type 1" graph is higher still (and, expectedly, identical to the one in the main paper). My preferred "type 1" graph, using the Indus symbols in their actual frequencies without correlations, would probably look identical to the Indus graph in S1. I think they should have plotted that graph in the main text figure: it looks pretty convincing to me.

Rahul Siddharthan said...

Mark - p.s. there is the other question of "if the conditional entropy is lower, so what?" Of course this does not prove that it is a language, but I think the fact that the Indus graph lies on top of all the other languages looks convincing. For example, if one plotted DNA, I think one would get a conditional entropy lower than the unconditional entropy but higher than what one sees for languages. (I would really like to have seen this plotted in figure 1A too!)

myl said...

Rahul: Your saturating value is about 2.5 whereas all their graphs go above 3.That's just a matter of what parameters you use in the simulation -- see the graph for the simulation that Cosma Shalizi did.

You can get any asymptote you want.

Rahul Siddharthan said...

Mark: I assume you mean parameters for the Zipf or geometric distributions. Which is what I said.

The bottom line is: in my opinion, the unconditional entropy for the Indus script in figure S1 is the closest thing to what you guys are trying to model with your adjustable-parameter distributions. Of course you can fit what curve you like if you pick your parameters correctly. If you chose the actual distribution of Indus symbols, you should get the same curve as in S1. The conditional entropy is significantly lower. With a random model that only preserved individual symbol frequencies, you could not retrieve the conditional entropy curve.

Put the S1 Indus curve on top of figure 1A, and the case is made.

I really don't know why the authors are making us dig through the supplementary text to learn this. Many data packets on the internet could have been saved.

myl said...

Rahus: Also, take a look at supplementary figure S1. This shows the unconditional entropy in Indus and other scripts, which should be the best comparison to your graphs (you, Sproat, Cosma): the only difference is that you assume a distribution of symbols and Figure S1 uses the actual Indus text. For most languages, it saturates at around 4.5 nats, which is substantially higher than what you all see, and also substantially higher than what they report for the conditional entropyAgain, you can get any number you want, depending on how you set up the simulation.

In particular, in order to get a unigram perplexity that is around 30 higher than the bigram perplexity (corresponding to an entropy difference of around 5 bits, or 3.5 nats), you could simply choose the signs in each inscription, at random, from one of N different "urns", which might correspond to different purposes for the inscriptions. Since each sequence comes from a single distribution, the within-sequence conditional entropy will depend on the distribution of signs in each urn; but the overall unconditional (unigram) entropy will depend on the distribution of sign across all the urns.

Again, depending on the number of urns, the distribution of signs in each one, and the distribution across urns, you can get any unigram and bigram statistics you want -- even though every inscription is a sequence of unconditioned choices, once the "urn" (= hidden state) is chosen.

I'm not arguing that this is how the inscriptions were created, just that the graphs and tables in Rao et al. seem to offer no useful evidence in favor of the script hypothesis.

myl said...

Rahul: I think the fact that the Indus graph lies on top of all the other languages looks convincing.Then I'm afraid that you're too easily convinced. Give me the numbers that you want to match -- rather than making me eyeball them from a figure -- and I'll give you a simple non-script-like simulation that matches them as exactly as you like.

Or just wait a day or two -- Cosma Shalizi has promised to post the code for a hidden-state variant of his simulation that will match not only Fig 1A, but also the difference between unconditional and conditional entropies, based on slowly varying hidden states (changing with a probability like .01). Then you can do it yourself.

Rahul Siddharthan said...

Mark - yes, this "urn model" would bias the conditional entropy. But it would be a extraordinary coincidence if it biased things just enough to agree so beautifully with the real languages. If your alphabet contained equally probable symbols, the number of texts were large, and there was no correlation between the size of the text and the urn it was drawn from, the increased entropy would be (log N) nats (by my quick calculation). (N = number of urns). It could be anything. Likewise with the real symbol distribution and a hypothetical number of urns: the reduction in entropy could be anything. Why should it agree with what one sees for other languages?

Of course no such hypothesis can be truly proved or disproved -- perhaps the creationists are correct and God dumped everything ready-made on us around 6000 BC -- but one should consider the simplest hypothesis. That hypothesis (assuming the analysis of Rao et al is correct) is that the strong similarity of the Indus script with other languages indicates that the script did encode a language.

I agree that it is an overstatement to say the paper "proves" anything -- but not so much of an overstatement as to say that the 2004 paper by Farmer et al "disproves" anything. This paper just makes the language hypothesis substantially more convincing than it was.

Rahul Siddharthan said...

Mark, Give me the numbers that you want to match -- rather than making me eyeball them from a figure -- and I'll give you a simple non-script-like simulation that matches them as exactly as you like.Of course you can do that, but Rao et al claim to do it without such fine-tuning. If you doubt that claim, that's a separate argument and you need to justify your doubts (it is a serious accusation). Alternatively, if you believe a very general linguistic model should match without fine-tuning, you don't need numbers from me.

Rahul Siddharthan said...

ps - I don't have the numbers either (though I could probably get them). I am not part of this project.

myl said...

Rahul: Of course you can do that, but Rao et al claim to do it without such fine-tuning.I'm really puzzled by this comment. They estimate unigram and bigram entropies of a corpus, so of course they get whatever numbers they get. There's no predictive model involved at all.

And what they found is that the perplexity associated with the choice of each sign, given knowledge of the preceding one, is about 30.

Their graph 1A compares this with two non-linguistic processes that have very different entropies from this.

My point is that this comparison is misleading and indeed meaningless, because a non-linguistic process could have whatever entropy you like.

Rahul Siddharthan said...

Once again, let's forget the non-linguistic processes in 1A and look ONLY at the unconditional entropy for Indus in S1 versus the conditional entropy for real languages in 1A.

(As I said earlier, I would have favoured using actual frequencies of Indus symbols in their "type 1" non-linguistic plot, rather than make all symbols -- or all transitions -- equally probable. If they had done that, I believe they would have obtained exactly what they show for Indus in figure S1.)

Now, all the above plots -- unconditional entropy for Indus, conditional entropy for Indus + all real languages -- are based on actual data with no parameters or finetuning. The conditional-entropy Indus plot fits very well with the other languages, and is substantially lower than the unconditional-entropy Indus plot.

Anonymous said...

I am (very likely) missing something here. In Figure S1, the "Nonling Type 1" seems to lie above the rest which are roughly clustered. When you move to Figure 1A, what I see is that "Nonling Type 1" is above everything else (as before) but "Nonling Type 2" lies below everything else. So, in moving from Figure S1 to Figure 1A, the position of "Nonling Type 2" has changed, but it seems to me that nothing else has changed. So, how exactly does the comparison throw light on the Indus script?

The only way I can make sense is that there is a unstated claim that any non-written script *must* be "close" to "Nonling Type 1" or "Nonling Type 2." Since in Figure 1A, the curve for the Indus script is well away from both but close to other spoken languages, it is therefore evidence that the Indus script represents a spoken language. Is that right?

The heart of the matter then is the basis of the claim that the curve for any non-spoken script must lie "close" to either "Nonling Type 1" or "Nonling Type 2". Is that something generally accepted? Perhaps Mark or someone else can clarify?

myl said...

Anonymous: "The heart of the matter then is the basis of the claim that the curve for any non-spoken script must lie "close" to either "Nonling Type 1" or "Nonling Type 2". Is that something generally accepted?"

That does seem to be the implication of the Science article -- but it certainly is not true, as you can see from the figures in my blog post:
http://languagelog.ldc.upenn.edu/nll/?p=1374

There are many simple sorts of non-linguistic processes whose measured conditional entropy would fall at any desired point between their "Type 1" and "Type 2" systems.

Ranjith said...

Rahul, Thanks for your effort to explain everything in detail. I am yet to read the papers of Farmer et al and the Rao et al. But from your discussions, would it be reasonable to say the following ?

Nobody has pointed out any sensible statistical measure to prove that Indus scripts do not encode a languageRao et al show that the measure they compute (the entropy) compares well with the natural languages and hence, statistically, there is no known reason to believe it is not a language

Is that a reasonable thing to say ?

Rahul Siddharthan said...

myl, anonymous: as far as I can tell, that is most certainly not the implication of the article and I really do not see where you read such a thing. In fact, the paper shows data for DNA and Fortran that fall somewhere between the languages and the extremes of type 1 and type 2 respectively. I think, on the one hand, the authors could have vastly improved Figure 1A (not that the referees seemed to mind); on the other hand, all of you are wilfully ignoring Figure 1B.

Ranjith: yes, it would be fair to say that but I think a stronger statement is possible.

Here is my best and, likely, final attempt at a summary. Note: this is my summary, not that of the authors -- I have had minimal discussion with one of the authors and none with the others.

1. A priori, there are two hypotheses: the Indus script encodes a language, or it does not.

2. Very few sequences of any sort are truly uncorrelated. A first-order Markov model is often the first attempt to represent all sorts of data, from languages to DNA sequences to computer programs. Mystifyingly, this does not seem to have been done previously for the Indus scripts.

3. It must be emphasised that a first-order Markov model (or, indeed, a Markov model of any order) does not truly describe any of these sequences. It is only an approximation.

4. A good measure of the level of digram correlation is the conditional entropy, as defined in the supporting text of Rao et al. For truly uncorrelated sequences, the conditional entropy reduces to the usual Shannon entropy.

5. All human languages considered in the paper exhibit a conditional entropy in a very narrow range -- in fact, the graphs basically fall on top of one another. There may be some reason for this that is universal to human languages. (If it is true of the spoken language, it is likely to be true of the script.)

6. The extreme cases of type 1 and 2 exhibit conditional entropies equaly to the unconditional entropy, and zero, respectively. Non-language sequences such as DNA and Fortran code (studied in the paper) show conditional entropies intermediate between natural languages and the extreme values.

7. As Mark repeatedly says, one can cook up a model to deliver any desired conditional entropy. One would expect randomly occurring models to exhibit a range of conditional entropies -- at least in the range encompassed by the DNA and Fortran examples.

8. When the conditional entropy of the Indus texts is plotted, it agrees very well with the languages -- in fact, lies on top of the curves (figure 1A) and intermediate among all of them (figure 1B). It -- with the other languages -- is well separated from Fortran, DNA, and the extreme values of types 1 and 2.

9. Therefore, either the Indus script represents a language, or it is generated by some other process that just happens to produce the same conditional entropy as human languages.

10. The parsimonious explanation is preferred: the Indus script encodes a language.

This is not a proof, and the authors do not claim a proof, only "evidence" (very convincing, in my opinion) for the language hypothesis.

Mark, you are welcome to keep commenting, but unless I learn something new from your comment, this will be my last. But I am hoping that someone of your eminence is not objecting merely on the grounds that "I can cook up a model with a zillion parameters that looks like a natural language, and therefore the authors have shown nothing", and I am missing something fundamental in your argument.

Anonymous said...

Rahul:

I have nothing to say beyond that I following step in your summary unconvincing:

9. Therefore, either the Indus script represents a language, or it is generated by some other process that just happens to produce the same conditional entropy as human languages.Why can't it be the case that the Indus script is non-linguistic and yet generates conditional entropy which is the same as the spoken languages? What rules out that possibility?

If there is a reason to believe that a non-linguistic script cannot -- or at the very least, extremely unlikely to -- have conditional entropy which is in the range of the linguistic scripts, then of course, the fact that the conditional entropy of the Indus script is in that range becomes extremely significant. But what is that reason? That was what I was asking Mark and you in my previous comment.

I fully understand that DNA and Fortran (the two actual non-linguistic scripts analyzed in the paper) have conditional entropy intermediate between the extremes of "Type 1" and "Type 2". However, those values separate them from the conditional entropy values of the spoken languages. What I should have said in my previous comment is that there seems to be an implicit, unsubstantiated claim in the paper that a non-linguistic script will have conditional entropy significantly different from those of the spoken scripts. After all, if a non-linguistic script can have an entropy value in the range of the linguistic scripts, then Figures 1A and 1B mean nothing at all.

As with you, I'll stop here. If you can get Ronojoy and his colleagues to pen a response, that will be helpful.

Let me end by thanking you for your blog; it's been very helpful even if I don't agree with you. But what's the use of a blog where everyone agrees with the author? :-)

Rahul Siddharthan said...

Anonymous: nothing rules out that possibility. That's why the authors do not claim a proof, only "evidence".

The unlikeliness of the possibility stems from the large range of possible conditional entropies (bounded by the type 1 and type 2 graphs) and the narrow range of conditional entropies shown by actual languages and the Indus script. Crudely, if you assume a random process would generate a conditional entropy uniformly likely to fall anywhere between those bounding values, I'd estimate the probability of it agreeing with natural languages to be not more than 10%.

Of course, if you have sufficiently strong prior reasons for believing that it is not a language, then this graph may be insufficient to persuade you (Bayes' theorem, etc...)

Thanks for the comments!

Richard Sproat said...

But your graph, and Sproat's, don't actually look very similar to me to the published graphs of Rao et al -- and, puzzlingly, are noticeably lower. Your saturating value is about 2.5 whereas all their graphs go above 3. I'm not sure what you are referring to. Just eyeballing their Figure 1A, it seems that the Indus (circles in their plot) starts at around 1.6 and climbs to about 2.8.
In mine, it starts at 1.57 and ends at 2.78. That seems awfully close to me. If I knew their actual numbers, I could surely tweak it to be closer. But it already seems pretty good for something that is generated with a model with just tweakable parameter.

Richard Sproat said...

Sorry that should have been "just one tweakable parameter".

Rahul Siddharthan said...

Richard - good to see you here. I am referring to figure 1A: while the Indus circles start, as you say, about 1.6, my eyeballs tell me that they go up to around 3.2 or 3.3 (and don't quite saturate).

Be that as it may -- I don't think I have much to add to my summary of 9:30 pm last night. While models that do not include digram correlations may still produce a relative conditional entropy below 1, there is no a priori reason to expect a generic such model to resemble languages, and indeed the two examples the authors study -- DNA and Fortran -- do not. Yes, you can fit it with one tweakable parameter, but I think one tweakable parameter is already too high.

I argued with anonymous above that I wouldn't expect a better than 10% chance of a random, untweaked non-language model fitting into the language band. If you can convince me otherwise, I am all attention.

Richard Sproat said...

Oh ok now I see it. There's an optical illusion in that plot which caused me to skew the horizontal. Well okay so I can tweak my one parameter (alpha) to 1.4 and have it start at 1.53 and end at 3.10

I'm not sure what else to say. Pereira's comment to the effect that "first-order statistics are compatible with a very large class of first-order Markov processes" seems to summarize the issue nicely. My own model, which only assumes a Zipfian distribution for the unigrams, and complete conditional independence for the bigrams, obviously isn't a model of real language, but it is able (with tweaking of that one parameter) to produce curves that look rather similar to what Rao et al. show. Of course we know that the Indus system did have a structure: things were not put together randomly. But the operant question is, do you have to assume that structure comes about because it encodes a language? Or could it be that it encodes something else?
I just don't believe conditional entropy can answer that.

And it's especially unconvincing when they use two straw man "non-linguistic" systems as the points of comparison, whereas we know lots of non-linguistic systems that have structure.

Rahul Siddharthan said...

Richard - pretty much any sequence not generated by rolling dice does have a structure. Humans are almost incapable of generating random numbers. The interesting observation is that human languages all have conditional entropies in a very narrow band. This indicates some sort of universality at work in diverse human languages, which wouldn't surprise me at all.

The point is not whether the Indus script has a structure (it certainly does) or whether the process that generates it is capable of having conditional entropies smaller than the unconditional entropy (almost any process other than rolling dice will do that). The point is, if it is not a language, how likely is it to fit into that narrow band occupied by languages? To me, the graphs suggest an estimate of 10%. On the one hand, if you have a higher estimate, I'd love to hear it. On the other hand, if your prior belief that it is not a language is greater than about 90%, then of course this data will not convince you.

I believe the authors have more ongoing and unpublished work (and one preprint on arxiv) so the last word has certainly not been said.

Richard Sproat said...

Well my silly model doesn't have much structure, except for the Zipfian unigram distribution, yet it still falls in the narrow band of "languages". I'm a linguist, and when I think of "structure" in symbol systems, I usually think in terms of more structure than that.

In any case I find it interesting that nearly all the discussion has focused on linguistic systems and what would characterize those, and whether the Indus symbol system was similar to those.

But few people seem to have thought to ask the question: what does the space of non-linguistic systems look like. I conjecture that is because most people have never thought about this issue and don't know anything about it. It's for sure that the reviewers for Science hadn't and didn't, since otherwise one of them would surely have said, "Now wait a minute..."

Rao et al.'s Types 1 and 2 certainly do not cover that space. Indeed, I doubt that their Types 1 and 2 even exist. But even if they did, can anyone tell me that they are absolutely sure that if you look at non-linguistic systems --- mathematical symbology, dance notation, Naxi pictographs, European heraldry ... --- that you will not see examples that also fall in that narrow band? In other words, systems that are very language-like (according to that measure) but clearly not language.

Unless you can tell me that no such system exists, I don't see how you can believe that Rao et al. have demonstrated anything.

Anonymous said...

I argued with anonymous above that I wouldn't expect a better than 10% chance of a random, untweaked non-language model fitting into the language band.Rahul,

1. Your 10% guesstimate, as you note, comes from assuming that non-linguistic scripts are "uniformly distributed" over the range between Type 1 and Type 2. Not clear why we should assume that. You cannot say that since we don't have any information, that's as good a (Bayesian) "prior" as any. We do have actual non-linguistic scripts -- some contemporaneous with the Indus script -- that can be analysed to get a feel for the actual distribution. The other evidence in favour of the non-linguistic nature of the Indus script cited by Farmer, Sproat and Witzel may also be relevant in the choice of a prior.

2. Secondly, I wonder if it's reasonable to conclude on the basis of the analysis of four scripts that the conditional entropy of linguistic scripts fall in a "narrow band." It might be sufficient, I don't know. I am just asking for clarification because I am struck by the fact that the actual range of conditional entropy of linguistic scripts might be larger and it's not showing up here. If it's sufficiently large, then the fact that the Indus script falls within their range may not mean much.

3. Assuming that what you say is correct -- that the linguistic scripts fall in a narrow band, that the non-linguistic scripts have a larger, uniformly distributed band -- how does one "set off" the significance of the statistical evidence against the other evidence gathered by Farmer et al? Is the statistical evidence "superior" to other evidence? Why? In particular, how does one evaluate the significance of the evidence cited by Farmer et al that while most other literate societies have left behind texts of substantial lengths, the largest one in the Indus script is 17 characters long with a mean of 4.6 or so.

4. I am not convinced at this stage that the evidence that Rao et al shows anything. (Says something about *my* prior, doesn't it?) At best, I'd call it suggestive. Comparison with more databases -- an objective of the authors, as Ronojoy noted in an earlier comment -- if it confirms the current findings will surely help. But even if it does, it leaves open the tricky question of addressing point #3 above.

PS: Some of what I say is covered by Richard above which I saw when I was previewing my own comment. I am too tired to edit my comment, so I'll leave it -- with apologies to all for the duplication.

Richard Sproat said...

My comment and anonymous's comment passed like ships in the night.

Good, so the issue of what non-linguistic systems look like is definitely on the table, as it should have been from day one.

Richard Sproat said...

By the way, all excellent points by Anonymous.

Ranjith said...

@Richard Sproat:

I have two questions:

1) Do Indus symbols have a Zipfian unigram distribution, with the parameter value close to what you got after tweaking ? (in other words, does your tweaked distribution match with the Indus symbol unigram distribution ?)

2) Can you show a few examples of real world non-linguistic symbols, which has a conditional entropy that matches with that of languages/Indus symbols ?

Rahul Siddharthan said...

"Suggestive" is in fact the word I had in mind. (As Sherlock Holmes said: "It is, however, very suggestive.")

I'm sure most or all of Anonymous's questions will be answered in due course.

Ranjith said...

how does one evaluate the significance of the evidence cited by Farmer et al that while most other literate societies have left behind texts of substantial lengths, the largest one in the Indus script is 17 characters long with a mean of 4.6 or so.
In this context, I have a naive query.

Have people considered the possibility that Indus symbols could be some kind of a shorthand writing method ? What if each symbol represents a word and not a sound? Can we say anything about it at this point ?

If one takes some English prose written in shorthand notation and calculates its conditional entropy, i wonder how it will differ from the existing result for English!

ps: i must admit that i do not know anything about the shorthand notation!

Rahul Siddharthan said...

Ranjith - if the Indus script does encode a language, very likely it is pictographic/ideographic (each symbol represents a word or idea, not a sound). Rao et al calculate the conditional entropy for English using both word-units and letter-units. The argument that there are no surviving long texts from Indus times seems specious to me for three reasons: first, as many have pointed out, longer texts may have been on more perishable media that have not survived; second, the number of other civilisations of that antiquity who have left behind written corpuses is small, so there is nothing much to compare with; third, there are no surviving long written texts from India even from the Vedic and later periods: Ashoka's are among the earliest surviving inscriptions.

Richard Sproat said...

Some comments on recent questions comments:

Do Indus symbols have a Zipfian unigram distribution, with the parameter value close to what you got after tweaking ? (in other words, does your tweaked distribution match with the Indus symbol unigram distribution ?) I don't know since I don't have access to precisely the corpus that Rao et al used. I do have some earlier (unigram) counts from Mahadevan's work, but I am not sure this is the same corpus. But in any event I would assume the answer is "no". But so what? I am not saying my model is what is going on for the Indus script, just that it's one of many models that can produce the same results as what Rao et al show. The main point, which was also Pereira's main point, is that there are many models that might explain the data without assuming they are writing.

Can you show a few examples of real world non-linguistic symbols, which has a conditional entropy that matches with that of languages/Indus symbols ? Not yet, but then there are very few electronic corpora of such symbol systems available. So that argues that I could not use statistical methods to demonstrate that the Indus corpus is non-linguistic. But for the same reason, I can't show the opposite. As I suggested to Rao, in the absence of any credible attempt to "decipher" the Indus symbols, it would be useful in any case to spend our efforts understanding non-linguistic symbol systems better.

The problem with assuming a word-based or "ideographic" system is that, as DeFrancis and others have convincingly argued, no real ideographic writing system. The best one could have would be a limited system that allows you to represent a few things, but it could not have been a full writing system. If anyone has an example of an ideographic system that is indeed a full writing system, I would love to see it.

I believe the issue of perishable materials was adequately addressed in Farmer et al, so I am not sure why it is being raised again here. The lack of any credible markers of literacy (pens, ink pots, styluses) or the fact that there were no civilizations that had long texts on perishable materials that did not also have long texts on non-perishable materials: those points seem hard to argue with. Yeah I suppose you could conceive of a civilization that had a full writing system, wrote long tracts of prose on sheepskin, confined their writing on more durable materials to short pithy phrases, and somehow arranged it so that not only did the perishable materials perish, but everything that was used to create them. I suppose anything is possible: but I think one tries to avoid conclusions that make something seem odder than it already does.

Also, I find the perishable materials hypothesis dubious on other grounds: even if something is perishable, it does not mean that every example must perish. We have Egyptian papyruses from the 3rd millenium BC. Is it really credible that nothing would have survived for the Indus, if it existed? It's certainly hot enough and dry enough there, as I can attest from having been at Harappa in July 2005.

I'm not sure why you think the number of ancient civilizations that left around written material is small. Indeed, even those that were clearly literate but did not leave much around, tended nonetheless to leave much longer inscriptions than the Indus people did, which was one of the points that we made in our original paper.

A plausible explanation for why the Ashokan inscriptions are the first is that that was when India first learned to write. There are no texts from the Vedic period because they didn't have literacy in the Vedic period. I realize I'm treading on dangerous ground here, not because I believe what I am saying is false, but because the statement is taken as inflammatory in some circles. So be it.

Richard Sproat said...

Oops again: The problem with assuming a word-based or "ideographic" system is that, as DeFrancis and others have convincingly argued, no real ideographic writing system should of course have read
The problem with assuming a word-based or "ideographic" system is that, as DeFrancis and others have convincingly argued, no real ideographic writing system's exist

Richard Sproat said...

or even better: system's exist -> systems exist.
yikes, careless editing.

Anonymous said...

The main point is that there are many models that might explain the data without assuming they are writing.

Really ? Without assuming that they are writing, are there many models that will explain all three of the following

(1) Simple entropy data of Indus symbols (Fig S1 of Rao et al)

(2) Conditional entropy of Indus symbols (Fig 1 of Rao et al)

(3) Various distributions of Indus symbols (unigram distribution, for example)

I am interested to know which model fits all of the above. My guess is that without assuming some correlations, similar to that of real-world languages, one may not be able to get all of the above.

Richard Sproat said...

Really ? Without assuming that they are writing, are there many models that will explain all three of the following

(1) Simple entropy data of Indus symbols (Fig S1 of Rao et al)
Well my simple model of course doesn't fit the unigram one very well --- entropies are too low. But of course as they themselves note, none of the linguistic systems fit the Indus well in S1. The best fit is to the non-linguistic of Type 2.

(2) Conditional entropy of Indus symbols (Fig 1 of Rao et al)
(3) Various distributions of Indus symbols (unigram distribution, for example)


I am interested to know which model fits all of the above. My guess is that without assuming some correlations, similar to that of real-world languages, one may not be able to get all of the above.
Well I think the question is already answered. If you want a match in both simple and conditional entropy, you do not find it with Rao et al.'s data either. For the simple, it looks like a "non-linguistic" (strawman) system. For the conditional it looks like maybe Tamil, maybe Sumerian.

So if you want a model that explains all of these things, you are not going to find it in their paper.

In any case, at some level the answer is obvious: the best model is the model that generated the corpus in the first place. Problem is, we don't know what that is. And we are not going to answer it by producing limited statistical tests on an even more limited subset of natural languages, and a highly imperfect and misleading understanding of non-linguistic symbol systems.

Look, go ahead and believe it if you want to. Of course there is the issue of weight of evidence raised by another "anonymous": we had a bunch of arguments in the original Farmer, Sproat & Witzel paper, and most of them had nothing to do with corpus statistics. Rao et al's paper addresses none of those. To my mind, accepting their conclusion, especially given how it was arrived at, as concluding anything is like convicting someone of a crime, when all the available evidence points to someone else, but the person you happen to want to convict is approximately the right height.

Rahul Siddharthan said...

Richard, I really do not understand what you are saying. The Rao et al data is not a match with anything. It is what it is.

The question is, having fine-tuned your one-parameter model to fit the Indus curve in figure 1A, will the same data also fit the unconditional entropy Indus curve in figure S1? My guess is it will not, but of course if you add more parameters, you can get it to fit.

If you then look at the individual symbol frequencies of your model, are they similar to the actual Indus frequencies? Probably not, but you can probably arrive at a model with the same frequencies and the same unconditional and conditional entropies, at the expense of throwing in some more parameters.

Ptolemy's epicycles did explain planetary motion pretty well, too.

Going back to your previous comment on ideographic scripts: what about Chinese? Or Japanese Kanji (which is much the same thing)? Or Egyptian hieroglyphics? These were/are perhaps not purely ideographic but are largely so; the Harappan system could have been the same.

Your other comments on perishable documents probably either deserve a post of their own in response, or deserve to be dismissed as a lot of unsubstantiated handwaving. I do not know what you mean by "adequately addressed" in your 2004 paper. In this space let me just make the following comments/questions:

1. I don't know why you need pens, inkpots, styluses -- why not quills? They were quite common in the west, until recently. If styluses, why not wooden or other perishable materials? Why not vegetable-based dyes, which are common in India even today? Maybe the inkpots were the pottery that has been found -- how would you specifically identify a Harappan inkpot?

2. Maybe they did write on leaves or animalskin, and didn't invent paper (Indian civilisations didn't come across paper for centuries, and used palm leaves). Palm leaves survive for centuries but probably not for 3.5 to 4.5 millennia. Why is that hard to imagine?

3. Perhaps they did not build enormous edifices with ornate inscriptions, or if they did, nothing has survived the centuries, except the tablets and a few other things. Why is that hard to imagine?

4. I don't see why the hotness and alleged dryness of Harappa should help your argument. Organic materials decompose faster in heat, not slower. And while Harappa may be dry today, it is hardly a desert; and it certainly wasn't one in earlier times -- it did sustain a flourishing civilisation.

5. The number of civilisations as old as the Indus Valley one that have left behind written materials is quite small: I know only of Sumer, Babylon, Egypt, Elam, and perhaps China. Of these, I don't think long texts survive from Elam or the contemporaneous period in China -- correct me if I am wrong. Even the Egyptian papyruses that you cite are younger than the Indus Valley civilisation.

6. The Ashoka pillars are an accident of Ashoka's personal history -- his father and grandfather did not erect pillars, or if they did, none have survived. Do you think they were illiterate and Ashoka invented writing? If Ashoka's life had transpired differently and he did not erect those pillars, perhaps you would think that Indians did not write for a few centuries longer. Several examples of writing appear in Hindu mythology (eg, Ganesha writing with his broken tusk). Many of the Buddhist Sutras, in particular the Sutta Nipata, are believed to have been written soon after the Buddha's death, by his direct disciples. The fact that original manuscripts do not survive means nothing. We don't know what sort of script was used, but the idea that these rather advanced communities did not know how to write defies belief. To apply your own criteria here, can you think of any other examples of communities with sophisticated governance, rich literature and advanced philosophies, who were illiterate?

7. You are dismissive of the Indus-Dravidian hypothesis because of the geographical distance between the Indus Valley sites and present-day Tamil Nadu, as well as the temporal distance between the Indus civilisation and the earliest recorded Tamil. But there is a Dravidian-origin language, Brahui, still spoken in parts of Pakistan and some neighbouring areas. Some scholars believe that the Elam language, roughly contemporaneous with the Harappans, was also of the Dravidian family. Finally, the known continuous history of Tamil is much longer than the gap that you object to between the Indus valley and Old Tamil. It does not seem such a stretch to think the Indus language may have been a Dravidian language.

Richard Sproat said...

I guess the following summarizes the issue rather well: Richard, I really do not understand what you are saying. The Rao et al data is not a match with anything. It is what it is.Ok good. So it's not a match with anything. So why are we even discussing it?

On Brahui. Yeah everyone knows that. Try convincing most Indologists that that is due to anything other than a recent migration. Hans Hock has a lot to say on this topic, for example.

I'm not an archaeologist, but I would have thought inkpots would be easy to identify: wouldn't they contain traces of ink?

China was later actually. But the earliest texts --- the Oracle Bones --- are much longer than anything Indus. So are the earliest Mesoamerican inscriptions --- much much later in their case, but then that was clearly developed independently anyway.

On your other points: yeah I guess one can imagine anything. It's just a question of the sum total of things that one has to imagine.

I think the real question for you is: why do you insist that the Indus must have been literate? I.e. why, fundamentally, do you care? Why is the conclusion that it might not have been a worry to you?

Okay this is my last post here. Obviously the discussions are getting
us nowhere. It is pretty clear what would need to be done at a minimum
to do the work that Rao et al did, in a more rigorous and (dare I say)
scientific fashion.

First we would need a much wider range of languages and
different types of writing systems.

Second we would need some serious corpora (not made up examples) of
non-linguistic systems. Unfortunately creating those is work, and
since not too many (if any) really exist, we are probably looking at a
few years' wait before the experiments could be done. I will have no
more to say about that here, but if anyone is seriously interested in
collaborating on something like that you can email me (I am easily
findable on the web), and we can discuss it.

Then, finally, we would need to come up with a serious battery of
statistical tests, being very careful to come up with plausible models
of the priors, per Fernando Pereira's points.

Then, and only then, might we be able to make a claim about
what the Indus system looks like.

But even then this will probably fail to make much difference. People
who want to believe this was a script for a language will continue to
try to decipher it, making all sorts of theories about what the
language (or languages) was, and what type of writing system it
was. There is no shortage of would-be Ventris's out there, as I can
attest from the other two controversies --- the Phaistos disk, and
rongorongo --- that I have been involved in.

(With the Phaistos disk, the fundamental observation that many people
have made that the text is too short to allow for a
verifiable decipherment, completely fails to stop people from trying
and being tickled pink when they "succeed". With rongorongo, many
people seem to have been impressed by Pozdniakov's demonstration that
a frequency plot of his decomposed glyphs matches, by some stretch of
the imagination, the distribution of syllables in Rapanui. Of course,
all he's done is rediscover that many symbol systems show a roughly
Zipfian distribution, meaning that a Zipfian distribution tells us
nothing at all.)

Archaeologists will continue to dig up new seals or other
inscriptions. If the pro-script camp gets lucky, they'll eventually
find a long text, or a bilingual. Occasionally computer scientists
will rediscover the wheel --- it seems to have been largely forgotten
in this discussion that the people at Helsinki did all kinds of
Harrisian analysis on the corpus and discovered structure back in the
1960's; will be able to hoodwink Science into publishing
(apparently quite easily done); and especially if they work at a
university that likes to promote its researchers' work (despite the
revolutionary nature of the work, the University of Illinois, where I
was at the time, chose to ignore our 2004 publication), get their
Warholian 15 minutes of fame. Since the press rarely publishes
nostra culpa articles, the fame may even last 15 minutes
before it turns to embarrassment (by which time people will have moved
on to something else anyway). Hindu extremists will continue to try
to find horses in an apparently horse-free society, and publish
inflammatory remarks about people who would dare suggest that their
ancestors were illiterate. (Their ancestors? But we don't
even know who the Indus people were...)

In other words, this will change practically nothing.

In a way, I would be happy if someone did dig up a bilingual
--- say a Sumerian-Indus dictionary nicely arranged on 100 clay
tablets. Of course it would mean we were wrong. That's cool: I do not
mind being wrong. What really irks me is not being wrong, but being
told I'm wrong because of a largely bogus result that shows nothing at
all, especially when a lot of the people (e.g. in the scientific
press) have not thought at all about the issue, and are merely
impressed by what looks like a convincing plot.

Rahul Siddharthan said...

Richard,

Ok good. So it's not a match with anything. So why are we even discussing it?You are either trying to score cheap debating points here, or you have not actually read the Rao et al manuscript, despite its brevity.

Nobody is making categorical assertions except you. It is a question of evidence and how that evidence affects your posterior beliefs, given your prior beliefs. If your prior beliefs have a confidence of 100%, nothing will change that.

Your 2004 paper contains no proofs of any kind either, but you don't have the modesty to admit it. Just compare the two titles -- your "Collapse of the Indus Script Thesis" versus their "Entropic Evidence for Linguistic Structure in the Indus Script". You assert an alleged truth confidently, they claim some statistical findings. But your statistics are laughably inadequate and the rest of your paper is so much handwaving. Now, I wonder why your university ignored it.

Sorry if that sounds personal, but I didn't start it, and I have no personal axe to grind -- I do have some issues with the Rao et al paper and I agree with some of the things you say. But I cannot take anyone seriously who dismisses all critics as nationalists of one kind or another, while refusing to read their arguments and making absurd distortions of their statements.

And, by the way, your recent content-free rant, posted on Steve Farmer's page, is a disgrace to your profession.

P said...

So if you want a model that explains all of these things, you are not going to find it in their paper.

I am amazed reading this comment from Sproat! After all this, he seems to think that what Rao et al published is a model !!

Arun said...

Rahul:

Question - from myl's and Richard Sproat's simulations, it would appear that bigram conditional entropy is not sufficient to distinguish between a 0th order and a 1st order Markov process. Is this true in general?

Or to ask a related question, have Rao et. al., established clearly that a first order Markov process models the Indus inscriptions better than any zeroth order Markov process?

Or is it that, assuming a first order Markov process for the Indus inscriptions, it falls right in the middle of the first order Markov processes (as measured by conditional entropy) that model linguistic systems?

Presumably there is a large but finite space of historical human-generated non-linguistic systems - if conditional entropy can be computed for each of them, and none of them falls into the linguistic class, then we have a strong indication (but never proof) that the Indus inscriptions are linguistic.

Thanks in advance for your answers.

Anonymous said...

More heat than light, as Mark Liberman anticipated! I have nothing to add to what I've already said but I was intrigued by the brief discussion pertaining to writing in India, so I did some search to find more. Please feel free to delete this comment, Rahul.

Other than the Indus valley script, the earliest evidence for writing in India is, as you said, exactly around Ashoka's time. Was there writing before that? One possibility, as you suggest, is that there was but nothing from pre-Ashokan (but post-Indus valley) times has survived. Alternatively, some have suggested that writing was invented just around Ashoka's time. I'll come back to this point but let first take up the Sutta Nipata which you suggest was written immediately after the Buddha's death. That appears not to be true.

I quote from The Sutta Nipata, translated by H. Saddhatissa, 1985: "The canonical texts of the Theravadin school of Buddhism,...collectively known as the Pali Canon are divided into the three main sections...All these texts were transmitted orally and only committed to ola leaf manuscripts in modern Sri Lanka in the first century B.C." An earlier book by Lord Chalmers says "Indeed, it cannot be assumed that in its present form, any given `book' of the [Pali] Canon dates back to before Ashoka's Council held at Patna in (perhaps) 240 B.C." (Both books can be previewed via Google Books; I am quoting from the Introduction in both cases.)

Back to writing in India. I think someone (Farmer? Witzel?) has suggested that the Vedic priests opposed writing, fearing perhaps the loss of their pre-eminence. Note that even after writing emerged, the Vedas themselves were not written until very modern times! Now resistance from entrenched groups to the introduction of a new technology is not unknown: Jared Diamond discusses this point in his monumental Guns, Germs and Steel. A notable and strange one is Japan which took to guns from the Portuguese in 1543, quickly improved the technology and then proceeded to ban it altogether. The ban ended only after Commodore Perry ended that country's self-imposed isolation in 1853! The reason for the ban was that the samurais (the traditional warrior class) feared that the guns could be operated by anyone (which it was) and hence preferred the sword! See the account by Jared Diamond here:

http://www.edge.org/3rd_culture/diamond_rich/rich_p5.html


Anyway, if one believes in the possibility of the Vedic priests opposing writing, then it is not unbelievable [no proof, mind you] that it was not until Ashoka, the first significant Buddhist king, one who [after his conversion] was not beholden to Vedic priests and who had his own reasons for wanting writing, that India as such adopted writing with the Brahmi script. The Nagari script based on Brahmi in which Sanskrit and other North Indian languages are written came later.

(The following article of Witzel may have details on the resistance to writing in India: Brahmanical Reactions to Foreign Influences and to Social and Religious Change. In: Olivelle, P. (ed.) Between the Empires. Society in India between 300 BCE and 400 CE. Oxford: Oxford University Press 2006: 457-499.)


Lastly, let me note that the time of the Indus Valley civilization is around the time when writing first appeared anywhere in the world. (I am depending on Jared Diamond here.) There are only four or five places where writing emerged independently, -- so not surprisingly, there isn't much to compare the Indus Valley Civilization with. However -- you will not like this -- Farmer et al do discuss Elam and contrast the unavailability of long texts in the Indus valley from a large corpus with Elam where texts of substantial length are available for the (undeciphered) Linear Elamite from a much smaller corpus. (See page 22 of their 2004 manuscript.)

It's been nice participating but I think I should stop here. Thanks for the discussion.

Rahul Siddharthan said...

Arun: If the process is indeed a Markov process, then the conditional entropy will reveal whether it is zeroth order or finite order. But it is possible for a zero-order process with "hidden states" to yield a conditional entropy lower than the unconditional entropy. myl gave an example, above. And indeed none of the languages or other examples being discussed (DNA, fortran) are actually generated as Markov processes: it is just a model. Sproat, Liberman et al claim that they can fit the observed language data (figure 1A) with a model that has one adjustable parameter. To complete the picture, obviously, they must also fit the data in figure S1 with the same model; as far as I can see, they don't, but I'm sure they could (with more parameters) if they tried. I just don't see what that sort of data-fitting proves. There is no fitting in the Rao et al data: it is what it is, and just happens to agree with natural languages.

Your use of the phrase "zeroth order Markov process" is misleading. Most certainly the Indus script is not a zero-order Markov process. That is ruled out by the Rao data. The issues are: 1. there are non-Markov processes that could generate the data, 2. A Markov process does not, by itself, imply language. The question is, how likely is it that you would get language-like conditional entropy by chance?


anonymous: thanks for the long comment. I stand corrected on the Buddhist scriptures (or the common view of them, anyway). And it is of course possible that the bit about Ganesha being Vyasa's scribe was added to the Mahabharata in later centuries.

However, I find it odd that the Vedic priests opposed writing if writing was unknown at the time. If writing was not known at all, could they have conceived it? More likely, I think, it was known but they opposed it to maintain their power. The Buddhists, and others like the Charavakas, were not beholden to the Brahminical priests and did not have a good reason to oppose writing. While Ashoka was the first famous king to convert to Buddhism, and the first to rule over large stretches of today's India, Buddhism was alive and well in India for centuries before him and after. I find it a bit of a stretch that writing was invented just about the time that a King converted to Buddhism. And what is the point of placing these pillars far and wide if nobody could read them? Literacy must have been somewhat widespread, if restricted to certain elite classes (as it was the world over).

Also, the Vedas were transmitted very precisely because of the careful system of Vedic chanting (which even preserved linguistic differences between the earlier and later Vedas). Other Hindu texts like the Ramayana and Mahabharata were not so carefully transmitted and have probably evolved substantially over the centuries. I don't think the Buddhists are known to have had a comparable system of chanting -- so if it was a purely oral tradition that maintained the Buddha's teachings until Ashoka's times or the Sri Lankan council, we must assume that some distortions occurred in the intervening centuries.

Arun said...

I thought I was careful to say that the Indus inscriptions are better described by a 1st order Markov model than a 0th order; etc.; apparently not.

Obviously not the Indus inscriptions not any language not DNA can be said to be generated by a Markov model.

I think the detractors of Rao et al. need an example, either historical, or **from a natural construction** that exhibits the same statistical features in order to shoot down the paper.

E.g., there are packaging symbols (fragile, inflammable, liquid, recyclable, this side up, consumer electronics) etc., perhaps with variations that are not obviously the same unless you already know the meaning; and present as short sequences on myriads of cartons. With some reasonable rules on how to place these symbols, if they yield entropies like the Indus inscriptions, then the Rao et al paper doesn't advance the cause of the Indus inscriptions being a language. The construction would be ad hoc (made to demonstrate a point) but also have to be natural.

Rahul Siddharthan said...

Arun - yes, it would be an interesting exercise. Road signs (suggested by km on my previous posting) would be another example -- strings in the corpus would be road signs as ordered on a given road in the driving direction. I would imagine that some correlations exist, eg between "caution" signs and posted speed limits.

Off-hand, other than your packaging information example, I can't think of non-linguistic signs that are routinely written on small surfaces like tablets.

Finally, in response to P above, I believe that Messrs Sproat, Liberman and co are deliberately and persistently missing the point: I can no longer attribute honest misunderstanding to their comments. Even if they did not read the paper and only skimmed the comments here, they cannot possibly believe that Rao et al are presenting a model. It is an example of a strawman argument:
instead of debating the issue, debate something else that was never being discussed anyway.

Another example: when after all of the above discussion, someone who we know possesses basic reading comprehension asks "I think the real question for you is: why do you insist that the Indus must have been literate?", I must assume other motives and not a desire for honest discussion.

Of course, I should have foreseen all this from the recent Farmer screed where Sproat is a co-author, and not tried to engage with him when he showed up here. But I did initially think he was interested in honest discussion.

Anonymous said...

Rahul,

Great blog and good discussion.

Some thoughts on Indian/Indus writing:

(1) This article appeared in science says:

There is strong evidence for trade and cultural links between the
Indus and cities in today’s Iran as
well as Mesopotamia.


If there were cultural contacts and trade with others, why is it conceivable that everyone else(Sumerians, Elamites) knew writing but Indus guys alone were unaware of writing, even after having their own script-like symbols ?!

(2)Isn't it more logical to imagine that people wrote on perishable materials (like palm leaves) than imagining that people like Paanini composed all those complex grammar rules etc just based on memory ?! (not to mention the vast literature that existed then)

Rahul Siddharthan said...

anonymous: thanks for the link, which I had missed. Fascinating article, as are Lawler's other articles from that issue. It is indeed extremely implausible that such a society did not know writing, but I think it is no use arguing with the Farmer crowd, who apparently believe their 2004 paper is an irrefutable proof of their notions.

Ramaswamy, S (Chicago) said...

Sri. Rahul,

Have you seen?
(a)

http://www.sciencedaily.com/releases/2009/04/090423142316.htm

(b)http://www.harappa.com/script/indusscript.html

http://www.harappa.com/script/

anbudan,
Ram

S. Kalyanaraman said...

Great blog. Let me pick up the following comments and try to respond:

Ranjith said…What if each symbol represents a word and not a sound? Can we say anything about it at this point ? 4/29/2009 6:47 PM

Richard Sproat said... The problem with assuming a word-based or "ideographic" system is that, as DeFrancis and others have convincingly argued, no real ideographic writing systems exist The best one could have would be a limited system that allows you to represent a few things, but it could not have been a full writing system. If anyone has an example of an ideographic system that is indeed a full writing system, I would love to see it. 4/29/2009 8:03 PM

Rahul Siddharthan said... Going back to your previous comment on ideographic scripts: what about Chinese? Or Japanese Kanji (which is much the same thing)? Or Egyptian hieroglyphics? These were/are perhaps not purely ideographic but are largely so; the Harappan system could have been the same. 4/30/2009 12:15 AM
Rahul Siddharthan said... And, by the way, your recent content-free rant, posted on Steve Farmer's page, is a disgrace to your profession. 4/30/2009 12:36 AM
Rahul Siddharthan said... Another example: when after all of the above discussion, someone who we know possesses basic reading comprehension asks "I think the real question for you is: why do you insist that the Indus must have been literate?", I must assume other motives and not a desire for honest discussion. Of course, I should have foreseen all this from the recent Farmer screed where Sproat is a co-author, and not tried to engage with him when he showed up here. But I did initially think he was interested in honest discussion. 4/30/2009 10:54 AM
Anonymous said... (1) This article appeared in science says: There is strong evidence for trade and cultural links between the Indus and cities in today’s Iran as well as Mesopotamia. 4/30/2009 9:53 PM (REF. http://www.sciencemag.org/cgi/content/short/320/5881/1276)
Rahul Siddharthan said...anonymous: thanks for the link, which I had missed. Fascinating article, as are Lawler's other articles from that issue. It is indeed extremely implausible that such a society did not know writing, but I think it is no use arguing with the Farmer crowd, who apparently believe their 2004 paper is an irrefutable proof of their notions.4/30/2009 10:57 PM
My comments: The arguments have been going back and forth without agreeing on definition of the key phrase, ’Writing system’. Any thing which encodes speech is a writing system. Why is there a conflation of ideas when talking about ‘ideographic systems’? If a glyph represents a word, what is such a system called? Say, tiger glyph. Can’t it be taken to represent a word, say, kola?
Secondly, the discussions on this blog should also focus a bit on the semantics of a language system. Many linguists have said that India of 4th to 3rd millennium BCE was a ‘linguistic area’ (that is, an area where many dialects absorbed features from one another and made them their own). If a decoding of the script provides evidence matching the glyphs (read rebus) with just one semantic category, ‘metallurgy,’ what would you call it? A writing system or not? See the arguments presented at http://sites.google.com/site/kalyan97
The report by scientists in Science magazine is an important contribution to language studies. It provides for an analysis of structural patterns which are the characteristic of languages.
A very important characteristic of languages is the semantic structure, that is, the underlying meanings of spoken words of languages. It is the 'meaning' which provides a structure even for short sequences of, say, an average of five symbols used on Indus script.
A major omission in the script studies so far is the arbitrary distinction made between so-called 'pictorial motifs' and 'signs'. As in Egyptian hieroglyphs, it is possible that the entire corpus of Indus script is composed of glyphs -- such as a rim of a narrow-necked jar, rimless pot, fish, svastika, antelope, elephant, tiger looking back, crocodile, ligatured animal body with three heads of one-horned heifer, short-horned bull, antelope, person seated in penance.
Unless all the glyphs are decoded in a logical cluster, taking into account the media used for inscriptions (such as terracotta bangles, copper plates, metallic weapons, tablets, seals), the decoding will not be complete.
The error made by Sproat et al is in assuming that a script has to be syllabic or alphabetic and in not evaluating the possibility of the glyphs representing words, spoken rebus (use of similar sounding words to connote substantive messages).
This website presents two pure tin ingots with inscriptions and proving them to be rosetta stones representing tin metal. Who else but metallurgists could have had the competence to inscribe on metallic weapons and on copper plates? This website also underscores the fact that during historical periods, early punch-marked coins from mints used the same corpus of glyphs pointing to a continuum in culture in ancient India.The conclusions drawn are that the glyphs get encoded within one semantic category -- repertoire of mints and of mine workers, pointing to the link between two great inventions: invention of writing and invention of metal alloying.
These conclusions have to be evaluated by any further scientific studies within the context of the continuum of language evolution as a cultural marker of an extensive civilization.
Vākyapadīya ("About Words and Sentences") is a work of Bhartṛhari on grammar, semantics and philosophy which looks at speech in 3 stages:.
1. Conceptualization by the speaker (Paśyantī "idea")
2. Performance of speaking (Madhyamā "medium)
3. Comprehension by the interpreter (Vaikharī "complete utterance").
Surely the meaning of shabda (spoken word) is understood in the context of a sentence. A sentence does not have to have a string of words. When a child says, ‘cow’, the meaning is complete in the context of the child’s reference to milk.
Context is the key. So is sphoṭa the meaning-unit of a sentence recognized by a hearer’s anticipation.
When I constructed the Indian lexicon, as a comparison of over 8000 semantic clusters in over 25 ancient languages of India, I was struck by a fact that many words had multiple meanings (many were clearly homonyms). As civilization progressed reognizing new phenomena and the way people reacted to these phenomena, the repertoire of sound-strings available as words were drawn upon to agree, in a social contract, on new meanings.
Let me cite an example. Kol is a Tamil word meaning “pancaloha, alloy of five metals.” Kola is a Santali word meaning ‘tiger’. Kola is a Nahali word meaning ‘woman’.So, what do the inventors of a writing system in a linguistic area do? They show a tiger ligatured to a woman to connote the word kol ’five metal alloy’. As a logical extension, the word kollan is invented to connote a ‘smith.’
This explains why many images are chosen in Indus script: tiger looking back, a person sitting on a branch of a tree, crocodile holding a fish in its jaws, a bovine with 3 heads of one-horned heifer, antelope, short-horned bull (each connoting a metal). A person sits in penance ‘(kamaḍha); used to connote rebus kampa ṭ ṭam ‘mint’. Ligatured glyphs is a unique method to conserve space with as many glyphs as possible to send out an unambiguous message related to metallurgical inventions.
Thus it is that a metallurgical invention of alloying gets matched with a writing system using ligatured glyphs.
My Indian lexicon has semantic cluster headers showing ímage’words and‘thought’ words, many of which are cultural social contract words.The writing system results in much more than mere mint marks; the system is used to connote the product and process while indicating the professional competence of the creator of the epigraph and the related metal artifact.
The test you propose for a private language is absolutely brilliant. I am sure that there are many dialects which can be candidates for the test.
When a symbol system gains the status of a social contract with people in an extensive area using the system, private language ceases and becomes public. A new metaphor is born.

Cyn_David said...

Thanks for some very interesting posts, Rahul!

I must confess that I am not a linguist or archeologist, and my interest in these things is strictly non-professional.

I don't think this issue will be resolved until someone finds a much longer text, or until the inscriptions are deciphered. I don't mean to dismiss the statistical analysis - of course there is a chance that it can offer supporting evidence. Unfortunately, I am not qualified to judge whether Rao's paper offers anything substantive in this direction. I appreciate your elaborations, at least I am now aware of some of the concerns and limitations of such types of analysis.

While Sproat may have some valid points (again, I'm not in a position to judge), he also apparently has personal problems that color his judgement. His reaction to any disagreement seems to be along the lines "ah, you must have some hidden agenda if you disagree with me", which is quite unusual in scholarly discussions.

Please do post news of any additional work by Rao's team, or any other developments in the study of the Indus Valley script.

Rahul Siddharthan said...

Kalyanaraman: I'd say a script must be a written representation of the spoken language, such that (a) any spoken text in that language may be written, and (b) that writing can then be read by anyone familiar with the script, to reproduce the spoken text. There may be some limitations and ambiguities (in today's scripts these mainly arise with foreign words), but mostly, I think this should be a requirement.

Cyn: I'm not an expert in linguistics and archaeology either...

All: well, I think the discussion has gone on long enough now! There has been much heat as well as some light, but I expect returns to diminish sharply if this continues. I'm not closing comments but probably won't reply to any more.

Truthseeker said...

Farmer and I debated his theory on the illiteracy of the Harappans months before he and his colleagues published their paper. A list of our comments can be found here

http://groups.yahoo.com/group/akandabaratam/message/14321


You are right about Sp determined to ignore Rao’s work based on his intention to represent his theory that the Harappans were illiterate as valid. This results from the fact that in the original Farmer paper, the researchers had attempted to use Zifp Law to prove that the Harappans were illiterate. I shortthis theory down so they removed this point from the paper they later published. I pointed out that the data they present in their paper does indeed fit Zifp’s Law.


http://groups.yahoo.com/group/akandabaratam/message/11097


I later published this and other evidence disputing their theory in 2005.

C. Winters, “ The Indus Valley Writing is evidence of ancient Dravidian literacy, International Journal of Dravidian Linguistics, 35(1), pp.139-152 (2005)
A cursory examination of the article by Farmer et al, by any competent researchers betrays its groundlessness. There are three variables in the study 1) literacy (independent
variable) and three dependent variables 1) relative frequency of the signs,2)the brevity of most inscriptions and 3)lack of evidence for an archaeological manuscript traditions. The first two dependent falsified by Farmer themselves. They maintain that brief inscriptions and limited number of signs on various medium is an indication of illiteracy. They contradict themselves in their own paper when they mention the fact that the average symbol length for Egyptian (6.94) and Indus (7.39) text are almost identical means. This along with the fact that Dreyer has found that Egyptian clay tablets with only two signs have phonemic meaning make it clear that if these text have phonemic and literate meaning the same is probably true of the Indus Valley writing.

Secondly, Farmer et al argue that their is not manuscript tradition for Indus writing. This is untrue, we see the continue use of signs similar to those in the Indus script from Harappan times, into the South India megalithic on into the Brahmi script. It was B.B. Lal's discovery of South India pottery with Harappan signs that allowed us to see the
direction in which Harappan writing was written.

The fact that very brief text exist of as few as two (2) signs, that have phonemic literate meaning show that the hypothesis and variables of Farmer et al lacks internal and external validity.

Truthseeker said...

The argument about the presence of singletons in
the Indus Valley writing says nothing about the
literacy of the Indus Valley seals. This results from
the fact that although the new sign (or singleton)
found in a seal text may not occur in other seals, the
singleton is usually made up of a combination of two
or more of the seventy basic Indus Valley signs to
form a “new” sign or singleton.
The Oracle Bone writing confirms this view. In
the Oracle Bone writing we see a number of signs that
are formed by two or more symbols.
L. Wieger, in Chinese Characters (New York:
Dover Publications: 1965) we learn that the Chinese
writing system is based on 224 Primitive Chinese signs
called Gu wen , these graphemes are joined together to
make new words.
The Gan zhi or cyclical graphs are among some of
the most ancient Chinese symbols. There are 22 signs
in the Gan zhi . David N. Knightley, in Sources of
Shang History : The oracle-bone inscriptions of Bronze
Age China (Berkeley: University of California Press,
1978) notes that these signs have been used from
ancient times up to the modern period. The Gan zhi
signs are joined together to form new words, e.g. the
yi symbol and the symbol for ‘pit’, are joined
together to make the word Xiung ‘accident , unlucky’.
David Knightley, has made it clear that while we
may find hundreds of Oracle Bone inscriptions there
were only 42 signs frequently used in the writing.
These 42 signs, along with a number of pictographic
signs were combined to one another and used to make
the corpus of Oracle Bone inscriptions. Because these
inscriptions were written for divining purposes, the
terms used in this genre was associated with
divinations and prognostications. As a result, as the
writing was used for other purposes the Chinese had to
invent new signs or singletons based on the Gan zhi
and Gu wen signs to record new types of information,
for a writing system used originally for divination.
Over time, these singletons would become high
frequency terms as they were used more frequently.

Truthseeker said...

Mr. Farmer claims that the Indus Valley seals are
far too brief to record any meaningful information.
This contention has no linguistic validity. A
sentence is a combination of selected syntactic items
arranged or modified in a particular pattern. The
number of signs used to write a particular sentence it
can not define the literacy of a sentence. A sentence
has meaning solely on the basis of the content of the
syntactic elements of that sentence. For example, in
Arabic we have kitab kita-b ‘he wrote (a) book’. In
English we have,
1. They came.
2. They saw her.
3. It is Jack.
All of these sentence have few words but they do have
meaning.
The presence of a limited number of signs on a
seal has nothing to do with the meaning of that seal.
This results from the fact that the Indus Valley signs
are homophones, that have varying meanings.
It is no secret that the Indus Valley signs are
almost identical to signs found on the Egyptian
pottery , and signs in the Minoan, Proto-Sumerian and
Proto-Elamite writing systems . This is very
interesting because, some researchers such as I.J.
Gelb in A Study of Writing (Chicago: University of
Chicago Press,1963), believe that some other group
besides the Sumerians invented the cuneiform script
used by the Sumerians.
The study of ancient writing systems makes it
clear that in many scripts, e.g., cuneiform and
Egyptian two or more signs can have the same
pronunciation. For example, in Sumerian there are 22
different signs that were used to represent the
syllable du.
In the cuneiform system there is no distinction
between voiceless, voiced and emphatic consonants. As
a result, the sign ga, can be read as ka and qa.
In Sumerian there are many homophones. As a
result, in many ancient language a term can be not
only an adjective or demonstrative, it may also
represent both a verb and noun. Each Sumerian
cuneiform sign represents a monosyllabic CV (consonant
vowel) or VC term. These Sumerian terms have multiple
meanings:
U, cock,; totality; to ride, to steer;
Ig, dike; embankment; to water; to say; this one;
Ul, joy, pleasure; to glitter, shine; remote,
distance;
An, sky; the god An; to be high; high; in front;
En, dignitary, lord; to rule; noble; until;
Ur , to surround; dog; to tremble; humble; liver,
spleen; it, these, thus; so.
The varying meanings of these homophonic terms, that
can be represented by one or more cuneiform signs,
make it clear that the combination of several
cuneiform signs can be written to provide meaningful
statements in a short text. Below are some Sumerian
examples:
Ash ti en ‘ Wish for a noble life’.
Zi eš , (This) Righteous shrine.
I po tu , ‘Capture the pure libation’.
Pa ge ki, ‘Girls take an oath (this) place’.
Mi lu du, ‘This (is) a favorable oracle of the
people’.
Given the fact that meaningful statements can be made
through the combination of a few cuneiform signs, make
it clear that the Indus Valley seals , even though
they contain a limited number of signs can express
literate, meaningful statements contrary to the
opinion of Steve Farmer.