[UPDATE 17 Sep 2010: The story continues here.]
I am an expert neither on archaeology and history, nor on computational linguistics (though some of my interests come close to the latter). My previous post attracted 56 comments, and I eventually closed comments because it seemed that nothing productive was going to occur, and meanwhile certain individuals seemed to be using the space to redo arguments that had already been hashed out elsewhere years ago.
I would like to view the problem as a Bayesian one: given two or more hypotheses, each with prior probabilities, and a set of data, calculate the posterior probabilities of the hypotheses. This is essentially what we all do in "learning from experience". By "prior probability" is meant how likely we consider the hypothesis in the absence of data. By "posterior probability" is meant how likely the hypothesis should seem after we have seen the data. These posteriors will become priors when we see a new set of data. When we try to answer questions such as "Given that it is cloudy and humid, will it rain?" we base our answers on years of accumulated experience.
Bayes' Theorem states, basically, that given a set of mutually exclusive hypotheses Hi, with prior probabilities P(Hi), and given some previously unknown data that pertains to these hypotheses, the "posterior probabilities" of the hypothesis P(Hi|D) are proportional to P(Hi)P(D|Hi) where P(D|Hi) is the "likelihood" of the data given the hypothesis. That is, the posterior probability is proportional to both the prior probability of the hypothesis, and the probability of seeing the data given the hypothesis. To ensure that the posteriors sum to 1, there is also a "normalisation factor" that is the sum of all posteriors.
An example may make it clearer: Suppose a patient is being tested for a particular kind of cancer. In people of that age group, this particular cancer occurs in 0.1% of the population (one in a thousand people). The test correctly reports cancer 99% of the time in patients with cancer (it gives a "false negative" 1% of the time). However, in patients without cancer, the test incorrectly reports cancer 5% of the time. In this case, the test is positive. What is the probability that the patient has cancer?
There are two hypotheses: H1 = the patient has cancer, H2 = the patient does not have cancer. Their prior probabilities are, respectively, 0.001 and 0.999. If the patient has cancer, the probability of seeing the data (the positive test) is P(D|H1) = 0.99. If the patient does not have cancer, the probability of seeing the data is P(D|H2) = 0.05. Then Bayes' theorem tells us that the posterior probabilities of H1 and H2 are, respectively, proportional to 0.001 times 0.99 and 0.999 times 0.05, or respectively, 0.00099 and 0.04995. After normalising, the probabilities are roughly 0.02 for "the patient has cancer" and 0.98 for "the patient does not have cancer" -- the information given by the tests is insufficient to overcome our prior information about the likeliness of cancer.
Bayes' Theorem can be readily proved in the "frequentist" interpretation of probability theory, which until recently was the only widely accepted interpretation. In this interpretation, a probability of an outcome is the fraction of times, in a large number of "trials", that the outcome can be observed. If one has N identical situations, and P of them yield positive outcomes, the probability of a positive outcome is P/N. (Think of coin tosses: if you toss an unbiased coin a thousand times, you will get heads in roughly 500 of them.) In the cancer example, the rate of occurrence of the cancer in a general population and the success rate of the test can be quantified via frequentist methods.
Where Bayesian methods were controversial is when the frequentist picture does not apply -- one does not have access to a large number of trials. For example (to get ahead of our story): "What is the probability that the Indus script represents a written language?" A frequentist would call the question meaningless, unless there is a large number N of Indus-like civilisations, each with similar scripts, and it was known that for P of those civilisations the script represented a language: then P/N is the prior probability of the language hypothesis. But in our case, N=1 and P is not known. Similarly, in the above medical example, if it were known that the cancer is a genetic condition and the patient has a family history of it, that would affect the prior probabilities, but it would be hard to calculate the correct priors. A good doctor, however, would certainly take the information into account in some way, and not dismiss the question as meaningless. And a Bayesian would say that a "gut feeling" assignment of the prior probabilities is better than no assignment at all.
If any useful observation came out of my previous post and the comments therein, it is this: we need to calculate P(D|H), that is, the probability of seeing the data shown by Rao et al. given the language hypothesis (HL) and given the non-language hypothesis (HNL); and the prior probabilities for each of those hypotheses. If we could actually do these, then we could assign a fairly confident posterior probability for each hypothesis.
Given the data for other languages in the Rao et al. paper, I would estimate P(D|HL) to be close to 1. That is, if the Indus script is a language, I would think it very likely that conditional entropies would closely resemble the data that Rao et al. show in their figure 1A. In comments to my previous post, I estimated P(D|HNL) as about 0.1: that is, if the script is not a language, the chance that it looks so much like a language is about 0.1. This was based on a crude argument: a generic sequence-generating process could lie anywhere between the "type 1" (fully random) and "type 2" (fully correlated) lines in figure 1A. Languages occupy a very narrow band in this region, that accounts for about 10% of the area (or 10% of the height at any given number of tokens). The probability of hitting that narrow band by chance is then about 10%. Of course, one can quibble with this: perhaps there is a large class of non-linguistic sequence-generating algorithms that will give conditional entropies in this band, but I think the burden is on those who protest to demonstrate that such classes of algorithms (a) exist and (b) are likely to have been used.
Estimating the prior probabilities is a whole other problem. Someone who professes complete ignorance would assign a prior of 0.5 to each hypothesis. With my favoured likelihoods of the data, above, this yields a posteior probability of about 0.91 for the language hypothesis, and 0.09 for the non-language hypothesis.
But we are not completely ignorant: we do know quite a lot about the Indus civilisation. So how do we assign a prior to the two hypotheses?
For Steve Farmer, Richard Sproat and Michael Witzel, the answer is: there is zero probability that the Indus civilization was literate. Their arguments are in this 39-page paper, but Farmer summarises it here in one sentence:
"Not one ancient literate civilization is known — including those that wrote routinely on perishable materials — that didn't also leave long texts behind on durable materials."
To this we can add one more claim from their longer paper: the statistical distribution of Indus symbols, including the large number of "singletons", that is, signs that occur only once, is proof that it could not be a language. (The word "proof" actually occurs twice in their paper, and the title is " The Collapse of the Indus-Script Thesis: The Myth of a Literate Harappan Civilization". In other words, there is not much room for doubt -- at least, according to these scholars.
But the one-sentence summary of Farmer is easily refuted: only three other equally ancient advanced civilisations are known (Babylon, Sumer, Indus), and the Indus was by far the largest and most advanced of the these. Farmer's sentence loses its impact somewhat when one realises that "Not one ancient literate civilisation is known..." means "Not one of the three that are over 4000 years old".
Not one civilisation is known, at any time in history, that was mainly urban, lived in planned cities with water supply and sanitation, had extensive trade networks, accurate measurement systems, and occupied an area of over a million square kilometres, but were illiterate.
Here we are not comparing with just three ancient civilisations, but with hundreds more between that time and ours. Many would say that we don't need to compare: the absurdity of the hypothesis that such a civilisation would be illiterate is self-evident.
As for the statistical arguments and the singleton count, Ronojoy Adhikari points me to this page containing data from Bryan Wells showing sign distributions from Proto-Sumerian, Proto-Elamite, and Uruk. Perhaps Farmer et al. will now argue that these were not scripts either.
Here is my feeling on what has happened here: Before 2004, the Rao et al. paper would not have gathered any attention. (Of course the Indus system is a language script! Why are you discussing it?) But that year, Steve Farmer managed to persuade two others -- one of whom, Michael Witzel, is a well-respected authority in the field -- to add their names to his thesis that it is not a language. The resulting manuscript was absurdly and unprofessionally bombastic in its language, while containing essentially nothing convincing. Regardless of the work of Rao et al, their hypothesis would have died a natural death -- but Rao et al do have Farmer et al to thank for enabling them to publish their work, with its obvious conclusions, in a prestigious journal like Science. Farmer et al are so rattled that they promptly post an incoherent, shrill, content-free, ad hominem rant on Farmer's website. Sproat even shows up on my previous post, leaving a chain of comments that reveal that he has neither understood, nor cares to understand, the argument. All those who dissent from their 2004 paper are Dravidian nationalists.
So that leaves the question: how do we assign prior probabilities for the two hypotheses? I think the opinions of Farmer, Sproat and Witzel can be discounted. If we instead asked: "Given that every other urban civilization with water supply and sanitation was literate, how likely is it that the Indus civilisation was illiterate?" I think the answer would be: "Extremely unlikely."
If we assign a prior of 0.9 for language (based on the above, I'd put it higher) and 0.1 for non-language, and retained my likelihoods as above, the posteriors are: 0.99 language, 0.01 non-language.
I expect it to get more convincing. Some actual (non-rhetorical) evidence to the contrary would, however, be very interesting.