Tuesday, May 05, 2009

More Indus thoughts and links

[UPDATE 17 Sep 2010: The story continues here.]

I am an expert neither on archaeology and history, nor on computational linguistics (though some of my interests come close to the latter). My previous post attracted 56 comments, and I eventually closed comments because it seemed that nothing productive was going to occur, and meanwhile certain individuals seemed to be using the space to redo arguments that had already been hashed out elsewhere years ago.

I would like to view the problem as a Bayesian one: given two or more hypotheses, each with prior probabilities, and a set of data, calculate the posterior probabilities of the hypotheses. This is essentially what we all do in "learning from experience". By "prior probability" is meant how likely we consider the hypothesis in the absence of data. By "posterior probability" is meant how likely the hypothesis should seem after we have seen the data. These posteriors will become priors when we see a new set of data. When we try to answer questions such as "Given that it is cloudy and humid, will it rain?" we base our answers on years of accumulated experience.

Bayes' Theorem states, basically, that given a set of mutually exclusive hypotheses Hi, with prior probabilities P(Hi), and given some previously unknown data that pertains to these hypotheses, the "posterior probabilities" of the hypothesis P(Hi|D) are proportional to P(Hi)P(D|Hi) where P(D|Hi) is the "likelihood" of the data given the hypothesis. That is, the posterior probability is proportional to both the prior probability of the hypothesis, and the probability of seeing the data given the hypothesis. To ensure that the posteriors sum to 1, there is also a "normalisation factor" that is the sum of all posteriors.

An example may make it clearer: Suppose a patient is being tested for a particular kind of cancer. In people of that age group, this particular cancer occurs in 0.1% of the population (one in a thousand people). The test correctly reports cancer 99% of the time in patients with cancer (it gives a "false negative" 1% of the time). However, in patients without cancer, the test incorrectly reports cancer 5% of the time. In this case, the test is positive. What is the probability that the patient has cancer?

There are two hypotheses: H1 = the patient has cancer, H2 = the patient does not have cancer. Their prior probabilities are, respectively, 0.001 and 0.999. If the patient has cancer, the probability of seeing the data (the positive test) is P(D|H1) = 0.99. If the patient does not have cancer, the probability of seeing the data is P(D|H2) = 0.05. Then Bayes' theorem tells us that the posterior probabilities of H1 and H2 are, respectively, proportional to 0.001 times 0.99 and 0.999 times 0.05, or respectively, 0.00099 and 0.04995. After normalising, the probabilities are roughly 0.02 for "the patient has cancer" and 0.98 for "the patient does not have cancer" -- the information given by the tests is insufficient to overcome our prior information about the likeliness of cancer.

Bayes' Theorem can be readily proved in the "frequentist" interpretation of probability theory, which until recently was the only widely accepted interpretation. In this interpretation, a probability of an outcome is the fraction of times, in a large number of "trials", that the outcome can be observed. If one has N identical situations, and P of them yield positive outcomes, the probability of a positive outcome is P/N. (Think of coin tosses: if you toss an unbiased coin a thousand times, you will get heads in roughly 500 of them.) In the cancer example, the rate of occurrence of the cancer in a general population and the success rate of the test can be quantified via frequentist methods.

Where Bayesian methods were controversial is when the frequentist picture does not apply -- one does not have access to a large number of trials. For example (to get ahead of our story): "What is the probability that the Indus script represents a written language?" A frequentist would call the question meaningless, unless there is a large number N of Indus-like civilisations, each with similar scripts, and it was known that for P of those civilisations the script represented a language: then P/N is the prior probability of the language hypothesis. But in our case, N=1 and P is not known. Similarly, in the above medical example, if it were known that the cancer is a genetic condition and the patient has a family history of it, that would affect the prior probabilities, but it would be hard to calculate the correct priors. A good doctor, however, would certainly take the information into account in some way, and not dismiss the question as meaningless. And a Bayesian would say that a "gut feeling" assignment of the prior probabilities is better than no assignment at all.

If any useful observation came out of my previous post and the comments therein, it is this: we need to calculate P(D|H), that is, the probability of seeing the data shown by Rao et al. given the language hypothesis (HL) and given the non-language hypothesis (HNL); and the prior probabilities for each of those hypotheses. If we could actually do these, then we could assign a fairly confident posterior probability for each hypothesis.

Given the data for other languages in the Rao et al. paper, I would estimate P(D|HL) to be close to 1. That is, if the Indus script is a language, I would think it very likely that conditional entropies would closely resemble the data that Rao et al. show in their figure 1A. In comments to my previous post, I estimated P(D|HNL) as about 0.1: that is, if the script is not a language, the chance that it looks so much like a language is about 0.1. This was based on a crude argument: a generic sequence-generating process could lie anywhere between the "type 1" (fully random) and "type 2" (fully correlated) lines in figure 1A. Languages occupy a very narrow band in this region, that accounts for about 10% of the area (or 10% of the height at any given number of tokens). The probability of hitting that narrow band by chance is then about 10%. Of course, one can quibble with this: perhaps there is a large class of non-linguistic sequence-generating algorithms that will give conditional entropies in this band, but I think the burden is on those who protest to demonstrate that such classes of algorithms (a) exist and (b) are likely to have been used.

Estimating the prior probabilities is a whole other problem. Someone who professes complete ignorance would assign a prior of 0.5 to each hypothesis. With my favoured likelihoods of the data, above, this yields a posteior probability of about 0.91 for the language hypothesis, and 0.09 for the non-language hypothesis.

But we are not completely ignorant: we do know quite a lot about the Indus civilisation. So how do we assign a prior to the two hypotheses?

For Steve Farmer, Richard Sproat and Michael Witzel, the answer is: there is zero probability that the Indus civilization was literate. Their arguments are in this 39-page paper, but Farmer summarises it here in one sentence:

"Not one ancient literate civilization is known — including those that wrote routinely on perishable materials — that didn't also leave long texts behind on durable materials."

To this we can add one more claim from their longer paper: the statistical distribution of Indus symbols, including the large number of "singletons", that is, signs that occur only once, is proof that it could not be a language. (The word "proof" actually occurs twice in their paper, and the title is " The Collapse of the Indus-Script Thesis: The Myth of a Literate Harappan Civilization". In other words, there is not much room for doubt -- at least, according to these scholars.

But the one-sentence summary of Farmer is easily refuted: only three other equally ancient advanced civilisations are known (Babylon, Sumer, Indus), and the Indus was by far the largest and most advanced of the these. Farmer's sentence loses its impact somewhat when one realises that "Not one ancient literate civilisation is known..." means "Not one of the three that are over 4000 years old".

Iravatham Mahadevan writes an excellent article in the Hindu demolishing the Farmer thesis on archaeological and historical grounds. I think his arguments can be summarised, Farmer-style, in the following sentence:
Not one civilisation is known, at any time in history, that was mainly urban, lived in planned cities with water supply and sanitation, had extensive trade networks, accurate measurement systems, and occupied an area of over a million square kilometres, but were illiterate.

Here we are not comparing with just three ancient civilisations, but with hundreds more between that time and ours. Many would say that we don't need to compare: the absurdity of the hypothesis that such a civilisation would be illiterate is self-evident.

As for the statistical arguments and the singleton count, Ronojoy Adhikari points me to this page containing data from Bryan Wells showing sign distributions from Proto-Sumerian, Proto-Elamite, and Uruk. Perhaps Farmer et al. will now argue that these were not scripts either.

Here is my feeling on what has happened here: Before 2004, the Rao et al. paper would not have gathered any attention. (Of course the Indus system is a language script! Why are you discussing it?) But that year, Steve Farmer managed to persuade two others -- one of whom, Michael Witzel, is a well-respected authority in the field -- to add their names to his thesis that it is not a language. The resulting manuscript was absurdly and unprofessionally bombastic in its language, while containing essentially nothing convincing. Regardless of the work of Rao et al, their hypothesis would have died a natural death -- but Rao et al do have Farmer et al to thank for enabling them to publish their work, with its obvious conclusions, in a prestigious journal like Science. Farmer et al are so rattled that they promptly post an incoherent, shrill, content-free, ad hominem rant on Farmer's website. Sproat even shows up on my previous post, leaving a chain of comments that reveal that he has neither understood, nor cares to understand, the argument. All those who dissent from their 2004 paper are Dravidian nationalists.

So that leaves the question: how do we assign prior probabilities for the two hypotheses? I think the opinions of Farmer, Sproat and Witzel can be discounted. If we instead asked: "Given that every other urban civilization with water supply and sanitation was literate, how likely is it that the Indus civilisation was illiterate?" I think the answer would be: "Extremely unlikely."

If we assign a prior of 0.9 for language (based on the above, I'd put it higher) and 0.1 for non-language, and retained my likelihoods as above, the posteriors are: 0.99 language, 0.01 non-language.

I expect it to get more convincing. Some actual (non-rhetorical) evidence to the contrary would, however, be very interesting.


kalyan97 said...


Have the views of Fernando Pereira been discussed and countered?


Kapil H. Paranjape said...

As one who did badly in Statistics (perhaps because of confusions such
as the one below), it seems to me that the important point is the
hypothesis A to which one applies the statement "A until (statistically)
proven not A".

In one case A = "is a language" in the other case A = "is not a language".

In the first case, the onus is on the disprovers to provide _real_ (as
opposed to artifically constructed) non-linguistic systems that match
the statistical measure chosen.

In the second case, the onus is on the provers to provide a
statistical measure that is not matched by any (even artificially
constructed) non-linguistic sytem.

km said...

That was a great explanation of Bayes' theorem and how it applies to this particular problem!

Rahul Siddharthan said...

Kalyanaraman: what views in particular? I don't see anything there that has not already been discussed ad nauseam.

Kapil: I think the word you're looking for is "null hypothesis". We can of course choose either "language" or "non-language" as the null hypothesis. The difficulty is that there is no model that adequately describes either. So a frequentist-style calculation of a "p-value" (probability of seeing the data given the null hypothesis) will not work. But if we must, I would give the data a p-value of 0.1 if the null hypothesis is "non-language". And if the "null hypothesis" is "language" I would assign a very large p-value (close to 1). But these are not calculations that would please a frequentist.

I think the Bayesian approach is a bit more useful, in taking account of prior probabilities as well as permitting such things in the absence of frequentist calculations. E T Jaynes takes it further (see in particular his textbook Probability Theory) -- he basically argues that the laws of probability theory are generalisations of Boolean logic and arise uniquely from certain simple desiderata; and though the frequentist limit exists and agrees with "classical" probability/statistics, it is not required at all.

Kapil H. Paranjape said...

I wasn't sure about the term "null hypothesis" so I didn't use it! :-)

To clarify my comments further here is a different classification
problem: mail vs. junk mail.

There are two views about e-mail:

1. Mail is "ham" unless proved to be "spam".

2. Mail is "spam" unless proved to be "ham".

In principle, one might think that the whole system is symmetric
about the interchange of "ham" and "spam" --- but it is not!

Those who believe in the former will create a Bayesian classifier and
will then separate out their mail into (typically) three classes:
ham, probably spam and spam
They would be very unhappy if some genuine mail landed in the third
category ("false positives") so they would tune their classifier to
ensure that undecideds would land in the first or second class and
live with the possibility of "false negatives" as long as this was
kept under control.

Those who believe in the latter often do not believe in using a
Bayesian classifier at all! Typically, they have a whitelist of
addresses that they accept mail from. Any other mail is bounced back
with a demand that the sender prove "genuine-ness" in some way.
Why use a statistical method when a fixed set of rules will ensure
that 99% of useful mail gets to me?! The 1% of good mail that fails
can be used to create some new rules (exceptions). Most importantly,
(for such users) their system ensures that "spam" (essentially) never
gets through!

So one side believes in Bayesian classifiers whereas the second side
believes in logical rules.

All this is just to say that explaining Bayesian approaches to
detecting languages may not have any (or the desired) effect on people
who want you to provide a linguistic model that fits a certain data
set before they will accept this data set as a language. If you fail
to provide such a model, they would accept a statistical measure only
if it does not allow *any* non-language to pass.

(On reading the blogs/comments of some of the supporters of "it is
not a language", one comes to the conclusion that they believe in
logical models for languages.)

kalyan97 said...

Rahul Siddharthan said...
Kalyanaraman: what views in particular? I don't see anything there that has not already been discussed ad nauseam.

My comment: Ferdinand is referring to the definition of 'language'. Let me explain my take on statistical analyses (I am a statistician myself). Language is too vague a term to be subject to mathematical formulae or probability analyses. See my Indian lexicon which provides over 8000 semantic cluster for over 25 ancient indian languages; many are 'image' words, and some are 'thought' words. The problem of decipherment is matching of homonyms, that is matching image words with though words. What would you call Indus script if both the pictorial motifs (yoga posture showing penance, tiger or antelope looking back, crocodile holding fish etc.) and signs (rim of jar, rimless pot, fish, arrow, svastika, etc.)are simply representation of image words read rebus for metal/mine-artifacts of a smithy or a mine-worker working with many types of smelters, furnaces and creating many metal alloys (hence ligatured glyphs)? The same glyphs repeat on punch-marked coins showing the writing system to be a creation of early mine-workers and metal smiths. Yes, it is a writing system of lists of mint repertoire using mleccha speech. Why look for grammar and all that paraphernalia of language?

Now, how does Ferdinand's comment relate to such a decoding and identifying semantic clusters related to a mine or mint -- just one semantic category which seems capable of explaining all the 500 glyphs of the writing system invented by the inventors of alloying?


kalyan97 said...

I owe a further explanation on the term, 'glyphs' used by me. In Mahadevan's concordance, field symbols (100) are treated as a category distinct from sigs (over 400). This distinction is to facilitate the compendium compilation. See the entire corpus of epigraphs at


You will agree that the distinction between 'field symbols' and 'signs' is arbitrary and is unnecessary to explain the writing system. A decoding is complete only when both categories are treated together and explained in the context of the speech of the creators of the glyphs.

My Indian lexicon with over 8000 semantic clusters in over 2000 pages can be seen at http://www.scribd.com/doc/2232617/lexicon


Rahul Siddharthan said...

Kalyanaraman - I'm not sure I understand you at all, but I think if a script is to convey meaning, it should be more than a sequence of symbols -- at a minimum we need both nouns and verbs, even if other parts of speech are omitted. I'm not sure what you mean by rebus -- do you mean symbols stood for phonetic sounds? But in any case I think we are very far from answering the question of what sort of script the Indus script is.

I notice from your website that you are sympathetic to the idea that the Indus civilization was a proto-Vedic ("Sarasvati") civilization. I think there are far too many holes in that theory to take it seriously. Extraordinary claims require extraordinary evidence.

Kapil - I think you are taking an extreme position. Even if one takes "ham" as the "null hypothesis" and declares that "mail is ham unless proved spam", no mail can be proved spam. With a sufficiently good model for ham, one can calculate a p-value, and a sufficiently low p-value can prove it "beyond reasonable doubt" but not prove it beyond any doubt; and we are very far from having a "sufficiently good model". Similarly for the converse (taking spam as null hypothesis).

Hence the need for Bayesian classifiers, which only assign posterior probabilities using our best (and continually updated) prior information.

The whitelist example does not even attempt to assign a p-value or any such thing. Lots of mail may be ham, but not provably so, and will therefore be blocked (or challenged -- but in my belief, challenge-response systems that require human action don't work most of the time).

So whether a mail is ham according to the whitelist or "rules" has nothing to do with the contents of the mail itself, but only the prejudices of the creator of the whitelist and "rules". I don't think that can be viewed as "hypothesis testing" in any sense.

Anonymous said...

Cosma Shalizi has blogged about it here in case you are interested


Anant said...

Rahul: another overused phrase is `yeoman service' but I cannot find another term to describe your efforts at explaining the work through this and the previous related post. Keep up the great work. Anant

Arun said...


Great work; interesting and inspiring. Thank you.

Rahul Siddharthan said...

anonymous: thanks for the link.

Arun, Anant: thanks!

kalyan97 said...

Rahul commened: I'm not sure what you mean by rebus -- do you mean symbols stood for phonetic sounds?

I notice from your website that you are sympathetic to the idea that the Indus civilization was a proto-Vedic ("Sarasvati") civilization. I think there are far too many holes in that theory to take it seriously. Extraordinary claims require extraordinary evidence.

My answers: A writing system can consiste of simply nouns without any verbs. Rebus means that the glyphs stood for phonetic sounds.

I never claimed that indus script is proto-vedic. My website contains a note on proto-vedic continuity theory for language evolution in the linguistic area of ancient india. My decoding of the script is simple: the language is mleccha; the glyphs are read rebus for metallurgical/mine-work artefacts of minerals, metals, alloys, furnaces/smelters and professional competence of metalsmiths. All glyphs (pictorial motifs + signs) represent rebus words (that is, homonyms, 'similar sounding words'which can be depicte as glyphs but read as similar sounding equivalents related to smithy/mine work).

kalyan97 said...

Again, I invite a reference to the entire corpus of inscriptions. Please see about 4000 inscriptions in 8 albums at:

It will be clear from the corpus that pictorial motifs dominate the inscription field. So, pictorial motifs should also be decoded. I do not understand why the focus is only on 'signs'.

Both pictorial motifs and signs are glyphs and both encode speech, mleccha speech.

The scientists will be missing out information, if the pictorial motifs are excluded from the markov models or probability analyses.

Do the encoders understand that there was an attested language called mleccha in ancient India? The categorisation of indo-aryan, munda and dravidian is not the only classification; in fact, it could be a misleading classification and lead decoders up the wrong path.


Arun said...

There is the logical possibility that the Indus Valley Civilization was literate, but what is on the seals does not represent their language.

I did not see anything new in Cosma Shalizi's argument. We need human generated sequences of symbols that are non-linguistic but have conditional probabilities like a language and that would shoot down Rao et al straightaway.

kalyan97 said...

Arun: There is the logical possibility that the Indus Valley Civilization was literate, but what is on the seals does not represent their language.

Kalyanaraman: I do not understand this. If IVC was literate, by definition, they had known 'writing'. Why would they be writing some one else's language?

Rahul Siddharthan said...

Belated response to several recent comments.

Kalyanaraman: if the script contained only nouns and not verbs, I would regard the script as non-linguistic. As for why pictorial motifs are not being included -- I don't know, I am not an expert. I can believe that they are relevant.

From what I understand of "mleccha" it is not a documented language but merely an ancient Hindu term for a non-Hindu (non-Aryan). I don't know how one concludes that they were non-Dravidian as well.

Arun: Indeed it is possible that the seals don't represent their language, but something else. And it is possible that none of their writing has survived. It could also be that the seals use the same script as the language, but are too terse to qualify as text (they may be more akin to inventories etc -- sequences of nouns, not complete sentences).

kalyan97 said...

Rahu Siddharth said: Kalyanaraman: if the script contained only nouns and not verbs, I would regard the script as non-linguistic. As for why pictorial motifs are not being included -- I don't know, I am not an expert. I can believe that they are relevant.

From what I understand of "mleccha" it is not a documented language but merely an ancient Hindu term for a non-Hindu (non-Aryan). I don't know how one concludes that they were non-Dravidian as well.

Kalyanaraman's answer: The key determinant is if the script encoded underlying speech. If so, only a list of nouns becomes a writing system.

Mleccha is an attested language: Manu distinguishes between mleccha vaacas and arya vaacas (that is, between ungrammatical lingua franca and grammatical written language).

Of course, pictorial motifs are relevant; they occupy the major portion of the inscription field. Without decoding these as nouns based on speech, all attempts are a waste and likely to mislead.

kalyan97 said...

About nouns and verbs.

Take a pictorial motif: tiger or antelope looking back. 'Looking back' is krammarincu, krammara in Telugu. The rebus reading is karmara 'smith'. Tiger + looking back: kola + krammara = rebus pancaloha + smith; Antelope + looking back: ranku + krammara = rebus: tin + smith.

I suggest that the glyph showing an animal looking back encodes a verb in mleccha speech. Hence, the script is a writing system.

kalyan97 said...

Indus Script encodes language:

Response of Rajesh PN Rao et al to Internet Discussions about their Work which appeared in Science magazine-- Entropic Evidence for Linguistic Structure in the Indus Script (21 May 2009)


Arun said...

Rajesh Rao and friends have a response to Witzel et. al.


Rahul Siddharthan said...

Dear Kalyanaraman: Please do not post other people's private mail here. If Richard Sproat, or anyone else, wants to contribute here directly, he is welcome to do so.

Uddalaka said...

It is trivial to construct examples of non-linguistic symbol systems that have conditional entropies identical to those of linguistic ones. Take, for example, the 48 phonemes of Classical Sanskrit, represented, for instance, by the appropriate symbols of the International Phonetic Alphabet. Replace 33 consonants by, say, symbols representing the 33 Vedic deities, and the rest by symbols representing deities from another pantheon, all shuffled up in random order. Thus, any text in Classical Sanskrit will become a "non-linguistic" inscription of "magico-religious" symbols. This would especially seem to be the case when one starts with a list of a 2-3000 names/titles, averaging ~5-6 syllables in length.

Naturally, it needn't be Sanskrit, one can do this in one's favorite language, expressed in one's favorite script.

The point is, of course, if the only significant symbolic strings out of a great civilization, formed out of systematic combination of 200-odd basic symbols, have conditional entropy comparable to that of linguistic scripts, there's no reason to believe they don't represent language.

Anonymous said...

@Rahul Siddharthan: Very nice blog post.

There were some dialectics with Professor Sproat himself in the comments section on my blog at: http://karatalaamalaka.wordpress.com/2009/05/04/indus-valley-symbols/#comments (scroll down to the 11th comment)

In particular, while he accepts that he is responsible for the silly language of "Collapse of the...", he says that he does not entirely agree with the language used by Dr. Farmer.

I would like to hear your views about Professor Sproat's latest rebuttal:

Unless the entire methodology adopted in generating this particular plot is published and the rigor is explained in detail (as in the addendum to Rao, et al's Science paper), I do not accept the rebuttal as being conclusive. However, this latest rebuttal by Professor Sproat is indeed interesting and I expect and hope that it generates some academic debate (sans the rhetoric).

Rahul Siddharthan said...

karatalaamalaka: very interesting discussion, thanks!

As Sproat says in his figure caption, one needs access to the corpus to reproduce the data. But all his data fall broadly in the same logodds range as what Rao et al show, except the Chinese data and, unsurprisingly, the heraldic signs (and the Sumerian and Korean are borderline). Perhaps it is because of the ideographic nature of Chinese, or perhaps there is some other reason... hard to say from my perspective.

I wonder why Sproat focuses on Asian languages and didn't include English and other languages where there are large public-domain corpuses of text.

kalyan97 said...

Indus seal from Farmana: metal casting workshop
Decoding Sarasvati hieroglyphs on the seal: iron cast workshop
sal ‘bos gaurus’ bison; Rebus: sāla = workshop (B.)
kat.ra_, kat.r.a_ = piece of ground enclosed and inhabited, market town, market, suburb (H.); Rebus: kar.ru, kad.ru_ buffalo calf (male or female)(Kur.); kat.a_ male of sheep or goat, he-buffalo (Ta.); male of cattle, young and vigorous; child, young person (Ma.)(DEDR 1123). kat.a_ri = young, plump bull, heifer (Ta.); kat.r.a_ = young buffalo (Ku.); kat.iya_ = buffalo heifer (H.); kat.hr.a_ young buffalo bull (H.)
kaṇḍa ‘arrow’; rebus: ayaskāṇḍa; ayaskāṇḍa a quantity of iron, excellent iron (Pāṇ.gaṇ)
dul = pair (synonym: two strokes)(Mu.): rebus: dul (cast) beḍa ‘fish’; beḍa ‘hearth’
koḍ = place where artisans work (G.lex.) koḍiyum = a wooden circle put round the neck of an animal; koḍ = neck (G.lex.) kōḍu = horns (Ta.)
The seal thus denotes: iron cast workshop with a hearth for casting
Language: mleccha; script in hieroglyphs: mlecchita vikalpa

Report excerpts:
Environmental Change and the Indus Civilization:Indus Project, Research Institute for Humanity and Nature, Japan
Our project has conducted archaeological excavations at Kanmer in Gujarat and, Girawad, Farmana and Mitathal in Haryana so far. Excavations is also planned to be made at Ganweriwala in Punjab, Pakistan.
By using the GIS, various data are being integrated into a spatial platform.
Photo 1 Stone-built Perimeter Wall at Kanmer
Excavation at Kanmer revealed that the site was enclosed by massive stone-built perimeter walls. Photo 2 Mud-brick structure at Farmana
Mud-brick structures during the Indus Civilization were discovered at Farmana.

Photo 3 Indus Seal from Farmana
http://www.chikyu.ac.jp/rihn_e/project/images/4FR-3_photo3.jpg The Indus seals which depict various animals show a part of the relations between the human society and natural environment.
Figure 3 Bird’s eye view of the ancient mound at Kanmer
A topographical map made by the GPS and the total station provides the basis for the analysis of the site structure by GIS.
More at http://sites.google.com/site/kalyan97/mlecchitavikalpa

Anonymous said...

A new paper by Rao et. al announced.

Note on the Indo-Eurasian list and response by Farmer already in place.

venkat said...

one finds it extremely difficult to relate the Indus civilisation with the dravidian languages prevailing in the south of India. As I delve deeper into the problem, I find that there is no evidence at all in , say, the Tamil literature for any connexion. There is no folk memory in Tamil classical writings and there are no artifacts or anything resembling physical structures like brickwork or anything of that kind in Tamil lands. The odd discovery of a hand axe showing Indus marks do not help us to associate Indus language with that of Dravidian lingustic development. It is therefore pure flight of fancy to try to establish that the Indus people were Dravidians. And as for the presence of the Brahuis is concerned, despite McAlpine's strong effort to create a proto-dravidian lingustic structure, the presence of a small number of vocal expressions in that language does not help us to state with any degree of certainty that the Indus civilisation had anything to do with the Brahuis.
do not know why my previous post in this matter was ignored when one of my students years ago produced a credible suggestion that the so-called language oon the seals were all identifying representations and, perhaps, nothing else.

Rahul Siddharthan said...

venkat - do not know why my previous post in this matter was ignored

If you mean, the comment you left on another Indus post of mine: could it be because my post was more than three months old? I continue to get notified of comments but do you really think people have nothing better to do than follow that thread forever?

If you have something useful to say, say it in your own space or in a forum that people are likely to read (perhaps Steve Farmer's mailing list).

toto said...

The crux of the matter is simple. The cond. entropy measure is only useful if it somehow distinguishes between linguistic and non-linguistic (but man-made) sequences.

Rao et al. claim that it does, using a computer programming language and DNA. Farmer et al. say that it doesn't, using heraldic signs plus a flurry of different human languages (how they compute conditional entropy for heraldic signs is not described - how do you order the signs?).

The only way to settle the matter is to apply the measure to many other human, non-linguistic corpuses (Vinca signs?) Nothing else will do.

Anonymous said...

"how they compute conditional entropy for heraldic signs is not described - how do you order the signs"

A point I'd wanted to make earlier : heraldic signs are typically arranged in three dimensions - overlapping layers with symbols distributed in 2-d on each. It is wholly non-comparable to the strict linear order of the Harappan seal-script.

Heraldic devices have ~purely linguistic~ descriptions called 'blazons', e.g. "Barry azure and argent, an orle of martlets gules" . They obey various conventions for the ordering the symbols/patterns on the shield. Sproat has probably *abstracted-out* a sequence of tokens representing heraldic signs i.e. [barry] [azure] [argent] [orle] [martlet] [gules] &c, and then calculated their conditional entropy.

It is crucial to note that some of the ordering above represents purely linguistic order : 'orle' simply means 'border' (of a particular type), and so 'orle of martlets' is just saying 'a border of birds'. Likewise, 'barry' means 'bars', and so 'barry azure and argent' = 'bars blue and silver'. 'gules' is a colour (a deep red). Colours (Or, sable, gules, argent &c) are seemingly all placed after the object that's coloured.

If Harappan signs have even this degree of linguistic order, I would consider a linguistic decipherment possible.

Anonymous said...

"The cond. entropy measure is only useful if it somehow distinguishes between linguistic and non-linguistic (but man-made) sequences."

It must be added that Rao et. al. explicitly state in their response to FSW's response that "Counter examples matter only if we claim that conditional entropy by itself is a sufficient
criterion to distinguish between language and non-language. We do not make this claim in our
" (Italics mine).

See this.

Richard Sproat said...

In generating that plot I did exactly what Rao et al did. For each corpus, I computed the conditional entropy for the first 20 most frequent signs, the first 40, and so forth. By the way, it was not clear from the Rao et al paper or the supplement that this method (20 most frequent, 40 most frequent ...) is what they were doing. I had to ask Rao. Maybe that was just me being stupid, but Liberman didn't understand that from the paper either. I think it just was not clearly described what they did.

I used Good-Turing rather than Kneser-Ney smoothing, but that will merely change the values, not the overall appearance. There is no reason to believe that Kneser-Ney will be much of a win for corpora of this size anyway. (The comparisons they allude to are for corpora with millions of tokens.)

As suggested the heraldic corpus was based on blazon, which do look linguistic -- a kind of odd mixture of English and Old French. But it is important to understand that blazon is just a formal language for describing heralds, and the two are interconvertible by someone who knows the system.

On why I picked mostly Asian languages (though neither Sumerian nor Amharic are Asian). Simple answer: I wanted languages that used scripts that had basic symbol counts in the ranges of many tens or hundreds. English only has 26 letters and as Rao et al's plot shows, that doesn't make for a very interesting curve.

Of course, and I have said this before, for this to be done really correctly, one would want a whole bunch of real non-linguistic systems and a whole bunch of real linguistic systems. Besides heraldry, I'd like to see mathematical symbols, dance notation, Naxi pictography ... More work to do and not enough hours in the day to do it.
But in any event, the tiny sample that was used in the original Rao et al. paper should not be taken to be convincing by anyone.

By the way, anyone want to hazard a guess why Fortran, which appears in their second plot, is not plotted in the first plot? I have a hunch why, but I'm wondering if anyone else has thought about this.

One final point. Chinese writing is not ideographic. It just isn't. That's not just my opinion. Go read DeFrancis' "The Chinese Language: Fact and Fantasy".

Rahul Siddharthan said...

By the way, anyone want to hazard a guess why Fortran, which appears in their second plot, is not plotted in the first plot?

I agree it is a weak point in their paper: I observed in my earlier post that I would have liked to see real-world data in figure 1.

But with that last comment from Richard Sproat, I think it is time to close this discussion. There is a new paper out but I haven't read it (very busy with other things).