Wednesday, September 06, 2006

Unicode: whats the problem?

One of my pet peeves is the lack of standardisation of Indian-language fonts on the internet. No, let me correct that. The standard exists: it's called unicode. But hardly any Indian-language sites (at least, Hindi and Tamil---I haven't looked at others) use it. I can only think of the BBC news site and Wikipedia, and I doubt very many Hindi or Tamil readers read those regularly. The rest of them use their own arbitrary fonts, mapped sometimes to the encoding of Windows fonts (CP-1252), sometimes not; no two fonts are substitutable. (Hm---I just checked and saw that Navbharat Times does use unicode. But Dainik Jagran, Ananda Vikatan and many others do not.)

I'm not quite sure who is to blame for it. I generally assumed that it's the Indian government and academic bodies who failed to set standards, or tried to push half-baked standards like ISCII. but according to the Unicode Consortium's own FAQ, Unicode and ISCII correspond almost directly (the FAQ explains the minor differences), and the ISCII standard goes back to 1988. So why did all these websites choose to reinvent the wheel, poorly?

Meanwhile, today's Hindu has an article on Tamil character encoding: apparently some Tamil users are unhappy at the current encoding, which uses only 8 bits. I would have thought 8 bits is plenty for a language with 18 consonants, 12 vowels and a few other letters. But the speakers are unhappy that syllables (like "kA", "கா") need to be printed by combining consonants and vowels ("க" + "ா"), which allegedly can cause "delays during data processing". (I should note for readers unfamiliar with Indian scripts that the second character above, "ா", is not a raw vowel but an modifier to the previous consonant; the vowel by itself would be "ஆ". "கா" is conceptually a single syllable, not two letters. But this seems hardly a practical problem.)

The delays-in-processing claim makes little sense to me. Naively I'd have thought that an eight-bit font would be more, not less efficient to deal with. In fact, the Unicode site has a FAQ specifically debunking this and other claims.

I suspect that the problem is the usual mindless competition with the Chinese: they have access to 32767 characters via UTF-16 and still want more, so why shouldn't we? Quoth the article,

"The Chinese government succeeded in gaining space for more than 27,000 Chinese characters by threatening to develop its own 16-bit encoding, Ponnavaiko, director, SRM Deemed University said."


So does Dr Ponnavaiko, who believes that 8 bits don't suffice to support 30-odd distinct characters, want us to threaten to develop our own, incompatible encoding? We tried incompatibility before, and it's a mess. Now we have compatibility, and it is better---to the extent that people use it. It works out-of-the-box on any software that supports unicode if the system has an appropriate unicode font installed (which is nearly all newer Windows, Mac and Linux systems and software), and is used on an increasing number of websites. What's the problem?

Of course, I'm well aware that newspapers are not the best at reporting technical matters, and if some expert out there does know about problems with the existing Unicode setup, I'm very interested.

1 comment:

kennady said...

Tamil language is may be India's most ancient language. it's spoken by the individuals of Tamil Nadu and huge communities in Sri Lanka , Malaysia and also the West. plenty of India's wealthy heritage of culture is written in Tamil.