|
The of
Indic computing
While on the statistical front the demand for local language
applications is significant, estimated at $64 million by 2004, on the market
front it still remains minuscule. Computing in a vernacular language has never
been easy—if the content was there then the fonts were missing, and if the fonts
and content were there, then the coding was missing. In the last ten years of
existence the Indian language applications market has had more hits than misses
and is only now gearing up to come into its own, says Chris Ann Fichardo
Keying in this article is so easy. Hit "Y"
and there are no second guesses as to what will appear on the screen. But when
computing in any Indian language, this is not the case. Though the personal
computer has been around for nearly 20 years in India, its usage is still largely
limited to the English-speaking population of the country. And as these users
constitute just 5 percent of the country’s population, what this effectively
means is that the other 95 percent have not be able to benefit from the information
age.
However, there are a host of bodies across
the country working to rectify this oversight. Corporate and government research
labs like IBM Research Labs, Media Lab Asia, NCST (National Centre for Software
Technology), TDIL (Technology Development in Indian Languages), C-DAC (Centre
for Development in Advanced Computing), the IITs (Indian Institutes of Technology)
and the NIIT-sponsored Centre for Research in Cognitive Systems are doing some
great work in this area.
The scope of Indian language applications
or indic languages as they are popularly known, is immense. According to a study
conducted by hardware industry association MAIT and research firm Frost &
Sullivan, revenues from local language applications are likely to touch $64
million by 2005. It also pegged the total market at $11 million in 2002 and
has estimated the growth rate to be around 79 percent for the forecast period
2002-2005.
"The IT industry in India is realising
that the next spurt of growth will stem from IT services reaching the masses.
To make this happen, support for local Indian languages is imperative,"
says Dr Ponani Gopalakrishnan, director of IBM India Research Lab (IRL). Set
up in 1998, IRL has research initiatives in electronic commerce, e-governance,
bioinformatics, unstructured information management and technologies for human-computer
interaction.
 |
| According to Nagarjuna G, companies encoded the fonts
instead of the data, and they also kept changing the font encoding, mainly
to market their own solutions |
Code uncode
Lucrative as the market is, it’s not so
easy to cash in on this goldmine. The Indian Constitution officially recognises
18 languages and 10 different scripts used in various states and regions across
the country. Developing a font to support each of these languages will not only
be time-consuming, but also extremely expensive.
Till date, the government has spent around
$300 million in more than 1,000 pilot projects aimed at spreading IT among the
masses and has achieved a success rate of just 40 percent.
Making the job even tougher is the lack
of uniform standards to develop the various fonts. (See box: IISCI v/s Unicode)
Not enforcing a standard in the industry gave rise to private, non-standard
solutions. As a result, there was repetition of work and a host of proprietary
codes. "Each company encoded the fonts instead of the data, and they also
kept changing the font encoding, mainly to market their own solutions. Technically,
there is no reason to do so," says Nagarjuna G, Mumbai director of the
Free Software Foundation of India.
As there were no standards to adhere to,
most indic codes developed remained unique to the owner and thus could not be
utilised in developing applications to encapsulate broader areas of computing
like banking solutions, legal documents, etc, thus restricting the use of local
language applications. Realising the need to develop a standard, C-DAC and the
then Ministry of IT (now Ministry of Information and Communications) worked
on evolving the national standard which is now called ISCII (Indian Script Code
for Information Interchange), recalls C-DAC executive director, R K Arora.
Though this mistake has been rectified
the damage has been done and the repercussions are still affecting the industry.
Dr Gopalkrishnan, who heads IRL, combats this problem on a daily basis, "When
building a local language application, lack of a popular coding standard produces
a big challenge. The existing applications have their own fonts and a corresponding
keyboard mapping. The statistical techniques used in building the recognition
systems require a large amount of training data. Since each such source of data
uses its own coding standard, we had to go through an additional step of converting
all the different coding standards to a single format. Also, since not all coding
standards are supported with complete documentation, the conversion takes greater
time and effort."
Damage control
Some standard drafts have been made and
presented, such as the eight-bit ISCII or 16-bit Unicode for script standardisation.
ISFOC for fonts and INSCRIPT phonetic keyboard layout. However, the final standards
are yet to be recommended.
Another significant step is the formation
of Technology Development for Indian Languages, (TDIL), which has the onus of
developing IT tools in local languages in India. Set-up in 1991, TDIL has sponsored
research in developing Indian language computing resources, processing systems,
tools and translation support systems and localisation of software for Indian
languages. It operates through collaborations with 13 resource centres across
India. Some interesting applications have been the local language word processor,
LEAP; C-DAC’s GIST (graphics and intelligence-based script), desktop publishing
applications and television subtitles in various vernacular languages.
MAIT also supports a consortium of IT companies
engaged in the development of local language applications and solutions. Vinnie
Mehta, president of the association says that it has done significant work in
the area of modification of Unicode to suit the requirements of Indian language
computing and also in development of font standards for language computing.
Next on the agenda are standards for products, e.g. keyboards.
"On one hand there are issues such
as lack of standards in local language computing that have hampered mass adoption
of local language solutions, on the other, somehow the user perception has been
that local language applications are for free...This is also due the fact that
there is rampant piracy in this area and the industry is highly fragmented.
It is our endeavour through the consortium that the industry will come to a
common platform and focus on collaborative research," elaborates Mehta.
The demand
As the Internet spreads its influence across
the width and breath of India, it has become imperative to have a medium that
will provide information in a manner that the local user can understand. To
take the benefits of the IT revolution to the rest of the non-English speaking
population India needs applications and solutions that support vernacular transactions.
"The front-end of the computer being English is a limitation in taking
IT to the masses that are conversant in a vernacular language. So from a marketers
perspective, even if we are able to target 5 percent of the missing 95 percent,
the target market will double overnight; this will open doors to huge opportunities,"
says Mehta.
Collaborating this view, C-DAC’s Arora
says, "This is an area that has large potential and there is a big need.
If we have to proliferate the usage of computers we have to be able to provide
both the hardware and software devices needed. Around 10 years ago people did
not even believe that we could create Indic fonts, but now we are evening developing
right-to-left fonts like those used to read Urdu, Persian, Arabic, Kashmiri,
etc.." C-DAC is currently developing an application called Urdu Nashir,
which is a word processor targeted at the publishing industry.
 |
| According to Dr Ponani Gopalakrishnan, the next spurt
of growth will stem from IT services reaching the masses. To make this happen,
support for local Indian languages is imperative |
Overcoming the hurdles
Explaining the reason why too much emphasis
was not put on indic language applications earlier, Prof Dr Pushpak Bhattacharyya,
who is currently heading the Universal Networking Language (UNL) project (See
box: UNL—A language for information exchange and transfer over the Internet)
at IIT, Bombay says, "India is very English-savvy and we have evolved a
tri-lingual formula—Hindi, English and the local language [e.g. Marathi in Maharashtra]
and at the central level we have Hindi and English. So a local language font
is not yet a crucial necessity—like in Europe or Japan, which is why in spite
of spending so much we have not produced some striking products, which would
be applicable to Indian languages.. But efforts are on. A lot of the younger
generation are actually machine-savvy and they would like to use computers,
get educated and access information on a more regular basis."
"Most of the computer-literate population
in India is more comfortable with the English language. As the industry matures,
and more global companies invest in local language solutions for the masses,
more choices will become available to the consumer and we will see a higher
rate of adoption of local languages in computing," adds Dr Ponani Gopalakrishnan.
The other reasons that slowed the progress
of indic language applications have been lack of development resources and proper
guidance. The MAIT-Frost & Sullivan study states that in some instances
projects initiated by the government failed due to lack of commercialisation
of technology and lax timelines for projects. Also, the basic lexical resources
needed to build local language applications were also not in place. Researchers
say that India has not invested in its lexical resources, it has yet to realise
the need of good dictionaries in the electronic format. It’s only now that institutions
like IIT, Bombay are for the first time building in word nets in Hindi, Marathi,
Tamil, Gujarati and Oriya.
Funding is also an area of concern. Most
of the players in this sector are mid-sized companies or educational institutes
with limited financial muscle, hence they often tend to be restrained in terms
of their research and development spending on new technologies. And while the
corporate sector is investing, the commitment of providing funds is still at
a token stage. "So far it’s mainly been the government that has invested
in indic projects. The focus from the industry is still not very prominent because
they still don’t see a market," says Arora.
"Funding is a big problem in language
resource creation and in the language processing area all the activities are
extremely manpower intensive, and this manpower is really quality manpower—they
have to be trained in computers, linguistics, and have to be paid very well—but
there are government restrictions as to how much we can pay our researchers.
All over the world language technologists are paid very well, we cannot afford
that kind of salaries in India and that is a big deterrent, says Dr Pushpak
Bhattacharyya.
Demand analysis
The local language IT market constitutes
of about 12-14 vendors. Most of the domestic players are regional and have limited
access to the market. They offer both off-the-shelf products and custom-made
applications in all major languages. The other set of key players in this market
are international players. However, international vendors are yet to take off
in a big way in India in terms of application offerings across different languages.
IBM offers a Hindi version of Lotus Notes in India, and Microsoft has also taken
some steps in this regard. The MAIT-Frost & Sullivan study predicts that
the participation of international vendors is expected to increase in the next
three years.
According to the study, while in the last
three years the market has been driven by off-the-shelf applications for end
users such as publications and the government sector, the future growth for
indic languages will come from e-governance applications developed for various
government departments. E-governance applications will also drive demand in
the consulting area. By 2005, sales from consulting will account for 67 percent
of total revenues in this segment.
The local language market per se is also
expected to come into its own. The law of supply and demand will spur growth.
As applications increase the demand for products will also increase. IT for
the masses is the new mantra that everybody is chanting. And if the computer
is at the doorstep of very village in the country, then the software to run
it cannot not be very far behind.
|

All figures in % and rounded; the base year is 2002
Source: Frost & Sullivan
|
|

All figures are rounded; the base year is 2002
Source: Frost & Sullivan
|
|

All figures are rounded ($ million); the base year is 2002
Source: Frost & Sullivan
|
|
Imagine a scenario where a tourist from Tamil Nadu downloads
information from the Goa tourism website in Tamil, while a German tourist
simultaneously downloads the same information from the same site in German.
The
ease of accessing information in a language one is familiar with is a
luxury the Internet still does not provide all its users. And this is
one lacuna that the Universal Networking Language (UNL) hopes to rectify.
The UNL project is a large-scale international co-operation with the goal
to provide information on the Internet in all national languages of the
members of the United Nations..
The rationale
The UNL project is an initiative of the United Nations and
began in 1996 under the umbrella of ‘Peace and harmony through communication.’
The rationale behind this initiative was that if the language barrier
on the Internet continues, i.e. if the Internet continued to be in English,
then there would be a huge information gap within humanity itself. Therefore,
the UNL endeavour was started. Initially when the project started there
were groups all over the world and the UN University in Tokyo co-ordinated
the project.
How it works
The language links to a representation called the UNL representation
or the language server (see diagram). Each language group had the task
of developing translators from their language into UNL form and vice-versa.
The first translator from the local language to UNL is called en-converter
and UNL to the local language is called de-converter. Dr Pushpak Bhattacharyya
and his team at IIT, Bombay were entrusted with building the Hindi en-converters
and de-converters.
"The scenario would be where somebody sitting in Tokyo sends
an e-mail to a person in India in Japanese and transparent to the receiver/sender
this mail gets translated into UNL and from UNL into the target language.
The main usage of UNL is the ability to communicate over the Net in a
language-independent manner," says Dr Bhattacharyya.
The implications
The ability to automatically translate documents from one
language to another. Earlier, machine translation did this process but
accurate translation was a problem because a natural language has its
own little quirks and is often imprecise, ambiguous and ill-specified,
explains Dr Bhattacharyya. "The uniqueness of this effort is that it uses
an extremely rich word dictionary and the UNL coders," he adds.
The other application is intelligent information processing
on the Web. With UNL documents one can do case matching and word matching
at the disambiguous level—so the matching process becomes much richer
and unwanted documents are eliminated.
Project status
The UN seeded the efforts, but once the infancy stage was
over then all the language groups had to generate their own resources
from their own country. The IIT, Bombay effort has received quite a lot
of support from TCS, the Indian government and the World Bank, which has
commissioned projects that would support these applications. "We are looking
at lots of UNL applications and not just the de-converter and en-converter.
We have projects where the summary of the documents is created in UNL
form, Q&A is in the UNL form, retrieving of the document is done in
a better way by filtering out irrelevant things. It’s at a small scale
right now but it is an important application," says Dr Bhattacharyya.
"We are hoping that in two years time we should have small
tools that are marketable and usable by other agencies. Already, the UNL’s
translatory system and the lexicon system which was developed are used
by other language processing groups. UNL is still IIT property and negotiating
is on with the funding agency to see if the application can be made free;
if not the source code, then at least the system," says Dr Bhattacharyya.
|
|
Languages are written using scripts. Many languages can be written using
a single script. For example, Hindi, Marathi, Nepali and Konkani are all
written in the Devnagari script. To be able to use scripts and therefore
languages with computers, the characters of a script must be represented
by an encoding.
Basic text processes such as display, editing, searching and sorting
need a consistent way of encoding characters that enables the exchange
of text data and creates the basis for software development.
Plain text is a pure sequence of character codes. It is public, standardised
and universally readable, and requires a rendering process to make it
visible. Glyphs are images that represent the shapes that characters can
have when they are displayed. The character to glyph rendering mechanism
is the domain of software processes. Although the encoding encapsulates
only the basic alphabetic characters, the number of glyphs and their combinations
required for the exhaustive rendering of Indic scripts can be quite large.
A set of glyphs for a script used to display text constitutes a font.
The number of glyphs can range from a couple of hundred to a couple of
thousand or more depending on the complexity of the script and the font
design.
Source: mithi.com
|
|
The ISCII (Indian Script Code for Information Interchange) code standard
specifies a seven-bit code table which can be used in a seven or eight-bit
ISO compatible environment. It allows English and Indian script alphabets
to be used simultaneously. ISCII caters to 10 Indian scripts.
Unicode is a 16-bit universal character encoding standard for multilingual
text. It covers all the major scripts used for writing Indian languages.
The Unicode Standard for Indic scripts is based on the ISCII-1988 revision
and is a superset of the ISCII-1991 character encoding. Texts encoded
in ISCII-1991 may be automatically converted to Unicode values and back
to their original encoding without loss of information.
|
|
The top three drivers of the local language application market
- Introduction and promotion of new technology solutions and applications
to cater to the growing needs of end-users.
- Increasing content creation in Indian languages for the Web.
- Initiatives in local language projects being undertaken by vendors,
central and state governments.
- Initiatives revolving around the commercialisation of products and
applications being developed in the numerous research labs in India.
The top three restraints for the local language software market
- Lack of formal IT-based language training.
- Lack of awareness regarding e-governance computing applications at
the grassroots level and low PC penetration across the country.
- Insufficient or delayed implementation of the initiatives taken by
different government bodies.
|
|
|
1-2 yrs |
3-4 years |
5-7 yrs |
|
Lack of standards |
High |
Medium |
Low |
|
Limited availability of software, fonts |
High |
High |
Medium |
|
Lack of local language content |
High |
Medium |
Low |
|
Slow technology progress |
Medium |
Medium |
Low |
Source: Frost & Sullivan
|
|