|
Application
Deciphering the Vedas
Perfect understanding of ancient works is a necessary key
to unlocking the secrets of the past, and this is where Artificial Intelligence
comes into play. By Varun Aggarwal
The
Vedas are among the oldest religious books in the world. They tell us a lot
about our ancient culture and civilisation. Unfortunately only a few people
make the effort to know what they mean. There are courses available in Sanskrit
and Vedic studies in India; however, they have few takers. The subject is more
popular in the west, where an Indian researcher is tapping the power of computing
to unearth the knowledge in these ancient texts.
The Mystic Vedas
|
"My
hope is to eventually complete the research and provide a complete substantiate
software system, necessary new algorithms and the requisite corpora, so
that the work can be self-sustaining and will not need to be done only
by hand after extensive instruction"
- Prabhu Ram Raghunathan
research engineerfrom the Carnegie Mellon University
|
Prabhu Ram Raghunathan, an Indian born research engineer,
software developer, roboticist, technology developer/entrepreneur and writer
/historian from Carnegie Mellon University, Pennsylvania, is trying to use Artificial
Intelligence to answer real world questions, such as the chronology of
the Vedic people. Different verses and chapters in the Rig Veda are supposed
to have been composed at different times. He feels, The study of
the Vedas is quite interpretive, as words can have multiple meanings. Further
Sanskrit has a number of compound words, which need to be split according to
Sandhi and Samasa rules, and according to the correct number and case. Sandhis
are rules for fusing words retaining the meanings of both the original words
as in boatman. Samasas are rules for forming compound words where a new word
is formed. So, the Vedas are presented in Samhitha or native form and a padapatha
or split up form which can be interpreted. Due to all these variations, a concordance
study is valuable. We can profitably apply ontological and inferential techniques
to analyse the Vedas.
If you consider the major philosophies of Hinduism, (Upanishadic and later),
they all discuss the concept of the Brahman or the Supreme Being
or Godhead. But there is hardly a mention of Brahman in the Vedas. So questions
like when did these concepts emerge or what views emerged at what times
and so on arise. Currently such questions are answered by comparative linguists,
philologists etc.
Raghunathans work is focused on trying to answer such
questions on a statistical basis, using Machine Learning (AI), aided by computational
linguistics and natural linguistics techniques, knowledge engineering and information
retrieval.
Raghunathan has had to develop a number of tools
and techniques for his work.
- Regular expression based Sanskrit parser
for certain common constructs
- Transliteration software package, that
lets one to convert from different encodings and conventions to write
Sanskrit in Latin fonts, such as Harvard Kyoto and ITrans and to do
inter-conversion between them. This also does parsing and conversion
to Unicode.
- Transliteration correction software, which
lets you fix punctuations and errors committed by people digitising
Sanskrit text. (For this, extensive parsing ability is needed)
- A bunch of databases, made partly by hand
and partly automatically to have a reliable corpus that can be analysed.
- A minimalist grammar parsing system, based
on direct grammar and heuristics as found in, say, a high school text
book.
- Several XML based conversion, processing
and analysis systems
- Distributed Sanskrit dictionary lookup
and concordance system which can compare words, concepts and concordances
across Vedic texts.
|
A work in progress
Raghunathan started working on new techniques and systems like analytical and
ontological tools for Sanskrit analysis in 2000. Ontology can be thought of
as something on the lines of the Google directory. It gives information about
information. This has a lot of value in machine processing, in speeding up and
compacting queries, helping comparative study etc.
He is currently trying to build a comprehensive corpus of
all available recensions and classical commentaries of the Rig Veda. With
the parts of the Rig Veda that I have assembled so far, I have been developing
techniques and conducting analytical studies, part by part. My hope is to eventually
complete the research and provide a complete substantiate software system, necessary
new algorithms and the requisite corpora, so that the work can be self-sustaining
and will not need to be done only by hand after extensive instruction.
Such a new repository with analytic techniques with complete
statistics on a scientific basis and not rule of thumb or purely
experiential basis and a software system will help conduct deep comparative
analysis and close reading, across languages and different eras of Sanskrit.
| The Vedas are the ancient Hindu Scriptures and are
among the oldest religious books of the world. They are four in numberthe
Rig, Yajur, Sama and Atharva Vedas, listed in order of antiquity. The first
three are cognates and were chanted during rituals. The Rig Veda is the
primary Vedic text and consists of 10 books and a few thousand verses. It
was meant to be recited by the Hotar priests during the fire sacrifice.
The Yajur was assigned to the Advharyu priests who conducted the sacrifices
and the Sama Veda to the Udhgathi priests who sang the Sama
verses during the preparation of the Holy Soma juice. The Yajur and Sama
Vedas repeat a lot of Rig Vedic content, with special syllables being added
in some cases. Each Veda could have several recensions. Apart from Hindu
theology, study of the Vedas becomes necessary to understand ancient Indian
history as the Vedas are the closest we have to historical texts of those
times and are studied using myriad approaches: Philology, Etymology, Comparative
Religion, Sanskritology, Indology etc.
The commonest version of the Sama Veda consists
of 1,875 verses, arranged in four parts, which are further divided into
books and chapters. The chapters are then broken into decades. Each decade
could consist of 10 verses. To each decade is assigned a God or a Collection
of Gods to whom the verses are addressed, either as a salutation or as
a prayer, such as Varuna, the Water God or to Visvedeva, or All of the
Gods. Most verses are addressed to Indra, the thunderbolt wielding Rain
God, and King of the Gods and to Agni, the fire God, who is considered
the conveyor of the sacrificial offerings to the Devas (the Gods). A number
of the verses are naturally in praise of Soma. Decades also identify the
metre in which they were composed and the Rishi or sage who composed them.
|
The trouble with ontology
While most common ontology creation systems such as Protégé and
WebOnto tend to support manual creation with a set of editor, language processor,
visualisation, maintenance and exchange tools, several new systems have attempted
to automate the creation process. The ontologies generated by them have to be
evaluated and corrected by a human expert, in an iterative process, before they
can be deployed. In this sense, they are semi-automatic. Ontology learning systems
attempt to parallel the workflow in a manual creation environment. Typically,
the first stage is to process the natural language corpus available with a text-processing
module to perform tasks such as stemming, part of speech tagging, parsing etc.
and an intermediate corpus is generated. Ontological information is then extracted
with the use of known lexical constructs and a domain lexicon. The information
learned is the main concepts in the domain and subsequently their interrelationships.
The first pass ontology so generated, is then pruned, evaluated and refined
iteratively by a domain expert. Apart from extraction from scratch, ontologies
can also be automatically refined or reverse engineered. The additional stages
involve the import of existing ontologies as a starting point, the export of
the generated ontology into an existing higher ontology and merging of related
ontologies. Most such environments support common ontology exchange languages
such as RDF schema, OIL, OWL etc. Starting with natural language text may always
not be the case. Ontology learning systems can also start from specialised data
repositories and schema.
The underlying learning algorithms and natural language processing facilities
can vary greatly as also the role of syntactic and semantic patterns rely on
term frequency metrics (TF, TFIDF etc.) and domain specific heuristics for identifying
syntactic patterns. ASIUM is another, earlier system that generates domain taxonomies
through hierarchical clustering.
Although ontology learning by itself is a nascent area, the
actual techniques used derive from computational linguistics, machine learning,
information retrieval etc. However, ontology learning methods noted above have
been highly domain specific, as they rely on domain heuristics.
|
Raghunathan has had to develop a number of tools
and techniques for his work.
- Regular expression based Sanskrit parser
for certain common constructs
- Transliteration software package, that
lets one to convert from different encodings and conventions to write
Sanskrit in Latin fonts, such as Harvard Kyoto and ITrans and to do
inter-conversion between them. This also does parsing and conversion
to Unicode.
- Transliteration correction software, which
lets you fix punctuations and errors committed by people digitising
Sanskrit text. (For this, extensive parsing ability is needed)
- A bunch of databases, made partly by hand
and partly automatically to have a reliable corpus that can be analysed.
- A minimalist grammar parsing system, based
on direct grammar and heuristics as found in, say, a high school text
book.
- Several XML based conversion, processing
and analysis systems
- Distributed Sanskrit dictionary lookup
and concordance system which can compare words, concepts and concordances
across Vedic texts.
|
The case for using ontology on the Vedas
The typical uses would be indexing, annotating, reasoning, improving parsing
models on other texts and semantic disambiguation etc., given the number of
recensions and interpretations; it becomes useful to be able to ontologise them
prior to analysis. This facilitates both intelligent retrieval, for instance
in the form of concordances, as also inferences and revisions to the interpretations
through disambiguation. We are also able to reconcile post Rig Vedic works with
the Rig Veda and understand the role of each in the Indological domain. As it
is commonly held that the hymns of the Rig Veda were extant for centuries and
were redacted as text over a long period of time, during which there were changes
to the beliefs, attributes and dramatis personae in the Vedic hymns, developing
a formal structure such as a domain ontology could give us some insights into
the chronology and concept evolution therein.
I am developing or learning or deducting an ontology
or structure of knowledge from a bunch of Sanskrit documents. Constructing any
good ontology by hand is itself difficult. Since the domain is vast, we need
to construct it automatically, and this is even tougher. To deduct or learn
an ontology, I am trying to deduce the relationships between the concepts, their
structures etc., by machine analysing the Rig Veda, both in text form and translation
form. Since the grammar is very difficult, I try to solve the grammar problem
in this paper by using, raw statistical learning techniques, which has never
before been done for Sanskrit. It has been used on English mainly. But why do
it? If we can solve this problem for something like Sanskrit, we can then adapt
the same techniques to a number of current problems where learning languages
based on inconsistent or incomplete grammar is necessary, such as discovering
and providing Web services on the semantic Web, says Raghunathan.
Further, this work in an ancient domain is highly relevant and extensible to
modern domains, including the provision of digital assistant services through
a broadband telephony network. If we wished to provide a Semantic Web application
to broadband telephony users in a network, and allow that the actual Web services
available to a user, through the digital assistant agent, such as scheduling
engagements, automatic email response, task allocation and prioritisation etc.
needed to be discovered, added to a services list, managed and supported over
the Internet, by processing multilingual and multimodal descriptions and maintaining
them in a domain services ontology, we see that there is a close parallel to
the earlier domain. We are faced with similar challenges. Our multimodal natural
language descriptions could come from voice interfaces, multimedia repositories,
including pictures and oral advertisements as also common Web service description
formats such as OWL-S and XML and may not be reliably or homogeneously phrased.
Some of our current work is on such a project, where we find the need
to ontologise domain service information without the benefit of the implicit
restrictions to facilitate other tasks, adds Raghunathan.
|