|
Application
Matching protein structures
The AnMol graph database minimises the computational requirements
for protein structure matching, a vital component of drug designing, writes
Vinutha V.
Before
any drug hits the market, it undergoes years of testing. This is done in order
to understand the chemical and biological structure of a virus and the side-effects
of medicines. For instance, in activities such as drug design, docking
is often an issue; this involves the search for drug molecules that can successfully
dock into a viral protein molecule to impair its actions.
Bio-informatics plays a major role in reducing the time and processes involved
in drug designing. Protein structure matching, a small part of drug designing,
is a highly complex process. Protein structures are intricate. Their shape in
3-dimensional (3D) space is of prime importance, in addition to other features
of the molecular graph. Many existing applications perform specific activities
on molecular structures. For instance, the SCOP (Structural Classification of
Proteins) developed from manual experience, FSSP (Fold classification based
on Structure-Structure alignment of Proteins), and CATH (Protein Structure Classification)
databases are used to maintain structure-based classifications of protein molecules.
A query on these databases takes about three days to return a result. The IT
infrastructure required is a massively parallel super computer. While this would
be useful for structural similarity searches, there is still a need for a general
platform for supporting various analytical queries in a uniform manner.
RDBMS not enough
Biomolecules such as proteins are large molecular structures, each having hundreds
to thousands of atoms. The structure of a biomolecule is of enormous interest
since the functionality of a protein is known to be strongly correlated to its
molecular structure. Databases of biomolecular structures hold valuable information
for activities such as drug design and understanding of metabolic processes.
Traditional RDBMS systems are not suitable for managing such huge data sets.
Although it is theoretically possible to store molecular structures as relational
tables, performing a structure search over such tables is an extremely time-consuming
activity.
In 2002, IIIT-B developed GRACE (Graph Space) for managing graph data and handling
structural queries on biomolecular databases. GRACE was successfully applied
in storing and retrieving molecular structures for biomolecular databases. Later,
in 2003, CL Infotech, a company that specialises in software development, re-engineering
projects and maintenance of software applications (ranging from wireless communications
to generalised embedded systems) joined IIIT-B in extending GRACE to handle
protein structures. The result was a software suite called AnMol (Analysis of
Molecules) which uses the GRACE DBMS (Database Management System).
Normally, RDBMS, which is a tabular database, is efficient in answering
attribute queries. But we were intending to derive data models that help us
decrease the complexity of graph matching using lower computing capabilities
in lesser time, explains Srinath Srinivasa, Professor and In-charge of
the GRACE project at IIIT-B. An RDBMS is not suitable for matching graph structures.
Just matching two graph structures is a problem that belongs to the NP (Nondeterministic
Polynomial)-hard class of algorithmic complexity. In intuitive terms, this means
that as the graph sizes increase, the complexity of matching them increases
too rapidly for it to be practically viable to match large graphs. In a graph
database, this problem is compounded by the fact that a query graph should be
matched against not one but a database of graphs. GRACE provides heuristics
by which structurally similar graphs can be extracted from the database quickly.
This is achieved by storing some pre-processed data containing information about
structural features of graphs that are stored in the database.
AnMol GRACE
The first version of GRACE was developed using the MySQL RDBMS. However, as
this did not prove effective, the next version has been developed from scratch
in C++ and has its own storage model, index structures and even a separate graph
query language called Safari.
Rather than treating protein structure data as a set of points in 3D space,
AnMol converts them to one or more graph structures and manages these structures
using GRACE. Converting 3D data into a set of graphs helps capture various aspects
of the protein molecule. A set of AnMol pre-processors takes 3D data about a
protein molecule and converts it into labelled graphs. These graphs capture
different aspects of the protein molecule. There are graphs depicting covalent
bonds, those depicting hydrogen bonds in the molecule, proximity graphs which
join each atom with every other atom that lies within a given radius, and so
on.
At the outset, AnMol brought down the IT infrastructure required from
a supercomputer to a high-end server. It allowed work on parallel computing
and to have a private database where customisation could be achieved,
notes Sujit Kumar, Director, CL Infotech. The query response was drastically
reduced in AnMol vis-à-vis other databases. If the query graph was already
present in the database, query response was typically between 200 milliseconds
for small queries to 2 seconds for large. When the query graph was not present
in the database, the search had to pass through several levels of structural
features in the database before giving up. This process needed anywhere between
300 milliseconds to 15 seconds. The complete response time for a user interaction
ranged from 500 milliseconds to 17 seconds. The response time for user interaction
includes the query response time plus the time taken for ranking and visualisation
of results.
AnMol is targeted at biotechnology firms and drug research companies. The design
goals of AnMol are to provide a fast and reliable platform for structural analysis
of biomolecules which can be used on a relatively low-end computational infrastructure
like a server-class machine or a simple computing cluster. This is a need since
most analytical tools for structural analysis require substantial computational
infrastructure. Some of the tools based on AnMol can be used over the Internet,
eliminating the need for high-end computing power at the users end; it
is meant to provide a platform for tools that end-users can install locally
and use without investing in large computational infrastructure.
GRACE can be applied to other domains in addition to molecular biology. Research
prototypes have been developed that use GRACE in areas such as mining Web logs
and in managing source code patterns for compiler optimisation. Other potential
areas include software engineering, cartography, knowledge management and CAD
(Computer Aided Design) databases.
vinutha@expresscomputeronline.com
|