Untitled Document
www.expresscomputeronline.com WEEKLY INSIGHT FOR TECHNOLOGY PROFESSIONALS
19 September 2005  
Untitled Document
Sections

Market
Management
Technology
Technology Life

Columns

Between The Bytes

Specials

HMA Bankbiz
UPS Batteries

Services
Subscribe/Renew
Archives
Search
Contact Us
Network Sites
Network Magazine India
Exp. Hotelier & Caterer
Exp. Travel & Tourism
feBusiness Traveller
Exp. Pharma Pulse
Exp. Healthcare Mgmt.
Exp. Textile
Group Sites
ExpressIndia
Indian Express
Financial Express
Home - Technology - Article

Application

Matching protein structures

The AnMol graph database minimises the computational requirements for protein structure matching, a vital component of drug designing, writes Vinutha V.

Before any drug hits the market, it undergoes years of testing. This is done in order to understand the chemical and biological structure of a virus and the side-effects of medicines. For instance, in activities such as drug design, ‘docking’ is often an issue; this involves the search for drug molecules that can successfully ‘dock’ into a viral protein molecule to impair its actions.

Bio-informatics plays a major role in reducing the time and processes involved in drug designing. Protein structure matching, a small part of drug designing, is a highly complex process. Protein structures are intricate. Their shape in 3-dimensional (3D) space is of prime importance, in addition to other features of the molecular graph. Many existing applications perform specific activities on molecular structures. For instance, the SCOP (Structural Classification of Proteins) developed from manual experience, FSSP (Fold classification based on Structure-Structure alignment of Proteins), and CATH (Protein Structure Classification) databases are used to maintain structure-based classifications of protein molecules. A query on these databases takes about three days to return a result. The IT infrastructure required is a massively parallel super computer. While this would be useful for structural similarity searches, there is still a need for a general platform for supporting various analytical queries in a uniform manner.

RDBMS not enough

Biomolecules such as proteins are large molecular structures, each having hundreds to thousands of atoms. The structure of a biomolecule is of enormous interest since the functionality of a protein is known to be strongly correlated to its molecular structure. Databases of biomolecular structures hold valuable information for activities such as drug design and understanding of metabolic processes. Traditional RDBMS systems are not suitable for managing such huge data sets. Although it is theoretically possible to store molecular structures as relational tables, performing a structure search over such tables is an extremely time-consuming activity.

In 2002, IIIT-B developed GRACE (Graph Space) for managing graph data and handling structural queries on biomolecular databases. GRACE was successfully applied in storing and retrieving molecular structures for biomolecular databases. Later, in 2003, CL Infotech, a company that specialises in software development, re-engineering projects and maintenance of software applications (ranging from wireless communications to generalised embedded systems) joined IIIT-B in extending GRACE to handle protein structures. The result was a software suite called AnMol (Analysis of Molecules) which uses the GRACE DBMS (Database Management System).

“Normally, RDBMS, which is a tabular database, is efficient in answering attribute queries. But we were intending to derive data models that help us decrease the complexity of graph matching using lower computing capabilities in lesser time,” explains Srinath Srinivasa, Professor and In-charge of the GRACE project at IIIT-B. An RDBMS is not suitable for matching graph structures. Just matching two graph structures is a problem that belongs to the NP (Nondeterministic Polynomial)-hard class of algorithmic complexity. In intuitive terms, this means that as the graph sizes increase, the complexity of matching them increases too rapidly for it to be practically viable to match large graphs. In a graph database, this problem is compounded by the fact that a query graph should be matched against not one but a database of graphs. GRACE provides heuristics by which structurally similar graphs can be extracted from the database quickly. This is achieved by storing some pre-processed data containing information about structural features of graphs that are stored in the database.

AnMol GRACE

The first version of GRACE was developed using the MySQL RDBMS. However, as this did not prove effective, the next version has been developed from scratch in C++ and has its own storage model, index structures and even a separate graph query language called Safari.

Rather than treating protein structure data as a set of points in 3D space, AnMol converts them to one or more graph structures and manages these structures using GRACE. Converting 3D data into a set of graphs helps capture various aspects of the protein molecule. A set of AnMol pre-processors takes 3D data about a protein molecule and converts it into labelled graphs. These graphs capture different aspects of the protein molecule. There are graphs depicting covalent bonds, those depicting hydrogen bonds in the molecule, proximity graphs which join each atom with every other atom that lies within a given radius, and so on.

“At the outset, AnMol brought down the IT infrastructure required from a supercomputer to a high-end server. It allowed work on parallel computing and to have a private database where customisation could be achieved,” notes Sujit Kumar, Director, CL Infotech. The query response was drastically reduced in AnMol vis-à-vis other databases. If the query graph was already present in the database, query response was typically between 200 milliseconds for small queries to 2 seconds for large. When the query graph was not present in the database, the search had to pass through several levels of structural features in the database before giving up. This process needed anywhere between 300 milliseconds to 15 seconds. The complete response time for a user interaction ranged from 500 milliseconds to 17 seconds. The response time for user interaction includes the query response time plus the time taken for ranking and visualisation of results.

AnMol is targeted at biotechnology firms and drug research companies. The design goals of AnMol are to provide a fast and reliable platform for structural analysis of biomolecules which can be used on a relatively low-end computational infrastructure like a server-class machine or a simple computing cluster. This is a need since most analytical tools for structural analysis require substantial computational infrastructure. Some of the tools based on AnMol can be used over the Internet, eliminating the need for high-end computing power at the user’s end; it is meant to provide a platform for tools that end-users can install locally and use without investing in large computational infrastructure.

GRACE can be applied to other domains in addition to molecular biology. Research prototypes have been developed that use GRACE in areas such as mining Web logs and in managing source code patterns for compiler optimisation. Other potential areas include software engineering, cartography, knowledge management and CAD (Computer Aided Design) databases.

vinutha@expresscomputeronline.com

 


UNSUBSCRIBE HERE
Untitled Document
© Copyright 2001: Indian Express Newspapers (Mumbai) Limited (Mumbai, India). All rights reserved throughout the world. This entire site is compiled in Mumbai by the Business Publications Division (BPD) of the Indian Express Newspapers (Mumbai) Limited. Site managed by BPD.