Untitled Document
www.expresscomputeronline.com WEEKLY INSIGHT FOR TECHNOLOGY PROFESSIONALS
15 September 2008  
Untitled Document
Sections

Market
Management
Technology
Technology Senate 2008
Technology Life

Columns

Between The Bytes

Events

Technology Senate
Technology Sabha

Specials

HMA Bankbiz
UPS Batteries

Services
Subscribe/Renew
Archives
Search
Contact Us
Network Sites
CIO Decisions
Exp.Channel Business
Express Hospitality
Express TravelWorld
feBusiness Traveller
Express Pharma
Express Healthcare
Express Textile
Group Sites
ExpressIndia
Indian Express
Financial Express

Untitled Document
 
Home - Technology - Article

Lead

Grid computing to aid computational biology

Building and running a high performance, low latency, scalable data engine for the purposes of computational biology requires grid computing, says V Sudhakshina

Computational Biology is an interdisciplinary field that applies the techniques of computer science, informatics, artificial intelligence, chemistry, biochemistry, applied mathematics, and statistics to solve problems that arise in the field of biology. It studies the management, analysis, and the stimulation of complex structures and processes inherent in living systems. It deals with the application of computers and technology for solving problems in the field of biology. It uses databases and software to store and produce applied algorithms.

Additionally most computational projects related to biology churn humongous volumes of data. This includes the genome projects, proteomics, protein structure determination and the rapid expansion in digitalization of patient biomedical data.

Major fields in biology that require computation include sequence alignment, gene finding, genome assembly, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, and the modeling of evolution. Through computational biology, the scientist or researcher can analyze DNA to locate genes, RNA sequence to predict their structure, the protein sequence to predict their location inside a particular cell, analyze gene expression images, and identify new drug molecules, build computational models for biological systems, analyze the various genomes located in the cell or organism, for genome annotation process etc.

Challenges faced

The challenges faced in the field of computational biology are as complex as the subject itself. According to Meghadri Ghosh, Director of Engineering, GemStone Systems, “There is a rapid increase in the need for system performance, a high-level of scalability-availability, and to address latency issue and optimize it. The existing data solutions are unable to meet those challenges.” Huge amount of data generated by the multiple sequencing, development of efficient algorithms to search for systems of genes or proteins and their interactions, location of DNA, RNA, genes, proteins, molecules are some of the major concerns faced in the field of computational biology.

Sequencing has moved into a different phase of innovation. Earlier where only one cell used to be studied for sequencing, now a whole organism is studied. The DNA sequences of hundreds of organisms are decoded and stored in databases. Then these stored sequences are further analyzed to determine the genes, which further show the protein functions. Sequencing of a single cell itself gives about 3 GB of data, so imagine the humongous amount of data produced by hundreds of organism and the amount of data that needed to processes at a faster pace. Various computer programs are used that apply algorithms and statistical techniques to biological datasets that typically store large numbers of DNA, RNA, or protein sequences.

IBM’s Blue Gene therapy
IBM’s Blue Gene family of supercomputers has been designed to deliver ultra scale performance within a standard programming environment while delivering efficiencies in power, cooling and floor-space consumption and is actually a supercomputer. It extends the performance through density and frequency, 4-way SMP enhanced functionality, scalability that of petaflop performance, and aggressive power management for low power consumption first established with the Blue Gene Solution. It provides up to 1,048,576 cores (262,144 nodes) resulting in high scalability. System such as Blue Gene is designed to handle data-intensive applications such as content distribution, simulations, and modeling [as found in computational biology research], Webserving, data mining or business intelligence. Customers note that time-to-solution for many applications has been reduced by orders of magnitude. Scientists can make new runs more often allowing them to explore alternative models and approaches to solve complex problems.

Computational genomics is a critical area of research in biology. It studies the genomes of cells and organisms by high-throughput genome sequencing that requires extensive post-processing known as genome assembly, and which uses DNA microarray technologies to perform statistical analyses on the genes expressed in individual cell types. The problem arrives with the sequencing of thousands of small fragments genomes. The task of assembling, studying and comparing genomes is very complicated. A human genome project can take about several days of CPU time to assemble the fragments.

Additionally large amount of information is produced through bio-medical images. With rapid increase in the bio-medical research and diagnosis, the need for high-speed technology for accurate measuring, processing and analysis of the images have increased. Analysis of gene and protein expressions, cancer mutations, predicating the protein structure, measuring bio-diversity, building computational models for molecular modeling and modeling of biological systems, etc are the areas which posses the challenges of faster distribution, availability and requirement of expensive high speed computing tools.

GemStone's solution for computational biology
Grid Data Management Requirements
GemFire Solution
Ultra-high performance, low latency data Distribution and high data throughput
  • In-memory data management with no disk latency.
  • Highly parallel data distribution and optimized
Caching and persistence models for extremely large volumes of data across distributed environments
  • Multiple, flexible data caching topologies.
  • Distributed data portioning schemes across servers for dynamic data scalability under loads.
  • Support for distributed transactions to facilitate a grid approach in OTLP environments.
  • Optional overflow to disk to accommodate excess data on a single
Data virtualization and location transparency
  • Pool memory and disk across network or manage large quantities of data and transparently move data on demand.
  • Access and synchronize data to/from data sources
Guaranteed QoS for data access and high availability for reliable operations
  • Advanced data placement strategies to co-locate data where it is most frequently requested and avoid expensive transformations and network hops.
  • Main memory replication or disk persistence to guarantee availability of data.
Multiple data format and language support with true interoperability
  • Native support for multiple data formats - java, C/C++. Can work on both x86 and UNIX platform.
Seamless fit in to existing enterprise architectures
  • Standards based interface for easy insertion in to distributed environments.
  • Pre-configured interceptors/drivers for zero code integration with database.

Addressing the needs of computational biology

The answer lies hidden in the grid-computing that can solve complex issues faced in computational biology. Ghosh added, “The solution (grid computing) enables the virtualization of key system resources such as CPU, memory, disk and storage spread across disparate systems as a single managed entity. The key areas where grid computing comes as an solution in computational biology are in managing and virtualizing set of resources, scheduling tasks across a set of compute nodes, maintaining a desired level of quality of service.” Technologies such as GemStone’s GemFire and IBM’s Blue Gene act as a data grid, cater to the requirements, and address the need for performance, high scalability, and low latency in the field of computational biology.

GemFire acts as a scalable, distributed platform that manages increasing volumes of enterprise data with low latency. It enables the delivery of actionable information to right application at the right time due to advanced data virtualization and data distribution capabilities.

Ghosh said, “The high volumes of data found in genome projects, proteomics and protein structure determination are always at risk. The scientist and researchers are always worried about the misplacing of such critically important data. Technology such as GemFire gives a unified view of the data located at different places. It writes an application that configures and locates the exact data needed. It provides with a distributed event notifications service where updates to data region are propagated through ‘listener’ framework to any member subscribing to that data.”

Further it consists of member data nodes that are connected to one another in a peer-to-peer fashion such that each member is aware of the availability of every other member at any time. The distributed cache API presents the entire distributed system as if it were just one logical cache totally abstracting the actual location of the data or the data source from the developer. It provides with variety of cache topologies such as peer-to-peer, client server, and WAN. This ensures minimum data copies, context switching and layering and efficient movement of data across all topologies. A grid scheduler connected to GemFire manages the initial distribution of data among the processing units and ensures that the required data is present or en route as the jobs are dispatched. As soon as the tasks are finished it either directs new jobs to processing unit that are already equipped with the right data, or initiates additional data distribution as needed. It can account a detailed knowledge of data locality, minimize the total data distribution load on the network by sending only required information, and can maximize the processor utilization by reducing time processors spent on waiting for data. The data routing and distributed caching provides with high availability, even if there is a huge amount of data. It also helps in improving the performance for easier scalability of data with less latency.

ecrep.bnb@expressindia.com

 


Untitled Document

UNSUBSCRIBE HERE
Untitled Document
© Copyright 2001: Indian Express Newspapers (Mumbai) Limited (Mumbai, India). All rights reserved throughout the world. This entire site is compiled in Mumbai by the Business Publications Division (BPD) of the Indian Express Newspapers (Mumbai) Limited. Site managed by BPD.