|
Lead
Grid computing to aid computational biology
Building and running a high performance, low latency, scalable
data engine for the purposes of computational biology requires grid computing,
says V Sudhakshina
Computational
Biology is an interdisciplinary field that applies the techniques of computer
science, informatics, artificial intelligence, chemistry, biochemistry, applied
mathematics, and statistics to solve problems that arise in the field of biology.
It studies the management, analysis, and the stimulation of complex structures
and processes inherent in living systems. It deals with the application of computers
and technology for solving problems in the field of biology. It uses databases
and software to store and produce applied algorithms.
Additionally most computational projects related to biology churn humongous
volumes of data. This includes the genome projects, proteomics, protein structure
determination and the rapid expansion in digitalization of patient biomedical
data.
Major fields in biology that require computation include sequence alignment,
gene finding, genome assembly, protein structure alignment, protein structure
prediction, prediction of gene expression and protein-protein interactions,
and the modeling of evolution. Through computational biology, the scientist
or researcher can analyze DNA to locate genes, RNA sequence to predict their
structure, the protein sequence to predict their location inside a particular
cell, analyze gene expression images, and identify new drug molecules, build
computational models for biological systems, analyze the various genomes located
in the cell or organism, for genome annotation process etc.
Challenges faced
The challenges faced in the field of computational biology are as complex as
the subject itself. According to Meghadri Ghosh, Director of Engineering, GemStone
Systems, There is a rapid increase in the need for system performance,
a high-level of scalability-availability, and to address latency issue and optimize
it. The existing data solutions are unable to meet those challenges. Huge
amount of data generated by the multiple sequencing, development of efficient
algorithms to search for systems of genes or proteins and their interactions,
location of DNA, RNA, genes, proteins, molecules are some of the major concerns
faced in the field of computational biology.
Sequencing has moved into a different phase of innovation. Earlier where only
one cell used to be studied for sequencing, now a whole organism is studied.
The DNA sequences of hundreds of organisms are decoded and stored in databases.
Then these stored sequences are further analyzed to determine the genes, which
further show the protein functions. Sequencing of a single cell itself gives
about 3 GB of data, so imagine the humongous amount of data produced by hundreds
of organism and the amount of data that needed to processes at a faster pace.
Various computer programs are used that apply algorithms and statistical techniques
to biological datasets that typically store large numbers of DNA, RNA, or protein
sequences.
| IBMs Blue Gene family of supercomputers has
been designed to deliver ultra scale performance within a standard programming
environment while delivering efficiencies in power, cooling and floor-space
consumption and is actually a supercomputer. It extends the performance
through density and frequency, 4-way SMP enhanced functionality, scalability
that of petaflop performance, and aggressive power management for low power
consumption first established with the Blue Gene Solution. It provides up
to 1,048,576 cores (262,144 nodes) resulting in high scalability. System
such as Blue Gene is designed to handle data-intensive applications such
as content distribution, simulations, and modeling [as found in computational
biology research], Webserving, data mining or business intelligence. Customers
note that time-to-solution for many applications has been reduced by orders
of magnitude. Scientists can make new runs more often allowing them to explore
alternative models and approaches to solve complex problems. |
Computational genomics is a critical area of research in biology. It studies
the genomes of cells and organisms by high-throughput genome sequencing that
requires extensive post-processing known as genome assembly, and which uses
DNA microarray technologies to perform statistical analyses on the genes expressed
in individual cell types. The problem arrives with the sequencing of thousands
of small fragments genomes. The task of assembling, studying and comparing genomes
is very complicated. A human genome project can take about several days of CPU
time to assemble the fragments.
Additionally large amount of information is produced through bio-medical images.
With rapid increase in the bio-medical research and diagnosis, the need for
high-speed technology for accurate measuring, processing and analysis of the
images have increased. Analysis of gene and protein expressions, cancer mutations,
predicating the protein structure, measuring bio-diversity, building computational
models for molecular modeling and modeling of biological systems, etc are the
areas which posses the challenges of faster distribution, availability and requirement
of expensive high speed computing tools.
|
Grid Data Management Requirements
|
GemFire Solution
|
| Ultra-high performance, low latency
data Distribution and high data throughput |
- In-memory data management with no disk latency.
- Highly parallel data distribution and optimized
|
| Caching and persistence models for
extremely large volumes of data across distributed environments |
- Multiple, flexible data caching topologies.
- Distributed data portioning schemes across servers for dynamic data
scalability under loads.
- Support for distributed transactions to facilitate a grid approach
in OTLP environments.
- Optional overflow to disk to accommodate excess data on a single
|
| Data virtualization and location transparency |
- Pool memory and disk across network or manage large quantities of
data and transparently move data on demand.
- Access and synchronize data to/from data sources
|
| Guaranteed QoS for data access and
high availability for reliable operations |
- Advanced data placement strategies to co-locate data where it is
most frequently requested and avoid expensive transformations and network
hops.
- Main memory replication or disk persistence to guarantee availability
of data.
|
| Multiple data format and language
support with true interoperability |
- Native support for multiple data formats - java, C/C++. Can work
on both x86 and UNIX platform.
|
| Seamless fit in to existing enterprise
architectures |
- Standards based interface for easy insertion in to distributed environments.
- Pre-configured interceptors/drivers for zero code integration with
database.
|
Addressing the needs of computational biology
The answer lies hidden in the grid-computing that can solve complex issues faced
in computational biology. Ghosh added, The solution (grid computing) enables
the virtualization of key system resources such as CPU, memory, disk and storage
spread across disparate systems as a single managed entity. The key areas where
grid computing comes as an solution in computational biology are in managing
and virtualizing set of resources, scheduling tasks across a set of compute
nodes, maintaining a desired level of quality of service. Technologies
such as GemStones GemFire and IBMs Blue Gene act as a data grid,
cater to the requirements, and address the need for performance, high scalability,
and low latency in the field of computational biology.
GemFire acts as a scalable, distributed platform that manages increasing volumes
of enterprise data with low latency. It enables the delivery of actionable information
to right application at the right time due to advanced data virtualization and
data distribution capabilities.
Ghosh said, The high volumes of data found in genome projects, proteomics
and protein structure determination are always at risk. The scientist and researchers
are always worried about the misplacing of such critically important data. Technology
such as GemFire gives a unified view of the data located at different places.
It writes an application that configures and locates the exact data needed.
It provides with a distributed event notifications service where updates to
data region are propagated through listener framework to any member
subscribing to that data.
Further it consists of member data nodes that are connected to one another in
a peer-to-peer fashion such that each member is aware of the availability of
every other member at any time. The distributed cache API presents the entire
distributed system as if it were just one logical cache totally abstracting
the actual location of the data or the data source from the developer. It provides
with variety of cache topologies such as peer-to-peer, client server, and WAN.
This ensures minimum data copies, context switching and layering and efficient
movement of data across all topologies. A grid scheduler connected to GemFire
manages the initial distribution of data among the processing units and ensures
that the required data is present or en route as the jobs are dispatched. As
soon as the tasks are finished it either directs new jobs to processing unit
that are already equipped with the right data, or initiates additional data
distribution as needed. It can account a detailed knowledge of data locality,
minimize the total data distribution load on the network by sending only required
information, and can maximize the processor utilization by reducing time processors
spent on waiting for data. The data routing and distributed caching provides
with high availability, even if there is a huge amount of data. It also helps
in improving the performance for easier scalability of data with less latency.
ecrep.bnb@expressindia.com
|