|
Vendor Accent
The case for data de-duplication
Surajit Sen explains the need for data de-duplication
and discusses the various underlying benefits of this technology
The
increased dependency of businesses upon automation has resulted in burgeoning
volumes of data, which need to be stored, managed, protected and retrieved based
on business demands. In an age where data is doubling every 12 to 24 months
in most organizations, storage networking technology has solved the problem
of efficient data management to a large extent.
However, deliberate copies of data are today contributing substantially to the
growing volume of data in organizations. Important data is routinely copied
to multiple locations to protect against all types of disaster. In the field
of data mining, for example, huge databases are often copied for the purpose
of running business intelligence queries. Numerous temporary copies of large
databases are also created deliberately in the process of application development
and testing. Across the globe, important data is routinely copied to multiple
locations to protect against all types of loss.
To further complicate matters, countless copies of data files are inadvertently
created when individuals share files. Consider the multiplier when an individual
sends a file to 15 colleagues: After the recipients save the file to their personal
systems, the file might also be copied once for backup, a second time for compliance
and a third time for disaster recovery. So, sending a single file to 15 people
could create up to 60 copies of the file: 15 deliberate copies and an additional
45 unintentional ones.
Data de-duplication is a technology that helps reduce the cost of data storage
in such environments.
The business case for data de-duplication
The disk storage market has always struggled to balance the increase in storage
requirements with pressures on IT organizations to reduce costs. This phenomenon
was instrumental in driving up the density of disk drives at a reduced cost
per GB. Not very long ago, the largest spindle size used to be 9 GB disk drives
for enterprise storage systems. Today, most storage vendors are shipping 750
GB drives (at almost the same price as the 9 GB drive).
Though this has helped to keep storage costs low, the proliferation of data
copies has nullified this benefit to a large extent.
Most organizations are forced to create multiple copies of the same data due
to various reasons like backup, test/development, QA, compliance, disaster recovery.
It is fairly common for IT organizations to use 100 TB of storage for just 10
TB of primary data. This is a huge cost and no matter how cheap disk drives
become, the cost of this additional burden would always keep CIOs awake at night.
Data de-duplication addresses the issue by offering Smart Copies i.e. logical
copies instead of physical copies. Logical copies use intelligent manipulation
of pointers to represent the same block of data in multiple places instead of
physically copying the same block of information multiple times. This solves
the business requirement of having multiple copies but without increasing storage
costs. In addition, by saving significantly on disk capacity requirements, this
also helps reduce physical infrastructure costs like power, cooling, floor space
etc.
The second way in which data de-duplication technology can help with big savings
is with the primary data volume itself. It is quite common to store multiple
files or objects in many locations within a primary data volume. An example
is the instance of an e-mail attachment being forwarded to multiple people within
the same organization. This results in the same data being stored by multiple
people within the same e-mail database.
Regulatory compliance and corporate governance norms make it mandatory for organizations,
under their purview, to store data in a tamperproof format for a certain period
of time. There are also stringent norms on the time taken to produce data should
it be asked for. Clearly tapes are not an option since they are neither reliable
nor quick. This means that the volume of data that needs to be stored on disk
is huge since you cannot delete data for extended periods of time.
Single instance storage is a technology that can significantly reduce disk space
requirements. Taken a step further, the potential disk space savings are multiple
times higher if we single instance blocks of data instead of files. A block
of data gets duplicated multiple times within the same data set, even if the
file may not have been repeated. Single instancing of such blocks would result
in disk space savings of the order of 20 to 50 times.
De-duplication and single instance storage
Data de-duplication through smart copies has been attempted by industry players
for some time.
Single instance storage was pioneered by a few start-up vendors and was adopted
in disk-based backup environments. The first attempts at de-duplication technology
addressed secondary copies of data, by eliminating redundant blocks of data
which could help save on disk capacity requirements for back-up. This was used
both on direct disk targets like NAS staging devices and virtual tape libraries.
Soon the potential impact of this technology was seen by other storage vendors
who started offering their own versions of de-duplication.
That the possibilities in de-duplication technology are exciting is stating
the obvious. What if this technology could be extended to primary storage? What
if you managed to eliminate all traces of redundant data from your environment?
Today almost all storage vendors are either announcing this technology as part
of their roadmap or have already announced general availability on their primary
storage systems.
Different vendors have differing approaches to treating redundant blocks of
data. Some vendors de-duplicate in the data path, whereas some de-duplicate
once the data has been written to the storage system. Some vendors are using
hashing algorithms whereas some are using file system intelligence to achieve
this. It remains to be seen which approach is adopted by the majority, however
what remains undisputed is the attractive value proposition of data de-duplication
in IT organizations today.
Data de-duplication is an important new technology that is quickly being embraced
by users as they struggle with issues of data proliferation. By eliminating
redundant data blocks, an immediate benefit is obtained through space and cost
efficiencies. When choosing a de-duplication product, however, it is important
to consider all aspects of design, including space savings efficiency, performance
overhead, and resiliency against failure.
The author is National Sales Manager, Network Appliance
India. For more information, write to soumitra@netapp.com
|