Untitled Document
www.expresscomputeronline.com WEEKLY INSIGHT FOR TECHNOLOGY PROFESSIONALS
10 March 2008  
Untitled Document
Sections

Market
Management
Technology
Technology Life

Columns

Between The Bytes

Events

Technology Senate
Technology Sabha

Specials

HMA Bankbiz
UPS Batteries

Services
Subscribe/Renew
Archives
Search
Contact Us
Network Sites
CIO Decisions
Exp.Channel Business
Express Hospitality
Express TravelWorld
feBusiness Traveller
Express Pharma
Express Healthcare
Express Textile
Group Sites
ExpressIndia
Indian Express
Financial Express

Untitled Document
 
Home - Technology - Article

Vendor Accent

The case for data de-duplication

Surajit Sen explains the need for data de-duplication and discusses the various underlying benefits of this technology

The increased dependency of businesses upon automation has resulted in burgeoning volumes of data, which need to be stored, managed, protected and retrieved based on business demands. In an age where data is doubling every 12 to 24 months in most organizations, storage networking technology has solved the problem of efficient data management to a large extent.

However, deliberate copies of data are today contributing substantially to the growing volume of data in organizations. Important data is routinely copied to multiple locations to protect against all types of disaster. In the field of data mining, for example, huge databases are often copied for the purpose of running business intelligence queries. Numerous temporary copies of large databases are also created deliberately in the process of application development and testing. Across the globe, important data is routinely copied to multiple locations to protect against all types of loss.

To further complicate matters, countless copies of data files are inadvertently created when individuals share files. Consider the multiplier when an individual sends a file to 15 colleagues: After the recipients save the file to their personal systems, the file might also be copied once for backup, a second time for compliance and a third time for disaster recovery. So, sending a single file to 15 people could create up to 60 copies of the file: 15 deliberate copies and an additional 45 unintentional ones.

Data de-duplication is a technology that helps reduce the cost of data storage in such environments.

The business case for data de-duplication

The disk storage market has always struggled to balance the increase in storage requirements with pressures on IT organizations to reduce costs. This phenomenon was instrumental in driving up the density of disk drives at a reduced cost per GB. Not very long ago, the largest spindle size used to be 9 GB disk drives for enterprise storage systems. Today, most storage vendors are shipping 750 GB drives (at almost the same price as the 9 GB drive).

Though this has helped to keep storage costs low, the proliferation of data copies has nullified this benefit to a large extent.

Most organizations are forced to create multiple copies of the same data due to various reasons like backup, test/development, QA, compliance, disaster recovery. It is fairly common for IT organizations to use 100 TB of storage for just 10 TB of primary data. This is a huge cost and no matter how cheap disk drives become, the cost of this additional burden would always keep CIOs awake at night.

Data de-duplication addresses the issue by offering Smart Copies i.e. logical copies instead of physical copies. Logical copies use intelligent manipulation of pointers to represent the same block of data in multiple places instead of physically copying the same block of information multiple times. This solves the business requirement of having multiple copies but without increasing storage costs. In addition, by saving significantly on disk capacity requirements, this also helps reduce physical infrastructure costs like power, cooling, floor space etc.

The second way in which data de-duplication technology can help with big savings is with the primary data volume itself. It is quite common to store multiple files or objects in many locations within a primary data volume. An example is the instance of an e-mail attachment being forwarded to multiple people within the same organization. This results in the same data being stored by multiple people within the same e-mail database.

Regulatory compliance and corporate governance norms make it mandatory for organizations, under their purview, to store data in a tamperproof format for a certain period of time. There are also stringent norms on the time taken to produce data should it be asked for. Clearly tapes are not an option since they are neither reliable nor quick. This means that the volume of data that needs to be stored on disk is huge since you cannot delete data for extended periods of time.

Single instance storage is a technology that can significantly reduce disk space requirements. Taken a step further, the potential disk space savings are multiple times higher if we single instance blocks of data instead of files. A block of data gets duplicated multiple times within the same data set, even if the file may not have been repeated. Single instancing of such blocks would result in disk space savings of the order of 20 to 50 times.

De-duplication and single instance storage

Data de-duplication through smart copies has been attempted by industry players for some time.

Single instance storage was pioneered by a few start-up vendors and was adopted in disk-based backup environments. The first attempts at de-duplication technology addressed secondary copies of data, by eliminating redundant blocks of data which could help save on disk capacity requirements for back-up. This was used both on direct disk targets like NAS staging devices and virtual tape libraries.

Soon the potential impact of this technology was seen by other storage vendors who started offering their own versions of de-duplication.

That the possibilities in de-duplication technology are exciting is stating the obvious. What if this technology could be extended to primary storage? What if you managed to eliminate all traces of redundant data from your environment? Today almost all storage vendors are either announcing this technology as part of their roadmap or have already announced general availability on their primary storage systems.

Different vendors have differing approaches to treating redundant blocks of data. Some vendors de-duplicate in the data path, whereas some de-duplicate once the data has been written to the storage system. Some vendors are using hashing algorithms whereas some are using file system intelligence to achieve this. It remains to be seen which approach is adopted by the majority, however what remains undisputed is the attractive value proposition of data de-duplication in IT organizations today.

Data de-duplication is an important new technology that is quickly being embraced by users as they struggle with issues of data proliferation. By eliminating redundant data blocks, an immediate benefit is obtained through space and cost efficiencies. When choosing a de-duplication product, however, it is important to consider all aspects of design, including space savings efficiency, performance overhead, and resiliency against failure.

The author is National Sales Manager, Network Appliance India. For more information, write to soumitra@netapp.com

 


Untitled Document

UNSUBSCRIBE HERE
Untitled Document
© Copyright 2001: Indian Express Newspapers (Mumbai) Limited (Mumbai, India). All rights reserved throughout the world. This entire site is compiled in Mumbai by the Business Publications Division (BPD) of the Indian Express Newspapers (Mumbai) Limited. Site managed by BPD.