Untitled Document
www.expresscomputeronline.com WEEKLY INSIGHT FOR TECHNOLOGY PROFESSIONALS
7 February 2011  
Untitled Document
Sections

Cover Story
Trend
Tech Views
News
Products

Express Intelligent Enterprise

Events

Technology Senate
Technology Sabha

Services
Subscribe/Renew
Archives
Search
Contact Us
Network Sites
Exp.Channel Business
Express Hospitality
Express TravelWorld
Express Pharma
Express Healthcare
Group Sites
ExpressIndia
Indian Express
Financial Express

Untitled Document
 
Home - Trend - Article

DW: the devil’s in the details

Should you go with an appliance or take a software-based approach? Who needs real-time performance? How do you go about optimizing performance? Prashant L. Rao talks to the experts and gets the answers

In the late 1990s there used to be custom data warehouses. In the early 2000s, you had the concept of an appliance with server, storage and database software rolled into one. These proprietary systems were promoted by several small organizations including Netezza. Organizations could buy these in their entirety or not at all. “Today, appliance vendors have gone back to selling only the software. Netezza, Greenplum etc. are all selling software that runs on commodity hardware,” said Vinod Shankar - Head, Business Intelligence, activecubes.

“For organizations facing a time crunch, it would make sense to go for an appliance. If an organization has time on its hands and its business requirements are going to be known as we move from one roll out to another and the data volumes are not that high and will be large only in five or six years, then the software-based approach would be suitable. Vendors are providing specialized database software suited for data warehousing,” said Shankar.

Atul Jalan, Managing Director & CEO, Manthan Systems had an interesting take on this. “Appliances enables the customers to deal with one vendor for support and offer a high performance data warehouse. Whereas, in the case of software, the customer has to deal with separate hardware, software and implementation vendors,” he said.

It is a fact that total cost of data warehousing is more than the cost of appliances and that one of the largest cost contributors to a data warehousing project is that of source system integration. “Any data warehouse appliance does not reduce this cost. Customers should not opt for appliances from a RoI perspective but rather they should opt for these from the point of view of high performance as most appliances are fine tuned to cater to high volumes of data with high performance,” he added.

“When it comes to appliances, Teradata and Netezza are popular. Oracle's database has always been popular but the company's appliance has yet to take off. For ETL, Datastage and Informatica are popular options. In reporting, Business Objects has been on the top of the list. In the last two to three years, Oracle's reporting applications have started catching up,” said Sandeep Mohile, Information Management Head, Torry Harris Business Solutions (P) Ltd.

Zubin Dowlaty, Head of Innovation & Development, Mu Sigma, said, “There seems to be a trend towards appliance installations due to the rapid deployment. I suspect that this is a growing trend to get data solutions operational as soon as possible. Software generally will take longer in isolation versus an appliance.”

Amit Gupta, Associate Architect, BI & Analytics Competency, Persistent Systems, said, “Enterprises are gradually moving towards appliance-based tools to minimize the overheads of maintaining and buying individual software and the integration hassles of software. However, many companies still prefer to purchase individual software licenses to have flexibility to fine tune BI and ETL features in order to avoid being completely dependent on an appliance vendor.”

Ajoy Kumar, Director and Head of Data Warehousing & Business Intelligence Practice, Virtusa Corporation, said, “We see both appliances and software being deployed in large organizations. The high cost of data warehousing appliances makes them more relevant for large data warehouses that require high performance and very stringent service levels. If you prefer to start small, a software-based data warehouse would be the better choice.”

Rajesh Ramaswamy, Global Head Business Intelligence Practice, Marlabs Inc., said, “Appliances are typically architected in an MPP format as opposed to the SMP architecture of traditional databases. They are typically used when the data warehouse sizes are enormous and when they are expected to grow rapidly. Cost per terabyte of appliances is significantly lower that that of traditional databases. A typical example of an organization that uses an appliance could be a utilities organization taking smart meter readings from a million customers every 15 minutes.”

Storage: SATA or SSD

Sudesh Prabhu, Director Presales and Services, Sybase, commented, “Most organizations that have implemented a data warehouse would have SAN storage with it. Unless it's where someone's just using BI over OLAP. Then it may be between SATA or SSD. Most of our customers use a SAN.”

Shankar said, “Data warehousing is about large volumes of data. The lack of capacity on SSDs is a constraint. As they start offering better price/performance, we will see usage grow. Netezza etc. are already offering some degree of SSD. In three to four years, the usage could grow.”

“Typically, data in a warehouse is stored on a SAN environment with FC or SAS (SATA will give more GB per $, but it is slow and should not be used for databases, operating systems or any other application that is constantly writing to the disks especially in a data warehouse),” said Dowlaty.

“SSD provides high IO throughput for scattered reads, which are generated to answer ad-hoc queries and OLAP cubes. Whereas, SATA is a cost-effective solution that performs almost equally well with parallel RAID configuration for report generation and analytics queries that perform sequential data scan. Since most of the warehouse use cases include periodic report generation and dashboards, SATA is a preferred option,” said Gupta.

Kumar said, “SSDs are becoming increasingly popular for data warehouses. Increasingly, we see customers who build a new data warehouse using SSD for power consumption efficiency and performance.”

Ramaswamy felt that automated storage tiering (AST) was the best option until SSD prices came down. “SSD has a lot of promise thanks to its high performance. Nevertheless, commercial adoption hasn’t picked up as capacities remain relatively low and costs high. Using SSD by itself or in large amounts is not a scalable model especially when data is growing exponentially. Hence, AST is more suitable for today’s environments. SSD remains relevant for areas where performance is critical and cost is not a consideration but in all other scenarios AST (encompassing SSD/FC/SAS) will prevail,” he said.

Dr. Jay B. Simha, CTO, ABIBA Systems, said, “Currently major vendors are providing the built-in storage for the data warehouse with SAS disks, although corporations favor SAN storage.”

Real-time data warehousing

Prabhu pointed out, “With social networks and third party analytical data coming in, it doesn't allow you to create a warehouse or have a structure around it and you have to consume the data as it comes and that opens up the opportunity for a real-time warehouse. The emphasis is on your ability to manage and make use of that data in as effective a manner as you employ prearranged data. In capital markets, you see a different kind of real-time warehouse—not only data from Reuters/Bloomberg—and analysis is conducted in milliseconds. Other markets are OK with response times of a few seconds.”

Shankar analyzed, “It is catching on with organizations that have the money to spend as well as the sources, manpower and skills to help them in complex implementations such as real-time data warehouses. As a BI consulting provider, in a BI/DW environment, data must flow at the speed of business. There is a need for data to be real-time and for it to be automatically pushed to users. A lot of telcos are moving towards real-time data warehousing. Change Data Capture (CDC) is the most common type. Organizations have to be cognizant of the fact that these implementations are very expensive. There is a lot of complexity in how these solutions are structured. Today vendors will price you as per each CPU that you have in the source system. We had a quote of $25 million just to implement CDC for one of our customers. It doesn't employ traditional ETL processes.”

In a traditional setup, you have data being extracted from source systems into a staging area from where it is funneled into a data warehouse. In between source and staging there's another layer or repository where the changed data from your source system is brought and published. ETL systems will read data directly from the staging area that's been fetched from tables in the source database. In CDC it won't touch the source tables. It will get data from the change logs and do a mini/micro batch ETL. “Every 15 minutes data is pulled from the change logs and brought to the staging area and the mini-ETL will be run on it,” he explained.

Dowlaty said, “This is a strong trend in the USA. With the rise of message queue systems, social networks, and mobile explosion, real-time processing is going to accelerate. The biggest hurdle is that the data being queried is a stream versus a repository like in SQL; therefore the need for tools like Complex Event Processing (CEP) systems.”

Gupta said, “A real-time data warehouse is required for many business functions such as fraud detection, monitoring quality of services and products, reducing supply chain latency etc. that demand taking decisions in real-time. With explosive economic growth in markets like India many companies are realizing the need to take accurate decisions at the right time. Many telecommunication and retail companies across India have implemented a real-time data warehouse in order to closely track customer needs and execute targeted sales campaigns.”

Kumar said, “Customers that are leveraging social media for marketing are analyzing real-time data.”

Ramaswamy said, “Many businesses, both large and small, are integrating real-time software into their overall business intelligence archiving and retrieval processes. It is important to realize two aspects:

  • No data warehouse is fully real-time. Real-time data warehouses are actually near real-time
  • A lot of the information in a data warehouse is not required in real-time. When architecting the system it is important to understand what data points need to be reported in real-time and what should not.”

Simha argued that, “Real-time data warehousing is not required by everyone. Mostly retail (integrated with complete supply chain) requires a real-time data warehouse for pricing. Others may not require such a system, which distracts from original premise of the data warehouse, which is information on demand for better decision making.”

MPP vs. SMP

Prabhu said, “MPP per se is not new. It is being applied in data warehousing in certain scenarios where there was no other alternative. If you have over a hundred users and the database is 50 TB plus, then you had no choice but to go in for MPP. Now, with the processing power of servers growing multi-fold, SMP is more viable as it allows you to use open systems. You can use something that you already have and get that same level of performance. With Sybase IQ we have benchmarks that are at par with MPP benchmarks on more cost-effective hardware. There are traditional users of MPP who continue to expand. That technology only allows you to expand horizontally.”

Shankar explained, “MPP is typically a technology used in appliances. Microsoft, SAP, Oracle and IBM are saying that MPP is the next generation of data warehouse. Software appliances (Sybase IQ, Vertica) provide data warehousing functionality by not being about MPP alone. They provide columnar structures. MPP offers convenience in implementing a data warehouse by reducing your reliance on how you design an application. MPP is an appliance approach. Netezza used to have multiple nodes, each of which had a hard drive with a central controller that's responsible for getting data from a hundred hard disks. Traditionally when a user fires a query you go to one hard drive. In MPP, it will send queries to different hard drives and collate the results.”

Dowlaty believed that share nothing designs such as Aster, Hadoop, and Teradata where one could scale horizontally would win out.

Jalan argued that MPP and databases purpose-built for data warehousing were already edging out SMP. MPP offers superior scalability of computing power and throughput; SMP offers the best price/performance.

Gupta said, “With the advent of cloud computing and clustered servers, SMP is becoming less popular. For the multi terabyte data warehouse and extensive decision support requirements the MPP architecture is a necessity. MPP architecture being loosely coupled with each processor having its own cache and storage is a superior alternative that supports almost linear scalability. Although SMP architecture is raising the price/performance ratio with time, there exists a transition area where SMP and MPP coexist to serve an up to a terabyte data warehouse.”

Kumar said, “MPP is used for large data warehouses. As data warehouses grow in size and scale, MPP is becoming the preferred option.”

Ramaswamy outlined the scenarios in which each approach would shine. For organizations that deal with huge volumes of data that needs to be collected and processed an MPP architecture is the only way out. However, only a small percentage of organizations worldwide today require this kind of architecture. Other than large utilities, telecom players and very large retail organizations there are few companies that realistically need an MPP architecture.

Simha felt that MPP coupled with column store databases was likely to dominate the data warehousing market.

The problem of data quality

Data quality is one of the biggest problems that every data warehousing implementation faces. Shankar said, “We advise our clients to address data quality issues at the root. Many organizations believe that a data quality solution fits at the data warehousing/BI layer. That's a reactive way of solving the problem. We believe that you need to have data quality solutions implemented at the enterprise-wide level on different systems including ERP/ BI etc. That's when you have a holistic approach to data quality as an issue. Data quality issues happen when a data entry operator enters data incorrectly or in the wrong manner.”

Dowlaty said, “Cleaning up systems in the OLTP system (downstream) helps the most. For e.g. addressing scrubbing at the point of entry. Then you need to use Master Data Management principles, solid ETL and data quality tools.”

Kumar commented, “To improve data quality, organizations need to painstakingly cleanse the bad data (name and address validation etc.). In addition, they need to put a data governance program in place in order to improve data quality awareness in the organization. There are many data quality tools such as Quality Stage, Trillium etc. that help in improving data quality.”

Ramaswamy argued that data quality had to be a continuous worry, rather than an issue addressed by a single, limited project. Establish a data quality firewall, he advised, a process or set of automated controls designed to recognize data quality issues in incoming data and to block low-quality date from entering your data warehouse. Implement a process at the back end for auditing and verifying the data. This includes reconciling the warehouse data back to source systems.

Simha said, “Data quality can be ensured with best practices in EDW like data quality team, disciplined data collection and validation etc.”

Optimizing performance

Sanjay Mehta, CEO, MAIA Intelligence, gave a couple of examples that reveal how performance optimization can mean very different things depending upon the situation. “A lot of manufacturing companies have month-end targets. On the last day of the month they dump stock at the stockist's end. The invoicing is continuous. At that time, the real-time report is getting punched in. They can't wait for an end-of-day report. At that point, you can't say go through a data warehouse. I can wait for a day, create a night job and massage the data.”

There’s always a challenge of bandwidth availability in rural areas. “We can do offline cubing. I need not be connected to a data warehouse but I can still do slicing and dicing. We had a customer doing business in rural areas so we scheduled data and report metadata into two XML files and they can use this to slice and dice. At periodic intervals they would e-mail that data and get two XML files relevant for the sales guy in that region. They can import that data and slice and dice as if they are connected to the data warehouse. The customer doesn't have to create a huge infrastructure for thousands of salespeople,” he added.

Prabhu said, “If you look at a typical query written with a report in mind, in BFSI there are queries that can run for minutes or even hours for that matter. If you look at a typical warehouse, query time is usually measured in seconds or minutes. That's for predefined queries. When a customer starts making ad-hoc queries, that's when they would like to have a huge speed advantage. You can't have an interactive experience unless the response is fast enough. There's a huge effort to bring down query time from minutes to seconds. There are other verticals like capital markets where the query times can't be more than a few milliseconds because the business won't allow that if you are looking at real-time analysis. Gone are the days of overnight reports.”

Shankar said that in the case of executive dashboards, query times would be a few seconds. Power users or analysts would accept results that took minutes to gather although the time taken would have to be less than five minutes. Canned predefined operational reports that could be granular or summarized or both take less than a minute to execute as you cache them and keep them readily available.

Mohile said, “In the old days we used to go back to how the query was structured etc. Today, however, performance optimization is about utilizing the processing power of the infrastructure and doing things efficiently using techniques such as parallel processing or in-memory.”

Dowlaty said, “We try and optimize queries to fit the business needs that they try to resolve. Depending on the size of the data and the complexity of the query, the average time could vary from a few seconds to as much as a few hours.”

Manthan has found that query time is a function of logical and physical data modeling and the underlying technology. “Our solution's average query time is 5 to 15 seconds,” said Jalan explaining that optimization starts from the point where the solution is conceptualized and it becomes a continuous and integral part of entire life of the solution. Data marts help in performance as they focus on a particular subject area and contain a summary of the information.

Gupta said, “Data warehouse schema is optimized for generating reports and additional performance can be achieved by having bit map indexes on the fact tables to optimize ad-hoc queries, partitioning the fact table and maintaining materialized views or summary tables to optimize report generation.”

Ramaswamy categorized the response time by the type of report—operational, informational and analytical. For the first two types of reports, it is important to have a good response time. Optimizing the report response times is usually done through improving the design by building summary, history or report tables or by building cubes. It could also be done by improving the type and scale of hardware, though this option has limitations. Users of analytical reports are more tolerant of the response times and flexibility is more important here.

Simha said, “Average query times range from five seconds for OLAP queries to around 30 minutes (for ad-hoc queries), excepting killer queries which may take hours.”

prashant.rao@expressindia.com

 


Untitled Document
Untitled Document

FEEDBACK: We would love to hear from you -- what you like about our content, what you dont, and even how you think we can improve. Please send your feedback to: prashant.rao@expressindia.com


© Copyright 2001: The Indian Express Limited. All rights reserved throughout the world. This entire site is compiled in Mumbai by the Business Publications Division (BPD) of The Indian Express Limited. Site managed by BPD.