|
DW: the devils in the details
Should you go with an appliance or take a software-based
approach? Who needs real-time performance? How do you go about optimizing performance?
Prashant L. Rao talks to the experts and gets the answers
In the late 1990s there used to be custom data warehouses. In the early 2000s,
you had the concept of an appliance with server, storage and database software
rolled into one. These proprietary systems were promoted by several small organizations
including Netezza. Organizations could buy these in their entirety or not at
all. Today, appliance vendors have gone back to selling only the software.
Netezza, Greenplum etc. are all selling software that runs on commodity hardware,
said Vinod Shankar - Head, Business Intelligence, activecubes.
For
organizations facing a time crunch, it would make sense to go for an appliance.
If an organization has time on its hands and its business requirements are going
to be known as we move from one roll out to another and the data volumes are
not that high and will be large only in five or six years, then the software-based
approach would be suitable. Vendors are providing specialized database software
suited for data warehousing, said Shankar.
Atul Jalan, Managing Director & CEO, Manthan Systems had an interesting
take on this. Appliances enables the customers to deal with one vendor
for support and offer a high performance data warehouse. Whereas, in the case
of software, the customer has to deal with separate hardware, software and implementation
vendors, he said.
It is a fact that total cost of data warehousing is more
than the cost of appliances and that one of the largest cost contributors to
a data warehousing project is that of source system integration. Any data
warehouse appliance does not reduce this cost. Customers should not opt for
appliances from a RoI perspective but rather they should opt for these from
the point of view of high performance as most appliances are fine tuned to cater
to high volumes of data with high performance, he added.
When it comes to appliances, Teradata and Netezza are popular. Oracle's
database has always been popular but the company's appliance has yet to take
off. For ETL, Datastage and Informatica are popular options. In reporting, Business
Objects has been on the top of the list. In the last two to three years, Oracle's
reporting applications have started catching up, said Sandeep Mohile,
Information Management Head, Torry Harris Business Solutions (P) Ltd.
Zubin Dowlaty, Head of Innovation & Development, Mu Sigma, said, There
seems to be a trend towards appliance installations due to the rapid deployment.
I suspect that this is a growing trend to get data solutions operational as
soon as possible. Software generally will take longer in isolation versus an
appliance.
Amit Gupta, Associate Architect, BI & Analytics Competency, Persistent Systems,
said, Enterprises are gradually moving towards appliance-based tools to
minimize the overheads of maintaining and buying individual software and the
integration hassles of software. However, many companies still prefer to purchase
individual software licenses to have flexibility to fine tune BI and ETL features
in order to avoid being completely dependent on an appliance vendor.
Ajoy Kumar, Director and Head of Data Warehousing & Business Intelligence
Practice, Virtusa Corporation, said, We see both appliances and software
being deployed in large organizations. The high cost of data warehousing appliances
makes them more relevant for large data warehouses that require high performance
and very stringent service levels. If you prefer to start small, a software-based
data warehouse would be the better choice.
Rajesh Ramaswamy, Global Head Business Intelligence Practice, Marlabs Inc.,
said, Appliances are typically architected in an MPP format as opposed
to the SMP architecture of traditional databases. They are typically used when
the data warehouse sizes are enormous and when they are expected to grow rapidly.
Cost per terabyte of appliances is significantly lower that that of traditional
databases. A typical example of an organization that uses an appliance could
be a utilities organization taking smart meter readings from a million customers
every 15 minutes.
Storage: SATA or SSD
Sudesh Prabhu, Director Presales and Services, Sybase, commented, Most
organizations that have implemented a data warehouse would have SAN storage
with it. Unless it's where someone's just using BI over OLAP. Then it may be
between SATA or SSD. Most of our customers use a SAN.
Shankar said, Data warehousing is about large volumes of data. The lack
of capacity on SSDs is a constraint. As they start offering better price/performance,
we will see usage grow. Netezza etc. are already offering some degree of SSD.
In three to four years, the usage could grow.
Typically, data in a warehouse is stored on a SAN environment with FC
or SAS (SATA will give more GB per $, but it is slow and should not be used
for databases, operating systems or any other application that is constantly
writing to the disks especially in a data warehouse), said Dowlaty.
SSD provides high IO throughput for scattered reads,
which are generated to answer ad-hoc queries and OLAP cubes. Whereas, SATA is
a cost-effective solution that performs almost equally well with parallel RAID
configuration for report generation and analytics queries that perform sequential
data scan. Since most of the warehouse use cases include periodic report generation
and dashboards, SATA is a preferred option, said Gupta.
Kumar said, SSDs are becoming increasingly popular for data warehouses.
Increasingly, we see customers who build a new data warehouse using SSD for
power consumption efficiency and performance.
Ramaswamy felt that automated storage tiering (AST) was the best option until
SSD prices came down. SSD has a lot of promise thanks to its high performance.
Nevertheless, commercial adoption hasnt picked up as capacities remain
relatively low and costs high. Using SSD by itself or in large amounts is not
a scalable model especially when data is growing exponentially. Hence, AST is
more suitable for todays environments. SSD remains relevant for areas
where performance is critical and cost is not a consideration but in all other
scenarios AST (encompassing SSD/FC/SAS) will prevail, he said.
Dr. Jay B. Simha, CTO, ABIBA Systems, said, Currently major vendors are
providing the built-in storage for the data warehouse with SAS disks, although
corporations favor SAN storage.
Real-time data warehousing
Prabhu pointed out, With social networks and third party analytical data
coming in, it doesn't allow you to create a warehouse or have a structure around
it and you have to consume the data as it comes and that opens up the opportunity
for a real-time warehouse. The emphasis is on your ability to manage and make
use of that data in as effective a manner as you employ prearranged data. In
capital markets, you see a different kind of real-time warehousenot only
data from Reuters/Bloombergand analysis is conducted in milliseconds.
Other markets are OK with response times of a few seconds.
Shankar analyzed, It is catching on with organizations that have the money
to spend as well as the sources, manpower and skills to help them in complex
implementations such as real-time data warehouses. As a BI consulting provider,
in a BI/DW environment, data must flow at the speed of business. There is a
need for data to be real-time and for it to be automatically pushed to users.
A lot of telcos are moving towards real-time data warehousing. Change Data Capture
(CDC) is the most common type. Organizations have to be cognizant of the fact
that these implementations are very expensive. There is a lot of complexity
in how these solutions are structured. Today vendors will price you as per each
CPU that you have in the source system. We had a quote of $25 million just to
implement CDC for one of our customers. It doesn't employ traditional ETL processes.
In a traditional setup, you have data being extracted from source systems into
a staging area from where it is funneled into a data warehouse. In between source
and staging there's another layer or repository where the changed data from
your source system is brought and published. ETL systems will read data directly
from the staging area that's been fetched from tables in the source database.
In CDC it won't touch the source tables. It will get data from the change logs
and do a mini/micro batch ETL. Every 15 minutes data is pulled from the
change logs and brought to the staging area and the mini-ETL will be run on
it, he explained.
Dowlaty said, This is a strong trend in the USA. With the rise of message
queue systems, social networks, and mobile explosion, real-time processing is
going to accelerate. The biggest hurdle is that the data being queried is a
stream versus a repository like in SQL; therefore the need for tools like Complex
Event Processing (CEP) systems.
Gupta said, A real-time data warehouse is required for many business functions
such as fraud detection, monitoring quality of services and products, reducing
supply chain latency etc. that demand taking decisions in real-time. With explosive
economic growth in markets like India many companies are realizing the need
to take accurate decisions at the right time. Many telecommunication and retail
companies across India have implemented a real-time data warehouse in order
to closely track customer needs and execute targeted sales campaigns.
Kumar said, Customers that are leveraging social media for marketing are
analyzing real-time data.
Ramaswamy said, Many businesses, both large and small, are integrating
real-time software into their overall business intelligence archiving and retrieval
processes. It is important to realize two aspects:
- No data warehouse is fully real-time. Real-time
data warehouses are actually near real-time
- A lot of the information in a data warehouse is
not required in real-time. When architecting the system it is important to
understand what data points need to be reported in real-time and what should
not.
Simha argued that, Real-time data warehousing is not required by everyone.
Mostly retail (integrated with complete supply chain) requires a real-time data
warehouse for pricing. Others may not require such a system, which distracts
from original premise of the data warehouse, which is information on demand
for better decision making.
MPP vs. SMP
Prabhu said, MPP per se is not new. It is being applied
in data warehousing in certain scenarios where there was no other alternative.
If you have over a hundred users and the database is 50 TB plus, then you had
no choice but to go in for MPP. Now, with the processing power of servers growing
multi-fold, SMP is more viable as it allows you to use open systems. You can
use something that you already have and get that same level of performance.
With Sybase IQ we have benchmarks that are at par with MPP benchmarks on more
cost-effective hardware. There are traditional users of MPP who continue to
expand. That technology only allows you to expand horizontally.
Shankar explained, MPP is typically a technology used in appliances. Microsoft,
SAP, Oracle and IBM are saying that MPP is the next generation of data warehouse.
Software appliances (Sybase IQ, Vertica) provide data warehousing functionality
by not being about MPP alone. They provide columnar structures. MPP offers convenience
in implementing a data warehouse by reducing your reliance on how you design
an application. MPP is an appliance approach. Netezza used to have multiple
nodes, each of which had a hard drive with a central controller that's responsible
for getting data from a hundred hard disks. Traditionally when a user fires
a query you go to one hard drive. In MPP, it will send queries to different
hard drives and collate the results.
Dowlaty believed that share nothing designs such as Aster, Hadoop, and Teradata
where one could scale horizontally would win out.
Jalan argued that MPP and databases purpose-built for data warehousing were
already edging out SMP. MPP offers superior scalability of computing power and
throughput; SMP offers the best price/performance.
Gupta said, With the advent of cloud computing and clustered servers,
SMP is becoming less popular. For the multi terabyte data warehouse and extensive
decision support requirements the MPP architecture is a necessity. MPP architecture
being loosely coupled with each processor having its own cache and storage is
a superior alternative that supports almost linear scalability. Although SMP
architecture is raising the price/performance ratio with time, there exists
a transition area where SMP and MPP coexist to serve an up to a terabyte data
warehouse.
Kumar said, MPP is used for large data warehouses. As data warehouses
grow in size and scale, MPP is becoming the preferred option.
Ramaswamy outlined the scenarios in which each approach would shine. For organizations
that deal with huge volumes of data that needs to be collected and processed
an MPP architecture is the only way out. However, only a small percentage of
organizations worldwide today require this kind of architecture. Other than
large utilities, telecom players and very large retail organizations there are
few companies that realistically need an MPP architecture.
Simha felt that MPP coupled with column store databases was likely to dominate
the data warehousing market.
The problem of data quality
Data quality is one of the biggest problems that every data warehousing implementation
faces. Shankar said, We advise our clients to address data quality issues
at the root. Many organizations believe that a data quality solution fits at
the data warehousing/BI layer. That's a reactive way of solving the problem.
We believe that you need to have data quality solutions implemented at the enterprise-wide
level on different systems including ERP/ BI etc. That's when you have a holistic
approach to data quality as an issue. Data quality issues happen when a data
entry operator enters data incorrectly or in the wrong manner.
Dowlaty said, Cleaning up systems in the OLTP system (downstream) helps
the most. For e.g. addressing scrubbing at the point of entry. Then you need
to use Master Data Management principles, solid ETL and data quality tools.
Kumar commented, To improve data quality, organizations need to painstakingly
cleanse the bad data (name and address validation etc.). In addition, they need
to put a data governance program in place in order to improve data quality awareness
in the organization. There are many data quality tools such as Quality Stage,
Trillium etc. that help in improving data quality.
Ramaswamy argued that data quality had to be a continuous worry, rather than
an issue addressed by a single, limited project. Establish a data quality firewall,
he advised, a process or set of automated controls designed to recognize data
quality issues in incoming data and to block low-quality date from entering
your data warehouse. Implement a process at the back end for auditing and verifying
the data. This includes reconciling the warehouse data back to source systems.
Simha said, Data quality can be ensured with best practices in EDW like
data quality team, disciplined data collection and validation etc.
Optimizing performance
Sanjay Mehta, CEO, MAIA Intelligence, gave a couple of examples that reveal
how performance optimization can mean very different things depending upon the
situation. A lot of manufacturing companies have month-end targets. On
the last day of the month they dump stock at the stockist's end. The invoicing
is continuous. At that time, the real-time report is getting punched in. They
can't wait for an end-of-day report. At that point, you can't say go through
a data warehouse. I can wait for a day, create a night job and massage the data.
Theres always a challenge of bandwidth availability in rural areas. We
can do offline cubing. I need not be connected to a data warehouse but I can
still do slicing and dicing. We had a customer doing business in rural areas
so we scheduled data and report metadata into two XML files and they can use
this to slice and dice. At periodic intervals they would e-mail that data and
get two XML files relevant for the sales guy in that region. They can import
that data and slice and dice as if they are connected to the data warehouse.
The customer doesn't have to create a huge infrastructure for thousands of salespeople,
he added.
Prabhu said, If you look at a typical query written with a report in mind,
in BFSI there are queries that can run for minutes or even hours for that matter.
If you look at a typical warehouse, query time is usually measured in seconds
or minutes. That's for predefined queries. When a customer starts making ad-hoc
queries, that's when they would like to have a huge speed advantage. You can't
have an interactive experience unless the response is fast enough. There's a
huge effort to bring down query time from minutes to seconds. There are other
verticals like capital markets where the query times can't be more than a few
milliseconds because the business won't allow that if you are looking at real-time
analysis. Gone are the days of overnight reports.
Shankar said that in the case of executive dashboards, query times would be
a few seconds. Power users or analysts would accept results that took minutes
to gather although the time taken would have to be less than five minutes. Canned
predefined operational reports that could be granular or summarized or both
take less than a minute to execute as you cache them and keep them readily available.
Mohile said, In the old days we used to go back to how the query was structured
etc. Today, however, performance optimization is about utilizing the processing
power of the infrastructure and doing things efficiently using techniques such
as parallel processing or in-memory.
Dowlaty said, We try and optimize queries to fit the business needs that
they try to resolve. Depending on the size of the data and the complexity of
the query, the average time could vary from a few seconds to as much as a few
hours.
Manthan has found that query time is a function of logical and physical data
modeling and the underlying technology. Our solution's average query time
is 5 to 15 seconds, said Jalan explaining that optimization starts from
the point where the solution is conceptualized and it becomes a continuous and
integral part of entire life of the solution. Data marts help in performance
as they focus on a particular subject area and contain a summary of the information.
Gupta said, Data warehouse schema is optimized for generating reports
and additional performance can be achieved by having bit map indexes on the
fact tables to optimize ad-hoc queries, partitioning the fact table and maintaining
materialized views or summary tables to optimize report generation.
Ramaswamy categorized the response time by the type of reportoperational,
informational and analytical. For the first two types of reports, it is important
to have a good response time. Optimizing the report response times is usually
done through improving the design by building summary, history or report tables
or by building cubes. It could also be done by improving the type and scale
of hardware, though this option has limitations. Users of analytical reports
are more tolerant of the response times and flexibility is more important here.
Simha said, Average query times range from five seconds for OLAP queries
to around 30 minutes (for ad-hoc queries), excepting killer queries which may
take hours.
prashant.rao@expressindia.com
|