Rate this page del.icio.us  Digg slashdot StumbleUpon

The problem with one-size-fits-all databases

by

An Interview with Database Technology Researcher, Mike Stonebraker

Many organizations are overwhelmed by their increasing amounts of data – and the growing need to give more people more access to it.

Researchers and entrepreneurs are combining a variety of technology advancements to create specialized database engines for applications such as business intelligence, data warehousing, and real-time data analysis. Many of these new engines run on Red Hat Enterprise Linux 5.

For an update, we spoke with Dr. Michael Stonebraker, database technology researcher. Stonebraker was the main architect of the INGRES relational database management system (RDBMS) and the object-relational database management system (DBMS) POSTGRES. Both technologies are still in broad use today. Dr. Stonebraker and his colleagues developed these prototypes at the University of California at Berkeley, where he was a professor of computer science for 25 years. More recently, Dr. Stonebraker was a co-architect of the Aurora stream processing engine as well as the C-Store high-performance read-oriented database engine at MIT.

RHM: You’ve been a vocal proponent of the need for new database architectures and a vocal critic of “one-size-fits-all” databases. What exactly do you mean by “one-size-fits-all?”

MS: By “one-size-fits-all,” I mean 30-year-old database engines that were designed for one purpose – business data processing – but that subsequently have been promoted to other purposes such as business intelligence, data warehousing, and real-time data analysis. Because of the lack of innovation in database technology and the incumbent database vendors’ obvious interest in maintaining the status quo, customers have had to adapt 30-year-old database engines to modern-day needs.
The database engines from the major RDBMS vendors are direct descendents of the early relational prototypes, namely System R and INGRES from the 1970s. These early systems were architected primarily to serve the needs of business data processing, now called on-line transaction processing (OLTP). They also were optimized for the expensive hardware and proprietary operating systems of the time.

But needs have evolved. Organizations want to do things like business intelligence or data warehousing, where lots of people are querying data in an ad hoc manner using SQL, complementing their more traditional update-intensive OLTP applications. And hardware and system software have also evolved, delivering much more capacity and processing power for a lot less money. Red Hat Enterprise Linux is a perfect example of this.

Over time, the RDBMS vendors have grafted features onto their engines to support the new applications. For example: materialized views, bitmap indexes, and data cubes for the data warehouse market. But these vendors are still trying to fit the proverbial square peg into a round hole. One size of database no longer fits all purposes. The fundamental architectures are just wrong.

RHM: The traditional RDBMS has been around for years, and many organizations already have made major commitments. Can they afford to change?

MS: The way I see it, they can’t afford not to change for business intelligence and other query-oriented applications. These are the applications that give their businesses competitive advantage.

RHM: So what’s the alternative?

MS: The alternative is specialized database engines for these new applications. By combining database innovations with advancements in hardware and operating systems, it’s possible to create new database engines that offer dramatic price/performance improvements over one-size-fits-all databases. By “dramatic,” I mean as much as 100X performance advantages over a one-size-fits-all database.

RHM: What kind of innovations?

MS: Things like massive parallelism using grid-based commodity hardware and modern operating systems like Red Hat Enterprise Linux. We are also talking about column-oriented architectures, aggressive use of compression, and new transaction architectures.

A little bit more about each of these four innovations–

With massive parallelism, it’s now possible to assemble massive collections of commodity PCs connected by standard networking technology, replacing expensive, proprietary hardware.

With new transaction architectures, it’s possible to simultaneously load and query data warehouses, without degrading the performance of either task. Warehouses can be loaded continually, not just the nightly load windows common today. This ability is particularly important for Web-based businesses, which run 24×7.

Column-oriented database architectures offer a big advantage. They store data on disk as columns rather than rows, which the RDBMS and more recent appliance vendors use, while still providing an SQL interface to users. In a column-oriented database architecture, each column contains the values from one attribute (for example, employee id or salary). Thus, a typical business-intelligence query that accesses only a few columns from each table need only read those specific columns from disk. As a result, they require much less disk I/O. Making column databases much faster than traditional row-oriented databases.

Aggressive use of data compression lets allows you to capitalize on Moore’s Law. CPU technology is increasing in performance at a faster rate than disk bandwidth. So, it makes sense to use aggressive compression and execute queries against compressed data, which overcomes the bandwidth limitations of current disks.

Another significant development is on-the-wire SQL for real-time data stream and event analysis. New versions of SQL with time windows and advanced messaging architectures can perform aggregate analysis on real-time information, such as feeds of financial trade and quote data, as efficiently as traditional SQL databases can query stored data. This is an active research area at several universities.

This approach is useful for financial institutions that want to analyze pricing “tick” data – as well as for multi-player game sites or advertising-driven Web sites that make money through transactions.

RHM: So, what role is Linux playing in this database renaissance?

MS: Linux plays an essential role in delivering great price/performance from these new engines. Let me give you an example from Vertica.

We had a customer, a telecommunications company, with a data warehousing problem. The customer had a 1-terabyte data warehouse running on a system from one of the proprietary data-warehousing appliance vendors. The customer was using the data warehouse to run queries on what’s known as call-detail records, and was unhappy with the performance. So the customer gave us his data, his schema, and a set of the most popular queries. We implemented his schema, loaded his data and ran the exact workload on the Vertica Database. His queries ran 25x faster.

But here’s what’s even better. The Vertica test required only a $2,500 server running Red Hat Enterprise Linux 4. Compare that to the cost of the 28-node proprietary data warehouse appliance we ran against and you’ve net huge cost savings.

RHM: What does all of this mean for the Red Hat community?

MS: It means that Red Hat Enterprise Linux users and developers have some powerful – and defensible – reasons for using Enterprise Linux as a database platform. In building the Vertica database engine, Linux was a strategic choice for us. Linux gave us both the modern operating-system platform and the price advantage we needed to deliver game-changing price/performance advantages.

RHM: Where can people learn more?

MS: You can hear more on this topic by joining a free webinar, The End of One-Size-Fits-All Databases, on Thursday, April 19, 2007, at 11:30 AM ET. Joel Berman, Product Management Director for Red Hat Enterprise Linux, will co-host.

4 responses to “The problem with one-size-fits-all databases”

  1. Rams says:

    That link to the “The End of One-Size-Fits-All-Databases” doesn’t work.

    This has been fixed. Thanks for the heads’ up. — ed.

  2. Kurt says:

    The following link is correct.

    http://www.bulldogsolutions.net/Vertica/VCA04192007/frmRegistration.aspx?bdls=9918

  3. Julie Bryce says:

    This link has been corrected: thanks.
    http://www.bulldogsolutions.net/Vertica/VCA04192007/frmRegistration.aspx?bdls=9918

  4. Ileana Somesan says:

    Column oriented DBMS seem indeed great.
    The only problem I have is the storage paradigm. Column stores or vertical partitioning – it’s quite the same thing!