What is Data Warehousing?: A Data Warehousing Primer for Database Designers

Over the last several years, data warehousing has emerged as a distinct discipline in the information technology world. As a database designer, you may wonder what, if anything, is different about this approach to systems development. This article outlines some of the fundamentals of data warehousing and suggests, in broad terms, where this technology might be applicable.

For the purposes of this article, we’ll look at data warehousing’s logical and physical design criteria. The considerations for logical design will focus primarily on the database schema. The physical design discussion will zero in on the unique aspects of a data warehouse and why, for certain applications, some database products are better than others. Note that I’ll be discussing only the most common solutions and products in this arena. This article is not intended to be comprehensive and there are many business problems and products in the marketplace which do not fit into this article’s framework.

Relational database usage falls primarily into two main categories: transaction processing systems (OLTP) and analytical systems (OLAP). When E.F. Codd laid out his rules for OLTP relational database design in the 1980’s, he also set forth a similar set of rules for OLAP. His primary point for creating this distinction is still valid: namely, that both approaches to database design are equally sound from a theoretical construct point of view. Thus your selection of one or the other design should be determined by the functional requirements of the system.

Transaction processing systems are used by businesses both to record and view information in real-time. Such systems are typically optimized for supporting a specific set of business processes. Although historical data is available in such systems, the main design goal is to support the needs of the people who run the business operations. For example, though an order entry system can undoubtedly provide the marketing department with useful historical information about what customers are buying at various times of the year, the primary goal in designing the system is to meet the needs of users of current business information. The design is defined by sale’s need to track the order, manufacturing’s need to build it, logistics’ need to ship it and finance’s requirements for invoicing the customer. Once the customer pays the bill, the need for access to the discrete records by the business operations people diminishes.

When building such a system, the database designer will almost certainly encounter performance constraints. How much data does the system need to retain and for how long? How many transactions does it need to support in a given timeframe? How shall business rules and relational integrity be implemented and enforced? When all is said and done, the end result will probably look a lot like a textbook third-normal form schema. Since writes to the database happen, on an approximate basis, as often as reads, care is taken to ensure that key fields are not duplicated, thus allowing efficient updates to data.

Query processing is typically done on a single transactional unit of data. For example, to look at a single order the database designer might create a query that pulls information from the following tables: customer, address, order, order detail, product and price. Although this joins a number of individual tables, the very nature of the query will constrain the order table to a single row. This will axiomatically constrain customer and address to a single row. Order details is likely to be a manageable number of rows and the remainder of the tables will be limited to a number of rows equivalent to the number of order details. So, for an order of 6 items, our query will need to retrieve just 21 rows from 6 separate tables.

Now imagine a query to the same database to report sales by department by product by fiscal quarter over the last two years. Assume that we’re a mail-order apparel firm that grosses $200M annually with an average order size of 6 items and an average price per item of $25.00. This query will need to traverse over 50 million rows to fetch the raw data needed for the report. A third normal form database schema may not necessarily perform very well in processing this query. To compound the problem, what if there is no algorithmetric method of determining which dates belong in a particular fiscal year or if your queries need to incorporate the concepts of this year vs. last year, or quarter-to-date? What if the company was reorganized recently and you need to be able to report last year’s data as if it reflected the reorganization and this year’s data as if the reorganization had not taken place? Chances are, these concepts are beyond the scope of the line-of-business requirements. Enter the data warehouse or datamart (N.B. Think of a datamart as a data warehouse that covers a single subject or operational area, such as sales).

The heart of the data warehouse is the dimensional (a.k.a. star or snowflake) schema: a central fact table surrounded by dimension and transform tables. Think of the fact table as the business process you are measuring, such as a sale or shipment. The dimension tables will contain the selections and their attributes which will be used to constrain the records in the fact table, as well as provide the values used for grouping and sorting the results.

What is Data Warehousing?

Wednesday, April 1, 2009

A Data Warehousing Primer for Database Designers - Part I

No comments:

Followers

Blog Archive

About Me