What is Data Warehousing?: A Data Warehousing Primer for Database Designers

A powerful tool is the technique of drill-down and drill-up. As an example, envision a query that displays total sales by region by month this year vs. last year. A segment of our SQL query will look, in part, like this: SELECT SUM(UnitSales), REGION … GROUP BY REGION. In analyzing the results, the user notices that sales in the Northeast are off 20% over the same time period last year. By re-querying the database and adding additional attributes in the store dimension (e.g. SELECT SUM(UnitSales), REGION, METROMARKET … GROUP BY REGION, METROMARKET), the user might discover that the reason for the change is that sales are off in a single metropolitan area. From a design standpoint , we can see that the concept of drill-down and drill-up is simply the process of adding or subtracting row headers to our query results.

One way to extract significant query performance is through aggregations. This involves examining the way in which facts are commonly reported and adding tables, which are updated on a periodic basis, that pre-summarize data along one or more dimensions. For example, if the grain of our fact table in the example above was daily sales totals and we know that reporting by month and quarter is done regularly, we might add additional fact tables with the grain reduced to monthly and quarterly totals, respectively. More complex aggregations are possible by adding additional dimensions, e.g. region by quarter or product by store by week. The better OLAP query tools understand automatically to use an aggregate table for a query if an appropriate one is available.

A final consideration (final only in the context of this article!) concerns changing dimensions. In the real world, which we expect our database to model, change is constant. Managers change jobs, stores open and close, sales territories shift, product specifications and their suppliers change. During the business analysis phase of a project, careful attention must be paid to the requirements for "changing history". Perfectly valid outcomes are: a. we don’t care to track old history; b. segment history between changed dimension attributes; or c. add new dimension records while keeping the old. This is a fairly complex topic and further elaboration is beyond the scope of this article.

Once settled on a logical model for your data, you will need to consider a number of factors in choosing a database product or optimizing your current database manager. There are a number of somewhat esoteric products designed for use in extreme or unusual applications (very large databases, unusual data models or complex numerical analysis) which we won’t discuss here. Most datamarts and many warehouses are constructed using conventional relational databases such as Microsoft SQL Server, Sybase, DB2 and Oracle. There are also specialized relational products available that are designed for data warehousing, such as Red Brick. These specialized products offer some significant advantages over more general purpose database managers.

The typical data warehouse is updated on a periodic basis, such as nightly or weekly. This data is usually extracted from legacy or operational systems, or supplied by outside syndicates. The data is cleaned up (transformed) and loaded into the warehouse. Depending on the application, the data loaded might be simply appended to existing data, or the warehouse might be completely emptied and reloaded. In any event, changes to discrete data are usually not allowed during normal business hours. This results in some unique advantages.

First, no transaction log is needed or wanted. Since we are making sure that the data is clean before it is loaded into the warehouse, and since we don’t perform update operations in the conventional sense, we have no need to implement transactions with their associated commit and rollback overhead.

Second, data structures that promote fast retrieval, such as the bitmap index, become a viable indexing option with the data warehouse. As you know, indexes are data structures that allow random access to data records based on the values of one or more fields. In the transaction processing world, they have been implemented using the binary tree (b-tree) almost exclusively for as long as I can remember. The b-tree’s main advantage is that it allows pretty fast access to any random record, while performing pretty well when you need to insert or delete a node in the tree, as when inserting or deleting a database record. Structures such as the bitmap index are much faster at retrieval than the b-tree but have the disadvantage that, once created, they are not updateable. Unfettered by the need to provide update capability, you can drop the database indexes before every load and rebuild them after the load is complete, expanding your index options.

Third, the special purpose databases, such as Red Brick, are extremely fast at loading tables and building indexes, thereby overcoming problems that may result from the strict operational time constraints of data warehouses. A typical application running a nightly update must complete within the time allotted for it, or the system will not be available the next day. In addition to the time it takes to physically load the new records and rebuild the indexes, you might also be constrained by limited access to the systems that feed data to the warehouse. Having 14 hours of warehouse downtime daily in which to load data may not be helpful if the order entry system feeding it doesn’t finish reconciliation until 2 A.M. The special purpose databases like Red Brick respond well to these situations.

Finally, products like Red Brick can also be used to address query optimization problems. From a practical standpoint, all of the current generation of conventional RDBMS’s are somewhat eccentric at processing queries against the star schema database. The reasons are many. Often, the query processor will decide that the constraints on some of the dimension records are too complex, or weak. It will then decide to process the fact table first, testing the results against each of the remaining dimension records. This will destroy the whole indexing scheme, as the optimizer unwittingly forces a sequential scan of the biggest tables in the database in order to select records in the smallest! Unfortunately, the tools that the database administrator has at their disposal to combat the problem are indirect. They are usually limited to fiddling with the database statistics by overwriting the systems actual numbers with values intended to "fool" the optimizer into thinking that a particular query strategy is acceptable. A bigger problem is that the relationships between data elements in the data warehouse are far more deterministic than those in a conventional OLTP system. We don’t need the query processor to "discover" those relationships for us; we already know what they are!

Source article: Data Warehouse Primer Page

What is Data Warehousing?

Wednesday, April 1, 2009

A Data Warehousing Primer for Database Designers - Part III

No comments:

Followers

Blog Archive

About Me