Wednesday, April 1, 2009
A Data Warehousing Primer for Database Designers - Part II
The simple example above models a retail business that operates stores in a number of different markets. Each row in the fact table (Sale Fact) contains information about the sale of an item (UnitsSold and UnitPrice). Notice that the fact table is inherently normalized. You should also notice that the facts are additive along all dimensions, i.e. no matter which dimension we choose to group the sales facts by, we can sum either UnitsSold or UnitPrice and the results make sense. Semiadditive facts can only be summed in certain dimensions. An example is a monthly account balance fact, which is not additive along the time dimension (try adding up the ending monthly balance of your checking account for a year and see if the resultant number is meaningful!).
Another consideration in the schema is the grain of the fact table. We might choose to have each row in the fact table represent an individual consumer transaction, daily, weekly or monthly sales summaries. Clearly, the finer the grain of the table, the larger the warehouse will be. The business analysis which (hopefully) preceded the database design will point out how fine the granularity must be in order to meet the functional requirements of the system.
We also need to decide on the metrics we wish to apply to our facts. These are the arithmetic operations we need to perform to produce the numeric results the users of the system need. Common metrics include SUM, COUNT, AVERAGE, and STD_DEV. Note that the some metrics, such as COUNT and AVERAGE, don’t necessarily correspond to their typical SQL database equivalents. Metrics which need a count of items usually require the number of unique values of a dimension attribute, not the count of rows returned. Depending on the tools used to implement the database queries, you may need to address this common shortcoming of relational databases.
The dimension tables store the descriptions or attributes of our business. To perform a query, we typically select the dimension rows which meet our desired selection criteria (e.g., all coats smaller than size 10 sold in the Northeast during 1996). The intersections (i.e. Cartesian product) of the rows selected from the dimension tables with the fact table form the result set. As a database designer, you should recognize that this typical query as the result of a many-to-many join of the dimension and fact tables.
The dimension tables in a pure star schema, as illustrated above, are not normalized. As a designer, you may have to fight the temptation to normalize these tables, turning the star into a snowflake schema. You will have to consider a number of things in making that decision. First of all, a number of very good OLAP tools on the market today (MicroStrategy’s DSSAgent is one) essentially force you to use the snowflake approach. Another consideration is how insulated the end users are from the schema design. Although a normalized database is a thing of beauty to a systems designer, it is intimidating to the average user. Denormalized dimension tables lend themselves very naturally to browsing using simple SQL SELECT DISTINCT queries. Since a data warehouse does not require individual records to be changed on a frequent basis, the update performance justification for normalization is not present. Finally, any rationalization based on table size is probably ill-considered. In the typical warehouse, we have fact tables with tens or hundreds of millions of rows and dimension tables with only hundreds or thousands. There are probably no space savings benefits that justify the added complexity.
Transform tables (not illustrated above) are used to join dimension tables back on themselves. They allow us to model useful concepts such as this year vs. last year (1:1) or year-to-date (1:M). Some OLAP tools allow you to create dimensional metrics which make it easy to form query templates that incorporate these concepts.
Subscribe to:
Post Comments (Atom)


No comments:
Post a Comment