What is Data Warehousing?: 2009

Friday, April 3, 2009

The FASMI test

Our definition is designed to be short and easy to remember — 12 rules or 18 features are far too many for most people to carry in their heads; we are pleased that we were able to summarize the OLAP definition in just five key words: Fast Analysis of Shared Multidimensional Information — or, FASMI for short.

FAST means that the system is targeted to deliver most responses to users within about five seconds, with the simplest analyses taking no more than one second and very few taking more than 20 seconds. Independent research in The Netherlands has shown that end-users assume that a process has failed if results are not received with 30 seconds, and they are apt to hit ‘Alt+Ctrl+Delete’ unless the system warns them that the report will take longer. Even if they have been warned that it will take significantly longer, users are likely to get distracted and lose their chain of thought, so the quality of analysis suffers. This speed is not easy to achieve with large amounts of data, particularly if on-the-fly and ad hoc calculations are required. Vendors resort to a wide variety of techniques to achieve this goal, including specialized forms of data storage, extensive pre-calculations and specific hardware requirements, but we do not think any products are yet fully optimized, so we expect this to be an area of developing technology. In particular, the full pre-calculation approach fails with very large, sparse applications as the databases simply get too large (the database explosion problem), whereas doing everything on-the-fly is much too slow with large databases, even if exotic hardware is used. Even though it may seem miraculous at first if reports that previously took days now take only minutes, users soon get bored of waiting, and the project will be much less successful than if it had delivered a near instantaneous response, even at the cost of less detailed analysis. The OLAP Survey has found that slow query response is consistently the most often-cited technical problem with OLAP products, so too many deployments are clearly still failing to pass this test.

ANALYSIS means that the system can cope with any business logic and statistical analysis that is relevant for the application and the user, and keep it easy enough for the target user. Although some pre-programming may be needed, we do not think it acceptable if all application definitions have to be done using a professional 4GL. It is certainly necessary to allow the user to define new ad hoc calculations as part of the analysis and to report on the data in any desired way, without having to program, so we exclude products (like Oracle Discoverer) that do not allow adequate end-user oriented calculation flexibility. We do not mind whether this analysis is done in the vendor's own tools or in a linked external product such as a spreadsheet, simply that all the required analysis functionality be provided in an intuitive manner for the target users. This could include specific features like time series analysis, cost allocations, currency translation, goal seeking, ad hoc multidimensional structural changes, non-procedural modeling, exception alerting, data mining and other application dependent features. These capabilities differ widely between products, depending on their target markets.

SHARED means that the system implements all the security requirements for confidentiality (possibly down to cell level) and, if multiple write access is needed, concurrent update locking at an appropriate level. Not all applications need users to write data back, but for the growing number that do, the system should be able to handle multiple updates in a timely, secure manner. This is a major area of weakness in many OLAP products, which tend to assume that all OLAP applications will be read-only, with simplistic security controls. Even products with multi-user read-write often have crude security models; an example is Microsoft OLAP Services.

MULTIDIMENSIONAL is our key requirement. If we had to pick a one-word definition of OLAP, this is it. The system must provide a multidimensional conceptual view of the data, including full support for hierarchies and multiple hierarchies, as this is certainly the most logical way to analyze businesses and organizations. We are not setting up a specific minimum number of dimensions that must be handled as it is too application dependent and most products seem to have enough for their target markets. Again, we do not specify what underlying database technology should be used providing that the user gets a truly multidimensional conceptual view.

INFORMATION is all of the data and derived information needed, wherever it is and however much is relevant for the application. We are measuring the capacity of various products in terms of how much input data they can handle, not how many Gigabytes they take to store it. The capacities of the products differ greatly — the largest OLAP products can hold at least a thousand times as much data as the smallest. There are many considerations here, including data duplication, RAM required, disk space utilization, performance, integration with data warehouses and the like.

Source article: The FASMI test

Thursday, April 2, 2009

What is OLAP?

An analysis of what the often misused OLAP term is supposed to mean

The term, of course, stands for "On-Line Analytical Processing". But that is not only a definition, it’s not even a clear description of what OLAP means. It certainly gives no indication of why you would want to use an OLAP tool, or even what an OLAP tool actually does. And it gives you no help in deciding if a product is an OLAP tool or not.

We hit this problem as soon as we started researching The OLAP Report in late 1994 as we needed to decide which products fell into the category. Deciding what is an OLAP has not got any easier since then, as more and more vendors claim to have ‘OLAP compliant’ products, whatever that may mean (often they don’t even know). It is not possible to rely on the vendors’ own descriptions and membership of the long-defunct OLAP Council was not a reliable indicator of whether or not a company produces OLAP products. For example, several significant OLAP vendors were never members or resigned, and several members were not OLAP vendors. Membership of the instantly moribund replacement Analytical Solutions Forum was even less of a guide, as it was intended to include non-OLAP vendors.

The Codd rules also turned out to be an unsuitable way of detecting ‘OLAP compliance’, so we were forced to create our own definition. It had to be simple, memorable and product-independent, and the resulting definition is the ‘FASMI’ test. The key thing that all OLAP products have in common is multidimensionality, but that is not the only requirement for an OLAP product.

Source article: What is OLAP?

Wednesday, April 1, 2009

A Data Warehousing Primer for Database Designers - Part III

A powerful tool is the technique of drill-down and drill-up. As an example, envision a query that displays total sales by region by month this year vs. last year. A segment of our SQL query will look, in part, like this: SELECT SUM(UnitSales), REGION … GROUP BY REGION. In analyzing the results, the user notices that sales in the Northeast are off 20% over the same time period last year. By re-querying the database and adding additional attributes in the store dimension (e.g. SELECT SUM(UnitSales), REGION, METROMARKET … GROUP BY REGION, METROMARKET), the user might discover that the reason for the change is that sales are off in a single metropolitan area. From a design standpoint , we can see that the concept of drill-down and drill-up is simply the process of adding or subtracting row headers to our query results.

One way to extract significant query performance is through aggregations. This involves examining the way in which facts are commonly reported and adding tables, which are updated on a periodic basis, that pre-summarize data along one or more dimensions. For example, if the grain of our fact table in the example above was daily sales totals and we know that reporting by month and quarter is done regularly, we might add additional fact tables with the grain reduced to monthly and quarterly totals, respectively. More complex aggregations are possible by adding additional dimensions, e.g. region by quarter or product by store by week. The better OLAP query tools understand automatically to use an aggregate table for a query if an appropriate one is available.

A final consideration (final only in the context of this article!) concerns changing dimensions. In the real world, which we expect our database to model, change is constant. Managers change jobs, stores open and close, sales territories shift, product specifications and their suppliers change. During the business analysis phase of a project, careful attention must be paid to the requirements for "changing history". Perfectly valid outcomes are: a. we don’t care to track old history; b. segment history between changed dimension attributes; or c. add new dimension records while keeping the old. This is a fairly complex topic and further elaboration is beyond the scope of this article.

Once settled on a logical model for your data, you will need to consider a number of factors in choosing a database product or optimizing your current database manager. There are a number of somewhat esoteric products designed for use in extreme or unusual applications (very large databases, unusual data models or complex numerical analysis) which we won’t discuss here. Most datamarts and many warehouses are constructed using conventional relational databases such as Microsoft SQL Server, Sybase, DB2 and Oracle. There are also specialized relational products available that are designed for data warehousing, such as Red Brick. These specialized products offer some significant advantages over more general purpose database managers.

The typical data warehouse is updated on a periodic basis, such as nightly or weekly. This data is usually extracted from legacy or operational systems, or supplied by outside syndicates. The data is cleaned up (transformed) and loaded into the warehouse. Depending on the application, the data loaded might be simply appended to existing data, or the warehouse might be completely emptied and reloaded. In any event, changes to discrete data are usually not allowed during normal business hours. This results in some unique advantages.

First, no transaction log is needed or wanted. Since we are making sure that the data is clean before it is loaded into the warehouse, and since we don’t perform update operations in the conventional sense, we have no need to implement transactions with their associated commit and rollback overhead.

Second, data structures that promote fast retrieval, such as the bitmap index, become a viable indexing option with the data warehouse. As you know, indexes are data structures that allow random access to data records based on the values of one or more fields. In the transaction processing world, they have been implemented using the binary tree (b-tree) almost exclusively for as long as I can remember. The b-tree’s main advantage is that it allows pretty fast access to any random record, while performing pretty well when you need to insert or delete a node in the tree, as when inserting or deleting a database record. Structures such as the bitmap index are much faster at retrieval than the b-tree but have the disadvantage that, once created, they are not updateable. Unfettered by the need to provide update capability, you can drop the database indexes before every load and rebuild them after the load is complete, expanding your index options.

Third, the special purpose databases, such as Red Brick, are extremely fast at loading tables and building indexes, thereby overcoming problems that may result from the strict operational time constraints of data warehouses. A typical application running a nightly update must complete within the time allotted for it, or the system will not be available the next day. In addition to the time it takes to physically load the new records and rebuild the indexes, you might also be constrained by limited access to the systems that feed data to the warehouse. Having 14 hours of warehouse downtime daily in which to load data may not be helpful if the order entry system feeding it doesn’t finish reconciliation until 2 A.M. The special purpose databases like Red Brick respond well to these situations.

Finally, products like Red Brick can also be used to address query optimization problems. From a practical standpoint, all of the current generation of conventional RDBMS’s are somewhat eccentric at processing queries against the star schema database. The reasons are many. Often, the query processor will decide that the constraints on some of the dimension records are too complex, or weak. It will then decide to process the fact table first, testing the results against each of the remaining dimension records. This will destroy the whole indexing scheme, as the optimizer unwittingly forces a sequential scan of the biggest tables in the database in order to select records in the smallest! Unfortunately, the tools that the database administrator has at their disposal to combat the problem are indirect. They are usually limited to fiddling with the database statistics by overwriting the systems actual numbers with values intended to "fool" the optimizer into thinking that a particular query strategy is acceptable. A bigger problem is that the relationships between data elements in the data warehouse are far more deterministic than those in a conventional OLTP system. We don’t need the query processor to "discover" those relationships for us; we already know what they are!

Source article: Data Warehouse Primer Page

A Data Warehousing Primer for Database Designers - Part II

The simple example above models a retail business that operates stores in a number of different markets. Each row in the fact table (Sale Fact) contains information about the sale of an item (UnitsSold and UnitPrice). Notice that the fact table is inherently normalized. You should also notice that the facts are additive along all dimensions, i.e. no matter which dimension we choose to group the sales facts by, we can sum either UnitsSold or UnitPrice and the results make sense. Semiadditive facts can only be summed in certain dimensions. An example is a monthly account balance fact, which is not additive along the time dimension (try adding up the ending monthly balance of your checking account for a year and see if the resultant number is meaningful!).

Another consideration in the schema is the grain of the fact table. We might choose to have each row in the fact table represent an individual consumer transaction, daily, weekly or monthly sales summaries. Clearly, the finer the grain of the table, the larger the warehouse will be. The business analysis which (hopefully) preceded the database design will point out how fine the granularity must be in order to meet the functional requirements of the system.

We also need to decide on the metrics we wish to apply to our facts. These are the arithmetic operations we need to perform to produce the numeric results the users of the system need. Common metrics include SUM, COUNT, AVERAGE, and STD_DEV. Note that the some metrics, such as COUNT and AVERAGE, don’t necessarily correspond to their typical SQL database equivalents. Metrics which need a count of items usually require the number of unique values of a dimension attribute, not the count of rows returned. Depending on the tools used to implement the database queries, you may need to address this common shortcoming of relational databases.

The dimension tables store the descriptions or attributes of our business. To perform a query, we typically select the dimension rows which meet our desired selection criteria (e.g., all coats smaller than size 10 sold in the Northeast during 1996). The intersections (i.e. Cartesian product) of the rows selected from the dimension tables with the fact table form the result set. As a database designer, you should recognize that this typical query as the result of a many-to-many join of the dimension and fact tables.

The dimension tables in a pure star schema, as illustrated above, are not normalized. As a designer, you may have to fight the temptation to normalize these tables, turning the star into a snowflake schema. You will have to consider a number of things in making that decision. First of all, a number of very good OLAP tools on the market today (MicroStrategy’s DSSAgent is one) essentially force you to use the snowflake approach. Another consideration is how insulated the end users are from the schema design. Although a normalized database is a thing of beauty to a systems designer, it is intimidating to the average user. Denormalized dimension tables lend themselves very naturally to browsing using simple SQL SELECT DISTINCT queries. Since a data warehouse does not require individual records to be changed on a frequent basis, the update performance justification for normalization is not present. Finally, any rationalization based on table size is probably ill-considered. In the typical warehouse, we have fact tables with tens or hundreds of millions of rows and dimension tables with only hundreds or thousands. There are probably no space savings benefits that justify the added complexity.

Transform tables (not illustrated above) are used to join dimension tables back on themselves. They allow us to model useful concepts such as this year vs. last year (1:1) or year-to-date (1:M). Some OLAP tools allow you to create dimensional metrics which make it easy to form query templates that incorporate these concepts.

A Data Warehousing Primer for Database Designers - Part I

Over the last several years, data warehousing has emerged as a distinct discipline in the information technology world. As a database designer, you may wonder what, if anything, is different about this approach to systems development. This article outlines some of the fundamentals of data warehousing and suggests, in broad terms, where this technology might be applicable.

For the purposes of this article, we’ll look at data warehousing’s logical and physical design criteria. The considerations for logical design will focus primarily on the database schema. The physical design discussion will zero in on the unique aspects of a data warehouse and why, for certain applications, some database products are better than others. Note that I’ll be discussing only the most common solutions and products in this arena. This article is not intended to be comprehensive and there are many business problems and products in the marketplace which do not fit into this article’s framework.

Relational database usage falls primarily into two main categories: transaction processing systems (OLTP) and analytical systems (OLAP). When E.F. Codd laid out his rules for OLTP relational database design in the 1980’s, he also set forth a similar set of rules for OLAP. His primary point for creating this distinction is still valid: namely, that both approaches to database design are equally sound from a theoretical construct point of view. Thus your selection of one or the other design should be determined by the functional requirements of the system.

Transaction processing systems are used by businesses both to record and view information in real-time. Such systems are typically optimized for supporting a specific set of business processes. Although historical data is available in such systems, the main design goal is to support the needs of the people who run the business operations. For example, though an order entry system can undoubtedly provide the marketing department with useful historical information about what customers are buying at various times of the year, the primary goal in designing the system is to meet the needs of users of current business information. The design is defined by sale’s need to track the order, manufacturing’s need to build it, logistics’ need to ship it and finance’s requirements for invoicing the customer. Once the customer pays the bill, the need for access to the discrete records by the business operations people diminishes.

When building such a system, the database designer will almost certainly encounter performance constraints. How much data does the system need to retain and for how long? How many transactions does it need to support in a given timeframe? How shall business rules and relational integrity be implemented and enforced? When all is said and done, the end result will probably look a lot like a textbook third-normal form schema. Since writes to the database happen, on an approximate basis, as often as reads, care is taken to ensure that key fields are not duplicated, thus allowing efficient updates to data.

Query processing is typically done on a single transactional unit of data. For example, to look at a single order the database designer might create a query that pulls information from the following tables: customer, address, order, order detail, product and price. Although this joins a number of individual tables, the very nature of the query will constrain the order table to a single row. This will axiomatically constrain customer and address to a single row. Order details is likely to be a manageable number of rows and the remainder of the tables will be limited to a number of rows equivalent to the number of order details. So, for an order of 6 items, our query will need to retrieve just 21 rows from 6 separate tables.

Now imagine a query to the same database to report sales by department by product by fiscal quarter over the last two years. Assume that we’re a mail-order apparel firm that grosses $200M annually with an average order size of 6 items and an average price per item of $25.00. This query will need to traverse over 50 million rows to fetch the raw data needed for the report. A third normal form database schema may not necessarily perform very well in processing this query. To compound the problem, what if there is no algorithmetric method of determining which dates belong in a particular fiscal year or if your queries need to incorporate the concepts of this year vs. last year, or quarter-to-date? What if the company was reorganized recently and you need to be able to report last year’s data as if it reflected the reorganization and this year’s data as if the reorganization had not taken place? Chances are, these concepts are beyond the scope of the line-of-business requirements. Enter the data warehouse or datamart (N.B. Think of a datamart as a data warehouse that covers a single subject or operational area, such as sales).

The heart of the data warehouse is the dimensional (a.k.a. star or snowflake) schema: a central fact table surrounded by dimension and transform tables. Think of the fact table as the business process you are measuring, such as a sale or shipment. The dimension tables will contain the selections and their attributes which will be used to constrain the records in the fact table, as well as provide the values used for grouping and sorting the results.

What is Data Warehousing?