The highest-quality power [...] comes from the application of knowledge. (Alvin Toffler, Powershift)
Knowledge Base


The Knowledge Base is accessible from Client Support area. Please login:

Support Level:
Password:

Data Warehousing Environment - An Introduction

Author: George Jucan

Copyright (c)2001, Open Data Systems Inc.


1. Overview

Only couple of years ago, data warehouses were regarded mostly as a historical data repository, with a very small impact on the daily business operation. Moving nowadays deeper into the 'informational age', the businesses are realizing more and more the crucial importance of knowledge, derived from historical data analysis, for solving current issues and planing for future growth. The 'business intelligence' sector is now turning pieces of data from operational applications, historical repositories and external sources into meaningful information to support better business decisions.

At the 'black box' level, we can say that an On-Line Analytical Processing (OLAP) system is extracting detailed data from the operational systems and transform them in order to offer usable analytical information to the end-users. It usually maintains a longer history of detailed and summarized data then the operational systems, in order to allow trend analysis and historical comparisons.

Driven by the growth of the business requirements for information, the OLAP environments have expanded and specialized to offer better and faster response to more and more sophisticated requests. Nowadays, a full blown warehousing solution will include at least one Operational Data Store (ODS), Extract-Transport-Load (ETL) processes usually including Staging Areas (SAs), one Enterprise Data Warehouse (DW), an Enterprise Metadata Repository (MR), couple of specialized Data Marts (DMs) and specialized OLAP environments and tools. It looks like (and it usually is) a rather complex environment, with a lot of components, but it can be quickly ordered if we look at it as a combination of solutions and tools.

Even if sometimes the data warehouse is presented as a product, it is in fact a solution. That is because building a data warehouse implies using existing, non-specialized technology like relational databases in a specific configuration to satisfy a specific business need. The business usage, translated into a specific type of questions the system has to answer to, is the one that defines an OLAP system. Furthermore, the business scope dictates if the solution will be an ODS, a Data Warehouse, a Data Mart or a combination of them, as specialized solutions for specific business requirements inside the OLAP environment.

The diagram bellow describes a fully developed OLAP environment, built using the top-down approach. As we will see later described, some of these components may be missing from a real environment.

Fig.1. - Complete OLAP environment

Please note that there is an alternate way of building an OLAP environment, named bottom-up and reflected in Fig.2. Recognizing that it may be a viable alternative in specific cases, we are strongly supporting the first layout described. An extensive compare of these two designs can be found in ' OLAP environment: top-down versus bottom-up methodologies '.

Fig.2. - Alternate OLAP environment

2. Components

An OLAP solution is usually using a relational database (Oracle, Informix, DB2 or others) specially designed for optimal response to a certain type of business questions. The database design is maybe the most important aspect of building an OLAP environment. Classical Entity-Relationship design is usually used for ODS's and Dimensional Modeling for Data Marts and Data Warehouses, but this is not an unbreakable rule (see ' Designing OLAP databases ').

The Operational Data Store

The Operational Data Store is considered by some specialists as an operational system, and as an OLAP component by others. In fact, the ODS has characteristics from both environments. Its design follows closely the operational databases design (sometimes up to straight replication), it is designed using traditional ER methodology, provides real-time transactional support and it allows update/delete/insert operations. All of these are main characteristics of a transactional system. But in the other hand, data stored in an ODS is consolidated and organized by subject areas, therefore usable for decision support processing, so it is an OLAP system. A more detailed presentation can be found in ' Is ODS a data warehousing component? '. For the purpose of this paper, we will list bellow the generally accepted characteristics of an ODS:

-          integrated: data from disparate operational systems is consolidated into a consistent view of the enterprise;

-          subject oriented: data is stored grouped by business subject areas, rather then optimal transactional processing;

-          volatile: data is permanently added, updated and deleted, to provide a snapshot of the current business environment;

-          current valued: there is no long term history in an ODS; it usually stores one day/week/month worth of data;

-          detail oriented: data in an ODS is at the same level of granularity as the operational systems, with no additional aggregates or summaries.

The Data Warehouse

The Data Warehouse is the most important OLAP concept, specially designed to contain integrated enterprise-wide detailed and summarized data, including a long enough history to provide both strategic and executive management perspectives of the enterprise. The design paradigm is completely different then the traditional ER modeling used for operational systems. The Dimensional Modeling methodology is changing the design perspective toward large data sets manipulation, rather then providing fast single row access. By definition, a Data Warehouse model is de-normalized and optimized for drill-down and roll-up operations into the facts across the dimensions. For more information on specific warehousing design techniques, please refer to ' Design Models in an OLAP Environment '. For now, we will conclude with a presentation of the most important characteristics of a Data Warehouse:

-          integrated: data is stored in an enterprise consolidated view (universal naming conventions, measurements, classifications and so on), even if the source systems are not consistent;

-          subject-oriented: all relevant data regarding a business subject area is grouped together;

-          non-volatile: once the data was loaded it can be only read; the users are not allowed to perform any update/delete/insert operations, so it can provide a consistent history;

-          time-variant: data is stored for long-term periods, quantified in years; it is not unusual for detailed data to be stored for 5-10 years, and summary data for up to 25 years.

The Data Marts

The Data Marts are OLAP constructs containing a subset of usually summarized data concerning a business area, department or function. Its design is based on the same Dimensional Modeling technique, and from many other perspectives it can be called 'a departmental data warehouse'. Of course, there are many differences between a data warehouse and one or many data marts, especially regarding the type of analytical environment provided. A detailed discussion on the similarities and distinctions between the warehouse and the marts environment can be found in ' Data Warehouse or Data Marts? '. For the scope of this paper, we will list the Data Marts generally accepted characteristics as:

-          departmental: the data represents a particular view on a subset of enterprise data, suitable for the needs of a small group of specialized users  ;

-          subject-oriented: all meaningful data (for that users group) from a business subject area is grouped together;

-          non-volatile: once the data was loaded the users are allowed to perform only read and no update/delete/insert operations, so it provides a consistent history;

-          time-variant: data is stored for long-term periods, anywhere in the range of 5 to 25 years.

The Metadata Repository

The Metadata Repository is the logic and semantic layer of understanding and interpreting the information stored in the OLAP environment (sometimes including the operational systems as well). From physical perspective it is a specially designed database (see ' A Metadata Repository Prototype ') with an exploration and visualization tool. The complexity of information regarding the whole environment is usually structured as:

-          business metadata, such as subject area definition, definition of entities, attributes and relationships, technical implementations of business information, enterprise wide aliases and their departmental equivalents of business data elements, and so forth.

-          technical metadata, describing the physical implementation of the business metadata. It is, at its turn, organized in:

-          static metadata, describing the objects with very rare changes over time, such as tables descriptions and structure, attributes description and physical definition, unique identifiers of data elements, indexes defined for faster data access, entities relationships and the corresponding foreign keys, and so on.

-          dynamic metadata, known also as data metrics, concerning data load volume and quality quantifiers, overall data statistics, data flows, data usage patterns and other information about the usage of the statis structures.

3. Data Flow

One of the most important decisions during the build of a data warehousing environment is related to the data flow. Where the data is extracted from, where is it processed and stored, which systems act as data sources for the others, what is each component's confidence level are directly related with the overall architecture. As most of the very important decisions it has to be decided during the initial analysis phase, as far as it has a strong impact on the final architecture. 

The Top-Down Approach

The data flow into a top-down OLAP environment starts with the data extraction from the enterprise operational data sources (relational databases, mainframe systems, proprietary databases, other internal and external data sources usually converted into flat files). The data elements are temporarily loaded and hosted into a Staging Area, where some validation and consolidation processes are performed to ensure a certain level of accuracy of the data transferred into the Operational Data Store. This step is sometimes bypassed if the ODS is a straight replication of the operational databases. For details please refer to ' Is ODS a data warehousing component? '.

Once the data vas validated and consolidated it is extracted from the Staging Area and loaded into the Operational Data Store. It is not unusual that some detailed data to be loaded in the Data Warehouse as well, in a parallel process, rather then being extracted later from ODS. But regularly detailed data is extracted from the ODS and temporarily hosted into a Staging Area where is aggregated and summarized, and then extracted and loaded into the Data Warehouse. The decision on the data streams is based mostly on the data availability business need for the ODS and Data Warehouse, reflected on the refresh cycles for ODS and DW. For example, if both databases are refreshed daily, and there is a business need for detailed data in the DW, it makes sense to load the detailed data simultaneously in ODS and DW. But if we have a real-time ODS but the Data Warehouse is loaded in a daily batch, it is better to keep the detail data load inside the main batch for consistency and complexity management reasons.

After the Data Warehouse aggregation and summarization processes have been performed, each Data Mart refresh cycle will extract required data from the DW into a staging area where a new set of transformations is applied to organize the information in the particular structures required by the DM. Then the Data Marts are loaded and the OLAP environment becomes available for users.

The Bottom-Up Approach

The bottom-up approach is reversing the positions of the Data Warehouse and the Data Marts into the data stream. As well, even if the complete environment includes an ODS, the most implementation are loading the Data Marts directly from the operational systems, usually through a staging area. Another variation is to load the Data Marts and the ODS simultaneously from the operational sources, but this approach is increasing the processes coordination complexity factor. The standard architecture, where the Data Marts are refreshed from the ODS and not directly from the operational databases, insures a data consolidation across the enterprise that is usually missing if each DM is extracting its data independently from the operational databases.

The generic data flow starts with the extraction from the data sources into a staging area, where the data elements are consolidated across the enterprise. The integrated data is then extracted and loaded into the ODS, where it adds or replaces existing data.

Once the ODS was refreshed, the current interval data is extracted into one or more staging areas and processed to fit into each DM's structure. From the Data Marts, and sometimes from ODS too, data is then extracted into a staging area where it is consolidated, aggregated and so on. Then the processed information is extracted from the staging area and loaded into the Data Warehouse.

4. Tools

There are two major types of tools used in an OLAP environment, classified by their usage: data manipulation and data analysis.

The tools used to connect the OLAP databases with the operational systems (OLTP), between them or with analytical environments are generally known as ETL (Extract, Transport and Load) or ETT (Extract, Transfer and Transform) tools. Their purpose is clearly stated by their definition: extract data from the operational databases, transfer the data sets across the network into the OLAP system (usually performing some validations and transformations on-the-fly or in staging areas) and load the (pre)processed data into the OLAP database (where some additional aggregation and summarization may be performed).

The data exploration and analysis tools are using (directly or through staging areas) the information stored in the OLAP databases to deliver specialized reports like time series, trend analysis, pivotal reports, cross references and so on. Some of them are also building an additional layer by extracting 'data cubes' from the OLAP database into their proprietary environment to allow intuitive slicing and dicing into smaller sets of information then the entire data warehouse.

5. Summary

This paper's only scope was to introduce the reader to basic architectural concepts of the OLAP environment. There are many variations of the two basic structures, developed in order to suit the specific interests of a particular business environment. More details on the topic can be found in the papers quoted trough the presentation but the author also suggests other articles and white papers, especially by W.H. Inmon and R. Kimball, considered “parents' of data warehousing.

To conclude, there is no 'right' or 'wrong' approach of designing an OLAP environment, but advantages and disadvantages of each methodology. Considering the costs of building a data warehousing structure, a pragmatic approach will dictate the architecture and the components to better suit the business needs. At the end of the day, the only success measure is the users' satisfaction with the OLAP system.

©2005 Open Data Systems Inc.           All rights reserved!           For inquires contact info@opendatasys.com