1.
Overview
Only couple of years ago, data warehouses were regarded
mostly as a historical data repository, with a very small impact on the daily
business operation. Moving nowadays deeper into the 'informational age', the
businesses are realizing more and more the crucial importance of knowledge,
derived from historical data analysis, for solving current issues and planing
for future growth. The 'business intelligence' sector is now turning pieces of
data from operational applications, historical repositories and external
sources into meaningful information to support better business decisions.
At the 'black box' level, we can say that an On-Line
Analytical Processing (OLAP) system is extracting detailed data from the
operational systems and transform them in order to offer usable analytical
information to the end-users. It usually maintains a longer history of detailed
and summarized data then the operational systems, in order to allow trend
analysis and historical comparisons.
Driven by the growth of the business requirements for
information, the OLAP environments have expanded and specialized to offer
better and faster response to more and more sophisticated requests. Nowadays, a
full blown warehousing solution will include at least one Operational Data
Store (ODS), Extract-Transport-Load (ETL) processes usually including Staging
Areas (SAs), one Enterprise Data Warehouse (DW), an Enterprise Metadata
Repository (MR), couple of specialized Data Marts (DMs) and specialized OLAP
environments and tools. It looks like (and it usually is) a rather complex
environment, with a lot of components, but it can be quickly ordered if we look
at it as a combination of solutions and tools.
Even if sometimes the data warehouse is presented as a
product, it is in fact a solution.
That is because building a data warehouse implies using existing,
non-specialized technology like relational databases in a specific
configuration to satisfy a specific business need. The business usage,
translated into a specific type of questions the system has to answer to, is
the one that defines an OLAP system. Furthermore, the business scope dictates
if the solution will be an ODS, a Data Warehouse, a Data Mart or a combination
of them, as specialized solutions for specific business requirements inside the
OLAP environment.
The diagram bellow describes a fully developed OLAP
environment, built using the top-down approach. As we will see later described,
some of these components may be missing from a real environment.

Fig.1. - Complete OLAP environment
Please note that there is an alternate way of building an
OLAP environment, named bottom-up and reflected in Fig.2. Recognizing that it
may be a viable alternative in specific cases, we are strongly supporting the
first layout described. An extensive compare of these two designs can be found
in ' OLAP environment: top-down versus
bottom-up methodologies '.

Fig.2. - Alternate OLAP environment
2.
Components
An OLAP solution is usually using a relational database
(Oracle, Informix, DB2 or others) specially designed for optimal response to a
certain type of business questions. The database design is maybe the most
important aspect of building an OLAP environment. Classical Entity-Relationship
design is usually used for ODS's and Dimensional Modeling for Data Marts and
Data Warehouses, but this is not an unbreakable rule (see ' Designing OLAP databases ').
The Operational Data Store
The Operational Data Store is considered by some specialists
as an operational system, and as an OLAP component by others. In fact, the ODS
has characteristics from both environments. Its design follows closely the
operational databases design (sometimes up to straight replication), it is
designed using traditional ER methodology, provides real-time transactional
support and it allows update/delete/insert operations. All of these are main
characteristics of a transactional system. But in the other hand, data stored
in an ODS is consolidated and organized by subject areas, therefore usable for
decision support processing, so it is an OLAP system. A more detailed
presentation can be found in ' Is ODS a
data warehousing component? '. For the purpose of this paper, we will list
bellow the generally accepted characteristics of an ODS:
-
integrated:
data from disparate operational systems is consolidated into a consistent view
of the enterprise;
-
subject oriented:
data is stored grouped by business subject areas, rather then optimal
transactional processing;
-
volatile:
data is permanently added, updated and deleted, to provide a snapshot of the
current business environment;
-
current valued:
there is no long term history in an ODS; it usually stores one day/week/month
worth of data;
-
detail oriented:
data in an ODS is at the same level of granularity as the operational systems,
with no additional aggregates or summaries.
The Data Warehouse
The Data Warehouse is the most important OLAP concept,
specially designed to contain integrated enterprise-wide detailed and
summarized data, including a long enough history to provide both strategic and
executive management perspectives of the enterprise. The design paradigm is
completely different then the traditional ER modeling used for operational
systems. The Dimensional Modeling methodology is changing the design
perspective toward large data sets manipulation, rather then providing fast
single row access. By definition, a Data Warehouse model is de-normalized and
optimized for drill-down and roll-up operations into the facts across the
dimensions. For more information on specific warehousing design techniques,
please refer to ' Design Models in an
OLAP Environment '. For now, we will conclude with a presentation of the
most important characteristics of a Data Warehouse:
-
integrated:
data is stored in an enterprise consolidated view (universal naming
conventions, measurements, classifications and so on), even if the source
systems are not consistent;
-
subject-oriented:
all relevant data regarding a business subject area is grouped together;
-
non-volatile:
once the data was loaded it can be only read; the users are not allowed to
perform any update/delete/insert operations, so it can provide a consistent
history;
-
time-variant:
data is stored for long-term periods, quantified in years; it is not unusual
for detailed data to be stored for 5-10 years, and summary data for up to 25
years.
The Data Marts
The Data Marts are OLAP constructs containing a subset of
usually summarized data concerning a business area, department or function. Its
design is based on the same Dimensional Modeling technique, and from many other
perspectives it can be called 'a departmental data warehouse'. Of course, there
are many differences between a data warehouse and one or many data marts,
especially regarding the type of analytical environment provided. A detailed
discussion on the similarities and distinctions between the warehouse and the
marts environment can be found in ' Data
Warehouse or Data Marts? '. For the scope of this paper, we will list the
Data Marts generally accepted characteristics as:
-
departmental:
the data represents a particular view on a subset of enterprise data, suitable
for the needs of a small group of specialized users ;
-
subject-oriented:
all meaningful data (for that users group) from a business subject area is
grouped together;
-
non-volatile:
once the data was loaded the users are allowed to perform only read and no
update/delete/insert operations, so it provides a consistent history;
-
time-variant:
data is stored for long-term periods, anywhere in the range of 5 to 25 years.
The Metadata Repository
The Metadata Repository is the logic and semantic layer of
understanding and interpreting the information stored in the OLAP environment
(sometimes including the operational systems as well). From physical
perspective it is a specially designed database (see ' A Metadata Repository Prototype ') with an exploration and visualization tool.
The complexity of information regarding the whole environment is usually
structured as:
-
business metadata,
such as subject area definition, definition of entities, attributes and
relationships, technical implementations of business information, enterprise
wide aliases and their departmental equivalents of business data elements, and
so forth.
-
technical metadata,
describing the physical implementation of the business metadata. It is, at its
turn, organized in:
-
static metadata, describing the objects with very rare
changes over time, such as tables descriptions and structure, attributes
description and physical definition, unique identifiers of data elements,
indexes defined for faster data access, entities relationships and the
corresponding foreign keys, and so on.
-
dynamic metadata, known also as data metrics,
concerning data load volume and quality quantifiers, overall data statistics,
data flows, data usage patterns and other information about the usage of the
statis structures.
3.
Data Flow
One of the most important decisions during the build of a
data warehousing environment is related to the data flow. Where the data is
extracted from, where is it processed and stored, which systems act as data
sources for the others, what is each component's confidence level are directly
related with the overall architecture. As most of the very important decisions
it has to be decided during the initial analysis phase, as far as it has a
strong impact on the final architecture.
The Top-Down Approach
The data flow into a top-down OLAP environment starts with
the data extraction from the enterprise operational data sources (relational
databases, mainframe systems, proprietary databases, other internal and
external data sources usually converted into flat files). The data elements are
temporarily loaded and hosted into a Staging Area, where some validation and
consolidation processes are performed to ensure a certain level of accuracy of
the data transferred into the Operational Data Store. This step is sometimes
bypassed if the ODS is a straight replication of the operational databases. For
details please refer to ' Is ODS a data
warehousing component? '.
Once the data vas validated and consolidated it is extracted
from the Staging Area and loaded into the Operational Data Store. It is not
unusual that some detailed data to be loaded in the Data Warehouse as well, in
a parallel process, rather then being extracted later from ODS. But regularly
detailed data is extracted from the ODS and temporarily hosted into a Staging
Area where is aggregated and summarized, and then extracted and loaded into the
Data Warehouse. The decision on the data streams is based mostly on the data
availability business need for the ODS and Data Warehouse, reflected on the
refresh cycles for ODS and DW. For example, if both databases are refreshed
daily, and there is a business need for detailed data in the DW, it makes sense
to load the detailed data simultaneously in ODS and DW. But if we have a
real-time ODS but the Data Warehouse is loaded in a daily batch, it is better
to keep the detail data load inside the main batch for consistency and
complexity management reasons.
After the Data Warehouse aggregation and summarization
processes have been performed, each Data Mart refresh cycle will extract
required data from the DW into a staging area where a new set of
transformations is applied to organize the information in the particular structures
required by the DM. Then the Data Marts are loaded and the OLAP environment
becomes available for users.
The bottom-up approach is reversing the positions of the
Data Warehouse and the Data Marts into the data stream. As well, even if the
complete environment includes an ODS, the most implementation are loading the
Data Marts directly from the operational systems, usually through a staging
area. Another variation is to load the Data Marts and the ODS simultaneously
from the operational sources, but this approach is increasing the processes
coordination complexity factor. The standard architecture, where the Data Marts
are refreshed from the ODS and not directly from the operational databases,
insures a data consolidation across the enterprise that is usually missing if
each DM is extracting its data independently from the operational databases.
The generic data flow starts with the extraction
from the data sources into a staging area, where the data elements are
consolidated across the enterprise. The integrated data is then extracted and
loaded into the ODS, where it adds or replaces existing data.
Once the ODS was refreshed, the current interval data is
extracted into one or more staging areas and processed to fit into each DM's
structure. From the Data Marts, and sometimes from ODS too, data is then
extracted into a staging area where it is consolidated, aggregated and so on.
Then the processed information is extracted from the staging area and loaded
into the Data Warehouse.
There are two major types of tools used in an OLAP
environment, classified by their usage: data manipulation and data analysis.
The tools used to connect the OLAP databases with
the operational systems (OLTP), between them or with analytical environments
are generally known as ETL (Extract, Transport and Load) or ETT (Extract,
Transfer and Transform) tools. Their purpose is clearly stated by their
definition: extract data from the operational databases, transfer the data sets
across the network into the OLAP system (usually performing some validations
and transformations on-the-fly or in staging areas) and load the (pre)processed
data into the OLAP database (where some additional aggregation and
summarization may be performed).
The data exploration and analysis tools are using (directly
or through staging areas) the information stored in the OLAP databases to
deliver specialized reports like time series, trend analysis, pivotal reports,
cross references and so on. Some of them are also building an additional layer
by extracting 'data cubes' from the OLAP database into their proprietary
environment to allow intuitive slicing and dicing into smaller sets of
information then the entire data warehouse.
This paper's only scope was to introduce the reader to basic
architectural concepts of the OLAP environment. There are many variations of
the two basic structures, developed in order to suit the specific interests of
a particular business environment. More details on the topic can be found in
the papers quoted trough the presentation but the author also suggests other
articles and white papers, especially by W.H. Inmon and R. Kimball, considered
“parents' of data warehousing.
To conclude, there is no 'right'
or 'wrong' approach of designing an OLAP environment, but advantages and
disadvantages of each methodology. Considering the costs of building a data
warehousing structure, a pragmatic approach will dictate the architecture and
the components to better suit the business needs. At the end of the day, the
only success measure is the users' satisfaction with the OLAP system.