In the Information Age, more and more
business and socio-political organizations are realizing the advantages that
can be drawn by employing the mostly un-utilized resource represented by the
enterprise information pool. Customer orders, product shipments, invoices,
customers and employees behaviour, and many more information items can enable a
company to better integrate into its market, predict its trends and proactively
react to future market needs or directions. All these informational resources
already exist in the enterprise, so there is no acquisition cost, so the
organizations only have to invest in organizing and meaningfully using it to
adjust their business behaviour, directly and positively impacting their
bottom-line.
The particular class of systems targeted to
organize and use large sets of enterprise information to enhance the business
perspective and decision-making process are largely known as Business
Intelligence (BI) solutions, such as On-Line Analytical Processing (OLAP),
Multidimensional OLAP (MOLAP), Decision Support Systems (DSS) and Enterprise
Information / Reporting Systems (EIS / ERS). These can be deployed as
non-integrated stand-alone solutions, commonly named Data Marts. But it is a
truism that somebody that knows the past can predict the future, so the most
useful deployment of a BI solution is on top of an Enterprise Data Warehouse
that holds integrated long-term history of organization's information elements.
There is a large industry debate concerning
the level of detail to be stored in the Data Warehouse (its 'granularity'), and
how meaningful a ten-years old atomic data element is for the organization. The
only definite answer to this issue is that there is no universal answer, as
each organization has different data retention policies for electronic and
written information. However, the Data Warehouse should have a consistent
history depth to provide meaningful cross-references, so many organizations are
using an Operational Data Store to maintain atomic data elements relationally
organized, and the Data Warehouse to dimensionally store mostly aggregated data
and only business critical atomic data elements.
The Data Warehousing is an intricate
environment, and may include multiple components quite different as
functionality and complexity. There are two opposed methodologies in Data
Warehousing theory, introduced by Ralph Kimball and Bill Inmon, and many
combinations in between, such as the approach promoted by Open Data Systems and
introduced in Data Warehousing Environment - An Introduction (Open Data
Systems Inc., www.opendatasys.com, Publications
section, December 2001). This approach is based on the axiom that there is no
universal 'DW solution', but each organization will implement subsets of the
potential environment, shown in the figure below:

Figure 1 - Complete Business Intelligence Environment
The back-end layer is usually
enterprise-wide and will include one or more of the following enabling structures:
- Data Warehouse, storing long-term (25
years or more) historical data, dimensionally organized, aggregated and/or granular.
- Operational Data Store, maintaining
consolidated short-term (5-10 years) detailed history in a relational structure
similar with the source systems.
- Staging areas, temporarily storing
data in transition between permanent repositories.
Typically, very few users will get direct
access into the Data Warehouse, as they should have advanced knowledge about
the data warehouse structure and data (for example, to avoid sending an
all-tables all-data join that will seriously impact the DW performance). The
regular user layer may typically include the following classes of client
systems fed from the Data Warehouse:
- MOLAP (Multidimensional On Line
Analytical Processing) cubes, used for their fast response time within a set of
pre-calculated data set.
- Data Marts, usually a ROLAP
(Relational On Line Analytical Processing) system that subsets the DW information
for a particular business area that requires extensive ad-hoc explorations, at
the expense of data retrieval performance.
- DSS (Decision Support Systems) usually
in a proprietary format required by the rules engine functionality, but also as
dimensional structures used with universal Data Mining tools.
- ERS (Enterprise Reporting Systems,
also named EIS – Enterprise Information Systems), a de-normalized relational
structure specifically designed to support specific reporting requirements
using elements originating from multiple operational systems.
The following table briefly compares the
main classes of Data Warehouse clients and their main purpose (which defines
the type of data required and processing performed):
Each of these types of client systems has
different types of data requirements in order to satisfy their specific
end-user analytical needs. Even if they can receive data from the Operational
Data Store, the Data Warehouse is usually better suited as a source at least
for OLAP, Data Mining and MOLAP clients.
The basic data support in the Data
Warehouse Database for any BI client is achieved by storing the atomic data at
the highest granularity level (more detail) required to satisfy any potential
business demand for information. The client systems could then extract and
process that elementary level data to create derived and/or aggregated data,
pre-computed or on-demand.
Another option is to pre-aggregate some of
the required derived information in the Data Warehouse Database, and feed it as
a 'new' fact to the client tools. While this should definitely be done when the
atomic data is moved out of the on-line media (reaches its retention limit),
there are many voices arguing that this should be also done even if the detail
data still exists in the Data Warehouse.
There are many arguments in favour and
against each approach. The following table depicts a summary comparison of the
most important differences between the two options mentioned above:
It is quite obvious that the balance is
clearly tilted toward performing the aggregations in the Data Warehouse
Database. This is especially true for aggregates involving complex
transformations of the detailed data based on business-driven algorithms that,
in fact, are 'new' business measures derived from a set of underlying data
elements.
However, caution must be exercised to not
overdo and build every potential aggregate in the permanent storage. As a basic
rule, the processing that is not part of the standard BI tools functionality
(in other words, is custom built to satisfy specific business rules) should be
built as a permanent store in the Data Warehouse. The straight summary, average
etc. aggregations along one or more dimensions should be trusted to the client
layer, as both MOLAP and ROLAP class solutions handle very well these standard
BI operations.
Moreover, many RDBMS engines now offer
advanced OLAP and data mining functionality as integral part of the database
server software, enabling variations from the traditional aggregation approach.
For example, Oracle9i can store OLAP data in a relational star schema
(consisting of fact and dimension tables), but also multi-dimensionally in an
Analytic Workspace (AW), or in a combination of both.
Also, with the lowering cost of on-line
storage media (hard drives), in the recent years it became so much more
cost-effective to pre-process data and permanently store the ready for usage
results, versus on-demand processing that requires significant increased CPU
power to deliver acceptable performance levels.
In conclusion, creating business driven
database level aggregates enables a certain level of client / toolset
independence, data storage and protection within a specialized RDBMS and lower
overall Total Cost of Ownership. Without minimizing the role of the client
systems in a Data Warehousing environment, it is quite obvious that the first
candidate for aggregates deployment is the Data Warehouse Database, regardless
of the type of client system that will ultimately present the data to the
end-user.