Web-enabled OLAP Tutorial
 
TABLE OF CONTENTS


- DW Overview

--------Architecture
--------Back-end Tools

- Intro to OLAP

--------Codd's 12 Rules
--------FASMI

- MD Data Structures

- OLAP Server

- OLAP Operations

- OLAP Architectures

--------MOLAP: Part I
--------MOLAP: Part II
--------ROLAP: Part I
--------ROLAP: Part II
--------HOLAP
--------DOLAP

- Data Explosion

- OLAP Criteria

- Glossary

- References

DW Overview: Back-end Tools

Data Warehouse Architecture
Figure 1. Data Warehouse Architecture

As highlighted in the red circle in the diagram, data warehousing systems use various data extraction and cleaning tools, and load and refresh utilities for populating data warehouses. Below we describe the back-end tools and utilities.

Data Cleaning

The data warehouse involves large volumes of data from multiple sources, which can lead to a high probability of errors and anomalies in the data. Inconsistent field lengths, inconsistent descriptions, inconsistent value assignments, missing entries and violation of integrity constraints are some of the examples. The three classes of data cleaning tools are popularly used to help detect data anomalies and correct them:

  • Data migration tools - allow simple transformation rules to be specified.
  • Data scrubbing tools - use domain specific knowledge to do the scrubbing of data from multiple sources.
  • Data auditing tools - discover rules and relationships by scanning data.

Load

After extracting, cleaning, and transforming, data will be loaded into the data warehouse. A load utility has to allow the system administrator to monitor status, to cancel, suspend and resume a load, and to restart after failure with no loss of data integrity. Sequential loads can take a very long time to complete especially when it deals with terabytes of data. Therefore, pipelined and partitioned parallelism are typically used. Also incremental loading over full load is more popularly used with most commercial utilities since it reduces the volume of data that has to be incorporated into the data warehouse.

Refresh

There are two issues to consider when refreshing a data warehouse: when and how to refresh. Typically the warehouse is refreshed periodically. The refresh policy is set by the data warehouse administrator, depending on the business needs and traffic. Refreshing techniques depend on the characteristics of the source database and the capabilities of the database servers. Most contemporary database systems provide replication servers that support incremental techniques for moving updates from a primary database to one or more replicas. There are two basic replication techniques:

  • Data shipping - treats a table in the data warehouse as a remote snapshot of a table in the source database. A trigger is used to update a snapshot log table whenever the source table changes.
  • Transaction shipping - uses the regular transaction log instead of triggers and a special snapshot log table. At the source database site, the transaction log is used to detect updates on replicated tables, and those log records are transferred to a replication server.