Data Explosion
Data explosion is the phenomenon that occurs in multidimensional models where the derived or calculated values significantly exceed the base values. There are three main factors that contribute to data explosion as listed below:
- Sparsely populated base data
- Many dimensions in a model
- A high number of calculated levels in each dimension
Figure 1 |
Figure 2 |
As demonstrated in Figures 1 & 2, we begin with our original 100% dense table. Suppose we wanted to find combination of sales figures for products A+B, C+D, and A+B+C+D. Suppose also that we wanted to figure this information over 4 quarters of the year. Notice how these have drastically added to the amount of data being stored in comparison to the original.
When ignoring data explosion, one of the major consequences is massive databases that are hundreds or thousands of times larger than is necessary. Also it requires expensive hardware to process and accommodate exploded data, which leads to expensive cost. Load or calculation times take much longer, which results a poor performance. The hidden cost may incur due to the failing to provide timely and relevant enterprise system. Therefore, the data explosion is the single most limiting factor in many large OLAP implementations. So what are some issues and solutions to this data explosion potential in OLAP?
Common Misdiagnoses
- Poor suppression of sparsity. The sparsity of data will only increase the size of the database a fraction of what data explosion.
- Multidimensional database storage. Data explosion is independent of the type of database (MOLAP, ROLAP). They are equally susceptible to data explosion.
- Lack of data compression. Data compression can shrink the data somewhat, but it does not resolve the issue of data explosion.
- Software errors. Database explosion is not a result of software bugs. Database explosion has more to do with how the data is being stored, rather than accidental duplication of data from software bugs.
Sparsity
Figure 3. An example of
dense dimension |
Figure 4. An example of sparse dimension |
Sparse data can be understood as spread out data. Imagine a dimension from ACME company, which specifies the sales figures of products A, B, C, and D over an entire year. We have sales figures for all products over the year and therefore, there are no cells holding 0's. A dimension without empty information (in this case 0's) results in a density of 100%, meaning the data is not very sparse (shown in Figure 3).
If we were to subdivide 1 year into months, weeks, days, and hours we would add more 0's as shown by Figure 4. Notice how changing the year has suddenly added multiple 0's to the table. As additional dimensions are added, the data will become sparser creating an increase in the number of 0's stored.
Although sparsity is a problem which makes increase the amount of data by about 10 times, it is not data explosion. Remember that data explosion is on the order of hundreds to thousands of times larger. However, a sparsely populated database is a likely start for data explosion.
Problem
Whether using ROLAP, MOLAP, or any other forms of OLAP databases, the original operation data has been extracted, transformed, and loaded into an OLAP database. Simply converting the original data into a star/snowflake schema (for ROLAP) or a multidimensional cube (for MOLAP) will reduce the space taken by the original data. If we were to port the data directly it would actually be a more dense (less sparse) set of data.
Because OLAP seeks to provide information as quickly as possible, multiple additional pre-calculated are stored in order to processing time. Because the assumption that disk space is relatively cheap, we often err on the side of storing more pre-calculated data in order to have the quickest response times. However, this causes an explosion in the amount of data that is stored on the order of 10 0's to 1000's of times the original copied operational data.
Figure 5
|
Although it can be applied to ROLAP in similar fashion, it is easiest to visualize the following effect. As shown in Figure 5, the blue cylinder represents all the original operation data sources. They are copied (via ETL) into the slightly smaller red cube. Each pre-calculation for the data stored in the cube is represented by the surrounding grey area. This causes multiple unnecessary grey pre-calculated data to be stored because of the sparsity of the original red cube. Any run-time data (stored only temporarily) is represented by the wooden cube, further expanding the total volume of the cube. This visually represents the “explosion” from pre-calculated data.
How to Avoid Database Explosion
The solutions to this database explosion problem are based on the following two concepts:
- Avoid fully pre-calculating any multidimensional object with more than five sparse dimensions.
- Reduce the sparsity of individual data objects by good application design and by using a multicube rather than a hypercube approach. This allows each object to have the minimum number of necessary dimensions. This is the reason why many OLAP systems designed for large applications use a multicube database structure.
How Much and What are Recommended to Pre-calculate?
The optimum amount of pre-calculation depends on various factors, which include the hardware, software, and network characteristics, the number of run-time users, or the complexity of the calculations. Typically there is no choice but to adopt a trial and error approach to find the optimum setting.
What data should be pre-calculated? If data depends on many other cells or complex formulas, then it will slow down calculation at run-time. Also data that are frequently viewed or the basis of many other calculations should be pre-calculated.
|