程序代写 Chapter 4: Data Warehousing and On-line Analytical Processing

Chapter 4: Data Warehousing and On-line Analytical Processing
n Data Warehouse: Basic Concepts
n Data Warehouse Modeling: Data Cube and OLAP n Data Warehouse Design and Usage
n Data Warehouse Implementation

Copyright By PowCoder代写 加微信 powcoder

Efficient Data Cube Computation
n At the core of multidimensional data analysis is the efficient computation of aggregations across many sets of dimensions.
n In SQL terms, these aggregations are referred to as group-by’s.
n Each group-by can be represented by a cuboid, where the set of group-by’s form a lattice of cuboids defining a data cube.
n In this subsection, we explore issues related to the efficient computation of data cubes.

Efficient Data Cube Computation
n In SQL, the GROUP BY statement is often used to calculate aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result-set by one or more columns.

Efficient Data Cube Computation
n For the following table, the SQL statement lists the number of customers in each country, sorted in descending order.
Part of Table “Customers”

Efficient Data Cube Computation
n The previous Group By statement returns the following result:
Part of Query Result
Group By tutorial: https://www.w3schools.com/sql/sql_groupby.asp

The “Compute Cube” Operator
n One approach to cube computation extends SQL to Data Mining Query Language (DMQL) in order to include a compute cube operator.
n The compute cube operator computes aggregates over all subsets of the dimensions specified in the operation.
n This can require excessive storage space, especially for large numbers of dimensions.
n We start with an intuitive look at what is involved in the efficient computation of data cubes.

The “Compute Cube” Operator
n Suppose that you want to create a data cube for AllElectronics sales that contains the following: city, item, year, and sales in dollars. Formally, there are:
n Three dimensions: city, item, and year n One measure: sales in dollars
n With DMQL, we can define and compute the data cube using the following statements:
define cube sales_cube [item, city, year]: sum(sales_in_dollars) compute cube sales_cube

The “Compute Cube” Operator
n With the 3 dimensions and 1 measure, the total number of cuboids is 23=8.
n Namely, we need to compute the following Group-Bys: (city, item, year),
(city, item), (item, year), (city, year), (city), (item), (year),

The “Compute Cube” Operator
n These group-by’s form a lattice of cuboids for the data cube:

The “Compute Cube” Operator
n A statement “compute cube sales_cube” would explicitly instruct the system to compute the sales aggregate cuboids for all eight subsets of the set {city, item, year}, including the empty subset.
n OLAP often needs to access different cuboids for different queries.
n Therefore, it may seem like a good idea to compute in advance all or at least some of the cuboids in a data cube. Precomputation leads to fast response time and avoids some redundant computation.

The “Compute Cube” Operator
n Curse of Dimensionality:
n A major challenge related to this precomputation, however, is that the required storage space may be too large if all the cuboids in a data cube are precomputed, especially when the cube has many dimensions.
n The storage requirements are even more excessive when many of the dimensions have associated concept hierarchies, each with multiple levels.
n For example, time is usually explored not at only one conceptual level (e.g., year), but rather at multiple conceptual levels such as in the hierarchy “quarter < year”. n This problem is referred to as the curse of dimensionality. The “Compute Cube” Operator n How many cuboids exist in an n-dimensional cube with each dimension being associated with Li levels? T = n (L +1) iÕ=1 i n With the previous sales data cube, there are 3 dimensions and each dimension is associated with 1 level, which leads to (1+1) x (1+1) x (1+1) = 8 cuboids. The “Compute Cube” Operator n With the previous data cube, if the dimension “year” is replaced with “time” and time involves a two-level concept hierarchy (quarter and year), then there are (1+1) x (1+1) x (2+1) = 12 cuboids. § 3-DCuboids:(city,item,year),(city,item,year-quarter) § 2-D Cuboids(city, item), (city, year), (city, year-quarter), (item, year), (item, year-quarter) § 1-DCuboids:(city),(item),(year),(year-quarter) § 0-DCuboids:() The “Compute Cube” Operator n If there are many cuboids, and these cuboids are large in size, a more reasonable option is partial materialization; that is, to materialize only some of the possible cuboids that can be generated. n There are three choices for data cube materialization given a base cuboid: n Full materialization: materialize every cuboid n No materialization: materialize non-base cuboid n Partial materialization: materialize some cuboid n Selecting which cuboids to materialize should be based on size, access frequency, etc. Indexing OLAP Data: Bitmap Index n The bitmap index method is popular in OLAP products because it allows quick search in data cubes. n Index on a particular column (i.e. dimension) n Each value in the column has a bit vector: because bitwise operation is faster than string operation n The length of the bit vector: # of records in the base table n The i-thbitissetto1ifthe i-throwofthebasetable has the value for the indexed column n Bitmap index is not suitable for high cardinality domains Indexing OLAP Data: Bitmap Index n In the AllElectronics data warehouse, suppose the dimension item at the top level has four values (representing item types): “home entertainment,” “computer,” “phone,” and “security.” n Each value (e.g., “computer”) is represented by a bit vector in the item bitmap index table. n Suppose that the cube is stored as a relation table (i.e. the base table in this example) with 100,000 rows. n Because the domain of item consists of four values, the bitmap index table requires four bit vectors (or lists), each with 100,000 bits. Indexing OLAP Data: Bitmap Index Indexing OLAP Data: Bitmap Index n Bitmap indexing is especially useful for low-cardinality dimensions because comparison, join, and aggregation operations are then reduced to bit arithmetic, which substantially reduces the processing time. n Note that the cardinality of a set is a measure of the "number of elements" of the set. n For example, the set A={2,4,6} contains 3 elements, and therefore A has a cardinality of 3. n For higher-cardinality dimensions, the method can be adapted using compression techniques. Indexing OLAP Data: Join Index n In relational database query processing: n Traditional index: It maps the value in a given column to a list of rows having that value. n Join index: The join index records can identify joinable tuples without performing costly join operations (because they are carried out in advance). n In data warehouses, join index relates the values of the dimensions of a star schema to rows in the fact table. n Join index can span multiple dimensions to form composite join index. Indexing OLAP Data: Join Index n We defined a star schema for AllElectronics before. n An example of a join index relationship between the sales fact table and the location and item dimension tables is shown in Figure 4.16. n For example: n The “Main Street” value in the location dimension table joins with tuples T57, T238, and T884 of the sales fact table. n Similarly, the “Sony-TV” value in the item dimension table joins with tuples T57 and T459 of the sales fact table. Indexing OLAP Data: Join Index Indexing OLAP Data: Join Index n The corresponding join index tables are shown in Figure 4.17. Efficient Processing OLAP Queries n The purpose of materializing cuboids and constructing OLAP index structures is to speed up query processing in data cubes. n Given materialized views, query processing should proceed as follows: n Determine which operations should be performed on the available cuboids n Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection n Determine which materialized cuboid(s) should be selected for OLAP operations. Efficient Processing OLAP Queries n Suppose that we define a data cube for AllElectronics: “sales cube [time, item, location]: sum(sales in dollars).” n The dimension hierarchies used are: n “day < month < quarter < year” for time; n “item name < brand < type” for item; n “street < city < province or state < country” for location. n Suppose that the query to be processed involves {brand, province or state}, with the selection constant “year = 2010.” Efficient Processing OLAP Queries n Also, suppose that there are four materialized cuboids available, as follows: n Which of these four cuboids should be selected? n Finer-granularity data cannot be generated from coarser- granularity data. n Therefore, cuboid 2 cannot be used because country is a more general concept than province or state. n Cuboids 1, 3, and 4 can be used to process the query. Efficient Processing OLAP Queries n In terms of cost, among cuboid 1, 2, 3: n It is likely that using cuboid 1 would cost the most because both item item_name and city are at a lower level than the brand and province or state concepts specified in the query. n If there are not many year values associated with items in the cube, but there are several item names for each brand, then cuboid 3 will be smaller than cuboid 4, and thus cuboid 3 should be chosen to process the query. OLAP Server Architectures n Logically, OLAP servers present business users with multidimensional data from data warehouses or data marts, without concerns regarding how or where the data are stored. n Practically, implementations of a warehouse server for OLAP processing include the following architectures: n Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a relational back-end server and client front-end tools. n Use relational or extended-relational DBMS to store and manage warehouse data and n Use OLAP middleware to support missing pieces n Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services n Excellent scalability OLAP Server Architectures n Multidimensional OLAP (MOLAP) servers: These servers support multidimensional data views through array-based multidimensional storage engines. n Map multidimensional views directly to data cube array structures. n Fast indexing to pre-computed summarized data n Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. n For example, a HOLAP server may allow large volumes of detailed data to be stored in a relational database, while aggregations are kept in a separate MOLAP repository. n Specialized SQL servers: Some database system vendors implement specialized support for SQL queries over star/snowflake schemas. OLAP Server Architectures n Specifically, how are data actually stored in ROLAP servers? n ROLAP uses relational tables to store data for online analytical processing. n The fact table associated with a base cuboid is referred to as a base fact table. n The base fact table stores data at the abstraction level indicated by the join keys in the schema for the given data cube. n Aggregated data can also be stored in fact tables, referred to as summary fact tables. n Some summary fact tables store both base fact table data and aggregated data. n Alternatively, separate summary fact tables can be used for each abstraction level to store only aggregated data. OLAP Server Architectures n Table 4.4 shows a summary fact table that contains both base fact data and aggregated data. Note that this is a simplified example, where there are no separate dimension tables. n The schema is “,” where day, month, quarter, and year define the sales date, and dollars sold is the sales amount.

OLAP Server Architectures
n Consider the tuples with an RID of 1001 and 1002, respectively.
n The data of these tuples are at the base fact level, where the sales dates are October 15, 2010, and October 23, 2010, respectively.
n Consider the tuple with an RID of 5001.
n This tuple is at a more general level of abstraction than the tuples 1001 and 1002. The day value has been generalized to all, so that the corresponding time value is October 2010.
n That is, the dollars sold amount shown is an aggregation representing the entire month of October 2010, rather than just October 15 or 23, 2010. The special value “all” is used to represent subtotals in summarized data.

Chapter 4: Data Warehousing and On-line Analytical Processing
n Data Warehouse: Basic Concepts
n Data Warehouse Modeling: Data Cube and OLAP n Data Warehouse Design and Usage
n Data Warehouse Implementation

n Data warehousing: A multi-dimensional model of a data warehouse n A data cube consists of dimensions & measures
n Star schema, snowflake schema, fact constellations
n OLAP operations: drilling, rolling, slicing, dicing and pivoting
n Data Warehouse Architecture, Design, and Usage
n Multi-tiered architecture
n Business analysis design framework
n Information processing, analytical processing, data mining, OLAM (Online Analytical Mining)
n Implementation: Efficient computation of data cubes n Partial vs. full vs. no materialization
n Indexing OALP data: Bitmap index and join index n OLAP query processing
n OLAP servers: ROLAP, MOLAP, HOLAP

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com