CS计算机代考程序代写 data mining Excel SQL database python algorithm Haskell concurrency flex ER COMP9318: Data Warehousing and Data Mining

COMP9318: Data Warehousing and Data Mining
— L2: Data Warehousing and OLAP —
1

n Why and What are Data Warehouses?
2

Data Analysis Problems
n The same data found in many different systems n Example: customer data across different
departments
n The same concept is defined differently
n Heterogeneous sources
n Relational DBMS, OnLine Transaction
Processing (OLTP)
n Unstructured data in files (e.g., MS Excel) and documents (e.g., MS Word)
3

Data Analysis Problems (Cont’d)
n Data is suited for operational systems
n Accounting,billing,etc.
n Do not support analysis across business functions
n Data quality is bad
n Missing data, imprecise data, different use of
systems
n Data are “volatile”
n Data deleted in operational systems (6months)
n Data change over time – no historical information
4

Solution: Data Warehouse
n Definedinmanydifferentways,butnotrigorously.
n A decision support database that is maintained separately from
the organization’s operational database
n Support information processing by providing a solid platform of consolidated, historical data for analysis.
n “Adatawarehouseisasubject-oriented,integrated,time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon
n Datawarehousing:
n The process of constructing and using data warehouses
5

Data Warehouse—Subject-Oriented
n Organized around major subjects, such as customer, product, sales.
n Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.
n Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
6

Data Warehouse—Integrated
n Constructed by integrating multiple, heterogeneous data sources
n relational databases, flat files, on-line transaction records
n Data cleaning and data integration techniques are applied.
n Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
n E.g., Hotel price: currency, tax, breakfast covered, etc.
n When data is moved to the warehouse, it is converted.
7

Data Warehouse—Time Variant
n The time horizon for the data warehouse is significantly longer than that of operational systems.
n Operational database: current value data.
n Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years) n Every key structure in the data warehouse
n Contains an element of time, explicitly or implicitly
n But the key of operational data may or may not contain “time element”.
8

Data Warehouse—Non-Volatile
1. A physically separate store of data transformed from the operational environment.
2. Operational update of data does not occur in the data warehouse environment.
n Does not require transaction processing, recovery, and concurrency control mechanisms
n Requires only two operations in data accessing:
n initial loading of data and access of data.
9

Data Warehouse Architecture
n Extract data from operational data sources
n clean, transform n Bulk load/refresh
n warehouse is offline n OLAP-server provides
multidimensional view
n Multidimensional-olap (Essbase, oracle express)
n Relational-olap
(Redbrick, Informix, Sybase, SQL server)
10

Data Warehouse Architecture
Function-oriented systems
All subjects, integrated
Subject-oriented systems
Advanced analysis
11

Why Separate Data Warehouse?
n Highperformanceforbothsystems
n DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
n Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.
n Differentfunctionsanddifferentdata:
n missing data: Decision support requires historical data which
operational DBs do not typically maintain
n data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources
n data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
12

Why OLAP Servers?
n Differentworkload:
n OLTP (on-line transaction processing)
n MajortaskoftraditionalrelationalDBMS
n Day-to-dayoperations:purchasing,inventory,banking,manufacturing,payroll, registration, accounting, etc.
n OLAP (on-line analytical processing) n Majortaskofdatawarehousesystem
n Dataanalysisanddecisionmaking
n Querieshard/infeasibleforOLTP,e.g.,
n Which week we have the largest sales?
n Does the sales of dairy products increase over time?
n Generate a spread sheet of total sales by state and by year.
n DifficulttorepresentthesequeriesbyusingSQLçWhy?
13

OLTP vs. OLAP
OLTP OLAP
users
clerk, IT professional knowledge worker
function
day to day operations decision support
DB design
application-oriented subject-oriented
data
current, up-to-date detailed, flat relational isolated
historical,
summarized, multidimensional integrated, consolidated
usage
repetitive ad-hoc
access
read/write lots of scans index/hash on prim. key
unit of work
short, simple transaction complex query
# records accessed
tens millions
#users
thousands hundreds
DB size
100MB-GB 100GB-TB
metric
transaction throughput query throughput, response
14

Comparisons
Databases
Data Warehouses
Purpose
Many purposes; Flexible and general
One purpose: Data analysis
Conceptual Model
ER
Multidimensional
Logical Model
(Normalized) Relational Model
(Denormalized) Star schema / Data cube/cuboids
Physical Model
Relational Tables
ROLAP: Relational tables MOLAP: Multidimensional arrays
Query Language
SQL (hard for analytical queries)
MDX (easier for analytical queries)
Query Processing
Bitmap/Join indexes, Star join, Materialized data cube
15
B+-tree/hash indexes, Multiple join optimization, Materialized views

n The Multidimensional Model
16

The Multidimensional Model
n A data warehouse is based on a multidimensional data model which views data in the form of a data cube, which is a multidimensional generalization of 2D spread sheet.
n Key concepts:
n Facts: the subject it models
n Typically transactions in this course; other types includes snapshots, etc.
n Measures: numbers that can be aggregated n Dimensions: context of the measure
n Hierarchies:
n Provide contexts of different granularities (aka. grains)
n Goals for dimensional modeling:
n Surround facts with as much relevant context (dimensions) as possible ç Why?
17

Supermarket Example
n Subject: analyze total sales and profits n Fact: Each Sales Transaction
n Measure: Dollars_Sold, Amount_Sold, Cost
n Calculated Measure: Profit n Dimensions:
n Store
n Product n Time
18

Visualizing the Cubes
n A valid instance of the model is a data cube city
total Sales
product
p1
p2
p3
p4
NY
$454


$925
LA
$468
$800


SD
$296

$240

SF
$652

$540
$745
454
925
468
800
296
240
652
540
745
product
Concepts: cell, fact (=non-empty cell), measure,
dimensions
Q: How to generalize it to 3D?
19
city

3D Cube and Hierarchies
Concepts: hierarchy (a tree of dimension values), level Sales of book176 in NY in Jan can be found in this cell
NY Book176
DIMENSIONS
PRODUCT LOCATION TIME
ALL ALL ALL category region year
product
country quarter state month week city day
store
20
month
city
Jan
Feb
Mar
product

Hierarchies
Concepts: hierarchy (a tree of dimension values), level
ALL
Food Beverage Book
………
ALL
category product
ALL
category
subcategory brand
product
21
Which design is better? Why?
Book176

The (city, moth) Cuboid
Sales of ALL_PROD in NY in Jan
NY Book176
DIMENSIONS
PRODUCT LOCATION TIME
ALL ALL ALL category region year
product
country quarter state month week city day
store
22
month
city
Jan
Feb
Mar
product

All the Cuboids
Date
2Qtr
1Qtr VCR
3Qtr
4Qtr
TV PC
U.S.A
Canada Mexico
23
Assume: no other non-ALL levels on all dimensions.
Product
Country

All the Cuboids /2
TV PC
VCR
all
1Qtr
2Qtr Date 3Qtr
4Qtr
all
Total annual sales of TV in U.S.A.
U.S.A
Canada Mexico
all
24
Assume: no other non-ALL levels on all dimensions.
Product
Country

Lattice of the cuboids
Base cuboid
product, store country
3-dim cuboid 2-dim cuboid 1-dim cuboid
0-dim cuboid
product,quarter quarter
product, quarter, country quarter, country product
n n-dim cube can be represented as (D1, D2, …, Dd), where Di is the set of allowed values on the i-th dimension.
n if Di = Li (a particular level), then Di = all descendant dimension values of Li.
n ALL can be omitted and hence reduces the effective dimensionality Yd
n Acompletecubeofd-dimensionsconsistsof (ni+1)cuboids, i=1
where ni is the number of levels (excluding ALL) on i-th dimension. n They collectively form a lattice.
25

Properties of Operations
n All operations are closed under the multidimensional model
n i.e., both input and output of an operation is a cube
n So that they can be composed
Q: What’s the analogy in the Relational Model?
26

Common OLAP Operations
Q: what should be its value?
Sales of book176 in NY in Q1 here
PRODUCT LOCATION TIME
n Roll-up: move up the hierarchy
DIMENSIONS
NY Book176
ALL ALL ALL category region year product country quarter
state month week
city day
store
month
COMP9318: Data Warehousing and Data Mining
27
city
Q1
Q2
Q3
product

NY Book176
quarter
DIMENSIONS
PRODUCT LOCATION TIME
ALL ALL ALL category region year
NY Book176
product
country quarter state month week city day
store
28
month
city
city
Jan Feb
Mar
product product

Data Cube Measures: Three Categories
n Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning
n E.g., count(), sum(), min(), max()
n Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function
n E.g., avg(), min_N(), standard_deviation()
n Holistic: if there is no constant bound on the storage size needed to describe a subaggregate.
n E.g., median(), mode(), rank()
29

Common OLAP Operations
n Drill-down: move down the hierarchy
n more fine-grained aggregation
30

Slice and Dice Queries
n Slice and Dice: select and project on one or more dimension values
product
city
month
The output cube has smaller dimensionality than the input cube
Product = Book176
31

Pivoting
n Pivoting: aggregate on selected dimensions
n usually 2 dims (cross- tabulation)
NY Book176
Sales (of all products) in NY in Q1
=sum( ??? )
month








City
NY
Q1 Q2 Q3
month
32
pivot on (city, month)
city
Q1
Q2 Q3
product

A Reflective Pause
n Let’s review the definition of data cubes again.
n Key message:
n Disentangle the “object” from its “representation” or “implementation”
33

Modeling Exercise 1: Monthly Phone Service Billing
Theme: analyze the income/revenue of Telstra
34

Solution
n FACT
n MEASURE
n DIMENSIONS
35

36

n The Logical Model
37

Logical Models
n Two main approaches:
n Using relational DB technology:
n Star schema, Snowflake schema, Fact constellation
n Using multidimensional technology: n Just as multidimensional data cube
38

Universal Schema è Star Schema
n Many data warehouses adopt a star schema to represent
the multidimensional model
n Each dimension is represented by a dimension-table
n LOCATION (location_key, store, street_address, city, state, country, region)
n dimension tables are not normalized
n Transactions are described through a fact-table
n each tuple consists of a logical pointer to each of the dimension- tables (foreign-key) and a list of measures (e.g. sales $$$)
The universal schema for supermarket
Store City State
S136 Syd NSW
S173 Melb Vic
Prod Brand Category
76Ha Nestle Biscuit
76Ha Nestle Biscuit
$Sold #Sold Cost
40 10 18
20 5 11
39

The Star Schema
TIME
time_key day
day_of_the_week month
quarter
year
PRODUCT
product_key product_name category brand
color supplier_name
SALES
time_key product_key location_key
units_sold amount
LOCATION
location_key store street_address city
state country region
40
Think why:
(1) Denormalized once from the universal schema (2) Controlled redundancy

Typical Models for Data Warehouses
n Modeling data warehouses: dimensions & measures
n Star schema: A fact table in the middle connected to a
set of dimension tables
n Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
n Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
41

Example of Star Schema
time
time_key
day day_of_the_week month
quarter year
Sales Fact Table
item
time_key
time_key item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
location
location_key street
city state_or_province country
branch
item_key item_name brand
type supplier_type
branch_key branch_name branch_type
Measures
42

Example of Snowflake Schema
time
time_key
day day_of_the_week month
quarter year
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
branch
branch_key branch_name branch_type
Measures
item
item_key item_name brand
type supplier_key
location
location_key street city_key
city
supplier
supplier_key supplier_type
city_key
city state_or_province country
43

Example of Fact Constellation
time
time_key
day day_of_the_week month
quarter year
Shipping Fact Table
time_key
item_key
Shipper_key
from_location
to_location
dollars_cost
units_shipped
Sales Fact Table
time_key
item_key
item_key branch_key
location_key
units_sold
dollars_sold
avg_sales
branch
item_key item_name
brand
type supplier_type
location
location_key street
city province_or_state country
branch_key branch_name branch_type
Measures
shipper_key shipper_name location_key shipper_type 44
item
shipper

Advantages of Star Schema
n Facts and dimensions are clearly depicted
n dimension tables are relatively static, data is
loaded (append mostly) into fact table(s) n easy to comprehend (and write queries)
“Find total sales per product-category in our stores in Europe”
SELECT PRODUCT.category, SUM(SALES.amount) FROM SALES, PRODUCT,LOCATION
WHERE SALES.product_key = PRODUCT.product_key AND SALES.location_key = LOCATION.location_key AND LOCATION.region=“Europe”
GROUP BY PRODUCT.category
Operations: Slice (Loc.Region.Europe) + Pivot (Prod.category)
45

n Query Language
46

Query Language
n Two approaches:
n Using relational DB technology: SQL (with extensions
such as CUBE/PIVOT/UNPIVOT)
n Using multidimensional technology: MDX
SELECT PRODUCT.category, SUM(SALES.amount)
FROM SALES, PRODUCT,LOCATION WHERE SALES.product_key =
PRODUCT.product_key
AND SALES.location_key = LOCATION.location_key
AND LOCATION.region=“Europe” GROUP BY PRODUCT.category
SELECT
{[PRODUCT].[category]} on ROWS,
{[MEASURES].[amount]} on COLUMNS FROM [SALES]
WHERE ([LOCATION].[region].[Europe])
Operations: Slice (Loc.Region.Europe) + Pivot (Prod.category, Measures.amnt)47

n Physical Model + Query Processing Techniques
48

Physical Model + Query Processing Techniques
n Two main approaches:
n Using relational DB technology: ROLAP
n Using multidimensional technology: MOLAP
n Hybrid: HOLAP
n Base cuboid: ROLAP
n Other cuboids: MOLAP
49

Q1: Selection on low-cardinality attributes
Sregion=“Europe”
TIME
time_key day
day_of_the_week month
quarter
year
CUSTOMER
customer_key customer_name region
type
SALES
time_key customer_key location_key
units_sold amount
LOCATION
location_key store street_address city
state country region
measures {
50
• Ignoring the final GROUP BY for now
• Omitting the Product dimension
JOIN
JOIN

Indexing OLAP Data: Bitmap Index
(1) BI on dimension tables
n Index on an attribute (column) with low distinct values
n Each distinct values, v, is associated with a n-bit vector (n = #rows)
n The i-th bit is set if the i-th row of the table has the value v for the indexed column
n Multiple BIs can be efficiently combined to enable optimized scan of the table
Custom BI on Customer.Region
Chap 4.2 in [JPT10]
v
bitmap
Asia
10100
Europe
01001
America
00010
Cust
Region
Type
C1
Asia
Retail
C2
Europe
Dealer
C3
Asia
Dealer
C4
C5
America
Retail
Europe
Dealer
51

Indexing OLAP Data: Bitmap Index /2
(1) Bitmap join index (BI on Fact Table Joined with Dimension tables)
n Conceptually, perform a join, map each dimension value to the bitmap of corresponding fact table rows.
— ORACLE SYNTAX –
CREATE BITMAP INDEX sales_cust_region_bjix ON sales(customer.cust_region)
FROM sales, customer
WHERE sales.cust_id = customers.cust_id;
52
https://docs.oracle.com/cd/B28359_01/server.111/b28313/indexes.htm#CIHGAFFF

Indexing OLAP Data: Bitmap Index /3
Sales
Customer
Cust
Region
Type
C1
Asia
Retail
C2
Europe
Dealer
C3
Asia
Dealer
C4
America
Retail
C5
Europe
Dealer
time customer loc Sale
101 C1 100 1
173 C1 200 2
208 C2 100 3
863 C3 200 5
991 C1 100 8
1001 C2 200 13
1966 C4 100 21
2017 C5 200 34
BI on Sales(Customer.Region)
v
bitmap
Asia
11011000
Europe
00100101
America
00000010
53

Q2: Selection on high-cardinality attributes
TIME
time_key day
day_of_the_week month
quarter
year
CUSTOMER
customer_key customer_name region
type
SALES
time_key customer_key location_key
units_sold amount
LOCATION
location_key store street_address city
state country region
measures {
Scity=“Kingsford”
54
JOIN
JOIN

Indexing OLAP Data: Join Indices
Chap 4.3 in [JPT10]
n Join index relates the values of the dimensions of a star schema to rows in the fact table.
n a join index on city maintains for each distinct city a list of ROW-IDs of the tuples recording the sales in the city
n Join indices can span multiple dimensions OR
n can be implemented as bitmap- indexes (per dimension)
n use bit-op for multiple-joins
Location
Sales
R102
R117 R118
R124
1
1 1
1
city = Sydney city = Randwick city = Kingsford city = Coogee
Kingsford
R102,R117,R118,R124
55

Q3: Arbitrary selections on Dimensions
Sregion = “Asia”
TIME
time_key day
day_of_the_week month
quarter
year
Customer
customer_key customer_name region
type
SALES
time_key product_key location_key
units_sold amount
SisSchoolHolidy(day)
measures {
LOCATION
location_key store street_address city
state country region
Scity =~ /\S+ford/
56
JOIN
JOIN

Star Query and Star Join (Cont.)
Usually only part of the dim tables because of the selection predicates
Time Customer Loc
XX
Sales
time
1
2
loc
100
200
Cust
10 20
time
cust
loc
sold
1
10
100
7
1 1
10 20
150 150
13 2
2
20
200
16




millions of tuples
time
cust
loc
1
10
100
1 1 1 2 2 2
10
20
20
10
10
20
200
100
200
100
200
100
2
20
200
thousands of tuples
1000 2000 500 86
Sales !” σ1(Time) !” σ2(Cust) !” σ3(Loc) è
Sales !” (σ1(Time) X σ2(Cust) X
σ3(Loc))
57
Chap 4.4 in [JPT10]

Q4: Coarse-grain Aggregations
n “Find total sales per customer type in our stores in Europe”
n Join-index will prune 3⁄4 of the data (uniform sales), but the remaining 1⁄4 is still large (several millions transactions)
n Index is unclustered
n High-level aggregations are expensive!!!!!
ÞLong Query Response Times ÞPre-computation is necessary
ÞPre-computation is most beneficial
LOCATON
region
country state
city store
58

Cuboids = GROUP BYs
n Multidimensional aggregation = selection on corresponding cuboid
σ1 selects some Years, σ2 selects some Brands, σ3 selects some Cities,
n Materialize some/all of the cuboids
n A complex decision involving cuboid sizes, query workload, and physical organization
GB(type, city)( Sales !” σ1(Time) !” σ2(Cust) !” σ3(Loc)) )
GB(type, city)( σ1’2’3’(Cuboid(Year, Type, City)) )
59

Two Issues
n How to store the materialized cuboids? n How to compute the cuboids efficiently?
60

CUBE BY in ROLAP
Sales
Product
1
454
2

3

4
925
ALL
1379
1
2
468
800


1268
3
296

240

536
4 ALL
652 1870
– 800
540 780
745 1670
1937 5120
Store Product_key sum(amout)
1 1454
1 4925
2 1468
2 2800
3 1296
3 3240
4 1625
4 3240 4 4745
1 ALL 1379
2 ALL 1268
3 ALL 536
4 ALL 1937
ALL 1 1870 ALL 2 800 ALL 3 780 ALL 4 1670 ALL ALL 5120
4 Group-bys here:
(store,product) (store) (product)
()
• Need to write 4 queries!!!
• Compute them independently
SELECT LOCATION.store, SALES.product_key, SUM (amount) FROM SALES, LOCATION
WHERE SALES.location_key=LOCATION.location_key
CUBE BY SALES.product_key, LOCATION.store
61
Store

Top-down Approach
n Model dependencies among the aggregates: most detailed “view”
can be computed from view (product,store,quarter) by summing-up all quarterly sales
product,quarter quarter
product,store,quarter
store,quarter
product none
product, store store
Cuboid A
A
1
2
B
(all)
(all)
M
???
???
Cuboid AB
A
1
1
1
1
2
B
1
3
2
1
4
M
10
20
30
40
50
Cuboid B
A
(all)
(all)
(all)
(all)
B
1
2
3
4
M
???
???
???
???
62

Bottom-Up Approach (BUC)
n BUC (Beyer & Ramakrishnan, SIGMOD’99)
n Ideas
n Compute the cube from
bottom up
n Divide-and-conquer
n A simpler recursive version: n BUC-SR

A
B

1
1

1
3

1
2

1
1

2


63

Understanding Recursion /1
n Powerful computing/problem-solving techniques
n Examples
n Factorial:
n f(n)=1,ifn=1
n f(n)=f(n-1)*n,ifn≥1
n Quick sort:
n Sort([x]) = [x]
n Sort([x1, …, pivot, … xn]) = sort[ys] ++ sort[zs]), where
f(0) = 0! = ???
ys = [ x | x in xi, x ≤ pivot ] zs = [ x | x <- xi, x > pivot ]
List comprehension in Haskell or python
64

Understanding Recursion /2
n LetC(n,m)bethenumberofwaystoselectmballsfromn numbered balls
n ShowthatC(n,m)=C(n-1,m-1)+C(n-1,m)
entire solution space
n Example:m=3,n=5
n Consideranyballinthe5balls,e.g.,‘d’
?1
?2
?3
d
?1 (no d)
?2(no d)
?1 (no d)
?2 (no d)
?3 (no d)
?i in{ a b c d e }
65

Key Points
n Sub-problems need to be “smaller”, so that a simple/trivial boundary case can be reached
n Divide-and-conquer
n There may be multiple ways the entire solution space can be divided into disjoint sub-spaces, each of which can be conquered recursively.
66

Geometric Intuition /1
n Reduce Cube(in 2D) to Cube(in 1D)
b1 b2 b3 a2
a1
M11
M12
M13
[Step 1]
M21
M22
M23
[Step 1]
[Step 2]
[Step 2]
[Step 2]
[Step 3]
M11
M12
M13
[Step 1]
[a1] ⨉
b1 b2 b3 *
M21
M22
M23
[Step 1]
[a2] ⨉
[Step 2]
[Step 2]
[Step 2]
[Step 3]
[*] ⨉
67

Geometric Intuition /2
n Reduce Cube(in 3D) to Cube(in 2D)
68
A
C B

Geometric Intuition /3
n Reduce Cube(in 3D) to Cube(in 2D)
69
A
A
C B
C B

Algebraic Derivation
n How to compute n-dim cube on (n+1)-dim base cuboid (array)?
n What do output tuples look like?
n How to compute (n+1)-dim cube on (n+1)-dim
base cuboid (array)?
n What else do we need?
[{r1
A
1
1
1
1
2
-]
r5}, ABC
[{r1
-]
r5}, BC
r1 r2
r3 r4 r5
B
1
1
2
3
1
C
1
2
1
1
1
M
10
20
30
40
50
70

BUC-SR (Simple Recursion)*
n BUC-SR(data, dims) n If (dims is empty)
n Output (sum(data)) n Else
n Dims = [dim1, rest_of_dims]
n For each distinct value v of dim1 n slice_v = slice of data on “dim1 = v” n BUC-SR(slice_v, rest_of_dims)
n data’ = Project(data, rest_of_dims) n BUC-SR(data’, rest_of_dims)
71
You need to fix the incomplete output part of the algorithm!
Boundary case:
data is essentially a list of measure values
General case:
1)Slice on dim1. Call BUC-SR recursively for each slice
2)Project out dim1, and call BUC-SR on it recursively

Input
Internal Output
Output
Example
[{r1

r4}, B]
M
BM
A
BM
B
1 30
1
1 30
1
10
2 30
1 2 30
123*
1 30 30 40
100
2 50
50
*
80 30 40 150
1
20
3 40
1
3 40
2
30
* 100
1 * 100
3
40
[{r5}, B]
BM
1 50
* 50
A
B
M
B
M
1
50
2
1
50
[{r1

r5}, AB]
2 * 50
A
B
M
1
1
10
1
1
20
1
2
30
1
3
40
2
1
50
r1 r2
r3 r4 r5
B
M
1
80
2 30
3
40
* 150
[{r1’

r5’}, B]
A
B
M
* 1 80 * 2 30 * 3 40 * * 150
B
M
1
80
2
30
3
40
72

Try a 3D-Cube by Yourself
[{r1

r5}, ABC]
A
1
1
1
1
B
1
1
2
3
1
C
1
2
1
1
1
M
10
20
30
40
50
r1 r2
r3
r4
r5
2
4/3/21
73

MOLAP
n (Sparse) array-based multidimensional storage engine
n Pros:
n small size (esp. for dense cubes)
n fast in indexing and query processing
n Cons:
n scalability
n conversion from relational data
74

Multidimensional Array
Step 2: An injective map from cell to offset
f(time, item) = 4*time + item
time
item
dollars_sold
Q1
home entertainment
605
Q2
home entertainment
680
Q3
home entertainment
812
Q4
home entertainment
927
Q1
computer
825
Q2
computer
952
Q3
computer
1023
Q4
computer
1038
Q1
phone
14
Q2
phone
31
Q3
phone
30
Q4
phone
38
Q1
security
400
Q2
security
512
Q3
security
501
Q4
security
580
time
item
dollars_s old
0
0
605
1
0
680
2
0
812
3
0
927
0
1
825
1
1
952
2
1
1023
3
1
1038
0
2
14
1
2
31
2
2
30
3
2
38
0
3
400
1
3
512
2
3
501
3
3
580
Step 1
Mappings
offset
0 4 8
12
1
5
9
13
2
6
10
14
3
7
11
15
time
value
Q1
0
Q2
1
Q3
2
Q4
3
item
value
home entertain ment
0
computer
1
phone
2
security
3
75

offset
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Multidimensional Array
Dense MD array
605
825
14
400
680
952
31
512
812
1023
30
501
927
1038
38
580
dollars_sold
605
825
14
400
680
952
31
512
812
1023
30
501
927
1038
38
580
Step 3: If dense, only need to store sorted slots
n Think:howtodecodeaslot?
76

The Sparse Case
Step 2: An injective map from cell to offset
time
item
dollars_sold
Q1
home entertainment
605
/* same table but with
home
Q2 entertainment 680
many ro
w
s deleted to
ho
Q3 entertainment 812
me
make it sparse */
home
Q4 entertainment 927
Q1 computer 825
Q2 computer 952
Q3 computer 1023
Q4 computer 1038
Q1 phone 14
Q2 phone 31
Q3 phone 30
Q4 phone 38
Q1 security 400
Q2 security 512
Q3 security 501
Q4
security
580
time
0
item
0
dollars_s old
605
1 0 680 4 /* same table but with
2 0 812 8 many rows deleted to
3 0 927 12 make it sparse */
0 1 825 1 1 1 952 5 211023 9 3 1 1038 13 0 2 14 2 1 2 31 6 2 2 30 10 3 2 38 14 0 3 400 3 1 3 512 7 23501 11
3
3
580
Step 1
Mappings
offset
0
f(time, item) = 4*time + item
time
value
Q1
0
Q2
1
Q3
2
Q4
3
item
value
home entertain ment
0
computer
1
phone
2
security
3
15
77

Multidimensional Array
Choice 1
offset
0
15
dollars_sold
605
580
Choice 2
Dense MD array
605














580
n Think:howtodecodeaslot? n Multidimensionalarrayis
typically sparse
n Use sparse array (i.e., offset + value)
n Could use chunk to further reduce the space
n Spaceusage:
n (d+1)*n*4 vs 2*n*4
n HOLAP:
n Store all non-base cuboid
in MD array
n Assign a value for ALL
78