程序代写代做代考 data mining concurrency python algorithm flex Excel database ER Haskell SQL 2dw

2dw

COMP9318: Data Warehousing
and Data Mining

— L2: Data Warehousing and OLAP —

n Why and What are Data Warehouses?

Data Analysis Problems

n The same data found in many different systems
n Example: customer data across different

departments
n The same concept is defined differently

n Heterogeneous sources
n Relational DBMS, OnLine Transaction

Processing (OLTP)
n Unstructured data in files (e.g., MS Excel) and

documents (e.g., MS Word)

Data Analysis Problems (Cont’d)

n Data is suited for operational systems
n Accounting,billing,etc.
n Do not support analysis across business

functions
n Data quality is bad

n Missing data, imprecise data, different use of
systems

n Data are “volatile”
n Data deleted in operational systems (6months)
n Data change over time – no historical

information 4

Solution: Data Warehouse

n Defined in many different ways, but not rigorously.

n A decision support database that is maintained separately from
the organization’s operational database

n Support information processing by providing a solid platform of
consolidated, historical data for analysis.

n “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon

n Data warehousing:

n The process of constructing and using data warehouses

Data Warehouse—Subject-Oriented

n Organized around major subjects, such as customer,
product, sales.

n Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.

n Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process.

Data Warehouse—Integrated

n Constructed by integrating multiple, heterogeneous data
sources
n relational databases, flat files, on-line transaction

records
n Data cleaning and data integration techniques are

applied.
n Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different
data sources

n E.g., Hotel price: currency, tax, breakfast covered, etc.
n When data is moved to the warehouse, it is converted.

Data Warehouse—Time Variant

n The time horizon for the data warehouse is significantly
longer than that of operational systems.
n Operational database: current value data.
n Data warehouse data: provide information from a

historical perspective (e.g., past 5-10 years)
n Every key structure in the data warehouse

n Contains an element of time, explicitly or implicitly
n But the key of operational data may or may not contain

“time element”.

Data Warehouse—Non-Volatile

1. A physically separate store of data transformed from
the operational environment.

2. Operational update of data does not occur in the data
warehouse environment.
n Does not require transaction processing, recovery,

and concurrency control mechanisms
n Requires only two operations in data accessing:

n initial loading of data and access of data.

Data Warehouse Architecture
n Extract data from

operational data sources
n clean, transform

n Bulk load/refresh
n warehouse is offline

n OLAP-server provides
multidimensional view

n Multidimensional-olap
(Essbase, oracle
express)

n Relational-olap
(Redbrick, Informix,

Sybase, SQL server)

Data Warehouse Architecture

Function-oriented
systems

Subject-oriented
systems

All subjects,
integrated

Advanced
analysis

Why Separate Data Warehouse?

n High performance for both systems

n DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery

n Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.

n Different functions and different data:

n missing data: Decision support requires historical data which
operational DBs do not typically maintain

n data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources

n data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled

Why OLAP Servers?

n Different workload:

n OLTP (on-line transaction processing)

n Major task of traditional relational DBMS

n Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll,

registration, accounting, etc.

n OLAP (on-line analytical processing)

n Major task of data warehouse system

n Data analysis and decision making

n Queries hard/infeasible for OLTP, e.g.,

n Which week we have the largest sales?
n Does the sales of dairy products increase over time?
n Generate a spread sheet of total sales by state and by year.

n Difficult to represent these queries by using SQL ç Why?

OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date

detailed, flat relational
isolated

historical,
summarized, multidimensional
integrated, consolidated

usage repetitive ad-hoc
access read/write

index/hash on prim. key
lots of scans

unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

Comparisons

Databases Data Warehouses

Purpose Many purposes; Flexible and
general

One purpose: Data analysis

Conceptual Model ER Multidimensional

Logical Model (Normalized) Relational Model (Denormalized) Star schema /
Data cube/cuboids

Physical Model Relational Tables ROLAP: Relational tables
MOLAP: Multidimensional arrays

Query Language SQL (hard for analytical
queries)

MDX (easier for analytical
queries)

Query Processing B+-tree/hash indexes, Multiple
join optimization, Materialized
views

Bitmap/Join indexes, Star join,
Materialized data cube

n The Multidimensional Model

The Multidimensional Model
n A data warehouse is based on a multidimensional data

model which views data in the form of a data cube, which
is a multidimensional generalization of 2D spread sheet.

n Key concepts:
n Facts: the subject it models

n Typically transactions in this course; other types includes
snapshots, etc.

n Measures: numbers that can be aggregated
n Dimensions: context of the measure

n Hierarchies:
n Provide contexts of different granularities (aka. grains)

n Goals for dimensional modeling:
n Surround facts with as much relevant context

(dimensions) as possible ç Why?
17

Supermarket Example

n Subject: analyze total sales and profits
n Fact: Each Sales Transaction

n Measure: Dollars_Sold, Amount_Sold, Cost
n Calculated Measure: Profit

n Dimensions:
n Store
n Product
n Time

Visualizing the Cubes

n A valid instance of the model is a data cube

total
Sales

product
p1 p2 p3 p4

ci
ty

NY $454 – – $925
LA $468 $800 – –
SD $296 – $240 –
SF $652 – $540 $745

product

city

652 540 745

296 240

468 800

454 925

Concepts: cell, fact (=non-empty cell), measure,
dimensions

Q: How to generalize it to 3D?

3D Cube and Hierarchies
pr

od
uc

t
cit

month

ALL ALL ALL

category region year

product country quarter

state month week

city day

store

PRODUCT LOCATION TIME
NY

Book176

Fe
b

DIMENSIONS
Sales of book176 in NY in Jan can be found in this cell

Ja
n

M
ar

Concepts: hierarchy (a tree of dimension values), level

Hierarchies

category

subcategory

ALL

brand

B
oo
k1
76

Which design is better? Why?

The (city, moth) Cuboid
pr

od
uc

t
cit

month

ALL ALL ALL

category region year

product country quarter

state month week

city day

store

PRODUCT LOCATION TIME
NY

Book176

Fe
b

DIMENSIONS

Sales of ALL_PROD in NY in Jan
Ja

M
ar

All the Cuboids

Date

Pr
od
uc
t

C
ou
nt
ry

VCR
PC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

Assume: no other non-ALL
levels on all dimensions.

All the Cuboids /2

Total annual sales
of TV in U.S.A.Date

Pr
od

uc
t

C
ou

nt
ryall

all
TV

VCR
PC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

all

Assume: no other non-ALL
levels on all dimensions.

Lattice of the cuboids
product, quarter, country

product countryquarter

quarter, countryproduct,quarter product, store

Base cuboid

3-dim cuboid

2-dim cuboid

1-dim cuboid

0-dim cuboid

n n-dim cube can be represented as (D1, D2, …, Dd), where Di is the set
of allowed values on the i-th dimension.
n if Di = Li (a particular level), then Di = all descendant dimension

values of Li.
n ALL can be omitted and hence reduces the effective

dimensionality
n A complete cube of d-dimensions consists of cuboids,

where ni is the number of levels (excluding ALL) on i-th dimension.
n They collectively form a lattice.

i=1

(ni + 1)

Properties of Operations

n All operations are closed under the
multidimensional model
n i.e., both input and output of an operation is a

cube
n So that they can be composed

Q: What’s the analogy in the Relational Model?

Common OLAP Operations

n Roll-up: move up the
hierarchy

COMP9318: Data Warehousing and Data Mining

pr
od

uc
t

cit
y

month

ALL ALL ALL

category region year

product country quarter

state month week

city day

store

PRODUCT LOCATION TIME
NY

Book176

Q
2

DIMENSIONSSales of book176 in NY in Q1 here

Q
1

Q
3

Q: what should be its value?

pr
od

uc
t

cit
y

month

NY
Book176

Fe
b

Ja
n

M
ar

ALL ALL ALL

category region year

product country quarter

state month week

city day

store

PRODUCT LOCATION TIME
DIMENSIONS

cit
y

pr
od

uc
tNY

Book176

quarter

Common OLAP Operations

n Drill-down: move down
the hierarchy
n more fine-grained

aggregation

Slice and Dice Queries

n Slice and Dice: select and project on one or more
dimension values

cit
y

product

month

Product = Book176

The output cube has smaller
dimensionality than the input cube

Pivoting

n Pivoting: aggregate on
selected dimensions
n usually 2 dims (cross-

tabulation)

pivot on (city, month)
Sales (of all products) in
NY in Q1
=sum( ??? )

pr
od
uc
t

Book176

month

Q
2

Q
1

Q
3

cit
y

month

City

… …

Q2Q1 Q3

A Reflective Pause

n Let’s review the definition of data cubes again.

n Key message:
n Disentangle the “object” from its

“representation” or “implementation”
32

Modeling Exercise 1: Monthly Phone
Service Billing

Theme: analyze the income/revenue of Telstra

Solution

n FACT

n MEASURE

n DIMENSIONS

n The Logical Model

Logical Models

n Two main approaches:
n Using relational DB technology:

n Star schema, Snowflake schema, Fact constellation
n Using multidimensional technology:

n Just as multidimensional data cube

Universal Schema è Star Schema
n Many data warehouses adopt a star schema to represent

the multidimensional model
n Each dimension is represented by a dimension-table

n LOCATION (location_key, store, street_address, city, state,
country, region)

n dimension tables are not normalized
n Transactions are described through a fact-table

n each tuple consists of a pointer to each of the dimension-tables
(foreign-key) and a list of measures (e.g. sales $$$)

Store City State Prod Brand Category $Sold #Sold Cost

S136 Syd NSW 76Ha Nestle Biscuit 40 10 18
S173 Melb Vic 76Ha Nestle Biscuit 20 5 11

The universal schema for supermarket

The Star Schema

time_key
day
day_of_the_week
month
quarter
year

TIME

location_key
store
street_address
city
state
country
region

LOCATION

SALES

product_key
product_name
category
brand
color
supplier_name

PRODUCT

time_key

product_key

location_key

units_sold

amount

Think why:
(1) Denormalized once from the universal schema
(2) Controlled redundancy 39

Typical Models for Data Warehouses

n Modeling data warehouses: dimensions & measures
n Star schema: A fact table in the middle connected to a

set of dimension tables
n Snowflake schema: A refinement of star schema

where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake

n Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation

Example of Star Schema

time_key
day
day_of_the_week
month
quarter
year

time

location_key
street
city
state_or_province
country

location

Sales Fact Table

time_key
time_key
item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key
item_name
brand
type
supplier_type

item

branch_key
branch_name
branch_type

branch

Example of Snowflake Schema

time_key
day
day_of_the_week
month
quarter
year

time

location_key
street
city_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key
item_name
brand
type
supplier_key

item

branch_key
branch_name
branch_type

branch

supplier_key
supplier_type

supplier

city_key
city
state_or_province
country

city

item_key
item_keybranch_key

Example of Fact Constellation

time_key
day
day_of_the_week
month
quarter
year

time

location_key
street
city
province_or_state
country

location

Sales Fact Table

time_key

location_key

units_sold

dollars_sold

avg_sales
Measures

item_key
item_name
brand
type
supplier_type

item

branch_key
branch_name
branch_type

branch

Shipping Fact Table

time_key

item_key

Shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_key
shipper_name
location_key
shipper_type

shipper

Advantages of Star Schema

n Facts and dimensions are clearly depicted
n dimension tables are relatively static, data is

loaded (append mostly) into fact table(s)
n easy to comprehend (and write queries)
�Find total sales per product-category in our stores in Europe”

SELECT PRODUCT.category, SUM(SALES.amount)
FROM SALES, PRODUCT,LOCATION
WHERE SALES.product_key = PRODUCT.product_key
AND SALES.location_key = LOCATION.location_key
AND LOCATION.region=“Europe”
GROUP BY PRODUCT.category

Operations: Slice (Loc.Region.Europe) + Pivot (Prod.category)

n Query Language

Query Language

n Two approaches:

n Using relational DB technology: SQL (with extensions

such as CUBE/PIVOT/UNPIVOT)

n Using multidimensional technology: MDX

SELECT PRODUCT.category,
SUM(SALES.amount)
FROM SALES, PRODUCT,LOCATION
WHERE SALES.product_key =
PRODUCT.product_key
AND SALES.location_key =
LOCATION.location_key
AND LOCATION.region=“Europe”
GROUP BY PRODUCT.category

SELECT
{[PRODUCT].[category]} on ROWS,
{[MEASURES].[amount]} on COLUMNS
FROM [SALES]
WHERE ([LOCATION].[region].[Europe])

Operations: Slice (Loc.Region.Europe) + Pivot (Prod.category, Measures.amnt)

n Physical Model + Query Processing Techniques

Physical Model + Query Processing
Techniques

n Two main approaches:
n Using relational DB technology: ROLAP
n Using multidimensional technology: MOLAP

n Hybrid: HOLAP
n Base cuboid: ROLAP
n Other cuboids: MOLAP

Q1: Selection on low-cardinality attributes

time_key
day
day_of_the_week
month
quarter
year

TIME

location_key
store
street_address
city
state
country
region

LOCATION

SALES

customer_key
customer_name
region
type

CUSTOMER

time_key

customer_key

location_key

units_sold

amount
{measures

JO
IN

JOIN

Sregion=“Europe”

• Ignoring the final GROUP BY for now
• Omitting the Product dimension

Indexing OLAP Data: Bitmap Index
(1) BI on dimension tables

n Index on an attribute (column) with low distinct values
n Each distinct values, v, is associated with a n-bit vector (n =

#rows)
n The i-th bit is set if the i-th row of the table has the value v

for the indexed column
n Multiple BIs can be efficiently combined to enable optimized scan

of the table

Cust Region Type
C1 Asia Retail
C2 Europe Dealer
C3 Asia Dealer
C4 America Retail
C5 Europe Dealer

Custom BI on Customer.Region

Chap 4.2 in [JPT10]

v bitmap
Asia 1 0 1 0 0
Europe 0 1 0 0 1
America 0 0 0 1 0

Indexing OLAP Data: Bitmap Index /2

(1) Bitmap join index (BI on Fact Table Joined with
Dimension tables)
n Conceptually, perform a join, map each dimension

value to the bitmap of corresponding fact table rows.

https://docs.oracle.com/cd/B28359_01/server.111/b28313/indexes.htm#CIHGAFFF

— ORACLE SYNTAX –

CREATE BITMAP INDEX sales_cust_region_bjix
ON sales(customer.cust_region)
FROM sales, customer
WHERE sales.cust_id = customers.cust_id;

Indexing OLAP Data: Bitmap Index /3
Sales

Cust Region Type
C1 Asia Retail
C2 Europe Dealer
C3 Asia Dealer
C4 America Retail
C5 Europe Dealer

Customer
time customer loc Sale
101 C1 100 1
173 C1 200 2
208 C2 100 3
863 C3 200 5
991 C1 100 8
1001 C2 200 13
1966 C4 100 21
2017 C5 200 34

BI on Sales(Customer.Region)
v bitmap
Asia 11011000
Europe 00100101
America 00000010

Q2: Selection on high-cardinality attributes

time_key
day
day_of_the_week
month
quarter
year

TIME

location_key
store
street_address
city
state
country
region

LOCATION

SALES

customer_key
customer_name
region
type

CUSTOMER

time_key

customer_key

location_key

units_sold

amount
{measures

JO
IN

JOIN

Scity=“Kingsford”

R102

R117

R118

R124

Indexing OLAP Data: Join Indices
n Join index relates the values of

the dimensions of a star schema
to rows in the fact table.
n a join index on city

maintains for each distinct
city a list of ROW-IDs of
the tuples recording the
sales in the city

n Join indices can span multiple
dimensions OR
n can be implemented as bitmap-

indexes (per dimension)
n use bit-op for multiple-joins

city = Sydney
city = Randwick
city = Kingsford
city = Coogee

Chap 4.3 in [JPT10]

SalesLocation

Kingsford R102,R117,R118,R124

Q3: Arbitrary selections on Dimensions

time_key
day
day_of_the_week
month
quarter
year

TIME

location_key
store
street_address
city
state
country
region

LOCATION

SALES

customer_key
customer_name
region
type

Customer

time_key

product_key

location_key

units_sold

amount
{measures

JO
IN

JOIN

Scity =~ /\S+ford/

SisSchoolHolidy(day)

Sregion = “Asia”

Star Query and Star Join (Cont.)

time
1
2

Cust
10
20

X
loc
100
200

time cust loc
1 10 100
1 10 200
1 20 100
1 20 200
2 10 100
2 10 200
2 20 100
2 20 200

time cust loc sold
1 10 100 7
1 10 150 13
1 20 150 2
2 20 200 16
… … … …
1000 2000 500 86

Sales

Time Customer Loc

thousands
of tuples

millions
of tuples

Sales !” σ1(Time) !” σ2(Cust) !”
σ3(Loc) è

Sales !” (σ1(Time) X σ2(Cust) X
σ3(Loc))

Usually only part of the dim tables
because of the selection predicates

Chap 4.4 in [JPT10]

Q4: Coarse-grain Aggregations
n �Find total sales per customer type in our stores in Europe”

n Join-index will prune ¾ of the data (uniform sales),
but the remaining ¼ is still large (several millions
transactions)

n Index is unclustered
n High-level aggregations are expensive!!!!! region

country

state

city

store

LOCATON

ÞLong Query Response Times

ÞPre-computation is necessary

ÞPre-computation is most beneficial

Cuboids = GROUP BYs

n Multidimensional aggregation = selection on
corresponding cuboid

n Materialize some/all of the cuboids
n A complex decision involving cuboid sizes,

query workload, and physical organization
58

GB(type, city)( Sales !” σ1(Time) !” σ2(Cust) !” σ3(Loc)) )

GB(type, city)( σ1’2’3’(Cuboid(Year, Type, City)) )

σ1 selects some Years, σ2 selects some Brands,
σ3 selects some Cities,

Two Issues

n How to store the materialized cuboids?
n How to compute the cuboids efficiently?

Store Product_key sum(amout)
1 1 454
1 4 925
2 1 468
2 2 800
3 1 296
3 3 240
4 1 625
4 3 240
4 4 745
1 ALL 1379
2 ALL 1268
3 ALL 536
4 ALL 1937
ALL 1 1870
ALL 2 800
ALL 3 780
ALL 4 1670
ALL ALL 5120

CUBE BY in ROLAP
Product

Sales
1 2 3 4 ALL

1 454 – – 925 1379
2 468 800 – – 1268
3 296 – 240 – 536
4 652 – 540 745 1937

St
or

ALL 1870 800 780 1670 5120

SELECT LOCATION.store, SALES.product_key, SUM (amount)
FROM SALES, LOCATION
WHERE SALES.location_key=LOCATION.location_key
CUBE BY SALES.product_key, LOCATION.store

4 Group-bys here:
(store,product)
(store)
(product)
()

• Need to write
4 queries!!!

• Compute them
independently

Top-down Approach

n Model dependencies among the aggregates:
most detailed “view”

can be computed from view
(product,store,quarter) by
summing-up all quarterly sales

product,store,quarter

product storequarter

none

store,quarterproduct,quarter product, store

A B M
1 1 10
1 3 20
1 2 30
1 1 40
2 4 50

Cuboid AB

A B M
1 (all) ???
2 (all) ???

Cuboid A

A B M
(all) 1 ???
(all) 2 ???
(all) 3 ???
(all) 4 ???

Cuboid B

Bottom-Up Approach (BUC)

n BUC (Beyer & Ramakrishnan,
SIGMOD’99)

n Ideas
n Compute the cube from

bottom up
n Divide-and-conquer

n A simpler recursive version:
n BUC-SR
A B …

1 1 …

1 3 …

1 2 …

1 1 …

2 … …

…

Understanding Recursion /1

n Powerful computing/problem-solving
techniques

n Examples
n Factorial:

n f(n) = 1, if n = 1
n f(n) = f(n-1) * n, if n ≥ 1

n Quick sort:
n Sort([x]) = [x]
n Sort([x1, …, pivot, … xn]) = sort[ys] ++ sort[zs]), where

ys = [ x | x in xi, x ≤ pivot ]
zs = [ x | x <- xi, x > pivot ]

f(0) = 0! =
???

List comprehension
in Haskell or

python

Understanding Recursion /2

n Let C(n, m) be the number of ways to select m balls from n
numbered balls

n Show that C(n, m) = C(n-1, m-1) + C(n-1, m)

n Example: m = 3, n = 5
n Consider any ball in the 5 balls, e.g., ‘d’

entire solution space

?1 ?2 ?3

d ?1 ?2 ?1 (no d) ?2 (no d) ?3 (no d)

a b c d e?i in { }

Key Points

n Sub-problems need to be “smaller”, so that a
simple/trivial boundary case can be reached

n Divide-and-conquer
n There may be multiple ways the entire solution

space can be divided into disjoint sub-spaces,
each of which can be conquered recursively.

n Reduce Cube(in 2D) to Cube(in 1D)

Geometric Intuition /1

M11 M12 M13 [Step 1]
M21 M22 M23 [Step 1]
[Step 2] [Step 2] [Step 2] [Step 3]

a1
a2

b1 b2 b3

M11 M12 M13 [Step 1]
b1 b2 b3

M21 M22 M23 [Step 1]

[Step 2] [Step 2] [Step 2] [Step 3]

[a1] ⨉

[a2] ⨉

[*] ⨉

n Reduce Cube(in
3D) to Cube(in 2D)

Geometric Intuition /2

n Reduce Cube(in
3D) to Cube(in 2D)

Geometric
Intuition /3

BUC-SR (Simple Recursion)*

n BUC-SR(data, dims)
n If (dims is empty)

n Output (sum(data))
n Else

n Dims = [dim1, rest_of_dims]
n For each distinct value v of dim1

n slice_v = slice of data on “dim1 = v”
n BUC-SR(slice_v, rest_of_dims)

n data’ = Project(data, rest_of_dims)
n BUC-SR(data’, rest_of_dims)

Boundary case:
data is essentially a list
of measure values

General case:
1)Slice on dim1. Call
BUC-SR recursively for
each slice

2)Project out dim1, and
call BUC-SR on it
recursively

You need to fix the incomplete output part of the algorithm!

Example

[{r1-r5}, AB]

[{r1-r4}, B]

r1
r2
r3
r4
r5

B M
1 10
1 20
2 30
3 40

B M
1 30
2 30
3 40
* 100

A B M
1 1 30
1 2 30
1 3 40
1 * 100

[{r5}, B]

B M
1 50

B M
1 50
* 50

A B M
2 1 50
2 * 50

[{r1’-r5’}, B]

B M
1 80
2 30
3 40

B M
1 80
2 30
3 40
* 150

A B M
* 1 80
* 2 30
* 3 40
* * 150

A B M
1 1 10
1 1 20
1 2 30
1 3 40
2 1 50

1 2 3 *
1 30 30 40 100
2 50 50
* 80 30 40 150

OutputInternal OutputInput

Try a 3D-Cube by Yourself

716/3/18

[{r1-r5}, ABC]

r1
r2
r3
r4
r5

A B C M
1 1 1 10
1 1 2 20
1 2 1 30
1 3 1 40
2 1 1 50

MOLAP

n (Sparse) array-based multidimensional storage
engine

n Pros:
n small size (esp. for dense cubes)
n fast in indexing and query processing

n Cons:
n scalability
n conversion from relational data

Multidimensional Array

time item dollars_sold

Q1
home
entertainment 605

Q2
home
entertainment 680

Q3
home
entertainment 812

Q4
home
entertainment 927

Q1 computer 825

Q2 computer 952

Q3 computer 1023

Q4 computer 1038

Q1 phone 14

Q2 phone 31

Q3 phone 30

Q4 phone 38

Q1 security 400

Q2 security 512

Q3 security 501

Q4 security 580

Mappings

time value

Q1 0

Q2 1

Q3 2

Q4 3

item value

home
entertain
ment 0

computer 1

phone 2

security 3

time item
dollars_s
old offset

0 0 605 0

1 0 680 4

2 0 812 8

3 0 927 12

0 1 825 1

1 1 952 5

2 1 1023 9

3 1 1038 13

0 2 14 2

1 2 31 6

2 2 30 10

3 2 38 14

0 3 400 3

1 3 512 7

2 3 501 11

3 3 580 15

f(time, item) = 4*time + item

Step 1

Step 2: An injective map from cell to offset

Multidimensional Array
offset dollars_sold

0 605

1 825

2 14

3 400

4 680

5 952

6 31

7 512

8 812

9 1023

10 30

11 501

12 927

13 1038

14 38

15 580

Dense MD array

605

825

400

680

952

512

812

1023

501

927

1038

580

n Think: how to decode a slot?
n Multidimensional array is

typically sparse
n Use sparse array (i.e.,

offset + value)
n Could use chunk to

further reduce the space
n Space usage:

n (d+1)*n*4 vs 2*n*4

n HOLAP:
n Store all non-base cuboid

in MD array
n Assign a value for ALL

Step 3: If dense, only need to store sorted slotsStep 3’: If sparse

Related Posts