Data Warehousing
and Data Mining
— L2: Data Warehousing and OLAP —
1
Part I
n Why and What are Data Warehouses?
n Transaction Processing vs. Analytical Processing
n Databases vs. Data Warehouses
Data is meaningless without analysis!
2
Example in a finance department
n Daily transaction tasks
n E.g., account receivable, account payable, payroll, etc.
Columns:
Description
G/L Account Branch
cost center
G/L account name Tax code
Total …
3
Example/2
n Weekly…monthly…yearly analytical tasks n E.g., Finance reports
4
Why OLAP Servers?
n Differentworkload:
n OLTP (on-line transaction processing)
n MajortaskoftraditionalrelationalDBMS
n Day-to-dayoperations:purchasing,inventory,banking,manufacturing,payroll, registration, accounting, etc.
n OLAP (on-line analytical processing) n Majortaskofdatawarehousesystem
n Dataanalysisanddecisionmaking
n Querieshard/infeasibleforOLTP,e.g.,
n Which week we have the largest sales?
n Does the sales of dairy products increase over time?
n Generate a spread sheet of total sales by state and by year.
n DifficulttorepresentthesequeriesbyusingSQLçWhy?
5
OLTP vs. OLAP
OLTP OLAP
users
clerk, IT professional knowledge worker
function
day to day operations decision support
DB design
application-oriented subject-oriented
data
current, up-to-date detailed, flat relational isolated
historical,
summarized, multidimensional integrated, consolidated
usage
repetitive ad-hoc
access
read/write lots of scans index/hash on prim. key
unit of work
short, simple transaction complex query
# records accessed
tens millions
#users
thousands hundreds
DB size
100MB-GB 100GB-TB
metric
transaction throughput query throughput, response
6
Data Analysis Problems
n The same data found in many different systems n Example: customer data across different
departments
n The same concept is defined differently
n Heterogeneous sources
n Relational DBMS, OnLine Transaction
Processing (OLTP)
n Unstructured data in files (e.g., MS Excel) and documents (e.g., MS Word)
7
Data Analysis Problems (Cont’d)
n Data is suited for operational systems
n Accounting,billing,etc.
n Do not support analysis across business functions
n Data quality is bad
n Missing data, imprecise data, different use of
systems
n Data are “volatile”
n Data deleted in operational systems (6months)
n Data change over time – no historical information
8
Solution: Data Warehouse
n Definedinmanydifferentways,butnotrigorously.
n A decision support database that is maintained separately from
the organization’s operational database
n Support information processing by providing a solid platform of consolidated, historical data for analysis.
n “Adatawarehouseisasubject-oriented,integrated,time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon
n Datawarehousing:
n The process of constructing and using data warehouses
9
Data Warehouse—Subject-Oriented
n Organized around major subjects, such as customer, product, sales.
n Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.
n Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
10
Data Warehouse—Integrated
n Constructed by integrating multiple, heterogeneous data sources
n relational databases, flat files, on-line transaction records
n Data cleaning and data integration techniques are applied.
n Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
n E.g., Hotel price: currency, tax, breakfast covered, etc.
n When data is moved to the warehouse, it is converted.
11
Data Warehouse—Time Variant
n The time horizon for the data warehouse is significantly longer than that of operational systems.
n Operational database: current value data.
n Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years) n Every key structure in the data warehouse
n Contains an element of time, explicitly or implicitly
n But the key of operational data may or may not contain “time element”.
12
Data Warehouse—Non-Volatile
1. A physically separate store of data transformed from the operational environment.
2. Operational update of data does not occur in the data warehouse environment.
n Does not require transaction processing, recovery, and concurrency control mechanisms
n Requires only two operations in data accessing:
n initial loading of data and access of data.
13
Data Warehouse Architecture
n Extract data from operational data sources
n clean, transform n Bulk load/refresh
n warehouse is offline n OLAP-server provides
multidimensional view
n Multidimensional-olap (Essbase, oracle express)
n Relational-olap
(Redbrick, Informix, Sybase, SQL server)
14
Data Warehouse Architecture
Function-oriented systems
All subjects, integrated
Subject-oriented systems
Advanced analysis
15
Why Separate Data Warehouse?
n Highperformanceforbothsystems
n DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
n Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.
n Differentfunctionsanddifferentdata:
n missing data: Decision support requires historical data which
operational DBs do not typically maintain
n data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources
n data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
16
Comparisons
Databases
Data Warehouses
Purpose
Many purposes; Flexible and general
One purpose: Data analysis
Conceptual Model
ER
Multidimensional
Logical Model
(Normalized) Relational Model
(Denormalized) Star schema / Data cube/cuboids
Physical Model
Relational Tables
ROLAP: Relational tables MOLAP: Multidimensional arrays
Query Language
SQL (hard for analytical queries)
MDX (easier for analytical queries)
Query Processing
Bitmap/Join indexes, Star join, Materialized data cube
17
B+-tree/hash indexes, Multiple join optimization, Materialized views
Comparisons/2
Source: https://www.zhihu.com/question/20623931
18
Comparisons/2
Source: https://www.zhihu.com/question/20623931
19
n The Multidimensional Model
20
The Multidimensional Model
n A data warehouse is based on a multidimensional data model which views data in the form of a data cube, which is a multidimensional generalization of 2D spread sheet.
n Key concepts:
n Facts: the subject it models
n Typically transactions in this course; other types includes snapshots, etc.
n Measures: numbers that can be aggregated n Dimensions: context of the measure
n Hierarchies:
n Provide contexts of different granularities (aka. grains)
n Goals for dimensional modeling:
n Surround facts with as much relevant context (dimensions) as possible ç Why?
21
Supermarket Example
n Subject: analyze total sales and profits n Fact: Each Sales Transaction
n Measure: Dollars_Sold, Amount_Sold, Cost
n Calculated Measure: Profit n Dimensions:
n Store
n Product n Time
22
Visualizing the Cubes
n A valid instance of the model is a data cube city
total Sales
product
p1
p2
p3
p4
NY
$454
–
–
$925
LA
$468
$800
–
–
SD
$296
–
$240
–
SF
$652
–
$540
$745
454
925
468
800
296
240
652
540
745
product
Concepts: cell, fact (=non-empty cell), measure,
dimensions
Q: How to generalize it to 3D?
23
city
3D Cube and Hierarchies
Concepts: hierarchy (a tree of dimension values), level Sales of book176 in NY in Jan can be found in this cell
NY Book176
DIMENSIONS
PRODUCT LOCATION TIME
ALL ALL ALL category region year
product
country quarter state month week city day
store
24
month
city
Jan
Feb
Mar
product
Hierarchies
Concepts: hierarchy (a tree of dimension values), level
ALL
Food Beverage Book
………
ALL
category product
ALL
category
subcategory brand
product
25
Which design is better? Why?
Book176
The (city, moth) Cuboid
Sales of ALL_PROD in NY in Jan
NY Book176
DIMENSIONS
PRODUCT LOCATION TIME
ALL ALL ALL category region year
product
country quarter state month week city day
store
26
month
city
Jan
Feb
Mar
product
All the Cuboids
Date
2Qtr
1Qtr VCR
3Qtr
4Qtr
TV PC
U.S.A
Canada Mexico
27
Assume: no other non-ALL levels on all dimensions.
Product
Country
All the Cuboids /2
TV PC
VCR
all
1Qtr
2Qtr Date 3Qtr
4Qtr
all
Total annual sales of TV in U.S.A.
U.S.A
Canada Mexico
all
28
Assume: no other non-ALL levels on all dimensions.
Product
Country
Lattice of the cuboids
Base cuboid
product, country country
3-dim cuboid 2-dim cuboid 1-dim cuboid
0-dim cuboid
product,quarter quarter
product, quarter, country quarter, country product
n n-dim cube can be represented as (D1, D2, …, Dd), where Di is the set of allowed values on the i-th dimension.
n if Di = Li (a particular level), then Di = all descendant dimension values of Li.
n ALL can be omitted and hence reduces the effective dimensionality Yd
n Acompletecubeofd-dimensionsconsistsof (ni+1)cuboids, i=1
where ni is the number of levels (excluding ALL) on i-th dimension. n They collectively form a lattice.
29
Properties of Operations
n All operations are closed under the multidimensional model
n i.e., both input and output of an operation is a cube
n So that they can be composed
Q: What’s the analogy in the Relational Model?
30