INFO20003 Database Systems
Dr Renata Borovica-Gajic
Lecture 19 Data Warehousing
Week 10
Coverage
By the end of this class you should be able to:
• Articulate the differences between transactional (operational)
and informational (dimensional) databases
• Explain the characteristics of a DW
• Understand and explain the overall architecture of a DW
• Design Star Schemas
INFO20003 Database Systems © University of Melbourne
3
Part 1: Introduction to Data Warehousing
What is the most effective distribution
channel?
What product prom- -otions have the biggest
impact on revenue?
Who are my customers and what products
are they buying?
Which customers are most likely to go to the competition ?
BUT HOW?
INFO20003 Database Systems
© University of Melbourne
5
Motivations: A manager wants to know….
Which are our lowest/highest margin customers ?
What impact will new products/services have on revenue
and margins?
Relational Databases for Operational Processing
• Usedtorundaytodaybusinessoperations
• Automationofroutinebusinessprocesses
– Accounting – Inventory
– Purchasing – Sales
• Created huge efficiencies
INFO20003 Database Systems
© University of Melbourne
6
OLTP Database
OLTP Database
Databases are great, BUT for business…
• Toomanyofthem
– Everybody wanted one, or two, or more
– Production, Marketing, Sales, Accounting …
• Everybodygotwhatwasbestforthem
– IBM, Oracle, Access, Microsoft
• Eventuallythisre-createdtheproblemdatabases were meant to solve
– Duplicated data – Inaccessible data – Inconsistent data
But data is useful for analysis and decision making
INFO20003 Database Systems © University of Melbourne
7
What can we do about it?
• NeedanintegratedwayofgettingtheENTIRE organisational data
• ItsreallyanInformationalDatabase,ratherthana Transactional Database
– A single database that allows all of the organisations data to be stored in a form that can be used to support organisational decision processes
INFO20003 Database Systems © University of Melbourne
8
Warehouse: An Informational Database
• Data Warehouse:
– A single repository of organisational data
– Integrates data from multiple sources
• Extracts data from source systems, transforms, loads into the warehouse
– Makes data available to managers/users – Supports analysis and decision-making
• Involvealargedatastore(oftenseveralTerabytes, Petabytes of data)
INFO20003 Database Systems © University of Melbourne
9
Transactional (Operational) Questions
Customer Service: Help! I forgot my membership card!
Select membership_nbr from MEMBER_INDEX where phone_num = ‘555-1212’
INFO20003 Database Systems © University of Melbourne
11
Analytical Questions
150
1,007,961
48,204,709
Campaign Management:
How many customers purchased more than $500 worth of alcohol in our Melbourne stores this year?
5,668,375
432,233
9,894
INFO20003 Database Systems © University of Melbourne
12
DW Supports Analytical Queries
• Oneisinterestedinnumericalaggregations – How many?
– What is the average?
– What is the total cost?
• Oneisinterestedinunderstandingdimensions – Sales by state by customer type
– Sales by product by store by quarter
DW will help answer these questions
INFO20003 Database Systems © University of Melbourne
13
Characteristics of a DW
• Subjectoriented
– Data warehouses are organised around particular subjects (sales, customers, products)
• Validated,Integrateddata
– Data from different systems converted to a common format: allows comparison and consolidation of data from different sources
– Data from various sources validated before storing it in a data warehouse
INFO20003 Database Systems © University of Melbourne
14
Characteristics of a DW
• Time variant
– Historical data
– Trend analysis crucial for decision support: requires historical data
– Data consists of a series of “snapshots” which are time stamped
• Non-volatile
– Users have read access only – all updating done automatically by ETL process and periodically by a DBA
INFO20003 Database Systems © University of Melbourne
15
A DW Architecture
Source Data Systems
Data Staging Area
Data & Meta-Data Storage Area
Analytics and Reporting
CRM Sales Production HR Finance
Internal
Stock Market Currency ABS
News
External
Processing
Clean
Derive
Match Combine De-duplication Standardise Transform Export
Meta-data
Enterprise Data Warehouse
Data Mart
Data Mart
Data Mart
Data Mart
Data Mart
Extract
Load
Feed
Extract
Feed
Ad hoc querying
Dashboards
Scorecards
Report writers
End-user applications
Modelling tools Data mining tools Visualisation tools
INFO20003 Database Systems © University of Melbourne
16
Load
Business Intelligence Dashboard
INFO20003 Database Systems © University of Melbourne
17
DW Supports Advanced Analytics
http://us.hudson.com/legal/blog/postid/513/predictive-analytics-artificial-intelligence-science-fiction-e-discovery-truth http://pypr.sourceforge.net/kmeans.html
INFO20003 Database Systems © University of Melbourne
18
Part 2- Dimensional Modelling
Business Analyst World
Fact
• How much revenue did the product G generate in the last three months, broken down by month for
Dimension
the south eastern sales region, by individual stores, broken down by promotions, compared to estimates
and to the previous version of the product
– Analysis starts usually with a single indication of something strange, then goes deep into the data, left to a new dimension, right to another, up to the summary, back down and left and right again, until the problem is identified…
– Dimensional Analysis: To support business analysts view
• Revenue per product per customer per location?
Fact Dimension Dimension Dimension
Dimension
INFO20003 Database Systems © University of Melbourne
20
Introduction to Dimensional Modelling
• PopularisedbyRalphKimballinthe1990s
• Basedonthemulti-dimensionalmodelofdataand designed for retrieval-only databases
• Verysimple,intuitive,andeasily-understoodstructure
• Alsoknownasstarschemadesign
INFO20003 Database Systems © University of Melbourne
22
Dimensional Modelling
• A dimensional model consists of:
– Fact table
– Several dimensional tables
– (Sometimes) hierarchies in the dimensions
• Essentially a simple and restricted type of ER model
INFO20003 Database Systems © University of Melbourne
23
Fact Table
• A fact table contains the actual business measures (additive, aggregates), called facts
• The fact table also contains foreign keys pointing to dimensions
keys{
} facts
Sale
Time key Store key Customer key Product key Dollar sales Unit sales
INFO20003 Database Systems © University of Melbourne
24
Fact Table – example
• Actual data might look like this
• Granularity, or level of detail, is a key issue
– Finest level of detail for a fact table, determined by the finest level of each dimension
Time-id
Store-id
Cust-id
Prod-id
Dollar sales
Unit Sales
T100
S303
C101
P98
$120,000
5,000
T101
S303
C256
P98
$240000
10,000
T102
S387
C101
P10
$456,000
27,899
T100
S234
C400
P56
$100,200
5,600
INFO20003 Database Systems © University of Melbourne
25
Star schema – dimensional model
Customer
Customer key Name Customer type
Sale
Time key Store key Customer key Product key Dollar sales Unit sales
Dimension 1 Fact
Product
Product key Product type weight
Dimension 4
Dimension 3
Dimension 2
Store
Store key Address Region
Time
Time key Day Month
INFO20003 Database Systems © University of Melbourne
26
Dimension Hierarchies
Product
Product key Product name Product type Product group Product sub-group weight
Sale
Time key Store key Customer key Product key Dollar sales Unit sales
Product name e.g. Hammer – Product type e.g. Tool
– Product group e.g. Hardware
INFO20003 Database Systems © University of Melbourne
27
Dimension Table – example
• Capturesafactorbywhichafactcanbedescribedor classified
• Actual data might look like this • Hierarchy evident in data
Prod-id
Prod-Name
Prod-Group
Prod- Subgroup
Weight
P10
Hammer
Hardware
Tool
5kg
P56
10cm Nails
Hardware
Nails
1kg
P98
Plastic Pipe
Plumbing
Pipe
1kg
INFO20003 Database Systems © University of Melbourne
28
Dimensional model as an ER model
Star schema
Customer Customerkey PK
Name Customer type
Product Product key PK
Product name Product type Product group Product sub-group weight
Sale Time key
Store key Customer key Product key Dollar sales Unit sales
PFK
PFK
PFK
PFK
Store
Store key PK
Address Region
Time
Time key PK
Day Month
Fact table is an intersection table
INFO20003 Database Systems © University of Melbourne
29
Designing a Dimensional Model
Steps:
1. Choose a Business Process
2. Choose the measured facts (usually numeric, additive quantities)
3. Choose the granularity of the fact table
4. Choose the dimensions
5. Complete the dimension tables
(Kimball, 1996)
INFO20003 Database Systems © University of Melbourne
30
Embedded Hierarchies in Dimensional Tables
Customer
Customer-id
Customer name Market segment Market sector Industry class Industry sector Industry group City
State Country
INFO20003 Database Systems © University of Melbourne
31
Embedded Hierarchies in Dimensional Tables
Market segment hierarchy
Industry hierarchy
Customer-id
Customer name Market segment Market sector Industry class Industry sector Industry group City
State Country
Market Sector
Market Segment
Industry Group
Industry Sector
Industry Class
Customer
Country
State
City
Location hierarchy
INFO20003 Database Systems © University of Melbourne
32
Snowflake Schema: hierarchy in dimensions
Product Category
Mnfg Group
Product Source
Brand
Product
Process Type
Market Sector
Market Segment
Store Layout
Store Type
Industry Group
Industry Sector
Industry Class
Customer
Sales (Fact Table)
Retail Outlet
City
State
Country
Country
State
City
Date
Marketing Region
Marketing Division
Holiday
Financial Quarter
Season
Financial Year
INFO20003 Database Systems © University of Melbourne
33
Design Outcomes: Normalised or Denormalised?
• Normalisation
– Eliminates redundancy – Storage efficiency
– Referential Integrity
• Denormalisation
– Fewer tables (fewer joins)
– Fast querying
– Design is tuned for end-user analysis
INFO20003 Database Systems © University of Melbourne
34
Let’s try it …
• We are making a data warehouse for a real estate agency. The company wants to track information about the selling of their properties. This warehouse keeps information about the agents (license#, first name, last name, phone #), buyers that come in (buyer id, first name, last name, phone #), and property (property#, property address, price). The information managers want to be able to find is the number of times a property is viewed, sales price. The information needs to be accessible by rental agent, by buyer, by property and for different time (day, week, month, quarter and year).
• Draw a star schema to support the design of this data warehouse.
INFO20003 Database Systems © University of Melbourne
35
What is Examinable?
• Differencesbetweentransactionalandinformational databases
• Designingastarschema
• Defining facts and dimension tables
INFO20003 Database Systems © University of Melbourne
36
If you want to know more
More technical details (I won’t ask you these things):
https://www.youtube.com/watch?v=w- S0fj0fmqg&list=PLdQddgMBv5zHcEN9RrhADq3CBColhY2hl&index=17
INFO20003 Database Systems © University of Melbourne
37
Next Lecture
• Distributed Databases
INFO20003 Database Systems © University of Melbourne
38