MET CS 689 B1 Designing and Implementing a Data Warehouse Mary E Letourneau
MET CS 689 B1
Designing and Implementing a Data Warehouse
Mary E Letourneau
Dimensional Data Modeling
March 25, 2020
1
Module 2
Dimensional Data Modeling
Modeling Fundamentals
Dimensions and Facts
Schemas
Slowly Changing Dimensions
Time and Bitemporality
Including Big Data
2
MET CS 689 B1
Designing and Implementing a Data Warehouse
Mary E Letourneau
Modeling Fundamentals
March 25, 2020
3
Modeling Fundamentals
Seven Goals
Adapting from OLTP
The Fact/Dimension Abstraction
Dimension
Many Dimensions are Hierarchical
Measures
Fact Tables
Seven goals
The DW/BI system must make information easily accessible
The DW/BI system must present information consistently
The DW/BI system must adapt to change
The DW/BI system must present information in a timely way
The DW/BI system must be a secure bastion that protects the information assets
The DW/BI system must serve as the authoritative and trustworthy foundation for improved decision making
The business community must accept the DW/BI system to deem it successful
Adapting from OLTP to Warehousing
Standard OLTP database designed to drive and structure semi-static business processes (Static in the sense that most people are doing the same task(s) over and over again.)
Data warehouse must abstract various processes and sources of data to correlate them
What are the key pieces of information?
How should these be measured and reported?
MET CS 689 B1
Designing and Implementing a Data Warehouse
Mary E Letourneau
Dimensions and Facts
March 25, 2020
7
The Fact/Dimension Abstraction
Key information are facts
Sales
Airline flights
Medical treatments
Reporting attributes are dimensions
Dates on facts
Organization
Type of Financial instrument
Market segment
Dimension
Dimensions describe who, what, when, where
Consider what data business users will want to use for filtering and/or grouping
Usually non-numeric
Any coordinate in multidimensional space
Time, Location, Person, Stock, Company, Market, Units of Measure
Contains a single (non-composite) primary key
Typically has keys for business attributes
Many columns, (relatively) few rows
Most Dimensions are Hierarchical
Hierarchies in Dimensions follow the drill-down/roll-up path for the related facts
Dates, again, exhibit an intrinsic hierarchy – year-quarter-day
Markets – country, region, city is a hierarchy
Products – Product category, brand name, product line, specific product is a hierarchy
Recognizing Dimensions with Multiple Hierarchies
Dates for calendar years vs dates for fiscal years
Location intensive dimensions may have multiple geographic hierarchies.
Store geographic location may differ from corporate region location
These can work together without interference in the same table
All dimension hierarchies must have same “grain” – level of detail – at the smallest level
Sample report
Dimensions
Facts
12
Note the hierarchy Category-Product (Product is a human-readable description of the SKU. These are at the same level in the hierarchy.)
Note the granularity = SKU
Dimension Values collated
13
There are really only four dimensions here
Measures
Measures tell us how much
Examples:
Interest rate described by Time, Country
Effective Annual Percentage, Period, size of loan, and value of interest rate
Refers to dimensions
Fact tables summarize measures to the level of dimensional granularity
Three types
Additive – aggregations (SUM, AVERAGE, etc.) across any/all dimensions
Semi-additive – some aggregates may not “make sense” across all dimensions
Non-additive – such as percentages/ratios, or “facts” that can only be COUNTed (eg Gender)
Fact tables
Many rows, (relatively) few columns
Types of Fact Tables
Transaction
Strong correlation to original OLTP records
Periodic
Used when certain aggregates commonly needed, such as monthly accounting statements
Dimensional hierarchies are useful here
Accumulating
Commonly used when comparing aggregates over time, such as budget vs actual
Factless
Facilitates analysis of “missing” fact records
Factless fact tables
MET CS 689 B1
Designing and Implementing a Data Warehouse
Mary E Letourneau
Schemas
March 25, 2020
17
Schemas
Putting Dimensions and Facts Together
Star Schema
Snowflake
Dimension-to-Dimension Relationships
Multi-valued Dimensions
Constellation/Galaxy
Conforming Dimensions
Putting dimensions and facts together
Star
Snowflake
Constellation
https://www.tutorialspoint.com/dwh/dwh_schemas.htm
Star Schema
Every key in a fact table refers to dimension primary key
Snowflake
“Normalizes” one or more dimensions
Dimension-to-dimension relationships
In the example below Customer and Account have M:M relationship. Each customer can own multiple accounts and each account can be shared by multiple customers.
Outrigger Dimension
Multi-valued dimensions
Each dimension attached to a fact table has a single value consistent with the fact table’s grain. But there are a number of situations in which a dimension is legitimately multivalued. For example, a patient receiving a healthcare treatment may have multiple simultaneous diagnoses. In these cases, the multivalued dimension must be attached to the fact table through a group dimension key to a bridge table with one row for each simultaneous diagnosis in a group.
A multivalued bridge table may need to be based on a type 2 slowly changing dimension. For example, the bridge table that implements the many-to-many relationship between bank accounts and individual customers usually must be based on type 2 account and customer dimensions. In this case, to prevent incorrect linkages between accounts and customers, the bridge table must include effective and expiration date/time stamps, and the requesting application must constrain the bridge table to a specific moment in time to produce a consistent snapshot.
Constellation / Galaxy
Multiple fact tables
Conform dimensions shared by more than one fact table
Conforming dimensions
26
Role-Playing Dimensions
It is not uncommon for a fact table to need to refer to the same kind of dimension more than once, especially date dimensions
Instead of creating multiple copies of the same dimension, consider putting multiple foreign keys to the dimension in the fact table.
Junk Dimensions
“Code” dimension tables with only one or two columns
Consider combining these into one “junk” dimension
Not quite a Cartesian product
Only a complete list of possible combinations
Junk Dimensions
Dimensional Design – Degenerate dimensions
Degenerate dimension is a dimension key in the fact table which does not have its own dimension table. Examples include: Credit Card Transaction #, Check Number, Order #, Ticket #
Intended usage: Grouping together related fact table rows. Example:
Retail point-of-sale transaction numbers tie all the individual items purchased together in the single market basket.
In health care claims related to a single hospital stay or episode of care can be grouped by using stay#
MET CS 674
Data Warehousing
Mary E Letourneau
Slowly-Changing Dimensions
March 25, 2020
31
Introduction to Slowly-Changing Dimensions (SCD)
Slowly-Changing Dimensions
SCD Type 0 – Don’t Update
SCD Type 1 – Overwrite
SCD Type 2 – Superseded
SCD Type 3 – Extend
SCD Type 4 – Mini-dimension
SCD Type 5 – Dimension outrigger
SCD Type 6 – Type 1 + 2 + 3
SCD Type 7 – Independent Durable Dimension
Slowly-Changing Dimensions
One might think the values in dimension tables never change, but:
List of sovereign countries may need to add Catalonia
People’s names
Genders
Changes may only be attributes of a dimension value, or new values entirely
How do we handle these “SCD” dimensions?
Eight “Types” (0 through 7)
Example of Person dimension changing
34
SCD Type 0 – Don’t update
Nope, don’t change anything
May be an attribute marked as “original”
SCD Type 1 – Overwrite
Overwrite changed attributes
Dimensional Queries now reflect changed data
Answers change (not bitemporal)
SCD Type 1 Illustration
SCD Type 2 – Superseded
Add new dimension table entry
New dimension key
Preserve existing facts with old key
Add new facts with new key
Almost always sequence with date effectivity, that is ‘effective from’ and ‘effective until’
SCD Type 2 Illustration
39
Mixing SCD1 and SCD2
SCD Type 3 – Extend
Add columns for changed attributes – “prior”
Denormalize
Not useful for unpredictable changes
SCD Type 3 Illustration
SCD Type 4 – mini dimension
Remove changing attributes to another table
Generate a new key for each
Split dimension references in fact table to the new ‘mini dimension’
Type 4 – add mini dimension
Type 4 Illustration
SCD Type 5 – Dimension Outrigger
Instead of creating mini-dimension, add dimension to primary dimension table
The attribute shows the current mini-dimension value
Fact table keeps one dimension column
Type 5 Illustration
SCD Type 6 – 1 + 2 + 3
Supersede dimension rows (Type 2)
Add history column (Type 3)
Overwrite columns (type 1), esp ‘current’
Type 6 Illustration
SCD Type 7 – Independent Durable Dimension
Use dimension’s natural permanent key as key to a ‘permanent’ dimension
Add ‘natural’ or ‘durable’ dimension as new dimension to fact table
New durable dimension is type 1 – overwritten to current every time
Previous dimension is retained as a type 2 with full history
Type 7
MET CS 689 B1
Designing and Implementing a Data Warehouse
Mary E Letourneau
Time and Bitemporality
March 25, 2020
52
Introduction to Time as Dimension
Time as Dimension
Temporal Data Types
Forms of Time Record
Assertions referring to time
Reasoning about Time
Durable Bitemporal Design
Time as dimension
Time has direction – past, current, future and time is ordered
Time is hierarchical – timestamp, day, week, month, quarter, year
Time as calendar – business calendar vs daily calendar
Same time point can be expressed differently depending on time zone and calendar
Temporal data types
Instant: something happened at an instant of time (e.g., “now”, which happens to be Oct 16, 2013, 6:xx:xx P.M. EST.
Interval: a length of time (e.g., three months). I also call it relative time, Other examples – Last Month, Current Month, 6 months ago, Last year
Period: an anchored duration of time (e.g., the fall semester, Sep 5, 2013 through December 24, 2013). Periods are typically closed with left date being open and right date being closed(not included in the interval).
Forms of time record
Valid time: when a fact was true in the real world (typically instant or period)
Transaction time: when a fact was stored in the database (typically instant or period used to represent)
User-defined time:any time defined by user that not valid or transaction time. This is basically time as standard attribute.
Assertions referring to Time
Current: Basically, show me the data the way it looks from stand-point of today.
Historical: data as recorded at a specific time
Non-Temporal: ignoring time and changes over time
Reasoning about Time
A “point in time” is rare – only intervals
Much data captured as “days”
Even ‘fine-grained’ timestamps are intervals
Sequences of “events” involve overlap as part of “Before” and “After”
Equality means equal beginning and end
Comparing Time
Before/Simultaneous/After // Start/End
58
Durable Bitemporal Design
True bitemporality requires strong rules
only current view of historical data can change
Historical view of historical data cannot change.
Rerunning calculations on yesterday’s data using transaction time should have same results as yesterday’s “current” calculations
59
More…
https://en.wikipedia.org/wiki/Temporal_database
Kimball/Ross pp 252-254
MET CS 674
Data Warehousing
Mary E Letourneau
Including Big Data
March 25, 2020
61
Big Data
Master Data Management
Metadata
Standardize business entities and data across multiple systems
Processing
Gather
Analyze
Process
Distribute
Master Data Management
Metadata
Assists with creating consistency
Assists with documenting the business needs and uses for the data
Many types of metadata
Technical, Business, Contextual, Process design-level, Program-level, Infrastructure, Core Business, Operational, Business Intelligence
Master Data Management
Krishnan – “… the process of standardization of key business entities and its associated data processing rules across multiple operational appliances and data consumption applications …”
With the variety of sources available (both big data and transactional), it is important to develop a standardized view of the data, with common grains, hierarchies, and measurements.
Krishnan – “MDM is not about technology. The critical success factor of this initiative is the subject matter experts in data within the business teams, who can understand and define the processing rules and complex decision-making process …”
Processing Big Data
Process
Context
Metadata, master data, semantic linkage
Standardize
Distribute
Key Points for Module Two
Vocabulary
Dimension vs Fact
Star schema vs Snowflake vs Constellation/Galaxy
Constellation – Drill-across
Measures
Additive
Semi-additive
Non-additive
Fact tables
Transaction
Periodic
Accumulating Snapshot
Factless Fact tables
Key Points for Module Two (con’t)
Dimension
Grain
Hierarchy
Drill-down & Roll-up
Describes facts
Used for filtering & grouping
Slowly changing dimension
Types 0 through 3
Bitemporality
Outriggers
Degenerate dimension / junk dimension
Role-playing dimension
Key Points for Module Two (con’t)
Krishnam – Chapters 6 & 7
Definition of data warehouse
Data architecture
Infrastructure
Pitfalls
Two schools of thought
Key Points for Module Two (con’t)
Kimball/Ross – Chapter 2 & 18
Fundamental Concepts
Gather business requirements and data realities
Four-step dimensional design process
Fact Table Techniques
Dimension Table Techniques
Slowly-Changing Dimensions
Dimensional Modeling Process
Key Points for Module Two (con’t)
Kimball/Ross – Chapter 18
Figure 18-1
Identify the business processes
Identify & Involve business representatives (Including CxO’s and administrative staff)
Review Business Requirements
Naming Conventions
Declare the grain
Identify dimensions
Identify facts
References
Kimball, Ralph and Ross, Margy. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition. Indianapolis, IN: John Wiley & Sons, 2013.
Krish Krishnan. Data Warehousing in the Age of Big Data, 1st ed., Waltham, MA: Morgan Kaufmann, 2013.
Temporal database. (2019, March 14). Retrieved from https://en.wikipedia.org/wiki/Temporal_database
Tutorialspoint.com. (n.d.). Data Warehousing Schemas. Retrieved from https://www.tutorialspoint.com/dwh/dwh_schemas.htm
Have a Good Evening and a Great Week!
End of Presentation
72