MET CS 689 B1 Designing and Implementing a Data Warehouse Mary E. Letourneau
MET CS 689
Data Warehousing
Mary E. Letourneau
Extract, Transform, Load
April 1 & 4, 2020
1
ETL gets data to warehouse
This is the way the data warehouse gets data
May be from numerous sources
Per Kimball/Ross, “consumes a disproportionate share of the time and effort required to build a DW/BI environment.”
Phases Overview
Extract – acquire data from a remote source
Transform – restructure data for data warehouse, applying conversions and dereferencing
Load – populate (or refresh) the data warehouse with new data
34 Subsystems?
Key points
Kimball suggests four main phases: Extract, Transform, Load, Management
Each one has several things to remember to consider
“Management”, while added by Kimball, is not included in the ETL/ELT acronyms
The Purchased ETL Alternative
Don’t re-invent the wheel
Many strong options such as Informatica, Goldengate, DataStage,…
Open-Source Tools
Wide connectivity to data sources and destinations
However, the data warehouse designer and implementer must still do the extensive analysis of existing data and intended use
Lather, Rinse, Repeat
ETL is a repeating, ongoing process
Often there is a large initial load of historical information
Then a regular re-capture of data and update or refresh of data warehouse
Consistency is very important
Can’t tie down staff, automation must be reliable
MET CS 689
Data Warehousing
Mary E. Letourneau
Extraction Phase
April 1 & 4, 2020
7
Extraction Mechanisms
Many different sources and associated ways to get data
(Remote) Operational Data Stores
Data feeds
Document Services
Web Services
Data Serialization
Controlling Extraction
(Remote) Operational Data Stores
Relational Structured OLTP system
Extraction typically via a series of queries or ‘dumps’
Several tables may require updates/inserts from one data store
Reference tables
Transactional tables
Performance formerly required much summarization, now just an option
Can be impaired by transactional data changes
data model changes to support extract
scheduled system downtime
database log mining
Data feeds
Many varieties of non-transactional data sources:
Who knew simple file transfer would never die?
“Flat” files – tab- or comma-separated
XML and JSON files also used
Excel
Web Services
Extraction program ”reaches out” to known locations/URLs and issues requests
Sift/scrape important data & meta-data from predefined document structure
Non-relational source can mean many null values
Very common to have highly duplicated, non-1NF data
Considerations
Bandwidth – Is ”landing zone” directory fast enough?
local SSD or fast disk?
high-performance network mount?
Must not read incompletely “fed” data – how is incomplete data handled?
marker file?
file rename?
Change data capture
How to make sure that data that has already been processed successfully is not processed again
Where’s the metadata?
“Flat” files – First row often (but not always) has column names
XML files use tags to mark data
JSON, YAML uses key-value pairs
Flat File Example
invoice, date, bill-to, given, family, address1, address2, city, state, postal, ship-to, sku, quantity, description, price, tax, total, comments
34843, 2001-01-23, id001, Chris, Dumars, 458 Walkman Dr., Suite #292, Royal Oak, MI, 48046, id001, BL394D, 4, Basketball, 450.00, 107.9571, 1907.9571, Late afternoon is best. Backup contact is Nancy Billsmer @ 338-4338.
34843, 2001-01-23, id001, Chris, Dumars, 458 Walkman Dr., Suite #292, Royal Oak, MI, 48046, id001, BL4438H, 1, Super Hoop, 2392.00, 143.4629, 2535.4629,
XML Example
JSON Example
http://json.org/example.html
YAML Example
Controlling the Extraction Process
(after the initial load)
Scheduler
How often? Balance file size against frequency
Polling
Sequencing
Tables that are referenced should generally get loaded earlier in the process
Credentials
What account credentials will be used to authorize access to the data?
Scheduling
Does ETL poll certain data sources? How often?
Does ETL get started and retrieve data itself?
How current does the data warehouse data need to be?
Must we schedule tasks on multiple hosts?
Are there timeouts for certain operations?
What do we do with scheduling or timing errors?
What if the last ETL is still running when the next one is scheduled?
MET CS 689
Data Warehousing
Mary E. Letourneau
Transformation Phase
April 1 & 4, 2020
19
ETL or ELT
Traditional ETL approach
Extract from source into staging tables
Transform
Load into warehouse
Another approach
Extract from source
Load staging area (tables, data lake)
Transform
Some transformed data may be loaded into a formal data warehouse, or analytical database
Transformation Activities
Cleansing
Deduplication
Simple transforms
Restructuring
Final result is that the extracted data is in a form matching, or strongly correlated to the data warehouse structure
“Cleansing”
Checking and correcting data
Often discarding bad data
Looking for gaps and inconsistencies. Three quality levels
Column
Structure
Business Rules
Applying “audit dimension” – tracks each ETL activity
How do we respond to bad data?
What do we do with the data?
Halt the process (& at what threshold)?
Suspend bad records?
Tag and pass the data along?
Report to a log / notify the DBA?
Do we need to cancel the next ETL run?
Deduplication
Very common for data extracts to have lots of redundant data; multiple sources often have common records
Sometimes duplicate records are acceptable, e.g. click streams
Must establish an idea for each input stream of what “key” fields make it “unique”, if not the entire record
If non-”keys” disagree, what do you do?
throw all the records away
set up ranking function to choose “best” record, discard the rest
Simple Transforms
Date or date-time strings
Unit conversions, e.g. ounces to grams
Applying labels, e.g. $, €, £
Converting via reference tables, converting codes to human-readable descriptions
Restructuring
Multiple data sources will have multiple structures
Data Warehouse will have still another structure
Transformation phase
may normalize non-1NF data into more tables
may summarize data
may re-map data to alternative structures
in particular, apply to dimensional models
MET CS 689
Data Warehousing
Mary E Letourneau
Creating and Managing Surrogate Keys
April 1 & 4, 2020
27
Issues with Data Source Keys
The data source controls the creation and assignment of keys
One data source may have keys inconsistent with others, or even contradictory
Data sources can be out of sync
The net is that we cannot use source-system keys in our data warehouse
Resolve Keys through Surrogates
Surrogate keys have no intrinsic meaning – their only attribute is uniqueness
Cannot be derived from source keys in any way – just use a counter
Map keys from source systems to generated surrogate keys
NOTE: source keys can be surrogates in the source system; that does not affect our use. We can map source keys that are natural keys or surrogate keys, we just do not care
Correlate loaded data using surrogate keys
Foreign keys (if used) refer to surrogate keys
Surrogate keys cannot repeat for different data sources
DW Foreign Keys and Load Order
Can’t load a ‘child’ record without ‘parent’
Must perform dependency analysis to insert in order
Any recursive references will require setting order for inserts within a table
Alternatives
Use large transactions, and configure foreign-key constraints so that they are not enforced until ‘commit’
Kick off users and disable/drop foreign keys until all load is complete
Use a dummy FK value for initial load, and add another pass to finalize
MET CS 689
Data Warehousing
Mary E Letourneau
Data Loading
April 1 & 4, 2020
31
Load and Verification
Loading – simplicities and complications
Verifying the load
Managing refresh
(Try to ignore ’facts’ and ‘dimensions’ for now and focus on process)
DB connection parameters
DBMS name and version
What language are you using (Python)
Driver (Language-specific interface) for DBMS
Host name
Database port
database (“instance”) name
Database User for Load
Owner of DW tables has downsides
more power than you want for ETL tool
can make audit of activities less clear
DW end user not appropriate
end users should not make updates
also muddles ETL activity
Create separate DW user for clarity and security
Managing Load Around DW activity
Kick end users off during load
Transaction controls
Fits most neatly with regular DB operations
Extra-large transactions
Need GB-sized rollback segments
Mark data with Load ID
essentially a versioning approach
notify clients when load is complete to use new load ID
may suggest large duplication
Kimball calls load ID “audit key”
35
MET CS 689
Data Warehousing
Mary E Letourneau
Refresh and Expiration of Prior Loads
April 1 & 4, 2020
36
So what about the second load?
Data Warehouse ETL is a repeating process
Must deal with cases
source key for load has not been seen before – this is (usually) a simple insert
previously-loaded source key is not in a new load
source key in load was previously loaded
Is Load Incremental or Replacement?
What’s your extract?
Is it a full dump of the entire data source?
Is it only the changes since the last load?
Does incremental load mark deleted data?
Do we delete prior data?
Full load omits prior source key
Incremental load marks “deleted” source key
Deleting prior loads in data warehouse may make sense
Corrections to incorrect data
But can cause orphaned records
Many ETL implementations just mark data as ‘obsolete’ instead
Preserves keys
Facilitates audit
Reports must exclude the ‘obsoletes’
Refreshing Prior Loads
Set match criteria, almost always based on the source key
Just skip loading the data? After all, it’s already there
Update the data:
update all the non-key columns
update some columns, leave other columns
How do we know what has changed
Timestamp (not reliable)
Log shipping (does not work for non-transactional data sources)
Checksum (requires checking every record)
Replication/mirroring
Refresh with “UPSERT”
MERGE (Oracle and MS)
INSERT ON CONFLICT (PostgreSQL)
MET CS 689
Data Warehousing
Mary E Letourneau
Tracking and Verifying Data Loads
April 1 & 4, 2020
43
Bad data loads aren’t worth it
Must observe ETL over time: verifying ETL and tracking errors
Must be able to backtrace old reports to the data loads from which they were generated
Often need tables that track ETL processes (audit tables) and failures (error tables) for compliance; also useful for debugging.
Often need tables to track lineage (what the original data source was) and dependencies (what tables are related) in order to track down bad data and update related tables.
Need an audit history table in DW
Some kind of load ID
Data source for load
Date/time of load start and finish
Completion status
How started (manual or scheduled)
Lineage may be able to be tracked by referencing with foreign key
Load Errors table
Record individual errors during load
Identify source file(s) or data stream for error
Type of error
Any possible description
Verification
Did all the records in the source get loaded?
Can you account for all the records that did not load?
Did failures in one load stream propagate into other streams?
Do you need to clear and repeat this load with ETL fixes or data source corrections?
MET CS 689
Data Warehousing
Mary E. Letourneau
Controlling and Parameterizing
April 1 & 4, 2020
48
What are the control knobs?
Structure and usage of each data source
Extraction/Transmission mechanisms
Structure of the destination data warehouse
Rules for cleansing and deduping
Rules or stipulations for scheduling
Credentials for data sources
(Meta-) Data-driven Operation
ETL tool not only uses, but stores all these controls
Startup Configuration files
Location of incoming feed directories
Programs to fetch data
User names for systems, with passwords or encryption keys
Host names and service ports
“Usual” metadata
Descriptions for both the extracts and the target data warehouse
tables
files
field layout
format
data types
cross-references
Additional Metadata Needed
What fields are used for deduping?
When do we update or delete previous data?
How do we generate surrogate keys for each case in which one is needed?
How do we sequence extracts and loads?
What is the history of ETL runs
Module 3 Key Points
Why does ETL take up so much time? What role does automation play in the ETL process?
What does ETL stand for? ETL vs ELT. Know the tasks associated with each. (Module 3)
Kimball’s checklist before launching the ETL design and development effort .. (beginning of chapter 19)
Balancing data latency with ETL frequency; Job scheduling
Data quality screening: Column, Structure, Business rules
Responding to quality events, the three options: halt the process, send records to suspense file, tag the data and pass it on to the next step
Error and Audit dimensions & facts
Primary, foreign, surrogate keys
Lineage and Dependency
Metadata – source metadata and ETL process metadata
Have a Good Evening and a Great Week!
End of Presentation
54