MET CS 689 B1 Designing and Implementing a Data Warehouse Andrew D Wolfe, Jr.
MET CS 689 B1
Designing and Implementing a Data Warehouse
Mary E Letourneau
Reporting, Analysis and Visualization
April 8 and 11, 2020
1
MET CS 689
Data Warehousing
Mary E Letourneau
Reporting
April 8 and 11, 2020
2
Reports
Who Needs ‘Em?
What reports of what data?
Key Performance Indicators (KPI)
Trouble Tracking
Technical mechanisms for reporting
Who needs reports?
Reports are those explorations of data warehouse data that are prepared on a regular basis
Reports provide a means of monitoring operations
Typically reports are prepared for middle or executive management
Occasionally prepared for consultants or auditors
Primarily sectioned and sorted on Date dimension
What reports of what data?
Report
Production Monitoring
Sales tracking
Expenses
Supply Chain
Promotion response
What reports of what data?
Report Data
Production Monitoring Facts: production records
Dimensions: products, locations
Sales tracking Facts: sales
Dimensions: sales org, products
Expenses Facts: expense reports/payments, payables
Dimensions: expense type, organization
Supply Chain Facts: supply shipments, projections
Dimensions: supplier, product
Promotion response Facts: Sales campaign
Dimensions: types of response, sales from response, customer
all of these use Date dimension
Key Performance Indicators (KPI)
Fact attributes directly indicating strategic success for organization
Sometimes fact quantities, e.g., numbers of sales
Primary topic of regular reports
Examples:
Sales pipeline
Employee productivity
Customer satisfaction
Appreciation of assets and inventory
Lead and Lag in KPI
Lag
How you’ve performed
Revenue Recognition, Profitability, Customer satisfaction surveys
Typically from Operational Data Systems – traditional data warehousing
Lead
Projecting upcoming results
Production pipeline, advertising response, consumer engagement
Big data sources like clickstreams and tweets can often provide valuable early projection of performance
Trouble Tracking Reports
Some regular reports are on a provisional basis for applying corrections
Productivity issues
Gaps in supply pipeline
Compensating for market changes
Technical mechanisms for reporting
Paper who?
Web interfaces
Summaries distributed through email or workflow
Dashboards
All require drill-down and roll-up
MET CS 689
Data Warehousing
Mary E Letourneau
Analytics
April 8 and 11, 2020
11
Is this the opposite of Reporting?
More ad-hoc
Works largely on Big Data semi-structured datasets
Works into data discovery and data science
We encounter analytics every time we go to Amazon
Semi-structured data?
Data Analytics begin with questions
Typically focused on clickstream because of sales focus
Tweets and hashtags for popular sentiment
Political data mining
Is this really unstructured?
What raw data?
We can’t really run analytics on raw raw data
We can use facts, whether in fact tables or not
We will want to use big data after conditioning and structuring it
This is distinct from entity resolution, that is, when it is creating dimensions
So what is “Analytics”?
The primary tool for analytics is statistical analysis
We are looking at volumes of data that are too large for ”eyeballing”
Statistical functions can help identify the nuggets of real information within a large dataset
Where do we start?
Generic exploration of data
Derive hypotheses
the analysis won’t create itself (yet!)
Look for ‘wrinkles’ or anomalies, outliers, local minima and maxima
Frame the measures to check the hypothesis
Determine the dimensions applicable to the hypothesis
Variables for Analytics
Analysis can take place using two, three, or more different variables
Often the time dimension is used as the reference variable
Typically one or more measures (fact attributes) are analyzed using statistics
Characterize the data distribution among the measures
Are logarithmic representations needed?
Create a pseudo-hierarchy of value ranges
MET CS 689
Data Warehousing
Mary E Letourneau
Visualization
April 8 and 11, 2020
18
Different forms of visualization
Bookmark this!
https://datavizcatalogue.com
Choosing your Visualization
What are the measures you want to explore?
What are the references (dimensions) that are pertinent to your exploration?
Are hierarchies available, or can they be created, to support drill-down?
Please don’t include too many variables!
Which graph type best supports the hypothesis you are exploring?
Approach for choosing graph type
How do you want to project variables against each other?
How many variables?
What is the volume of data?
Is time a variable of interest?
Are variables categorized (hierarchical) or not?
A Abela © 2009 http://extremepresentation.com
Chart Chooser
Dimension on horizontal, especially Time
Measure(s) up vertical
Each measure is a 2-d graph
Compare changes in measures with dimension
Variants: bar/column chart
Line Chart
How do you read this?
Scatter Chart
Wide assortment of facts
No linear dimension for line chart
Looking for correlations across axes
Note additional info via colors and data point shapes – low-cardinality dimensions
http://datadrivenjournalism.net/resources/DDJSchool_Tutorial_Analysing_Datasets_with_Tableau_Public
How do you read this?
Aha!
No low youth unemployment with low secondary education
(Almost) No high youth unemployment with low secondary education
Bubble Chart
Another variable to scatter chart
Bubbles will be measures
Take correlation of axes and determine if the bubble size relates to either
https://upload.wikimedia.org/wikipedia/commons/a/a0/Bubble_Chart_of_Crime_versus_Poverty_in_50_states.jpg
How do you read this?
How do you read this?
Radar/Circle Chart
http://www.practicaldb.com/data-visualization-consulting/outlier-detection/
Dimensions with repeating cycles
TIME
Helps to discover patterns in data
How do you read this?
Heat Maps
Dimensions on both axes
Render measures with color gradient
Size (a la Bubble Charts) can also present measures
How do you read this?
Map Graphs
Another “natural” rendering
present one or more variables against geography with
labels/values
colors – excellent “heat maps”
Map Graph example
Just say no! – Pie Charts
only useful for finding biggest and smallest
Just say no! – Tree Maps
https://techandmate.com/blog/different-chart-experiences-in-tableau/
What does this actually tell us?
Just say no! – Stacked Charts
Relative proportions not visually distinguishable
https://www.tutorialspoint.com/tableau/tableau_bar_chart.htm
Well, Maybe
http://www.storytellingwithdata.com/blog/2012/12/and-winner-is
Start by selecting your data
What are your sources for information?
You need data in one or more tables
Common variables must join the tables in some way
However not all visualizations present as joins per se
Tableau Visualization
Use each workbook for a particular hypothesis
Create and attach data sources to form and join the information you want
Use a sheet in the workbook to create a visualization testing your hypothesis
Be honest!
Especially with yourself!
Don’t let confirmation bias draw your graph or prejudge your conclusion
What does this mean?
Widgets
Jackelopes
Problems with this graph
Sample size too small to be statistically useful
Scope of data uncertain – what other dimensions or measures apply?
Outliers probably unusable
What would it mean to be above or below the line? What is “widgets per jackelope”?
Is there any kind of line, or dip, or other way you can draw a conclusion from this set of dots?
So what does this mean?
Widgets
Jackelopes
Add some sensationalistic context…
What about 3-D?
Rendering in 2-D is subject to distortion
Must be able to rotate and zoom to explore
Don’t add spurious info to make something splashy
What can we do with this?
https://www.livestories.com/blog/five-ways-to-fail-data-visualization
However…
Useful References
https://eazybi.com/blog/data_visualization_and_chart_types/
Choose Right Chart Type for Data Visualization. Part 1: Data Comparison
https://www.livestories.com/blog/five-ways-to-fail-data-visualization
MET CS 689
Data Warehousing
Mary E Letourneau
Key Points
April 8 and 11, 2020
52
Module 4 Key Points (ten questions)
Information life-cycle management (Krishnan ch 12)
Components – be able to describe the various parts of each component
ILM – Information management policies
E.g. Data acquisition, quality, retention, security
Governance
E.g. Governance council members and responsibilities
Technology Solution
E.g. Data quality, enrichment, transformation, archival and retention, master data management, metadata
Benefits
Big Data special considerations
Module 4 Key Points – con’t
Big Data Analytics, Visualization, and Data Scientists (Krishnan ch 13)
Big Data Analytics
Data discovery
Data acquisition, tagging, classification, modeling
Visualization
Reporting (Module notes)
Types of reports
KPI Reports
Ad Hoc Reports
BI Reports
Delivery of reports
Analyses
Drill-down and Roll-Up
Drill-across
Module 4 Key Points – con’t
Reporting (Module notes) – con’t
Visualization
Choosing the data
Choosing the visualization type
Understand the types of visualizations (line chart, scatter plot, bubble chart, etc) and when they are appropriate (relationship, comparison, distribution, composition)
Common mistakes
Forwarding Data to Further Stores and Uses
OLAP and Cubes
Visualized Cubes – Slicing, Dicing, Pivoting
Warehouse Data
Data Marts
Data Feeds
Data Lakes
Data Vaults
MET CS 689
Data Warehousing
Mary E Letourneau
Final Project
April 8 and 11, 2020
56
Final Project
Design
Star Schema – Must have a REALLY good reason for a snowflake
One fact table – Min 100,000 rows
Three or four (preferably) dimensions
One time dimension
One dimension with hierarchies
One or two slowly changing dimension attributes, at least one of which is implemented with Type 2.
Submit a Visio or Lucidchart diagram
ETL
Can use any combination of methods
Submit your code and/or package
Final Project (con’t)
Visualization
Using Tableau or Power BI – at least two visualizations of the data
One visualization needs to have drill-down/roll-up (i.e. use the hierarchical dimension)
One visualization needs to demonstrate filtering
One visualization needs to use one or more calculated fields
Between all visualizations, all dimensions are used.
Submit at least four images … one for each of the above criteria. Expectations are that it could require 8 to 10 images to demonstrate everything.
Word document
Assumptions
Design process
ETL process and why that process was chosen
Professional document, with references as needed.
PowerPoint – 5 – 10 minute presentation
Zip all files and submit to the Blackboard.
Have a Good Evening and a Great Week!
End of Presentation
59
/docProps/thumbnail.jpeg