程序代写代做代考 database ER flex SQL Hive data structure Seminar 1

Seminar 1

Lecture 3
Dimensional Modelling
Marten Risius
E-mail: m.risius@business.uq.edu.au
Room: 516 Joyce Ackroyd (37) Building

‹#›

Recap: Business Analytics Framework

‹#›

2

Recap: Invoice example (Normalized Relations and ER Diagram)
We can name the relations now
Customer (CustomerNumber, CustomerName, CustomerAddress)
Clerk (ClerkNumber, ClerkName)
Product (ProductNumber, ProductDescription)
Invoice (InvoiceNumber, Date, CustomerNumber, ClerkNumber)
InvoiceLineItem (InvoiceNumber, ProductNumber, UniPrice, Quantity)

‹#›

‹#›
Example ER model
4

Recap: Querying a Database (Four Primary Operations of SQL)
Operation SQL Command
Create INSERT
Read SELECT
Update UPDATE
Delete DELETE

‹#›
Recap: SELECT statement

SELECT [ALL | DISTINCT] select_expr [, select_expr …]
List the columns (and expressions) that are returned from the query

[FROM table_references
Indicate the table(s) or view(s) from where the data is obtained

[WHERE where_condition]
Indicate the conditions on whether a particular row will be in the result

GROUP BY {col_name | expr } [ASC | DESC], …]
Indicate categorisation of results

HAVING where_condition]
Indicate the conditions under which a particular category (group) is included in the result

ORDER BY {col_name | expr | position} [ASC | DESC], …]
Sort the result based on the criteria

[LIMIT {[offset,] row_count | row_count OFFSET offset}]
Limit which rows are returned by their return order (ie 5 rows, 5 rows from row 2)]

‹#›
Recap: SELECT Example 1

SQL

RESULT

The names of the attributes that we want data from in the table
The TABLE (name) we want to query

‹#›
Recap: SELECT Example 2
SQL

RESULT

‹#›
Recap: An SQL Primer: GROUP BY

SQL

RESULT
Logic: Count (Customer ID) will return the number of customers,
Group BY CustType will group the result based on CustType
Aggregating data by particular attribute

‹#›
Recap: SQL Joins – Inner JOIN

Inner Join the tables with foreign keys!

SQL

RESULT

‹#›

10

ER Modelling Task
The Brisbane Movie Library purchases movies on various formats and loans them to its members for a charge in order to make a profit. The business is designing a new information system.
The proposed new system will include an accurate catalogue to inform members of movies held in each store by a number of different categories (eg. action, comedy, etc.) or which movies are held featuring their favourite actors. The catalogue will also show if a particular movie is available that day at a particular store.
Accurate information about which members have borrowed which movies, and when movies are due to be returned will also be available. This should encourage borrowers to return their movies promptly. Keeping track of loans using the current membership system has proven to be slow and prone to error. Improved turnaround of movies should increase profit.

In order to keep track of the costs involved in purchasing movies, details of purchase orders will be stored for all movies. This information will help to select suppliers, negotiate cheaper prices for future purchases, and help with auditing.

Each movie is allocated a rental charge and all loans are for one day (24 hour period). Occasionally, a special member may be given a longer loan period. All overdue movies incur an excess charge of $2 per day for each day they are late. While members will be encouraged to return movies to the store from which they borrowed them, the new system should also make it easier to keep track of movies returned to other stores.

‹#›
Brisbane Movie Library – ER Model

‹#›
Brisbane Movie Library – ER Model

‹#›

Recap: Business Analytics Framework

‹#›

14

Transactional vs Informational Databases

‹#›
Agenda and Learning Objectives for today
By the end of this class you should be able to:
Identify difference between informational and transactional questions
Explain the differences between transactional and informational databases
Define and develop dimensional data models for data-driven decision

‹#›
Transactional Databases
Support operations of an organization (running transactions)
Selling a products, shipping, hiring, supplying
Store data from every-day transactions
Highly normalised to avoid redundancy of data
Optimised to write new data in as transactions happen (because of normalised structure)

‹#›
Is normalization good for analytical decision-making purposes?
Let’s look at the two types of databases:
Transactional databases
used to answer operational questions

Informational (Analytical) databases
used to answer strategic questions

‹#›
Sales Transactions

Marten Risius
Leesburg
Sam’s Club
September 8, 2012
Kendall Jackson
Chardonnay
2 for total of $19.88
Liquor

‹#›

19

Transactional (Operational) Questions
Customer Service:
Help! I forgot my membership card!
Select membership_nbr from MEMBER_INDEX where phone_num = ‘555-1212’

‹#›
20
How many tables affected? 1
How many rows have to be accessed? 1 (with an index in place)

Transactional (Operational)

Select item_location from ITEM_DESCRIP where item_name = ‘Kendall Jackson chardonnay’
Inventory:
Where do you carry Kendall Jackson chardonnay?

‹#›
21
How many tables affected? 1
How many rows have to be accessed? 1 (with an index in place)

Transactional (Operational)
Customer Service:
What stores are open on Sunday in Queensland?
Select store_nbr from STORE_INFORMATION where open_Sun_flag = ‘yes’ and state = ‘Queensland’

‹#›
22
How many tables affected? 1
How many rows have to be accessed? As stores open on Sunday, or a table scan if no index in place

Analytical Questions
Campaign Management:
How many customers purchased more than $500 worth of alcohol in our Brisbane stores this year?

1,007,961
48,204,709
150
5,668,375
9,894
432,233

‹#›
23
How many tables affected? 5
How many rows? millions

With business analytics, we are interested in analytical queries
One is interested in numerical aggregations
How many?
What is the average?
What is the total cost?

One is interested in understanding dimensions
Sales by state by customer type
Sales by product by store by quarter

‹#›
Informational Databases
Have a different scope & different purpose
Show me the top products
Show me problem regions
Tell me why (drill down)
View other data (drill across)
Show the highest margins
Alert me if calls are high
Focus is on getting information at a higher level suitable for decision-making

Transactional Databases
Focus is on supporting day to day operations
Recording orders
Processing claims
Making shipments
Generating invoices
Receiving cash
Reserving airline seats

Transactional vs Informational Databases

‹#›
Transactional vs Informational Databases
Transactional Informational
Data Content Current Values Archives, derived, summarised
Data Structure Optimised for transactions (lots of writes) Optimised for complex queries
Access Frequency Very High Medium
Access Type Read, update, delete Read
Usage Predictable, repetitive Ad hoc, random, heuristic
Response Time Sub-seconds Seconds to Minutes
Users Many Relatively few

‹#›
So for decision-making purposes, we need an Informational Database
Designed for analytic tasks
Gets data from multiple locations
Internal / external
Intuitive and easy to use
Allows direct access by users without IT support
Conducive to long analysis sessions
Read intensive
Updated at known intervals and is stable
Storing historical data also
Able to allow users to run queries and get results online
Able to allow users to initiate reports

‹#›
Our Solution
The Data Warehouse!
So, what’s a data warehouse?
A single repository of organisational data
Current and historical
Integrates data from multiple sources
Internal and external
Extracts data from source systems, transforms, loads into the warehouse
“Single version of truth” – a holistic integrated view of organization data
Makes data available to managers/users
Without hindering day to day transactional work
It’s a database! But it is denormalised…

‹#›
Data Warehouse Features

‹#›
Defining Features
Subject Oriented Data
Data warehouses are organised around particular subjects
Data is integrated across functions
sales, customers, products
Data in a DW cuts across Application requirements

Operational Applications
Order Processing
Savings Accounts
Customer Billing
Loans Processing
Accounts Receivable
Supplier Orders
Data Warehouse Subjects
Sales
Product
Account
Customer
Claims
Supplier

‹#›
Defining Features
Integrated Data
Data from different systems
Can be from different applications, operating systems, etc
File layouts, field naming conventions could be different
Locale information could be different
Need to convert to a common format
allows comparison and consolidation of data from different sources
Data from various sources are validated before storing them in a data warehouse.
Data quality is crucial to the credibility of the warehouse

‹#›
Defining Features
Time-Variant Data
In application systems the data is current
i.e. The current true (or correct) value
In a Data Warehouse
Data used for analysis and decision making
Need current and past data = Historical data
Otherwise can’t answer many analytical questions
Data is stored as snapshots of the current values
Snapshots are time stamped
Data changes stored over time
Allows
Analysis of the past
Relation of data to the present
Forecasting for the future

‹#›

Defining Features
Non-Volatile Data
Unlike transaction systems the DW doesn’t get updated every time the data changes
Store extracted data snapshots over time
Data is periodically updated
That could be every second, hour, day, week or even month
Different data items updated with different frequencies
Users have read access only
all updating done automatically by ETL process and
periodically by DB Administrator

‹#›
Defining Features
Data Granularity
Operational systems
Data kept at lowest level of detail
Summary data created by adding up the numbers
Its not stored
Informational systems
Queries usually start with summary data
Then as analysis occurs more detailed levels of data are needed
Data usually stored at various levels for efficiency
Data granularity is the level of detail
The finer the granularity the lower the level of detail
The lowest level of granularity is called “the grain”

‹#›
Defining Features
Supports management needs
Used by end users
Data warehouses require a simple and easy to navigate structure
Responses to queries should be “timely”

‹#›
35

Data Warehouse Design:
Dimensional Modelling (Kimball)

‹#›
36

Business Analyst World
How much revenue did the product G generate in the last three months, broken down by month for the south eastern sales region, by individual stores, broken down by promotions, compared to estimates and to the previous version of the product
Analysis starts usually with a single indication of something strange, then goes deep into the data, left to a new dimension, right to another, up to the summary, back down and left and right again, until the problem is identified…

‹#›

37

Introduction to Dimensional Modelling
Popularised by Ralph Kimball in the 1990s
Based on the multi-dimensional model of data and designed for retrieval-only databases
Very simple, intuitive, and easily-understood structure
Also known as star schema design

‹#›
Dimensional Modelling- Objectives
Produce database structures that are easy for end users to understand and write queries against
Optimise query performance (as opposed to update performance)

‹#›
Dimensional Modelling
A dimensional model consists of
a fact table
several dimensional tables
hierarchies in the dimensions

Essentially a simple and restricted type of ER model

‹#›
Fact Table
A fact table contains the actual business measures (additive), called facts
Also contain foreign keys for dimensions
Sale
Time key
Store key
Customer key
Product key
Dollar sales
Unit sales
facts
keys

}
{

‹#›
Fact Table – example
Actual data might look like this
Granularity, or level of detail, is a key issue
Finest level of detail for a fact table
Determined by the finest level of each dimension

Time-id Store-id Cust-id Prod-id Dollar sales Unit Sales
T100 S303 C101 P98 $120,000 5,000
T101 S303 C256 P98 $240000 10,000
T102 S387 C101 P10 $456,000 27,899
T100 S234 C400 P56 $100,200 5,600

‹#›
Grain Example
Rough: customer postal codes (5,000), product type (200), store (300), week (52)

Detailed: individual customer (200,000), individual product (2,000), store (200), day (365)

Impact
Higher storage requirements for fine grain
More reporting flexibility for fine grain

‹#›
Grain size
Product of dimension cardinalities
Product of sparsity rate (1 – sparsity)
Sparsity: number of cell with values / total number of cells
43

Dimension Tables
Sale
Time key
Store key
Customer key
Product key
Dollar sales
Unit sales
Customer
Customer key
Name
Customer type

Product
Product key
Product type
weight
Store
Store key
Address
Region
Time
Time key
Day
Month

‹#›
Dimension Hierarchies
Sale
Time key
Store key
Customer key
Product key
Dollar sales
Unit sales

Product
Product key
Product type
Product group
Product sub-group
weight
Product group e.g. Hardware
– Product type e.g. Tool
– Product e.g. Hammer

‹#›
Dimension Table – example
Actual data might look like this
Hierarchy evident in data
Prod-id Prod-Name Prod-Group Prod-Subgroup Weight
P10 Hammer Hardware Tool 5kg
P56 10cm Nails Hardware Nails 1kg
P98 Plastic Pipe Plumbing Pipe 1kg

‹#›
Dimensional model as an ER model
Sale
Time key
Store key
Customer key
Product key
Dollar sales
Unit sales
Customer
Customer key
Name
Customer type

Store
Store key
Address
Region
Time
Time key
Day
Month

Product
Product key
Product type
Product group
Product sub-group
weight
Fact table is an
intersection table

‹#›
Star Schema
Sales Summary
(Fact Table)
Product

Customer
Retail
Outlet
Time
“WHEN” dimension
“WHO” dimension
“WHAT” dimension
“WHERE” dimension

‹#›
Designing a Dimensional Model
Choose a Business Process
Choose the grain of the fact table
Choose the dimensions
Choose the measured facts (usually numeric, additive quantities)
Complete the dimension tables

(Kimball, 1996)

‹#›
Dimensional Modelling Task
Design a dimensional model for LOANS
The Brisbane Movie Library purchases movies and loans them to its members for a charge in order to make a profit. The business is designing a data mart and decision support system.
Management wants to analyse the borrowing patterns of members in order to better identify the key members (most revenue per quarter). They can then focus on providing service to these members.
Management needs to analyse the value of their movies. They want to know which movies generate the most revenue per quarter. They don’t want to keep movies which are never (or rarely) borrowed. If a movie has not been borrowed for 3 months, it will be sold. Movies which generate a total rental return over 6 months which is less than their purchase price should also be sold to help keep inventory levels down.
Management wants to analyse the performance of each store to understand which are the most successful (in terms of profit = revenue – cost).

‹#›
Design Outcomes: Normalised or Denormalised?
Normalisation
Eliminates redundancy
Storage efficiency
Referential Integrity

Denormalisation
Fewer tables (fewer joins)
Fast querying
Design is tuned for end-user analysis (tools & cognition)

‹#›
51
Most DBAs assume that normalization is always the best way to design data. But, this is not always true.

In data warehousing, speed is more important than efficient data storage and updates. This is because the data in the warehouse is historical and does not change. Data warehouse DBAs are much more concerned with designing the database tables and indexes so that queries can be performed on thousands – even millions – of records very quickly.

Let’s Summarise!!
Transactional databases suitable for running transactions
Store data in normalized structure
Informational databases suitable for decision-making
It is not highly normalized

‹#›
What is Examinable:
Differenced between informational and transactional databases/questions
DW Features
Developing dimensional models

‹#›
Next Seminar

‹#›
Next Seminar
More Dimensional Modelling

‹#›

Address
Delivery
Supplier
Product
Sales Order
Customer
Employee
Manage
Accept
Place
Actual Delivery
Request Delivery
Distribute
Belong
Produce
Employee Number
Employee Name
Employee Telephone Extension
Employee Position Title
Employee Start Date
Customer Number
Customer Name
Customer Registered Date
Customer Credit Limit
Customer Credit Terms Code
Customer Segment Code
Address Code
Address Occupancy Number
Address Street Name
Postal Area Code
Delivery Number
Delivery Date
Delivery Instructions
Delivery Quantity Delivered
Sales Order Number
Address Code
Sales Order Number
Customer Number
Sales Order Date Accepted
Sales Order Confirmation *
Employee Number
Address Code
Product Number
Product Name
Product Description
Product Category Code
Product Current Cost Price
Product Quantity on Hand
Product Re Order Level
Supplier Number
Supplier Name
Customer /
Employee
Assignment
Be
Sales Order Item
Include
Product Source
Have
Sales Order Number
Product Number
Sales Order Item Quantity Requested
Sales Order Item Sale Selling Price
Sales Order Item Sale Cost Price
Product Number
Supplier Number
Product Source Start Date
Product Source End Date
Customer Number
Customer/Employee Assignment Start Date
Customer/Employee Assignment End Date *
Employee Number
Customer Contact
Person
Customer Contact
Type
Classify
Contact
Product Category
Group
Address Type
Postal Area
Classify
Situate
Address Type Code
Address Type Name
Postal Area Code
Postal Area Suburb
Postal Area City
Postal Area State
Postal Area Country
Region Code
Product Category Code
Product Category Name
Product Category Description
Customer Contact Name
Customer Contact Number
Customer Number
Customer Contact Type Code
Customer Contact Type Code
Customer Contact Type Description
Level 1
Customer
Location
Has
Belong To
Address Type Code
Address Code
Customer Number
Customer
Segment
Customer Credit
Terms
Customer Industry
Customer Industry
Type
Supplier Contact
Possess
Product Price
History
Contain
Product Number
Product Price History Termination Date
Product Price History Minimum Quantity
Product Price History Selling Price
Supplier Number
Supplier Contact Person
Supplier Contact Number
Region
Belong To
Employee
Qualification
Has
Categorise
Categorise
Classify
Customer Segment Code
Customer Segment Description
Customer Credit Terms Code
Customer Credit Terms Description
Customer Industry Type Code
Customer Number
Region Code
Region Name
Region Description
Employee Qualification Code
Employee Number
Employee Qualification Year
Customer Industry Type Code
Customer I
ndus
try Type Description
Offer
Employee
Qualification Type
Has
Employee Qualification Code
Employee Qualification Description

Print

Store Visit
Item Scan
SubCategory
Store Information
Member Index
Item Description

/docProps/thumbnail.jpeg