PowerPoint Presentation
Database Systems Infrastructure
BI
Copyright © 2012, SAS Institute Inc. All rights reserved.
Coming back to this slide…
So far, we have looked at Normalisation and ERD, and today we will be looking at Business Intelligence (BI) and Data Warehouse (DW).
In the 1990s, Cognos and Business Objects and a few other vendors are moving in the direction of Business Intelligence (BI), Data Warehouse (DW) is the key of consolidated the data before populated to the BI applications.
The purpose of DW is the data is cleaned and I do not have to worry about cleaning the data again. Every day or every week I only have to clean and add the current data to the database.
The data were used in BI. Reports are printed or on screen for managers to review and make decisions.
INFS5710 Week 1
Business Intelligence (BI)
Comprehensive, cohesive, integrated set of tools and processes
Captures, collects, integrates, stores, and analyzes data
Generates and presents information to support business decision making
Allows a business to transform:
Data into information
Information into knowledge
Knowledge into wisdom
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
From the previous diagram, we have seen it has to do with the users.
We have to gather data and turn the data into information, and usually on a report on paper or screen.
The decision maker will review and make a decision on the information on a report or they will come back and ask for more information.
As stated in the slide…
Business Intelligence (BI)
Concepts, practices, tools and techniques to help business
Understand its core capabilities
Provide snapshots of the company situation
Identify key opportunities to create a competitive advantage
Provides a framework for
Collecting and storing operational data and aggregating it into decision support data
Analysing decision support data and presenting generated information to end users to support business decisions
Making business decision which generates more data
Monitoring results to evaluate outcomes and predicting future outcomes with a high degree of accuracy
How would the LA Bike-Sharing company in your project do business analysis?
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
Data are collected and stored, and then the data will then be used to populate (i.e. added to) in DW.
As stated on the slide…
Your Project: LA Bike-Sharing
BI
Employee,
Products,
Customers,
Assets
:
Bike transactions
(real time)
Weather
Your analysis
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
INFS5710 Week 1
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
This is a typical BI framework, i.e. an IT infrastructure for BI. You have people, process, management and governance.
1. People are the technical people or users.
2. Processes are what you need to do get the report produced.
3. In this case, the data, external and operational, have to undergo ETL (Extract, Transform and Loading) which is to clean the data. Management refers to management of data.
Extract is to get the data.
Transform is to clean the data to make sure they are readable, blanks are filled and data are corrected to the correct format. For instance, you might want to consolidate “University of New South Wales”, “U. New South Wales”, “University of NSW”, or even “Unsw” etc.. To become “UNSW”.
Loading is to load into DW, and there is usually a process of checking and evaluating the data uploaded into DW (Data store in the diagram), if they are not good or enough information, you do it again.
Data cleaning and data cleansing are two different terms used but they are the same thing.
You can then run a query on the data and create reports, or do data analysis etc.
As for governance, this is to deal with how you data is to be managed. Who have access to the data, and what data is to be stored and where?
Read Table 13-2 for the BI architectural component.
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. Commonly used dimensions are people, products, place and time.
Use a rubric cube as an example, each side of the rubric cube is a dimension, so you can say you have a six dimensions. One dimension could be sales, another dimension could be states lot like NSW, Victoria, and so on, another dimension could be products, another dimension could be customers, and so on
Practices to Manage Data
Data visualization: Abstracting data to provide information in a visual format
Enhances the user’s ability to efficiently comprehend the meaning of the data
Techniques:
Pie charts and bar charts
Line graphs
Scatter plots
Gantt charts
Heat maps
Project schedule
a representation of data in the form of a map or diagram in which data values are represented as colours.
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
In SAS EG and Studio, both have these tools. The graphs are not pretty, colours and fonts are not the best but you can see the results.
Reporting Styles of a Modern BI System
insightful information about the organization and key actionable information used to support decisions.
monitor decision outcome
explanatory and predictive
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
Different types and purpose of different Reporting Styles of a modern BI System.
Monitoring could be to see how things are going, so you might have different colours such as red, amber, and green. Green means everything it’s OK. Amber means something needs attention. Red means something needs attention.
Advanced reporting
Monitoring and alerting
Advanced data analytics
Business Intelligence Evolution
A transaction is any action that reads from or writes to a database
(SELECT, UPDATE, CREATE)
ATM
Relational DB
Answer simple
Questions only
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
A summary of BI Evolution.
Multidimensional
DB
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
Decision Support Data
Decision support system (DSS) is an arrangement of computerised tools used to assist managerial decision making
Typically has a much narrower focus and reach than a BI solution
BI information technology has evolved from centralized reporting styles to the current, mobile BI and Big Data analytics style in the span of just a few years
The rate of technological change is not slowing down; technology advancements are accelerating the adoption of BI to new levels
Effectiveness of BI depends on quality of data gathered at operational level
Operational data
Seldom well-suited for decision support tasks
Stored in relational database with highly normalised structures
Optimized to support transactions representing daily operations
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
Decision Support Data
Differ from operational data in:
Time span
Granularity
Drill down: Decomposing a data to a lower level
Roll up: Aggregating a data into a higher level
Dimensionality
Operational data short time frame;
Decision support data longer
for decision support data
High for decision support data (e.g., data requirement for multi-periods, multi-location, multi-companies)
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
“multi-dimensional”
representation
①
②
③
①
②
④
③
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
You can transform operational data into decision support data by using Pivot Table in Excel. Pivot Table is a cheap and free data tools to do BI.
Column A Year becomes Time
Column B Region remains as Region
Column C Agent remains as Agent
Column D Product remains as Product
Column E Value becomes Sales
You then can run a report, say, Sales by product, region over the last two years (2014-2015).
Decision Support Database Requirements
Database schema
Must support complex, non-normalized data representations
Data must be aggregated and summarized
Queries must be able to extract multidimensional time slices
Slice-and-dice
organization of data
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
We will talk more about data warehouse in a moment and that will fit in with this.
Decision Support Database Requirements
Data extraction and loading
Allow batch and scheduled data extraction
Support different data sources and check for inconsistent data or data validation rules
Support advanced integration, aggregation, and classification
Database size should support
Very large databases (VLDBs)
Advanced storage technologies
Multiple-processor technologies
ETL
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
Data Warehouse
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
Remember, we talked about ETL a few slides back.
The step is associated with the Collection of data. You have plenty of transaction data like in a supermarket, they have several millions of transactions, probably most on toilet paper but you need to extract from the database to be cleaned and processed before the data is stored in the Data Warehouse (DW).
Your data might not be cleaned, so you will have to clean it. Once you cleaned the data, you might not want to do it again, so you want to save it somewhere – this is the data warehouse comes in. Thus, transformation is the process of doing this!
The transformation is the critical part of the whole process. As part of the cleaning process, you can transform the data in various way by ensuring all the columns are filled with data. New columns could be created, or new columns are created for aggregation or summarisation.
Once, everything is done, then you load the data into DW.
Next, we will talk about integrated, subject-oriented, time-variant and non-volatile in DW.
The feedback loop shows that the design of DW may be evolving. From analysis, analysers may feel like to get more data (different types) to analysis, which is represented by the feedback loop.
Table 13.8 – Characteristics of Data Warehouse Data and Operational Database Data
Read-only
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stared on the slides…
The four characteristics of DW of integrated, subject-oriented, time-variant and non-volatile are on the slides.
Star Schema
Data-modeling technique
Maps multidimensional decision support data into a relational database
Creates the near equivalent of multidimensional database schema from existing relational database
Yields an easily implemented model for multidimensional data analysis
is the approach most widely used to develop data warehouse and dimensional data marts
organization of data
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
ERD is used in the development of a database.
To develop a data model for DW, we called it Star Schema!. It is different from ERD – the model is de-normalised.
It is called a star schema because it looks like a star! You will see later…
Data Marts
Small, single-subject data warehouse subset
Provide decision support to a small group of people
Benefits over data warehouses
Lower cost and shorter implementation time
Technologically advanced
Inevitable people issues
(minor resistance to changes)
(a test system for a full data warehouse)
e.g., finance, sales, human resource
e.g., By State – only people in that state can see the numbers.
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
Data mart is trimmed to a small subset in DW contains data for a specific area and for a specific group of people. The size of the data mart is much smaller than data warehouse, so the access speed is much faster.
For instance, Finance data mart for finance department or sales data mart for sales department.
It has benefits over larger DW – as stated on the slide…
Components of Star Schemas
Each dimension (above) has its own table with attributes.
(e.g., region – state – city – store)
Drill-down and roll-up
Sales data for example (measures)
(Sales by) location, time, customer, products
people, products, place and time
tables
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
For example for the sales star schema – you can have sales data in the Facts, and location, time, customer and product as dimensions.
For example the attributes for product could include Product_id, Product_descrtiption, Weight, Bar_code and so on.
There are attributes of the facts and dimension you have to add.
Facts
Dimensions
Attributes
Numeric values that represent a specific business aspect
Qualifying characteristics that provide additional perspectives to a given fact
Used to search, filter, and classify facts
Slice and dice: Ability to focus on slices of the data cube for more detailed analysis
Attribute hierarchies
Provides a top-down data organization
Slice and Dice
x
y
z
x
z
x
y
y
z
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
Sorry, the picture is not very clear!
Slice a cube, like cut a slide of cake. You will left with 2-dimensional data.
Dice means you can cut into smaller cubes! You still have 3-dimensional data.
To Roll-up and Drill Down
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
To roll-up, you can think of adding the data together as you move up, whereas
To drill-down, you want to look the data in more details.
Star Schema Representation
Facts and dimensions represented by physical tables in data warehouse database
Many-to-one (M:1) relationship between fact table and each dimension table
Fact and dimension tables
Related by foreign keys
Subject to primary and foreign key constraints
Primary key of a fact table
Is a composite primary key because the fact table is related to many dimension tables
Always formed by combining the foreign keys pointing to the related dimension tables
(see next slide)
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
The key facts about Star Schema is shown on the above slide
It is probably better to show as an example on the next few slides.
each dimension table
fact table
Foreign keys of a fact table to the related dimension tables
PK
PK
PK
PK
Denormalized dimension tables
Dimensions relate directly to the fact table only
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
The fact table has four dimension tables are they are location (loc_id PK), time (time_id PK), customer (cust_id PK), and product (prod_id PK).
The fact table is sales, and the PK is made up of a composite key of four Ids, namely loc_id , time_id, cust_id and prod_id.
Look at the schema and try to work it out!
Performance-Improving Techniques for the Star Schema
“Normalising” dimensional tables (might not be in 3NF)
Snowflake schema: Dimension tables can have their own dimension tables
Maintaining multiple fact tables to represent different aggregation levels
Denormalizing fact tables
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
Note: Star Schema is not normalised! It probably is in 2NF or 1NF! The reason it you want fast access to the data.
Sometimes, you want to do some normalisation in order to ensure the integrity of the data. Snowflake schema, evolved from Star Schema, allows some normalisation on one of the dimension table as shown in the snowflake.
e.g., store location
A dimension table can have its own dimension tables.
fact table
Compare with the LOCATION table in Fig 13.11
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
For example, a snowflake schema with some normalisations.
Simply to speed up
query operations
redundancy!
The data warehouse designer must identify which levels of aggregation to precompute and store in the database.
fact table
fact table
fact table
fact table
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
You can have multiple fact tables and this will speed up accessing the data.
Performance-Improving Techniques for the Star Schema
Partitioning and replicating tables
Partitioning: Splits tables into subsets of rows or columns and places them close to customer location
Replication: Makes copy of table and places it in a different location
Periodicity: Provides information about the time span of the data stored in the table
Distributed database
or called fragmentation
Why distributed database? (store data near users)
Internet as the platform for data access and distribution
Mobile wireless revolution
data of current year, previous years, or all years stored in different tables
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
We cover some of these terms next week.
Online Analytical Processing (OLAP)
Advanced data analysis environment that supports decision making, business modeling, and operations research
Characteristics:
Multidimensional data analysis techniques
Advanced database support
Easy-to-use end-user interfaces
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
Multidimensional Data Analysis Techniques
Data are processed and viewed as part of a multidimensional structure
Augmented by the following functions:
Advanced data presentation functions
Advanced data aggregation, consolidation, and classification functions
Advanced computational functions
Advanced data-modeling functions
3D graphic, pivot table
Slice, dice, drill down, roll up
Market share, return,
profit margin
What if scenarios, predictive modeling
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
As stated on the slide…
Figure 13.17 – OLAP Architecture
©2017 Cengage Learning®. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected website or school-approved learning management system for classroom use.
.
The OLAP is virtually looks like DW except the data can come from either external data directly or from DW.
The data go directly to OLAP usually are the current or confidential data.
Once the data are gathered together, you can access directly to create reports or view on your Mobile.
Oracle
Oracle
Database
Flat Files
Machine Learning
Prepared by Vincent Pang, Feb. 2021 Note: In-Memory Database (e.g. SAP Hana) is an alternative data model not shown here
ETL (Data
Cleansing)
Data (DW to BD, or
vice versa or both)
RelationalDatabaseBig DataDataWarehouseNormalisationReporting (Business Intelligence and Visualisation) and Business Analysis (End Users)(De-Normalised)External Data (e.g. Excel)Hadoop Distributed File System (HDFS) and MapReduceEntity Relationship Model (ERM)Unstructured Data(Social Media)Structured Data (Internet of Things (IOT))HadoopSQLSQLSpark and NoSQL (and other tools)Data StreamingData Streaming(“Not Normalised”)External Data (e.g. Excel)Star Schema