Information Management
Data Warehousing (thanks to Wolf- )
Università degli Studi di support systems
Copyright By PowCoder代写 加微信 powcoder
• The amount of data generated and collected every day is increasing
• Decisions are taken based on information extracted from data
• Bad decisions can lead to disaster
• A decision support system is a set of IT solutions and tools used for extracting information from data electronically stored
• Past: describe the past, describe problems, reduce costs
• Future: anticipate the future, suggest corrections, increase profits
• Issues in realizing a DSS: large datasets, heterogeneous information sources, history of data, …
Data warehouse
• Data warehousing is at the base of decision support systems
• Questions we need to answer [Kimball 1996]
• We have many data but we cannot access them
• How is it possible that different people in the same role have substantially different results?
• We need to select, group, and manipulate data in any possible manner
• Show me only what is important
• Everybody knows that some data are not correct
Data warehouse: requirements
• Accessibility also by users with limited technological background
• Integration of different data sources
• Flexibility of query operations
• Synthesis for easy analysis
• Multidimensional representation to provide an intuitive view of the data
• Correctness and completeness of represented data
Data warehouse: definition (1)
• Intuitively, a data warehouse is a large database
• Not all large DBs are DWs, but DWs are large DBs (from a TB to several PB) • Distributed over several servers and requires high computational power
• A data warehouse is a collective data repository
• Contains snapshots of the operational data (history)
• Populated through the Extract-Transform-Load (ETL) process
• Useful for data analytics
Data warehouse: definition (2)
• Compared to other solutions it…
• Is suitable for tactical/strategic focus
• Implies a small number of transactions
• Implies large transactions spanning over a long period of time
Data warehouse: Inmon definition
A data warehouse is a:
• Subject oriented
• Integrated
• Non-volatile
• Time variant
collection of data in support of management’s decisions (by )
Subject oriented
• Data are organized and centered around the concepts of interest for the company
• Data elements relating to the same real-world event or object are linked together
• Traditional operational DBs are organized according to applications • Examples of subjects: customers, products, orders,…
Integrated
• The DW contains data from most or all the organization’s operational systems and these data are made consistent
• Example: gender, dates, measures
• The realization of a DW does not imply the insertion of new data, but the reorganization of existing ones
Non-volatile
• Data are never over-written nor deleted • Data are read-only
Time varying
• A DW represents a sequence of snapshots of operational data, repeated periodically
• Changes over the data are never lost
• Time horizon
• Operational system: 60-90 days • DW: 5-10 years
OLTP: On Line Transaction Processing
Traditional elaboration of transactions, which realize operative systems
• Predefined simple operations
• Each operation involves a limited amount of data • Detailed and always up-to-date information
• ACID properties are crucial
OLAP: On Line Analytical Processing
Elaboration of operations for decision support
• Complex and non pre-defined operations
• Each operation involves a huge amount of data
• Aggregated and historical data, possibly outdated
• ACID properties are not relevant (typically read-only)
OLTP and OLAP: separate environments
OLTP Database
Data Warehouse
OLTP APPLICATION
OLAP APPLICATION
FINAL USERS (transactions)
ANALYSTS (complex queries)
OLTP vs. OLAP (1)
Operational Data Stores (OLTP)
Mostly updates
Mostly reads
Many small transactions
Few, but complex queries
MB-TB of data
TB-PB of data
Summarized data
Final users / employees
Decision makers and analysts
Up-to-date data
May be slightly outdated data
OLTP vs. OLAP (2)
• Typical questions to be answered by OLAP
• How much did sales unit A earn in January?
• How much did sales unit B earn in February?
• What was their combined sales amount for the first quarter?
• Answering these questions with SQL-queries is difficult
• Complex query formulation necessary
• Processing will be slow due to complex joins and multiple scans
• Why can such questions be answered faster with a DW? • DW data is rearranged and pre-aggregated data collection
• The table arrangement is subject oriented
OLAP: FASMI requirements
• Fast: response times should be low, even if dealing with huge amounts of data
• Analytical: should provide support for the final user to analyze data in a simple manner
• Shared: data should be shared among multiple final users
• Multidimensional: data should be organized according to a
multidimensional model
• Informational: the system should store all the information of interest
DW architectures
• A DW architecture should be
• Independent: separated from from OLTP systems
• Scalable: support a growing size of data and number of final users/analysts • Extensible: able to integrate novel applications whenever needed
• Secure: enforce access control
• Easy to manage
1-level architecture
• The DW is virtual
• Implemented as a multidimensional view over the data in the OLTP • Generated through a middleware
• Does not guarantee separation between OLTP and OLAP systems • Not used in practice
Operational D
Multidimensional alanysys
Data mining
2-level architecture (1)
Extract Clean Transform Load
Data Warehouse
External sources
Multidimensional analysis
Operational DBs
Data mining
2-level architecture (2)
• Data sources: heterogeneous information sources both internal and external
• Usually operational data stores that rely on relational database technology
• ETL: data from different and heterogeneous sources should be • extracted
• cleaned to remove inconsistencies and completed of missing parts • integrated according to a common schema
2-level architecture (3)
• Staging area: contains data that should be loaded in the DW from sources
• Copied data are prepared through ETL
• These data are not used for analysis but only for populating the DW
• DW server: the DW is the logical container of collected data, which can be directly queried or used to build data marts
• Cleaned raw data
• Derived (aggregated) data • Metadata
• Analysis: analyzes data and builds reports
2-level architecture (4)
• Data mart: subset or result of aggregations of the data in the DW • Includes the subset of data of interest for a specific business area
• Provides better performance
• Advantages of 2-level architectures
• DW is always available, even when the data sources are temporarily
unavailable
• Queries operating on the DW do not interfere with OLTP transactions
• Data in DW and OLTP are kept independent (different data model, different data granularity, different time window)
• Specific optimization techniques to support analysis
3-level architecture (1)
External sources
Operational DBs
Reconciled Data
Data Warehouse
Extract Clean Transform Load
Multidimensional analysis
Data mining
3-level architecture (2)
• The third (additional) level is represented by reconciled data
• Reconciled data (operational data store) is the materialization of the
result of ETL process
• In 2-level architectures the result of ETL is not materialized
Applications for DWs
• A DW is the base repository for front-end analytics (or business cockpits)
• Knowledge discovery in databases (KDD) and data mining
• Results are used for • Data visualization
• Reporting
• Decision support system (DSS) analysts
• usually have a business background
• aim to define and discover information used in corporate decision-making • do not know exactly what they need
• operate in an explorative manner (by subsequent refinement steps)
DW design and implementation
• Top-down: analyze the whole needs of the company and design the DW in its entirety
• Long design and implementation times • Complex task
• Bottom-up: the DW is built incrementally by iteratively integrating different data marts
• Reduces times
• Reduces costs
• Most widely used solution
Bottom-up DW development
Define objectives and plan
• preliminary phase including feasibility analysis
Infrastructure design
• analyze and compare architectural solutions and available technologies
Data mart design and development • create data mart and applications
• integration with the existing system
Data mart design
Data mart modeling is based on the schemas of the underlying operational DBs
1. Analysis and reconciliation of data sources
2. Requirements analysis
3. Conceptual modeling
4. Refinement of workload and conceptual model validation
5. Logical modeling
6. Loading modeling
7. Physical modeling
Analysis and reconciliation of data sources
INPUT: operational DBs schemas OUTPUT: reconciled schema
• Analyzes operational data to determine and analyze: • the schemas of data sources
• useful correlations among data
• portions of data of interest for the decision process
• data quality
• Makes different data sources homogeneous and enables their integration
Requirements analysis
INPUT: reconciled schema
OUTPUT: facts, preliminary workload
• Facts are concepts of primary interest for the decisional process
• For workload, it is important to determine the granularity of facts • Fine-grained: less performance, more precision
• Coarse-grained: higher performance, lower precision
Conceptual modeling
INPUT: reconciled schema, facts, preliminary workload OUTPUT: fact schema
• The fact schema represents all the relevant information about facts that need to be considered
• Measures: atomic properties of the fact that can be analyzed • Dimension: perspective for the analysis
• Fact: sales
• Measures: sold quantities, gain
• Dimensions: product, time, place
Workload and validation
INPUT: fact schema, preliminary workload OUTPUT: validated fact schema
• Formulate queries representing the workload to determine whether the fact schema supports all of them
Logical modeling (1)
INPUT: fact schema, logical model, workload OUTPUT: logical model of the data mart
• Identify views that should be materialized
• Design vertical and/or horizontal fragmentation to maximize
performance
• Choose the logical model between ROLAP (Relational OLAP) and MOLAP (Multidimensional OLAP)
Logical modeling (2)
• ROLAP (Relational OLAP)
• Relational database model, extended and adapted
• Data organized in tables and analysis operations expressed as SQL instructions • Access structures specifically designed to optimize analysis operations
• MOLAP (Multidimensional OLAP)
• Stores data directly in multidimensional shape • Proprietary data structures
Loading modeling
INPUT: operational DBs schemas, reconciled schema, logical model of the data mart
OUTPUT: loading procedures
Physical modeling
INPUT : logical model of the data mart, target DBMS, workload OUTPUT: physical schema of the data mart
• Identify indexes, based also on the workload and data volumes
Conceptual Modeling
Dimensional Fact Model (DMF)
• The Dimensional Fact Model (DFM) is a graphical conceptual model for data mart design aimed at
• support conceptual modeling
• create an environment where queries can be formulated in an intuitive
• enable requirement specification
• represent a reference point for logical modeling
• provide project documentation
• DFM conceptual schema is a set of fact schemas that model facts, measures, dimensions, and hierarchies
• A fact is a concept of interest for the decisional process • Usually models a set of events that happen within a company • Is dynamic and evolves over time
• A measure is a numeric property of a fact and describes a quantitative aspect of interest for the analysis
• A dimension is a property of a fact that has a finite domain and
describes an aspect of the fact for the analysis
dimensions
• The relationship between measures and dimensions is expressed, at the extensional level (i.e., instance level) through (primary) events
• A primary event is a specific occurrence of a fact
• It is identified through an ennuple with a value for each dimension • It has a value for each measure
• A fact is a many-to-many relationship among dimensions
Event for fact SALE:
on 3/4/2019, store X sold 10 units of product Y with a receipt of 25 USD
• Dimensional attributes are dimensions and the attributes, with discrete values, describing them
• Hierarchy is a directed tree whose nodes are dimensional attributes and whose edges model many-to-one relationships between pairs of dimensional attributes
• The dimension is the root of the tree • The other nodes are its attributes
productàtype productàbrand typeàcategory
expresses the type of a product (beer is a drink) expresses the brand of a product
each type belongs to a category (drink is food)
Hierarchy: example
dimensional attributes
• Each DMF can be modelled as an ER
• The fact is a relation
• Each measure is an attribute • Each dimension is an entity
Secondary event
• Given a set of dimensional attributes, each ennuple of values identifies a secondary event
• A secondary event aggregates the corresponding primary events
• A secondary event has a value for each measure, summarizing the values of
the same measure in the corresponding primary events
Aggregates all the sales in Milan for product X in January 2021 The quantity will be the sum of all the tuples with these values
Descriptive and optional attributes
• Descriptive attribute: property of a dimensional attribute or hierarchy • Not used for aggregation
• Leaf node
• Optional attributes:
• For some instances the attribute may
not have a value
• Telephone number and address are descriptive of a store
• Diet is optional for a product
Convergence
• Two or more attributes are connected by two or more distinct paths, all representing functional dependencies
storeà storeCityà state à country store à saleDistrict à country
Conceptual design
• Starts from documentation including ER schema, relational schema, …
Identify facts
For each fact
1. Build the attribute tree
2. Identify dimensions
3. Identify measures
4. Draw the fact schema
Identify facts
• A fact is a concept of primary interest for the decisional process • Typically events that dynamically happen in the company
• Considering an ER schema, entities that represent frequently updated archives are good candidate facts
Sales is a good candidate fact Shop is not a good candidate fact
Build the attribute tree
• Considering an ER schema, and entity F modeling a fact, the attribute tree is such that
• Each vertex is an attribute of the ER schema
• The root corresponds to the identifier of F
• For each vertex v, the corresponding attribute functionally determines all the attributes represented by descendants of v
Build the attribute tree: example (1)
Build the attribute tree: example (2)
Prune the attribute tree
• Remove dimension attributes (single nodes or entire subtrees) that are not relevant for the DW
Identify dimensions
• Dimensions determine how events can be aggregated for the decisional process
• Dimensions are chosen among the children of the root
• The choice of dimensions is crucial as it defines the granularity of
primary events
• Time is usually a dimension
Identify measures
• Measures are chosen among numeric attributes the children of the root or applying aggregates to other attributes in the tree
• A fact can be empty (i.e., with no measure)
Draw the fact schema
• The attribute tree is translated into a DFM schema
Logical Modeling
Multidimensional model: Cube (1)
• A cube represents a fact
• Each of the axis of the cube represents a dimension of the fact • Each cell in the cube represents a primary event
• A slice of the cube represents a secondary event
storeCity quantity
Multidimensional model: Cube (1)
• A cube can have an arbitrary number of dimensions
• Cells in a cube can be empty • Sparse cube: many empty cells • Dense cube: few empty cells
• ROLAP logical modeling does not suffer from sparsity issue
Star schema
It includes:
• A dimension table DTi for each dimension, with a primary key di and different attributes for each aggregation level
• A fact table FT with an external key di for each dimension and an additional attribute for each measure
Star schema: example
category department marketingGroup
unitPrice numberOfCustomers
week month day holiday quarter Year
storeCity state
country salesManager salesDistrict
Snowflake schema
• Results from a (partial) normalization of a star schema
• Prevents or reduces redundancies in dimension tables
• Given a star schema, decompose one or more dimension table DTi in multiple tables DTi,1… DTi,n to remove functional dependencies
• Each dimension table DTi,j includes • A primary key di,j
• A subset of the attributes that functionally depend from di,j
• Possibly external keys for other dimension tables DTi,k to enable the reconstruction of DTi
Snowflake schema: example
week month day holiday quarter Year
marketingGroup
CategoryId
CategoryId
category department
unitPrice numberOfCustomers
brand brandCity TypeId
salesManager salesDistrict cityId
storeCity state country
Snowflake schema vs star schema
• Snowflake schema
• reduces the storage space
• requires to insert additional external keys
• facilitates joins including only attributes in primary dimension tables (dimension tables are smaller)
• complicates joins involving attributes in secondary dimension tables (higher number of joins)
Dimensions (1)
• Dimensions are the core of the multidimensional model
• Dimensions are used for • Selecting data
• Grouping data
• Dimensions consist of dimension values • Product: milk, cream…
• Time: 1/1/2019, 2/1/2019…
• Values are possibly organized in order
• Used for comparing values (e.g., income of different months)
Dimensions (2)
• Dimensions encode hierarchies with different levels
• Dimension values are organized in a tree or in a lattice, whose top element is ALL
• Product > Type > Category
• Day > Month > Quarter > Year
• Granularity of the bottom level is important and has an impact on scalability
Measures (1)
• A measure represents a property of a fact that users want to study and analyze
• A measure has
• Numerical value
• Aggregation formula
• A measure is additive along a dimension if the SUM operator can be used to aggregate along that dimension
Measures (2)
• Additive measures
• Additive along all dimensions
• Refer to a timeframe, at the end of which they can be evaluated cumulatively Example: number of products sold in a day
• Semi-additive measures
• Additive only along some dimensions
• Evaluated at a particular time
Example: number of products in inventory (not additive on time)
• Non-additive measures
• Additive along none of the dimensions
• Are evaluated at particular times and are relative Example: unit price of a product
Logical design
1. Translate DFM schema into a logical (star or snowflake) schema
2. Materialize views
3. Horizontal or vertical fragmentation of fact table
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com