Information Management
Data Warehousing (thanks to Wolf- )
Università degli Studi di support systems
Copyright By PowCoder代写 加微信 powcoder
• The amount of data generated and collected every day is increasing
• Decisions are taken based on information extracted from data
• Bad decisions can lead to disaster
• A decision support system is a set of IT solutions and tools used for extracting information from data electronically stored
• Past: describe the past, describe problems, reduce costs
• Future: anticipate the future, suggest corrections, increase profits
• Issues in realizing a DSS: large datasets, heterogeneous information sources, history of data, …
Data warehouse
• Data warehousing is at the base of decision support systems
• Questions we need to answer [Kimball 1996]
• We have many data but we cannot access them
• How is it possible that different people in the same role have substantially different results?
• We need to select, group, and manipulate data in any possible manner
• Show me only what is important
• Everybody knows that some data are not correct
Data warehouse: requirements
• Accessibility also by users with limited technological background
• Integration of different data sources
• Flexibility of query operations
• Synthesis for easy analysis
• Multidimensional representation to provide an intuitive view of the data
• Correctness and completeness of represented data
Data warehouse: definition (1)
• Intuitively, a data warehouse is a large database
• Not all large DBs are DWs, but DWs are large DBs (from a TB to several PB) • Distributed over several servers and requires high computational power
• A data warehouse is a collective data repository
• Contains snapshots of the operational data (history)
• Populated through the Extract-Transform-Load (ETL) process
• Useful for data analytics
Data warehouse: definition (2)
• Compared to other solutions it…
• Is suitable for tactical/strategic focus
• Implies a small number of transactions
• Implies large transactions spanning over a long period of time
Data warehouse: Inmon definition
A data warehouse is a:
• Subject oriented
• Integrated
• Non-volatile
• Time variant
collection of data in support of management’s decisions (by )
Subject oriented
• Data are organized and centered around the concepts of interest for the company
• Data elements relating to the same real-world event or object are linked together
• Traditional operational DBs are organized according to applications • Examples of subjects: customers, products, orders,…
Integrated
• The DW contains data from most or all the organization’s operational systems and these data are made consistent
• Example: gender, dates, measures
• The realization of a DW does not imply the insertion of new data, but the reorganization of existing ones
Non-volatile
• Data are never over-written nor deleted • Data are read-only
Time varying
• A DW represents a sequence of snapshots of operational data, repeated periodically
• Changes over the data are never lost
• Time horizon
• Operational system: 60-90 days • DW: 5-10 years
OLTP: On Line Transaction Processing
Traditional elaboration of transactions, which realize operative systems
• Predefined simple operations
• Each operation involves a limited amount of data • Detailed and always up-to-date information
• ACID properties are crucial
OLAP: On Line Analytical Processing
Elaboration of operations for decision support
• Complex and non pre-defined operations
• Each operation involves a huge amount of data
• Aggregated and historical data, possibly outdated
• ACID properties are not relevant (typically read-only)
OLTP and OLAP: separate environments
OLTP Database
Data Warehouse
OLTP APPLICATION
OLAP APPLICATION
FINAL USERS (transactions)
ANALYSTS (complex queries)
OLTP vs. OLAP (1)
Operational Data Stores (OLTP)
Mostly updates
Mostly reads
Many small transactions
Few, but complex queries
MB-TB of data
TB-PB of data
Summarized data
Final users / employees
Decision makers and analysts
Up-to-date data
May be slightly outdated data
OLTP vs. OLAP (2)
• Typical questions to be answered by OLAP
• How much did sales unit A earn in January?
• How much did sales unit B earn in February?
• What was their combined sales amount for the first quarter?
• Answering these questions with SQL-queries is difficult
• Complex query formulation necessary
• Processing will be slow due to complex joins and multiple scans
• Why can such questions be answered faster with a DW? • DW data is rearranged and pre-aggregated data collection
• The table arrangement is subject oriented
OLAP: FASMI requirements
• Fast: response times should be low, even if dealing with huge amounts of data
• Analytical: should provide support for the final user to analyze data in a simple manner
• Shared: data should be shared among multiple final users
• Multidimensional: data should be organized according to a
multidimensional model
• Informational: the system should store all the information of interest
DW architectures
• A DW architecture should be
• Independent: separated from from OLTP systems
• Scalable: support a growing size of data and number of final users/analysts • Extensible: able to integrate novel applications whenever needed
• Secure: enforce access control
• Easy to manage
1-level architecture
• The DW is virtual
• Implemented as a multidimensional view over the data in the OLTP • Generated through a middleware
• Does not guarantee separation between OLTP and OLAP systems • Not used in practice
Multidimensional alanysys
Middleware
Operational DBs
Data mining
2-level architecture (1)
Extract Clean Transform Load
Data Warehouse
External sources
Multidimensional analysis
Operational DBs
Data mining
2-level architecture (2)
• Data sources: heterogeneous information sources both internal and external
• Usually operational data stores that rely on relational database technology
• ETL: data from different and heterogeneous sources should be • extracted
• cleaned to remove inconsistencies and completed of missing parts • integrated according to a common schema
2-level architecture (3)
• Staging area: contains data that should be loaded in the DW from sources
• Copied data are prepared through ETL
• These data are not used for analysis but only for populating the DW
• DW server: the DW is the logical container of collected data, which can be directly queried or used to build data marts
• Cleaned raw data
• Derived (aggregated) data • Metadata
• Analysis: analyzes data and builds reports
2-level architecture (4)
• Data mart: subset or result of aggregations of the data in the DW • Includes the subset of data of interest for a specific business area
• Provides better performance
• Advantages of 2-level architectures
• DW is always available, even when the data sources are temporarily
unavailable
• Queries operating on the DW do not interfere with OLTP transactions
• Data in DW and OLTP are kept independent (different data model, different data granularity, different time window)
• Specific optimization techniques to support analysis
3-level architecture (1)
External sources
Operational DBs
Reconciled Load Data
Data Warehouse
Extract Clean Transform Load
Multidimensional analysis
Data mining
3-level architecture (2)
• The third (additional) level is represented by reconciled data
• Reconciled data (operational data store) is the materialization of the
result of ETL process
• In 2-level architectures the result of ETL is not materialized
Applications for DWs
• A DW is the base repository for front-end analytics (or business cockpits)
• Knowledge discovery in databases (KDD) and data mining
• Results are used for • Data visualization
• Reporting
• Decision support system (DSS) analysts
• usually have a business background
• aim to define and discover information used in corporate decision-making • do not know exactly what they need
• operate in an explorative manner (by subsequent refinement steps)
DW design and implementation
• Top-down: analyze the whole needs of the company and design the DW in its entirety
• Long design and implementation times • Complex task
• Bottom-up: the DW is built incrementally by iteratively integrating different data marts
• Reduces times
• Reduces costs
• Most widely used solution
Bottom-up DW development
Define objectives and plan
• preliminary phase including feasibility analysis
Infrastructure design
• analyze and compare architectural solutions and available technologies
Data mart design and development • create data mart and applications
• integration with the existing system
Data mart design
Data mart modeling is based on the schemas of the underlying operational DBs
1. Analysis and reconciliation of data sources
2. Requirements analysis
3. Conceptual modeling
4. Refinement of workload and conceptual model validation
5. Logical modeling
6. Loading modeling
7. Physical modeling
Analysis and reconciliation of data sources
INPUT: operational DBs schemas OUTPUT: reconciled schema
• Analyzes operational data to determine and analyze: • the schemas of data sources
• useful correlations among data
• portions of data of interest for the decision process
• data quality
• Makes different data sources homogeneous and enables their integration
Requirements analysis
INPUT: reconciled schema
OUTPUT: facts, preliminary workload
• Facts are concepts of primary interest for the decisional process
• For workload, it is important to determine the granularity of facts • Fine-grained: less performance, more precision
• Coarse-grained: higher performance, lower precision
Conceptual modeling
INPUT: reconciled schema, facts, preliminary workload OUTPUT: fact schema
• The fact schema represents all the relevant information about facts that need to be considered
• Measures: atomic properties of the fact that can be analyzed • Dimension: perspective for the analysis
• Fact: sales
• Measures: sold quantities, gain
• Dimensions: product, time, place
Workload and validation
INPUT: fact schema, preliminary workload OUTPUT: validated fact schema
• Formulate queries representing the workload to determine whether the fact schema supports all of them
Logical modeling (1)
INPUT: fact schema, logical model, workload OUTPUT: logical model of the data mart
• Identify views that should be materialized
• Design vertical and/or horizontal fragmentation to maximize
performance
• Choose the logical model between ROLAP (Relational OLAP) and MOLAP (Multidimensional OLAP)
Logical modeling (2)
• ROLAP (Relational OLAP)
• Relational database model, extended and adapted
• Data organized in tables and analysis operations expressed as SQL instructions • Access structures specifically designed to optimize analysis operations
• MOLAP (Multidimensional OLAP)
• Stores data directly in multidimensional shape • Proprietary data structures
Loading modeling
INPUT: operational DBs schemas, reconciled schema, logical model of the data mart
OUTPUT: loading procedures
Physical modeling
INPUT : logical model of the data mart, target DBMS, workload OUTPUT: physical schema of the data mart
• Identify indexes, based also on the workload and data volumes
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com