Data Mining: Concepts and Techniques
¡ª Chapter 4 ¡ª
Qiang (Chan) Ye Faculty of Computer Science Dalhousie University University
Copyright By PowCoder代写 加微信 powcoder
Chapter 4: Data Warehousing and On-line Analytical Processing
n Data Warehouse: Basic Concepts
n Data Warehouse Modeling: Data Cube and OLAP n Data Warehouse Design and Usage
n Data Warehouse Implementation
What is a Data Warehouse?
n Defined in many different ways. Loosely speaking:
n A data warehouse is a decision support data repository that is maintained separately from the organization¡¯s operational database.
n It supports information processing by providing a solid platform of consolidated, historical data for analysis.
n ¡°A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management¡¯s decision-making process.¡±¡ªW. H. Inmon
n Data warehousing: The process of constructing and using data warehouses.
Data Warehouse¡ªSubject-Oriented
n Organized around major subjects, such as customer, product, sales
n Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing
n Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
Data Warehouse¡ªIntegrated
n Constructed by integrating multiple, heterogeneous data sources
n relational databases, flat files, on-line transaction records n Data cleaning/integration techniques are applied.
n Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
n When data is moved to the warehouse, it is converted.
Data Warehouse¡ªTime Variant
n The time horizon for the data warehouse is significantly longer than that of operational systems
n Operational database: current value
n Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
n Every key structure in the data warehouse contains an element of time, explicitly or implicitly
Data Warehouse¡ªNonvolatile
n A physically-separate repository of data transformed from the operational environment
n Operational update of data does not occur in the data warehouse environment
n Does not require transaction processing, recovery, and concurrency control mechanisms
n Requires only two operations in data accessing: n initial loading of data and access of data
Differences between Operational Database and Data Warehouse
n Because most people are familiar with commercial relational database systems, it is easy to understand what a data warehouse is by comparing these two kinds of systems.
n The major task of operational database systems is to perform online transaction and query processing.
n These systems are called Online Transaction Processing (OLTP) systems.
n They cover most of the day-to-day operations of an organization such as purchasing, inventory, manufacturing, banking, payroll, registration, and accounting.
Differences between Operational Database and Data Warehouse
n Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data analysis and decision making.
n These systems are known as Online Analytical Processing (OLAP) systems.
n Such systems can organize and present data in various formats in order to accommodate the diverse needs of different users.
Differences between Operational Database and Data Warehouse
clerk, IT professional
manager, executive
day to day operations
decision support
application-oriented
subject-oriented
Data content
current, up-to-date detailed, flat relational isolated
historical,
summarized, multidimensional integrated, consolidated
repetitive
read/write
short, simple transaction
mostly read-only complex query
transaction throughput
query throughput
Heterogeneous Database Integration
n Traditional databases use the query-driven approach: n Wrappers and integrators (or mediators) on top of
multiple, heterogeneous databases are constructed.
n When a query is submitted, a metadata dictionary is used to translate the query into queries appropriate for the individual heterogeneous sites involved.
n These queries are then mapped and sent to local query processors.
n The results returned from the different sites are integrated into a global answer set.
Heterogeneous Database Integration
n Data warehousing employs an update-driven approach:
n Information from multiple, heterogeneous sources is integrated in advance and stored in a warehouse for direct querying and analysis.
n Unlike online transaction processing databases, data warehouses do not contain the most current information (data is updated in a periodical manner).
n However, a data warehouse brings high performance to the integrated heterogeneous database system because data are copied and restructured into one data repository.
Why a Separate Data Warehouse?
n High performance for both systems
n DBMS¡ª tuned for OLTP: access methods, indexing, concurrency
control, recovery
n Warehouse¡ªtuned for OLAP: complex OLAP queries, multidimensional view, consolidation
n Different functions/data:
n missing data: Decision support requires historical data which
operational DBs do not typically maintain
n data consolidation: Decision support requires consolidation of data (i.e. aggregation, summarization, etc) from heterogeneous sources
n data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
Data Warehouse: A Multi
Tiered Architecture
OLAP Server
Monitor & Integrator
Other sources
Operational DBs
Extract Transform Load Refresh
Data Warehouse
Analysis Query Reports Data mining
Data Marts
Data Sources
Data Warehouse Server (Bottom Tier)
OLAP Server (Middle Tier)
Front-End Tools (Top Tier) 14
Three Data Warehouse Models
n Enterprise warehouse
n It collects all of the information about subjects spanning the
entire organization
n It is typically implemented with high-performance servers or parallel-architecture platforms.
n It requires extensive business modeling and may take years to design and build.
n Data Mart
n It contains a subset of corporate-wide data that is of value to
a specific groups of users.
n Its scope is confined to specific, selected groups, such as marketing data mart
Three Data Warehouse Models
n Dependent data marts obtain the data directly from enterprise data warehouses.
n Independent data marts obtain the data from one or more operational systems or external information providers, or from data generated locally within a particular department or geographic area.
n Virtual warehouse
n It is a set of views over operational databases
n For efficient query processing, only some of the possible summary views may be materialized.
ETL: Extraction, Transformation, and Loading
n ETL: The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading.
n Note that ETL refers to a broad process, and not three well-defined steps.
n Nevertheless, the entire process is known as ETL.
ETL: Extraction, Transformation, and Loading
n Generally, the entire ETL process involves the following steps:
n Data extraction: get data from multiple, heterogeneous, and external sources
n Data cleaning: detect errors in the data and rectify them when possible
n Data transformation: convert data from legacy or host format to warehouse format
n Load: sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions
n Refresh: propagate the updates from the data sources to the warehouse
Metadata Repository
Meta data is the data defining warehouse objects. It stores: n Description of the structure of the data warehouse
n schema, view, dimensions, hierarchies, derived data definitions, data mart locations and contents
n Operational meta-data
n data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)
n The algorithms used for summarization
n The mapping from operational environment to the data warehouse
n which includes source databases and their contents, gateway descriptions, data partitions, data extraction, cleaning, transformation rules and defaults, data refresh and purging rules, and security (user authorization and access control).
Metadata Repository
n Data related to system performance
n which includes indices and profiles that improve data access
and retrieval performance, in addition to rules for the timing and scheduling of refresh, update, and replication cycles.
n Business data
n business terms and definitions, ownership of data, charging policies
Chapter 4: Data Warehousing and On-line Analytical Processing
n Data Warehouse: Basic Concepts
n Data Warehouse Modeling: Data Cube and OLAP n Data Warehouse Design and Usage
n Data Warehouse Implementation
Data Cube: A Multidimensional Data Model
n A data warehouse is based on a multidimensional data model which views data in the form of a data cube.
n A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions.
n A data cube is defined by dimensions and facts.
n Dimensions are the perspectives or entities with respect to
which an organization wants to keep records.
n For example, AllElectronics may create a sales data warehouse in order to keep records of the store¡¯s sales with respect to the following dimensions: time, item, location and supplier.
Data Cube: A Multidimensional Data Model
n Dimension Table: Each dimension may have a table associated
with it, called a dimension table, which further describes the
dimension. For example:
n For the dimension item: item (item_name, brand, type)
n For the dimension time: time (day, week, month, quarter, year)
n Facts are numeric measures. We can think of them as the quantities with which we want to analyze relationships between dimensions.
n Examples of facts for a sales data warehouse include dollars sold (sales amount in dollars), units sold (number of units sold), and amount budgeted.
n Fact Table: It contains the facts (i.e. measures), as well as keys to each of the related dimension tables. An example fact table will be included later.
Data Cube: A Multidimensional Data Model
n Although we usually think of a cube as a 3-D geometric structure, in data warehousing the data cube is n-dimensional.
n To gain a better understanding of data cubes and the multidimensional data model, let¡¯s start by looking at a simple 2-D data cube (included in next slide) that is, in fact, a table or spreadsheet for sales data from AllElectronics.
n In particular, we will look at the AllElectronics sales data for items sold per quarter in the city of Vancouver.
n In this 2-D representation, the sales for Vancouver are shown with respect to the time dimension (organized in quarters) and the item dimension (organized according to the types of items sold).
Data Cube: A Multidimensional Data Model
Data Cube: A Multidimensional Data Model
n Now, suppose that we would like to view the sales data with a third dimension.
n For instance, suppose we would like to view the data according to time and item, as well as location, for the cities Chicago, , Toronto, and Vancouver.
n These 3-D data are shown in the table in the next slide.
n Note that the 3-D data in the table are represented as a series of 2-D tables.
Data Cube: A Multidimensional Data Model
Data Cube: A Multidimensional Data Model
Conceptually, we may also represent the same data in the form of a 3-D data cube.
Data Cube: A Multidimensional Data Model
n Suppose that we would like to view our sales data with an additional fourth dimension such as supplier.
n Viewing things in 4-D becomes tricky. However, we can think of a 4-D cube as being a series of 3-D cubes, as shown in the next slide.
n If we continue in this way, we may display any n- dimensional data as a series of (n-1)-dimensional ¡°cubes¡±.
Data Cube: A Multidimensional Data Model
n Suppose that we would now like to view our sales data with an additional fourth dimension such as supplier.
n Viewing things in 4-D becomes tricky. However, we can think of a 4-D cube as being a series of 3-D cubes, as shown below.
Data Cube: A Multidimensional Data Model
n Tables 4.2 and 4.3 show the data at different degrees of summarization.
n In the data warehousing research literature, a data cube like those shown in Figures 4.3 and 4.4 is often referred to as a cuboid.
n Formally, given a set of dimensions, we can generate a cuboid for each of the possible subsets of the given dimensions.
n This leads to a lattice of cuboids
n Each of the cuboids shows the data at a different level of summarization.
n The lattice of cuboids is then referred to as a data cube.
n Figure 4.5 shows a lattice of cuboids forming a data cube for the dimensions time, item, location, and supplier.
Data Cube: A Multidimensional Data Model
Data Cube: A Multidimensional Data Model
n The cuboid that holds the lowest level of summarization is called the base cuboid.
n For example, the 4-D cuboid in Figure 4.4 is the base cuboid for the given time, item, location, and supplier dimensions.
n Figure 4.3 is a 3-D (nonbase) cuboid for time, item, and location, summarized for all suppliers.
n The 0-D cuboid, which holds the highest level of summarization, is called the apex cuboid.
n In our example, this is the total sales, or dollars sold, summarized over all four dimensions.
n The apex cuboid is typically denoted by ¡°all¡±.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com