Star, Snowflake, Fact Constellation
n The entity-relationship model is commonly used in the design of relational databases.
n A data warehouse, however, requires a concise, subject- oriented schema that facilitates online data analysis.
n The most popular data model for a data warehouse is a multidimensional model, which can exist in the form:
Copyright By PowCoder代写 加微信 powcoder
n Star schema
n Snowflake schema
n Fact constellation schema
Star Schema
n Star schema is the most common modeling paradigm for data warehouse.
n Star schema consists of:
n A large central table (fact table) containing the bulk of the
n A set of smaller attendant tables (dimension tables), one for each dimension
n The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table.
n For more details: https://en.wikipedia.org/wiki/Star_schema
Example of Star Schema
day day_of_the_week month
quarter year
Sales Fact Table
branch_key
location_key
units_sold
dollars_sold
location_key street
city state_or_province country
item_key item_name brand
type supplier_type
branch_key branch_name branch_type
Facts/Measures
Star Schema
n Notice that in the star schema, each dimension is represented by only one table, and each table contains a set of attributes.
n For example, the location dimension table contains the attribute set {location key, street, city, state_or_province, country}.
n This constraint may introduce some redundancy.
n For example, ¡°Urbana¡± and ¡°Chicago¡± are both cities in the state
of Illinois, USA.
n Entries for such cities in the location dimension table will create redundancy related to the attributes state_or_province and country: ¡°…, Urbana, IL, USA¡± and ¡°…, Chicago, IL, USA¡±.
n Moreover, the attributes within a dimension table may form either a hierarchy (total order) or a lattice (partial order).
Snowflake Schema
n The snowflake schema is a variant of the star schema model, where some dimension tables are further split into additional tables.
n The resulting schema graph forms a shape similar to a snowflake.
n The major difference between the snowflake and star schema models is that the dimension tables of the snowflake model may be split to reduce redundancies.
n Such a table is easy to maintain and saves storage space. n More details: https://en.wikipedia.org/wiki/Snowflake_schema
Example of Snowflake Schema
day day_of_the_week month
quarter year
Sales Fact Table
branch_key
location_key
units_sold
dollars_sold
branch_key branch_name branch_type
Facts/Measures
item_key item_name brand
type supplier_key
location_key street city_key
supplier_key supplier_type
city state_or_province country
Snowflake Schema
n However, this type of space saving is not important in comparison to the typical magnitude of the fact table.
n Furthermore, the snowflake structure can reduce the effectiveness of browsing, since more joins will be needed to execute a query.
n Consequently, the system performance may be adversely impacted.
n Hence, although the snowflake schema reduces redundancy, it is not as popular as the star schema in data warehouse design.
Fact Constellation
n Complex applications may require multiple fact tables to share dimension tables.
n This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation.
Example of Fact Constellation
day day_of_the_week month
quarter year
Shipping Fact Table
item_key item_name brand
type supplier_type
shipper_key
from_location
to_location
dollars_cost
units_shipped
Sales Fact Table
branch_key
location_key
units_sold
dollars_sold
branch_key branch_name branch_type
location_key street
city province_or_state country
Facts/Measures
shipper_key shipper_name location_key shipper_type
Fact Constellation
n In data warehousing, there is a distinction between a data warehouse and a data mart.
n A data warehouse collects information about subjects that span the entire organization, such as customers, items, sales, assets, and personnel, and thus its scope is enterprise-wide.
n For data warehouses, the fact constellation schema is commonly used, since it can model multiple, interrelated subjects.
Fact Constellation
n A data mart, on the other hand, is a department subset of the data warehouse that focuses on selected subjects, and thus its scope is department-wide.
n For data marts, the star or snowflake schema is commonly used, since both are geared toward modeling a single subject.
n Of course, as mentioned previously, star schema is more popular and efficient.
all region
Germany …
North_America Canada … Mexico
Dimensions: Schema Hierarchy vs. Set- grouping Hierarchy
all Europe …
A concept hierarchy that is a total (or partial) ordering of the attributes in a database schema is called a schema hierarchy.
city Frankfurt … office
Vancouver … Toronto L. Chan … M. Wind
Dimensions: Schema Hierarchy vs. Set- grouping Hierarchy
Concept hierarchies may also be defined by discretizing or grouping values for a given dimension or attribute, resulting in a set-grouping hierarchy.
Facts/Measures: Three Categories
n Facts/Measures can be organized into three categories: distributive, algebraic, and holistic. The categorization is based on the kind of aggregate functions used.
n Distributive Measure:
n A distributive measure corresponds to a distributive aggregate
n An aggregate function is distributive if it can be computed in a distributed manner as follows.
n Suppose the data are partitioned into n sets. We apply the function to each partition, resulting in n aggregate values.
n If the result derived by applying the function to these n aggregate values is the same as that derived by applying the function to the entire data set (without partitioning), the function can be computed in a distributed manner
n E.g., count(), sum(), min(), max()
Facts/Measures: Three Categories
n Algebraic Measure:
n An algebraic measure corresponds to an algebraic aggregate
n An aggregate function is algebraic if it can be computed using an algebraic function with M arguments (where M is a bounded positive integer), each of which is obtained by applying a distributive aggregate function.
n E.g., avg()=sum()/count()
n Note that both sum() and count() are distributive aggregate
functions.
n Holistic Measure: There does not exist an algebraic function with M arguments (where M is a constant) that characterizes the computation.
n E.g., median()
Facts/Measures: Three Categories
n Most large data cube applications require efficient computation of distributive and algebraic measures.
n Many efficient techniques for computation of distributive and algebraic measures exist.
n In contrast, it is difficult to compute holistic measures efficiently.
n Efficient techniques to approximate the computation of some holistic measures, however, do exist.
Typical OLAP Operations
n Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction
n Drill down (roll down): reverse of roll-up
n from higher level summary to lower level summary or
detailed data, or introducing new dimensions
n Slice: perform a selection on one dimension of the given cube, resulting in a subcube.
n Dice: define a subcube by performing a selection on two or more dimensions
n Pivot (rotate): a visualization operation that rotates the data axes in view to provide an alternative data presentation
Typical OLAP Operations
n Other operations
n drill across: executes queries involving (i.e., across)
more than one fact table
n drill through: uses relational SQL facilities to drill through the bottom level of a data cube down to its back-end relational tables
A Star-net Query Model
n The querying of multidimensional databases can be based on a star-net model.
n The star-net model consists of radial lines emanating from a central point, where each line represents a concept hierarchy for a dimension.
n Each abstraction level in the hierarchy is called a footprint.
n The footprints represent the granularities that can be used by OLAP operations such as drill-down and roll-up.
A Star-net Query Model
Chapter 4: Data Warehousing and On-line Analytical Processing
n Data Warehouse: Basic Concepts
n Data Warehouse Modeling: Data Cube and OLAP n Data Warehouse Design and Usage
n Data Warehouse Implementation
Design of Data Warehouse: A Business Analysis Framework
n To design an effective data warehouse, we need to understand/analyze business needs and construct a business analysis framework.
n The construction of a data warehouse, a large and complex information system, can be viewed as the construction of a large and complex building, for which the owner, architect, and builder have different views.
n These views are combined to form a complex framework that represents the top-down, business-driven, or owner¡¯s perspective, as well as the bottom-up, builder- driven, or implementer¡¯s view of the information system.
Design of Data Warehouse: A Business Analysis Framework
n Four views related to the design of a data warehouse
n Top-down view: allows selection of the relevant information necessary for the data warehouse. This view matches current and future business needs.
n Data source view: exposes the information being captured,
stored, and managed by operational systems. Data sources are often modeled by traditional data modeling techniques, such as the entity-relationship model
n Data warehouse view: consists of fact tables and dimension tables
n Business query view: the data perspective in the data warehouse from the end-user¡¯s viewpoint
Data Warehouse Design Process
n Three different approaches
n Top-down: Starts with overall design and planning (mature) n Bottom-up: Starts with experiments and prototypes (rapid) n Combination of both
n From software engineering point of view
n Waterfall: structured and systematic analysis at each step
before proceeding to the next
n https://en.wikipedia.org/wiki/Waterfall_model
n Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around
n https://en.wikipedia.org/wiki/Spiral_model
Data Warehouse Design Process
n Typically, data warehouse design process includes the following steps:
n Choose a business process to model, e.g., orders, invoices n Choose the grain (atomic level of data) of the business
n E.g., individual transactions, individual daily snapshots
n Choose the dimensions that will apply to each fact table record
n Choose facts/measures that will populate each fact table record
Data Warehouse Development: A Recommended Approach
Multi-Tier Data Warehouse
Distributed Data Marts
Enterprise Data Warehouse
Model refinement Model refinement
Define a high-level corporate data model
Data Warehouse Usage
n Three kinds of data warehouse applications n Information processing
n supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs
n Analytical processing
n multidimensional analysis of data warehouse data
n supports basic OLAP operations, slice-dice, drilling, pivoting
n Data mining
n knowledge discovery
n supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools
From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM)
n Online Analytical Mining (OLAM) = OLAP + Data Mining n OLAM is becoming popular because:
n High quality of data in data warehouses
n DW contains integrated, consistent, cleaned data
n Available information processing structure surrounding data warehouses
n Web accessing, service facilities, reporting and OLAP tools n OLAP-based exploratory data analysis
n Mining with drilling, dicing, pivoting, etc. n On-line selection of data mining functions
n Integration and swapping of multiple mining functions, algorithms, and tasks
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com