Conceptual Modeling
Dimensional Fact Model (DMF)
• The Dimensional Fact Model (DFM) is a graphical conceptual model for data mart design aimed at
Copyright By PowCoder代写 加微信 powcoder
• support conceptual modeling
• create an environment where queries can be formulated in an intuitive
• enable requirement specification
• represent a reference point for logical modeling
• provide project documentation
• DFM conceptual schema is a set of fact schemas that model facts, measures, dimensions, and hierarchies
• A fact is a concept of interest for the decisional process • Usually models a set of events that happen within a company • Is dynamic and evolves over time
• A measure is a numeric property of a fact and describes a quantitative aspect of interest for the analysis
• A dimension is a property of a fact that has a finite domain and
describes an aspect of the fact for the analysis
dimensions
• The relationship between measures and dimensions is expressed, at the extensional level (i.e., instance level) through (primary) events
• A primary event is a specific occurrence of a fact
• It is identified through an ennuple with a value for each dimension • It has a value for each measure
• A fact is a many-to-many relationship among dimensions
Event for fact SALE:
on 3/4/2019, store X sold 10 units of product Y with a receipt of 25 USD
• Dimensional attributes are dimensions and the attributes, with discrete values, describing them
• Hierarchy is a directed tree whose nodes are dimensional attributes and whose edges model many-to-one relationships between pairs of dimensional attributes
• The dimension is the root of the tree • The other nodes are its attributes
productàtype productàbrand typeàcategory
expresses the type of a product (beer is a drink) expresses the brand of a product
each type belongs to a category (drink is food)
Hierarchy: example
dimensional attributes
• Each DMF can be modelled as an ER
• The fact is a relation
• Each measure is an attribute • Each dimension is an entity
Secondary event
• Given a set of dimensional attributes, each ennuple of values identifies a secondary event
• A secondary event aggregates the corresponding primary events
• A secondary event has a value for each measure, summarizing the values of
the same measure in the corresponding primary events
Aggregates all the sales in Milan for product X in January 2021 The quantity will be the sum of all the tuples with these values
Descriptive and optional attributes
• Descriptive attribute: property of a dimensional attribute or hierarchy • Not used for aggregation
• Leaf node
• Optional attributes:
• For some instances the attribute may
not have a value
• Telephone number and address are descriptive of a store
• Diet is optional for a product
Convergence
• Two or more attributes are connected by two or more distinct paths, all representing functional dependencies
storeà storeCityà state à country store à saleDistrict à country
Conceptual design
• Starts from documentation including ER schema, relational schema, …
Identify facts
For each fact
1. Build the attribute tree
2. Identify dimensions
3. Identify measures
4. Draw the fact schema
Identify facts
• A fact is a concept of primary interest for the decisional process • Typically events that dynamically happen in the company
• Considering an ER schema, entities that represent frequently updated archives are good candidate facts
Sales is a good candidate fact Shop is not a good candidate fact
Build the attribute tree
• Considering an ER schema, and entity F modeling a fact, the attribute tree is such that
• Each vertex is an attribute of the ER schema
• The root corresponds to the identifier of F
• For each vertex v, the corresponding attribute functionally determines all the attributes represented by descendants of v
Build the attribute tree: example (1)
Build the attribute tree: example (2)
Prune the attribute tree
• Remove dimension attributes (single nodes or entire subtrees) that are not relevant for the DW
Identify dimensions
• Dimensions determine how events can be aggregated for the decisional process
• Dimensions are chosen among the children of the root
• The choice of dimensions is crucial as it defines the granularity of
primary events
• Time is usually a dimension
Identify measures
• Measures are chosen among numeric attributes the children of the root or applying aggregates to other attributes in the tree
• A fact can be empty (i.e., with no measure)
Draw the fact schema
• The attribute tree is translated into a DFM schema
Logical Modeling
Multidimensional model: Cube (1)
• A cube represents a fact
• Each of the axis of the cube represents a dimension of the fact • Each cell in the cube represents a primary event
• A slice of the cube represents a secondary event
storeCity quantity
Multidimensional model: Cube (1)
• A cube can have an arbitrary number of dimensions
• Cells in a cube can be empty • Sparse cube: many empty cells • Dense cube: few empty cells
• ROLAP logical modeling does not suffer from sparsity issue
Star schema
It includes:
• A dimension table DTi for each dimension, with a primary key di and different attributes for each aggregation level
• A fact table FT with an external key di for each dimension and an additional attribute for each measure
Star schema: example
category department marketingGroup
unitPrice numberOfCustomers
week month day holiday quarter Year
storeCity state
country salesManager salesDistrict
Snowflake schema
• Results from a (partial) normalization of a star schema
• Prevents or reduces redundancies in dimension tables
• Given a star schema, decompose one or more dimension table DTi in multiple tables DTi,1… DTi,n to remove functional dependencies
• Each dimension table DTi,j includes
• A primary key di,j
• A subset of the attributes that functionally depend from di,j
• Possibly external keys for other dimension tables DTi,k to enable the reconstruction of DTi
Snowflake schema: example
week month day holiday quarter Year
CategoryId
category department
marketingGroup
CategoryId
unitPrice numberOfCustomers
brand brandCity TypeId
salesManager salesDistrict cityId
storeCity state country
Snowflake schema vs star schema
• Snowflake schema
• reduces the storage space
• requires to insert additional external keys
• facilitates joins including only attributes in primary dimension tables (dimension tables are smaller)
• complicates joins involving attributes in secondary dimension tables (higher number of joins)
Dimensions (1)
• Dimensions are the core of the multidimensional model
• Dimensions are used for • Selecting data
• Grouping data
• Dimensions consist of dimension values • Product: milk, cream…
• Time: 1/1/2019, 2/1/2019…
• Values are possibly organized in order
• Used for comparing values (e.g., income of different months)
Dimensions (2)
• Dimensions encode hierarchies with different levels
• Dimension values are organized in a tree or in a lattice, whose top element is ALL
• Product > Type > Category
• Day > Month > Quarter > Year
• Granularity of the bottom level is important and has an impact on scalability
Measures (1)
• A measure represents a property of a fact that users want to study and analyze
• A measure has
• Numerical value
• Aggregation formula
• A measure is additive along a dimension if the SUM operator can be used to aggregate along that dimension
Measures (2)
• Additive measures
• Additive along all dimensions
• Refer to a timeframe, at the end of which they can be evaluated cumulatively Example: number of products sold in a day
• Semi-additive measures
• Additive only along some dimensions
• Evaluated at a particular time
Example: number of products in inventory (not additive on time)
• Non-additive measures
• Additive along none of the dimensions
• Are evaluated at particular times and are relative Example: unit price of a product
Logical design
1. Translate DFM schema into a logical (star or snowflake) schema
2. Materialize views
3. Horizontal or vertical fragmentation of fact table
• A view is a fact table storing aggregated data
• Useful for the evaluation of query results based on pre-aggregated
• Aggregation operators can be
• Distributive: can compute aggregates based on pre-aggregates (e.g., sum, min, max)
• Algebraic: require a finite number of additional information to compute aggregates from pre-aggregates (e.g., average requires the number of items involved in each pre-aggregate)
• Olistic: do not permit to compute aggregates based on pre-aggregates and a finite number of additional information (e.g., median)
Materialization of views (1)
• Select a subset of views to be materialized to minimize a pre-defined cost function
• Minimize the evaluation of query workload
• Minimize the cost of view maintenance, due to updates
• Constraints that need to be taken into account
• Storage space
• Time necessary for updates
• Maximum query response time (fixed by final users)
• Maximum outdate of data in query response (fixed by final users)
Materialization of views (2)
• These minimization objectives and constraints are in contrast
• Traditionally, view materialization aims at minimize query response
time under system constraints (i.e., storage space and update time)
• The size of the problem grows with the number of dimensional
attributes (each combination of attributes is a possible view)
• Solution in two steps
• Identify views useful for the query workload
• Select the sub-optimal set of views using heuristic algorithms
Materialization of views (3)
• It is useful to materialize a view when • Directly answers to a frequent query
• Permits to answer to several queries
• It is not advisable to materialize a view when
• Its aggregation pattern is similar to an existing view
• Its aggregation pattern is very fine-grained
• Materialization does not reduce of one order of magnitude the query evaluation workload
Materialization of views (4)
Materialized views need to be kept up to date
• Rebuild the view at each updateànot efficient
• Incremental update of the viewsàmore efficient
• Immediate: views are always consistent, but causes a higher cost
• Deferred: lower cost, but views may be inconsistent for a time window
Operations on Multidimensional Data
Basic operations
• Slice-and-dice: select and project
• Selects a subsets of the cells of the cube
• Roll-up: aggregates data
• Applies an aggregation function (e.g., sum) over the data in a cube
• It can be applied to additive measures, for which the aggregation makes sense along a given dimension
• Drill-down: disaggregates data
• It is the inverse of roll-up
• Adds detail to a cube by disaggregating a measure along one or more dimensions
Slice-and-dice: example
• Selects the quantity of pasta sold for trimester for each city
1 trim. 2019
2 trim. 2019
3 trim. 2019
4 trim. 2019
Drill-down: example
1 trim. 2019
2 trim. 2019
3 trim. 2019
4 trim. 2019
1 trim. 2019
2 trim. 2019
3 trim. 2019
4 trim. 2019
Milano-1 Milano-2
70000 60000
65000 60000
40000 70000
75000 70000
Bologna-1 Bologna-2 Bologna-3
Roma-1 …
Drill-down on storeCity (from city to store)
Roll-up: example
1 trim. 2019
2 trim. 2019
3 trim. 2019
4 trim. 2019
Roll-up on time (from trimester to year)
SQL extensions for OLAP
SQL:2003 extensions for OLAP
• Crosstab
• GROUP BY extensions • ROLLUP
• GROUPING SETS
Crosstab (1)
• Table reporting the result of different aggregations • Efficient for dense datasets
• Can be computed using wizards or PIVOT clause
Toothpaste
Crosstab (2)
SELECT City, [Toothpaste] AS Toothpaste, [Soap] AS Soap FROM ( SELECT City, Product, Amount
FROM Sales) ps PIVOT ( SUM (Amount)
FOR Product IN ( [Toothpaste], [Soap]) ) AS pvt
Toothpaste
Toothpaste
Toothpaste
Crosstab (3)
• Tabular representation of the results of aggregations
• ALL is a dummy value and is the top of the dimensions hierarchy
• Compatible with relational model
• Use PIVOT to obtain a nicer view
Toothpaste
Toothpaste
Toothpaste
• SQL offers three clauses to compute aggregates as extensions to the GROUP BY clause
• GROUP BY ROLLUP(a1, …, an)
computes aggregates (e.g., SUM, COUNT, AVG, MAX, MIN) at increasing levels of aggregation
• GROUP BY [ai1,…,ain] CUBE(aj1, …, ajm)
computes aggregates at all the possible combinations of aggregations
• GROUP BY GROUPING SETS (A1, …, An)
computes aggregates according to aggregations defined by the sets of attributes in the clause
sum(Units)
fiat fiat ford
red red red
ROLLUP (1)
select Model, Year, Color, sum(Unit) from T
where Model in {‘fiat’,’ford’}
and Color = ‘red’
and Year between 1994 and 1995 group by rollup(Model, Year, Color)
• Roll up from right to left, producing the following aggregations: • Model, Year, Color
• Model, Year
ROLLUP (2)
fiat fiat ford fiat fiat ford fiat ford ALL
1994 1995 1994 1994 1995 1994 ALL ALL ALL
red red red ALL ALL ALL ALL ALL ALL
sum(Units)
fiat fiat ford
red red red
Partial ROLLUP (1)
select Model, Year, Color, sum(Unit) from T
where Model in {‘fiat’,’ford’}
and Color = ‘red’
and Year between 1994 and 1995 group by Model, rollup(Year, Color)
• Model is not part of the roll-up, produces the following aggregations: • Model, Year, Color
• Model, Year
Partial ROLLUP (2)
fiat fiat ford fiat fiat ford fiat ford
1994 1995 1994 1994 1995 1994 ALL ALL
red red red ALL ALL ALL ALL ALL
• Applied to n attributes, generates n+1 aggregations in the order • Original groups generated by GROUP BY clause
• Sub-totals of attribute an considering the different values of the first n-1
attributes
• Sub-totals of an and an-1 considering the different values of the first n-2
attributes •…
• The overall total
• Helpful on columns representing the same dimension such as time or
• ROLLUP(year, month, day)
• ROLLUP(country, region, city)
sum(Units)
fiat fiat ford
red red red
select Model, Year, Color, sum(Unit)
where Model in {‘fiat’,’ford’}
and Color = ‘red’
and Year between 1994 and 1995
group by cube(Model, Year, Color)
• Produces all possible roll-up aggregations: • Model, Year, Color
• Model, Year
• Model, Color
• Year, Color • Model
fiat fiat ford fiat fiat ford fiat ford ALL ALL fiat ford ALL ALL ALL ALL
1994 1995 1994 1994 1995 1994 ALL ALL 1994 1995 ALL ALL ALL 1994 1995 ALL
red red red ALL ALL ALL red red red red ALL ALL red ALL ALL ALL
• Applied to n attributes, generates 2n aggregations
• Includes all the rows produced by ROLLUP
• CUBE is suitable on columns that represent different dimensions • For instance, subtotals for month, state, product
• Partial CUBE works in a similar way as partial ROLLUP
sum(Units)
fiat fiat ford
red red red
GROUPING SETS (1)
select Model, Year, Color, sum(Unit) from T
where Model in {‘fiat’,’ford’}
and Color = ‘red’
and Year between 1994 and 1995
group by grouping sets((Model,Year,Color),
(Model,Color), (Year,Color))
• Produces only the aggregations specified in the clause
GROUPING SETS (2)
Model Year Color sum(Unit)
fiat 1994 red 50 fiat 1995 red 85 ford 1994 red 80 fiat ALL red 135 ford ALL red 80 ALL 1994 red 130 ALL 1995 red 85
Equivalences
• CUBE(a,b)àGROUPING SETS((a,b),(a),(b),())
• ROLLUP(a,b,c)àGROUPING SETS((a,b,c),(a,b),(a),())
• GROUP BY GROUPING SETS ((a,b,c))àGROUP BY a,b,c
• GROUP BY GROUPING SETS (a,b,c)à GROUP BY a
UNION ALL GROUP BY b UNION ALL GROUP BY c
GROUPING (1)
select Model, Year, Color, sum(Unit)
grouping(Model) as M,
grouping(Year) as Y,
from T grouping(Color) as C
where Model in {‘fiat’,’ford’}
and Color = ‘red’
and Year between 1994 and 1995
group by rollup(Model, Year, Color)
• Returns 1 if the result is a NULL value created by a grouping operation, 0 otherwise
• Helps distinguishing between original and aggregated data, especially if NULL value appears in the table
GROUPING (2)
fiat fiat ford fiat fiat ford fiat ford ALL
1994 1995 1994 1994 1995 1994 ALL ALL ALL
red red red ALL ALL ALL ALL ALL ALL
0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1
0 0 0 1 1 1 1 1 1
GROUPING_ID (1)
• Returns the decimal number corresponding to the bit vector generated by a GROUPING on all the columns in the aggregation
fiat fiat ford fiat fiat ford fiat ford ALL
1994 1995 1994 1994 1995 1994 ALL ALL ALL
red red red ALL ALL ALL ALL ALL ALL
0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 1 1 1
0 0 0 1 1 1 1 1 1
0 0 0 1 1 1 3 3 7
• select Item, Year, Lang, sum(Unit) from T group by cube(Item, Year, Lang)
• select Year, Lang, sum(Unit) from T where Item=‘crime’
group by cube(Year, Lang)
• select Item, Year, Lang, sum(Unit) from T group by rollup(Item, Year, Lang)
• select Item, Year, Lang, sum(Unit) from T
group by grouping sets((Item, Year, Lang), (Item, Lang), (Year, Lang))
Physical Modeling
Physical modeling: phases
• Translation of fact schema
• Materialization of views
• Vertical/horizontal fragmentation
• Choice of indexes
• Data allocation on physical supports
The first three phases are part of the logical modeling, but strongly interplay with physical modeling
Goal of physical modeling
• Optimization of query execution, depending on • Available indexes
• The specific DBMS
• The specific instance of the dataset
• Optimizers can operate based on
• Rules: considers the data organization, the structure of the query, and indexes • Costs: considers an estimation of query evaluation costs based on catalogs
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com