Tree-based indexes (3)
• B-trees and B+-trees support searches along one dimension only • Not suited for multi-dimensional data like DWs
• Multidimensional trees are an evolution of B-trees for multidimensional data
• Describe sets of points through geometric regions containing clusters of points
Copyright By PowCoder代写 加微信 powcoder
• Clusters are used for search operations
• Clusters are organized in a hierarchical structure
R-tree (1)
• Each data item is seen as a point in an n-dimensional space
• A dimension for each dimensional attribute
• The coordinates are the values of the data item for each dimension
• Data are organized in Minimum Bounding Rectangles (MBRs) depending on their position in the n-dimensional space
• Each node corresponds to an MBR bounding its children
• Leaf nodes point to database objects
• Note: MBR are possibly overlapping
R-tree (2)
R-tree of order (n,M), with m>=M/2
• Each internal node has between m and M children
• Each leaf node includes between (pointers to) m and M data items • The root has at least two children (unless it is a leaf node)
• All the leaf nodes are at the same level
R-tree: example
R-tree – Search
• The search operates similarly to B+-trees
• Given a query rectangle Q, starting from the root node:
• Find the child (children, resp.) of the current node intersecting with Q • Recursively visit the selected child (children, resp.)
• When reaching a leaf node, return the data items of interest
• Could visit the whole tree • No performance guarantee
R-tree – Search: example (1)
R-tree – Search: example (2)
R-tree – Insert (1)
When inserting a new data item d, it is necessary to
choose a MBR in a leaf where d can be inserted
• If d falls in a MBR in a leaf • insert d into the MBR
• If d does not fall in a MBR in a leaf
• insert d into the MBR that causes the smallest volume growth • reduces dead space
• minimizes the probability of unnecessary intersection in query
evaluation
• heuristic approach to choose such MBR
(e.g., choose the MBR with smallest overall volume)
R-tree – Insert (2)
• If the MBR is full (i.e., already includes M data items)
• the rectangle must be split
• the parent node must be updated
• split can propagate recursively until the root, possibly causing a growth in the
height of the tree
R-tree – Insert: example
Splitting a node
• The goal of splitting a MBR is to limit overlaps to minimize the need to traverse both the resulting nodes
• Not a simple task!
Good split Bad split
R-tree – Delete
• Search the leaf storing the data item d to be deleted
• Delete d from the MRB
• If the MBR includes less than m data items, condense the MBR (i.e., erase the node and re-insert its data items)
R-tree – Delete: example
R-tree – Update
• If a data item d is updated, the bounding of the MBR where it is located might change
• The data item should be removed and re-inserted
Notes on R-tree
• The R-tree that maximizes search efficiency is the one minimizing overlaps and dead spaces
• Improvements over R-trees
• R+-trees: no overlap between MBRs
• R*-trees: improve the split algorithm
• X-trees: scalability up to 20 dimensions
Bitmap indexes
• A bitmap is an array data structure where each cell stores a bit (0 o 1 value)
• Bitmap indexes are mainly used in data warehouses because data are not frequently updated
• Considering an attribute:
• Different binary configurations of the bitmap represent different domain
• Smaller domains imply shorter bitmaps
• Space occupied by the index
(number of values)*(number of rows)*1bit
Bitmap indexes: example
Geography Bitmap index on (dimension) Sales (fact) Sales
Bitmap on the fact table for dimension Geography over attribute Shop
• 6 bits, one per row of table Sales
• Value 1 if the Shop corresponds to the row of the bitmap, 0 otherwise
Bitmap indexes: removal
• If a record is removed from the fact table, the number of bits in the bitmap does not change
• Its value is set to 0 in all the bitmaps Sales (fact)
Bitmap index on Sales
Bitmap indexes: insertion
• If a record is inserted into the fact table, the number of bits in the bitmap increases by one
• Its value is set to 0 in all the bitmaps already in the table
Geography Bitmap index on (dimension) Sales (fact) Sales
Bitmap indexes: query evaluation
• Bitmap indexes are efficient for the evaluation of AND, OR, XOR operations (and the corresponding set intersection/union/difference)
• Useful especially when selectivity is high
Select the sum spent at P&C or Saturn Bitmap = 001000 v 100010 = 101010
Bitmap index on Sales
Bitmap indexes: pros and cons
Advantages
• Operations are efficient and easy to implement
Disadvantages
• Bitmaps are sparse (many 0s)
• Solution: adopt compression techniques
• For each new attribute, a new bitmap vector is needed
• Solution: multi-component bitmaps • Not useful for range queries
• Solution: range-encoded bitmaps
Join optimization
• Due to the structure of data warehouses, the evaluation of queries often requires the execution of join operations
• Mainly join between the fact table and dimension tables
• Joins are expensive operations and choosing a different order or
algorithm for their evaluation can make the difference
• In OLTP, try to avoid joins between tables with no common attributes
• In DWs cross products are quite frequent
• Use adequate index structures to reduce their cost
Use of materialized views
• Materialized views can reduce query evaluation costs (e.g., pre- computing aggregates or join results)
• To this aim, it is necessary to integrate a materialized view M in a query Q, obtaining Q’, in such a way that Q’ is equivalent to Q (i.e., is guaranteed to produce the same result)
• Selection conditions on M cannot be more restrictive than in Q • The projection in Q should be a subset of the projection in M
• Selection conditions in Q should be possible on M
• Aggregates in Q should be computable over M
Exercise on bitmap indexes
• Build the bitmap index for attribute PRODUCT and the bitmap index for attribute CITY.
• Write the condition operating on bitmap indexes to filter sales in Milan for products P1 and P3.
Sales (fact)
ETL: Extract Transform Load (1)
• Three database functions that are combined into one tool to pull data out of operational DBs and place it into the DW
• Migrate data from one database to another, to form data marts and data warehouses
• Convert databases from one format or type to another
• When should we ETL?
• Periodically (e.g., every night, every week) or after significant events • Refresh policy set by administrator based on user needs and traffic
• Possibly different policies for different sources
• Rarely, on every update (real-time DW)
ETL: Extract Transform Load (2)
• Integrate heterogeneous systems (different DBMS, operating system, hardware, communication protocols, etc.)
• ETL challenges
• Getting the data from the source to target as fast as possible
• Allow recovery from failure without restarting the whole process
• Data managed by ETL tools are in the staging area • Cannot be analyzed and used for reporting
• Can only be used to feed the DW
• ETL is often under estimated and is time-consuming
Data staging area
• Area where data are stored during the ETL process
• Data in the staging area cannot be accessed/queried/analyzed • Supports sequential operations on large data collections
• Cleaned and elaborated data are then inserted into data marts
Structures for the staging area
• Flat files
• No overhead due to the presence of a DBMS infrastructure
• No functionalities and indexes offered by DBMS
• Used for persistent storage of data
• ETL based on scripts enable efficient searches, filtering, replacing
• XML datasets
• Useful standard for input/output of the ETL system • Can rely on XPath, XQuery
• Relational tables
• Can be easily used also in absence of dedicated ETL tools
• Advantages of RDBMS: integrity, SQL interface, clear metadata • May be slower
ETL process
• Create a high-level schema of the flow of information
• Choose the most adequate ETL tool
• Identify complex data transformation and identify keys for tables
• Construct dimension tables
• Construct one static dimension
• Construct update mechanisms for one dimension • Construct the remaining dimensions
• Construct fact tables
• Construct the fact table
• Define incremental updates
• Construct aggregates
• Design and construct ETL automation
High-level diagram
• Highlight tables including data that will be aggregated to feed the data warehouse
Raw-product
Add product type
Check ref. integrity
Aggregate sales per product per day
Extract time
Building dimensions
• Static dimension table
• Identify and assign keys to dimension tables
• Maintain a mapping between production keys and data warehouse keys using a table
• Handle changes in dimensions
• Maintain updates in the mapping between production and data warehouse
• Load dimension tables
• Replace -> small tables
• Load only changes -> large tables
Building fact tables
• Initial load
• Done when the DW is first created • ETL for all data in production files • Very heavy
• Incremental updates
• Move only changes from last load • Performed periodically
• Less heavy
• Dimensions must be updated before facts
• Relevant dimension rows for new facts must be in the fact table
Extract (1)
• Data are read from a data source
• Internal scripts/tools at the data source, which export the data to be used • External programs, which extract the data from the source
• Extracted data are stored in the staging area
• Data sources can be of two types
• Non-cooperative (e.g., snapshot, specific, logged, queryable sources) • Cooperative (e.g., replicated, call-back, internal action)
Extract (2)
• Different kinds of extraction methods, including • SQL-based applications
• DB unload tools
• Specific applications
• Extraction is a time consuming operation
• Usually only changes are extracted periodically
Extract: computing deltas
• Delta: changes in the data source since the last load
• Store extracted totals in the staging area • Always possible
• Handles deletions
• High extraction times
• Include update timestamp in all rows at the data sources • Requires revision of the source systems
• Reduces extract times
• Cannot (alone) handle deletion
• Translates the original format of the data into a desired format/state for the DW to be populated
• Transform implies: data type conversion, normalization/de- normalization, building keys
• Includes two sub-phases • Data cleaning
• Data integration
Transform: data quality
• Data quality is important for data warehouses
• Data should be • Precise
• Complete • Consistent • Unique
Data cleaning (1)
• Often data extracted from different sources are not of adequate quality
• Data cleaning aims at increasing data quality before inserting data in the DW
• Data cleaning can precede, follow, or work in parallel to data integration
• Manually clean data takes too long
• Data cleaning uses semi-automatic tools and anomaly detection techniques
• Thesaurus, regular expressions, geographical information, …
Data cleaning (2)
The errors that data cleaning usually aims to correct include
• Incomplete data: define a coding for missing information, try to reconstruct the missing data or to foresee it, ask input to the operator
• Incorrect data (e.g., impossible values): syntactic rules and dictionaries to identify typos or different representation conventions
• Inconsistent data: identify rules that establish a correlation, and use them to identify errors
Integration
• Several conceptual schemas need to be combined into a unified global schema
• Differences in terminology must be resolved • Redundancy must be removed
• Fast loading is not an easy task • SQL-based loading is not efficient • DB load tools are faster
• Rebuilding indexes as a consequence of load operations is an expensive task
• Parallelization is important for improving efficiency • Dimensions can be load concurrently
• Fact tables can be load concurrently
• Partitions can be load concurrently
• Initial load
• Deliver dimensions tables • Deliver fact tables
• Continuous load
• Must be scheduled and proceed in a specific order to guarantee that integrity
and completeness constraints are not violated • Should be carefully planned
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com