Predicate Migration:
Optimizing Queries with Expensive Predicates
. Hellerstein Michael Stonebraker
Computer Science Division, EECS Department University of California, Berkeley, CA 94720
Copyright By PowCoder代写 加微信 powcoder
In this paper we develop a theory for moving expensive predicates in a query plan so that the total cost of the plan — including the costs of both joins and restrictions — is minimal. We present an algorithm to implement the theory, as well as results of our implementation in POSTGRES. Our experience with the newly enhanced POSTGRES query optimizer demonstrates that correctly optimizing queries with expensive predicates often produces plans that are orders of magni- tude faster than plans generated by a traditional query optimizer. The additional complexity of considering expensive predicates during opti- mization is found to be manageably small.
1 Introduction
Traditional relational database (RDBMS) literature on query optimization stresses the significance of choosing an efficient order of joins in a query plan. The placement of the other standard relational operators (selection and projection) in the plan has typically been handled by “pushdown” rules (see e.g., [Ull89]), which state that restrictions and projections should be pushed down the query plan tree as far as possible. These rules place no importance on the ordering of projections and restrictions once they have been pushed below joins.
The rationale behind these pushdown rules is that the rela- tional restriction and projection operators take essentially no time to carry out, and reduce subsequent join costs. In today’s
Current address: Computer Sciences Department, University of Wisconsin, Madison, WI 53706. This material is based upon work supported under a National Science Foundation Graduate Fellowship.
systems, however, restriction can no longer be considered to be a zero-time operation. Extensible database systems such as POSTGRES [SR86] and Starburst [HCL 90], as well as vari- ous Object-Oriented DBMSs (e.g., [MS87], [WLH90], [D 90], [ONT92], etc.) allow users to implement predicate functions in a general-purpose programming language such as C or C++. These functions can be arbitrarily complex, potentially requiring access to large amounts of data, and extremely complex process- ing. Thus it is unwise to choose a random order of application for restrictions on such predicates, and it may not even be optimal to push them down a query plan tree. Therefore the traditional model of query optimization does not produce optimal plans for today’s queries, and as we shall see, the plans that traditional optimizers generate can be many orders of magnitude slower than a truly optimal plan.
To illustrate the significance of ordering restriction predicates, consider the following example:
Example 1.
/* Find all maps from week 17 showing more than 1% snow cover. Channel 4 contains images from the frequency range that interests us. */
retrieve (maps.name)
where maps.week = 17 and maps.channel = 4
and coverage(maps.picture) > 1
In this example, the function coverage is a complex im- age analysis function that may take many thousands of in- structions to compute. It should be quite clear that the query will run faster if the restrictions maps.week = 17 and maps.channel = 4 are applied before the restriction coverage(maps.picture) > 1, since doing so minimizes the number of calls to coverage.
While restriction ordering such as this is important, correctly ordering restrictions within a table-access is not sufficient to solve the general problem of where to place predicates in a query execution plan. Consider the following example:
Example 2.
/* Find all channel 4 maps from weeks starting in June that show more than 1% snow cover. Information about each week is kept in the weeks table, requiring a join. */
retrieve (maps.name)
where maps.week = weeks.number
and weeks.month = “June”
and maps.channel = 4
and coverage(maps.picture) > 1
The traditional focus of relational query optimization schemes has been on the choice of join methods and join orders. Re- strictions have typically been handled in query optimizers by “predicate pushdown” rules, which apply restrictions in some random order be- fore as many joins as possible. These rules work under the assumption that restriction is essentially a zero-time operation. However, today’s extensible and object-oriented database systems allow users to define time-consuming functions, which may be used in a query’s restriction and join predicates. Furthermore, SQL has long supported subquery predicates, which may be arbitrarily time-consuming to check. Thus re- strictions should not be considered zero-time operations, and the model of query optimization must be enhanced.
Traditionally, a DBMS would execute this query by applying all the single-table restrictions in the where clause before per- forming the join of maps and weeks, since early restriction can lower the complexity of join processing. However in this example the cost of evaluating the expensive restriction predi- cate may outweigh the benefit gained by doing restriction be- fore join. In other words, this may be a case where “pred- icate pushdown” is precisely the wrong technique. What is needed here is “predicate pullup”, namely postponing the re- striction coverage(maps.picture) > 1 until after comput- ing the join of maps and weeks.
In general it is not clear how joins and restrictions should be in- terleaved in an optimal execution plan, nor is it clear whether the migration of restrictions should have an effect on the join orders and methods used in the plan. This paper describes and proves the correctness of the Predicate Migration Algorithm, which produces a minimal-cost query plan for queries with expensive predicates. Predicate Migration modestly increases query op- timization time: the additional cost factor is polynomial in the number of operators in a query plan. This compares favorably to the exponential join enumeration schemes used by most query optimizers, and is easily circumvented when optimizing queries without expensive predicates — if no expensive predicates are found while parsing the query, the techniques of this paper need not be invoked. For queries with expensive predicates, the gains in execution speed should offset the extra optimization time. We have implemented Predicate Migration in POSTGRES, and have found that with modest overhead in optimization time, the exe- cution time of many practical queries can be reduced by orders of magnitude. This will be illustrated below.
1.1 Application to Existing Systems: SQL and Subqueries
It is important to note that expensive predicate functions do not exist only in next-generation research prototypes. Current rela- tional languages, such as the industry standard, SQL [ISO91], have long supported expensive predicate functions in the guise of subquery predicates. A subquery predicate is one of the form “expression operator query”. Evaluating such a predicate re- quires executing an arbitrary query and scanning its result for matches — an operation that is arbitrarily expensive, depend- ing on the complexity and size of the subquery. While some subquery predicates can be converted into joins (thereby be- coming subject to traditional join-based optimization strategies) even sophisticated SQL rewrite systems, such as that of Star- burst [PHH92], cannot convert all subqueries to joins. When one is forced to compute a subquery in order to evaluate a predicate, then the predicate should be treated as an expensive function. Thus the work presented in this paper is applicable to the majority of today’s production RDBMSs, which support SQL.
1.2 Related Work
Stonebraker first raised the issue of expensive predicate op- timization in the context of the POSTGRES multi-level store [Sto91]. The questions posed by Stonebraker are directly addressed in this paper, although we vary slightly in the defini- tion of cost metrics for expensive functions.
One of the main applications of the system described in [Sto91] is Project Sequoia 2000 [SD92], a University of Cal- ifornia project that will manage terabytes of Geographic Infor-
mation System (GIS) data, to support global change researchers. It is expected that these researchers will be writing queries with expensive functions to analyze this data. A benchmark of such queries is presented in [SFGM93].
Ibaraki and Kameda [IK84], Krishnamurthy, Boral and Zan- iolo [KBZ86], and Swami and Iyer [SI92] have developed and refined a query optimization scheme that is built on the notion of rank that we will use below. However, their scheme uses rank to reorder joins rather than restrictions. Their techniques do not consider the possibility of expensive restriction predicates, and only reorder nodes of a single path in a left-deep query plan tree, while the technique presented below optimizes all paths in an arbitrary tree. Furthermore, their schemes are a proposal for a completely new method for query optimization, while ours is an extension that can be applied to the plans of any query optimizer. It is possible to fuse the technique we develop in this paper with those of [IK84, KBZ86, SI92], but we do not focus on that issue here since their schemes are not widely in use.
The notion of expensive restrictions was considered in the
context of the
solution was to model a restriction on relation as a join be- tween and a virtual relation of infinite cardinality containing the entire logical predicate of the restriction. By modeling re- strictions as joins, they were able to use a join-based query optimizer to order all predicates appropriately. Unfortunately, most traditional DBMS query optimizers have complexity that is exponential in the number of joins. Thus modelling restrictions as joins can make query optimization prohibitively expensive for a large set of queries, including queries on a single relation. The scheme presented here does not cause traditional optimizers to exhibit this exponential growth in optimization time.
Caching the return values of function calls will prove to be vital to the techniques presented in this paper. Jhingran [Jhi88] has explored a number of the issues involved in caching proce- dures for query optimization. Our model is slightly different, since our caching scheme is value-based, simply storing the results of a function on a set of argument values. Jhingran’s focus is on caching complex object attributes, and is therefore instance-based.
1.3 Structure of the Paper
The following section develops a model for measuring the cost and selectivity of a predicate, and describes the advantages of caching for expensive functions. Section 3 presents the Predicate Migration Algorithm, a scheme for optimally locating predicates in a given join plan. Section 4 details the results of our imple- mentation experience in POSTGRES. Section 5 summarizes and provides directions for future research.
2 Background: Expenses and Caching
Query optimization schemes typically attempt to find a query plan of minimal estimated cost. To develop our optimizations, we must enhance the traditional model for analyzing query plan cost. This will involve some modifications of the usual metrics for the expense of relational operators, and will also require the introduction of function caching techniques. This preliminary discussion of our model will prove critical to the analysis below.
A relational query in a language such as SQL or Postquel [RS87] may have a where clause, which contains an
logic programming system [CGK89]. Their
arbitrary Boolean expression over constants and the range vari- ables of the query. We break such clauses into a maximal set of conjuncts, or “Boolean factors” [SAC 79], and refer to each Boolean factor as a distinct “predicate” to be satisfied by each result tuple of the query. When we use the term “predicate” below, we refer to a Boolean factor of the query’s where clause. A join predicate is one that refers to multiple tables, while a restriction predicate refers only to a single table.
Traditional query optimizers compute selectivities for both joins and restrictions. That is, for any predicate (join or restriction) they estimate the value
to the different levels of the multi-level store. Instead, this can be computed by POSTGRES itself via system statistics, thus providing more accurate information about the distribution and caching of data across the storage levels.
2.2 Cost of SQL Subqueries and Other Query Language Functions
SQL allows a variety of subquery predicates of the form “ex- pression operator query”. Such predicates require computation of an arbitrary SQL query for evaluation. Simple uncorrelated subqueries have no references to query blocks at higher nesting levels, while correlated subqueries refer to tuple variables in higher nesting levels.
In principle, the cost to check an uncorrelated subquery re- striction is the cost of materializing the subquery once, and the cost of scanning the subquery once per tuple. However, we will need these cost estimates only to help us reorder opera- tors in a query plan. Since the cost of initially materializing an uncorrelated subquery must be paid regardless of the subquery’s location in the plan, we ignore the overhead of the materializa- tion cost, and consider an uncorrelated subquery’s cost per tuple tobe.
Correlated subqueries must be materialized for each value that is checked against the subquery predicate, and hence the per-tuple expense for correlated subqueries is . We ignore
here since scanning can be done during each materialization, and does not represent a separate cost. Postquel functions in POSTGRES have costs that are equivalent to those of correlated subqueries in SQL: an arbitrary access plan is executed once per tuple of the relation being restricted by the Postquel function.
The cost estimates presented here for query language func- tions form a simple model and raise some issues in setting costs for subqueries. The cost of a subquery predicate may be lowered by transforming it to another subquery predicate [LDH 87], and by “early stop” techniques, which stop materializing or scanning a subquery as soon as the predicate can be resolved [Day87]. In- corporating such schemes is beyond the scope of this paper, but including them into the framework of the later sections merely requires more careful estimates of the subquery costs.
2.3 Join Expenses
In our subsequent analysis, we will be treating joins and restric- tions uniformly in order to optimally balance their costs and benefits. In order to do this, we will need to measure the ex- pense of a join per tuple of the join’s input, i.e. per tuple of the cartesian product of the relations being joined. This can be done for any join method whose costs are linear in the cardinalities of the input relations, including the most common algorithms: nested-loop join, hash join, and merge join. Note that sort- merge join is not linear in the cardinalities of the input relations. However, most systems, including POSTGRES, do not use sort- merge join, since in situations where merge join requires sorting of an input, either hash join or nested-loop join is almost always preferable to sort-merge.
A query may contain many join predicates over the same set of relations. In an execution plan for a query, some of these pred- icates are used in processing a join, and we call these primary join predicates. If a join has expensive primary join predicates, then the cost per tuple of a join should reflect the expensive function costs. That is, we add the expensive functions’ costs,
selectivity
card output card input
and make the assumption that selectivities of different predi- cates are independent. Typically these estimations are based on default values and system statistics [SAC 79], although re- cent work suggests that accurate and inexpensive sampling tech- niques can be used [LNSS93, HOT88].
2.1 Cost of User-Defined Functions in POSTGRES
In an extensible system such as POSTGRES, arbitrary user- defined functions may be introduced into both restriction and join predicates. These functions may be written in a general pro- gramming language such as C, or in the database query language, e.g. SQL or Postquel. In this section we discuss programming language functions; we handle query language functions below.
Given that user-defined functions may be written in a general purpose language such as C, there is little hope for the database to correctly estimate the cost and selectivity of predicates con- taining these functions, at least not initially. In this section we extend the POSTGRES function definition syntax to cap- ture a function’s expense. Selectivity modeling for user-defined operators in POSTGRES has been described in [Mos90].
To introduce a function to POSTGRES, a user first writes the function in C and compiles it, and then issues Postquel’s define function command. To capture expense information, the define function command accepts a number of special flags, which are summarized in Table 1.
The cost of a predicate in POSTGRES is computed by adding up the costs for each expensive function in the expression. Given
a POSTGRES predicate recursively defined as:
percall cpu perbyte cpu
, the expense per tuple is
is a function
access cost
is a constant or tuple variable
bytes is the expected (return) size of the argument in bytes, and access cost is the cost of retrieving any data necessary to compute the function. This data may be stored anywhere in the various levels of the POSTGRES multi-level store, but un- like [Sto91] we do not require the user to define constants specific
After repeated applications of a function, one could collect per- formance statistics and use curve-fitting techniques to make estimates about the function’s behavior. Such techniques are beyond the scope of this paper.
is the recursively computed expense of argument ,
766453!
,’9201/.(8’ &0 #%###$ ! ,,-+*()’&####”
description
percall cpu
execution time per invocation, regardless of the size of the arguments
perbyte cpu
execution time per byte of arguments
percentage of argument bytes that the function will need to access
Table 1: Function Expense Parameters in POSTGRES
Tuple Size
maps weeks emp dept
1 040 424 24 32 44
932 19 10 000 20
as described in Section 2.1, to the join costs per tuple.
Join predicates that are not applicable while processing the join are merely used to restrict its output, and we refer to these as secondary join predicates. Secondary join predicates are essentially no different from restriction predicates, and we treat them as such. These predicates may be reordered and even pulled up above higher join nodes, just like restriction predicates. Note, however, that a secondary join predicate must remain above its corresponding primary join. Otherwise the secondary join
predicate would be impossible to evaluate.
2.4 Function Caching
The existence of expensive predicates not only motivates re- search into richer optimization schemes, it also suggests the need for DBMSs to cache the results of expensive predicate functions. In this paper, we assume that the system caches the return values of all functions for at least the duration of a query. This lowers the cost of a function, since with some probability the function can be evaluated simply by checking the cache. Given the distribution of the data in a function’s cache, and the distribution of the inputs to a function, one can derive a ratio of cache misses to cache lookups for the function. This ratio serves as the probability of a cache miss for a given tuple, and should be facto
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com