程序代写代做代考 database SQL Microsoft PowerPoint – Spatial Data Management – Week 8 – Advanced Topics 2 – Improving Performance

Microsoft PowerPoint – Spatial Data Management – Week 8 – Advanced Topics 2 – Improving Performance

1

Week 8 – Advanced Topics 2
Improving Database Performance

Dr Claire Ellul

c.ellul@ucl.ac.uk

Spatial Data Management

• Overview

– Indexing

– Normalisation and De-Normalisation

– Query Parsing

Indexes

• Indexes

– Provide accelerated access to data by reducing
the set of objects to be looked for when
processing a query

– In simple terms, they improve query
performance

Indexes

• Indexes

– Take the following PERSONAL_DETAILS table

Name Surname Date of Birth

James Jones 25 th January 1987

James Smith 13 th March 1956

Robert Jones 15 th April 1945

Janet Jones 15 th April 1967

Richard Jones 20 th September 1972

Alex Smith 19 th November 1966

John Ward 1 st August 1979

Indexes

• Run the following query

SELECT *

FROM PERSONAL_DETAILS

WHERE SURNAME = ‘SMITH’

Indexes

• Without an index

– The system has to check each row to see if the
surname is Smith

– In a large table this can take a while!

2

Indexes

– The system has to run the following code

FOR EACH ROW IN THE TABLE
IF ROW.SURNAME = ‘SMITH’ THEN

ADD ROW TO RESULTS LIST

END IF

NEXT ROW

– For the small table shown above this requires a
total of 3 x 7 = 21 operations (3 operations –
FOR, IF, ADD and 7 rows of data)

Indexes

• Without an index

– This process must be repeated for each of the
rows on the table – in the case of a table with
2 million

– This results in 2 million x 3 = 6 million
operations, which is not very efficient!

Indexes

• Adding an Index

– This will speed up the process by creating a
‘shortcut’ to each unique surname in the table

– Conceptually, the index will look something
like this:

Surname Rows

Jones 1,3,4,5

Ward 7

Smith 2,6

Indexes

• With an Index
– The system now has to run the following code

FOR EACH ROW IN THE INDEX TABLE

IF INDEX.SURNAME = ‘SMITH’ THEN

FOR EACH ITEM IN THE LIST

SELECT ROW FROM THE TABLE

ADD THE DATA INTO THE RESULTS LIST

NEXT ITEM IN THE LIST

END IF

NEXT ROW IN THE INDEX TABLE

– For the table above, that is a total of 12 operations!

Indexes

• Creating an Index in SQL

– The SQL command to create an index is:
CREATE INDEX

ON ();

– For example:
CREATE INDEX SURNAME_IDX

ON PERSONAL_DETAILS(SURNAME);

Indexes

• Indexes

– In reality, indexes are stored using a system
called a B-Tree

– This takes advantage of the structure of the
hard disk of the computer to allow fast
retrieval of the index data

3

Indexes

• When to use indexes

– Non-spatial indexes are best used when some
of the data in the column is the same

– Indexes are not very efficient:

• When each item in the column is unique

• When the datasets are small (indexes will not make
much difference in performance)

Indexes

• When to use Indexes

– When deciding whether to use an index, you
also need to look at how the data will be used

– Will it be added to or updated very frequently?

– Or will it be used for decision making – I.e. to
answer queries?

Indexes

• When to use Indexes

– Each time you insert, update or delete a
record, the index has to be modified

– This takes time, and the index may not be
worth using if the data is not being used for
decision support

Spatial Indexing

• Spatial indexes work on a similar principle to
normal indexes

• Instead of looking for rows with the same
surname, they look for rows with data that is in
the same area

• The idea is that if you are interested in data for
London, you will not be interested in data for
Scotland, so the system should just get the
London data you need without searching the
Scotland data too

Indexing and Spatial Indexing

• An example – Simple Grid Index

1

4

2

6

7

3

5

8

A

B

C

D

L

Indexing and Spatial Indexing

• Grid Structure

– This involves partitioning the space into
regular rectangular grid squares called cells

– A point is assigned to one of the grid cells if it
is within the grid cell

– Points, lines and polygons may be assigned to
multiple grid cells if they overlap

4

Indexing and Spatial Indexing

• Grid Structure

– This is most efficient for managing point
datasets

– It works by storing links to items in one grid
cell next to each other on the computer hard
disk (in the sub-area of the disk called a page)

– For convenience, we represent each page as a
row in a table in the following diagram

Indexing and Spatial Indexing

• Grid Structure

A

B

C D

A

B

C

D

Grid
Cell

Contents

AA P1

AB

AC P2

AD

BA P3

BB P3

BC P3

BD P3

Note: Each ‘Grid Cell’ row in the above table actually corresponds to a
page on the hard disk of the computer, rather than to a table in a
database

Grid
Cell

Contents

CA

CB L1

CC L1,P4

CD P4

DA L1

DB L1

DC P4

DD P4

P1 P2

P3

P4L1

Indexing and Spatial Indexing

• Problems with the Grid Index (1)

– Taking only ONE grid cell into account does not
allow Polygon 3 to be included in the search
for the polygon closest to line L

– Therefore, expand the search to the grid cell
containing L and all the surrounding cells

– But this has performance implications!

Indexing and Spatial Indexing

• Expanded Grid Cell Search

A

B

C

D

Y

AG

AF

AH

OAI

AB

AA

AC

AD

X

N

AE AJ P Q

W

E

F

M

R

G

V

H

L

S

I

U

J

K

T

Indexing and Spatial Indexing

• Problems with the Grid Index (2)

– What happens if all the data is in one corner of
the given area?

Indexing and Spatial Indexing

• Grid Indexes

A

B

C D

A

B

C

D

P1
P2

P3 P4

L1

Grid
Cell

Contents

AA P1, L1,P4

AB P2,L1,P4

AC

AD

BA P3, L1

BB P4

BC

BD

Grid
Cell

Contents

CA

CB

CC

CD

DA

DB

DC

DD

–The grid has many empty cells, and a few cell that are

densely packed

5

Indexing and Spatial Indexing

• Problems with the Grid Index (3) :
– Index size can grow quite quickly, leading to an

increase in search time and reduction in performance
• Because lines and polygons overlap multiple grid squares, the

index can grow to quite a large size – this is why it is most
suited to point data

• If a point falls on the intersection of four grids, it is assigned
to all four, also increasing the size of the index

• If the geometry is only distributed in a few cells, the other
cells are created but remain empty, increasing index size

– It may also be difficult to calculate the most
appropriate size of the grid. In general, the rule of
thumb is that the grid size should be approximately
equal to the most common query window size

Indexing and Spatial Indexing

• The Quadtree

– This index presents one solution to the distribution
problem when using the Grid Structure

– In this index type, the search space is decomposed
into quadrants rather than equal-sized cells

– The index is represented as a tree, with the root node
having four leaf-nodes, and each leaf-node in turn
having four further leaf-nodes as required by the data
(hence Quadtree)

Indexing and Spatial Indexing

• The Quadtree

– A line or a rectangle can appear in more than
one leaf node.

– Again each tree node is mapped onto a sub-
area of the hard disk called a page

– We will represent this as a tree structure

Indexing and Spatial Indexing

• The Quadtree

– Note that as this is a space-based index, the
depth of the tree varies depending on how
densely populated the area of the map is

– This may affect performance in densely
populated areas

Indexing and Spatial Indexing

• Quadtree Example

A

B

C D

A

B

C

D

P1
P2

P3

P4

L1

P3, L1 P1, L1 P2, L1

P4

It is important to realise that each node in the tree
represents an AREA (Quadrant) on the map –
therefore you can ‘zoom in’ to the area of interest
by following the node structure

Assume that node order is:

•Top Left, Top Right, Bottom Left, Bottom Right
(this will depend on the software being used)

Indexing and Spatial Indexing

• R-Tree

– This relies on a balanced hierarchical structure, in
which each tree node is mapped onto a disk page

– R-Trees organise rectangles according to a
containment relationship

– A rectangle (called the directory rectangle) is
associated with each node.

– This directory rectangle corresponds to the MBR of all
the child rectangles or elements of this node

6

12

10

11

Indexing and Spatial Indexing

• R-Tree Directory Rectangles

4 7

5
6

8
9

3

1
2

a

Diagram taken from RSV

c

d

Essentially, the R-Tree is based on the
principle of looking at the data and taking
into account what data is most likely to be
queried at the same time

This then forms the basis of the index

4,5,6,7 1,2,3,12 12,8,9,10,11 9,10,11

b

a b c d

Indexing and Spatial Indexing
• R-Tree

– Can handle data in multiple dimensions

– For all nodes in the tree except the root, the
number of entries varies between 0 and ½ the total
number of entries allowed on the node (this depends
on disk page size)

– For each entry in a non-leaf node, the entry is the
directory rectangle of the child node

– Each leaf entry contains the MBR of the object it
links to

– Each root has at least two entries (unless it is a leaf)

– All leaves are at the same level

Indexing and Spatial Indexing

• R-Tree
– The R-Tree adapts its gridding to the structure

of the DATA rather than simply dividing up the
search space

– A region of search space populated with a
large number of objects generates a large
number of tree leaves

– This allows the tree to have the same depth all
through, giving equal performance for densely
and non-densely populated areas

• Creating indexes in PostgreSQL/PostGIS

– Non-Spatial Index
Create Index spatial_table_points_name_idx
ON spatial_table_points (name, surname);

– Spatial Index
• CREATE INDEX spatial_table_points_gidx

ON spatial_table_points
USING GIST(the_geom);

– GIST stands for “Generalised Search Tree” which is a basic
generic index that can be used for spatial and other data types.
PostGIS then uses an R-Tree approach when implementing GIST
on spatial datasets

Indexing in PostGIS

Spatial Data Management

• Overview

– Indexing

– Normalisation and De-Normalisation

– Query Parsing

Normalisation

• Normalisation

– First rule of a database

• “One fact, one place”

7

Normalisation

• Normalisation

– Normalisation removes any redundant data
from the database and avoids data duplication

– It is a way of validating the quality of the
database design

– Is usually applied after the logical model has
been developed

Normalisation

• Normalisation

– A properly normalised set of tables simplifies
the retrieval and maintenance processes

– In an ideal world, we would not need to
normalise the data, as the logical model would
be perfect

– But we need to go through the normalisation
process to ensure that this is the case!

Normalisation

• Redundancies and Anomalies
– A redundancy occurs when the same piece of

data is duplicated in the database. This leads
to excessive storage being used for the
database.

– Anomalies in the database related to problems
with the following operations:
• Update

• Insert

• Delete

Normalisation

• Update Anomaly

– Data inconsistency or loss of data integrity can arise
due to partial update (if data exists in two places, you
could update only on instance)

• Insertion Anomaly

– Data cannot be added because some other data is
absent

• Deletion Anomaly

– Data may be unintentionally lost through deletion of
other data

Normalisation

• Normalisation

– Reduces a table into simpler structure

– Defined as a step-by-step process of
transforming an non-normalised table with
progressively simpler structures

– Since the process is reversible, no information
is lost during the transformation

– Decomposition

• Decomposition
– This is the process of splitting up tables into

smaller tables, which happens as part of the
normalisation process

– Should Be
• Lossless – no information should be lost or added

through the normalisation process, and the process
should be reversible.

• Preserve dependencies – the relationships between
the different attributes and tables should not be
lost.

8

– Normal Forms

• First Normal Form

– For a table to be in First Normal Form (1NF)
the underlying domains must contain simple
atomic values

– This means that there should only be one value
in each field in the table

– First Normal Form

• Moving to First Normal Form – An Example

Book ID Book Title Alternative
Title

Authors Series
Title

Format Font Purchase Price

1 Database
Systems

Concepts,
Languages
and
Architectures

P. Atzeni

S. Ceri

S. Paraboschi

R. Torlone

39.50

(this table is not Normalised)

– First Normal Form

• Anomalies
– Update Anomaly – you cannot modify an Author’s

name without having to re-insert values for the whole
Authors field

– Insert Anomaly – you cannot add a new Author without
having to re-insert values for the whole Authors field

– Delete Anomaly you cannot delete one Author without
deleting the others as well

– First Normal Form

• Moving to First Normal Form

Book Title Alternative Title Authors Publisher Format Font Series
Title

Cost Price

Database Systems Concepts, Languages and
Architectures

P. Atzeni McGraw Hill Hardback Times New
Roman

Database
Tutor

20.00

Database Systems Concepts, Languages and
Architectures

S. Ceri McGraw Hill Hardback Times New
Roman

Database
Tutor

20.00

Database Systems Concepts, Languages and
Architectures

S. Paraboschi McGraw Hill Hardback Times New
Roman

Database
Tutor

20.00

Database Systems Concepts, Languages and
Architectures

R. Torlone McGraw Hill Hardback Times New
Roman

Database
Tutor

20.00

Book Title Alternative
Title

Authors Publisher Series
Title

Format Font Cost Price

Database
Systems

Concepts,
Languages
and
Architectures

P. Atzeni

S. Ceri

S. Paraboschi

R. Torlone

McGraw
Hill

Datab
ase
Tutor

Hardback Times
New
Roman

20.00

– First Normal Form

• First Normal Form

– The above table has been normalised so that
each field contains an atomic (single) value

– This has created issues with duplicates, which
are resolved by the next steps in the
Normalisation process.

– Second Normal Form
• Moving to Second Normal Form – An

Example
Book Title Alternative Title Author Publisher Format Font Series

Title
Cost Price

Database Systems Concepts, Languages and
Architectures

P. Atzeni McGraw Hill Hardback Times New
Roman

Database
Tutor

20.00

Database Systems Concepts, Languages and
Architectures

S. Ceri McGraw Hill Hardback Times New
Roman

Database
Tutor

20.00

Database Systems Concepts, Languages and
Architectures

S. Paraboschi McGraw Hill Hardback Times New
Roman

Database
Tutor

20.00

Database Systems Concepts, Languages and
Architectures

R. Torlone McGraw Hill Hardback Times New
Roman

Database
Tutor

20.00

Database Design M Jones Bachmann Paperback Arial Design for
Dummies

15.00

(this table is in First Normal Form)

9

– Second Normal Form

• Anomalies
– Update Anomaly – you cannot modify the Series Title without

making sure that it is modified in four places

– Insert Anomaly – you cannot add a new Author without having to
re-insert values for the Book Title, Alternative Title and other fields
as well

– Delete Anomaly you cannot delete the ‘Database Design’ book
without losing information regarding the publisher of the ‘Design for
Dummies’ series

– Second Normal Form

• Second Normal Form

– So, 1NF still shows anomalies for the update,
insert and delete process

– Therefore further normalisation is required to
eliminate these

– Second Normal Form (2NF) goes some way to
addressing the problems shown by 1NF

• Second Normal Form – An Example

– Decompose the tables to eliminate these
problems

– Second Normal Form
– Second Normal Form

• Second Normal Form – An Example
Book Title Author

Database Systems S. Atzeni

Database Systems S. Ceri

Database Systems S. Paraboschi

Database Systems R. Torlone

Database Design M Jones

Book Title Alternative
Title

Cost Price Series Title Format Font Publisher

Database
Systems

Concepts,
Languages
and
Architectures

20.00 Database
Tutor

Hardback Times New
Roman

McGraw Hill

Database
Design

15.00 Design for
Dummies

Paperback Arial Bachmann

Normalisation

• The Problem With Normalisation

– It introduces “joins” into the database

– Joins make a query run slower, especially
when there is a lot of data in the tables

– Indexes help, but may not be the whole
solution on very large databases

Normalisation

• De-Normalisation

– Is the reverse process of normalisation

– Tables are MERGED into one

– This results in duplicate data

– But … queries perform much faster as in a
sense the joins between different parts of the
data are “pre-calculated” so we don’t have to
use JOIN statements

10

Spatial Data Management

• Overview

– Indexing

– Normalisation and De-Normalisation

– Query Parsing

Query Parsing

• Examining Query Execution

Source: http://www.cubrid.org/blog/dev-platform/postgresql-at-a-glance/

Query Parsing

• Query Parsing – Step 1 – Syntax Parsing
– Validates that the SQL syntax is correct

• i.e. do the words ‘from’ and ‘select’ and so on appear as required.

– Returns an error to the user if the syntax is incorrect

Source: http://files.meetup.com/1990051/PostgreSQL%20Internals%20-%20Overview.pdf

Query Parsing

• Query Parsing – Step 2 – Semantic Parsing
– Validates that the SQL makes sense in the context of

the database
• i.e. do the named tables, views and columns exist

– Returns an error to the user if the semantics are
incorrect

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:2588723819082

Query Parsing

• Query Parsing – Step 3 – Query Rewriter
– Modifies queries to take rules (“constraints”) into consideration

i.e. transforms the SQL into a form more appropriate for down-
stream optimisation

• e.g. translate name = ‘Jack’ or name = ‘John’ or name = ‘Joe’ to name in
(‘Jack’, ‘John’, ‘Joe’) which is more efficient

• e.g. if there is a sub query, check whether a join may be more efficient

• e.g. re-order the query so that the where clause (filter) that eliminates most
records is run first, leaving less records to be joineda

– Once finished, the modified query is passed to the query planner
for planning and execution – as a ‘parse tree’

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:2588723819082

http://pic.dh e.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=%2Fcom.ibm.db2.luw.admin.perf.doc%2Fdoc%2Fc0005293.html

Query Parsing

• Query Parsing – Parse Tree

SELECT customer_name,
balance FROM customers
WHERE balance > 0
ORDER BY balance

11

Query Parsing

• Query Parsing – Step 4 – Query Optimization

– The planner is responsible for traversing the
parse tree and finding all possible plans for
executing the query.

• The plan might include a sequential scan through the
entire table and index scans if useful indexes have
been defined.

• If the query involves two or more tables, the planner
can suggest a number of different methods for joining
the tables.

Query Parsing

• Query Parsing – Step 5 – Query Execution
– The execution plans are developed in terms of query operators.

– Each query operator transforms one or more input sets into an
intermediate result set.

– Complex queries are broken down into simple steps.

– When all possible execution plans have been generated, the
optimizer searches for the least-expensive plan.

Query Parsing

• Query Parsing – Step 5 – Operators (1)

– Seq Scan (sequential scan).

• Starts at the beginning of the table and scanning to
the end of the table.

• Checks each row against any ‘where’ clause and
adds the row to the result if it passes

Query Parsing

• Query Parsing – Step 5 – Operators (2)

– Index Scan

• Traverses an index structure if there is a ‘where’
clause and an appropriate index exists

• Allows the query to quickly skip any rows not
meeting the criteria

• Unlike the Seq Scan, returns the data pre-ordered
(as the index is ordered)

Query Parsing

• Query Parsing – Step 5 – Operators (3)

– Sort

• Orders the result set returned by either the Seq
Scan or Index operators

– Data can be sorted in memory or on-disk (where
temporary results are stored on disk if the system memory
is not large enough for the sort) – the latter is slower.

Query Parsing

• Query Parsing – Step 5 – Operators (4)

– The Unique operator eliminates duplicate
values from the input set.

– The LIMIT operator is used to limit the size of
a result set.

12

Query Parsing

• Query Parsing – Step 5 – Query Execution
– Taking the required operators and data into account,

each plan is assigned an estimated execution cost.

– Cost estimates are measured in units of disk I/O.
• An operator that reads a single block of 8,192 bytes (8K) from

the disk has a cost of one unit.

• CPU time is also measured in disk I/O units, but usually as a
fraction.

• For example, the amount of CPU time required to process a
single row is assumed to be 1/100th of a single disk I/O.

Query Parsing

• Query Parsing – Step 5 – Query Execution

– After choosing the least-expensive execution
plan, the query executor starts at the beginning
of the plan and asks the topmost operator to
produce a result set.

• This in turn calls the next operator, which calls the
next until the bottom most operator generates results
which are passed back up the tree.

Query Parsing

• Query Parsing In Practice

• The EXPLAIN statement gives you some
insight into how the PostgreSQL query
planner/optimizer decides to execute a
query.

– The EXPLAIN statement can be used to analyze
SELECT, INSERT, DELETE, UPDATE commands

Spatial Data Management

• Overview

– Indexing

– Normalisation and De-Normalisation

– Query Parsing