Design Theory and Normalization (part 2)
CSC343
Mark Kazakevich
(with slides from Diane Horton, Jeff Ullman)
Last time
● We learned about a set of constraints that we can put on a relation called functional dependencies
○ We learned about how to test if an FD follows from a set of FDs using the closure test
○ Learned how to project FDs onto a subset of attributes
○ Learned how to find a minimal basis, or a set of equivalent non-redundant FDs
Going forward..
● How do we use these tools to make better schemas?
Anomalies
● When we try to represent too much information in a relation, we run into issues known as anomalies
● Three main types: ○ Redundancy
■ Unnecessary repetition of information ○ Update anomalies
■ Updating a tuple creates inconsistent data ○ Deletion anomalies
■ Removing a tuple results in unwanted loss of information
G-M, Ullman, Widom
BadTables ● What is bad about this table?
● It has all the anomalies
G-M, Ullman, Widom
● Redundancy
○ Lots of duplicate information for the Star Wars tuples
● Update
○ Changing the genre for Star Wars in one tuple requires
updating all Star Wars tuples ● Deletion
○ Removing Vivien Leigh as a star can remove ‘Gone With the Wind’ entirely
Recall: Database ‘Design’ Theory
● General idea:
○ Express constraints on the relationships between
attributes (did this last time)
○ Use these constraints to decompose the relations
■ Create smaller,‘better’ relations
Decomposing a relation
R1 = π B (R) Here we decompose into two relations,
R2 = π C (R)
but often we decompose into more.
But which decomposition?
● Decomposition can definitely improve a schema.
● But which decomposition?
○ There are many possibilities.
● And how can we be sure a new schema doesn’t exhibit other anomalies?
Normalization
● We will decompose our relations into a normal form that guarantees some good properties.
● The first one we’ll look at is called BCNF.
Boyce-Codd Normal Form (BCNF)
We say a relation R is in BCNF if:
● For every nontrivial FD X->Y that
holds in R, X is a superkey.
Remember:
○ nontrivial means Y is not contained in X. ○ a superkey doesn’t have to be minimal.
■ i.e., the left side of every nontrivial FD must contain a key.
Intuition
We say a relation R is in BCNF if:
● For every nontrivial FD X->Y that
holds in R, X is a superkey.
● In other words, BCNF requires that:
○ Only things that functionally determine
everything
can functionally determine anything.
BCNF Worksheet Q 1-2
We say a relation R is in BCNF if:
● For every nontrivial FD X->Y that
holds in R, X is a superkey.
Redundant values from non-superkey FDs
● Why is the BCNF property valuable?
● If we have an FD X -> Y where X is not a superkey, there can be redundant data.
● If every attribute depends on the key, there will be no redundancy
BCNF Worksheet Q3 part a
BCNF Decomposition
● R is a relation;F is a set of FDs.
● Return the BCNF decomposition of R, given these FDs.
BCNF_decomp(R, F):
If an FD X -> Y in F violates BCNF: Compute X+.
Replace R by two relations with schemas: R1 = X+
R2 = R – (X+ – X)
Project the FD’s F onto R1 and R2. Recursively decompose R1 and R2 into BCNF.
BCNF Worksheet Q3 parts b to e
Notes on decomposition
● If more than one FD violates BCNF, you may decompose based on any one of them.
○ So there may be multiple decompositions possible.
● The new relations we create may not be in BCNF – we must recurse.
○ We only keep the relations at the “leaves”.
Speed-up
● When projecting FDs onto a new relation, check each new FD:
○ Does the new relation violate BCNF because of this FD?
● If so, stop the projection.
○ You are about to discard this relation
anyway (and decompose further).
Decomposition Properties
What we want from a decomposition:
1. No anomalies.
2. Lossless Join: It should be possible to…
○ project the original relations onto the decomposed schema
○ then reconstruct the original by joining.
We should get back exactly the original tuples.
3. Dependency Preservation:
All the original FD’s should be satisfied.
What BCNF decomposition guarantees
1. No anomalies : ✓ (Due to no redundancy)
2. Lossless Join : ✓
(textbook section 3.4.1 argues this)
3. Dependency Preservation : ✗
What is a ‘lossy’ join?
● For any decomposition, it is the case that:
○R⊆r1 ⋈r2… ⋈rn ○ i.e., we will get back every tuple.
● But it may not be the case that:
○R⊇r1 ⋈r2… ⋈rn ○ i.e., we can get spurious tuples.
Properties Worksheet Q1
The BCNF property does not guarantee lossless join
● If you use the BCNF decomposition algorithm, a lossless join is guaranteed.
● If you generate a decomposition some other way ○ you have to check to make sure you have a
lossless join…
○ even if your schema satisfies BCNF! ● We’ll learn an algorithm for this check later.
Preservation of dependencies
● BCNF decomposition does not guarantee preservation of dependencies.
● i.e., in the schema that results, it may be possible to create an instance that:
○ satisfies all the FDs in the final schema,
○ but violates one of the original FDs.
● Why? Because the algorithm goes too far — breaks relations down too much.
Properties Worksheet Q2 part a
3NF is less strict than BCNF
● 3rd Normal Form (3NF) modifies the BCNF condition to be less strict.
● An attribute is prime if it is a member of any key.
X -> A satisfies 3NF iff
X is a superkey or A is prime.
● i.e., it’s ok if X is not a superkey as long as A is prime.
3NF ‘synthesis’
● BCNF decomposition
○ starts from a large relation and gets schema
by decomposing into smaller ones ● 3NF synthesis
○ Builds up schema by constructing relations from FDs
3NF Synthesis
● F is a set of FDs;L is a set of attributes.
● Synthesize and return a schema in 3rd Normal Form
3NF_synthesis(F, L):
Construct a minimal basis M for F. For each FD X->Y in M:
Define a new relation with
schema X ∪ Y.
If no relation is a superkey for L:
Add a relation whose schema is
some key.
3NF synthesis doesn’t “go too far”
● BCNF decomposition doesn’t stop decomposing until in all relations:
○ ifX -> AthenXis a superkey.
● 3NF generates relations where:
○ X -> Aand yetXis not a superkey,
but A is at least prime.
How can we get anomalies with 3NF?
● 3NF synthesis guarantees that the resulting schema will be in 3rd normal form.
● This allows FDs with a non-superkey on the LHS.
● This allows redundancy, and thus anomalies.
How do we know…?
… that the 3NF synthesis algorithm guarantees: ● 3NF:A property of minimal basis [see the
textbook for more]
● Preservation of dependencies: Each FD from a
minimal basis is contained in a relation, thus
preserved.
● Lossless join:We’ll return to this once we know
how to test for lossless join.
3NF synthesis doesn’t “go too far”
1. No anomalies : ✗
2. Lossless Join : ✓
3. Dependency Preservation : ✓
● 3NF doesn’t decompose too far; preserves original FDs.
● Neither BCNF nor 3NF can guarantee all three requirements! We must be satisfied with 2 of 3.
BCNF vs 3NF
● BCNF: Decompose too far ⇒ can’t always enforce original FDs (but sometimes it works out!)
● 3NF: Not far enough ⇒ can have redundancy
● We consider a schema “good” if it is in either BCNF or 3NF
● In practise, if we can get a BCNF decomposition that happens to preserve dependencies, we will use it.
○ Otherwise failing that, use 3NF and deal with some redundancy.
Properties Worksheet Q2 parts b, c
Shifting gears for a moment:
We’ve talked a lot about keys and superkeys… Let’s look at ways to find all keys given a set of FDs
Keys Worksheet