Detecting Functional Dependencies
21.5.2013 Felix Naumann
2
Overview
■ Functional Dependencies ■ TANE
□ Candidate sets
□ Pruning Algorithm
□ Dependency checking □ Approximate FDs
■ FD_Mine
■ Conditional FDs
Felix Naumann | Profiling & Cleansing | Summer 2013
3
Definition – Functional Dependency
■ „X → A“ is a statement about a relation R: When two tuples have same value in attribute set X, the must have same values in attribute A.
■ Formally:X→AisanFDoverR (R⊨X→A)⇔ for all tuples t1,t2∈R: t1[X]=t2[X] t1[A] = t2[A]
■ Cangeneralizetosets:X→Y⇔X→AforeachA∈Y yy
Felix Naumann | Profiling & Cleansing | Summer 2013
f(x)
?
xx
4
Trivial FDs
■ Trivial: Attributes on RHS are subset of attributes on LHS □ Street, City → City
□ Any trivial FD holds
■ Non-trivial: At least one attribute on RHS does not appear on LHS □ Street, City → Zip, City
■ Completely non-trivial: Attributes on LHS and RHS are disjoint. □ Street, City → Zip
■ Minimal FD: RHS does not depend on any subset of LHS.
■ Typical goal: Given a relation R, find all minimal completely non-
trivial functional dependencies.
Felix Naumann | Profiling & Cleansing | Summer 2013
5
FD Inference Rules
■ R1 Reflexivity X ⊇ Y X→Y (also X→X) □ Trivial FDs
■ R2 Accumulation {X→Y} XZ→YZ □ Aka: Augmentation
■ R3 Transitivity {X→Y, Y →Z} X→Z ■ R1-R3 known as Armstrong-Axioms
□ Sound and complete
■ R4 Decomposition {X→YZ} X→Y
■ R5 Union {X→Y, X→Z} X→YZ
■ R6 Pseudotransitivity {X→Y,WY →Z} WX→Z
Felix Naumann | Profiling & Cleansing | Summer 2013
6
FD Discussion
■ Schema vs. instance
■ Keys as special case for FDs
□ XiskeyofRifX→R\X
■ Uses for FDs
□ Schema design and normalization
□ Key discovery
□ Data cleansing (especially conditional FDs)
Felix Naumann | Profiling & Cleansing | Summer 2013
7
Naive Discovery Approach
■ Given relation R, detect all minimal, non-trivial FDs X → A.
■ For each column combination X
□ For each pair of tuples (t1,t2)
◊ If t1[X\A] = t2[X\A] and t1[A] ≠ t2[A]: Break
■ Complexity
□ Exponential in number of attributes □ times number of rows squared
Felix Naumann | Profiling & Cleansing | Summer 2013
8
Overview
■ Functional Dependencies ■ TANE
□ Candidate sets
□ Pruning Algorithm
□ Dependency checking □ Approximate FDs
■ FD_Mine
■ Conditional FDs
Felix Naumann | Profiling & Cleansing | Summer 2013
9
Tane – General Idea
■ Two elements of approach
1. Reduce column combinations through pruning
◊ Reasoning over FDs
2. Reduce tuple sets through partitioning
◊ Partition data according to attribute values ◊ Level-wise increase of size of attribute set
● Consider sets of tuples whose values agree on that set
■ Huhtala, Y.; Kärkkäinen, J.; Porkka, P. & Toivonen, H.
TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies
Computer Journal, 1999, 42, 100-111
Felix Naumann | Profiling & Cleansing | Summer 2013
10
Discovery strategy
■ Bottom up traversal through lattice
□ only minimal dependencies
□ Pruning
□ Re-use results from previous level
■ ForasetX,testallX\A→A,A∈X
□ only non-trivial dependencies □ Test on efficient data structure
ABCD
ABC ABD ACD BCD
Felix Naumann | Profiling & Cleansing | Summer 2013
AB AD AC BC BD CD
ABCD
11
Overview
■ Functional Dependencies ■ TANE
□ Candidate sets
□ Pruning Algorithm
□ Dependency checking □ Approximate FDs
■ FD_Mine
■ Conditional FDs
Felix Naumann | Profiling & Cleansing | Summer 2013
12
Candidate Sets
■ RHS candidate set C(X)
■ Stores only those attributes that might depend on all other
attributes of X.
□ I.e., those that still need to be checked
□ If A∈C(X) then A does not depend on any proper subset of X.
■ C(X)=R\{A∈X|X\A→Aholds}
■ Example:R={ABCD},andA→CandCD→B □ C(A) = {ABCD}\{} = C(B) = C(C) = C(D) □ C(AB) = {ABCD}\{}
□ C(AC) = {ABCD}\{C} = {ABD}
□ C(CD) = {ABCD}\{}
□ C(BCD) = {ABCD}\{B} = {ACD} Felix Naumann | Profiling & Cleansing | Summer 2013
13
RHS candidate pruning
■ For minimality it suffices to test X\A → A where □ A∈X and A∈C(X \{B}) for all B∈X.
□ I.e., A is in all candidate sets of the subsets.
■ Example
□ X = {ABC}. Assume we know C → A from previous step.
□ Need to test three dependencies: AB→C, AC→B, and BC→A
◊ We should not be testing BC→A, because we know C→A □ Candidate sets:
◊ C(AB) = {ABC}, C(AC)={BC}, C(BC)={ABC}
□ E.g. BC→A does not need to be tested for minimality, because
A is not in all three candidate sets: A∉C(AB)∩C(AC)∩C(BC)
□ AB→C, AC→B need to be tested, because B and C appear in all
candidate sets.
Felix Naumann | Profiling & Cleansing | Summer 2013
15
Improved RHS candidate pruning
■ Basis:LetB∈XandletX\B→Bhold.IfX→A,thenX\B→A.
□ Example:A→Bholds.IfAB→Cholds,thenalsoA→C.
□ Use this to reduce candidate set: If X\B → B for some B, then any dependency with X on LHS cannot be minimal.
◊ Just remove B.
■ C+(X) = {A∈R | ∀B∈X: X\{A,B} → B does not hold} □ Special case: A = B corresponds to C(X)
□ C(X)=R\{A∈X|X\A→Aholds}
■ This definition removes three types of candidates.
□ C1 = {A∈X | X\A → A holds} (as before)
□ C2={R\X} if ∃B∈X:X\B→B
□ C3 = {A∈X | ∃B∈X\A : X\{A,B} → B holds} Felix Naumann | Profiling & Cleansing | Summer 2013
16
Example for C2
■ C+(X) = {A∈R | ∀B∈X: X\{A,B} → B does not hold} ■ C2={R\X}if ∃B∈X:X\B→B
■ R=ABCD,X=ABC ■ C(X) = ABCD initially ■ Discovery of C→B
□ Remove B from C(X)
□ Additionally remove R\X = D
□ Ok, because remaining combination of LHS contains B and C.
◊ ABC→D is not minimal because C→B ■ Together: C+(X) = {AC}
Felix Naumann | Profiling & Cleansing | Summer 2013
17
C3
■ C3 = {A∈X | ∃B∈X\A : X\{A,B} → B holds} ■ Same idea as before, but for subsets
■ Assume X has proper subset Y (X⊃Y) such that Y\B → B holds for some B∈Y.
■ Then we can remove from C(X) all A∈X\Y.
■ Example X = ABCD and let C→B
■ X⊃Y=BCandX\Y=AD
■ Thus can remove all AD.
□ Any remaining combination of LHS contains B and C. ◊ ABC →D and BCD→A
□ Again, since C→B any such FD is not minimal.
■ Together: C+(X) = {C}
Felix Naumann | Profiling & Cleansing | Summer 2013
18
More pruning of lattice: Key pruning
■ Insight: If X is superkey and X\B →B, then X\B is also a superkey.
■ Case1:IfXissuperkey,noneedtotestanyX→A. ■ Case 2:
□ If X is superkey and not key, any X → A is not minimal (for any A∉X).
□ If A∈X and X\A→A then X\A is superkey, and no need to test. ■ Summary: Can prune all keys and their supersets
■ Later: Test for superkey-property based on “key-error” of partition
Felix Naumann | Profiling & Cleansing | Summer 2013
19
Overview
■ Functional Dependencies ■ TANE
□ Candidate sets
□ Pruning Algorithm
□ Dependency checking □ Approximate FDs
■ FD_Mine
■ Conditional FDs
Felix Naumann | Profiling & Cleansing | Summer 2013
20
TANE Base Algorithm
■ Each level L contains the corresponding nodes of the lattice
Felix Naumann | Profiling & Cleansing | Summer 2013
21
■ Ll+1={X | |X| = l+1 and all subsets Y⊂X of size l are in Ll} □ General apriori idea
□ Can use Ll to generate Ll+1
Generating Lattice Levels
■ Prefix blocks: disjoint sets from Ll with common prefix of size l-1 □ Allpairsforl=1
■ Line 5: All subsets of new set must appear in lower level Felix Naumann | Profiling & Cleansing | Summer 2013
Trivial for L1: Nothing happens
22
Dependency Computation
■ Line 2: Create candidate sets; each attribute must appear in all candidate sets of smaller size
■ Line 4: Only test attributes from candidate set
■ Line 5: Actual test on data
■ Line 7: Reduce candidates by newly found dependent
■ Line 8: Reduce candidates by all other attributes: cannot depend on all others, because any combination involving A and LHS is not minimal
Felix Naumann | Profiling & Cleansing | Summer 2013
23
Pruning
■ Line 3: Basic pruning
□ Deletion from Ll ensures that supersets cannot be created during level generation (loops not executed on empty candidate sets)
■ Lines 4-8: Key pruning
Felix Naumann | Profiling & Cleansing | Summer 2013
24
Example run
■ R = ABCD, C→B and AB→D are to be discovered □ Also: AC→D through pseudo-transitivity
■ L0 = {},
□ C+({}) = ABCD □ nothing to do
■ L1 = {A}, {B}, {C}, {D}
□ C+(X) = ABCD for all X∈ L1
□ Still nothing to do: No FDs can be generated from singletons □ Thus, no pruning
■ L2 =AB,AC,AD,BC,BD,CD
□ E.g. C +(AB) = C+(AB\A) ∩ C+(AB\B) = ABCD ∩ ABCD □ C+(X) = ABCD for all X∈ L2
□ Dep. checks for AB: A→B and B→A Nothing happens Felix Naumann | Profiling & Cleansing | Summer 2013
25
Example run
■ L2 =AB,AC,AD,BC,BD,CD
□ C+(X) = ABCD for all X∈ L2
□ Dep. checks for BC: B→C (no!) and C→B (yes!) □ Output C→B
□ Delete B from C+(BC): ACD
□ Delete R\BC from C+(BC): C
◊ Note BC→A and BC→D are not minimal ■ L3 = ABC, ABD, ACD, BCD
□ C+(ABC) = C+(AB) ∩ C+(AC) ∩ C+(BC) = C
□ C+(BCD) = C+(BC) ∩ C+(BD) ∩ C+(CD) = C
□ C+(ABD) = C+(ACD) = ABCD unchanged
□ Dep. check for ABC: ABC ∩ C+(ABC) are candidates
◊ AB→C no! Did not check BC→A and AC→B Felix Naumann | Profiling & Cleansing | Summer 2013
26
Example run
■ L3 = ABC, ABD, ACD, BCD
□ C+(ABC) = C+(BCD) = C
□ C+(ABD) = C+(ACD) = ABCD
□ Dep. check for ABD: ABD ∩ C+(ABD) are candidates
◊ AD→B and BD→A: no!
◊ AB→D: yes! Output AB→D
◊ Delete D from C+(ABD): ABC
◊ Delete R\ABD from C+(ABD): AB
□ Dep. check for BCD: BCD ∩ C+(BCD) are candidates ◊ Only need to check BD→C: no!
□ Dep. check for ACD: ACD ∩ C+(ACD) are candidates ◊ CD→A and AD→C: no!
◊ AC→D: yes! Output AC→D
◊ Delete D from C+(ABD): ABC
◊ Delete R\ACD from C+(ABD): AC Felix Naumann | Profiling & Cleansing | Summer 2013
27
Example run
■ L4 = ABCD
■ C+(ABCD) = C+(ABC) ∩ C+(ABD) ∩ C+(ACD) ∩ C+(BCD) = {} □ Nothing to check
□ Did not need to check
◊ BCD→A: Not minimal because C → B ◊ ACD→B: Not minimal because C → B ◊ ABD→C: Not minimal because AB → D ◊ ABC→D: Not minimal because AB → D
Felix Naumann | Profiling & Cleansing | Summer 2013
28
Overview
■ Functional Dependencies ■ TANE
□ Candidate sets
□ Pruning Algorithm
□ Dependency checking □ Approximate FDs
■ FD_Mine
■ Conditional FDs
Felix Naumann | Profiling & Cleansing | Summer 2013
29
Notation and Partitions
■ Tuples t and u are equivalent wrt. attribute set X if t[A] = u[A] for all A∈X.
■ Attribute set X partitions R into equivalence classes □ Equivalence class of tuple t:
[t]X = {u∈R | ∀A∈X : t[A] = u[A]}
□ Partition is set of disjoint sets: πX = {[t]X | t∈R}
◊ Each set has unique values for X-values. □ |π| is number of equivalence classes in π.
Felix Naumann | Profiling & Cleansing | Summer 2013
30
TupleID A B C
D
Partitioning – Example
11a 21A
3 2 A 42A 52b Lily
63b 73C 8 3 C
$ Orchid Flower
■ [1]A= [2]A= {1,2}
■ πA = {{1,2}, {3,4,5}, {6,7,8}}
■ πBC = {{1}, {2}, {3,4}, {5}, {6}, {7}, {8}} ■ πD = {{147}, {2}, {3}, {5}, {6}, {8}}
Felix Naumann | Profiling & Cleansing | Summer 2013
$ Flower Tulip
$ Daffodil
$ Flower
# Rose
31
Partition refinement
■ Partition π refines partition π‘ if every equivalence class in π is a subset of some equivalence class in π‘.
□ π has a finer partitioning than π‘.
■X→A⇔πXrefinesπA 11A
□ If πX refines πA then πX⋃A = πA.
□ πX⋃A always refines πA.
□ if πX⋃A ≠ πA then |πX| ≠ |πX⋃A|. □ if|πX|=|πX⋃A|thenπX⋃A=πA.
21A 32A 42A 52A 63B 73B 83B
■ Together: X → A ⇔ πX refines πA ⇔ |πX| = |πX⋃A| □ This implies a simple check for an FD.
■ πA = {12, 345, 678}
■ πB = {12345, 678}
■ πAB = {12, 345, 678}
Felix Naumann | Profiling & Cleansing | Summer 2013
AB
32
■ Idea: Remove equivalence classes of size 1 from partitions. □ SingletonequivalenceclasscannotviolateanyFD.
□ Sameideaasforpositionlists
Stripped partitions
■ Problem
□ X→A⇔|πX|=|πX⋃A|nottrueforstrippedpartitionsπ’ 11Aa
□ :|πC|=|πC⋃A|=6and|π’C|=|π’C⋃A|=2
21Ab 32Ac 42Ac
□ ⇐:|π’B|=|π’B⋃C|=2butB↛C ■ Solution: Key error
□ e(X)isminimumfractionoftuplestoremoveforXtobekey
□ e(X)=1–|πX|/r 52Ad
◊ e(B)=1–|πB|/r=1–3/8=5/8 □ e(X)=(||π’X||–|π’X|)/r
63Be 73Be 83Df
◊ ||π’X|| = sum of sizes of equivalence classes in π’
◊ e(B) = ((5+2) – 2)/r = 5/8 □ X→A ⇔ e(X)=e(X∪A)
Felix Naumann | Profiling & Cleansing | Summer 2013
ABC
33
Computing partitions
■ Compute partition πA for each A∈R □ Directly from database
□ Only store tuples ID (Integers)
■ Product π1 · π2: Least refined partition that refines both π1 and π2 □ πX · πY= πX∪Y
□ Partitions πX computed as product of two partitions of size |X|–1.
□ Algorithm on next slide
■ Dependency checking: X\A → A
□ Calculate e(X) = (||π’X|| – |π’X|)/r and e(X\A) = … □ Check e(X) = e(X\A)
■ Also key pruning: X is key if e(X) = 0. Felix Naumann | Profiling & Cleansing | Summer 2013
34
Partition Product
Felix Naumann | Profiling & Cleansing | Summer 2013
ABC
1047 2157 3158 4269 5269 6347
π’A⋃B={23, 45} π’B⋃C={16, 45}
35
Overview
■ Functional Dependencies ■ TANE
□ Candidate sets
□ Pruning Algorithm
□ Dependency checking □ Approximate FDs
■ FD_Mine
■ Conditional FDs
Felix Naumann | Profiling & Cleansing | Summer 2013
36
Making TANE approximate
■ Definition based on minimum number of tuples to be removed from R for X→A to hold in R.
■ Discovery problem:
□ Given relation R and threshold ε, find all minimal non-trivial
FDsX→A suchthate(X→A)≤ ε.
1. Define error: Fraction of tuples causing FD violation
□ Errore(X→A)=min{|S||S⊆R, R\S⊨X→A}/|R|
2. Specify error threshold ε
3. Modify dependency checking algorithm
□ Efficient algorithm to compute error □ Bounds to avoid error calculation
Felix Naumann | Profiling & Cleansing | Summer 2013
Exact version Modifications
37
Approximate Dependency Checking
Felix Naumann | Profiling & Cleansing | Summer 2013
38
■ Errore(X→A)=min{|S||S⊆R, R\S⊨X→A}/|R|
■ Any equivalence class c ∈ πX is the union of one or more
Computing error
equivalence classes c1‘,c2‘, …∈πX∪A
■ Foreachc∈ πXthetuplesinallbutoneoftheci‘smustbe
removed for X→A to hold.
■ Minimum number to remove: Size of c minus size of largest ci’.
■ → = 1−∑∈ ∈∪∧⊆
AB 11a
■ Example: B→A
□ πA = {12, 345, 678}
□ πB = {1, 234, 56, 78} 32A
□ πAB ={1,2,34,5,6,78} 42A
□ |πB| ≠ |πBA|
52b 63b 73C 83C
□ e(B→A) = 8/8 – (1+2+1+2)/8 = 2/8
■ Also possible on stripped partitions – not here.
Felix Naumann | Profiling & Cleansing | Summer 2013
21A
40
Bounding on error
■ Computing error is in O(|R|)
■ e(X)–e(X∪A) ≤ e(X→A) ≤ e(X) ■ I.e., do not calculate FD error if
□ e(X)–e(X∪A)>ε □ e(X)<ε
AB 11a 21A 32A ■ e(B→A) = 8/8 – (1+2+1+2)/8 = 2/8 42A 52b 63b 73C 83C
■ e(B) = 4/8
■ e(B∪A)=2/8
Felix Naumann | Profiling & Cleansing | Summer 2013
41
Overview
■ Functional Dependencies ■ TANE
□ Candidate sets
□ Pruning Algorithm
□ Dependency checking □ Approximate FDs
■ FD_Mine
■ Conditional FDs
Felix Naumann | Profiling & Cleansing | Summer 2013
42
FD_Mine: Refinement of TANE
■ If X → Y and Y → X hold, then X and Y are equivalent candidates,
denoted as X↔Y.
■ Make use of additional FD properties
ABCDE
□ X↔Y and XW → Z YW → Z
100010 201010 302012 403110 541124 643122 700110
□ X↔Y and WZ → X WZ → Y ■Example
□ A→D and D→A A↔D
□ Examination: AB → C and BC → A □ Inferred:
◊ BD → C (property 1)
◊ BC → D (property 2)
□ D can be removed from table
Pairs of UCCs
■ Hong Yao, Howard J. Hamilton, Cory J. Butz: FD_Mine: Discovering Functional
Dependencies in a Database Using Equivalences. ICDM 2002: 729-732 Felix Naumann | Profiling & Cleansing | Summer 2013
43
Overview
■ Functional Dependencies ■ TANE
□ Candidate sets
□ Pruning Algorithm
□ Dependency checking □ Approximate FDs
■ FD_Mine
■ Conditional FDs
Felix Naumann | Profiling & Cleansing | Summer 2013
44
Conditional Functional Dependencies
■ Idea similar to CINDs
■ Embedded FD plus pattern tableau ■ Definition CFD
□ Pair (X→A, Tp)
◊ X→A is embedded FD
◊ Tp is pattern tableau, made up of pattern tuples tp
□ Pattern tuple with attributes B∈X∪A where tp[B] ◊ Constant in dom(B)
◊ Unnamed variable “_” for values in dom(B)
□ Special case: Tp[B]=“_” for all B is equivalent to normal FD
Felix Naumann | Profiling & Cleansing | Summer 2013
45
Semantics of CFDs
■ a≈b(amatchesb)if
□ eitheraorbis“_”
□ bothaandbareconstantsanda=b
■ DB satisfies (R: X → Y, Tp) iff
□ For any tuple tp in the pattern tableau Tp and for any tuples t1,
t2 in DB:
◊ If t1[X] = t2[X] ≈ tp[X], then t1[Y] = t2[Y] ≈ tp[Y]
□ tp[X]: identifying the set of tuples on which the constraint tp applies: { t | t[X] ≈ tp[X]}
□ t1[Y] = t2[Y] ≈ tp[Y]: enforcing the embedded FD, and the pattern of tp
Felix Naumann | Profiling & Cleansing | Summer 2013
Slide from Wenfei Fan
46
id country
area-code phone
street
city zip
Example: Violation of CFDs
t1 44
t2 44
131 1234567 131 3456789
Mayfield Crichton
NYC EH4 8LE NYC EH8 8LE
t3 01
t4 01
908 3456789 908 9876543
Mountain Ave Mainstreet
NYC 07974 NYC 07974
■ cust([country, zip] → [street], Tp) ■ Tuples t1 and t2 violate the CFD
country
zip street
44 _ _ □ t1[country, zip] = t2[country, zip] ≈ tp[country, zip]
□ But t1[street] ≠ t2[street]
■ The CFD applies to t1 and t2 since they match tp[country, zip] ■ Tuples t3 and t4 do not violate the CFD
□ CFD does not apply to t3 and t4 Slide from Wenfei Fan Felix Naumann | Profiling & Cleansing | Summer 2013
47
id country area-code
phone
street city
zip
Example: Violation of CFD by single tuple
t1 44 131
t2 44 131
1234567
3456789
Mayfield NYC Crichton NYC
EH4 8LE EH8 8LE
t3 01 908
t4 01 908
3456789
9876543
Mountain Ave NYC
■ cust([country, zip] → [street], Tp)
■ Tuple t1 does not satisfy the CFD.
country zip street
■ t1[country, area-code] = t1[country, area-code] ≈ tp1[country, area-code]
■ t1[city] = t1[city]; however, t1[city] does not match tp1[city]
■ In contrast to traditional FDs, a single tuple may violate a CFD.
Felix Naumann | Profiling & Cleansing | Summer 2013
Slide from Wenfei Fan
Mainstreet
07974 NYC 07974
tp1 44 131 Edi tp2 01 908 MH tp3 _ _ _
48
Further literature
■ DepMiner
□ Efficient Discovery of Functional Dependencies and Armstrong Relations. Stéphane Lopes, Jean-Marc Petit, and Lotfi Lakhal. EDBT 2000
■ CORDS
□ Ihab F. Ilyas, Volker Markl, Peter J. Haas, Paul Brown, Ashraf Aboulnaga: CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. SIGMOD Conference 2004: 647-658
■ FastFDs
□ Catharine M. Wyss, Chris Giannella, Edward L. Robertson: FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances. DaWaK 2001: 101-110
■ CFDs
□ Loreto Bravo, Wenfei Fan, Shuai Ma: Extending Dependencies
with Conditions. VLDB 2007: 243-254 Felix Naumann | Profiling & Cleansing | Summer 2013