CS计算机代考程序代写 data structure Functional Dependencies algorithm flex database Detecting Functional Dependencies

Detecting Functional Dependencies
21.5.2013 Felix Naumann

2
Overview
■ Functional Dependencies ■ TANE
□ Candidate sets
□ Pruning Algorithm
□ Dependency checking □ Approximate FDs
■ FD_Mine
■ Conditional FDs
Felix Naumann | Profiling & Cleansing | Summer 2013

3
Definition – Functional Dependency
■ „X → A“ is a statement about a relation R: When two tuples have same value in attribute set X, the must have same values in attribute A.
■ Formally:X→AisanFDoverR (R⊨X→A)⇔ for all tuples t1,t2∈R: t1[X]=t2[X]  t1[A] = t2[A]
■ Cangeneralizetosets:X→Y⇔X→AforeachA∈Y yy
Felix Naumann | Profiling & Cleansing | Summer 2013
f(x)
?
xx

4
Trivial FDs
■ Trivial: Attributes on RHS are subset of attributes on LHS □ Street, City → City
□ Any trivial FD holds
■ Non-trivial: At least one attribute on RHS does not appear on LHS □ Street, City → Zip, City
■ Completely non-trivial: Attributes on LHS and RHS are disjoint. □ Street, City → Zip
■ Minimal FD: RHS does not depend on any subset of LHS.
■ Typical goal: Given a relation R, find all minimal completely non-
trivial functional dependencies.
Felix Naumann | Profiling & Cleansing | Summer 2013

5
FD Inference Rules
■ R1 Reflexivity X ⊇ Y  X→Y (also X→X) □ Trivial FDs
■ R2 Accumulation {X→Y}  XZ→YZ □ Aka: Augmentation
■ R3 Transitivity {X→Y, Y →Z}  X→Z ■ R1-R3 known as Armstrong-Axioms
□ Sound and complete
■ R4 Decomposition {X→YZ}  X→Y
■ R5 Union {X→Y, X→Z}  X→YZ
■ R6 Pseudotransitivity {X→Y,WY →Z}  WX→Z
Felix Naumann | Profiling & Cleansing | Summer 2013

6
FD Discussion
■ Schema vs. instance
■ Keys as special case for FDs
□ XiskeyofRifX→R\X
■ Uses for FDs
□ Schema design and normalization
□ Key discovery
□ Data cleansing (especially conditional FDs)
Felix Naumann | Profiling & Cleansing | Summer 2013

7
Naive Discovery Approach
■ Given relation R, detect all minimal, non-trivial FDs X → A.
■ For each column combination X
□ For each pair of tuples (t1,t2)
◊ If t1[X\A] = t2[X\A] and t1[A] ≠ t2[A]: Break
■ Complexity
□ Exponential in number of attributes □ times number of rows squared
Felix Naumann | Profiling & Cleansing | Summer 2013

8
Overview
■ Functional Dependencies ■ TANE
□ Candidate sets
□ Pruning Algorithm
□ Dependency checking □ Approximate FDs
■ FD_Mine
■ Conditional FDs
Felix Naumann | Profiling & Cleansing | Summer 2013

9
Tane – General Idea
■ Two elements of approach
1. Reduce column combinations through pruning
◊ Reasoning over FDs
2. Reduce tuple sets through partitioning
◊ Partition data according to attribute values ◊ Level-wise increase of size of attribute set
● Consider sets of tuples whose values agree on that set
■ Huhtala, Y.; Kärkkäinen, J.; Porkka, P. & Toivonen, H.
TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies
Computer Journal, 1999, 42, 100-111
Felix Naumann | Profiling & Cleansing | Summer 2013

10
Discovery strategy
■ Bottom up traversal through lattice
□  only minimal dependencies
□ Pruning
□ Re-use results from previous level
■ ForasetX,testallX\A→A,A∈X
□  only non-trivial dependencies □ Test on efficient data structure
ABCD
ABC ABD ACD BCD
Felix Naumann | Profiling & Cleansing | Summer 2013
AB AD AC BC BD CD
ABCD

11
Overview
■ Functional Dependencies ■ TANE
□ Candidate sets
□ Pruning Algorithm
□ Dependency checking □ Approximate FDs
■ FD_Mine
■ Conditional FDs
Felix Naumann | Profiling & Cleansing | Summer 2013

12
Candidate Sets
■ RHS candidate set C(X)
■ Stores only those attributes that might depend on all other
attributes of X.
□ I.e., those that still need to be checked
□ If A∈C(X) then A does not depend on any proper subset of X.
■ C(X)=R\{A∈X|X\A→Aholds}
■ Example:R={ABCD},andA→CandCD→B □ C(A) = {ABCD}\{} = C(B) = C(C) = C(D) □ C(AB) = {ABCD}\{}
□ C(AC) = {ABCD}\{C} = {ABD}
□ C(CD) = {ABCD}\{}
□ C(BCD) = {ABCD}\{B} = {ACD} Felix Naumann | Profiling & Cleansing | Summer 2013

13
RHS candidate pruning
■ For minimality it suffices to test X\A → A where □ A∈X and A∈C(X \{B}) for all B∈X.
□ I.e., A is in all candidate sets of the subsets.
■ Example
□ X = {ABC}. Assume we know C → A from previous step.
□ Need to test three dependencies: AB→C, AC→B, and BC→A
◊ We should not be testing BC→A, because we know C→A □ Candidate sets:
◊ C(AB) = {ABC}, C(AC)={BC}, C(BC)={ABC}
□ E.g. BC→A does not need to be tested for minimality, because
A is not in all three candidate sets: A∉C(AB)∩C(AC)∩C(BC)
□ AB→C, AC→B need to be tested, because B and C appear in all
candidate sets.
Felix Naumann | Profiling & Cleansing | Summer 2013

15
Improved RHS candidate pruning
■ Basis:LetB∈XandletX\B→Bhold.IfX→A,thenX\B→A.
□ Example:A→Bholds.IfAB→Cholds,thenalsoA→C.
□ Use this to reduce candidate set: If X\B → B for some B, then any dependency with X on LHS cannot be minimal.
◊ Just remove B.
■ C+(X) = {A∈R | ∀B∈X: X\{A,B} → B does not hold} □ Special case: A = B corresponds to C(X)
□ C(X)=R\{A∈X|X\A→Aholds}
■ This definition removes three types of candidates.
□ C1 = {A∈X | X\A → A holds} (as before)
□ C2={R\X} if ∃B∈X:X\B→B
□ C3 = {A∈X | ∃B∈X\A : X\{A,B} → B holds} Felix Naumann | Profiling & Cleansing | Summer 2013

16
Example for C2
■ C+(X) = {A∈R | ∀B∈X: X\{A,B} → B does not hold} ■ C2={R\X}if ∃B∈X:X\B→B
■ R=ABCD,X=ABC ■ C(X) = ABCD initially ■ Discovery of C→B
□ Remove B from C(X)
□ Additionally remove R\X = D
□ Ok, because remaining combination of LHS contains B and C.
◊ ABC→D is not minimal because C→B ■ Together: C+(X) = {AC}
Felix Naumann | Profiling & Cleansing | Summer 2013

17
C3
■ C3 = {A∈X | ∃B∈X\A : X\{A,B} → B holds} ■ Same idea as before, but for subsets
■ Assume X has proper subset Y (X⊃Y) such that Y\B → B holds for some B∈Y.
■ Then we can remove from C(X) all A∈X\Y.
■ Example X = ABCD and let C→B
■ X⊃Y=BCandX\Y=AD
■ Thus can remove all AD.
□ Any remaining combination of LHS contains B and C. ◊ ABC →D and BCD→A
□ Again, since C→B any such FD is not minimal.
■ Together: C+(X) = {C}
Felix Naumann | Profiling & Cleansing | Summer 2013

18
More pruning of lattice: Key pruning
■ Insight: If X is superkey and X\B →B, then X\B is also a superkey.
■ Case1:IfXissuperkey,noneedtotestanyX→A. ■ Case 2:
□ If X is superkey and not key, any X → A is not minimal (for any A∉X).
□ If A∈X and X\A→A then X\A is superkey, and no need to test. ■ Summary: Can prune all keys and their supersets
■ Later: Test for superkey-property based on “key-error” of partition
Felix Naumann | Profiling & Cleansing | Summer 2013

19
Overview
■ Functional Dependencies ■ TANE
□ Candidate sets
□ Pruning Algorithm
□ Dependency checking □ Approximate FDs
■ FD_Mine
■ Conditional FDs
Felix Naumann | Profiling & Cleansing | Summer 2013

20
TANE Base Algorithm
■ Each level L contains the corresponding nodes of the lattice
Felix Naumann | Profiling & Cleansing | Summer 2013

21
■ Ll+1={X | |X| = l+1 and all subsets Y⊂X of size l are in Ll} □ General apriori idea
□ Can use Ll to generate Ll+1
Generating Lattice Levels
■ Prefix blocks: disjoint sets from Ll with common prefix of size l-1 □ Allpairsforl=1
■ Line 5: All subsets of new set must appear in lower level Felix Naumann | Profiling & Cleansing | Summer 2013

Trivial for L1: Nothing happens
22
Dependency Computation
■ Line 2: Create candidate sets; each attribute must appear in all candidate sets of smaller size
■ Line 4: Only test attributes from candidate set
■ Line 5: Actual test on data
■ Line 7: Reduce candidates by newly found dependent
■ Line 8: Reduce candidates by all other attributes: cannot depend on all others, because any combination involving A and LHS is not minimal
Felix Naumann | Profiling & Cleansing | Summer 2013

23
Pruning
■ Line 3: Basic pruning
□ Deletion from Ll ensures that supersets cannot be created during level generation (loops not executed on empty candidate sets)
■ Lines 4-8: Key pruning
Felix Naumann | Profiling & Cleansing | Summer 2013

24
Example run
■ R = ABCD, C→B and AB→D are to be discovered □ Also: AC→D through pseudo-transitivity
■ L0 = {},
□ C+({}) = ABCD □ nothing to do
■ L1 = {A}, {B}, {C}, {D}
□ C+(X) = ABCD for all X∈ L1
□ Still nothing to do: No FDs can be generated from singletons □ Thus, no pruning
■ L2 =AB,AC,AD,BC,BD,CD
□ E.g. C +(AB) = C+(AB\A) ∩ C+(AB\B) = ABCD ∩ ABCD □ C+(X) = ABCD for all X∈ L2
□ Dep. checks for AB: A→B and B→A Nothing happens Felix Naumann | Profiling & Cleansing | Summer 2013

25
Example run
■ L2 =AB,AC,AD,BC,BD,CD
□ C+(X) = ABCD for all X∈ L2
□ Dep. checks for BC: B→C (no!) and C→B (yes!) □ Output C→B
□ Delete B from C+(BC): ACD
□ Delete R\BC from C+(BC): C
◊ Note BC→A and BC→D are not minimal ■ L3 = ABC, ABD, ACD, BCD
□ C+(ABC) = C+(AB) ∩ C+(AC) ∩ C+(BC) = C
□ C+(BCD) = C+(BC) ∩ C+(BD) ∩ C+(CD) = C
□ C+(ABD) = C+(ACD) = ABCD unchanged
□ Dep. check for ABC: ABC ∩ C+(ABC) are candidates
◊ AB→C no! Did not check BC→A and AC→B Felix Naumann | Profiling & Cleansing | Summer 2013

26
Example run
■ L3 = ABC, ABD, ACD, BCD
□ C+(ABC) = C+(BCD) = C
□ C+(ABD) = C+(ACD) = ABCD
□ Dep. check for ABD: ABD ∩ C+(ABD) are candidates
◊ AD→B and BD→A: no!
◊ AB→D: yes! Output AB→D
◊ Delete D from C+(ABD): ABC
◊ Delete R\ABD from C+(ABD): AB
□ Dep. check for BCD: BCD ∩ C+(BCD) are candidates ◊ Only need to check BD→C: no!
□ Dep. check for ACD: ACD ∩ C+(ACD) are candidates ◊ CD→A and AD→C: no!
◊ AC→D: yes! Output AC→D
◊ Delete D from C+(ABD): ABC
◊ Delete R\ACD from C+(ABD): AC Felix Naumann | Profiling & Cleansing | Summer 2013

27
Example run
■ L4 = ABCD
■ C+(ABCD) = C+(ABC) ∩ C+(ABD) ∩ C+(ACD) ∩ C+(BCD) = {} □ Nothing to check
□ Did not need to check
◊ BCD→A: Not minimal because C → B ◊ ACD→B: Not minimal because C → B ◊ ABD→C: Not minimal because AB → D ◊ ABC→D: Not minimal because AB → D
Felix Naumann | Profiling & Cleansing | Summer 2013

28
Overview
■ Functional Dependencies ■ TANE
□ Candidate sets
□ Pruning Algorithm
□ Dependency checking □ Approximate FDs
■ FD_Mine
■ Conditional FDs
Felix Naumann | Profiling & Cleansing | Summer 2013

29
Notation and Partitions
■ Tuples t and u are equivalent wrt. attribute set X if t[A] = u[A] for all A∈X.
■ Attribute set X partitions R into equivalence classes □ Equivalence class of tuple t:
[t]X = {u∈R | ∀A∈X : t[A] = u[A]}
□ Partition is set of disjoint sets: πX = {[t]X | t∈R}
◊ Each set has unique values for X-values. □ |π| is number of equivalence classes in π.
Felix Naumann | Profiling & Cleansing | Summer 2013

30
TupleID A B C
D
Partitioning – Example
11a 21A
3 2 A 42A 52b Lily
63b 73C 8 3 C
$ Orchid Flower
■ [1]A= [2]A= {1,2}
■ πA = {{1,2}, {3,4,5}, {6,7,8}}
■ πBC = {{1}, {2}, {3,4}, {5}, {6}, {7}, {8}} ■ πD = {{147}, {2}, {3}, {5}, {6}, {8}}
Felix Naumann | Profiling & Cleansing | Summer 2013
$ Flower Tulip
$ Daffodil
$ Flower
# Rose

31
Partition refinement
■ Partition π refines partition π‘ if every equivalence class in π is a subset of some equivalence class in π‘.
□ π has a finer partitioning than π‘.
■X→A⇔πXrefinesπA 11A
□ If πX refines πA then πX⋃A = πA.
□ πX⋃A always refines πA.
□  if πX⋃A ≠ πA then |πX| ≠ |πX⋃A|. □ if|πX|=|πX⋃A|thenπX⋃A=πA.
21A 32A 42A 52A 63B 73B 83B
■ Together: X → A ⇔ πX refines πA ⇔ |πX| = |πX⋃A| □ This implies a simple check for an FD.
■ πA = {12, 345, 678}
■ πB = {12345, 678}
■ πAB = {12, 345, 678}
Felix Naumann | Profiling & Cleansing | Summer 2013
AB

32
■ Idea: Remove equivalence classes of size 1 from partitions. □ SingletonequivalenceclasscannotviolateanyFD.
□ Sameideaasforpositionlists
Stripped partitions
■ Problem
□ X→A⇔|πX|=|πX⋃A|nottrueforstrippedpartitionsπ’ 11Aa
□ :|πC|=|πC⋃A|=6and|π’C|=|π’C⋃A|=2
21Ab 32Ac 42Ac
□ ⇐:|π’B|=|π’B⋃C|=2butB↛C ■ Solution: Key error
□ e(X)isminimumfractionoftuplestoremoveforXtobekey
□ e(X)=1–|πX|/r 52Ad
◊ e(B)=1–|πB|/r=1–3/8=5/8 □ e(X)=(||π’X||–|π’X|)/r
63Be 73Be 83Df
◊ ||π’X|| = sum of sizes of equivalence classes in π’
◊ e(B) = ((5+2) – 2)/r = 5/8 □ X→A ⇔ e(X)=e(X∪A)
Felix Naumann | Profiling & Cleansing | Summer 2013
ABC

33
Computing partitions
■ Compute partition πA for each A∈R □ Directly from database
□ Only store tuples ID (Integers)
■ Product π1 · π2: Least refined partition that refines both π1 and π2 □ πX · πY= πX∪Y
□ Partitions πX computed as product of two partitions of size |X|–1.
□ Algorithm on next slide
■ Dependency checking: X\A → A
□ Calculate e(X) = (||π’X|| – |π’X|)/r and e(X\A) = … □ Check e(X) = e(X\A)
■ Also key pruning: X is key if e(X) = 0. Felix Naumann | Profiling & Cleansing | Summer 2013

34
Partition Product
Felix Naumann | Profiling & Cleansing | Summer 2013
ABC
1047 2157 3158 4269 5269 6347
π’A⋃B={23, 45} π’B⋃C={16, 45}

35
Overview
■ Functional Dependencies ■ TANE
□ Candidate sets
□ Pruning Algorithm
□ Dependency checking □ Approximate FDs
■ FD_Mine
■ Conditional FDs
Felix Naumann | Profiling & Cleansing | Summer 2013

36
Making TANE approximate
■ Definition based on minimum number of tuples to be removed from R for X→A to hold in R.
■ Discovery problem:
□ Given relation R and threshold ε, find all minimal non-trivial
FDsX→A suchthate(X→A)≤ ε.
1. Define error: Fraction of tuples causing FD violation
□ Errore(X→A)=min{|S||S⊆R, R\S⊨X→A}/|R|
2. Specify error threshold ε
3. Modify dependency checking algorithm
□ Efficient algorithm to compute error □ Bounds to avoid error calculation
Felix Naumann | Profiling & Cleansing | Summer 2013

Exact version Modifications
37
Approximate Dependency Checking
Felix Naumann | Profiling & Cleansing | Summer 2013

38
■ Errore(X→A)=min{|S||S⊆R, R\S⊨X→A}/|R|
■ Any equivalence class c ∈ πX is the union of one or more
Computing error
equivalence classes c1‘,c2‘, …∈πX∪A
■ Foreachc∈ πXthetuplesinallbutoneoftheci‘smustbe
removed for X→A to hold.
■ Minimum number to remove: Size of c minus size of largest ci’.
■ 􏱉 􏱊 → 􏱋 = 1−∑􏱔∈􏱕􏱒 􏱌􏱍􏱎 􏱏􏱐 􏱏􏱐∈􏱑􏱒∪􏱓∧􏱏􏱐⊆􏱏 􏱖
AB 11a
■ Example: B→A
□ πA = {12, 345, 678}
□ πB = {1, 234, 56, 78} 32A
□ πAB ={1,2,34,5,6,78} 42A
□ |πB| ≠ |πBA|
52b 63b 73C 83C
□ e(B→A) = 8/8 – (1+2+1+2)/8 = 2/8
■ Also possible on stripped partitions – not here.
Felix Naumann | Profiling & Cleansing | Summer 2013
21A

40
Bounding on error
■ Computing error is in O(|R|)
■ e(X)–e(X∪A) ≤ e(X→A) ≤ e(X) ■ I.e., do not calculate FD error if
□ e(X)–e(X∪A)>ε □ e(X)<ε AB 11a 21A 32A ■ e(B→A) = 8/8 – (1+2+1+2)/8 = 2/8 42A 52b 63b 73C 83C ■ e(B) = 4/8 ■ e(B∪A)=2/8 Felix Naumann | Profiling & Cleansing | Summer 2013 41 Overview ■ Functional Dependencies ■ TANE □ Candidate sets □ Pruning Algorithm □ Dependency checking □ Approximate FDs ■ FD_Mine ■ Conditional FDs Felix Naumann | Profiling & Cleansing | Summer 2013 42 FD_Mine: Refinement of TANE ■ If X → Y and Y → X hold, then X and Y are equivalent candidates, denoted as X↔Y. ■ Make use of additional FD properties ABCDE □ X↔Y and XW → Z  YW → Z 100010 201010 302012 403110 541124 643122 700110 □ X↔Y and WZ → X  WZ → Y ■Example □ A→D and D→A  A↔D □ Examination: AB → C and BC → A □ Inferred: ◊ BD → C (property 1) ◊ BC → D (property 2) □ D can be removed from table Pairs of UCCs ■ Hong Yao, Howard J. Hamilton, Cory J. Butz: FD_Mine: Discovering Functional Dependencies in a Database Using Equivalences. ICDM 2002: 729-732 Felix Naumann | Profiling & Cleansing | Summer 2013 43 Overview ■ Functional Dependencies ■ TANE □ Candidate sets □ Pruning Algorithm □ Dependency checking □ Approximate FDs ■ FD_Mine ■ Conditional FDs Felix Naumann | Profiling & Cleansing | Summer 2013 44 Conditional Functional Dependencies ■ Idea similar to CINDs ■ Embedded FD plus pattern tableau ■ Definition CFD □ Pair (X→A, Tp) ◊ X→A is embedded FD ◊ Tp is pattern tableau, made up of pattern tuples tp □ Pattern tuple with attributes B∈X∪A where tp[B] ◊ Constant in dom(B) ◊ Unnamed variable “_” for values in dom(B) □ Special case: Tp[B]=“_” for all B is equivalent to normal FD Felix Naumann | Profiling & Cleansing | Summer 2013 45 Semantics of CFDs ■ a≈b(amatchesb)if □ eitheraorbis“_” □ bothaandbareconstantsanda=b ■ DB satisfies (R: X → Y, Tp) iff □ For any tuple tp in the pattern tableau Tp and for any tuples t1, t2 in DB: ◊ If t1[X] = t2[X] ≈ tp[X], then t1[Y] = t2[Y] ≈ tp[Y] □ tp[X]: identifying the set of tuples on which the constraint tp applies: { t | t[X] ≈ tp[X]} □ t1[Y] = t2[Y] ≈ tp[Y]: enforcing the embedded FD, and the pattern of tp Felix Naumann | Profiling & Cleansing | Summer 2013 Slide from Wenfei Fan 46 id country area-code phone street city zip Example: Violation of CFDs t1 44 t2 44 131 1234567 131 3456789 Mayfield Crichton NYC EH4 8LE NYC EH8 8LE t3 01 t4 01 908 3456789 908 9876543 Mountain Ave Mainstreet NYC 07974 NYC 07974 ■ cust([country, zip] → [street], Tp) ■ Tuples t1 and t2 violate the CFD country zip street 44 _ _ □ t1[country, zip] = t2[country, zip] ≈ tp[country, zip] □ But t1[street] ≠ t2[street] ■ The CFD applies to t1 and t2 since they match tp[country, zip] ■ Tuples t3 and t4 do not violate the CFD □ CFD does not apply to t3 and t4 Slide from Wenfei Fan Felix Naumann | Profiling & Cleansing | Summer 2013 47 id country area-code phone street city zip Example: Violation of CFD by single tuple t1 44 131 t2 44 131 1234567 3456789 Mayfield NYC Crichton NYC EH4 8LE EH8 8LE t3 01 908 t4 01 908 3456789 9876543 Mountain Ave NYC ■ cust([country, zip] → [street], Tp) ■ Tuple t1 does not satisfy the CFD. country zip street ■ t1[country, area-code] = t1[country, area-code] ≈ tp1[country, area-code] ■ t1[city] = t1[city]; however, t1[city] does not match tp1[city] ■ In contrast to traditional FDs, a single tuple may violate a CFD. Felix Naumann | Profiling & Cleansing | Summer 2013 Slide from Wenfei Fan Mainstreet 07974 NYC 07974 tp1 44 131 Edi tp2 01 908 MH tp3 _ _ _ 48 Further literature ■ DepMiner □ Efficient Discovery of Functional Dependencies and Armstrong Relations. Stéphane Lopes, Jean-Marc Petit, and Lotfi Lakhal. EDBT 2000 ■ CORDS □ Ihab F. Ilyas, Volker Markl, Peter J. Haas, Paul Brown, Ashraf Aboulnaga: CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. SIGMOD Conference 2004: 647-658 ■ FastFDs □ Catharine M. Wyss, Chris Giannella, Edward L. Robertson: FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances. DaWaK 2001: 101-110 ■ CFDs □ Loreto Bravo, Wenfei Fan, Shuai Ma: Extending Dependencies with Conditions. VLDB 2007: 243-254 Felix Naumann | Profiling & Cleansing | Summer 2013