CS699 Lecture 2 Data Exploration
Types of Data Sets – Data matrix, e.g., numerical matrix, crosstabs
• Record
– Relational records
– Document data: text documents: term‐ frequency vector
– Transaction data • Graph and network
– World Wide Web
– Social or information networks
– Molecular Structures
• Ordered
– Video data: sequence of images
– Temporal data: time‐series
– Sequential Data: transaction sequences
– Genetic sequence data
TID Items
• Spatial, image and multimedia: – Spatial data: maps
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
– Image data:
– Video data:
2
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
Data Objects
– sales database: customers, store items, sales
– medical database: patients, treatments
– university database: students, professors, courses
• Also called samples , examples, instances, data points, objects, tuples.
• Data objects are described by attributes.
• Database rows ‐> data objects; columns ‐>attributes.
3
Attributes
• Attribute (or fields, dimensions, features, variables): a data field, representing a characteristic or feature of a data object.
– E.g., customer _ID, name, address
• Types:
– Nominal (or categorical), Binary, Ordinal – Numeric: quantitative
4
• Interval‐scaled • Ratio‐scaled
Attribute Types
• Nominal: categories, states, or “names of things”
– Hair_color = {auburn, black, blond, brown, grey, red, white}
– marital status, occupation, ID numbers, zip codes
• Binary
– Nominal attribute with only 2 states (0 and 1)
– Symmetric binary: both outcomes equally important
• e.g., gender
– Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV
positive)
• Ordinal
– Values have a meaningful order (ranking) but magnitude between
5
successive values is not known.
– Size = {small, medium, large}, grades, army rankings
6
Numeric Attribute Types
• Quantity(integerorreal‐valued) • Interval‐scaled
– Measured on a scale of equal‐sized units
– Values have order
• E.g., temperature in C ̊or F ̊, calendar dates
– No true zero‐point
– Can’t say 10oC is twice as warm as 5oC.
– Some arithmetic operations do not make sense: sum of years 1900 and 2000.
7
Numeric Attribute Types
• Ratio‐scaled
– Inherent zero‐point
• e.g., temperature in Kelvin, length, counts, monetary quantities
– We can speak of values as being an order of magnitude larger than the unit of measurement (10 inches is twice as long as 5 inches).
– Any arithmetic operations are allowed.
•
Discrete Attribute
•
Continuous Attribute
8
Discrete vs. Continuous Attributes
– Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a
collection of documents
– Sometimes, represented as integer variables
– Note: Binary attributes are a special case of discrete attributes
– Has real numbers as attribute values
• E.g., temperature, height, or weight
– Practically, real values can only be measured and represented using a finite number of digits
– Continuous attributes are typically represented as floating‐ point variables
Basic Statistical Descriptions of Data • Motivation
– Tobetterunderstandthedata:centraltendency,variation and spread
• Central Tendency
– Locationofcenterofadatadistribution – mean,median,mode,etc.
• Data dispersion
9
– Howthedataisspreadout
– quartiles,interquartilerange,boxplot,standarddeviation, variance, etc.
Measuring the Central Tendency • Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size.
–
1n
x n xi (sample), x (population)
i1 N – Weighted arithmetic mean:
i1 x
n wixi
n
w
i i1
– Trimmed mean: chopping extreme values
10
Measuring the Central Tendency • Median:
– Middle value if odd number of values, or average of the middle two values otherwise
– medianof<2,5,6,8,11,20,40>is8
– medianof<2,5,6,8,20,40>is7(=(6+8)/2)
11
• Mode
Measuring the Central Tendency
– Valuethatoccursmostfrequentlyinthedata
– Unimodal,bimodal,trimodal
– modeof<1,1,3,3,3,5,8,9,10,10>is3(unimodal)
– modesof<1,1,3,3,3,5,8,9,10,10,10>are3and10 (bimodal)
– Empiricalformulatoestimatemodeforunimodal, moderately skewed data (given mean and median):
mean mode 3 (mean median)
12
•
Median, mean and mode of symmetric, positively and negatively skewed data
symmetric
Symmetric vs. Skewed Data
Mean Median Mode
positively skewed
negatively skewed
January 16, 2020 Data Mining: Concepts and Techniques
13
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
– Quartiles: Q1 (25th percentile), Q3 (75th percentile)
– Inter‐quartilerange:IQR=Q3–Q1
– Fivenumbersummary:min,Q1,median,Q3,max
– Boxplot:endsoftheboxarethequartiles;medianismarked; add whiskers, and plot outliers individually
– Outlier:
• LessthanQ1 –1.5*IQR
• Greater than Q3 + 1.5 * IQR
– Note:Therearedifferentwaysofdeterminingquartiles. •
14
Measuring the Dispersion of Data
• Example
D = <2, 10, 12, 15, 17, 20, 53>
Median = 15
Q1 = median of lower half <2, 10, 12> = 10 Q3 = median of upper half <17, 20, 53> = 20 IQR = 20 – 10 = 10
Q1 – 1.5*IQR = 10 – 15 = ‐5
Q3 + 1.5*IQR = 20 + 15 = 35
So, 53 is an outlier
• Note: There are other ways of determining Q1 and Q3
15
Measuring the Dispersion of Data • Variance and standard deviation (sample: s, population: σ)
– Variance:
1n1n1n (xx)2 [x2 (x)2]
s2
21n (x)21n x22
i
i
i
n1i1
N i1 N i1
i
i
– Denominator in the formula
• N–1(orn–1)isusedforsample • N (or n) is used for population
n1 i1
n i1
– Standard deviation s (or σ) is the square root of variance s2 (or σ2) 16
• Boxplot
Boxplot Analysis
• Five‐number summary of a distribution
– Minimum,Q1,Median,Q3,Maximum
maximum Q3
– Dataisrepresentedwithaboxandlines
– Can be drawn vertically or horizontally
median (Q2)
– Theendsoftheboxareatthefirstandthird quartiles. So, the height of the box is IQR
– Themedianismarkedbyalinewithinthebox
– Whiskers:twolinesoutsidetheboxextendedto
Q1 minimum
Minimum and Maximum
– Outliers:pointsbeyondaspecifiedoutlier threshold, plotted individually
17
Boxplot Analysis • Example
18
Visualization of Data Dispersion: 3‐D Boxplots
19
January 16, 2020 Data Mining: Concepts and Techniques
Properties of Normal Distribution Curve • The normal (distribution) curve
– From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation)
– From μ–2σ to μ+2σ: contains about 95% of it – From μ–3σ to μ+3σ: contains about 99.7% of it
68%
95%
99.7%
−3 −2 −1 0 +1 +2 +3 −3 −2 −1 0 +1 +2 +3 −3 −2 −1 0 +1 +2 +3 20
Graphic Displays of Basic Statistical Descriptions
• Boxplot: graphic display of five‐number summary
• Histogram: x‐axis are values, y‐axis represents frequencies
• Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane
21
Histogram Analysis • Histogram: Graph display of tabulated
frequencies, shown as bars
40 35 30 25 20 15 10
• It shows what proportion of cases fall into each of several categories
• The categories are usually specified as non‐overlapping intervals of some variable. The categories (bars) must be adjacent
5 0
10000 30000 50000
70000 90000
22
23
Histograms Often Tell More than Boxplots
The two histograms shown in the left may have the same boxplot representation
Thesamevaluesfor:min, Q1, median, Q3, max
But they have rather different data distributions
• •
Provides a first look at bivariate data to see clusters of points, outliers, etc
24
Scatter plot
Each pair of values is treated as a pair of coordinates and plotted as points in the plane
• •
Example with iris Dataset
Four predictor attributes: sepallength, sepalwidth, petallength,
25
petalwidth
Explore Sepalwidth
mean = 3.057, Std Dev = 0.435
4.4 4.2
Q1 = 2.8, Q2 (median) = 3.0, Q3 = 3.325 IQR = 0.525, 1.5 * IQR = 0.7875
Lower threshold = 2.0125
Upper threshold = 4.1125
4.1
Outliers: 2.0, 4.2, 4.4 2.2 Max. excluding outliers: 4.1
Min. excluding outliers: 2.2
2.0
Histogram and boxplot
•
Scatterplot: petalwidth vs. petallength Shows positive correlation
26
Example with iris Dataset
•
Scatterplot: sepalwidth vs. sepallength There does not seem to be any correlation
27
Example with iris Dataset
•
Dissimilarity (e.g., distance)
– Numerical measure of how different two data objects are – Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
•
Proximity refers to a similarity or dissimilarity
28
Similarity and Dissimilarity
• Similarity
– Numerical measure of how alike two data objects are – Value is higher when objects are more alike
– Often falls in the range [0,1]
•
Data matrix
•
0 d(2,1) 0 d(3,1) d(3,2) 0 : : :
29
Data Matrix and Dissimilarity Matrix
– n data points with p dimensions
x11 …
… x1f
… x1p … …
– Two modes Dissimilarity matrix
xi1 …
… … … xif
… xip … …
– n data points, but registers only the distance
– A triangular matrix
– Single mode
d(n,1) d(n,2) … … 0
… …
xn1 … xnf … xnp
30
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
Object Attribute1 (x) Attribute2 (y) O1 1 2
O2 3 5
O3 2 0
O4 4 5 Dissimilarity Matrix
(with Euclidean Distance)
O1 O2 O3 O4 O1 0
O2 3.61 0
O3 2.24 5.1 0
O4 4.24 1 5.39 0
31
Proximity Measure for Nominal Attributes
• Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary attribute)
• Distance: d(i,j)
where m: # of matches, p: total # of variables
(or distance = #mismatches / all)
• Example
Object
Income Housing Zip Marital_status
Molly Greg
high own 02215 yes medium own 02215 yes
distance(Molly, Greg) = 1/4 or 0.25
pm p
Dissimilarity between Binary Variables
• For symmetric binary variables, use the same method that is used for nominal attributes: distance = #mismatches / all
• Example
Name Fever Cough Test-1 Test-2 Test-3 Test-4
32
d(jack ,mary )160.17 d(jack ,jim)620.33 d(jim,mary )630.5
Jack Y N P N N N MaryY N P N P N Jim Y P N N N N
Distance on Numeric Data: Minkowski Distance • Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p‐ dimensional data objects, and h is the order (the distance so defined is also called L‐h norm)
• Properties
– d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness) – d(i, j) = d(j, i) (Symmetry)
– d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
•
A distance that satisfies these properties is called metric
33
Special Cases of Minkowski Distance
• h = 1: Manhattan distance (city block, L1 norm)
– E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d(i,j)|x x ||x x |…|x x | i1 j1 i2 j2 ip jp
• h = 2: Euclideean distance (L2 norm)
d(i, j) (|x x |2 |x x |2 …|x x |2)
i1 j1 i2 j2 ip jp • h . “supremum” distance (Lmax norm, L norm)
– This is the maximum difference between any component (attribute) of the vectors
34
35
0
3 0
2 5 0
3 1 5 0
Example: Minkowski Distance
object attribute1 attribute2 O1 1 2
O2 3 5
O3 2 0
L O1 O2 O3 O4 O1 0
O2 5 0
O3 3 6 0
O4 4 5
O4 6 1 7 0 Euclidean (L2)
Manhattan (L1)
L2 O1 O2 O3 O4
O1 O2 O3 O4
Supremum
L O1 O2 O3 O4
O1 O2 O3 O4
0
3.61 0
2.24 5.1 0
4.24 1 5.39 0
Cosine Similarity
• A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.
• Other vector objects: gene features in micro‐arrays, …
• Applications: information retrieval, biologic taxonomy, gene feature mapping, …
• Cosine measure: If d1 and d2 are two vectors (e.g., term‐frequency vectors), then
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d||: the length of vector d
• Between 0 and 1, inclusive; Closer to 0: less similar; Closer to 1: more similar 36
• •
Example: Cosine Similarity cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d|| is the length of vector d Ex: Find the similarity between documents 1 and 2.
d1 = (5,0,3,0,2,0,0,2,0,0) d2 = (3,0,2,0,1,1,0,1,0,1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*0+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481 ||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 25 / (6.481 * 4.12) = 0.94 37
Attributes of Mixed Types • Distance between object 1 and object 2.
• A1 and A2: interval‐scaled; A3, A4, and A5: asymmetric binary (P is more important than N); A6 and A7: nominal; A8 is ordinal (ranks are gold = 3, silver = 2, bronze = 1); “?” indicates a missing value.
• A1:|8–21|/(21–6)=0.867
• A2:|17–6|/(28–6)=0.5
• A3:1,A6:0,A7:1
• A8:|1–0.5|/(1–0)=0.5
OID A1 A2 A3 A4 A5 A6 A7 A8
• d(O1,O2)
= (0.87 + 0.5 + 1 + 0 + 1 + 0.5) / 6 = 0.68
1 8 17 N 2 21 6 P 3 10 10 P 4 16 12 P 5 12 14 P 6 13 11 N 7 10 8 P 8 6 28 N
N N ? N P N N Y N Y P N N N P Y
two 4wd gold two fwd silver two fwd bronze four 4wd gold four fwd gold two fwd silver four 4wd bronze four fwd gold
38
• http://www.cs.illinois.edu/~hanj/bk3/
39
References
• Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012