Foundation Concepts for Data Mining
Skip to main content
Print book
Foundation Concepts for Data Mining
Site:
Wattle
Course:
COMP3425/COMP8410 – Data Mining – Sem 1 2021
Book:
Foundation Concepts for Data Mining
Printed by:
Zizuo Xiao
Date:
Saturday, 8 May 2021, 11:02 PM
Table of contents
1. Introduction
2. Data Types and Representations (Text 2.1)
3. Basic Statistical Descriptions of a Single Data Variable (Text 2.2) 3.1. Computation of Measures (Text: 4.2.4)
3.2. Measuring the Central Tendency: Mean, Median and Mode (Text 2.2.1)
3.3. Measuring the Dispersion of Data: Range, Quartile, Variance, Standard Deviation, and Interquartile Range (Text 2.2.2)
3.4. Exercises
4. Measuring Correlation amongst Two Variables (not in Text)
5. Measuring Similarity and Dissimilarity of Multivariate Data (Text 2.4) 5.1. Proximity Measures for Nominal Attributes (Text 2.4.2)
5.2. Proximity Measures for Binary Attributes (Text 2.4.3)
5.3. Dissimilarity of Numeric Data: Normalisation (Text 3.5.2)
5.4. Dissimilarity of Numeric Data: Minkowski Distance (Text 2.4.4)
5.5. Proximity Measures for Ordinal Attributes (Text 2.4.5)
5.6. Dissimilarity for Attributes of Mixed Types (Text 2.4.6)
5.7. Cosine Similarity for Sparse Vectors (Text 2.4.7)
5.8. Exercises
1. Introduction
Most of this material is derived from the text, Han, Kamber and Pei, Chapter 2, or the corresponding powerpoint slides made available by the publisher. Where a source other than the text or its slides was used for the material, attribution is given. Unless otherwise stated, images are copyright of the publisher, Elsevier.
Here, we introduce some important concepts that are used right across data mining methods and the KDD process.
ACTION: If you would like a pre-recorded lecture for this topic, then watch this. Or you can dive straight into the next sections — it is all there.
44 minutes
2. Data Types and Representations (Text 2.1)
Data is generally structured as objects, with each object described by some kind of properties, or attributes, or features, of some real-world or abstract entity. Some of those attributes may include relationships to other objects.
Example:
a Medical database may include information about patient objects and
treatment objects with relations connecting patients to the treatments
they are given
Example: Attributes of a patient may include name, gender and birth date. Attributes of a treatment may include treatment name, responsible doctor, and date.
The objects may alternatively be called samples, examples, instances, data points, observations, rows, tuples, records, or vectors. The attributes may be called fields, variables, dimensions, properties, or features. This will depend on the data model or data mining discipline you are working with, and you need to be comfortable with all of those names.
Commonly, a data model (also called a meta-model) dictates how these concepts are materialised in a data structure, such as the relational data model, for example. Alternatively, a data model could refer to a model that instantiates a particular meta-model for some domain of application, for example, the particular relational schema that is used in your hospital’s medical records system.
In this course, we will be employing the relational data model, multidimensional data models, a transactional data model, simple tables, graph and ontology data models, and others.
Attribute Types
The type of an attribute defines the way its values for some object are represented, the ways it can be manipulated, and appropriate methods for mining.
Nominal
categories, states, classes, or “names of things”
also called categorical
e.g. Hair_color = {auburn, black, blond, brown, grey, red, white}
e.g. marital status, occupation, ID numbers, postcodes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative). e.g. attended class (no, yes)
Convention: assign 1 to most important outcome (e.g., HIV positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between successive values is not known.
e.g. Size = {small, medium, large}, grades, army rankings
Commonly used in surveys e.g. {very poor, poor, moderate, good, excellent}
Can be derived by discretising numeric quantities into ordinal categories (e.g. ANU marks -> grades or age -> child/adult).
Numeric Interval-Scaled
Could be integer or real
Measured on a scale of equal-sized units
Values have order
No true zero-point
e.g, temperature in ˚C, calendar dates
We can reason about the difference amongst values, but not about the multiples or ratios of values.
e.g., 20˚C is 10˚C warmer than 10˚C but it is not twice as warm as 10˚C.
Are not very common in practice
Numeric Ratio-Scaled
Could be integer or real
Inherent zero-point
A ratio scale has all the properties of interval-scaled attributes and the additional property of permitting reasoning over multiples of values
We can speak of values as being a multiple of another in the same unit of measurement (10 K˚ is twice as high as 5 K˚).
We can also reason about the difference between values, as for interval-scaled.
e.g., temperature in Kelvin, length, counts, monetary quantities
Discrete versus Continuous Attributes
Another classification of attributes, more often used in machine learning:
Discrete Attribute
Has only a finite or countably infinite set of values
e.g., postcodes, profession, or the set of words in a collection of documents
Sometimes, represented as integer variables
Binary, nominal and ordinal attributes are a special case of discrete attributes
Continuous Attribute
real numbers as values
e.g., temperature, height, weight, circumference
In practice, real values can only be measured and represented using a finite number of bits (so in that sense are approximated by discrete values)
Typically represented as floating-point variables
Commonly, an object with continuous attributes is called a feature vector and is identified with a point or vector in multidimensional space. For example, v = (2.0, 4.0, 1.0) is a vector in 3-dimensional Euclidean space which can be sketched as follows.
>Object v shown as a red vector in 3-dimensional space. Source: https://math.libretexts.org
3. Basic Statistical Descriptions of a Single Data Variable (Text 2.2)
Simple statistics can help us to understand the data we are working with, and by that to select appropriate cleaning and data mining methods, and to interpret data mining results. We will look at the central tendency (what is the middle value)?, and dispersion (how uniform are the values? and how are they spread?).
These should mostly be concepts you have seen before.
But before that, we look briefly at the cost of computing these and other functions in data mining.
3.1. Computation of Measures (Text: 4.2.4)
When analysis or data mining needs to operate over big data, we need to think about the computational cost of all stages of the processing.
When assessing the character of data, when summarising data, and when developing measures of similarity and interestingness of data and data patterns, some measures are computationally more difficult to compute than others.
The classification of measure functions as distributive, algebraic or holistic is useful to understand their computational complexity.
Distributive
Suppose the data is partitioned into n sets, and the function is applied to each of those sets. Then, suppose the function is applied to those n results. If the result is the same as applying the function to all of the data without partitioning, then the function is distributive. e.g., count(), sum(), min(), max()
Such computations can be made efficient by distributing the computation.
Algebraic
A function is algebraic if it can be computed by a function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive function.
e.g. average() can be computed from sum() and count().
e.g. min_N() and max_N (N minimum/maximum values for integer N)
e.g. standard_deviation()
Such computations can take advantage of the distributive sub-functions to be made efficient.
Holistic
A function is holistic if there is no constant bound on the storage size needed to describe sub-functions. That is, there is no algebraic function with M arguments (where M is a bounded integer) that characterises the computation.
e.g. median(), mode(), rank()
Such computations are difficult to compute efficiently and often efficient approximations are used instead.
3.2. Measuring the Central Tendency: Mean, Median and Mode (Text 2.2.1)
Let X be an attribute (e.g. salary).
Let x1, x2, ..xn be a set of n observations or observed values for X.
Arithmetic Mean (or, simply Mean)
Weighted (Arithmetic) Mean
Used when some values are more important or more significant than others (e.g occurrence frequency, or weightings for most important people represented in the data). Also called weighted average.
Can also use trimmed mean that removes a portion of the highest and lowest values and compute the mean of the remainder – to avoid being pushed off centre by extreme values.
Median
The middle value of an ordered set of values (or the average of the middle two values if there is an even number of values).
Represents the centre better than the mean when data is skewed (asymmetrical) towards one end or has extreme values.
Works for ordinal data as well as numeric data.
But if there is an even number of values then it is not unique: the two middlemost values and anything in between those values
Can be expensive to compute but can be approximated by interpolation when data values are grouped into ranges with frequencies.
Mode
Defined as value that occurs most frequently in the data
Works for numeric and nominal or ordinal data
May not be unique – data sets with multiple modes are called unimodal, bimodal, trimodal (1,2,3 modes respectively)
An empirical formula that applies to unimodal data that is at most moderately skewed, meaning that the mode can be easily approximated from mean and median:
Midrange
Defined as the average of the largest and smallest values
Cheap to compute as indication of central tendency of thedata
Skewness
The figure illustrates the main central tendency measures of symmetric and skewed data.
Skewness is a measure of asymmetry in data. As seen here the sign and magnitude of the difference between mean and mode is an simple indicator of skewness in the data. There are many other ways to measure skewness.
3.3. Measuring the Dispersion of Data: Range, Quartile, Variance, Standard Deviation, and Interquartile Range (Text 2.2.2)
Let X be a numeric attribute (e.g. salary).
Let x1, x2, ..xn be a set of n observations or observed values for X.
Range
The range is the difference between the largest and smallest values, i.e. max(X) – min(X)
Quartiles
Quantiles are data values that split the data into (nearly) equal-sized increasing-order subsets.
Given an integer k such that 0