CS计算机代考程序代写 database Bioinformatics Bayesian gui data mining decision tree AI algorithm Introduction to Data Mining

Introduction to Data Mining

Skip to main content

Print book

Introduction to Data Mining

Site:

Wattle

Course:

COMP3425/COMP8410 – Data Mining – Sem 1 2021

Book:

Introduction to Data Mining

Printed by:

Zizuo Xiao

Date:

Saturday, 8 May 2021, 11:00 PM

Description

Foundational and Introductory topics

Table of contents
1. Introduction (Text:1) 1.1. Why Data Mining?
1.2. What is Data Mining?
1.3. What makes a pattern useful?
1.4. What Kind of Data Can be Mined?
1.5. Data Mining is a skilled craft
1.6. What kind of patterns can be mined?
1.7. Challenges in Data Mining
1.8. Privacy Implications of Data Mining (Text 13.4.2)

2. Popular Data Mining Tools
3. Reading and Exercises
4. Quiz

1. Introduction (Text:1)
Most
of this material is derived from the text, Han, Kamber and Pei, Chapter
1, or the corresponding powerpoint slides made available by the
publisher.  Where a source other than the text or its slides was
used for the material, attribution is given. Unless otherwise stated,
images are copyright of the publisher, Elsevier.

In this e-book we motivate and give a
broad outline of the course content and go into a bit more detail on
topics relevant to your assignment.

 

Copyright xkcd, https://xkcd.com/1429/

1.1. Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes

Data collection and data availability
Automated data collection tools, database systems, Web, computerised society

Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, Web shopping…
Science: Remote sensing, bioinformatics, instrumentation and simulation, astronomy
Australia’s ASKAP now generating 5.2 terabytes/sec (ie 1 024 Gb/s) [Conversation]

Society and everyone: news, digital cameras, YouTube, twitter
Government: Medicare claims, Tax returns, cash transaction reports,  arrival cards, business surveys, census

We are drowning in data, but starving for knowledge!

Necessity is the mother of invention”- hence Data mining—Automated analysis of massive data sets

ACTION: Read this: Hidden from students:URLCompanies ignore data at their peril   (see the full text here if the live link to AFR does not work for you):   Get on Board

1.2. What is Data Mining?
Data mining (knowledge discovery from data)

Definition: Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amounts of data

This can also be described as building models of data and evaluating the models for usefulness.

Alternative names: Knowledge
discovery in databases (KDD), knowledge extraction, data/pattern
analysis, data archaeology, data dredging, information harvesting,
business intelligence, machine learning, pattern recognition etc.

ACTION:     Watch this (business-oriented) video on what is data mining?

Data Mining vs Machine Learning vs Statistics?

No strict distinction, but indicative of a research or business community. Machine Learning as induction, abduction and generalisation is older and arose as a branch of AI in the very early days of computing. Data mining 
(and KDD) arose from Database research in the early 90s.  Machine
learning may aim at more agent-oriented methods such as incremental,
active and bio-inspired learning. Data mining is more likely to be
batch- and large-data- oriented.  Both schools draw heavily on Statistics for
data characterisation. Some machine-learning/data-mining techniques
also arise in Statistics e.g. Naive Bayes, Decision trees.

Due
to the very mixed heritage of data mining techniques, you will see data
mining language arising from each of these three discipline areas, with
very often different names used for the same or a very similar idea.
For example, the very fundamental idea of an atomic chunk of data that
contributes to the description of some object of interest is variously
called an attribute, a feature, or a variable.
This course primarily uses the language in the text, which is primarily
derived from a  database background, but as a practicing data
miner, you must be able to recognise and interchangeably use the various
terms.

The professional data miner often needs to be multi-skilled:

The
following diagram illustrates that data mining is usually regarded as
only one element of KDD, although it is common enough to describe the
whole process as data mining.

Data Cleaning: Removes noise and inconsistent data

Data integration: Multiple data sources are combined

Data Selection: Relevant data is retrieved from a data warehouse

Data Transformation (not shown above): Data is transformed and consolidated by summary or aggregation operations

Data mining: Intelligent methods are applied to extract data patterns

Pattern evaluation: Interesting patterns are identified

Knowledge Presentation: Present mined knowledge to users via visualisation and knowledge representation.

Example: Web mining usually involves

Data cleaning
Data integration from multiple sources
Warehousing the data
Data cube construction
Data selection for data mining
Data mining
Presentation of the mining results
Patterns and knowledge to be used or stored into knowledge-base

In Business Intelligence, the
warehouse, data cube and reporting are emphasised, but targeted at
(human) data exploration and presentation, not much on automated
mining. Note the attribution of tasks to occupational roles.

In Statistics and Machine Learning, the process might look more like this.

1.3. What makes a pattern useful?
One
of the challenges in data mining is that automated tools can typically
generate thousands or millions of patterns. Ultimately,  not all
can be used, so data miners and their tools need to be able to select interesting patterns for further study and possible application.

Typically a pattern is interesting if

it is easily understood by humans working in the domain of application
it is valid (true), with some required degree of certainty, on sample data and real  data in practice
it is potentially useful or actionable
it is novel, that is not already well known or else it validates a hypothesis that needed confirmation

An interesting pattern can be characterised as representing knowledge.

However, not all of these conditions are achieved by all mining methods, and not all these conditions can be met in automated ways.

Objective and
automatable measures of pattern interestingness do exist, based on
pattern structure and statistical measures.Commonly such a measure would
need to exceed some user-selected threshold to be considered interesting, or alternatively, just some user-selected number of the best-ranked patterns by that measure are considered interesting.

For example, support and confidence  are
used to assess association rules. These indicate respectively  how
much of the data the rule has any relevance to, and how much of the
relevant data it is valid for.  Coverage and accuracy  have respectively  similar roles for classification rules. The simplicity
of the pattern is commonly assesed, too. The formulation of these
objective measures may be  specific to a mining method and is not
directly comparable across methods.

Subjective measures
of interestingness, based on user beliefs about the data,  are
also generally required. These may be domain-dependent and it is
typically the task of a data miner (you!) to know enough about the
domain of the data and the proposed application of the mining results to
be able to apply judgement here.  Results will generally need to
be communicated and shared with other domain experts prior to accepting
and acting on them.

Subjectively, interesting patterns may be unexpected (I did not know this!), actionable (I can use that result to change what we do), or confirming (I had a hunch that might be true, but now I have evidence for it) 

It is highly desirable for data
mining methods to generate only interesting patterns for users. It is
also efficient for data mining methods to use concepts of
interestingness internally during the search for patterns, so that
computations that will result in uninteresting patterns need not be
explored. This means that in order for a method to be  a 
viable candidate for a data mining problem, the notion of
interestingness built in to the method must be aligned to the user’s
concept of interestingness associated with the problem.

Interestingness, and how it
can be used for efficient mining, is an important aspect of every data
mining technique, and it affects the nature of problems to which the
technique can be applied.

 

ACTION: Throughout the course you will  see how to use a wide range of well-known methods for data mining. At all times you must be aware what
methods are suitable for what kinds of problems and what kinds of
data.  You must be able to interpret the results of automated
mining methods in the context of both objective and subjective
evaluation criteria that are appropriate to the problem and the method.
Pause and reflect for a moment on this, and plan to ask your self
throughout the course:

(1) For what kind of problem and data is this method suitable?

(2) How can I evaluate and interpret the results of this method?

1.4. What Kind of Data Can be Mined?
Well…. any kind that can be represented in a machine.

This course focuses on database
data, data warehouse data and transactional data – the typical kind of
data likely to be held in government repositories.

We also lightly cover some of the more advanced and more complex data types:

Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Stucture data, graphs, social networks and multi-linked data
Spatial data and spatiotemporal data
Text data
The World-Wide Web, particularly linked data

Other complex data areas  that we will not cover are

Multimedia data

Heterogeneous databases and legacy databases

Object-relational databases

1.5. Data Mining is a skilled craft
Here,
by a short video parable about gold mining, the convenor attempts to
explain the reality of the iterative, trial and error discovery process
that characterises data mining, while also demonstrating which parts of
the process are addressed in  the course.

IMDB: 9.5/10 Rotten Tomatoes: 99%

Credits: 

Producer, Director, Scriptwriter, Props, Costumes, Actor and Personal Assistant to each : Kerry Taylor
Sound, Lighting, Camera, Post-production: Evelyn Zhao
6.3 minutes

1.6. What kind of patterns can be mined?
The
following summarises patterns and corresponding  methods, most of
which  will be covered in detail through the course. 

Data Mining for Generalisation

Data warehousing
multidimensional data model
Data cube technology
OLAP (online analytical processing)

Multidimensional  class or concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region
Data characterised by a user specified class description can be retrieved from a warehouse
Data discrimination can compare features of selected classes
OLAP operations can be used to summarise data along user-specified dimensions

Frequent Patterns, Association and Correlation Analysis

Frequent patterns (or frequent itemsets)
What prescribed drugs are frequently taken together? What welfare payments are frequently received together?

Association, correlation vs. causality
A typical association rule: Tertiary Education -> Atheist  [10%, 20%]  (support, confidence)
Are strongly associated items also strongly correlated?

How to mine such patterns and rules efficiently in large datasets?
How to use such patterns for classification, clustering, and other applications?

Classification

Classification and prediction 
Construct models (functions) based on some training examples (supervised learning)
Describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars based on (gas mileage)
Predict unknown class labels

Typical methods – Decision
trees, naïve Bayesian classification, support vector machines, neural
networks, rule-based classification, pattern-based classification,
logistic regression, …
Typical applications – Credit
card or taxation fraud detection, direct marketing, classifying stars,
diseases,  web-pages, …

Clustering/ Cluster Analysis

Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters), e.g., cluster
houses to find distribution patterns, cluster youth in detention to
develop  targeted interventions.
Principle: Maximizing intra-cluster similarity & minimizing inter-cluster similarity
Many methods and applications

Outlier Analysis

Outlier: A data object that does not comply with the general behavior of the data
Noise or exception? ― One person’s garbage could be another person’s treasure
Methods: by-product of clustering or regression analysis, …
Useful in fraud detection, rare events analysis, who is slipping through the safety net?

Others

Sequence, trend, and evolution analysis (for temporal  or other sequence data)
Structure, network and text analysis, e.g. Web mining
Inductive logic programming e.g. linked data mining

1.7. Challenges in Data Mining
Basic Competency

It does not take a lot of skill to
pick up a data mining tool and to push its buttons to run some
algorithms  over some data and to get some output back
again.   But this approach will not provide the required
insight into application problems and can be very harmful in some
circumstances.

The competent data miner must be capable of 

Selecting the right tool for the
job. This will be informed by both the question being asked and the
nature of the data that is used;

 Evaluating  the output
of the tool. What is it telling you? How good are the results? Is there a
way of obtaining better results? What does “better” mean for the
particular question being asked?; and
Interpreting the results in the
context of the question. What can be learnt from the results and 
for what purpose can that information be ethically and robustly
used?  

The following challenges are
indicative of the issues which professional data miners encounter and
for which data mining researchers attempt to create new methods.

Mining Methodology

Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Interdisciplinary skills
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining

Leveraging Human Knowledge

Goal-directed mining
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results

Efficiency and Scalability

Data reduction
Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods

Diversity of Data Types

Handling complex types of data
Mining dynamic, networked, and global data repositories

Data Mining and Society

Social impacts of data mining
Privacy-preserving data mining
Invisible data mining
ACTION  Watch this video on invisible data mining
ACTION
(recommended) This book is a very readable and thoughtful presentation
on on how data mining can go wrong and what to look out for. 
Consider reading it! It is available in the University library.  Cathy O’Neil, Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy, 2016.

1.8. Privacy Implications of Data Mining (Text 13.4.2)
This material is derived and extended from Section 13.4.2 of the text.

Data mining often uses information about individual people.

People often feel that this
information is private to themselves, and only the person in question
has the right to decide who to share it with, and for what purpose.

People often willingly or unconsciously but consentingly share such private information.

The information may be shared for reasons quite independently of data mining, but it may be mined for secondary use.

e.g. credit card transactions,
health records, personal financial records, individual biological
traits, criminal or justice investigations, ethnicity, educational
performance.

Not all data mining uses personal data, although it may nevertheless be confidential data.

e.g. natural resources, climate and weather, water storage and floods, astronomy, geography, geology, biology

other scientific and engineering data

Access to individual records
directly attributable to individual people is the most common source of
concern. This data may leak in ways that have no connection to data
mining (e.g Web shopping). Nevertheless, when it is collected initially
for the purposes of mining, or when mining is a secondary use, the data
miner should be aware of their legal and ethical obligations to protect
privacy.

Who has access to this data?

ACTION: Watch this video

Data Mining – How Companies Now Know Everything About You – Joel Stein – Time Magazine

Who controls and owns the data?

ACTION: Read  this short editorial    

Trust me, I’m an algorithm

Privacy Principles

The OECD (Organisation for Economic Cooperation), of which Australia is a member,  publishes the OECD Privacy Framework 2013. It  provides some useful principles for personal data, summarised here:

Collection Limitation: Data should be collected by lawful and fair means, with the knowledge and consent of the individual where appropriate.

Data Quality: For the purpose that data is used, it should be relevant, accurate, complete and up-to-date.

Purpose Specification:
Data should be collected only for a specified purpose, and used for
that purpose (although alternative uses are conditionally possible).

Use Limitation: Data should not be disclosed or used for other purposes unless authorised by the person or by the law.

Security Safeguards: Data should be protected by reasonable security safeguards.

Openness Principle: 
Developments, practices and policies with respect to data should be
open, including the existence and nature of data, the main purposes of
use, and the identity and contact method for a data controller.

Individual Participation:
Individuals have the right to obtain confirmation of existence or
absence of data relating to them, to obtain such data, to challenge such
data, and if successful,  to have the data erased, rectified,
completed or amended.

Accountability Principle: A data controller is accountable for complying with the principles.

How to protect privacy when data mining?

Careless disclosure control by
data miners can be the cause of privacy breaches. Private data should be
managed securely, and privacy-preserving data mining can be used to
ensure that the results of data mining can be shared without sharing
personal data.

Data-security enhancing techniques include

Multilevel security model in databases
– users are authorised to access only data that is classified at their
security level. Although can permit an inference of hidden data.

Encryption – personal data may be encrypted and protected by signatures, biometrics, locational distribution.

Intrusion detection – actively monitor for breaches

Physical access control – keep the data off-network and secured in a physical space.

Privacy-preserving data mining techniques include

Randomization methods (a.k.a perturbation) –
where noise is added to the data (i.e .errors are introduced), that are
sufficiently large to hide the genuine values, but so that the final
results of data mining will be preserved -e.g. aggregate distributions
are preserved.

K-anonymity methods – where the granularity of data is reduced so that at least k individuals share the same data so no individual can be uniquely recognised (e.g. grouping street address to postcode).

L-diversity methods  – where granularity is reduced as for K-anonymity
but additionally values for particularly sensitive information within a
group are required to be diverse  (e.g. all salary brackets should
be represented in the group with the same postcode).

Distributed Privacy Preservation methods 
– large data sets are partitioned into subsets that are mined
independently but limited information sharing of data mining results can
assure individual privacy while enabling aggregate results.

Downgrading results – Hide some of the results like association rules or clusters that apply to small numbers of individuals.

Differential Privacy –
use special algorithms that are formally guaranteed to give
approximately the same results on data sets that differ in only a small
number of records, so that the presence of absence of an individual’s
data does not affect the result significantly.

Statistical Disclosure Control – implement governance procedures that ensure sensitive information is not released.

ACTION: Read this recommended paper about  techniques to protect privacy and well-publicised failures in privacy protection Ohm, Broken Promises of Privacy

ACTION: For a recent
Australian perspective, read this story on identification of medical
data released by the Australian Government: https://pursuit.unimelb.edu.au/articles/the-simple-process-of-re-identifying-patients-in-public-health-records  (December 2017)  and this response by the Office of the Australian Information Commissioner https://www.oaic.gov.au/privacy/privacy-decisions/investigation-reports/mbspbs-data-publication/ (March 2018). Who do you think is right here?

File

2. Popular Data Mining Tools
Please have a look over these popular data mining tools. This information was primarily derived from https://togaware.com/datamining/survivor/index.html and is copyright TogaWare.

Hidden from students:PageSome Data Mining Tools

ACTION:  blended only Let us know  if you are using one of these or something else in your workplace.    

What datamining tools do you use in your workplace? 

3. Reading and Exercises

For this week’s practical you are asked to learn a little about the main data mining tool we will be using often in this course.

ACTION: Firstly, please read this paper:

Hidden from students:File
Williams, Rattle: A Data Mining GUI for R, The R Journal December 2009

ACTION: 
online and hybrid 
Go back and complete the Week Zero course preparation if you have not done so already.

ACTION: 
blended only Then, please have a look at Rattle by the following exercise.

Hidden from students:Page
Exercise: Install Rattle and get to know it

ACTION: 
on-campus only Here
are some instructions to download and install software on your own
computer: It is not essential to do this as all software is also
provided in the CSIT computer  laboratories. However, you may find
this useful so that you can work on your course work and assignments
locally from elsewhere. If you get stuck, seek assistance in your next
lab.

Hidden from students:Page
Exercise: Install Rattle and get to know it

We will also be doing some data warehouse and OLAP exercises in the course.

ACTION:
blended only

  In order to use Computer Science lab facilities you will need to  log in to Streams, once only. Please do so by following the link if you have not ever done it before. Then,
please
follow the instructions below to set up your environment to attempt the
data warehouse lab exercises. Note that this is entirely repeated from
COMP7240 Introduction to Databases, so may be familiar to you already.

Exercise: Set Up for Data Warehouse Activities

ACTION:
on-campus only

 In order to use Computer Science lab facilities you will need to  log in to Streams, once only. Please do so by following the link if you have not ever done it before.

Here
are some instructions to download and install software on your own
computer.  It is not essential to do this as all software is also
provided in the CSIT computer  laboratories. However, you may find
this useful so that you can work on your course work and assignments
locally from elsewhere. If you get stuck, seek assistance in your next
lab.

Exercise: Set Up for Data Warehouse Activites

4. Quiz
ACTION: Take the on-line quiz. This
week’s quiz is primarily intended to get you accustomed to the quiz
format, although you may find it inspires you to go back and re-read
this week’s material. Your
quiz will be automatically marked.

It
is intended to help you learn and to help you check how you are going.
Every quiz contributes to marks for the course. You must attempt the
quiz before the closing time to have it available for review at a later
time

Week 1 quiz: basic concepts