Site: Course: Book:
COMP3425/COMP8410 – Data Mining – Sem 1 2022 Introduction to the Course hybrid S1 2022
Printed by: Date:
Copyright By PowCoder代写 加微信 powcoder
Thursday, 9 June 2022, 5:58 PM
Description
This is the first topic for the course, inviting you to understand how the course works and what to expect.
1. Welcome
Table of contents
1. Welcome
This is the first topic for the course, inviting you to understand how the course works and what to expect. Generally, course materials are organised in e-books and you will see one or more e-books grouped into a section for each topic of the online course.
How to read the course materials: COMP3425/COMP8410 is being delivered in hybrid mode for all students in S1 2022. Students have options to participate remotely (also called online) or physically on-campus. The same online materials are used for the undergraduate COMP3425 in on- campus mode, postgraduate COMP8410 in on-campus mode (and also online), and postgraduate COMP8410 in blended mode. In some places, advice and content are customised for one or two of these alternatives. Differentiated material will be identified in purple as u/g only, p/g only, blended only, on-campus only, or online only. Please be conscious of which of these categories apply to you.
ACTION: Read and follow this instruction:
Throughout the course materials, you will see remarks like this, in green. These are instructions for you to follow.
ACTION 1. Read the course outline — it is very important that you have this course outline in mind throughout the course. You are assumed to know this and a great deal of it is actually helpful. You can come back to this document at any time as it is easy to find right at the top of the course home page.
Course Outline
ACTION 2. Read the course schedule and keep it on hand throught the course.You will need it to plan your days for attendance and assessment
Course Schedule
ACTION 3. Check your knowledge of relational databases. You will need some basic understanding of relational databases and SQL in this Data Mining course.
Use this if you need a refresher for SQL
ACTION 4. Set up your computer for practical exercises.
You will first need to use these platforms in Week 3. So if you have problems, then do not panic, and seek help in your first lab class in Week 2.
Print book
Introduction to the Course hybrid S1 2022
This course has a strong practical component and you will need to use, at least PostgreSQL, Rattle, R, Protege, and OWL-Miner that relies on the Java runtime). You may also want to use additional data mining software of your choice, such as Python tools.
Firstly, if you have not done so already, you will need to register with Streams. Log in with your ANU user id and ANU password. If you change your ANU password during the course, you will need to come back and re-register with Streams to use lab facilities including the VDI.
Now, you may choose to either (a) Use the ANU-supplied Horizon VDI through which all the required software is pre-installed or (b) use your own self-installed software in versions for Windows, Mac or Linux, or (c) visit the physical laboratories in the CSIT and HN buildings on campus where all software is installed or (d) use remote-access to the CSIT and HN laboratories on campus or (e) use any combination of the above. There is also an option (f) to use an Oracle Virtual Box VM, https://cs.anu.edu.au/docs/student-computing-environment/linuxlabs/softwareaccess/ but this may be slow on low-performance personal computers and the ANU staff will not offer support. Be aware that the VDI (option (a)) offers an Ubuntu Linux-based desktop environment and may be difficult on low-throughput network connections, although it has been found to be trouble-free for most students and is likely to be the least-fuss remote option. Alternatively, self-installation has the advantage that you will be interacting with your own familiar O/S and once you have invested the extra effort to install, you can keep the software for future work. Rattle has proven to be easy to install for most students, but always there are a few Mac users that never solve all their problems.
If you choose the VDI option, then follow the instructions here: https://cs.anu.edu.au/docs/student-computing-environment/linuxlabs/VDI/. You will need to reboot your machine part-way through, so you will need to remember to come back here afterwards. You will continue to need to continue to run GlobalProtect every time you subsequently use the Horizon VDI.
For the PostgreSQL database work, you can self-install PostgreSQL if you prefer, without assistance from the course staff. You will be able to complete all the PostgreSQL work by suitably adapting the instructions given here. For all other options, to set up for database work you will need to follow the instructions here to use the PostgreSQL server on Partch.
Practical Exercise: Set Up for Data Warehouse Activities
If you choose to self-install Rattle, then follow the instructions here. If you chose to use another access method instead then you will also need to follow these instructions, beginning with “Start Rattle” as you do not need to install first.
Practical Exercise: Install Rattle and get to know it
The remaining course software is available in the labs and through the VDI or self-installation, but we need not be concerned with that until later in the course.
ACTION 5. Enrol in a lab class. Enrolment opens 11am on the first day of the course.
Tutorial/Laboratory Groups
Foundational and Introductory topics
1. Introduction (Text:1)
1.1. Why Data Mining?
1.2. What is Data Mining?
1.3. What makes a pattern useful?
1.4. What Kind of Data Can be Mined?
1.5. Data Mining is a skilled craft
1.6. What kind of patterns can be mined?
1.7. Challenges in Data Mining
1.8. Privacy Implications of Data Mining (Text 13.4.2)
2. Popular Data Mining Tools 3. Reading and Exercises
Description
Table of contents
1. Introduction (Text:1)
Most of this material is derived from the text, Han, Kamber and Pei, Chapter 1, or the corresponding powerpoint slides made available by the publisher. Where a source other than the text or its slides was used for the material, attribution is given. Unless otherwise stated, images are copyright of the publisher, Elsevier.
In this e-book we motivate and give a broad outline of the course content and go into a bit more detail on topics relevant to your assignment.
Copyright xkcd, https://xkcd.com/1429/
Print book
Site: Course: Book:
COMP3425/COMP8410 – Data Mining – Sem 1 2022 Introduction to Data Mining
Printed by: Date:
Thursday, 9 June 2022, 6:10 PM
Introduction to Data Mining
The Explosive Growth of Data: from terabytes to petabytes
1.1. Why Data Mining?
Data collection and data availability
Automated data collection tools, database systems, Web, computerised society Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, Web shopping…
Science: Remote sensing, bioinformatics, instrumentation and simulation, astronomy
Australia’s ASKAP now generating 5.2 terabytes/sec (ie 1 024 Gb/s) [Conversation]
Society and everyone: news, digital cameras, YouTube, twitter
Government: Medicare claims, Tax returns, cash transaction reports, arrival cards, business surveys, census
We are drowning in data, but starving for knowledge!
Necessity is the mother of invention”- hence Data mining—Automated analysis of massive data sets
ACTION: Read this: Companies ignore data at their peril (see the full text here if the live link to AFR does not work for you): Get on Board
1.2. What is Data Mining?
Data mining (knowledge discovery from data)
Definition: Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amounts of data
This can also be described as building models of data and evaluating the models for usefulness.
Alternative names: Knowledge discovery in databases (KDD), knowledge extraction, data/pattern analysis, data archaeology, data dredging, information
harvesting, business intelligence, machine learning, pattern recognition etc.
ACTION: Watch this (business-oriented) video on what is data mining? Data Mining vs Machine Learning vs Statistics?
No strict distinction, but indicative of a research or business community. Machine Learning as induction, abduction and generalisation is older and arose as a branch of AI in the very early days of computing. Data mining (and KDD) arose from Database research in the early 90s. Machine learning may aim at more agent-oriented methods such as incremental, active and bio-inspired learning. Data mining is more likely to be batch- and large-data- oriented. Both schools draw heavily on Statistics for data characterisation. Some machine-learning/data-mining techniques also arise in Statistics e.g. Naive Bayes, Decision trees.
Due to the very mixed heritage of data mining techniques, you will see data mining language arising from each of these three discipline areas, with very often different names used for the same or a very similar idea. For example, the very fundamental idea of an atomic chunk of data that contributes to the description of some object of interest is variously called an attribute, a feature, or a variable. This course primarily uses the language in the text, which is primarily derived from a database background, but as a practicing data miner, you must be able to recognise and interchangeably use the various terms.
The professional data miner often needs to be multi-skilled:
The following diagram illustrates that data mining is usually regarded as only one element of KDD, although it is common enough to describe the whole process as data mining.
Data Cleaning: Removes noise and inconsistent data
Data integration: Multiple data sources are combined
Data Selection: Relevant data is retrieved from a data warehouse
Data Transformation (not shown above): Data is transformed and consolidated by summary or aggregation operations Data mining: Intelligent methods are applied to extract data patterns
Pattern evaluation: Interesting patterns are identified
Knowledge Presentation: Present mined knowledge to users via visualisation and knowledge representation.
Example: Web mining usually involves
Data cleaning
Data integration from multiple sources
Warehousing the data
Data cube construction
Data selection for data mining
Data mining
Presentation of the mining results
Patterns and knowledge to be used or stored into knowledge-base
In Business Intelligence, the warehouse, data cube and reporting are emphasised, but targeted at (human) data exploration and presentation, not much on automated mining. Note the attribution of tasks to occupational roles.
In Statistics and Machine Learning, the process might look more like this.
1.3. What makes a pattern useful?
One of the challenges in data mining is that automated tools can typically generate thousands or millions of patterns. Ultimately, not all can be used, so data miners
and their tools need to be able to select interesting patterns for further study and possible application. Typically a pattern is interesting if
it is easily understood by humans working in the domain of application
it is valid (true), with some required degree of certainty, on sample data and real data in practice it is potentially useful or actionable
it is novel, that is not already well known or else it validates a hypothesis that needed confirmation
An interesting pattern can be characterised as representing knowledge.
However, not all of these conditions are achieved by all mining methods, and not all these conditions can be met in automated ways.
Objective and automatable measures of pattern interestingness do exist, based on pattern structure and statistical measures.Commonly such a measure would need to exceed some user-selected threshold to be considered interesting, or alternatively, just some user-selected number of the best-ranked patterns by that measure are considered interesting.
For example, support and confidence are used to assess association rules. These indicate respectively how much of the data the rule has any relevance to, and how much of the relevant data it is valid for. Coverage and accuracy have respectively similar roles for classification rules. The simplicity of the pattern is commonly assesed, too. The formulation of these objective measures may be specific to a mining method and is not directly comparable across methods.
Subjective measures of interestingness, based on user beliefs about the data, are also generally required. These may be domain-dependent and it is typically the task of a data miner (you!) to know enough about the domain of the data and the proposed application of the mining results to be able to apply judgement here. Results will generally need to be communicated and shared with other domain experts prior to accepting and acting on them.
Subjectively, interesting patterns may be unexpected (I did not know this!), actionable (I can use that result to change what we do), or confirming (I had a hunch that might be true, but now I have evidence for it)
It is highly desirable for data mining methods to generate only interesting patterns for users. It is also efficient for data mining methods to use concepts of interestingness internally during the search for patterns, so that computations that will result in uninteresting patterns need not be explored. This means that in order for a method to be a viable candidate for a data mining problem, the notion of interestingness built in to the method must be aligned to the user’s concept of interestingness associated with the problem.
Interestingness, and how it can be used for efficient mining, is an important aspect of every data mining technique, and it affects the nature of problems to which the technique can be applied.
ACTION: Throughout the course you will see how to use a wide range of well-known methods for data mining. At all times you must be aware what methods are suitable for what kinds of problems and what kinds of data. You must be able to interpret the results of automated mining methods in the context of both objective and subjective evaluation criteria that are appropriate to the problem and the method. Pause and reflect for a moment on this, and plan to ask your self throughout the course:
(1) For what kind of problem and data is this method suitable? (2) How can I evaluate and interpret the results of this method?
1.4. What Kind of Data Can be Mined?
Well…. any kind that can be represented in a machine.
This course focuses on database data, data warehouse data and transactional data – the typical kind of data likely to be held in government repositories. We also lightly cover some of the more advanced and more complex data types:
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences) Stucture data, graphs, social networks and multi-linked data
Spatial data and spatiotemporal data
The World-Wide Web, particularly linked data
Other complex data areas that we will not cover are
Multimedia data
Heterogeneous databases and legacy databases Object-relational databases
1.5. Data Mining is a skilled craft
Here, by a short video parable about gold mining, the convenor attempts to explain the reality of the iterative, trial and error discovery process that characterises data mining, while also demonstrating which parts of the process are addressed in the course.
IMDB: 9.5/10 Rotten Tomatoes: 99%
Producer, Director, Scriptwriter, Props, Costumes, Actor and Personal Assistant to each : Sound, Lighting, Camera, Post-production:
6.3 minutes
1.6. What kind of patterns can be mined?
The following summarises patterns and corresponding methods, most of which will be covered in detail through the course.
Data Mining for Generalisation
Data warehousing
multidimensional data model Data cube technology
OLAP (online analytical processing)
Multidimensional class or concept description: Characterization and discrimination Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region
Data characterised by a user specified class description can be retrieved from a warehouse Data discrimination can compare features of selected classes
OLAP operations can be used to summarise data along user-specified dimensions
Frequent Patterns, Association and Correlation Analysis
Frequent patterns (or frequent itemsets)
What prescribed drugs are frequently taken together? What welfare payments are frequently received together?
Association, correlation vs. causality
A typical association rule: Tertiary Education -> Atheist [10%, 20%] (support, confidence)
Are strongly associated items also strongly correlated?
How to mine such patterns and rules efficiently in large datasets?
How to use such patterns for classification, clustering, and other applications?
Classification
Classification and prediction
Construct models (functions) based on some training examples (supervised learning) Describe and distinguish classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars based on (gas mileage) Predict unknown class labels
Typical methods – Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern-based classification, logistic regression, …
Typical applications – Credit card or taxation fraud detection, direct marketing, classifying stars, diseases, web-pages, …
Clustering/ Cluster Analysis
http://homepage.smc.edu/grippo_alessand
Unsupervised learning (i.e., Class label is unknown)
Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns, cluster youth in detention to develop targeted interventions.
Principle: Maximizing intra-cluster similarity & minimizing inter-cluster similarity
Many methods and applications
Outlier Analysis
Outlier: A data object that does not comply with the general behavior of the data Noise or exception? ― One person’s garbage could be another person’s treasure Methods: by-product of clustering or regression analysis, …
Useful in fraud detection, rare events analysis, who is slipping through the safety net?
Sequence, trend, and evolution analysis (for temporal or other sequence data) Structure, network and text analysis, e.g. Web mining
Inductive logic programming e.g. linked data mining
1.7. Challenges in Data Mining
Basic Competency
It does not take a lot of skill to pick up a data mining tool and to push its buttons to run some algorithms over some data and to get some output back again. But this approach will not provide the required insight into application problems and can be very harmful in some circumstances.
The competent data miner must be capable of
Selecting the right tool for the job. This will be informed by both the question being asked and the nature of the data that is used;
Evaluating the output of the tool. What is it telling you? How good are the results? Is there a way of obtaining better results? What does “better” mean for the
particular question being asked?; and
Interpreting the results in the context of the question. What can be learnt from the results and for what purpose can that information be ethically and robustly used?
The following challenges are indicative of the issues which professional data miners encounter and for which data mining researchers attempt to create new methods.
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space Interdisciplinary skills
Boosting the power of discovery in a networked environment Handling noise, uncertainty, and incompleteness of data Pattern evaluation and pattern- or constraint-guided mining
Leveraging Human Knowledge
Goal-directed mining
Interactive mining
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com