COMP9318 Review
Wei Wang @ UNSW
June 4, 2018
Course Logisitics
I THE formula:
mark = 0.55 · exam + 0.15 · (ass1 + proj1 + lab)
mark = FL, if exam < 40
lab = avg(best of 3(lab1, lab2, lab3, lab4, lab5))
I proj1 and ass1 will be marked ASAP; we aim at delivering the result
before the exam
I Pre-exam consultations:
I 15 Jun: 1500–1700, K17-508
I 18 Jun: 1200–1400, K17-508
I Course feedback: via comments in the course survey or private
messages to me on the forum. We are particularly interested in
aspects such as coverage, difficulty levels, use of
python/Jupyter, project, and background required.
Note
(1) The final exam mark is important and you must achieve at least 40!
(2) Supplementary exam is only for those who cannot attend the final
exam.
About the Final Exam
I Time: 1345 – 1600, 19 Jun 2016 (Tue), 10 minutes reading time +
2 hr closed-book exam.
I Accessories: UNSW Approved Calculator. Note: watches are
prohibited.
I Designed to test your understanding and familiarity of the core
contents of the course.
I Answer 1 + 6 questions out of 9 questions.
I Q1: short answer (can use your own words) and compulsory.
I Choose 6 from Q2 to Q9; thers will requires some “calculation” (i.e.,
similar to tute/ass questions)
About the Final Exam /2
I Read the instructions carefully.
I Use your time wisely. Don’t spend too much time if stuck on one
question or writing excessively long answers on Q1.
Tips
(1) Write down intermediate steps. (2) Know how to do log2(x) on your
calculator. (3) Work on “easy” questions first (but start the answer on a
new page on the booklet).
Disclaimer
We will go through the main contents of each lecture. However, note
that it is by no means exhaustive.
Introduction
I DM vs. KDD
I Steps of KDD; iterative in nature; results need to be validated.
I Database (efficiency) vs. Machine learning (effectiveness)
vs. Statistics (validity):
I Able to cast a real problem into a data mining problem.
Data Warehousing and OLAP
I Understand the four characteristics of DW (DW vs. Data Mart)
I Differences between OLTP and OLAP
I Multidimensional data model; data cube;
I fact, dimension, measure, hierarchies
I cuboid, cube lattice
I three types of schemas
I four typical OLAP operations
I ROLAP/MOLAP/HOLAP
I Query processing methods for OLAP servers, including the BUC
cubing algorithm.
NOT needed:
I Design good DW schemas and perform ETL from operational data
sources to the DW tables.
Linear Algebra
I Column vectors; Linear combination; Basis vectors; Span
I Matrix vector multiplication
I Eigenvalues and eigenvectors
I SVD: general idea.
Data Preprocessing
I Understand that real data is “dirty” (incomplete, noisy, inconsistent)
I How to handle missing data?
I How to normalize the data?
I How to handle noisy data? different binning/histogram method
(including V-optimal and MaxDiff)
I How to discretize data?
NOT needed:
I Feature selection and reduction (e.g., PCA, Random Projection,
t-SNE)
Classification and Prediction
I Classification basics:
I overfitting/underfitting; cross-validation
I Classification vs prediction; vs clustering (unsupervised learning);
eager learning vs. lazy learning (instance-based learning)
I Decision tree:
I The ID3 algorithm
I Decision tree pruning
I Derive rules from the decision tree
I The CART algorithm (with gini index)
I Naive Bayes classifier
I Smoothing
I Two ways to apply NB on text data
I Logistic regression/MaxEnt classifier; Maximum likelihood
estimation of the model parameters + regularization; Gradient
ascend.
I SVM: Main idea; the optimization problem in the primal form; the
decision function in the dual form; kernel
Cluster Analysis
I Clustering criteria: minimize intra-cluster distance + maximize
inter-cluster distance
I Distance/similarity
I how to deal with different types of variables
I distance functions: Lp
I metric distance functions
Cluster Analysis /2
I Partition-based Clustering: k-Means (algorithm, advantages,
disadvantages, . . . )
I Hierarchical Clustering: agglomerative, single-link / complete-link /
group average hierarchical clustering
I Graph-based Clustering: Unnormalized graph laplacian and its
semantics, overview of spectral clustering algorithm; embedding.
Association Rule Mining
I Concepts:
I Input: transaction db
I Output: (1) frequent itemset (via minsup); (2) association rules (via
minconf )
I Apriori algorithm:
I Apriori property (2 versions)
I The Apriori algorithm
I How to find frequent itemsets?
I How to derive the association rules?
Association Rule Mining /2
I FP-growth algorithm:
I How to mine the association rule using FP-trees?
I Derive association rules from the frequent itemsets.
Thanks You and Good Luck!
Introduction
Review