程序代写代做代考 data mining python decision tree algorithm database COMP9318 Review

COMP9318 Review

Wei Wang @ UNSW

June 4, 2018

Course Logisitics

I THE formula:

mark = 0.55 · exam + 0.15 · (ass1 + proj1 + lab)
mark = FL, if exam < 40 lab = avg(best of 3(lab1, lab2, lab3, lab4, lab5)) I proj1 and ass1 will be marked ASAP; we aim at delivering the result before the exam I Pre-exam consultations: I 15 Jun: 1500–1700, K17-508 I 18 Jun: 1200–1400, K17-508 I Course feedback: via comments in the course survey or private messages to me on the forum. We are particularly interested in aspects such as coverage, difficulty levels, use of python/Jupyter, project, and background required. Note (1) The final exam mark is important and you must achieve at least 40! (2) Supplementary exam is only for those who cannot attend the final exam. About the Final Exam I Time: 1345 – 1600, 19 Jun 2016 (Tue), 10 minutes reading time + 2 hr closed-book exam. I Accessories: UNSW Approved Calculator. Note: watches are prohibited. I Designed to test your understanding and familiarity of the core contents of the course. I Answer 1 + 6 questions out of 9 questions. I Q1: short answer (can use your own words) and compulsory. I Choose 6 from Q2 to Q9; thers will requires some “calculation” (i.e., similar to tute/ass questions) About the Final Exam /2 I Read the instructions carefully. I Use your time wisely. Don’t spend too much time if stuck on one question or writing excessively long answers on Q1. Tips (1) Write down intermediate steps. (2) Know how to do log2(x) on your calculator. (3) Work on “easy” questions first (but start the answer on a new page on the booklet). Disclaimer We will go through the main contents of each lecture. However, note that it is by no means exhaustive. Introduction I DM vs. KDD I Steps of KDD; iterative in nature; results need to be validated. I Database (efficiency) vs. Machine learning (effectiveness) vs. Statistics (validity): I Able to cast a real problem into a data mining problem. Data Warehousing and OLAP I Understand the four characteristics of DW (DW vs. Data Mart) I Differences between OLTP and OLAP I Multidimensional data model; data cube; I fact, dimension, measure, hierarchies I cuboid, cube lattice I three types of schemas I four typical OLAP operations I ROLAP/MOLAP/HOLAP I Query processing methods for OLAP servers, including the BUC cubing algorithm. NOT needed: I Design good DW schemas and perform ETL from operational data sources to the DW tables. Linear Algebra I Column vectors; Linear combination; Basis vectors; Span I Matrix vector multiplication I Eigenvalues and eigenvectors I SVD: general idea. Data Preprocessing I Understand that real data is “dirty” (incomplete, noisy, inconsistent) I How to handle missing data? I How to normalize the data? I How to handle noisy data? different binning/histogram method (including V-optimal and MaxDiff) I How to discretize data? NOT needed: I Feature selection and reduction (e.g., PCA, Random Projection, t-SNE) Classification and Prediction I Classification basics: I overfitting/underfitting; cross-validation I Classification vs prediction; vs clustering (unsupervised learning); eager learning vs. lazy learning (instance-based learning) I Decision tree: I The ID3 algorithm I Decision tree pruning I Derive rules from the decision tree I The CART algorithm (with gini index) I Naive Bayes classifier I Smoothing I Two ways to apply NB on text data I Logistic regression/MaxEnt classifier; Maximum likelihood estimation of the model parameters + regularization; Gradient ascend. I SVM: Main idea; the optimization problem in the primal form; the decision function in the dual form; kernel Cluster Analysis I Clustering criteria: minimize intra-cluster distance + maximize inter-cluster distance I Distance/similarity I how to deal with different types of variables I distance functions: Lp I metric distance functions Cluster Analysis /2 I Partition-based Clustering: k-Means (algorithm, advantages, disadvantages, . . . ) I Hierarchical Clustering: agglomerative, single-link / complete-link / group average hierarchical clustering I Graph-based Clustering: Unnormalized graph laplacian and its semantics, overview of spectral clustering algorithm; embedding. Association Rule Mining I Concepts: I Input: transaction db I Output: (1) frequent itemset (via minsup); (2) association rules (via minconf ) I Apriori algorithm: I Apriori property (2 versions) I The Apriori algorithm I How to find frequent itemsets? I How to derive the association rules? Association Rule Mining /2 I FP-growth algorithm: I How to mine the association rule using FP-trees? I Derive association rules from the frequent itemsets. Thanks You and Good Luck! Introduction Review