CS计算机代考程序代写 decision tree database data science algorithm COMP20008 Elements of

COMP20008 Elements of
data processing
2021SM1 High level study guide

COMP20008 2021SM1 – examinable content
Everything covered in the lectures and workshops are examinable except for Guest lectures and the Easter Bonus lecture in week 5.

Introduction
• What is data wrangling • Data wrangling pipeline

Visualisation
• Line plots, Boxplots, Histograms, Bar charts, Scatter plots, Heatmaps, Parallel Coordinate plots
• Understanding these plots
• Understanding patterns & interpreting these plots • Visual outlier detection

Data Formats
• Categories of data formats: structured, semi-structured, unstructured.
• Differences among different data formats.
• High level format of relational databases, as well as the processing
and the operations upon such data.
• Syntax of JSON, XML, HTML
• PDF format and its challenges.

Text processing & document representation.
• Text processing (and pre-processing):
• Text search & approximate text search. Applications;
• Distance and similarity methods for text/strings.
• Pattern matching – regular expressions: syntax, interpretations, applications.
• Pre-processing: tokenisation, case-folding, stemming, lemmatization, stop-word removal, text normalisation, noise removal; what operations are done in each step?
• Why apply each preprocessing step to text? Benefits and effects.
• Representations of unstructured text documents • Bag of words
• TF-IDF and basic ranking
• Web crawling: algorithm and challenges & scraping

Data pre-processing and cleaning
• Motivation
• Different types of data quality issues, examples of data issues.
• Missing data & different types of missing data. Why is data missing? • Simple imputation strategies.

Correlation
• What is a correlation, why is it useful, how it differs to causation
• Understand how correlation can be identified using visualisation
• Understand linear vs non-linear correlations
• Understand Euclidean distance and Pearson Correlations, differences between them, how to calculate and interpret
• Understand data discretization, entropy, mutual information

Clustering
• Understand what clustering is
• Know the K-Means algorithm, when it works well, poorly
• Understand the VAT algorithm, how it works, why it’s useful, how to use it to estimate clusters
• Be able to interpret a heat map visualisation of a dissimilarity matrix
• Understand hierarchical clustering, how it works, when it works
well/poorly, how to apply it
• Create and interpret a dendrogram
• Know how to use clustering to detect outliers
• Compare hierarchical clustering to K-Means

Regression
• How to use regression analysis to predict the value of a dependent variable based on independent variables
• Make inferences about the slope and correlation coefficient
• Evaluate the assumptions of regression analysis and know what to do
if the assumptions are violated
• Residual plot analysis.

Classification
• Understand what is the difference between classification and regression and why it is useful to build models for these tasks
• Understand how a decision tree may be used to make predictions about the class of a test instance
• Understand the key steps in building a decision tree
• Understand the operation and rationale of the k nearest neighbour
algorithm for classification
• Understand the advantages and disadvantages of using the k nearest neighbour or decision tree for classification

Experimental Design
• Understand the difference between supervised and unsupervised algorithms
• Understand the principles of experimental design: Feature Filtering, Dimensionality Reduction, Performance Evaluation
• Go from average to great data analysis

Record/data linkage
• What is it? Applications of data linkage
• Challenges of data linkage
• Steps in a data linkage process
• Blocking, purpose (role in data linkage), evaluation of blocking method.
• Evaluation metrics for data linkage results.

Recommender Systems
• Understand the following: recommender systems, collaborative filtering, user-based, item-based
• How to perform collaborative filtering & predictions
• Understand the challenges recommender systems face
• Advantages/disadvantages of user-based and item-based methods

Privacy
• What a sensitive attribute, non-sensitive attribute and quasi-identifier are
• Understand the notions of k-anonymity and l-diversity and how they protect privacy, advantages & disadvantages of each
• Understand the benefits & risks of using and sharing people’s location data

Differential privacy
• What information is being protected & not protected
• How it works & how / why noise is added
• What the global sensitivity and privacy loss budget are and how the affect the noise that’s added to the dataset

Big Data
• Appreciate ethical considerations in the context of a data wrangling/data science/data analytics project
• Explain the stakeholders in big data analytics, their perspectives
• Understand the 10 simple rules for responsible big data research.