CS代写 Elements of Data Processing

Elements of Data Processing
Semester 1, 2022
Dimensionality reduction – Principal Component Analysis
© University of Melbourne 2022

Dimensionality Reduction
Principal Component Analysis
Motivation: High dimensional data
• True dimensionality << observed high dimensional data • The curse of dimensionality: "Data analysis techniques which work well at lower dimensions (fewer features), often perform poorly as the dimensionality of the analysed data increases (lots of features)" • As dimensionality increases, data becomes increasingly sparse and all the distances between pairs of points begin to look the same. Impacts any algorithm that is based on distances between objects. © University of Melbourne 2022 Dimensionality reduction • To reduce dimensionality perform feature selection. • E.g., 2 features selected from the original 4 based on correlation. • Domain knowledge • Perform feature extraction, when transforming a dataset from ! to " (" ≪ !) features • The output ! features do not need to be a subset of the input " features. Rather, they can be new features whose values are constructed using some function applied to the input " features © University of Melbourne 2022 Transforming from N to ! ≪ # dimensions • The transformation should preserve the characteristics of the data, i.e. distances between pairs of objects • If a pair of objects is close before the transformation, they should still be close after the transformation • If a pair of objects is far apart before the transformation, they should still be far apart after the transformation • The set of nearest neighbors of an object before the transformation should ideally be the same after the transformation © University of Melbourne 2022 Principal components analysis (PCA) • Find a new set of features that better captures the variability of the data © University of Melbourne 2022 Principal components analysis (PCA) • Find a new set of features that better captures the variability of the data • First dimension chosen to capture as much of the variability as possible; • The second dimension is orthogonal to the first, and subject to that constraint, captures as much of the remaining variability as possible, • The third dimension is orthogonal to the first and second, and subject to that constraint, captures as much of the remaining variability as possible, • Etc. Linear weighted combination of features (feature engineering). • A good visualisation: http://setosa.io/ev/principal-component-analysis © University of Melbourne 2022 • Avoid curse of dimensionality • Reduce amount of time and memory required by algorithms • Allow data to be more easily visualised • May help to eliminate irrelevant features or reduce noise • Fewer features → smaller machine learning models → faster answer • Better feature set by transformation → more accurate answer • Principal components are independent © University of Melbourne 2022 PCA – limitations • It may be non-intuitive to interpret the principal components • Some information loss. • Assumes linear combinations of the features • The PCA reduced dimensions may not help with classification. • PCA is sensitive to scale (standardize features!) • PCA is sensitive to outliers. © University of Melbourne 2022 9 Principal components analysis in Python sklearn.decomposition Will practise in workshop © University of Melbourne 2022