Elements of Data Processing
Semester 1, 2022
Dimensionality reduction – Principal Component Analysis
© University of Melbourne 2022
Copyright By PowCoder代写 加微信 powcoder
Dimensionality Reduction
Principal Component Analysis
© University of Melbourne 2022 2
Motivation: High dimensional data
• True dimensionality << observed high dimensional data
• The curse of dimensionality: “Data analysis techniques which work well at lower dimensions (fewer features), often perform poorly as the dimensionality of the analysed data increases (lots of features)”
• As dimensionality increases, data becomes increasingly sparse and all the distances between pairs of points begin to look the same. Impacts any algorithm that is based on distances between objects.
© University of Melbourne 2022
Dimensionality reduction
• To reduce dimensionality perform feature selection.
• E.g., 2 features selected from the original 4 based on correlation. • Domain knowledge
• Perform feature extraction, when transforming a dataset from ! to " (" ≪ !) features
• The output ! features do not need to be a subset of the input " features. Rather, they can be new features whose values are constructed using some function applied to the input " features
© University of Melbourne 2022
Transforming from N to ! ≪ # dimensions
• The transformation should preserve the characteristics of the data,
i.e. distances between pairs of objects
• If a pair of objects is close before the transformation, they should
still be close after the transformation
• If a pair of objects is far apart before the transformation, they
should still be far apart after the transformation
• The set of nearest neighbors of an object before the transformation should ideally be the same after the transformation
© University of Melbourne 2022
Principal components analysis (PCA)
• Find a new set of features that better captures the variability of the data
© University of Melbourne 2022
Principal components analysis (PCA)
• Find a new set of features that better captures the variability of the data
• First dimension chosen to capture as much of the variability as possible;
• The second dimension is orthogonal to the first, and subject to that constraint,
captures as much of the remaining variability as possible,
• The third dimension is orthogonal to the first and second, and subject to that
constraint, captures as much of the remaining variability as possible,
• Etc. Linear weighted combination of features (feature engineering).
• A good visualisation: http://setosa.io/ev/principal-component-analysis © University of Melbourne 2022
• Avoid curse of dimensionality
• Reduce amount of time and memory required by algorithms • Allow data to be more easily visualised
• May help to eliminate irrelevant features or reduce noise
• Fewer features → smaller machine learning models → faster answer • Better feature set by transformation → more accurate answer
• Principal components are independent
© University of Melbourne 2022
PCA – limitations
• It may be non-intuitive to interpret the principal components • Some information loss.
• Assumes linear combinations of the features
• The PCA reduced dimensions may not help with classification. • PCA is sensitive to scale (standardize features!)
• PCA is sensitive to outliers.
© University of Melbourne 2022 9
Principal components analysis in Python
sklearn.decomposition
Will practise in workshop
© University of Melbourne 2022
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com