程序代写 Machine Learning and Data Mining in Business

Machine Learning and Data Mining in Business
Lecture 3: Preparing Your Data For Machine Learning
Discipline of Business Analytics

Copyright By PowCoder代写 加微信 powcoder

A framework for machine learning projects
Think iteratively!
1. Business understanding
2. Data collection and preparation 3. Exploratory data analysis
4. Feature engineering
5. Machine learning
6. Evaluation
7. Deployment and monitoring

Tutorial application: predicting house prices
• Output: the sale price of a residential property.
• Inputs: square footage, number of bedrooms, number of bathrooms, number of garage spots, neighbourhood, house quality, among several others.
• Available data: past house sales.

Lecture 3: Feature Engineering
Learning objectives
• Using exploratory data analysis to understand your data. • Feature engineering for tabular data.
• Handling data problems.

Lecture 3: Feature Engineering
1. Exploratory data analysis 2. Feature engineering
3. Numerical predictors
4. Categorical predictors
5. Interaction features 6. Data problems

Exploratory data analysis

Exploratory data analysis
Exploratory data analysis (EDA) is the process of describing and visualising a dataset to understand it, obtain insights, detect problems, and check assumptions.
We use EDA to inform feature engineering and modelling.

The importance of EDA
Kaggle State of Data Science and Machine Learning (2020)

The importance of EDA (example): Anscombe Quartet

The importance of EDA (example): Anscombe Quartet
In this classic illustrative example by Anscombe (1973):
• The response and predictor have the same sample average and variance.
• The regression lines are the same.
• The R2 is the same. Therefore the sample correlations are the
The figures are the ones telling the story!

Data types
Credit: https://legac.com.au/blogs/further-mathematics-exam-revision/further-mathematics-unit-3-data-analysis- types-of-data

Example: house price predictions
• Continuous: size.
• Discrete: number of bedrooms. • Nominal: neighbourhood.
• Ordinal: quality.

1. Type inference
2. Missing value analysis
3. Univariate distribution plots
4. Univariate descriptive statistics
5. Bivariate relationship plots
6. Dependence measures (such as correlations)

Look out for in univariate EDA
• Data errors
• Missing values
• Outliers
• High skewness
• High kurtosis
• Multi-modality
• High proportion of zeros or other values (numerical variables) • High cardinality (nominal variables)
• Sparse labels (nominal variables)
• Low variation

Look out for in bivariate EDA
• Strong relationships • Weak relationships • Outliers
• Nonlinearity
• Non-constant error variance • Collinearity

Measures of dependence
• Pearson correlation (continuous)
• Spearman rank correlation (ordered variables)
• Kendall’s τ rank correlation coefficient (ordered variables) • Correlation ratio (nominal-continuous)
• Cramer’s φ (nominal variables)
• Theil’s U (nominal variables)
• φk (all types)
• Mutual information

Measures of dependence
Figure from Baak, Koopman, Snoek and Klous (2020)

Look out for in multivariate EDA
• Multicollinearity (e.g., global correlation coefficients). • Irrelevant variables.

Advanced EDA
• Bivariate distributions • Interaction effects
• Outlier detection
• Clustering
• Time series, spatial data, network data • Unstructured data
• Advanced data visualisation

Feature engineering

At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used. –

Feature engineering
Feature engineering is the process of preparing the data for the learning algorithms.
Feature is another term for an input variable.

Feature engineering
Feature engineering involves tasks such as:
• Extracting features from raw data.
• Constructing informative features based on domain knowledge.
• Processing the data into the format required by different learning algorithms.
• Processing the predictors in ways that help the learning algorithm to build better models.
• Imputing missing values.

Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering. –

Feature engineering
• Different learning algorithms have different feature engineering requirements. A technique that is suitable for a certain learning algorithm and context can be unnecessary or counter-productive for others.
• Usually, there are no fixed recipes or a single correct way to do feature engineering.
• Experiment and let the data guide your decisions.

Feature engineering vs. representation learning
• Most machine learning algorithms learn the parameters of a model directly from the inputs, which relies on feature engineering.
• Representation learning methods learn features from data and can be supervised or unsupervised. Later in the unit we will discuss deep learning, which is a type of representation learning.

Next steps
• In the following sections will describe several feature engineering techniquse for tabular data.
• However, keep in mind that a key aspect of feature engineering is to come up with informative features. That depends on the context, rather than on specific techniques.

There is ultimately no replacement for the smarts you put into feature engineering. –

Numerical predictors

Feature scaling
Most learning algorithms require the inputs to be on the same scale and sometimes be centered around zero.

Standardisation
We standardise the inputs by computing
x′ij = xij − xj , sdj
where xj and sdj are the sample average and standard deviation of predictor j, calculated for the training data.
The constructed features have sample mean zero and variance one.

Scaling features to a range
Min-max scaling (normalisation) restricts the values of a variable to the [0, 1] interval by computing the transformation
x′ij= xij−minj , maxj − minj
where minj and maxj are the minimum and maximum sample values of predictor j.
Max-abs scaling works similarly but scales the variable to the [−1, 1] range.

Feature scaling
• Either standardisation or scaling the features to a range can lead to better performance depending on the problem.
• Robust scaling may work better when there are outliers.
• Scaling the variables (including the response) can have the additional benefit of improving numerical performance.

Log transformation
Many learning algorithms perform better when the predictors are approximately normally distributed.
The log transformation
x′ij = log(xij)
for xij > 0 is often effective to reduce right skewness.

Box-Cox transformation
The Box-Cox transformation extends this idea as
if λ ̸= 0, log(xij+α) ifλ=0,
 (x +α)λ −1
for xij > −α, a transformation parameter λ and shift parameter α.

Yeo-Johnson transformation
The Yeo-Johnson transformation is
((xij +1) −1)/λ
log(x +1) x(λ) =  ij
ifλ̸=0,xij ≥0 ifλ=0,x ≥0
−[(−xij + 1)(2−λ) − 1]/(2 − λ) −log(−xij +1)
if λ ̸= 2, xij < 0 ifλ=2,xij <0. Yeo-Johnson transformation 􏰗 No restriction on the range of the predictor. 􏰗 Selects λ by directly making the transformed variable as close to normally distributed as possible. Many zeros When a variable has many zeros, you may want to treat these cases separately by creating a dummy variable. Discrete predictors • When a discrete predictor has many possibles values, we generally treat them as continuous variables. • When there only a few possible values (e.g. number of children), we may handle discrete predictors in a similar way to categorical variables. Categorical predictors Nominal predictors We nearly always need to encode nominal variables as numerical features. One-hot encoding In one-hot encoding, we construct a binary indicator for each category. For example, if the predictor is color ∈ {blue, yellow, red}, we create the three-dimensional feature vector (1, 0, 0)  if blue, if yellow, if red. (0, 1, 0) (0, 0, 1) Dummy encoding In dummy encoding, we start with one-hot encoding but delete one arbitrary feature to avoid perfect multicollinearity. If we have a predictor gender ∈ {male, female}, we encode it as 1 if male, x= 0 if female, 1 if female, 0 if male. Sparse categories Some categories may have low counts. In this case, we may want to merge all classes with count below a minimum into an “other” category. High cardinality High cardinality means that there is a large number of categories. • Merge sparse categories. • Merge categories that are judged to be similar. • Hash encoding. • Target encoding. • Learning algorithms such as CatBoost are specifically designed to deal with categorical features. Target encoding • Target encoding builds one feature that encodes each category with an estimate of the expected value of the response given the category. • In the simplest version of target encoding, we replace each category with the average response value for each category. • It’s common to smooth the estimates by blending them with the average response value for the entire training set. Target encoding Image credit: https://laptrinhx.com/stop-one-hot-encoding-your-categorical- variables-2553223393/ Target Encoding • A major disadvantage of target encoding is target leakage. • Related methods such as K-fold target encoding and Bayesian target encoding mitigate this problem. • CatBoost encoding attempts to minimise target leakage. Ordinal predictors • Ordinal encoding builds a single feature with numerical values that reflect the ordering of the categories. This is suitable for specific learning algorithms such as tree-based methods. • In other cases, the default choice is to treat ordinal predictors as nominal variables. Mixed data types Learning algorithms vary in terms of how well they handle mixed data types. • Tree-based methods handle mixed data types in a natural way, while methods such as kNN make more sense when all the predictors are directly comparable. • Many methods require the features to be on the same scale, but there’s no natural way to put continuous variables and dummy variables on the same scale. • For practical reasons, we often ignore this issue and treat numerical features constructed from categorical predictors like any other numerical predictors. Interaction features Interaction effects An interaction effect occurs when the relationship between the response and a predictor depends on other predictors. Interaction effects • The relationship between the value and the size of a residential property depends on the location. • The relationship between a loan risk and the instalments depends on the income of the borrower. Illustration student non−student 0 50 100 150 0 50 100 150 Income Income 200 600 1000 1400 200 600 1000 1400 Illustration The model for the illustration on the previous page is f(x) = β0 +β1 ×income+β2 ×student+β3 ×income×student. We can interpret it as f(x) = β0 + β1 × income if student = 0, (β0+β2)+(β1+β3)×income ifstudent=1. Interaction effects Formally, a model has interaction effects when the predictive function is not additive in the predictors. The predictive function is additive, but has an interaction. f(x) = f1(x1) + f2(x2) + f3(x3) f(x) = f1(x1) + f2(x2, x3) Interactions between continuous and categorical variables Suppose that we have a numerical predictor x and a categorical predictor encoded through C dummy variables. We construct the features (xd1,xd2,...,xdC) We construct features for interactions between categorical variables in a similar way. Example: human resources Interactions between continuous variables • You can model interactions between continuous variables using low-dimensional model components. • For example, consider a non-additive regression function μ(x1,x2). You can use a bivariate polynomial as an approximation f(x) = β0 + β1x1 + β2x2 + β3x21 + β4x2 + β5x1x2, which includes an interaction term. Curse of dimensionality • When p = 2, there is one possible interaction: f1(x1, x2). • When p = 3, there are four: f1(x1, x2), f2(x1, x3), f3(x2, x3), and f4(x1, x2, x3). • When p = 4, there are 11. • Whenp=10,thereare1,013. • When p = 20, there are 1, 048, 555. • Whenp=50,thereare1,125,899,906,842,573. Data problems Missing values Missing values • Delete rows that have missing values (avoid). • Delete columns that have too many missing values (avoid). • Use simple imputation such as replacing the missing value by the average, median or mode. • Use imputation methods such as building a model to predict the missing value. • Create a dummy variable that indicates a missing value (“missingness” is often information). • Use a learning algorithm that can handle missing values directly. Missing data imputation • Delete outliers (avoid). • If an outlier is due to an error, fix the error if there is information to do so. Otherwise, treat it as a missing value. • Censor the variable. • Transform the variable. • Creating a dummy variable to indicate outliers. • Use a learning algorithm that is robust to outliers. Data leakage • Leakage occurs when you train a model that uses information that would not be available for prediction in a production environment. • Leakage causes over-optimistic estimation of generalisation performance. • You should carefully inspect your variables to avoid leakage. 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com