Live Coding Wk3 – Lecture 5 – Introduction to Machine Learning¶
Machine learning is a powerful tool and we have only just seen a few different descriptions of algorithms. Lets go through some discussion and house-keeping in preparation of introducing machine learning algorithms next lecture.
Copyright By PowCoder代写 加微信 powcoder
### Imports and data you will need
import numpy as np
import pandas as pd
data = pd.read_csv(‘data/weight_loss.csv’, index_col=0)
data.head()
Problem: I Need Data!!!¶
A key aspect of machine learning is using data to learn and solve problems. In-fact, the preparation of the data is just as important than the algorithm used for learning. Without good ingredients we can’t make a good dish!
Lets go through a couple standard proceedures used in preparing data for a machine learning algorithm.
Part 1 – Lets Train¶
In machine learning, we have two major steps involving data in the learning pipeline:
Training the algorithim;
Testing the performance.
We want to split our full datasets into two subsets, corresponding to training and testing steps. Lets see how good our Python is now. How do we split the data into: 80% training and 20% testing?
### Your code
import math
import random
training_set1 = None # Fill!
testing_set1 = None # Fill!
# Make sure the lengths work out
assert(len(data) == len(training_set1) + len(testing_set1))
The splitting of data into a training subset and a test subset is a very common task. It is actually already defined in one of the Python libraries.
Nothing wrong about reinventing the wheel when learning about wheels.
Lets repeat what we did above but with scikit-learn.
from sklearn.model_selection import train_test_split
# Lets see what the function does
train_test_split?
# Your code
training_set2 = None # Fill!
testing_set2 = None # Fill!
Sometimes we are interested in splitting our original data into 3 different subsets. We still have the training and testing subsets, however, we also have a validation subset now.
Your job: split the dataset into 3 partitions of size 60:20:20, corresponding to the training, test, and validation subsets.
# Your code
training_set3 = None # Fill!
testing_set3 = None # Fill!
validation_set3 = None # Fill!
# Make sure the lengths work out
assert(len(data) == len(training_set3) + len(testing_set3) + len(validation_set3))
Part 2 – Lets Be Centered¶
Many machine learning algorithms assume that our data is normalised. This requires us to make sure our dataset has a number of standard features. For continuous features, we want our data to have zero mean and unit standard deviation. This can be done by the following transformation
\begin{equation}
\textbf{x}_{\text{norm}} = \frac{\textbf{x}_{\text{raw}} – \mu}{\sigma},
\end{equation}
where $ \textbf{x}_{raw} $ is a vector of features; and $ \mu $ and $ \sigma $ are the mean and standard variation of that specific feature.
Lets normalise the training subset of our dataset using that transformation
# Your code
def normalise_columm(x):
pass # Define this!
norm_training_weights = normalise_columm(training_set1[‘Weight’])
assert(abs(np.mean(norm_training_weights)) < 1e-10) assert(abs(np.std(norm_training_weights) - 1) < 1e-10) Discussion: If we want to also normalise our test dataset, what mean $ \mu $ and standard deviation $ \sigma $ values should we be using? Discuss here! 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com