Lab3_Classification2_Qs
Classification (2) – implementation and application of Nearest Neighbour classification, and Logistic Regression¶
Copyright By PowCoder代写 加微信 powcoder
Introduction¶
In this notebook we continue on with some of methods of
classification, starting with an implementation of Naive Bayes, then an application of Naive Bayes on a benchmark dataset. The notebook also looks into the related method of Logistic Regression for comparison.
Classification¶
Classification models learn a mapping $h(\boldsymbol{X})$ from a feature space $\boldsymbol{X}$ to a finite set of labels $\boldsymbol{Y}$
In this lab we will focus for simplicity on binary classification, where the labels are assumed to be in $\{-1,1\}$ or alternatively $\{0,1\}$.
We will use simple generated datasets and a real data set on the sinking of the Titanic to explore some different classification algorithms. For a description of the variables and more information on the data see: https://www.kaggle.com/c/titanic-gettingStarted/data
%pylab inline
%load_ext autoreload
%autoreload
import numpy as np
import matplotlib.pyplot as pl
import util
from scipy.stats import itemfreq
from scipy.stats import bernoulli
from scipy.stats import multivariate_normal as mvnorm
import sklearn
import scipy
import matplotlib
print(“SKLEARN”,sklearn.__version__)
print (“SCIPY”,scipy.version.full_version)
print(“NUMPY”,np.__version__)
print(“MATPLOTLIB”,matplotlib.__version__)
X, Y = util.load_data() # passenger_class, is_female, sibsp, parch, fare, embarked (categorical 0-3)
X_demean = X – np.mean(X, axis=0)
X_unitsd = X_demean/(np.std(X_demean,axis=0))
X_whiten = np.dot(X_demean, util.whitening_matrix(X_demean))
Populating the interactive namespace from numpy and matplotlib
The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
SKLEARN 0.20.1
SCIPY 1.1.0
NUMPY 1.15.4
MATPLOTLIB 3.0.2
One approach to learning a classification function $h(\boldsymbol{X})$ is to model $P(y|\boldsymbol{x})$ and convert that to a classification by setting:
\begin{equation}h(\boldsymbol{X}) = \begin{cases} 1 & \text{if }P(y|\boldsymbol{x}) > \frac{1}{2}\\ 0 & \text{otherwise}\end{cases}
\end{equation}
Example: Suppose we want to build a model to predict the probability of survival on the Titanic based on just two categorical features, a persons class (1,2 or 3) and their sex (1=female,0=male). An obvious approach would be to create a category for each combination of our features (female 1st, female 2nd … male 3rd) and calculate the proportion who survived in each as an estimate for the survival probability $P(y|\boldsymbol{x})$. For each observation in our test data – we simply look up the survival rate in the corresponding category.
This corresponds to maximum likelihood estimation: $\hat{\theta} = argmax_{\theta’}P(data|\theta’)$, where the parameters, $\theta$, we want to estimate are the true probabilities $P(y|\boldsymbol{x})$ for each combination of $\boldsymbol{x}$ and $data$ is the set features and labels we have observed.
combinations = [(i,j) for i in [1,2,3] for j in [0,1]]
for c in combinations:
match = np.where((X[:,0] == c[0]) * (X[:,1] == c[1]))[0]
print(c,sum(Y[match])/float(len(match)))
(1, 0) 0.38961038961038963
(1, 1) 0.9846153846153847
(2, 0) 0.1780821917808219
(2, 1) 0.9122807017543859
(3, 0) 0.14218009478672985
(3, 1) 0.4583333333333333
Question: Why will this approach not work in general? What happens as we increase the number of features or the number of values each feature can take? What about if features are continuous?
Naive Bayes¶
Following Bayes Rule, we can write:
$P(y|\boldsymbol{x}) \propto P(y)P(\boldsymbol{x}|y) = P(y)P(x_1,x_2…x_D|y) $
It easy to estimate $P(y)$ as just the proportions of each class in the training data. We could also directly estimate $P(x_1,x_2…x_D|y)$ for each $y$ (for example with kernel density estimation) but as the number of features $D$ gets large this estimation suffers the curse of dimensionality.
Naive Bayes assumes that the data was generated by a model where all the features are independent of one-another given the class label so that we can estimate $P(x_j|y)$ separately for each feature.
\begin{equation}
P(y|\boldsymbol{x}) \propto P(y)\prod_{j=1}^D P(x_j|y)
\end{equation}The normalisation constant can be obtained;
\begin{equation}
P(y|\boldsymbol{x}) = \frac{P(y)\prod_{j=1}^D P(x_j|y)}{P(\boldsymbol{x})},
\end{equation}where,
\begin{equation}
P(\boldsymbol{x}) = P(y=0)\prod_{j=1}^D P(x_j|y=0) + P(y=1)\prod_{j=1}^D P(x_j|y=1),
\end{equation}
this operation is called marginalisation, since we marginalise (or sum/integrate out) $y$ from the joint distribution (top line) $P(y, \mathbf{x}) = P(y)P(\mathbf{x}|y)$ to obtain a distribution over $\mathbf{x}$.
Exercise: Implement a Naive Bayes model for the Titanic data set using passenger_class and is_female as features
# a function that may be useful
def proportions(array):
“”” returns a map from each unique value in the input array to the proportion of times that value occures “””
prop = itemfreq(array)
prop[:,1] = prop[:,1]/sum(prop,axis=0)[1]
return dict(prop)
class Naive_Bayes:
def train(self,X,Y):
“”” trains the model with features X and labels Y “””
# 1) Estimate P(Y=1)
self.py = sum(Y)/float(len(Y))
# 2) For each feature, x, estimate P(x|y=1) and P(x|y=0)
survived = X[np.where(Y==1)[0],:] # the features of those who survived
died = X[np.where(Y==0)[0],:] # the features for those who died
# estimate P(gender|survived)
self.gs = YOUR CODE HERE # return a map from gender value to probability: Hint the proportions function above may help
# estimate P(class|survived)
self.cs = YOUR CODE HERE # return a map from class to probability for those who survived
# estimate P(gender|died)
self.gd = YOUR CODE HERE # return a map from gender value to probability for those who died
# estimate P(class|died)
self.cd = YOUR CODE HERE # return a map from class to probability for those who died
def predict(self,sex,p_class):
“”” outputs the probability of survival for a given class and gender “””
# caclulate unormalized P(y = 1|sex,p_class) as P(y=1)P(sex|y=1)P(p_class|y=1)
ps = YOUR CODE HERE
# calculate unormalized P(y = 0|sex,p_class) as P(y=0)P(sex|y=0)P(p_class|y=0)
pd = YOUR CODE HERE
# calculates the survival ratio as ps/pd and the normalized probability from the ratio
psn = r/(1+r)
return psn
# run the model
model = Naive_Bayes()
model.train(X,Y)
for p_class,sex in combinations:
print((p_class,sex),model.predict(sex,p_class))
File “
self.gs = YOUR CODE HERE # return a map from gender value to probability: Hint the proportions function above may help
SyntaxError: invalid syntax
Exercise: Compare these predictions with those just based on the proportion of survivals. How true is the Naive Bayes assumption for this case?
Question: How does the number of parameters to be learnt scale with the number of features for Naive Bayes?
Exercise: Run Naive Bayes from Sci-Kit Learn using the same features.
from sklearn.preprocessing import OneHotEncoder
from sklearn.naive_bayes import MultinomialNB
# Sklearn doesn’t have a model that expects categorical data.
# We need to first encode our (p_class, sex) to (is_first,is_second,is_third,is_female,is_male)
# use preprocessing.OneHotEncoder to create a new dataset X2 that is the transformation of the first 2 columns of X
nb_enc = OneHotEncoder()
X2 = # use the encoder to transform the first two columns of X
# fit a Multinommial Naive Bayes Model
# transforms our combinations to the one-hot encoding
c = nb_enc.transform(np.asarray(combinations)).toarray()
# gets predictions for each combination
predictions = nb.predict_proba(c)
# prints your predictions in the same format as previous models
for i in range(len(c)):
print(combinations[i],predictions[i][1])
Naive Bayes can also handle continuous features. The data below is generated by a Gaussian mixture model. For each class there is a separate 2-dimensional Gaussian distribution over the features x1, x2.
# Generates some data from a Gaussian Mixture Model.
mean0 = [-1,-1] # the mean of the gaussian for class 0
mean1 = [1,1] # the mean of the gaussian for class 1
cov0 = [[.5, .28], [.28, .5]] # the covariance matrix for class 0
cov1 = [[1, -.8], [-.8, 1]] # the covariance matrix for class 1
mixture = util.GaussianMixture(mean0,cov0,mean1,cov1)
mX,mY = mixture.sample(500,0.5,plot=True)
Exercise: Fit a Gaussian Naive Bayes model using Sklearn
from sklearn.naive_bayes import GaussianNB
# fit a GaussianNB model
gnb = YOUR CODE HERE # create and fit a Gaussian Naive Bayes model to the Gaussian mixture data mX,mY
# plots the probability that a point in x,y belogs to the class Y=1 according to your model and the decision boundary p=.5
x = np.linspace(-4,4,100)
y = np.linspace(-4,4,100)
test_points = np.dstack(np.meshgrid(x, y)).reshape(-1,2)
z = gnb.predict_proba(test_points)[:,1].reshape(len(x),len(y)) # probability Y = 1
f,ax = subplots(1,1,figsize=(5,5))
cn = ax.contourf(x,y,z)
ct = ax.contour(cn,levels=[0.5])
ax.scatter(mX[:,0],mX[:,1],s=5, c = [“black” if t < 1 else "white" for t in mY],alpha=1)
ax.clabel(ct)
# Try changing the covariance matrices and refitting your model.
# When does the probability distribution returned by Naive Bayes resemble the true one
Logistic Regression¶
Logistic regression models $P(y|\boldsymbol{x})$ directly by assuming it is a (logistic) function of a linear combination of the features. The logistic function $\theta(s) = \frac{e^s}{e^s+1}$ maps the weighted features to $[0,1]$ to allow it to model a probability. Training logistic regression corresponds to learning the weights $\boldsymbol{w}$ to maximise the likelihood function:
\begin{equation}
P(y_1...y_n|\boldsymbol{x}_1...\boldsymbol{x}_n,\boldsymbol{w}) = \prod_{i=1}^n \theta(y_i\boldsymbol{w}^T\boldsymbol{x}_i)
\end{equation}Maximising the likelihood $P(y_1...y_n|\boldsymbol{x}_1...\boldsymbol{x}_n,\boldsymbol{w})$ is equivalent to minimising the negative log-likelihood:
\begin{equation}
\boldsymbol{w}^* = argmin_{\boldsymbol{w}}\left( -\log\left(\prod_{i=1}^n \theta(y_i\boldsymbol{w}^T\boldsymbol{x}_i)\right)\right)
= argmin_{\boldsymbol{w}}\left( \sum_{i=1}^n \ln(1+e^{-y_i\boldsymbol{w}^T\boldsymbol{x}_i})\right)
\end{equation}
Once we have the weights $\boldsymbol{w}^*$, we can predict the probability that a new observation belongs to each class.
Question: Suppose that we have a data set that is linearly separable. What happens to the weights $w$ when we run linear regression?
Exercise: Use Sklearn to fit a logistic regression model on the Gaussian mixture data.
#Run Logistic regression on the gaussian mixture data
from sklearn.linear_model import LogisticRegression
logistic = YOUR CODE HERE # create and fit a logistic regression model on mX,mY
# plot the probability y = 1 as over the feature space as for Naive Bayes
logistz = logistic.predict_proba(test_points)[:,1].reshape(len(x),len(y)) # probability Y = 1
f,ax = subplots(1,1,figsize=(5,5))
cn = ax.contourf(x,y,logistz)
ct = ax.contour(cn,levels=[0.5])
ax.scatter(mX[:,0],mX[:,1],s=5, c = ["black" if t < 1 else "white" for t in mY],alpha=1)
ax.clabel(ct)
show()# implement the jacobian
# we can model more complex decision boundaries by expanding the feature space to include combinations of features
# re-fit logistic regression adding in all quadratic combinations of features ie x1,x2,x1x2,x1^2,x2^2
from sklearn.preprocessing import PolynomialFeatures
poly_expand = YOUR CODE HERE # create a polynomial feature transformer that produces quadratic combinations
m2X = YOUR CODE HERE # use poly_expand to transform the original features (mX)
logistic.YOUR CODE HERE # fit the logistic model with the new features
# transform the test plots and predict and plot
testpoints2 = poly_expand.transform(test_points)
logistic2z = logistic.predict_proba(testpoints2)[:,1].reshape(len(x),len(y)) # probability Y = 1
f,ax = subplots(1,1,figsize=(5,5))
cn = ax.contourf(x,y,logistic2z)
ct = ax.contour(cn,levels=[0.5])
ax.scatter(mX[:,0],mX[:,1],s=5, c = ["black" if t < 1 else "white" for t in mY],alpha=1)
ax.clabel(ct)
With large numbers of features there is a risk of overfitting to the training data. We can tune a logistic regression model to reduce the risk of overfitting by penalising large weights, $\boldsymbol{w}$
Exercise: Experiment with the regularisation parameters sklearn provides:
penalty = "l1" or "l2" and C = inverse of weight of regularisation term.
lreg = YOUR CODE HERE # create and fit a regularized logistic regression model to the quadraticly expanded features
# plots the probability as before
logistic2z_reg = lreg.predict_proba(testpoints2)[:,1].reshape(len(x),len(y)) # probability Y = 1
f,ax = subplots(1,1,figsize=(5,5))
cn = ax.contourf(x,y,logistic2z_reg)
ct = ax.contour(cn,levels=[0.5])
ax.scatter(mX[:,0],mX[:,1],s=5, c = ["black" if t < 1 else "white" for t in mY],alpha=1)
ax.clabel(ct)
# Run logistic regression on the titanic data
titanic_logist = YOUR CODE HERE # create and fit a logistic regression model on the titanic data
# Look at the coefficients (weights) in the model. Are they meaningfull?
# Do you need to change the way any of the features were encoded?
print(titanic_logist.coef_)
print(titanic_logist.intercept_)
Implement Logistic Regression (Optional)¶
Recall for logistic regression we are trying to find (assuming we have encoded $Y$ as $\{-1,1\}$)
\begin{equation}
\boldsymbol{w}^* = argmin_{\boldsymbol{w}}\left( \sum_{i=1}^n \ln(1+e^{-y_i\boldsymbol{w}^T\boldsymbol{x}_i})\right)
\end{equation}
This is a convex optimisation problem in $\boldsymbol{w}$.
We can solve it using gradient decent (or an optimisation library)
Exercise: Implement logistic regression using scipy's optimisation library and run it on the Gaussian mixture data mX,mY.
# implement logistic regression using the scipy's optimization library and run it on the gaussian model data
from scipy.optimize import minimize
# copy input so our modifications don't effect original data
dataX = mX
dataY = mY
# encode mY as -1,1
dataY[mY==0] = -1
# add a column of all ones to mX to allow us to fit an intercept
dataX = np.hstack((np.ones((mX.shape[0],1)),mX))
# implement the loss function
def loss(w,X,Y):
YOUR CODE HERE
# start the optimization with randomly guessed weights
w0 = np.random.random((dataX.shape[1],1))
# runs the optimisation
optimal = minimize(loss,w0,args=(dataX,dataY),method="BFGS")
w = optimal.x
# how does this compare with the coefficients you saw using Sklearn?
# try refitting the sklearn logistic model with a very high value for C (like 10000).
The optimisation method we are using (BFGS) needs to know the jacobian (gradient of the loss function with respect to w). Since we didn't supply it, python is approximating it numerically. We can speed things up by supplying it.
\begin{equation}
L(\boldsymbol{w}) = \sum_{i=1}^n \ln(1+e^{-y_i\boldsymbol{w}^T\boldsymbol{x}_i}) \longleftarrow \text{ loss function}\\
\nabla{L} = [\frac{\partial L}{\partial w_1},\frac{\partial L}{\partial w_2}, ..., \frac{\partial L}{\partial w_D}] \longleftarrow \text{ definition of gradient}\\
\frac{\partial L}{\partial w_j} = -\sum_{i=1}^n x_{ij} \frac{ y_i e^{-y_i\boldsymbol{w}^T\boldsymbol{x}_i}}{1+e^{-y_i\boldsymbol{w}^T\boldsymbol{x}_i}} \longleftarrow \text{ result of taking partial derivative of loss function with respect to weight $j$}\\
\end{equation}Exercise: repeat the previous exercise but supply the Jacobian to the minimizer.
# implement jacobian
def grad_loss(w,X,Y):
YOUR CODE HERE
# start the optimization with randomly guessed weights
w0 = np.random.random((dataX.shape[1],1))
optimal = minimize(loss,w0,args=(dataX,dataY),jac = grad_loss, method="BFGS")
print(optimal)
w = optimal.x
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com