程序代写代做代考 flex Hive QBUS6850: Tutorial 4 – Linear Regression 2, Feature

QBUS6850: Tutorial 4 – Linear Regression 2, Feature

Extraction/Preparation, Logistic Regression

Objectives

• To know how to do model selection;
• To learn how to extract features;
• To learn how to do logistic regression;

1.1 Ordinary Least Squares Drawbacks

OLS (Ordinary least squares) is an unbiased estimator by definition, however this does

not mean that it always produces a good model estimation.
OLS can completely fail or produce poor results under relatively common conditions.

The shared point of failure among all these conditions is that is singular and OLS

estimates depend on . When is singular we cannot compute its inverse.
This happens frequently due to:
– collinearity i.e. predictors (features) are not independent
– n > m i.e. number of features exceeds number of observations

Let then .
Or in other words each element of encodes the similarity or distance from each

feature vector to every other feature vector. When we have collinear features this can result
in meaning that is singular and we cannot compute its inverse.

Moreover, OLS may be subject to outliers. This means that a single large outlier value

can greatly affect our regression line. Therefore, we know OLS is low (0) bias but high
variance. We might wish to sacrifice some bias to achieve lower variance.

1.2 Ridge Regression

Ridge regression addresses both issues of high variance and inevitability.

By reducing the magnitude of the coefficients by using the ℓ2 norm we can reduce the
coefficient variance and thus the variance of the estimator. Therefore, the ridge regression
objective function is

where λ is a tuneable parameter and controls the strength of the penalty. Therefore, the

solution is given by

By reducing the variance of the estimator, ridge regression tends to be more “robust”

against outliers than OLS since we have a solution with greater bias and less variance.
Conveniently this means that is then modified to . In other words, we add a
diagonal component to the matrix that we want to invert. This diagonal component is called
the “ridge”. This diagonal component increases the rank so that the matrix is no longer
singular.

We can use cross validation to find the shrinkage parameter that we believe will
generalise best to unseen data.

About the Data

The dataset is a collection of community and crime statistics from
http://archive.ics.uci.edu/ml/datasets/communities+and+crime

my_list = [1, 2, 4, 8, 16]

from sklearn.linear_model import LinearRegression, RidgeCV
import numpy as np
import pandas as pd

from sklearn.model_selection
import train_test_split
from sklearn.metrics import mean_squared_error
#http://archive.ics.uci.edu/ml/datasets/communities+and+crime
crime = pd.read_csv(‘communities.csv’, header=None, na_values=[‘?’])

Pre-processing steps

#Delete the first 5 columns
crime = crime.iloc[:, 5:]
# Remove rows with missing entries
crime.dropna(inplace=True)

Extract all the features available as our predictors and the response variable
(number of violent crimes per capita)

# Get the features X and target/response y
X = crime.iloc[:, :-1]
y = crime.iloc[:, -1]

To evaluate model performance we will split the data into train and test sets.

http://archive.ics.uci.edu/ml/datasets/communities+and+crime

Our procedure will follow: – Fit model on training set – Test performance via loss

function values on test set

# Split data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

Fit OLS model and calculate loss values over test data

lm = LinearRegression()
lm.fit(X_train, y_train)
preds_ols = lm.predict(X_test)
print(“OLS: {0}”.format(mean_squared_error(y_test, preds_ols)/2))

Generate possible candidate sets for l, fit the ridge model using cross
validation on training set to find optimal l and calculate mse over test set.

alpha_range = 10.**np.arange(-2,3)

rregcv = RidgeCV(normalize=True, scoring=’neg_mean_squared_error’,
alphas=alpha_range)

rregcv.fit(X_train, y_train)

preds_ridge = rregcv.predict(X_test)
print(“RIDGE: {0}”.format(mean_squared_error(y_test, preds_ridge)/2))

However RidgeCV can also be used without those parameters eg.

rregcv = RidgeCV()

rregcv.fit(X_train, y_train)

preds_ridge = rregcv.predict(X_test)
print(“RIDGE: {0}”.format(mean_squared_error(y_test, preds_ridge)/2))

1.3 Text Feature Extraction

• Bag of words

• Embedding

1.3.1 Bag of Word

Bag of words(1-gram) counts the words and keep the numbers and serve as the
features.

3 steps need to be performed to get Bag of Word (BOW) features:

1. Tokenizing: Segment the corpus into “words” 3
2. Counting: Count the appearance frequecy of difference words
3. Normalizing

CountVectorizer from sklearn combine the tokenizing and counting.

from sklearn.feature_extraction.text import CountVectorizer vectorizer =
CountVectorizer()

vectorizer

corpus = [

‘This is the first document.’,

‘This is the second second document.’,

‘And the third one.’,

‘Is this the first document?’,

]

X_txt = vectorizer.fit_transform(corpus)

vectorizer.get_feature_names() == (

[‘and’, ‘document’, ‘first’, ‘is’, ‘one’,

‘second’, ‘the’, ‘third’, ‘this’])

#X is the BOW feature of X

print(X_txt.toarray())

#the value is vocab id

print(vectorizer.vocabulary_)
#Unseen words are ignore

vectorizer.transform([‘Something completely new.’]).toarray()

Bag of Words features cannot carputer local information. E.g. “believe or not” has
the same features as “not or believe”. Bi-gram preserve more local information, which
regards 2 contagious words as one word in the vocabulary.
For example, “believe or”, “or not”, “not or” and “or believe” are counted. The feature
extraction is shown in the code below.

bigram_vectorizer = CountVectorizer(ngram_range=(2, 2),
token_pattern=r’\b\w+\b’, min_df=1)
analyze = bigram_vectorizer.build_analyzer()

print(analyze(‘believe or not a b c d e’))

#Extract the features

In [94]: X_txt_2 = bigram_vectorizer.fit_transform(corpus).toarray()

print(X_txt_2)

1.3.2 tf-idf

Some words has very high frequency (e.g. “the”, “a”, ”which”), therefore, carrying
not much meaningful information about the actual contents of the document.
We need to compensate them to prevent the high-frequency shadowing other words. td

Where is the number of document. is the number of

documents containing .

Each row is normalized to have unit Euclidean norm:

from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer(smooth_idf=False)

counts = [[3, 0, 1],

[2, 0, 0],

[3, 0, 0],

[4, 0, 0],

[3, 2, 0],

[3, 0, 2]

]

tfidf = transformer.fit_transform(counts)

print(tfidf)

print(tfidf.toarray())

A even more concise way to compute the tf-idf features. Combine counts and tf-idf

from sklearn.feature_extraction.text import TfidfVectorizer vectorizer =
TfidfVectorizer()

print(corpus)

X_txt_3 = vectorizer.fit_transform(corpus)

print(X_txt_3.toarray())

1.4 Feature Selection

• LASSO
• Elastic-Net
• Refer to lecture slides for their loss functions

We can take two approaches when using LASSO or Elastic-net for feature

selection. We can softly regularise the coefficients until a sufficient number
are set to 0. Fortunately using cross validation to optimise the regularisation
parameter lambda (called alpha in sklearn) usually results in many of the
features being ignored since their coefficient values are shrunk to 0.

Alternatively, you can set a threshold value for the coefficient and find a
suitable regularisation parameter that meets this requirement.

We will take the path of cross validation.

Note that due to the shared L1/L2 regularisation of Elastic-Net it does not
aggressively prune features like Lasso. However, in practice it often performs
better when used for regression prediction.

NOTE: You can also use LASSO/Elastic-Net for regular regression
tasks.

Fit the LASSO model. Performed a train/test split and used CV on the

train set to determine optimal parameters.

from sklearn.linear_model import LassoCV, ElasticNetCV

X= crime.iloc[:, :-1]
y = crime.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

lascv = LassoCV()
lascv.fit(X_train, y_train)

preds_lassocv = lascv.predict(X_test)
print(“LASSO: {0}”.format(mean_squared_error(y_test, preds_lassocv)/2))
print(“LASSO Lambda: {0}”.format(lascv.alpha_))

Fit the Elastic-Net model

elascv = ElasticNetCV(max_iter=10000)
elascv.fit(X_train, y_train)

preds_elascv = elascv.predict(X_test)
print(“Elastic Net: {0}”.format(mean_squared_error(y_test, preds_elascv)/2))
print(“Elastic Net Lambda: {0}”.format(elascv.alpha_))

Determine which columns were retained by each model

columns = X.columns.values
lasso_cols = columns[np.nonzero(lascv.coef_)] print(“LASSO Features:
{0}”.format(lasso_cols))
elas_cols = columns[np.nonzero(elascv.coef_)] print(“Elastic Features:
{0}”.format(elas_cols))

1.5 Regression vs Classification

• Logistic Regression

WARNING: Do not use regression for classification tasks.

In general it is ill advised to use linear regression for classification tasks.

Regression learns a continuous output variable from a predefined linear (or
higher order) model. It learns the parameters of this model to predict an output.

Classification on the other hand is not explicitly interested in the underlying
generative process. Rather it is a higher abstraction. We are not interested in
the specific value of something. Instead we want to assign each data vector to
the most likely class.

Logistic regression provides us with two desirable properties: – the output
of the logistic function is the direct probability of the data vector belonging to
the success case

the logistic function is non-linear and more flexible than a linear regression,

which can improve classification accuracy and is often more robust to outliers.

About the dataset

The data shows credit card loan status for many accounts with three
features: student, balance remaining and income.

from sklearn.linear_model import LogisticRegression

df = pd.read_csv(‘Default.csv’)

Convert the student category column to Boolean values

df.student = np.where(df.student == ‘Yes’, 1, 0)

Use the balance feature and set the default status as the target value to
predict. You could also use all available features if you believe that they are
informative.

X = df[[‘balance’]]

y = df[[‘default’]]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3,
random_state=1

Fit the Logistic Regression model

log_res = LogisticRegression()

log_res.fit(X_train, y_train)

Predict probabilities of default.

predict_proba() returns the probabilities of an observation belonging to
each class. This is computed from the logistic regression function (see above).

predict() is dependent on predict_proba(). predict() returns the
class assignment based on the proability and the decision boundary. In other
words, predict returns the most likely class i.e. the class with greatest
probability or probability > 50%.

prob = log_res.predict_proba(pd.DataFrame({‘balance’: [1200, 2500]}))
print(prob)

print(“Probability of default with Balance of 1200: {0:.2f}%”.format(prob[0,1] *
100)
print(“Probability of default with Balance of 2500: {0:.2f}%”.format(prob[1,1] *
100)

outcome = log_res.predict(pd.DataFrame({‘balance’: [1200, 2500]}))

print(“Assigned class with Balance of 1200: {0}”.format(outcome[0]))
print(“Assigned class with Balance of 2500: {0}”.format(outcome[1]))

We can evaluate classification accuracy using confusion matrix (to be
explanined in week 4 lecture) and the classification report

from sklearn.metrics import confusion_matrix

pred_log = log_res.predict(X_val)

print(confusion_matrix(y_val,pred_log))
from sklearn.metrics import classification_report
print(classification_report(y_val, pred_log, digits=3))

Objectives