Theory¶
1. Problem: Estimate why do cactai grow in the parts of the desert. We don’t really know what distinguishes high and low growth clusters. What is the ML system most suitable for the problem?
1. How can the lack of validation set create a bias in the fit measurement?
1. How can we reduce overfitting?
Practice¶
In [4]:
# To plot pretty figures directly within Jupyter
%matplotlib inline
import matplotlib as mpl
mpl.rc(‘axes’, labelsize=14)
mpl.rc(‘xtick’, labelsize=12)
mpl.rc(‘ytick’, labelsize=12)
import os
# Code example
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
# Download the data
import urllib
DOWNLOAD_ROOT = “https://raw.githubusercontent.com/ageron/handson-ml2/master/”
for filename in (“oecd_bli_2015.csv”, “gdp_per_capita.csv”):
print(“Downloading”, filename)
url = DOWNLOAD_ROOT + “datasets/lifesat/” + filename
urllib.request.urlretrieve(url,filename)
Downloading oecd_bli_2015.csv
Downloading gdp_per_capita.csv
In [ ]:
# Code example
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model
# Load the data
oecd_bli = pd.read_csv(“oecd_bli_2015.csv”, thousands=’,’)
gdp_per_capita = pd.read_csv(“gdp_per_capita.csv”,thousands=’,’,delimiter=’\t’,
encoding=’latin1′, na_values=”n/a”)
In [ ]:
#1. Load dataset oecd_bli
import os
import pandas as pd
oecd_bli = pd.read_csv(“oecd_bli_2015.csv”, thousands=’,’)
oecd_bli = oecd_bli[oecd_bli[“INEQUALITY”]==”TOT”]
oecd_bli = oecd_bli.pivot(index=”Country”, columns=”Indicator”, values=”Value”)
oecd_bli.head(2)
Out[ ]:
Indicator
Air pollution
Assault rate
Consultation on rule-making
Dwellings without basic facilities
Educational attainment
Employees working very long hours
Employment rate
Homicide rate
Household net adjusted disposable income
Household net financial wealth
Housing expenditure
Job security
Life expectancy
Life satisfaction
Long-term unemployment rate
Personal earnings
Quality of support network
Rooms per person
Self-reported health
Student skills
Time devoted to leisure and personal care
Voter turnout
Water quality
Years in education
Country
Australia
13.0
2.1
10.5
1.1
76.0
14.02
72.0
0.8
31588.0
47657.0
20.0
4.8
82.1
7.3
1.08
50449.0
92.0
2.3
85.0
512.0
14.41
93.0
91.0
19.4
Austria
27.0
3.4
7.1
1.0
83.0
7.61
72.0
0.4
31173.0
49887.0
21.0
3.9
81.0
6.9
1.19
45199.0
89.0
1.6
69.0
500.0
14.46
75.0
94.0
17.0
1 Plot the relationship Linear model predicting Homicide rate with ‘Educational Attaintment’ and ‘Life satisfaction’ to the propostion: “…in much wisdom is much grief, and he who increases knowledge increases sorrow.” Use a scatter plot: y=’Life satisfaction’, x=’Educational attainment’
In [ ]:
2 Estimate and report relationship Linear model predicting ‘Life satisfaction’ using ‘Educational attainment’.
In [ ]:
3 Plot regression line and scatter plot together. Does the greated eduction is correlation with greated life satisfaction?
In [1]:
4 Drop all Scandinavian countries: [‘Denmark’, ‘Estonia’, ‘Finland’, ‘Norway’, ‘Iceland’] and re-estimate the relationship between life_satisfcation and educational attainment. How did the results change?
In [1]:
In [ ]:
NoScanCountries.index
Out[ ]:
Index([‘Australia’, ‘Austria’, ‘Belgium’, ‘Brazil’, ‘Canada’, ‘Chile’,
‘Czech Republic’, ‘France’, ‘Germany’, ‘Greece’, ‘Hungary’, ‘Ireland’,
‘Israel’, ‘Italy’, ‘Japan’, ‘Korea’, ‘Luxembourg’, ‘Mexico’,
‘Netherlands’, ‘New Zealand’, ‘OECD – Total’, ‘Poland’, ‘Portugal’,
‘Russia’, ‘Slovak Republic’, ‘Slovenia’, ‘Spain’, ‘Sweden’,
‘Switzerland’, ‘Turkey’, ‘United Kingdom’, ‘United States’],
dtype=’object’, name=’Country’)
In [ ]:

In [1]:
5 Plot regression lines for all countries and countries without Scandinavia
1. List item
2. List item
In [2]:
# Plot regression lines for all countries and countries without Scanfinavia
In [ ]:
1. Load Boston. Dataset description is here: https://scikit-learn.org/stable/datasets/index.html#boston-dataset
A. CRIM per capita crime rate by town
B. ZN proportion of residential land zoned for lots over 25,000 sq.ft.
C. INDUS proportion of non-retail business acres per town
D. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
E. NOX nitric oxides concentration (parts per 10 million)
F. RM average number of rooms per dwelling
G. AGE proportion of owner-occupied units built prior to 1940
H. DIS weighted distances to five Boston employment centres
I. RAD index of accessibility to radial highways
J. TAX full-value property-tax rate per $10,000
K. PTRATIO pupil-teacher ratio by town
L. B 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
M. LSTAT % lower status of the population
N. MEDV Median value of owner-occupied homes in $1000’s
In [ ]:
# load libraries
from sklearn.datasets import load_boston
import pandas as pd
import random
# set random seed to get consistent results
random.seed(123)
np.random.seed(123)
# Load Boston citi price data
X, y = load_boston(return_X_y=True)
df = pd.DataFrame(data=X,columns=[‘CRIM’, ‘ZN’, ‘INDUS’, ‘CHAS’, ‘NOX’, ‘RM’, ‘AGE’, ‘DIS’, ‘RAD’, ‘TAX’, ‘PTRATIO’, ‘B’, ‘LSAT’])
# create random indicator for pipeline
group_of_items = [‘Good’, ‘Bad’, ‘Terrible’, ‘Excellient’]
df[‘QLTY’] = np.random.choice(group_of_items, len(df), p=[0.5, 0.1, 0.1, 0.3])
df_full = df.copy()
# Randomly delete observations for imputation
for i in range(0,13):
c = df.columns[i]
df.loc[df.sample(frac=0.1).index, c] = None
df_miss = df
Create a pipeline that will:
• create binary indicators for all categories of QLTY
• Impute all missing variables based on means
• Scale all numerical data using Standard scalar
Be careful not to scale categorical data: cat = [‘CHAS’, ‘RAD’, ‘QLTY’] To impute categorical variables use SimpleImputer(strategy=’most_frequent’)
In [ ]:
Out[ ]:
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSAT
QLTY
0
0.00632
18.0
2.31
0.0
0.538
6.575
65.2
4.0900
1.0
296.0
15.3
396.90
4.98
Terrible
1
0.02731
0.0
7.07
0.0
0.469
6.421
78.9
4.9671
2.0
242.0
17.8
396.90
9.14
Good
2
0.02729
0.0
7.07
0.0
0.469
7.185
61.1
4.9671
2.0
242.0
17.8
392.83
4.03
Good
3
0.03237
0.0
2.18
0.0
0.458
6.998
45.8
6.0622
3.0
222.0
18.7
394.63
2.94
Bad
4
0.06905
0.0
2.18
0.0
0.458
7.147
54.2
6.0622
3.0
222.0
18.7
396.90
5.33
Excellient
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
501
0.06263
0.0
11.93
0.0
0.573
6.593
69.1
2.4786
1.0
273.0
21.0
391.99
9.67
Good
502
0.04527
0.0
11.93
0.0
0.573
6.120
76.7
2.2875
1.0
273.0
21.0
396.90
9.08
Excellient
503
0.06076
0.0
11.93
0.0
0.573
6.976
91.0
2.1675
1.0
273.0
21.0
396.90
5.64
Good
504
0.10959
0.0
11.93
0.0
0.573
6.794
89.3
2.3889
1.0
273.0
21.0
393.45
6.48
Excellient
505
0.04741
0.0
11.93
0.0
0.573
6.030
80.8
2.5050
1.0
273.0
21.0
396.90
7.88
Good
506 rows × 14 columns
In [3]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
# WE can also impute the missing data
from sklearn.impute import SimpleImputer
imputer = SimpleImputer()
from sklearn.preprocessing import OneHotEncoder
1. Estimate linear regression predicting MEDV using full data and imputed data: df_full_prepared, df_prepared. Report ther resulting RMSEs
In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
lin_reg = LinearRegression()
In [ ]: