Microsoft Learning Experiences
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
Energy Efficiency Prediction
This lab is based on the Microsoft course material for MS Azure, more specifically Lab3A, Lab3B and
Lab3C. In this exercise, you will use a dataset of metrics for buildings. Specifically, you will try to identify
data fields that influence the heating load of a building, which is a measure of its energy efficiency.
Note: Do not copy code out of the PDF as it breaks the spacing that is needed for python. Always
copy from the .py files!
Visualizing Data with Python
Python supports the matplotlib library; which provides extensive graphical capabilities. This makes
Python a useful language with which to create visualizations of your data in order to explore relationships
between the data fields and identify features that may be useful for predicting labels in machine learning
projects. We will first test our code in spyder and then import it to MS Azure. This is mainly due to
debugging reasons.
Load the Dataset
1. Start Spyder, and open the PrepEE.py file in the folder where you extracted the lab files for this
course.
2. In the PrepEE.py pane, in the following code, change C://DAT203xLabfiles to the path to the
folder where you extracted the lab files for this course.
## Load the data
import pandas as pd
import os
from sklearn import preprocessing
pathName = “c://DAT203xLabfiles”
fileName = “EnergyEfficiencyRegressiondata.csv”
filePath = os.path.join(pathName, fileName)
eeframe = pd.read_csv(filePath)
## Remove columns we’re not going to use
eeframe = eeframe.drop(‘Cooling Load’, 1)
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
## scale numeric features using scikitlearn
scaleList = [“Relative Compactness”, “Surface
Area”, “Wall Area”, “Roof Area”, “Glazing
Area”, “Glazing Area Distribution”]
arry = eeframe[scaleList].as_matrix()
eeframe[scaleList] = preprocessing.scale(arry)
Please do not copy code from the pdf but use the .py files!
3. Select the code listed above (with the modified path) and on the toolbar, click Run current cell.
The code performs the following actions:
a) Loads a panda data frame named eeframe with data from a text file named
EnergyEfficiencyRegressiondata.csv.
b) Cleans the data. Specifically:
The Cooling Load column is removed, because it contains similar data to the
Heating Load column, and is not required in this exercise.
The Relative Compactness, Surface Area, Wall Area, Roof Area, Glazing Area,
and Glazing Distribution columns are scaled so they can be easily compared.
4. In the IPython console pane, enter the following command to output the data frame:
eeframe
6. View the results, which show the first few rows of the data frame. Note that the data includes
the following columns, which describe the physical attributes of a building and its Heating Load
measurement:
Relative Compactness
Surface Area
Wall Area
Roof Area
Overall Height
Orientation
Glazing Area
Glazing Area Distribution
Heating Load
Import the matplotlib Library
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
1. In Spyder, open the VisualizeEE.py file in the folder where you extracted the lab files for
this course.
2. In the VisualizeEE.py pane, under the comment ## Import Libraries, select the following code.
import matplotlib
matplotlib.use(‘agg’) # Set
backend
from pandas.tools.plotting import
scatter_matrix import pandas.tools.rplot as
rplot
import matplotlib.pyplot as
plt import numpy as np
3. With the code above selected, on the toolbar, click Run current cell. The code imports the
matplotlib library, and some other libraries used in the script. Please ignore the warning about
rplot being deprecated.
Note: When running in an IPython console on your desktop machine the above code may
generate several warnings. Setting a backend may be ignored. Further, the rplot library will
generate warnings that it has been deprecated. Please ignore these warnings. This code will
operate without warnings in the Azure ML Execute Python Script module.
Create a Pair-Wise Scatter Plot
A pair-wise scatter plot visualizes all the pairwise relations in the data. The plot contains a matrix
where for each pair of data-entries, the data points are plotted on the 2D-plane defined by the2
entries. On the diagonal, a histogram of the single entries is shown. This plot is very useful to get an
overview of the data as it shows the relationship between all data entry pairs in one view. We can also
use it to spot outliers or degenerated data entries and visualize the relationship of the single data
entries.
1. In the VisualizeEE.py pane, under the comment ## Create a pair-wise scatter plot, select the
following code.
Azure = False
## If in Azure, frame1 is passed to function
if(Azure == False):
frame1 = eeframe
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
fig1 = plt.figure(1, figsize=(10, 10))
ax = fig1.gca()
#we show a smoothened histogram on the diagonal (kde)
scatter_matrix(frame1, alpha=0.3,
diagonal=’kde’, ax = ax)
plt.show()
# If we are in Azure (we need to set the flag), the output needs
# to be stored in a file such that we can view it with the Azure
# interface.
if(Azure == True): fig1.savefig(‘scatter1.png’)
2. With the code above selected, on the toolbar, click Run current cell. When running in Azure, a
data frame parameter named frame1 is passed to the first input port of the Execute Python
Script module (see function header); but when running locally, this code loads frame1 with the
eeframe data frame you loaded in the PrepEE.py script. The code then creates a scatter plot
matrix visualization that compares all columns in the dataset, and if running in Azure, saves the
resulting image so that it can be included in the output.
3. In the IPython console pane, view the scatter plot matrix that has been generated, as shown
below.
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
4. Some particular features you can notice include:
You can see plots of Relative Compactness vs. Surface Area; second from left in the top
row and second from top in the left most column. These two variable appear to be highly
correlated. We may not want to use both features in a given model.
Many of the other variables seem to cluster into two groups based on the value of
Overall Height.
Heating Load clusters into two groups for certain variables. For example, look at the
cross plots between Heating Load and Relative Compactness and Surface Area.
Create Conditioned Scatter Plots (Trellies)
With a trellies plot, we can visualize the relationship of two data entries conditioned on the value of 2
categorical data entries. For each possibility of the categorical values, an individual plot is created that
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
shows the relationship of the two selected data entries. Such plots are useful to visualize how the
relationship between two data entries changes for different values of the specified categorical data
entries.
We will generate the trellies plots to visualize the dependency of the heating load to the “Relative
Compactness”, “Surface Area”, “Wall Area”, “Roof Area”, “Glazing Area” and “Glazing Area
Distribution” data entry. As conditional categorical values, we choose “Overall Height” and
“Orientation”. In addition, we can also visualize the fit of an n-degree polynomial to the data for each of
the trellies plots. This gives a good impression whether the relationship between the data entries is
linear or non-linear.
1. In the VisualizeEE.py pane, under the comment ## Create conditioned scatter plots, select the
following code.
## Create conditioned scatter plots.
col_list = [“Relative Compactness”,
“Surface Area”,
“Wall Area”,
“Roof Area”,
‘Glazing Area’,
“Glazing Area Distribution”]
indx = 0
for col in col_list:
if(frame1[col].dtype in [np.int64, np.int32, np.float64]):
indx += 1
fig = plt.figure(figsize = (12,6))
fig.clf()
ax = fig.gca()
plot = rplot.RPlot(frame1, x = col, y = ‘Heating Load’)
plot.add(rplot.TrellisGrid([‘Overall Height’,
‘Orientation’]))
plot.add(rplot.GeomScatter())
plot.add(rplot.GeomPolyFit(degree=2))
ax.set_xlabel(col)
ax.set_ylabel(‘Heating Load’)
plot.render(plt.gcf())
if(Azure == True): fig.savefig(‘scatter’ + col + ‘.png’)
Note: If you get an error that rplot is not found, your pandas version that you installed is too new.
However, you can resort to the ‘VisualizeEE_seaborn.py’ script to use the new plotting
functionality.
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
2. With the code above selected, on the toolbar click Run current cell. The code creates scatter
plot charts that compare Heating Load with:
Relative Compactness
Surface Area
Wall Area
Roof Area
Glazing Area
Glazing Area Distribution
The code adds conditions for Overall Height and Orientation to each of these scatter plots.
3. In the IPython console pane, view the scatter plots, as shown below.
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
4. Note that each chart includes eight scatter plots. There are two rows of shaded tiles horizontally
across the top, one for each level (unique value) of Orientation, and one for each of the two
levels (unique values) of Overall Height. Each of these scatter plots has Relative Compactness
on the vertical (x) axis and Heating Load on the vertical (y) axis. The data are grouped by, or
condition on Overall Height value (3.5 and 7) and Orientation value (2, 3, 4, and 5).
5. Examine this chart and note the following interesting features in these scatter plots:
� The range of values of Heating Load are quite different between the upper and
lower rows; Overall Height of 3.5 and 7, respectively. In fact, there is very little
overlap in these values, indicating that Overall Height is an important feature in
these data.
� The distribution of these data does not change significantly with the levels of
Orientation, indicating it is not a significant feature.
� There is a notable trend of Heating Load with respect to Relative Compactness
indicating Relative Compactness is a significant feature.
Create Histograms
A histogram plots the density of a distribution of the vertical axis, vs. bins of values on the horizontal axis.
The values of a continuous variable are placed into one of several equal width bins. The density for each bin
is the count of values in that bin. As the number or width of the bins changes, the details of the histogram
will change. Histograms provide an empirical view of the distribution of the data being plotted. We will plot
histograms of our relevant data entries, where we split the data into two datasets with ‘overall height’ =
3.5 and ‘overall height’ = 7. Hence, we can visualize how the distribution of the entries changes in
dependence of the height of the building.
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
1. In the VisualizeEE.py pane, under the comment ## Histograms of features by Overall Height,
select the following code.
## Create histograms
col_list = [“Relative Compactness”,
“Surface Area”,
“Wall Area”,
“Roof Area”,
‘Glazing Area’,
“Glazing Area Distribution”,
“Heating Load”]
for col in col_list:
temp7 = frame1.ix[frame1[‘Overall Height’] == 7,
col].as_matrix()
temp35 = frame1.ix[frame1[‘Overall Height’] == 3.5,
col].as_matrix()
fig = plt.figure(figsize =(12,6))
fig.clf()
ax7 = fig.add_subplot(1, 2, 1)
ax35 = fig.add_subplot(1, 2, 2)
ax7.hist(temp7, bins = 20) ax7.set_title(‘Histogram of ‘ +col
+
‘\n for for Overall Height of 7’)
ax35.hist(temp35, bins = 20)
ax35.set_title(‘Histogram of ‘ +col +
‘\n for for Overall Height of 3.5’)
2. With the code above selected, on the toolbar click Run current cell. The code creates histogram
pairs for each of the following columns:
Relative Compactness
Surface Area
Wall Area
Roof Area
Glazing Area
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
Glazing Area Distribution
Heating Load
3. In the IPython console pane, view each of the histograms, as shown below.
4. Examine the pairs of histograms created. In most cases, the range of values on the horizontal axis are
quite different for the two values of Overall Height; 7 and 3.5. A few of histogram pairs show little
difference between the two values of Overall Height.
Direct your attention to the histograms of Heating Load. Examine this chart and note how ifferent the
distribution of Heating Load is for the two values of Overall Height. In fact, there is very little overlap
in the range of values for the two levels of Overall Height. Additionally, note the outliers in both
distributions shown.
Create Visualizations in an Azure Machine Learning Experiment
1. If you have not already done so, open a browser and browse to https://studio.azureml.net. Then
sign in using the Microsoft account associated with your Azure ML account.
2. Create a new blank experiment, and give it the title Visualize Data (Python).
3. Search for the Energy Efficiency Regression data dataset, and drag it to the experiment canvas.
4. Search for the Select Columns in Dataset module and drag it to the canvas below the Energy
Efficiency Regression data dataset. Then connect the output port from the Energy Efficiency
Regression data dataset to the first input port of the Select Columns in Dataset module.
https://studio.azureml.net/
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
5. Select the Select Columns in Dataset module, and in the Properties pane, launch the column
selector.Then configure the column selector to begin with all columns and exclude the Cooling Load
column as shown in the following image.
6. Search for the Normalize Data module, and drag it to the canvas below the Select Columns in
Dataset module. Then connect the output port from the Select Columns in Dataset module
to the input port of the Normalize Data module.
7. Select the Normalize Data module, and in the Properties pane, in the Transformation Method
list, select MinMax. Then launch the column selector and configure the module to begin with no
columns, include all Numeric columns, exclude the Heating Load column (which is the label we
hope to predict), and exclude the Overall Height, and Orientation columns (which are
categorical), as shown in the following image.
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
8. Search for the Execute Python Script module, and drag it to the experiment canvas under the
Normalize Data module. Then connect the Transformed dataset output port from the
Normalize Data module to the first input port of the Execute Python Script module. At this
point your experiment should look like the following figure:
9. Select the Execute Python Script module, and in the Properties pane, replace the default
Python script with the code from the VisualizeEE.py script you ran in Spyder (ensure you copy
and paste the entire script, including the function definition and the return statement!).
10. Edit the Python script in the Properties pane to change the statement Azure = False to
Azure = True. This is required to use the data from the dataset instead of loading it from a
local file.
11. Save and run the experiment.
12. When the experiment has finished running, visualize the output from the Python Device dataset
output port of the Execute Python Script module (the output on the right), and in the Graphics
area, view the data visualizations generated by the Python script, as shown in the following
image.
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
13. Close the Python device output.
Feature Engineering with Python
Having visualized data to determine any apparent relationships between columns in a dataset, you can
prepare for modeling by selecting only the data columns that you believe will be pertinent features for
the label you hope to predict. Additionally, you may decide to generate new feature columns based on
calculations that combine or extrapolate existing columns.
Use Python to Generate New Column
We will now add new columns to our data set that represent polynomial expansions of certain data entries.
You will start by rearranging your experiment:
1. Add another Execute Python Script module to your experiment.
2. Delete connections between the Select Columns in Dataset and Normalize Data modules.
Connect the output of the Select Columns in Dataset module to the Dataset1 input port of
the new Execute Python
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
Script module. Then connect the Results dataset output of the new Execute Python Script
module to the input of the Normalize Data module. Your experiment should now resemble the
figure below.
5. Select the new Execute Python Script module, and in the Properties pane, replace the existing
code in the Execute Python Script module with the following code. You can copy this code from
the NewFeatures.py file in the folder where you extracted the lab files for this lab.
def azureml_main(frame1):
sqrList = [“Relative Compactness”, “Surface Area”, “Wall Area”]
sqredList = [“Relative Compactness Sqred”, “Surface Area Sqred”,
“Wall Area Sqred”]
frame1[sqredList] = frame1[sqrList]**2
cubeList = [“Relative Compactness 3”, “Surface Area 3”, “Wall Area
3”]
frame1[cubeList] = frame1[sqrList]**3
return frame1
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
Review the code, and note that it creates polynomial columns named Relative Compactness
Sqred, Surface Area Sqred, and Wall Area Sqred by squaring the values of the existing Relative
Compactness, Surface Area, and Wall Area columns, and polynomial columns named Relative
Compactness 3, Surface Area 3, and Wall Area 3 by cubing the values of the existing Relative
Compactness, Surface Area, and Wall Area columns.
6. Save and run the experiment. When the experiment has finished, visualize the Transformed
Dataset output port of the Normalize Data module, and verify that the dataset now includes
columns named Relative Compactness Sqred, Surface Area Sqred, Wall Area Sqred, Relative
Compactness 3, Surface Area 3, and Wall Area 3, as shown here:
Visualize the New Columns
1. Select the last Execute Python Script module in the experiment (after the Normalize Data
module), and in the Properties pane and add the variables Relative Compactness Sqred,
Surface Area Sqred, Wall Area Sqred, Relative Compactness 3, Surface Area 3, and Wall Area
3 to the both col_list variables in the script.
2. Review the code, and note that it creates visualizations of the columns in the datasets, including
the new feature columns you added in the previous Execute Python Script module.
3. Save and run the experiment.
4. When the experiment has finished, visualize the Python Device Dataset output port of the
second Execute Python Script module, and view the charts it has created.
In particular, note that the conditioned scatter plots for Heating Load vs Surface Area, Heating
Load vs Surface Area Sqred, Heating Load vs Surface Area 3, Heating Load vs Wall Area,
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
Heating Load vs Wall Area Sqred, and Heating Load vs Wall Area 3. The shape of the curves
on these plots is fairly similar for these similar features. There is some flattening of the curves
with the higher order polynomials, the effect we are looking for. However, he effect is not
dramatic in any event. Only by testing these features when you build machine learning models
will you know which of these features are effective.
Select Initial Features
1. Search for the Metadata Editor module, and drag it to the canvas. Then connect the Results
dataset output port from the Normalize Data module (which is already connected to the second
Execute Python Script module) to the input port of the Metadata Editor module so that your
experiment looks like this:
2. Select the Metadata Editor module, and in the Properties pane, launch the column selector. Then
configure the column selector to begin with no columns and include the Overall Height and
Orientation column names as shown in the following image.
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
3. With the Metadata Editor module selected, in the Properties pane, in the Categorical list, select
Make Categorical.
4. Search for the Select Columns in Dataset module and drag it to the canvas below the Metadata
Editor module. Then connect the Results dataset output port from the Metadata Editor module
to the input port of the Select Columns in Dataset module. You will use the Select Columns in
Dataset module to select the initial features for the model based on the data exploration and
feature engineering you have performed so far.
5. Select the Select Columns in Dataset module you just added, and in the Properties pane launch
the column selector. Then configure the module to begin with all columns, and then exclude the
following columns:
i. Orientation
ii. Glazing Area Distribution
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
6. Verify that your experiment looks like this, and then save the experiment.
Creating a Model
Having prepared for modeling by selecting or creating columns that you believe will be pertinent
features for the label you hope to predict, you can start creating, training, and testing machine learning
models for your particular predictive requirements.
Create and Train a Model
1. Search for the Split module, and drag it to the canvas below the Select Columns in Dataset
module you added at the end of the previous exercise. Then connect the Results dataset output
port from the Select Columns in Dataset module to the input port of the Split module. You will
use the Split module to split the data into two sets; one to train the model, and another to
test it.
2. Select the Split module, and in the Properties pane, set the following properties:
Splitting mode: Split Rows
Fraction of rows in the first output dataset: 0.6
Randomized: Selected
Random seed: 5416
Stratified split: False
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
3. Search for the Linear Regression module, and drag it to the canvas beside the Split module.
Then select the Linear Regression module and in the Properties pane, set the following
properties:
Solution method: Ordinary Least Squares
L2 regularization weight: 0.0001
Include intercept term: Unselected
Random number seed: 345689
Allow unknown category levels: Selected
4. Search for the Train Model module, and drag it to the canvas beneath the Split and Linear
Regression modules. Then connect the output from the Linear Regression model to the left input of
the Train Model module, and connect the left output of the Split module (which represents the
training data set) to the right input of the Train Model module.
5. Select the Train Model module, and in the Properties pane launch the column selector. Then
configure the module to include only the Heating Load column. This trains the model to predict
a value for the Heating Load label based on the other feature columns in the dataset.
6. Verify that your experiment from the second Select Columns in Dataset module onwards
resembles the following image, and then save and run the experiment.
7. When the experiment has finished, visualize the output for the Train Model module, and note
that it describes the relative weight of each feature column. The feature weights represent the
relative importance that the trained model applies to each feature column when used to predict
the Heating Load label, based on the training data set. In the next exercise, you will use the set of
test data to evaluate these features.
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
8. Close the output.
Testing and Scoring Models
Having trained a model, you can examine it to evaluate its effectiveness at predicting label values.
Predictive modeling is an iterative process that often involves creating and comparing multiple models,
and refining these models until you find one that suits your requirements.
In this exercise, you will evaluate the effectiveness of the features used by your trained model against the
test data set. Then you will create a second model, and score the models to compare them.
Score a Model
1. Search for the Score Model module, and drag it to the canvas below the second Train Model
module. Then connect the output from Train Model module to the left input of the Score Model
module, and connect the right output from the Split module (which represents the test data set)
to the right input of the Score Model module; as shown in the following image:
2. Save and run the experiment.
3. When the experiment has finished, visualize the output of the Score Model module.
4. Note that the output includes a Scored Labels column, which contains the predicted value for
the Heating Load label (which is also included in the output).
5. Select the Scored Label column and in the Visualizations area, in the compare to list, select
Heating Load. Note that the scatter plot shows an approximately linear correlation between the label
predicted by the model and the actual label value in the test data, as shown here:
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
Create a Second Model
In the previous exercise, you computed some new features and then determined which features were
less important to the model. In this exercise you will compute a model with the original feature set for
comparison. Models with unnecessary features are said to be ‘over parameterized’ so the additional
features could actually perform worse. Over parameterized models will generally not generalize well to
the range of input values expected in production. However, one must proceed with caution. Removing
too many features can lead to the loss of important information and therefore reduced model
performance.
1. Search for the Select Columns in Dataset module, and drag it to the canvas below the second
existing Select Columns in Dataset module. Then connect the output port from the existing
Select Columns in Dataset module to the input port of the Select Columns in Dataset module
you just added (in addition to the Split module to which it is already connected).
2. Select the Select Columns in Dataset module you just added, and in the Properties pane
launch the column selector. Then configure the module to begin with all columns and exclude
all the features we added previously (Surface Area Sqred, Surface Area 3, Wall Area Sqred,
Wall Area 3, Relative Compactness Sqred, Relative Compactness 3).
3. Click and drag around the Linear Regression, Split, Train Model and Score Model modules to
select them, and then copy and paste them. Move the new copies of the modules so that they
are under the Select Columns in Dataset module that you added at the beginning of this
procedure.
4. Connect the output of the Select Columns in Dataset module to the input of the pasted Split
module so that your experiment after the Normalize Data module resembles the following
image.
Then save and run the experiment
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
5. When the experiment has finished, visualize the output for the second Train Model module
(which uses the reduced set of columns from the Select Columns in Dataset module), and note
that the relative weighted feature columns do not include the polynomial features, which you
removed from the model.
Evaluating Model Performance
Having created a machine learning model you are ready to evaluate the performance of the model. You
will evaluate your model using summary statistics produced by the Azure ML Evaluate Model module. In
subsequent exercises, you will perform an in-depth evaluation of model errors using custom Python code.
Having trained a model, you can examine the results to evaluate its effectiveness at predicting label
values. Predictive modeling is an iterative process that often involves creating and comparing different
sets of features, multiple models, and refining these choices until you find one that suits your
requirements.
Evaluate a Machine Learning Model
1. Start with the model with the reduced feature set (filtered with a second Select Columns in
Dataset module). Search for the Evaluate Model module and drag it onto your canvas. Connect
the Scored Data Set output port of the Score Model module to the left hand input of the
Evaluate Model module. Your experiment should now look like this:
i.
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
Tip. When evaluating a single model, always use the left input to the Evaluate Model module.
The right input allows you to compare performance of another model to the performance of
the first model.
2. Save and run the experiment. When the experiment is finished, visualize the Evaluation Result
port of the Evaluate Model module and review the performance statistics for the model.
Overall, these statistics should be promising, if not ideal. The Coefficient of Determination, is a
measure of the reduction in the variance, between the raw label and the model error; squared
error. This statistic is often referred to as R2. A perfect model would have a Coefficient of
Determination of 1.0, all the variance in the label is explained in the model. Relative Squared
Error is the ratio of the variance or squared error of the model divided by the variance of the
data. A perfect model would have a Relative Squared Error of 0.0, all model errors are zero. You
should observe that the statistics for your model are some distance from ideal.
3. Close the evaluation results.
Compare Model Performance
Now that you have computed performance statistics for one model you can compare these figures
to those from another model.
1. Connect the output from the Score Model module which was not evaluated so far to the right
input port of the Evaluate Model module that is already connected to the Score Model
module for the other model.
2. Save and run the experiment. When the experiment is finished, visualize the Evaluation
Result port of the Evaluate Model module and review the performance statistics for both
models. Your azure graph should look like this:
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
3. Compare the performance statistics for the models. For example, look at the Relative Squared
Errorand the Coefficient of Determination. You see that the model on the left (with non-linear
features) performs slightly better than the model on the right.
Our feature generation was successful, and the polynomial features contained useful
information.
Optional: Understanding Model Errors with Python
In the previous exercise you evaluated the performance of two models using the summary statistics
provided by the Evaluate Model module. In this exercise you will evaluate the performance of a
machine learning model in greater depth using custom Python code.
1. In the Evaluation (Python) experiment, search for and locate the Metadata Editor module. Drag this
module onto your canvas. Connect the Scored Dataset output of the Score Model module for the
first model you created (and which you evaluated to be performing best) to the input of the
Metadata Editor module.
2. Select the Metadata Editor model and in the properties pane, launch the column selector and select
the Overall Height column as shown below.
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
3. With the Metadata Editor module selected, in the properties pane, in the Categorical box
select Make non-categorical. The output from this Metadata Editor model will show the
Overall Height column as a string type which we can work with in Python.
4. Search for and locate the Execute Python Script module. Drag this module onto your canvas.
Connect the Results Dataset output of the Metadata Editor module to the Dataset1 (left
hand) input of the Execute Python Script module.
5. With the Execute Python Script module selected, in the properties pane, replace the existing
Python script with the following code. You can copy this code from VisResiduals.py in the
folder where you extracted the lab files:
def
rmse(Resid):
import numpy as np
resid = Resid.as_matrix()
length = Resid.shape[0]
return np.sqrt(np.sum(np.square(resid)) / length)
def
azureml_main(frame1):
# Set graphics
backend import
matplotlib
matplotlib.use(‘ag
g’)
import pandas as pd
import pandas.tools.rplot as rplot
import matplotlib.pyplot as plt
import statsmodels.api as sm
## Compute the residuals
frame1[‘Resids’] = frame1[‘Heating Load’] – frame1[‘Scored
Labels’]
## Create data frames by Overall Height
temp1 = frame1.ix[frame1[‘Overall Height’] == 7]
temp2 = frame1.ix[frame1[‘Overall Height’] == 3.5]
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
## Create a scatter plot of residuals vs Heating Load.
fig1 = plt.figure(1, figsize=(9, 9))
ax = fig1.gca()
temp1.plot(kind = ‘scatter’, x = ‘Heating Load’, \
y = ‘Resids’, c = ‘DarkBlue’, alpha = 0.3, ax = ax)
temp2.plot(kind = ‘scatter’, x = ‘Heating Load’, \
y = ‘Resids’, c = ‘Red’, alpha = 0.3, ax = ax)
ax.set_title(‘Heating load vs. model residuals’)
plt.show()
fig1.savefig(‘plot1.png’)
## Scatter plots of the residuals conditioned by
## several features
col_list = [“Relative Compactness”,
“Surface Area”,
“Wall Area”,
“Roof Area”,
“Glazing Area”]
for col in col_list:
## First plot one value of Overall Height.
fig = plt.figure(figsize=(10, 4.5))
fig.clf()
ax = fig.gca()
plot = rplot.RPlot(temp1, x = ‘Heating Load’, y = ‘Resids’)
plot.add(rplot.GeomScatter(alpha = 0.3, colour =
‘DarkBlue’))
plot.add(rplot.TrellisGrid([‘.’, col]))
ax.set_title(‘Residuals by Heating Load and height = 7
conditioned on ‘ + col + ‘\n’)
plot.render(plt.gcf())
fig.savefig(‘scater_’ + col + ‘7’ + ‘.png’)
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
## Now plot the other value of Overall Height.
fig = plt.figure(figsize=(10, 4.5))
fig.clf()
ax = fig.gca()
plot = rplot.RPlot(temp2, x = ‘Heating Load’, y = ‘Resids’)
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
plot.add(rplot.GeomScatter(alpha = 0.3, colour = ‘Red’))
plot.add(rplot.TrellisGrid([‘.’, col]))
ax.set_title(‘Residuals by Heating Load and height = 3.5
conditioned on ‘ + col + ‘\n’)
plot.render(plt.gcf())
fig.savefig(‘scater_’ + col + ‘3.5’ + ‘.png’)
## QQ Normal plot of residuals
fig3 = plt.figure(figsize = (12,6))
fig3.clf()
ax1 = fig3.add_subplot(1, 2, 1) ax2
= fig3.add_subplot(1, 2, 2)
sm.qqplot(temp1[‘Resids’], ax = ax1)
ax1.set_title(‘QQ Normal residual plot \n with Overall Height =
3.5’)
sm.qqplot(temp2[‘Resids’], ax = ax2)
ax2.set_title(‘QQ Normal residual plot \n with Overall Height =
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
7’)
fig3.savefig(‘plot3.png’)
## Histograms of the residuals
fig4 = plt.figure(figsize = (12,6))
fig4.clf()
ax1 = fig4.add_subplot(1, 2, 1) ax2 =
fig4.add_subplot(1, 2, 2)
ax1.hist(temp1[‘Resids’].as_matrix(), bins = 40)
ax1.set_xlabel(“Residuals for Overall Height = 3.5”)
ax1.set_ylabel(“Density”)
ax1.set_title(“Histogram of residuals”)
ax2.hist(temp2[‘Resids’].as_matrix(), bins = 40)
ax2.set_xlabel(“Residuals of model”)
ax2.set_ylabel(“Density”)
ax2.set_title(“Residuals for Overall Height = 7”)
fig4.savefig(‘plot4.png’)
## Create new data frame and fill in the mean squared residuals for
both classes
out_frame = pd.DataFrame({ \
‘rmse_Overall’ : [rmse(frame1[‘Resids’])], \
‘rmse_35Height’ : [rmse(temp1[‘Resids’])], \
‘rmse_70Height’ : [rmse(temp2[‘Resids’])] })
return out_frame
Ensure you have a Python return statement at the end of your azureml_main function; for
example, return frame1. Failure to include a return statement will prevent your code from
running and may produce an inconsistent error message.
This code performs the following tasks:
a. Creates a function called rmse that returns root squared mean error.
b. Creates a column named Resids containing the residuals (the differences between the
original Heating Load label and the predicted Scored Label value)
c. Creates a data frame containing the data with an Overall Height value of 7, and a data
frame containing the data with an Overall Height value of 3.5.
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
d. Creates a scatter plot of the heating load (x axis) and the residuals (y axis). These data are
grouped by overall height. The points are given different colors depending on the overall
height.
e. Creates scatter plots that show heating load against residuals, conditioned by overall height
and other feature columns.
f. Creates a histogram that shows residuals conditioned by overall height.
g. Creates a Q-Q normal plot of the residuals.
h. Calls the rmse function three times; once for all of the residuals, once for the residuals with
an Overall Height of 3.5 and once for the residuals with an Overall Height of 7. The results
from these calls are returned as the output data frame for the function.
6. Save and run the experiment. Then, when the experiment is finished, visualize the Python device
port of the Execute Python Script module.
7. Examine the scatter plot that shows heating load against residuals conditioned by overall height,
which should look similar to this:
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
Examine the structure of these residuals with respect to the label, Heating Load. In an ideal case,
the residuals should appear random with respect to the label (Heating Load). These results are not
ideal. First, notice that the residuals are in two groups of clusters, based on Overall Height.
Second, notice that the residuals in each group have a linear structure.
8. Review the conditioned scatter plots have been created. For example, look in detail at the
scatter plots conditioned on Roof Area and Overall Height, as shown below.
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
Note the shaded conditioning level tiles across the top of this charts. The first plot only has one
value of Roof Area. The second chart has three tiles across the top (horizontal) showing the
three levels (unique values) of Roof Area. Each scatter plot shows the data falling into the
group by Roof Area and Overall Height.
Examine this plot and notice that the residuals are not random across these plots. First, as
previously noted, a linear structure is visible in several of these subplots. Second, the
placement of the residuals is notably different from plot to plot across the bottom row (Overall
Height = 7). These observations confirm that there is nonlinear behavior not captured by this
model.
Examine the other conditioned scatter plots. You will see similar structure, further evidence that a
linear model does not fit this data particularly well.
9. Examine the histogram, as shown below:
Examine these results, and note the differences in the histograms by OverallHeight. Further,
there are apparent outliers for the OverallHeight of 7.
11. Close the Python device output.
12. Visualize the Result Dataset output of the Execute Python Script module, and review the root
squared mean error results returned by the rsme function, as shown below.
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop
These results show significant variation between the root squared mean error calculations for the
overall residuals and the residuals for Overall Height of 3.5 and Overall Height of 7. These results
indicate that the residuals are not independent the Overall Height feature.
Optional Things to Try:
x Test the performance if we only use the quadratic features
x We can also add polynomials of higher order. Add to your feature base also polynomials of 4th,
5th and 6th order. Which degree works best? Generate a plot with the degree of the polynomial
on the x axis and the relative squared error on the y-axis
CMP3036M/CMP9063M Data Science
2016 – 2017 Semester B Week 03 Workshop