Quantitative Platial Analysis: methods for handling and representing platial heterogeneity and linking varying concepts of place
Part 1: Machine learning / data mining, inference vs prediction
GEOG5917 Big Data and Consumer Analytics
Lex Comber
Professor of Spatial Data Analytics
School of Geography
University of Leeds
a.comber@leeds.ac.uk
Pre-amble
Last week
Introduced models and modelling
Regression models (OLS, GLM)
Geographically Weighted Regression as an example of a SVC model
Aims in creating models
Prediction
Understanding
Communication
Hypothesis development, etc
BUT touched on these very lightly
Pre-amble
We introduced some of the assumptions of OLS
Independence of variables, linear relationships etc
We considered Tobler’s 1st Law and concepts of
spatial auto-correlation (variables)
spatial heterogeneity (processes)
We looked at GWR as way of accommodating these
Pre-amble
We introduced some of the assumptions of OLS
Independence of variables, linear relationships etc
We considered Tobler’s 1st Law and concepts of
spatial auto-correlation (variables)
spatial heterogeneity (processes)
We looked at GWR as way of accommodating these
Pre-amble
Part 1 of the practical
constructed OLS and GWR models of unemployment
Used OLS model to predict
With cautionary note to only use GWR models for understanding
This week we will examine tools for prediction and inference in a bit more detail
AND in a MACHINE LEARNING context
Pre-amble
Part 1 of the practical
constructed OLS and GWR models of unemployment
Used OLS model to predict
With cautionary note to only use GWR models for understanding
This week we will examine tools for prediction and inference in a bit more detail
AND in a MACHINE LEARNING context
Part 1. Machine learning / data mining, inference vs prediction
Part 2: Assignment Introduction
Part 3: Module Summary (normally next week)
Outline
Machine Learning: What is ML?
ML vs Classic Statistics
Prediction vs Inference
Mechanics of ML
data rescaling
training and validation data
measures fit, error, accuracy
model tuning
some ML models
Machine Learning
There are 100s of Machine Learning algorithms or models
You may have heard of some of them
Neural Networks
Random Forest
Support Vector Machine
Gradient Boosting Machine
k-Nearest Neighbour (kNN)
We will work with 6 of them in the practical
But what are they?
Machine Learning
Machine Learning (ML)
has been around for 30-40 years
is a discipline in computer science and artificial intelligence:
‘the study and construction of algorithms that can learn from and make predictions on data’
Arose from pattern recognition and computational learning theory
ML linked to
data mining: exploratory data analysis
aka unsupervised learning
Recent growth has been because of
Increased (personal) computer power
Data, Data, Data!
Suites of ML algorithms provided in R / Python etc packages & libraries
Machine Learning
It is difficult to define ML
Often framed comparatively to Statistics
For ML it is not important if the model is true
Classic statistics seeks to determine the true model
The model that gives insight into the processes that produced the data
ML focuses on accurate prediction
Yet this prediction (ML) vs insight / understanding (Stats) is overly simplistic (see Breiman reference)
But no clear distinction
Great discussion here: https://stats.stackexchange.com/questions/268755/when-should-linear-regression-be-called-machine-learning
Machine Learning
ML characteristics
Problems involve much more data (big data)
need to be cleaned, manipulated summarised etc
Use more variables than classic statistics
more variable-to-variable interactions need to be modelled
numerical approaches to model selection
Less concerned about covariate (variable) significance
significance (p-values) are associated with insight / understanding
More concerned with predictive power
using whatever combination of variables to maximise that
BUT Risk of overfitting (because of more variables)
requires some measure to test this and validate the model
Classic
Statistics
Machine
Learning
Machine Learning
ML characteristics
Problems involve much more data (big data)
need to be cleaned, manipulated summarised etc
Use more variables than classic statistics
more variable-to-variable interactions need to be modelled
numerical approaches to model selection
Less concerned about covariate (variable) significance
significance (p-values) are associated with insight / understanding
More concerned with predictive power
using whatever combination of variables to maximise that
BUT Risk of overfitting (because of more variables)
requires some measure to test this and validate the model
Machine Learning
ML characteristics
Problems involve much more data (big data)
need to be cleaned, manipulated summarised etc
Use more variables than classic statistics
more variable-to-variable interactions need to be modelled
numerical approaches to model selection
Less concerned about covariate (variable) significance
significance (p-values) are associated with insight / understanding
More concerned with predictive power
using whatever combination of variables to maximise that
BUT Risk of overfitting (because of more variables)
Need measure to test this and validate the model
Overfitting
Model is too specific to training data
Low performance on new data
Machine Learning
Characteristics more evident when compared with Classic Statistics
Statistical models:
used either to infer something about relationships or predict
BUT model created based on some domain understanding (theory)
statistical models can predict, but accuracy is not a strength
ML models:
Don’t care about theory (except to make a better model)
Provide various degrees of interpretability (depends on algorithm)
Generally sacrifice interpretability for predictive power
NOTE: a linear ML regression model can train a linear regressor and obtain the same outcome as a statistical regression model that seeks to minimize the squared error between data points
Outline
Machine Learning: What is ML?
ML vs Classic Statistics
Prediction vs Inference
Mechanics of ML
data rescaling
training and validation data
measures fit, error, accuracy
model tuning
some ML models
Prediction vs. Inference
Models have the generic form
is the target variable (the thing we are interested in)
are the independent variables (the factors we use to model )
When we make models were are usually trying
understand how changes in are related to changes in
OR
predict an unknown given some values of
Prediction vs. Inference
Terminology can be confusing
Prediction is straightforward
Inference is used to mean both understanding and prediction in different sources
Statistics
models for both inference / understanding and prediction
characterising the relationship between changes in and changes in is referred to as “statistical inference”
Machine Learning
models are generally for prediction
confusingly prediction is sometimes referred to as “inference” in the ML literature!
ML trains a model “to make generalizable inferences about some type of data based on previous data” (this is prediction!)
Prediction vs. Inference
…in a real estate setting, one may seek to relate values of homes to inputs such as crime rate, zoning, distance from a river, air quality, schools, income level of community, size of houses, and so forth.
In this case one might be interested in how the individual input variables affect the prices—that is, how much extra will a house be worth if it has a view of the river? This is an inference problem.
Alternatively, one may simply be interested in predicting the value of a home given its characteristics: is this house under- or over-valued? This is a prediction problem.
James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An introduction to statistical learning (Vol. 112, pp. 3-7). New York: Springer, p20.
Prediction vs. Inference
Prediction:
uses the estimated function to forecast over unsampled areas
when data for those areas becomes available, or into the future, given different future states
Inference
uses estimated function to understand the impact of the inputs on the outcome
often associated with process understanding
Both aim to identify the function that best approximates the relationships or structure in the data
Although Prediction and Inference are both in a sense inferential …
one inferring predictions (forecasting) and the other inferring process
… and both may use the same estimation procedure to determine …
…the major difference is that they have different procedures
Summary so far…
ML vs Statistics
Classic statistical models usually based on some domain understanding (theory)
ML models don’t care about theory they just want to predict
Both do Prediction and Inference
Prediction vs. Inference
Inference: understanding how changes in and changes in are related
Need for significance, p-values, coefficient estimates etc
Prediction: aims simply predict values accurately
Generalizability of the model to new data is important
Outline
Machine Learning: What is ML?
ML vs Classic Statistics
Prediction vs Inference
Mechanics of ML
training and validation data
data rescaling
measures fit, error, accuracy
model tuning
some ML models
Training & Validation Data
In ML it is important to train and validate (test) the model on different data
Aim is prediction: need to be certain that the model will generalize – i.e. is not specific to the data it was created with
Thus input data needs to be split into 2 subsets:
one to create (train) the model
the other validate (test) or evaluate its performance
The performance of the model can be evaluated on how well it predicts the known target variable
Training & Validation Data
Split the data to training / validation subsets (eg 70:30 / 80:20)
Train (create) model using training subset
Internal measures of model performance
Test (evaluate) model on the testing subset
External measures of performance
Training & Validation Data
Split the data to training / validation subsets (eg 70:30 / 80:20)
Train (create) model using training subset
Internal measures of model performance
Test (evaluate) model on the testing subset
External measures of performance
Training & Validation Data
Recall that the aim in ML prediction is to use predictor variables and a response variable to determine some function , such that predicted (really called or y-hat) is as close to as possible
We only know how good the model is when for the test data is evaluated
Training & Validation Data
Key point
The target variable in training/ testing splits needs to have the same properties (spread, central tendency etc)
This is so that model is trained and tested on representative sampled
The createDataPartition function
splits data and ensures that the distributions are similar
Data Rescaling
Most ML models use some form of multi-variate distance
to determine how far observations are from each other
A multi-variate feature space is defined by the predictor variables passed to the model
Example: simple Linear Regression
Finds the hyper-plane with the minimum summed distance between each observation and the plane
Data Rescaling
Most ML models use some form of multi-variate distance
to determine how far observations are from each other
Problem: variables are in different units
Price in £1000s, Lat in Degrees, Beds is a count
Danger: models may be dominated by variables with large numerical values
price in £s, Lat in metres (Easting), etc
Concept of closeness changes
Data Rescaling
To do this data are rescaled or standardized
Two basic (common) approaches
Z-scores (mean of 0, standard deviation of 1)
Straight rescale (0-1, 0-100, 0-255, etc)
Data have same distributions but now in similar relative scales
Data Rescaling
To do this data are rescaled or standardized
Two basic (common) approaches
Z-scores (mean of 0, standard deviation of 1)
Straight rescale (0-1, 0-100, 0-255, etc)
Data have same distributions but now in similar relative scales
Data Rescaling
Key point
Data rescaling needs to be done AFTER training / validation splits
Data rescaling before the split introduces information about the future into the training explanatory variables
The test data are the unknown future
Summary so far…
We have some data
We have created training and testing subsets
We have rescaled them
We have a ML model to use
We will fit a model
We will evaluate the model
Summary so far…
We have some data
We have created training and testing subsets
We have rescaled them
We have a ML model to use
We will fit a model
We will evaluate the model
What model parameter values?
How do we decide?
How do we evaluate the final specification of the model
before applying it to the training data?
for a inference (ie with no training)?
?
Summary so far…
We have some data
We have created training and testing subsets
We have rescaled them
We have a ML model to use
We will fit a model
We will evaluate the model
What model parameter values?
How do we decide?
How do we evaluate the final specification of the model
before applying it to the training data?
for a inference (ie with no training)?
?
This is where measures of fit, error and accuracy are used
They are needed to evaluate the model internally and externally with the training data.
Internally error measures are generated using cross-validation
Fit/Error/Accuracy Measures
How do we know how good our model is
During model construction? And…
When used to predict using test / holdout data?
A number of different accuracy measures
RMSE (root mean square error), sum of the squared errors, or residual differences between and (smaller is better)
MAE (mean absolute error), sum of the differences (smaller is better)
R2 (R-squared), residual sum of squares over the total sum of squares (bigger is better)
Fit/Error/Accuracy Measures
How do we know how good our model is
During model construction? And…
When used to predict over test / holdout data?
A number of different accuracy measures
RMSE (root mean square error), sum of the squared errors, or residual differences between and (smaller is better)
MAE (mean absolute error), sum of the differences (smaller is better)
R2 (R-squared), residual sum of squares over the total sum of squares (bigger is better)
Fit/Error/Accuracy Measures
How do we know how good our model is
During model construction? And…
When used to predict over test / holdout data?
A number of different accuracy measures
RMSE (root mean square error), sum of the squared errors, or residual differences between and (smaller is better)
MAE (mean absolute error), sum of the differences (smaller is better)
R2 (R-squared), residual sum of squares over the total sum of squares (bigger is better)
Fit/Error/Accuracy Measures
How do we know how good our model is
During model construction? And…
When applied to predict over test / holdout data?
Cross-validation
Randomise / shuffle the dataset
Split the dataset into k groups
For each group:
Use it as the hold out / test data
Use the rest as training data & fit a model
Evaluate the model on the test set
Summarize the evaluations
This generates an internal measure of model fit, error and accuracy are used
Fit/Error/Accuracy Measures
Measures of fit / error / accuracy are needed to evaluate the model
during training through cross validation
after training for its generalizability by applying to test data (hold out data)
Outline
Machine Learning: What is ML?
ML vs Classic Statistics
Prediction vs Inference
Mechanics of ML
training and validation data
data rescaling
measures fit, error, accuracy
model tuning
some ML models
Model Tuning
We have some data
We have created training and testing subsets
We have rescaled them
We have a ML model to use
We have fit a model
We have evaluated the model
Model Tuning
We have some data
We have created training and testing subsets
We have rescaled them
We have a ML model to use
We have fit a model
We have evaluated the model
Now we need to make sure that the model is tuned!
Model Tuning
Most ML models have a number of parameters that need to be set
Example for Gradient Boosting Machine
Example for Random Forest (ranger implementation)
How do we know what values to specify ?
Model Tuning
Most ML models have a number of parameters that need to be set
Example for Gradient Boosting Machine
Example for Random Forest (ranger implementation)
How do we know what values to specify ?
Model Tuning
Most ML models have a number of parameters that need to be set
Example for Gradient Boosting Machine
Example for Random Forest (ranger implementation)
How do we know what parameter values to specify?
Model Tuning
A ranger Random Forest example
This is where model tuning comes in
Set up a grid of combinations of values
Create and evaluate a model for each combination
Fit / Error / Accuracy
Use the parameter combination with the best fit measure
Model Tuning
A ranger Random Forest example
This is where model tuning comes in
Set up a grid of combinations of parameter values
Create and evaluate a model for each combination
Fit / Error / Accuracy (RMSE, MAE, R2)
Use best parameter combination (best fit)
Model Tuning
A ranger Random Forest example
This is where model tuning comes in
Set up a grid of combinations of parameter values
Create and evaluate a model for each combination
Fit / Error / Accuracy (RMSE, MAE, R2)
Use best parameter combination (best fit)
ML Models
There are hundreds
We will work with the caret package
This is wrapper for 100s of ML models
It provides a common interface to them
Wrapper makes it slower than original R package
gbm comes from the gbm package
rf from the randomForest package
etc
ML Models
There are hundreds
We will work with the caret package
This is wrapper for 100s of ML models
It provides a common interface to them
Wrapper makes it slower than original R package
gbm comes from the gbm package
rf from the randomForest package
etc
ML Models
In the practical you will use the following ML models
Linear Regression (you know about this already!)
K-Nearest Neighbour
The k-Nearest Neighbour algorithm operates under the assumption that records with similar values have similar attributes – they are expected to be close in the multidimensional feature space described earlier
Bagged Regression Trees
Bootstrap aggregated decision trees (bagging). Bagging generates multiple models with the same parameters and averages the results from multiple tress.
Random Forests
Bagged regression tress may suffer from high variance. Random Forests build large collections of decorrelated trees by adding randomness to the tree construction process.
Gradient Boosted Machine
Boosting seeks to convert weak learning trees into strong learning ones, with each tree fit on a slightly modified version of the original data. Gradient boosting machines pass a loss function indicating model fit back into subsequent models
Support Vector Machine
Support vector machine use kernel functions, that undertake complex data transformations and then determine the process (hyperplanes) that separate the transformed data.
Summary
Machine Learning (ML) has been around for 30-40 years
Definitions of ML are often framed comparatively to Statistics
ML focuses on accurate prediction
ML Problems involve much more data (big data)
Use more variables than classic statistics
Less concerned about covariate (variable) significance
significance (p-values) are associated with insight / understanding
More concerned with predictive power
using whatever combination of variables to maximise that
BUT Prediction (ML) vs insight / understanding (Stats) is overly simplistic
BUT Risk of overfitting (because of more variables)
Need measure to test this and to validate the model
Summary
Mechanics of ML
training and validation or test data
Train and test the model on separate data
data rescaling
So that multivariate distances reflect interactions between variables not their dimensions
measures fit, error, accuracy
RMSE sum of the squared errors, or residual differences between and (smaller is better)
MAE sum of the differences (smaller is better)
R2 residual sum of squares over the total sum of squares (bigger is better)
model tuning: ML algorithms have parameters
Set up a grid of combinations of parameter values
Create and evaluate a model for each combination
Fit / Error / Accuracy (RMSE, MAE, R2)
use best parameter combination (best fit)
Summary
Cross-validation
Resamples the model to evaluate the model during training
Validation
Evaluates the predictions from hold-out / test data
Provides a second, objective measure of how well the model is expected to perform
The generalizability of the ML model
Reading
Read all of these!
Statistics vs ML – nice paper with many comments (ignore the equations!) but the father of Random Forests
Breiman, L., 2001. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical science, 16(3), pp.199-231 – https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726
Nice overviews of different approaches (again ignore the equations!)
Mullainathan, S. and Spiess, J., 2017. Machine learning: an applied econometric approach. Journal of Economic Perspectives, 31(2), pp.87-106. https://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.31.2.87
Dzyabura, D. and Yoganarasimhan, H., 2018. Machine learning and marketing. In Handbook of Marketing Analytics. Edward Elgar Publishing. https://faculty.washington.edu/hemay/ml-marketing.pdf
A geographical perspective
Miller, H.J. and Goodchild, M.F., 2015. Data-driven geography. GeoJournal, 80(4), pp.449-461.
Accessible summaries
Neural Networks and Random Forests
https://citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics/
Random Forests
Bootstrapping, Bagging, Boosting and Random Forest
http://blog.echen.me/2011/03/14/laymans-introduction-to-random-forests/
https://www.r-bloggers.com/a-brief-tour-of-the-trees-and-forests/
Part 1. Machine learning / data mining, inference vs prediction
Part 2: Assignment Introduction
Part 3: Module Summary (normally next week)
Part 2: The Assignment
GEOG5917 Big Data and Consumer Analytics
Lex Comber
Professor of Spatial Data Analytics
School of Geography
University of Leeds
a.comber@leeds.ac.uk
Overview
Practical
Assignment
Overview
Assignment Preparation
Details
Practical
The practical undertakes the Mechanics of ML described earlier
Training and Testing data splits
Data normalization
Measure of error
ML Model tuning
Validation
Practical
Some of the steps in the practical take time
Please please read the practical before running code
Some of the code is for you to run outside of the practical – it takes hours to run!
It is there for you to experiment with later
Practical
In Part 5 you will examine 6 ML models
Standard linear regression
𝑘-Nearest Neighbour
Bagged Regression Trees
Random Forests
Gradient Boosting Machines
Support Vector Machine
You will apply these to the same problem (house prices in Liverpool)
Practical
In Part 5 you will examine 6 ML models
Standard linear regression
𝑘-Nearest Neighbour
Bagged Regression Trees
Random Forests
Gradient Boosting Machines
Support Vector Machine
And examine their different results
Overview
Practical
Assignment
Overview
Assignment Preparation
Details
Assignment Overview
Your sequence of work is
Week 17 practical
Self directed Assignment Preparation practical
Assignment
Assignment Overview
Week 17 practical introduces Machine Learning
Data prep, tuning, accuracy etc,
Different ML apaches
For the Assignment you will undertake a Random Forest analysis of AirBnb data
To prepare you there is a self directed Assignment Preparation practical that introduces RF
Assignment Overview
Week 17 practical introduces Machine Learning
Data prep, tuning, accuracy etc,
Different ML apaches
For the Assignment you will undertake a Random Forest analysis of AirBnb data
To prepare you there is a self directed Assignment Preparation practical that introduces RF
Assignment Prep
Self-directed practical
Introduces Random Forests
Provides methodological background to RF
Regression Trees
Bagged Trees
Includes worked examples
Assignment
Module handbook has overview
Make sure you have the latest version
It indicates the structure to be used
Every year…
Make it easy for me to award you marks
It gives an indication of where your effort should be
Assignment
Module handbook has overview
Make sure you have the latest version
It indicates the structure to be used
Every year…
Make it easy for me to award you marks
It gives an indication of where your effort should be
Assignment
Detail and specific are on the VLE
Instructions
Data
Marking Criteria
Assignment
Detail and specific are on the VLE
Instructions
Data
Marking Criteria
Assignment
Detail and specific are on the VLE
Instructions
Data
Marking Criteria
Assignment
Detail and specific are on the VLE
Instructions
Data
Marking Criteria
Overview
You will tune and create a Random Forest model of AirBnb rental prices using the data given to you
You will use this to predict / suggest AirBnb rental prices for some potential new listings
Overview
Instructions detail a sequence
Part 1
Data, packages etc
Making an Initial OLS model
Refining the OLS model
Predicting AirBnb price
Part 2
Demonstrates tuning a RF model using georgia data
What you have to do is tune and apply an RF to the AirBnb data
Overview
Instructions detail a sequence
Part 1
Data, packages etc
Making an Initial OLS model
Refining the OLS model
Using this to predict AirBnb price
Part 2
Demonstrates tuning a RF model using georgia data
What you have to do is tune and apply an RF to the AirBnb data
Overview
Instructions detail a sequence
Part 1
Data, packages etc
Making an Initial OLS model
Refining the OLS model
Predicting AirBnb price
Part 2
Demonstrates tuning a RF model using georgia data
What you have to do is tune and apply an RF to the AirBnb data
Overview
Instructions detail a sequence
Part 1
Data, packages etc
Making an Initial OLS model
Refining the OLS model
Predicting AirBnb price
Part 2
Demonstrates tuning a RF model using georgia data
What you have to do is tune and apply an RF to the AirBnb data
Overview
Instructions detail a sequence
Part 1
Data, packages etc
Making an Initial OLS model
Refining the OLS model
Predicting AirBnb price
Part 2
Demonstrates tuning a RF model using georgia data
What you have to do is tune and apply an RF to the AirBnb data
Detail
So for the assignment task you will:
split the data into a training and validation subsets;
create and tune a Random Forest model with the training subset;
evaluate the model using the evaluation subset;
apply the model to the new properties data to predict their price;
compare the results with the OLS prediction;
write up the assignment in the way suggested in the Part 1 Overview.
There are no tricks
I am not trying to catch you out
I have given you clear guidance about what is to be done and how
Assignment Overview
Your sequence of work is
Week 17 practical
Self directed Assignment Preparation practical
Assignment
Questions?
Part 1. Machine learning / data mining, inference vs prediction
Part 2: Assignment Introduction
Part 3: Module Summary (normally next week)
Part 3: Module Summary and Wrap up
GEOG5917 Big Data and Consumer Analytics
Lex Comber
Professor of Spatial Data Analytics
School of Geography
University of Leeds
a.comber@leeds.ac.uk
Module Journey
Introduction to Big Data
A critical view of Data Analytics
Understanding data and data structure through EDA
Linking and joining data
Making models (Linear Regression)
Making more models (Spatial Regression with GWR)
Machine learning
Big data and Databases
Assignment: data wrangling and ML tuning
DATA
BIG Data
Module Summary
This module has sought to develop the tools and understandings needed for Big Data and Consumer Analytics
Data has been small (until this week!)
BUT the techniques for
a) EDA, examining data properties, multivariate data structure etc
b) linking, summarising, filtering, selecting data etc
c) creating inferential / predictive statistical / ML models
are the same for Big Data or small data
The translation to Big Data and Consumer Analytics is direct
Module Summary
This module has sought to develop the tools and understandings needed for Big Data and Consumer Analytics
Data has been small (until this week!)
BUT the techniques for
a) EDA, examining data properties, multivariate data structure etc
b) linking, summarising, filtering, selecting data etc
c) creating inferential / predictive statistical / ML models
are the same for Big Data
The translation to Big Data and Consumer Analytics is direct
Module Summary
This module has sought to develop the tools and understandings needed for Big Data and Consumer Analytics
Data has been small (until this week!)
BUT the techniques for
a) EDA, examining data properties, multivariate data structure etc
b) linking, summarising, filtering, selecting data etc
c) creating inferential / predictive statistical / ML models
are the same for Big Data
The translation to Big Data and Consumer Analytics is direct
Module Summary
This module has sought to develop the tools and understandings needed for Big Data and Consumer Analytics
Data has been socio-economic (mostly census data)
BUT the models for
a) inference / prediction / classification in regression
b) explaining how the response variable relates to the inputs
c) examining variations over different geographic areas
are the same for Consumer Analytics
The translation to Big Data and Consumer Analytics is direct
Module Summary
This module has sought to develop the tools and understandings needed for Big Data and Consumer Analytics
Data has been socio-economic (mostly census data)
BUT the models for
a) inference / prediction / classification in regression
b) explaining how the response variable relates to the inputs
c) examining variations over different geographic areas
are the same for Consumer Analytics
The translation to Big Data and Consumer Analytics is direct
Big Data
Big Data refers to the data we create as part of our every day lives and activities
We generate a lot of data
From CCTV, Satellite images and traffic monitoring
From Cell Phones to Paying for products on-line or in-store
From Travel using smartcards to Uber
… these are just the digital records of living
We generate even more data through our choices
Social media, Netflix, Spotify
All supported by core technologies: GPS-enabled devices & Connectivity – wired, wifi, mobile
Big Data
There has been an explosion of data
Sensing technologies, mobile, web and GPS
Issues around trust & data quality
sampling, representativeness,data not collected under designed experiment
But opportunities for insight: volume, velocity
Many datasets available, both private & public
Benefits recognised across all areas of commerce & research
Remember that all Data are collected somewhere
Geography has something to say !
BUT
A critical view of Data Analytics
Big Data: is always secondary data: never collected for your purpose or under designed experiment
Fishing / Data Mining can easily result in fallacious inference
But how would you know?
Big Data analysis requires processing, reshaping and integration, a Reflexive phase and Detective work
Visualisation can help to develop Hypotheses and understanding and understand the flaws in the data
A critical view of Data Analytics
Exploration vs hypothesis tests
Should be more like a cycle
Explore for patterns
Experiment repeatedly: eg sample, identify patterns then apply to whole dataset / other repeated samples
Engage with domain experts to set up theoretical framework
Classic Research
1. Formulate a research question
2. Identify what data to collect and how to collect it
3. Perform some statistical tests to determine whether any effects/associations are unlikely to have occurred by chance
4. Get an answer to the question
Big Data Mining (intentional minus signs)
-1. Collect lots of data about anything
-2. Perform some kind of data mining
-3. Get some kind of answer
-4. Decide what question it was an answer to
A critical view of Data Analytics
What we suggest is a process of
View Identify Refine Zoom
Research needs to have an inferential dimension…
otherwise just generating results to arbitrary questions
… in order to identify important questions
representatively robust visualisations can help
If we don’t do this we don’t know
IF the Big Questions are deep in the Big Data
OR playing with Big Data will help us to answer Big Questions we currently have
©
Module Journey
Introduction to Big Data
A critical view of Data Analytics
Understanding data and data structure through EDA
Linking and joining data
Making models (Linear Regression)
Making more models (Spatial Regression with GWR)
Machine learning
Big data and Databases
Assignment: data wrangling and ML tuning
DATA
BIG Data
EDA
We don’t just plug and play
We need to know the properties of our data
We need to think about our objectives / hypotheses
We need to consider whether we can answer the questions we are interested in BEFORE we pile into data analysis
EDA is central to this
Contrary to the Big Data / Machine Learning / Data Science narrative
“let the data speak” “inference and theory free”
This is a Critical Data Science approach
EDA
So the aim is to look at the data, evaluate, perform some tests
Look at numeric data individually (univariate EDA)
Numeric: summaries of distributions, central tendency (means & medians), spreads (st dev and IQR), Visualisations of distributions (histogram, boxplot)
Non-numeric: counts, barcharts
Look at combinations of data (multivariate EDA)
Correlations, scatterplot and trendlines (numeric to numeric)
Boxplots, comparing group means with t-tests (numeric to non-numeric)
Correspondence tables and chi-squared tests (non-numeric to non-numeric)
Some statistical tests can help confirm, differences, relationships etc
But so can folding – see practical tasks!
EDA
. EDA is an iterative cycle that involves
Generating questions about your data
Searching for answers by visualising, transforming, and modelling your data
Using what you learn to refine your questions and/or generate new questions
EDA is not a formal process with a strict set of rules
EDA is a state of mind
During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will zoom in on a few particularly productive areas that you’ll eventually write up and communicate to others.
EDA
EDA is fundamentally a creative process
The key to asking quality questions is to generate a large quantity of questions
It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset
There is no rule about which questions you should ask to guide your research.
Two types of questions are useful for making discoveries within your data
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
EDA
Variation is the tendency of the values of a variable to change from measurement to measurement.
You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is because measurements will include a small amount of error that varies from measurement to measurement.
Categorical variables can also vary if you measure across different subjects (e.g. the eye colors of different people), or different times (e.g. the energy levels of an electron at different moments).
Every variable has its own pattern of variation, which can reveal interesting information.
EDA helps to understand patterns by visualising distributions
EDA
Typical values: in both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values.
Think about:
Which values are the most common? Why?
Which values are rare? Why? Does that match your expectations?
Can you see any unusual patterns? What might explain them?
Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:
How are the observations within each cluster similar to each other?
How are the observations in separate clusters different from each other?
How can you explain or describe the clusters?
Why might the appearance of clusters be misleading?
EDA
Unusual values / Outliers are observations that are unusual- data points that don’t seem to fit the pattern.
When you have a lot of data, outliers are sometimes difficult to see in a histogram.
There are so many observations in the common bins that the rare bins are so short that you can’t see them
Covariation
Variation describes the behaviour within a variables
Covariation describes the behaviour between variables
the tendency for the values of two or more variables to vary together in a related way.
EDA helps to understand patterns by visualising relationships
EDA
What we are trying to do here is to gain an understanding of the data to help develop our analysis ideas
Which variables?
Distributions and correlations? Do data need to be transformed (as in the assignment)?
Which folds (medians, classes etc)?
We are trying to proceed in an informed way
Generate and share visualisations
Refine and develop ideas with experts
Linking Data
Data from different sources can be linked in two main ways
Using some kind of attribute (field, variable) that in both datasets have in common
Performing some kind of spatial overlay (ie locations in common) if the data are spatial
Linking by fields
Attributes in common, a database approach, table joins
Example from paper I am writing now with 957 million records
Linking by fields
Key considerations is not just the direction of the join
left, right, inner etc
Also Cardinality
Relationships can be
1 to Many (1:M)
Many to 1 (M:1)
1 to 1 (1:1)
Many to Many (M:M)
So in database theory the cardinality of a table refers to the number of rows contained in the table
Linking by fields
By Location / Geography
Linking different types of information stored in different data tables
Datasets can be joined through common fields
BUT You need to think about the cardinality of these relations between elements in data tables
Example from Airbnb listings data
Reviews to Listings (as in Practical) 865 data points
OR listings to reviews 14880 data points
Answer different types of questions
AND Different tools exist in R for indexing performing joins
Linking geographically
By Location / Geography
Probably 2 most common spatial joins you will want to undertake …
(Information moving From to To layer)
Areas to Points – straightforward
Overlay points to areas
Generates a point dataset with both sets of attributes
You will do this in the practical
oa2_pts %>%
st_intersection(oa_sf) -> oa_joined_pts
Linking geographically
By Location / Geography
Probably 2 most common spatial joins you will want to undertake …
(Information moving From to To layer)
Points to Areas – NOT straightforward
Rarely 1:1 cardinality
Need to summarise the points in each area in some way
Counts, means, medians, max
See Brunsdon and Comber (2018) Chapter 5 for full worked example
Linking geographically
By Location / Geography
Data increasingly are spatial (have location attached to them)
May want to
attach the properties of an area to coincident points
summarise the properties of a load of points over an area
There are different tools to do this
Datasets can be joined geographically through spatial overlay
You need to think what you are trying to do:
Attach polygon attributes to points?
Generate point summaries for each polygon?
Area to area: are more complex
Module Journey
Introduction to Big Data
A critical view of Data Analytics
Understanding data and data structure through EDA
Linking and joining data
Making models (Linear Regression)
Making more models (Spatial Regression with GWR)
Machine learning
Big data and Databases
Assignment: data wrangling and ML tuning
DATA
BIG Data
Models
Why construct models?
To Predict the effects of changing system components
To Test (verify) a system
To Determine the relative importance of different components of the system
To Aid understanding, rather than the model being an end in itself
Models aid in hypothesis formulation and testing
Modelling process identify areas of system where knowledge is lacking
Help to direct subsequent efforts
To Aid to communication
Can help to explain complex system behaviours to ‘non-experts’
BUT graphical outputs can be persuasive and model limitations may not be understood
Output can appear deceptively ‘correct’
117
Linear Regression
Linear Regression is the fundamental model
Construct causal model of factors influencing an outcome
y is outcome (response, target variable)
x’s are factors (predictor, independent variables)
β are the coefficient estimates (how changes in x are related to changes in y)
Assumes relationships are linear, that predictors are normally distributed and independent (ie not collinear or “saying the same thing”)
Linear Regression
Linear Regression is affected by outliers
these pull the hyperplane around
Could remove these but…(see practicals this week and week 5)
Assumed that linear relationships are present, that predictors are normally distributed and independent (ie not collinear or “saying the same thing”)
Other models available for different data distributions through GLM
Binomial, Poisson etc
But these are the fundamental to what is incorrectly but commonly called “machine learning” and “artificial intelligence” (really just fancy stats!)
Many ML regression approaches (1000s!!!!)
Linear Regression
Described modelling and core tool regression
assumptions of linear relationships normally distributed and independent data
Inherently assumes global relationships between covariates
ie relationship between y and x is the same everywhere
stationary relationships
In reality this assumption of process spatial invariance is violated in many instances when location is considered
non-stationary relationships
crime and unemployment story
GWR: Tobler’s Law
As a geographer…
I do not expect things to be the same everywhere
Expect to find clusters, hotspots etc… spatial auto-correlation
Interested in how and where processes, relationships, etc vary spatially
In statistical terminology… I am interested in relationship spatial non-stationarity and process spatial heterogeneity
This is Tobler’s First Law of Geography
“Everything is related to everything else, but near things are more related to each other”
Tobler, W.R., 1970. A computer movie simulating urban growth in the Detroit region. Economic Geography, 46(sup1), pp.234-240.
GWR: Tobler’s Law
What does Tobler’s Law mean?
1) It describes spatial autocorrelation
often find that observations that are nearby are similar
Birds of a feather flock together
It is what we intuitively know
BUT this violates the fundamental principles of data independence in statistics
Clusters, hotspots, etc
It makes spatial analysis different from statistical analysis
GWR: Tobler’s Law
What does Tobler’s Law mean?
2) It describes process heterogeneity
Relationships between factors (ie processes) vary in different places
Crime and unemployment example
Nearby relationships are similar
AND this is different to the assumptions of most statistical models
Global one-size-fits-all models vs local ones
It makes spatial analysis different from statistical analysis
123
GWR
GWR
Uses a moving window or kernel
Constructs a series of local models using data under the kernel
GWR creates many local regression models
each local model uses nearby data
data are weighted by their distance to the location being modelled
GWR
Module Journey
Introduction to Big Data
A critical view of Data Analytics
Understanding data and data structure through EDA
Linking and joining data
Making models (Linear Regression)
Making more models (Spatial Regression with GWR)
Machine learning
Big data and Databases
Assignment: data wrangling and ML tuning
DATA
BIG Data
Machine Learning
ML characteristics
Problems involve much more data (big data)
need to be cleaned, manipulated summarised etc
Use more variables than classic statistics
more variable-to-variable interactions need to be modelled
numerical approaches to model selection
Less concerned about covariate (variable) significance
significance (p-values) are associated with insight / understanding
More concerned with predictive power
using whatever combination of variables to maximise that
BUT Risk of overfitting (because of more variables)
requires some measure to test this and validate the model
Classic
Statistics
Machine
Learning
Machine Learning
ML characteristics
Problems involve much more data (big data)
need to be cleaned, manipulated summarised etc
Use more variables than classic statistics
more variable-to-variable interactions need to be modelled
numerical approaches to model selection
Less concerned about covariate (variable) significance
significance (p-values) are associated with insight / understanding
More concerned with predictive power
using whatever combination of variables to maximise that
BUT Risk of overfitting (because of more variables)
requires some measure to test this and validate the model
Machine Learning
ML characteristics
Problems involve much more data (big data)
need to be cleaned, manipulated summarised etc
Use more variables than classic statistics
more variable-to-variable interactions need to be modelled
numerical approaches to model selection
Less concerned about covariate (variable) significance
significance (p-values) are associated with insight / understanding
More concerned with predictive power
using whatever combination of variables to maximise that
BUT Risk of overfitting (because of more variables)
Need measure to test this and to validate the model
Overfitting
Model is too specific to training data
Low performance on new data
Machine Learning
Characteristics more evident when compared with Classic Statistics
Statistical models:
used either to infer something about relationships or predict
BUT model created based on some domain understanding (theory)
statistical models can predict, but accuracy is not a strength
ML models:
Don’t care about theory (except to make a better model)
Provide various degrees of interpretability (depends on algorithm)
Generally sacrifice interpretability for predictive power
NOTE: a linear ML regression model can train a linear regressor and obtain the same outcome as a statistical regression model that seeks to minimize the squared error between data points
Prediction vs. Inference
Prediction:
uses the estimated function to forecast over unsampled areas
when data for those areas becomes available, or into the future, given different future states
Inference
uses estimated function to understand the impact of the inputs on the outcome
often associated with process understanding
Both aim to identify the function that best approximates the relationships or structure in the data
Although Prediction and Inference are both in a sense inferential …
one inferring predictions (forecasting) and the other inferring process
… and both may use the same estimation procedure to determine …
…the major difference is that they have different procedures
Fit/Error/Accuracy Measures
Measures of fit / error / accuracy are needed to evaluate the model
during training through cross validation
after training for its generalizability by applying to test data (hold out data)
Summary so far…
We have some data
We have created training and testing subsets
We have rescaled them
We have a ML model to use
We will fit a model
We will evaluate the model
What model parameter values?
How do we decide?
How do we evaluate the final specification of the model
before applying it to the training data?
for a inference (ie with no training)?
This is where measures of fit, error and accuracy are used
They are needed to evaluate the model internally and externally with the training data.
Internally error measures are generated using cross-validation
Summary
Cross-validation
Resamples the model to evaluate the model during training
Validation
Evaluates the predictions from hold-out / test data
Provides a second, objective measure of how well the model is expected to perform
The generalizability of the ML model
Summary
Machine Learning (ML) has been around for 30-40 years
Definitions of ML are often framed comparatively to Statistics
ML focuses on accurate prediction
ML Problems involve much more data (big data)
Use more variables than classic statistics
Less concerned about covariate (variable) significance
significance (p-values) are associated with insight / understanding
More concerned with predictive power
using whatever combination of variables to maximise that
BUT Prediction (ML) vs insight / understanding (Stats) is overly simplistic
BUT Risk of overfitting (because of more variables)
Need measure to test this and to validate the model
Summary
Mechanics of ML
training and validation or test data
Train and test the model on separate data
data rescaling
So that multivariate distances reflect interactions between variables not their dimensions
measures fit, error, accuracy
RMSE sum of the squared errors, or residual differences between and (smaller is better)
MAE sum of the differences (smaller is better)
R2 residual sum of squares over the total sum of squares (bigger is better)
model tuning: ML algorithms have parameters
Set up a grid of combinations of parameter values
Create and evaluate a model for each combination
Fit / Error / Accuracy (RMSE, MAE, R2)
use best parameter combination (best fit)
Module Journey
Introduction to Big Data
A critical view of Data Analytics
Understanding data and data structure through EDA
Linking and joining data
Making models (Linear Regression)
Making more models (Spatial Regression with GWR)
Machine learning
Big data and Databases
Assignment: data wrangling and ML tuning
DATA
BIG Data
Databases
So what do you do if your data is very BIG? When it is TOO big?
too big for your Computer / PC
too big for Excel
too big to fit on a hard drive, etc
This is where databases come in
Databases are collections of linked data tables
Usually stored somewhere else
Databases
So what do you do if your data is very BIG? When it is TOO big?
too big for your Computer / PC
too big for Excel
too big to fit on a hard drive, etc
This is where databases come in
Databases are collections of linked data tables
Usually stored somewhere else
Databases
What are Databases?
Collections of data tables
Data tables that have some field field (attribute) in common
Records (observations) in different data tables are related to each other using a field they have in common
Databases
What are Databases?
Collections of data tables
Data tables that have some field field (attribute) in common
Records (observations) in different data tables are related to each other using a field they have in common
Queries
dplyr syntax
verbs
joins
familiar powerful
same / similar as working with in-memory data tables
use all of the dyplr verbs and 2 table manipulations
Queries
dplyr syntax
verbs
joins
Databases
Databases are collections of linked data tables
They allow BIG data to be accessed and queried without having to load data into working memory
The interface with the database is managed through a DBMS
This supports database queries
The dplyr syntax can be used to construct queries
It translates them to SQL before passing to the DBMS
Only the results of the query are returned to the local computer
Allows big data analyses within a single environment
Module Journey
Introduction to Big Data
A critical view of Data Analytics
Understanding data and data structure through EDA
Linking and joining data
Making models (Linear Regression)
Making more models (Spatial Regression with GWR)
Machine learning
Big data and Databases
Assignment: data wrangling and ML tuning
DATA
BIG Data
Module Summary
This module has sought to develop the tools and understandings needed for Big Data and Consumer Analytics
Data has been small (until this week!)
BUT the techniques for
a) EDA, examining data properties, multivariate data structure etc
b) linking, summarising, filtering, selecting data etc
c) creating inferential / predictive statistical / ML models
are the same for Big Data
The translation to Big Data and Consumer Analytics is direct
Module Summary
This module has sought to develop the tools and understandings needed for Big Data and Consumer Analytics
Data has been socio-economic (mostly census data)
BUT the models for
a) inference / prediction / classification in regression
b) explaining how the response variable relates to the inputs
c) examining variations over different geographic areas
are the same for Consumer Analytics
The translation to Big Data and Consumer Analytics is direct
Finally…
Thank You!
What is a model?
• A model seeks to represent (some of) the real world
• A collection of components which produce some kind of output (a
system)
• Models represent interactions between factors and their effects on
outputs
– Factors can be the component parts of a system
• Simple systems: outcomes can be predicted without a model, e.g.
vaccination will protect from a given disease
• Complex systems: difficult to predict outcomes due to interacting
factors / circumstances
• Models seek to answer What if and Why / How questions
– what might happen if…?
– why or how does the effect come about?
Inputs
Outputs
Do Stuff
What is a model?
•A model seeks to represent (some of) the real world
•A collection of components which produce some kind of output (a
system)
•Models represent interactions between factors and their effects on
outputs
-Factors can be the component parts of a system
•Simplesystems: outcomes can be predicted without a model, e.g.
vaccination will protect from a given disease
•Complex systems: difficult to predict outcomes due to interacting
factors / circumstances
•Models seek to answer What ifand Why / Howquestions
-what might happen if…?
-why or how does the effect come about?
Inputs
Outputs
Do Stuff
Models
• Why construct models?
– To Predict the effects of changing system components
– To Test (verify) a system
– To Determine the relative importance of different components of the system
– To Aid understanding, rather than the model being an end in itself
• Models aid in hypothesis formulation and testing
• Modelling process identify areas of system where knowledge is lacking
• Help to direct subsequent efforts
– To Aid to communication
• Can help to explain complex system behaviours to ‘non-experts’
• BUT graphical outputs can be persuasive and model limitations may not be understood
• Output can appear deceptively ‘correct’
Linear Regression
• Regression finds the hyper-plane in multivariate
space with the minimum distance of each point
to the plane
– Easy to see / understand in 2D (ie with one factor)
– And in 3D (with rotatable plot)
– See demo in R
• Similar in a multi-dimensional hyperspace (i.e.
lots of factors)
• Good illustration of workings here:
http://www.learnbymarketing.com/tutorials/lin
ear-regression-by-hand-in-excel/
Linear Regression
•Regression finds the hyper-plane in multivariate
space with the minimum distance of each point
to the plane
-Easy to see / understand in 2D (iewith one factor)
-And in 3D (with rotatable plot)
-See demo in R
•Similar in a multi-dimensional hyperspace (i.e.
lots of factors)
•Good illustration of workings here:
http://www.learnbymarketing.com/tutorials/lin
ear-regression-by-hand-in-excel/
Background
• As a geographer…
• I do not expect things to be the same everywhere
– Expect to find clusters, hotspots etc… spatial auto-
correlation
– Interested in how and where processes, relationships,
etc vary spatially
• In statistical terminology… I am interested in
relationship spatial non-stationarity and process
spatial heterogeneity
• This is Tobler’s First Law of Geography
“Everything is related to
everything else, but near things
are more related to each other”
Tobler, W.R., 1970. A computer movie simulating urban growth in the Detroit region. Economic Geography, 46(sup1), pp.234-240.
GWR
• Geographically Weighted
Regression
– Sensitive to Tobler’s 1st Law
– Generates model coefficient
estimates at different locations
– Shows spatial variation in relationship
Background
• Geographically Weighted
Regression
• Explicitly accommodates
Tobler’s 1st Law
• An example of a spatially
varying coefficient model
� Probably the most well known
Brunsdon C, Fotheringham AS, Charlton ME (1996). Geographically Weighted Regression: A
Method for Exploring Spatial Non-stationarity. Geographical Analysis 28(4):281-298, many
papers thereafter and a book in 2002…
PctBach
100 to 200
200 to 300
300 to 400
400 to 500
500 to 600
600 to 700
PctEld
-2,600 to -2,400
-2,400 to -2,200
-2,200 to -2,000
-2,000 to -1,800
-1,800 to -1,600
-1,600 to -1,400
-1,400 to -1,200
GWR
•Geographically Weighted
Regression
-Sensitive to Tobler’s 1
st
Law
-Generates model coefficient
estimates at different locations
-Shows spatial variation in relationship
Background
•Geographically Weighted
Regression
•Explicitly accommodates
Tobler’s 1
st
Law
•An example of a spatially
varying coefficient model
Probablythe most well known
BrunsdonC, Fotheringham AS, Charlton ME (1996). Geographically Weighted Regression: A
Method for Exploring Spatial Non-stationarity. Geographical Analysis 28(4):281-298, many
papers thereafter and a book in 2002…
PctBach
100 to 200
200 to 300
300 to 400
400 to 500
500 to 600
600 to 700
PctEld
-2,600 to -2,400
-2,400 to -2,200
-2,200 to -2,000
-2,000 to -1,800
-1,800 to -1,600
-1,600 to -1,400
-1,400 to -1,200
GWR
• GWR: spatially varying coefficient model
• Lets break this down
2. spatially varying
– GWR creates many local regression models
– at lots of locations across the study area
– each local model uses nearby data
– data are weighted by their distance to the
location being modelled
! = !!!(!!,!!)! + !!!!(!!,!!) + !!!!(!!,!!) … !!!!!(!!,!!) !
GWR
•GWR: spatially varying coefficient model
•Lets break this down
2. spatially varying
-GWR creates many local regression models
-at lots of locations across the study area
-each local model uses nearby data
-data are weighted by their distance to the
location being modelled
!=!!
!
(!
!
,!
!
)
!
+!
!
!
!
(!
!
,!
!
)
+!
!
!
!
(!
!
,!
!
)
…!!
!
!
!
(!
!
,!
!
)
!
Real world
Measurement (collect data)
Analysis à
inference
about process
Real world
Measurement (collect data)
Analysis à
inference
about process
Measurement (collect data)
Analysis à
prediction
Measurement (collect data)
Analysis à
prediction
3
4
5
6
7
2 4 6 8
Observed
P
re
di
ct
ed
3
4
5
6
7
2 4 6 8
Observed
P
r
e
d
i
c
t
e
d
2011
2012
2013
2014
2015
2016
2017
2018
0 1000 2000 3000
Items (counts)
2011
2012
2013
2014
2015
2016
2017
2018
0 2000 4000 6000
Costs
2011201220132014201520162017201801000 2000 3000Items (counts)2011201220132014201520162017201802000 4000 6000Costs
3
4
5
6
7
2 4 6 8
Observed
P
re
di
ct
ed
2011
2012
2013
2014
2015
2016
2017
2018
0 1000 2000 3000
Items (counts)
2011
2012
2013
2014
2015
2016
2017
2018
0 2000 4000 6000
Costs
3
4
5
6
7
2468
Observed
P
r
e
d
i
c
t
e
d
201120122013201420152016201720180 1000 2000 3000Items (counts)2011
2012
2013
2014
2015
2016
2017
2018
0 2000 4000 6000
Costs
Model Tuning
• A ranger Random Forest example
• This is where model tuning comes in
– Set up a grid of combinations of parameter values
– Create and evaluate a model for each combination
• Fit / Error / Accuracy (RMSE, MAE, R2)
• Use best parameter combination (best fit)
Model Tuning
•A ranger Random Forest example
•This is where model tuning comes in
-Set up a grid of combinations of parameter values
-Create and evaluate a model for each combination
•Fit / Error / Accuracy (RMSE, MAE, R
2
)
•Use best parameter combination (best fit)
Training & Validation Data
• Split the data to training /
validation subsets (eg 70:30 /
80:20)
• Train (create) model using
training subset
– Internal measures of model
performance
• Test (evaluate) model on the
testing subset
– External measures of performance
Training & Validation Data
•Split the data to training /
validation subsets (eg 70:30 /
80:20)
•Train (create) model using
training subset
-Internal measures of model
performance
•Test(evaluate) model on the
testing subset
-External measures of performance
Fit/Error/Accuracy Measures
• Measures of fit / error / accuracy
are needed to evaluate the model
– during training through cross
validation
– after training for its generalizability by
applying to test data (hold out data)
Fit/Error/Accuracy Measures
•Measures of fit / error / accuracy
are needed to evaluate the model
-during training through cross
validation
-after training for its generalizability by
applying to test data (hold out data)
RF GBM SVM
LM kNN TB
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
u25
KitchenTRUE
Garage
En.suiteTRUE
TerracedTRUE
o65
u16
Easting
DetachedTRUE
u65
unmplyd
u45
Semi.DetachedTRUE
Detached
Northing
gs_area
Beds
u25
KitchenTRUE
Garage
En.suiteTRUE
TerracedTRUE
o65
u16
Easting
DetachedTRUE
u65
unmplyd
u45
Semi.DetachedTRUE
Detached
Northing
gs_area
Beds
Variable Importance
databate
verb
data·∙ur·∙ba·∙te
\ˌdat-‐tər-‐ˈbāt
\
Popularity:
Bottom
2%
of
words
Definition
of
databate
:
meaningless
manipulation
of
database,
especially
one’s
own,
commonly
resulting
in
‘something’
and
achieved
by
mining
or
other
analysis
exclusive
of
context
or
theory,
by
instrumental
manipulation,
occasionally
by
inferential
fantasy,
or
by
various
combinations
of
these
agencies
databate
verb data·ur·ba·te \ˌdat-tər-ˈbāt \
Popularity: Bottom 2% of words
Definition of databate
: meaningless manipulation of database, especially one’s own,
commonly resulting in ‘something’ and achieved by mining or other
analysis exclusive of context or theory, by instrumental manipulation,
occasionally by inferential fantasy, or by various combinations of these
agencies
aggregate(by = list(df$OAC_class), FUN=mean) %>%
melt(id.vars= “Group.1”,
variable.name = “Var”,
value.name =”Value”) -> agg_melted
You may wish to examine the outputs. The melted data can be passed to ggplot and used to construct
polar or radar plots showing the typical (mean) values of the di�erent OAC classes for each of the numeric
variables, as in Figure 10.
ggplot(data=agg_melted, aes(x=factor(Var), y=Value,
group= Group.1, colour=Group.1, fill=Group.1)) +
geom_point(size=2) +
geom_polygon(size = 1, alpha= 0.4) +
scale_x_discrete() +
theme_light()+
facet_wrap(~Group.1, ncol = 4)+
scale_color_manual(values= brewer.pal(8, “Set1”))+
scale_fill_manual(values= brewer.pal(8, “Set1”))+
coord_polar()+
theme(legend.position = “none”,
axis.title = element_blank(),
axis.text = element_text(size = 6))
gs_area
u16
u25
u45
u65
o65
unmplyd
OnePH
OneFamH
Degree
White
Mixed
Asian
Black
Other
gs_area
u16
u25
u45
u65
o65
unmplyd
OnePH
OneFamH
Degree
White
Mixed
Asian
Black
Other
gs_area
u16
u25
u45
u65
o65
unmplyd
OnePH
OneFamH
Degree
White
Mixed
Asian
Black
Other
gs_area
u16
u25
u45
u65
o65
unmplyd
OnePH
OneFamH
Degree
White
Mixed
Asian
Black
Other
gs_area
u16
u25
u45
u65
o65
unmplyd
OnePH
OneFamH
Degree
White
Mixed
Asian
Black
Other
gs_area
u16
u25
u45
u65
o65
unmplyd
OnePH
OneFamH
Degree
White
Mixed
Asian
Black
Other
gs_area
u16
u25
u45
u65
o65
unmplyd
OnePH
OneFamH
Degree
White
Mixed
Asian
Black
Other
Multicultural Metropolitans Suburbanites Urbanites
Constrained City Dwellers Cosmopolitans Ethnicity Central Hard−Pressed Living
−2
−1
0
1
2
−2
−1
0
1
2
Figure 10: Radar plots of the variable mean values for each OAC class.
In the above code the data were rescaled to show the variables on the same polar axis. Rescaling can be
done in a number of ways and there are many functions to do it. Here the scale function applies a classic
standardised approach around mean of 0 with a standard deviation of 1 – i.e. a z-score. Others use variable
minimum and maximum to linearly scale between 0 and 1 (such as the rescale function in the scales
package). The polar plots in the figure show that there are some large di�erences between OAC classes in
most of the variables, although how evident this is will depend on the method of rescaling (trying changing
21
aggregate(by=list(df$OAC_class),FUN=mean)%>%melt(id.vars=”Group.1″,variable.name=”Var”,value.name=”Value”)->agg_meltedYoumaywishtoexaminetheoutputs.Themelteddatacanbepassedtoggplotandusedtoconstructpolarorradarplotsshowingthetypical(mean)valuesofthedifferentOACclassesforeachofthenumericvariables,asinFigure10.ggplot(data=agg_melted,aes(x=factor(Var),y=Value,group=Group.1,colour=Group.1,fill=Group.1))+geom_point(size=2)+geom_polygon(size=1,alpha=0.4)+scale_x_discrete ()+theme_light()+facet_wrap(~Group.1,ncol=4)+scale_color_manual (values=brewer.pal(8,”Set1″))+scale_fill_manual (values=brewer.pal(8,”Set1″))+coord_polar()+theme(legend.position =”none”,axis.title=element_blank(),axis.text=element_text(size=6))gs_area
u16
u25
u45
u65
o65
unmplyd
OnePH
OneFamH
Degree
White
Mixed
Asian
Black
Other
gs_area
u16
u25
u45
u65
o65
unmplyd
OnePH
OneFamH
Degree
White
Mixed
Asian
Black
Other
gs_area
u16
u25
u45
u65
o65
unmplyd
OnePH
OneFamH
Degree
White
Mixed
Asian
Black
Other
gs_area
u16
u25
u45
u65
o65
unmplyd
OnePH
OneFamH
Degree
White
Mixed
Asian
Black
Other
gs_area
u16
u25
u45
u65
o65
unmplyd
OnePH
OneFamH
Degree
White
Mixed
Asian
Black
Other
gs_area
u16
u25
u45
u65
o65
unmplyd
OnePH
OneFamH
Degree
White
Mixed
Asian
Black
Other
gs_area
u16
u25
u45
u65
o65
unmplyd
OnePH
OneFamH
Degree
White
Mixed
Asian
Black
Other
Multicultural Metropolitans Suburbanites Urbanites
Constrained City DwellersCosmopolitans Ethnicity CentralHard−Pressed Living
−2
−1
0
1
2
−2
−1
0
1
2
Figure10:RadarplotsofthevariablemeanvaluesforeachOACclass.
Intheabovecodethedatawererescaledtoshowthevariablesonthesamepolaraxis.Rescalingcanbe
doneinanumberofwaysandtherearemanyfunctionstodoit.Herethescalefunctionappliesaclassic
standardisedapproacharoundmeanof0withastandarddeviationof1-i.e.az-score.Othersusevariable
minimumandmaximumtolinearlyscalebetween0and1(suchastherescalefunctioninthescales
package).ThepolarplotsinthefigureshowthattherearesomelargedifferencesbetweenOACclassesin
mostofthevariables,althoughhowevidentthisiswilldependonthemethodofrescaling(tryingchanging
21
1
−0.444
−0.145
0.13
−0.101
0.274
−0.494
−0.416
0.068
−0.045
−0.224
0.097
−0.001
−0.444
1
−0.015
−0.64
−0.475
−0.159
0.113
0.117
−0.37
0.183
0.543
0.127
0.186
−0.145
−0.015
1
−0.497
−0.588
0.034
0.326
0.453
−0.283
0.28
0.13
0.23
0.29
0.13
−0.64
−0.497
1
0.488
−0.034
−0.242
−0.15
0.492
−0.322
−0.464
−0.32
−0.372
−0.101
−0.475
−0.588
0.488
1
0.001
0.063
−0.235
0.36
−0.263
−0.329
−0.237
−0.268
0.274
−0.159
0.034
−0.034
0.001
1
0.358
−0.558
−0.192
0.19
−0.101
0.354
0.213
−0.494
0.113
0.326
−0.242
0.063
0.358
1
0.006
−0.278
0.254
0.161
0.237
0.234
−0.416
0.117
0.453
−0.15
−0.235
−0.558
0.006
1
−0.189
0.189
0.24
0.009
0.155
0.068
−0.37
−0.283
0.492
0.36
−0.192
−0.278
−0.189
1
−0.727
−0.716
−0.823
−0.814
−0.045
0.183
0.28
−0.322
−0.263
0.19
0.254
0.189
−0.727
1
0.297
0.626
0.551
−0.224
0.543
0.13
−0.464
−0.329
−0.101
0.161
0.24
−0.716
0.297
1
0.309
0.396
0.097
0.127
0.23
−0.32
−0.237
0.354
0.237
0.009
−0.823
0.626
0.309
1
0.642
−0.001
0.186
0.29
−0.372
−0.268
0.213
0.234
0.155
−0.814
0.551
0.396
0.642
1
u16
u25
u45
u65
o65
unmplyd
OnePH
Degree
White
Mixed
Asian
Black
Other
u1
6
u2
5
u4
5
u6
5
o6
5
un
mp
lyd
On
eP
H
De
gre
e
W
hit
e
Mi
xe
d
As
ian
Bla
ck
Ot
he
r
−1.0
−0.5
0.0
0.5
1.0
Correlation
Figure 8: A correlation plot of numeric variables.
data. This is especially useful with very large data. Plot character size and transparency can be used as above
to aid visualisation. Binning provides another route. There are a number of options: geom_bin2d, geom_hex
and geom_density2d – try using each of these in turn in the 2nd line of the code below to generate alternative
versions of Figure 9 (note you may have to install the hexbin package: install.packages(“hexbin”, dep
= T)).
The bin shadings provide a convenient way of representing the density of data points with a similar value. Of
course di�erent elements of faceting, ordering, binning and grouping can be combined as in Figure 9:
df %>% ggplot(mapping = aes(x = Degree, y = unmplyd)) +
geom_hex(bins=15) +
facet_wrap(~OAC_class, ncol = 3)+
labs(fill=�Count�)+
coord_fixed() +
scale_fill_gradient(low = “lemonchiffon”, high = “darkblue”)
A final set of consideration for examining multiple variables simultaneously is through the use of radar plots.
What these seek to do is to comparatively show the properties (attributes) of di�erent groups or classes
of variables in the data alongside each other, in order to determine visually whether specific groups have
di�erent general properties.
Radar plots with an aggregate function can be used to generate and display summaries (means in this case)
of each variable for di�erent groups, such as the OAC classes. The data needs to rescaled (so the di�erent
measures are comparable) and melted before being passed to ggplot:
df %>% select(-code, -OAC_class) %>% # remove the code attribute
sapply(scale) %>%
19
One parcel has
many owners
Many parcels have
many owners
Many parcels
have one owner
One parcel has
one owner
One-to-many Many-to-manyMany-to-oneOne–to-one
One parcel has
many owners
Many parcels have
many owners
Many parcels
have one owner
One parcel has
one owner
One-to-manyMany-to-manyMany-to-oneOne–to-one
Linear Regression
• Interpreting the outputs
– Fit
– Coefficient estimates (rates)
– Significance
• Detail in the practicals
• Coefficient Estimates
– How changes in x are associated
with changes in y
– e.g. 1% increase in PctBach
associated with increase of $562
median income
Linear Regression
•Interpreting the outputs
-Fit
-Coefficient estimates (rates)
-Significance
•Detail in the practicals
•Coefficient Estimates
-How changes in xare associated
with changes in y
-e.g. 1% increase in PctBach
associated with increase of $562
median income
Tobler’s 1st Law of Geography
Waldo Tobler in front of the Newberry Library. Chicago,
November 2007
(http://en.wikipedia.org/wiki/Waldo_R._Tobler)
“Everything is related to everything else, but
near things are more related than distant
things”
http://en.wikipedia.org/wiki/University_of_California,_Santa_Barbara
Tobler W., (1970) A computer movie simulating urban growth in the Detroit region. Economic Geography, 46(2): 234-240
Tobler’s 1
st
Law of Geography
Waldo Tobler in front of the Newberry Library. Chicago,
November 2007
(http://en.wikipedia.org/wiki/Waldo_R._Tobler)
“Everything is related to everything else, but
near things are more related than distant
things”
http://en.wikipedia.org/wiki/University_of_California,_Santa_Barbara
ToblerW., (1970) A computer movie simulating urban growth in the Detroit region. Economic Geography, 46(2): 234-240
Tobler’s 1st Law of Geography
Waldo Tobler in front of the Newberry Library. Chicago,
November 2007
(http://en.wikipedia.org/wiki/Waldo_R._Tobler)
“Everything is related to everything else, but
near things are more related than distant
things”
http://en.wikipedia.org/wiki/University_of_California,_Santa_Barbara
Tobler W., (1970) A computer movie simulating urban growth in the Detroit region. Economic Geography, 46(2): 234-240
Tobler’s 1
st
Law of Geography
Waldo Tobler in front of the Newberry Library. Chicago,
November 2007
(http://en.wikipedia.org/wiki/Waldo_R._Tobler)
“Everything is related to everything else, but
near things are more related than distant
things”
http://en.wikipedia.org/wiki/University_of_California,_Santa_Barbara
ToblerW., (1970) A computer movie simulating urban growth in the Detroit region. Economic Geography, 46(2): 234-240
! = !!!(!!,!!)! + !!!!(!!,!!) + !!!!(!!,!!) … !!!!!(!!,!!) !
Uses
a
data
under
moving
window
or
kernel
– the
‘geographically’
bit
• Data
under
the
kernel
are
weighted
by
their
distance
to
the
kernel
centre
– the
‘weighted’
bit
• A
local
model
constructed
at
that
kernel
location
– Regression,
PCA,
correspondence
analysis,
etc
Uses a data under moving window or kernel
–the ‘geographically’ bit
•Data under the kernel are weighted by their
distance to the kernel centre
–the ‘weighted’ bit
•A local model constructed at that kernel location
–Regression, PCA, correspondence analysis, etc
Linking by fields
• Attributes in common, a database approach,
table joins
• Example from paper I am writing now with
957 million records
Linking by fields
•Attributes in common, a database approach,
table joins
•Example from paper I am writing now with
957 million records
prescriptions
SHA
PCT
PRACTICE
BNF.CODE
BNF.NAME
ITEMS
NIC
ACT.COST
QUANTITY
PERIOD
ccg_patients
CCGcode
CCGnm_s
ccg.reg.pa
ccg
CCGcode
geometry
prescriptions
SHA
PCT
PRACTICE
BNF.CODE
BNF.NAME
ITEMS
NIC
ACT.COST
QUANTITY
PERIOD
ccg_patients
CCGcode
CCGnm_s
ccg.reg.pa
ccg
CCGcode
geometry