Introduction to information system
Introduction to Data Science
Bowei Chen
School of Computer Science
University of Lincoln
CMP3036M/CMP9063M Data Science
Hello, I’m a
Data Scientist!
My research interest lies mostly in developing
intelligent algorithms and data solutions to the
following fields:
• Computational advertising:
programmatic guarantee
• Internet economics and digital products:
inventory pricing, information systems
• Mathematical finance:
derivatives pricing, algorithmic trading
http://staff.lincoln.ac.uk/bchen
http://staff.lincoln.ac.uk/bchen
Module Motivation
“Torture the data, and it will confess to anything.”
Ronald Coase, Nobel Prize Laureate in Economics
“I keep saying that the sexy job in the next 10 years will be statisticians.”
Hal Varian, Google Chief Economist
“Data is a precious thing and will last longer than the systems themselves.”
Tim Berners-Lee, Inventor of the World Wide Web
Top 10 Hot Job Titles That
Barely Existed 5 Years Ago | 2014
The Alan Turing Institute
It is the UK’s national centre for data science, headquartered at the British Library.
Following a public competition with international peer review, the Institute was
founded in 2015 as a joint venture by the universities of Cambridge, Edinburgh,
Oxford, University College London, Warwick and the UK EPSRC. https://turing.ac.uk
https://turing.ac.uk/
Module Information
Title Data Science
Code CMP3036M/CMP9063M
Semester 2016-2017 Semesters A & B
Coordinator Bowei Chen
Instructors Bowei Chen (Semester A)
TBC (Semester B)
Demonstrators Deema Abdal Hafeth
Liyun Gong
JingMin Huang
Assessment CMP3036M: Assignment (50%) + Assignment (50%)
CMP9063M: Assignment (40%) + Assignment (40%) + Report (20%)
Topics in Semester A
Week A01 Introduction (Lecture)
Weeks A02-06 Theory: Fundamentals of Probability and Statistics (Lecture)
• Probability Concept
• Popular Distributions
• Point and Interval Estimation
• Sampling and Hypothesis Testing
Practice: R (Workshop)
Week A07 Direct Study
Weeks A08-13 Theory: Supervised Learning (Lecture)
• Data Preparation and Model Evaluation
• Linear and Logistic Regressions
• Naïve Bayes and Decision Tree
Practice: R and Microsoft Azure (Workshop)
Timetable in Semester A
Lecture
Thursday 15:00 – 16:00 @ MB0312
Workshop
Group A:
Thursday 9:00 – 11:00 @ MC3203
Group C:
Thursday 16:00 – 18:00 @ MC3204
Group B:
Friday 15:00 – 17:00 @ MC3204
Contact Information
Name Role Contact
Bowei Chen* Module Coordinator bchen@lincoln.ac.uk
Deema Abdal Hafeth Demonstrator/TA dabdalhafeth@lincoln.ac.uk
Liyun Gong Demonstrator/TA lgong@lincoln.ac.uk
Jingmin Huang Demonstrator/TA jhua8590@gmail.com
* Office Hours: Monday 14:00 – 16:00 @ MC3220B, MHT
mailto:bchen@lincoln.ac.uk
https://github.com/boweichen/CMP3036MDataScience/blob/master/dabdalhafeth@lincoln.ac.uk
https://github.com/boweichen/CMP3036MDataScience/blob/master/lgong@lincoln.ac.uk
mailto:jhua8590@gmail.com
Module Github page
https://github.com/boweichen/CMP3036MDataScience
• Detailed Course Topics (by Week for Semester A)
• Reading List
Note: please check your slides, example codes, and
assessment documents on Blackboard. The reading
list on the Github page is a general guide for your
direct study. I suggest you to select the materials
according to your background and interest.
https://github.com/boweichen/CMP3036MDataScience
What is Data Science?
There is much debate about what
data science is, and what it isn’t.
Data Science
It is an interdisciplinary field about processes and systems to extract knowledge or
insights from data in various forms. It includes:
• Dealing with data storage and retrieval
• Summarising and analysing data
• Parallel data processing
• Pattern recognition and statistical testing
• Building predictive models
• Data visualisation
• Management information system (MIS) reporting
Statistics
Statistics is the study of the collection, analysis, interpretation, presentation, and
organisation of data.
Some people think statistics is a branch of mathematics while this point of view is not
agreed by all mathematicians and statisticians
https://www.quora.com/What-do-pure-mathematicians-and-statisticians-think-of-each-other/answer/Michael-Hochster
https://www.quora.com/What-do-pure-mathematicians-and-statisticians-think-of-each-other/answer/Michael-Hochster
Machine Learning
A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by
P, improves with experience E.
Tom Mitchell
Example (a handwriting recognition learning problem)
Task T: recognizing and classifying handwritten words within images
Performance measure P: percent of words correctly classified
Training experience E: a database of handwritten words with given classifications
Five Tribes of Machine Learning
Tribe Origins Master algorithms Representative scientist
Symbolists Logic,
philosophy
Inverse deduction Tom Mitchell, Steve Muggleton,
Ross Quinlan
Connectionists Neuroscience Backpropagation Geoff Hinton, Yann LeCun,
Yoshua Bengio
Revolutionaries Evolutionary
biology
Genetic
programming
John Koza, John Holland, Hod
Lipson
Bayesians Statistics Probabilistic
inference
David Heckerman, Judea Pearl,
Michael Jordan
Analogisers Psychology Kernel machines Peter Hart, Vladimir Vapnik,
Douglas Hofstadter
Pedro Domingos. The Five Tribes of Machine Learning and What You Can Take from Each. University of Washington.
Machine Learning vs Statistics
Glossary
Machine learning Statistics
Metwork, graphs Model
Weights Parameters
Learning Fitting
Generalization Test set performance
Supervised learning Regression/classification
Unsupervised learning Density estimation, clustering
Large grant = $1,000,000 Large grant = $50,000
Nice place to have a meeting:
Snowbird, Utah, French Alps
Nice place to have a meeting:
Las Vegas in August
To paraphrase provocatively, ‘machine learning is statistics minus any
checking of models and assumptions’.
Brian D. Ripley
Rob Tibshirani
http://blog.revolutionanalytics.com/2009/09/the-difference-between-statistics-and-machine-learning.html
http://blog.revolutionanalytics.com/2009/09/the-difference-between-statistics-and-machine-learning.html
Data Mining
In data mining, the major task is to find some interesting, unknown rules and relations,
like predictive rules, clusters, or associations from data. The algorithms tend to be more
deterministic and procedural, although statistics are often used to make some decision
in the process.
Schematic View
By Brendan Tierney, 2012
Note:
• I have slightly different views:
• Statistics, data mining and ML
should have overlaps
• Several fields here are growing. Their
definitions might be changing in the
future as well as their relationships
among each other.
What Tools Do
Data Scientists Use?
According to 2014 Data
Science Salary Survey,
the most frequently used
tools (with corresponding
salary distribution) are:
https://www.quora.com/What-tools-do-data-scientists-use#!n=18
https://www.quora.com/What-tools-do-data-scientists-use#!n=18
Cluster 1
• Windows
• Oracle
• SAS
• Excel
• SQL
• C#
• SPSS
• MS SQL
Server
• VBA
• Microstrategy
Cluster 2
• Linux
• Java
• Redis
• Hive
• Amazon
• MapReduce
• Scala
• Spark
• Pig
• Hbase
• Storm
• MapR
• MongoDB
Cluster 3
• R
• Python
• Matlab
• Network
Graph
• Weka
• Libsvm
• Continuum
Analytics
Cluster 4
• Mac OS X
• MySQL
• JavaScript
• D3
• Ruby
• SQLite
• Google Chart
Tools
• PostgreSQL
Cluster 5
• Unix
• C++
• Perl
• C
“Tool Correlation” in the Same Survey Book Maps
Different Kinds of Data Scientists
I do feel these clusters correspond well to the roles each data scientist plays in general:
• Cluster 1 — Business Intelligence
• Cluster 2 — Hadoop and Data Engineering
• Cluster 3 — Machine Learning and Data Analytics
• Cluster 4 — Data Visualization
https://www.quora.com/What-tools-do-data-scientists-use#!n=18
https://www.quora.com/What-tools-do-data-scientists-use#!n=18
Top 10 Skills That Data Scientists
Have Listed on Their LinkedIn Profiles
• Data Mining
• Machine Learning
• R
• Python
• Data Analysis
• Statistics
• SQL
• Java
• Matlab
• Algorithms
https://www.linkedin.com/pulse/20140903194459-57656293-
the-data-science-skills-network?trk=mp-reader-card
The module is skills-based.
You will study the popular
data science skills through
both semesters.
https://www.linkedin.com/pulse/20140903194459-57656293-the-data-science-skills-network?trk=mp-reader-card
The mathematical content
will be kept to the minimum
necessary. However, this
minimum level is nonzero.
Be prepared to get your
hands dirty with data
programming and on the
detail of the data in order
to really understand it!
I only teach the fundamentals in
class and many topics can be
explored in depth. I hope you will
have fun in studying this module!
“It is a short course, not a hurried course!”
Y.Abu-Mostafa et al.
Learning from Data
Leisure Reading
Thank You
bchen@lincoln.ac.uk