程序代写代做代考 Excel python SQL database Java matlab data mining javascript hbase hadoop c++ algorithm finance Bayesian c# decision tree Hive data science Introduction to information system

Introduction to information system

Introduction to Data Science

Bowei Chen

School of Computer Science

University of Lincoln

CMP3036M/CMP9063M Data Science

Hello, I’m a

Data Scientist!

My research interest lies mostly in developing

intelligent algorithms and data solutions to the

following fields:

• Computational advertising:

programmatic guarantee

• Internet economics and digital products:

inventory pricing, information systems

• Mathematical finance:

derivatives pricing, algorithmic trading

http://staff.lincoln.ac.uk/bchen

http://staff.lincoln.ac.uk/bchen

Module Motivation

“Torture the data, and it will confess to anything.”

Ronald Coase, Nobel Prize Laureate in Economics

“I keep saying that the sexy job in the next 10 years will be statisticians.”

Hal Varian, Google Chief Economist

“Data is a precious thing and will last longer than the systems themselves.”

Tim Berners-Lee, Inventor of the World Wide Web

Top 10 Hot Job Titles That

Barely Existed 5 Years Ago | 2014

The Alan Turing Institute

It is the UK’s national centre for data science, headquartered at the British Library.

Following a public competition with international peer review, the Institute was

founded in 2015 as a joint venture by the universities of Cambridge, Edinburgh,

Oxford, University College London, Warwick and the UK EPSRC. https://turing.ac.uk

https://turing.ac.uk/

Module Information

Title Data Science

Code CMP3036M/CMP9063M

Semester 2016-2017 Semesters A & B

Coordinator Bowei Chen

Instructors Bowei Chen (Semester A)

TBC (Semester B)

Demonstrators Deema Abdal Hafeth

Liyun Gong

JingMin Huang

Assessment CMP3036M: Assignment (50%) + Assignment (50%)

CMP9063M: Assignment (40%) + Assignment (40%) + Report (20%)

Topics in Semester A

Week A01 Introduction (Lecture)

Weeks A02-06 Theory: Fundamentals of Probability and Statistics (Lecture)

• Probability Concept

• Popular Distributions

• Point and Interval Estimation

• Sampling and Hypothesis Testing

Practice: R (Workshop)

Week A07 Direct Study

Weeks A08-13 Theory: Supervised Learning (Lecture)

• Data Preparation and Model Evaluation

• Linear and Logistic Regressions

• Naïve Bayes and Decision Tree

Practice: R and Microsoft Azure (Workshop)

Timetable in Semester A

Lecture

Thursday 15:00 – 16:00 @ MB0312

Workshop

Group A:

Thursday 9:00 – 11:00 @ MC3203

Group C:

Thursday 16:00 – 18:00 @ MC3204

Group B:

Friday 15:00 – 17:00 @ MC3204

Contact Information

Name Role Contact

Bowei Chen* Module Coordinator bchen@lincoln.ac.uk

Deema Abdal Hafeth Demonstrator/TA dabdalhafeth@lincoln.ac.uk

Liyun Gong Demonstrator/TA lgong@lincoln.ac.uk

Jingmin Huang Demonstrator/TA jhua8590@gmail.com

* Office Hours: Monday 14:00 – 16:00 @ MC3220B, MHT

mailto:bchen@lincoln.ac.uk
https://github.com/boweichen/CMP3036MDataScience/blob/master/dabdalhafeth@lincoln.ac.uk
https://github.com/boweichen/CMP3036MDataScience/blob/master/lgong@lincoln.ac.uk
mailto:jhua8590@gmail.com

Module Github page

https://github.com/boweichen/CMP3036MDataScience

• Detailed Course Topics (by Week for Semester A)

• Reading List

Note: please check your slides, example codes, and

assessment documents on Blackboard. The reading

list on the Github page is a general guide for your

direct study. I suggest you to select the materials

according to your background and interest.

https://github.com/boweichen/CMP3036MDataScience

What is Data Science?
There is much debate about what

data science is, and what it isn’t.

Data Science

It is an interdisciplinary field about processes and systems to extract knowledge or

insights from data in various forms. It includes:

• Dealing with data storage and retrieval

• Summarising and analysing data

• Parallel data processing

• Pattern recognition and statistical testing

• Building predictive models

• Data visualisation

• Management information system (MIS) reporting

Statistics

Statistics is the study of the collection, analysis, interpretation, presentation, and

organisation of data.

Some people think statistics is a branch of mathematics while this point of view is not

agreed by all mathematicians and statisticians
https://www.quora.com/What-do-pure-mathematicians-and-statisticians-think-of-each-other/answer/Michael-Hochster

https://www.quora.com/What-do-pure-mathematicians-and-statisticians-think-of-each-other/answer/Michael-Hochster

Machine Learning

A computer program is said to learn from experience E with respect to some class of

tasks T and performance measure P, if its performance at tasks in T, as measured by

P, improves with experience E.

Tom Mitchell

Example (a handwriting recognition learning problem)

Task T: recognizing and classifying handwritten words within images

Performance measure P: percent of words correctly classified

Training experience E: a database of handwritten words with given classifications

Five Tribes of Machine Learning

Tribe Origins Master algorithms Representative scientist

Symbolists Logic,

philosophy

Inverse deduction Tom Mitchell, Steve Muggleton,

Ross Quinlan

Connectionists Neuroscience Backpropagation Geoff Hinton, Yann LeCun,

Yoshua Bengio

Revolutionaries Evolutionary

biology

Genetic

programming

John Koza, John Holland, Hod

Lipson

Bayesians Statistics Probabilistic

inference

David Heckerman, Judea Pearl,

Michael Jordan

Analogisers Psychology Kernel machines Peter Hart, Vladimir Vapnik,

Douglas Hofstadter

Pedro Domingos. The Five Tribes of Machine Learning and What You Can Take from Each. University of Washington.

Machine Learning vs Statistics

Glossary

Machine learning Statistics

Metwork, graphs Model

Weights Parameters

Learning Fitting

Generalization Test set performance

Supervised learning Regression/classification

Unsupervised learning Density estimation, clustering

Large grant = $1,000,000 Large grant = $50,000

Nice place to have a meeting:

Snowbird, Utah, French Alps

Nice place to have a meeting:

Las Vegas in August

To paraphrase provocatively, ‘machine learning is statistics minus any

checking of models and assumptions’.

Brian D. Ripley

Rob Tibshirani

http://blog.revolutionanalytics.com/2009/09/the-difference-between-statistics-and-machine-learning.html

http://blog.revolutionanalytics.com/2009/09/the-difference-between-statistics-and-machine-learning.html

Data Mining

In data mining, the major task is to find some interesting, unknown rules and relations,

like predictive rules, clusters, or associations from data. The algorithms tend to be more

deterministic and procedural, although statistics are often used to make some decision

in the process.

Schematic View

By Brendan Tierney, 2012

Note:

• I have slightly different views:

• Statistics, data mining and ML

should have overlaps

• Several fields here are growing. Their

definitions might be changing in the

future as well as their relationships

among each other.

What Tools Do

Data Scientists Use?

According to 2014 Data

Science Salary Survey,

the most frequently used

tools (with corresponding

salary distribution) are:

https://www.quora.com/What-tools-do-data-scientists-use#!n=18

https://www.quora.com/What-tools-do-data-scientists-use#!n=18

Cluster 1

• Windows

• Oracle

• SAS

• Excel

• SQL

• C#

• SPSS

• MS SQL
Server

• VBA

• Microstrategy

Cluster 2

• Linux

• Java

• Redis

• Hive

• Amazon

• MapReduce

• Scala

• Spark

• Pig

• Hbase

• Storm

• MapR

• MongoDB

Cluster 3

• R

• Python

• Matlab

• Network
Graph

• Weka

• Libsvm

• Continuum
Analytics

Cluster 4

• Mac OS X

• MySQL

• JavaScript

• D3

• Ruby

• SQLite

• Google Chart
Tools

• PostgreSQL

Cluster 5

• Unix

• C++

• Perl

• C

“Tool Correlation” in the Same Survey Book Maps

Different Kinds of Data Scientists

I do feel these clusters correspond well to the roles each data scientist plays in general:

• Cluster 1 — Business Intelligence

• Cluster 2 — Hadoop and Data Engineering

• Cluster 3 — Machine Learning and Data Analytics

• Cluster 4 — Data Visualization

https://www.quora.com/What-tools-do-data-scientists-use#!n=18

https://www.quora.com/What-tools-do-data-scientists-use#!n=18

Top 10 Skills That Data Scientists

Have Listed on Their LinkedIn Profiles

• Data Mining

• Machine Learning

• R

• Python

• Data Analysis

• Statistics

• SQL

• Java

• Matlab

• Algorithms

https://www.linkedin.com/pulse/20140903194459-57656293-

the-data-science-skills-network?trk=mp-reader-card

The module is skills-based.

You will study the popular

data science skills through

both semesters.

https://www.linkedin.com/pulse/20140903194459-57656293-the-data-science-skills-network?trk=mp-reader-card

The mathematical content

will be kept to the minimum

necessary. However, this

minimum level is nonzero.

Be prepared to get your

hands dirty with data

programming and on the

detail of the data in order

to really understand it!

I only teach the fundamentals in

class and many topics can be

explored in depth. I hope you will

have fun in studying this module!

“It is a short course, not a hurried course!”

Y.Abu-Mostafa et al.

Learning from Data

Leisure Reading

Thank You

bchen@lincoln.ac.uk