CS代考 MIE1624H – Introduction to Data Science and Analytics Lecture 1 – Introduct

Lead Research Scientist, Financial Risk Quantitative Research, SS&C Algorithmics Adjunct Professor, University of Toronto
MIE1624H – Introduction to Data Science and Analytics Lecture 1 – Introduction
University of Toronto January 11, 2022

Copyright By PowCoder代写 加微信 powcoder

◼ Lead Research Scientist, Financial Risk Quantitative Research at SS&C Algorithmics, formerly with Watson Financial Services, IBM
◼ Ph.D. in Computer Science from McMaster University
◼ Author of over 20 papers and reports
◼ Adjunct professor at University of Toronto and lecturer at McMaster University
◼ Research areas:
❑ business analytics, operational research, optimization, finance ❑ portfolio optimization, multi-objective optimization
❑ market and credit risk modeling and optimization
❑ numerical methods for risk management
❑ design of numerical algorithms and their software implementation

Profession
«Choose a job you love,
and you will never have to
work a day in your life.»
«The only way to do great work is to love what you do. If you haven’t found it yet, keep looking. Don’t settle.»

Data Science
Machine Learning
Data Science
Business & Domain Expertise

 Analytics

What is analytics?
Analytics is the scientific process of deriving insights from
data in order to make decisions Analyze
Descriptive Analytics
What has happened?
Predictive Analytics
What will happen?
Prescriptive Analytics and Artificial Intelligence
What should we do?
Business Value

Operations research
◼ Operations Research (O.R.) is the discipline of applying advanced analytical methods to help make better decisions
◼ Analytical techniques:
❑ Simulation – giving you the ability to try out approaches and test
ideas for improvement
❑ Optimization – narrowing your choices to the very best when there are virtually innumerable feasible options and comparing them is difficult
❑ Probability and Statistics – helping you measure risk, mine data to find valuable connections and insights, test conclusions, and make
reliable forecasts
❑ Mathematical Modeling – algorithms and software

Our planet is a complex, dynamic, highly interconnected $54 Trillion system-of-systems (OECD-based analysis)
This chart shows ‘systems‘ (not ‘industries‘)
Communication
Transportation
Leisure / Recreation / Clothing
Electricity
Global system-of-systems
$54 Trillion
(100% of WW 2008 GDP)
Infrastructure
1. Size of bubbles represents
systems’ economic values
2. Arrows represent the strength of
Healthcare
Legend for system inputs
Same Industry Business Support IT Systems Energy Resources Machinery Materials
$ 12.54 Tn
systems’ interaction
Source: IBV analysis based on OECD
Govt. & Safety

Economists estimate, that all systems carry inefficiencies of up to $15 Tn, of which $4 Tn could be eliminated
This chart shows ‘systems‘ (not ‘industries‘)
40% 35% 30% 25% 20%
Analysis of inefficiencies in the planet‘s system-of-systems
Healthcare
Global economic value of
System-of- systems
$54 Trillion
100% of WW 2008 GDP
Inefficiencies
$15 Trillion
28% of WW 2008 GDP
Improvement potential
$4 Trillion
7% of WW 2008 GDP
Building & Transport 34% Infrastructure Education
Electricity
Food & Water
Communication
Government & Safety
Financial 12,540 4,580
3,960 Transportation (Goods & Passenger)
Leisure / Recreation / Clothing 7,800
How to read the chart:
For example, the Healthcare system‘s value is $4,270B. It carries an estimated inefficiency of 42%. From that level of 42% inefficiency, economists estimate that ~34% can be eliminated (= 34% x 42%).
Note: Size of the bubble indicate absolute value of the system in USD Billions
30% 35% 40% 45%
System inefficiency as % of total economic value
Source: IBM economists survey 2009; n= 480
Improvement potential as % of system inefficiency

History of analytics

History of analytics

Course Outline

Course summary
▪ Course title: Introduction to Data Science and Analytics
▪ Course summary: The objective of the course is to learn analytical models and overview quantitative algorithms for solving engineering and business problems. Data science or analytics is the process of deriving insights from data in order to make optimal decisions. It allows hundreds of companies and governments to save lives, increase profits and minimize resource usage. Considerable attention in the course is devoted to applications of computational and modeling algorithms to finance, risk management, marketing, health care, smart city projects, crime prevention, predictive maintenance, web and social media analytics, personal analytics, etc. We will show how various data science and analytics techniques such as basic statistics, regressions, uncertainty modeling, simulation and optimization modeling, data mining and machine learning, text analytics, artificial intelligence and visualizations can be implemented and applied using Python. Python and Tableau, Power BI are modeling and visualization software used in this course. Practical aspects of computational models and case studies in Interactive Python are emphasized.

Course outline
Introduction to data science and analytics
▪ Data science concepts
▪ Application areas of quantitative modeling
Python programming, data science software
▪ Introduction to Python
▪ Comparison of Python, R and Matlab usage in data science
Basic statistics
▪ Random variables, sampling
▪ Distributions and statistical measures ▪ Hypothesis testing
▪ Statistics case studies in Ipython
Overview of linear algebra
▪ Linear algebra and matrix computations ▪ Functions, derivatives, convexity

Course outline
Modeling techniques, regression
▪ Mathematical modeling process
▪ Linear regression
▪ Logistic regression
▪ Regression case studies in IPython
Data visualization and visual analytics
▪ Visual analytics
▪ Visualizations in Python
Simulation modeling
▪ Random number generation
▪ Monte Carlo simulations
▪ Simulation case studies in IPython
Optimization
▪ Unconstrained non-linear optimization algorithms
▪ Overview of constrained optimization algorithms
▪ Optimization case studies in IPython 20

Course outline
Advanced machine learning
▪ Decision trees
▪ Advanced supervised machine learning algorithms (Naive Bayes, k-NN, SVM) ▪ Intro to ensemble learning algorithms (Random Forests, Gradient Boosting)
▪ Intro to neural networks
▪ Text analytics and natural language processing
▪ Clustering (K-means, Fuzzy C-means, Hierarchical Clustering, DBSCAN)
▪ Dimensionality reduction
▪ Association rules
▪ Overview of reinforcement learning
▪ Machine learning case studies in IPython
Introduction to Deep Learning
▪ Mathematics of neural networks
▪ Introduction to Deep Learning
▪ Convolutional Neural Networks (CNN) 21

Assignments, projects and grading (tentative)
Assignment #1 – Solving an analytics problem in Python (12%) ▪ Individual assignment.
Assignment #2 – Solving an analytics problem in Python (16%) ▪ Individual assignment.
Assignment #3 – Solving an analytics problem in Python (16%) ▪ Individual assignment.
Final Exam Project (24%)
▪ Individual project.
▪ For the final exam project you may be responsible for analyzing, computing and writing up a solution to a practical data science problem in Python. Each project must be completed individually.
Course Project – Personalized learning and course curriculum design via machine learning and data analytics in Python (20%)
▪ Group project (groups of 7 students), the same groups as for In-Class Presentations.
In-Class Group Presentation (12%)
▪ Group presentations of up to 10-12 minutes are required to cover topics related to additional course materials and the course project.
▪ Presentations needs to be recorded and uploaded to Quercus. Presentations will be played during lectures and followed up by online Q&A.
▪ All assignments, projects and presentations needs to be completed remotely. Presence at UofT campus is not required for this course. You are encouraged to use online collaboration tools for group project and group presentation preparation.

Course materials and readings
❑ Course slides by O. Romanko and D. Rosu, 2022 Quercus
❑ Python for Data Analysis: Data Wrangling with Pandas, NumPy, and Ipython by W. McKinney, 2017 https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1491957662/
❑ Getting Started with Data Science: Making Sense of Data with Analytics by M. Haider, 2015

❑ Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Instagram, GitHub, and More by M. Russell and M. Klassen, 2019 https://www.amazon.ca/Mining-Social-Web-Facebook-Instagram/dp/1491985046/

 Recommended Literature

Literature

Literature

Literature

Literature

Sources of Data

Data sources
▪ Data files – demo in Python
✓CSV (comma separated value) files
✓Spreadsheet files, e.g., Excel or Google Spreadsheet
▪ Databases ✓ SQL
▪ Internet – demo in Python
✓Web scraping ✓ APIs
▪ Big Data platforms and Cloud ✓ Hadoop
✓ Cloud (AWS, Google Cloud, Microsoft Azure, IBM Cloud)

Use of data globally and in the financial sector
31 Multiple responses accepted

Use of camera phones at the Papal inauguration in 2005 and 2013

We can collect information from almost everything to make better decisions
30 billion
RFID tags embedded into our world and across entire ecosystems
1 billion Camera phones in
existence able to document accidents, damage, and crimes
Of new automobiles
will contain event data recorders collecting travel information
Instrumented Interconnected Intelligent

What is big data?
Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools.
Difficulties include capture, storage, search, sharing, analytics, and visualizing.
Source: Wikipedia

Big social data

Analytics Examples

Data reveals hidden city dynamics

Applications of big data analytics
Smarter Healthcare
Homeland Security
Manufacturing
Multi-channel sales
Log Analysis
Search Quality
Retail: Churn, NBO
Traffic Control
Trading Analytics
Fraud and Risk

Marketing analytics

Fitting room analytics
Source: Adme

IBM Debater

 Modeling

Questions that we can try to answer with models
▪ Statistics – exploratory analysis and hypothesis testing ✓Decisions made from samples
✓Hypothesis testing
▪ Machine learning – learning from examples
✓Supervised learning (prediction, classification)
✓Unsupervised learning (clustering, dimensionality reduction, associations)
▪ Artificial intelligence – advanced analytics ✓Text analytics, social media analytics, NLP ✓Spatio-temporal analytics
✓Image and visual recognition (deep learning) ✓Reinforcement learning and autonomous systems
▪ Modeling uncertainty – what would happen in the future? ✓Monte Carlo simulations
▪ Optimizing decisions – what’s best? ✓ Optimization
▪ Finding connections – is FB related to Cambridge Analytica? ✓Graph/network models

Models and reality
Simplified abstraction of reality
Capture essence of problem
Calculations
From Monahan, G., “Management Decision Making”, Cambridge University Press, 2000
Interpretation
“Real” World
Analysts World

Artificial Intelligence

Text analytics and sentiment analysis
Sentiment analysis of tweets

Natural Language Processing: features and target variable in
sentiment analysis
features (words) target
bear tea love bad drink sentim All bears are lovely 56% Our tea was bad -35% That bear drinks with bear -5% The bear drinks tea 4% We love bears 63%
Stop words that were removed:
❑ are, was
examples (news articles)

Natural Language Processing: ‘bag of words’ based on Word Frequency (WF) and sentiment analysisfeatures (WF) target
bear tea love bad drink sentim All bears are lovely 56% Our tea was bad -35% That bear drinks with bear -5% The bear drinks tea 4% We love bears 63%
Supervised machine learning algorithm:
❑ Linear regression
❑ Decision trees
❑ SVM regression
❑ k-NN regression
❑ Ensembles (random forests, XGBoost)
❑ Artificial neural nets (deep learning)
10100 01010 20001 11001
bag of words
examples (news articles)

Natural Language Processing: word frequency (Word Cloud)
Word Cloud about Toyota

Neural networks and deep learning
▪ Based loosely on computer models of how brains work
▪ Model is an assembly of inter-connected neurons (nodes) and weighted links
▪ Each neuron applies a nonlinear function to its inputs to produce an output
▪ Output node sums up each of its input value according to the weights of its links
▪ Used for classification, pattern recognition, speech recognition
▪ “ ” model – no explanatory power, very hard to interpret the results
Training ANN means learning the weights of the neurons
x2 Input Layer
f6 y Output Layer
f7 Hidden Layer

Neural networks and deep learning

Spatio-temporal analytics – car theft hotspots
Source: Booz Allen Hamilton, Field Guide to Data Science

IBM Cloud – all services
http://cloud.ibm.com
Click “Catalog” at the top of the dashboard

IBM Cloud – Watson AI / ML services
http://cloud.ibm.com
Click “Catalog” at the top of the dashboard

IBM Cloud – Speech to Text service

IBM Cloud – Speech to Text service
Copy “API Key” to your Python code

Case study – Watson Speech-to-Text, Natural Language Understanding and Text-to-Speech prototype on IBM Cloud

 Crowd-Sourced Analytics

Bellingcat – open source investigations
Source: Bellingcat, Russia’s War in Ukraine: The Medals and Treacherous Numbers

Bellingcat – open source investigations
Source: Bellingcat, Russia’s War in Ukraine: The Medals and Treacherous Numbers

Bellingcat – open source investigations
A number of awarded medals “For Distinction in Combat” is 4300 between 07.11.2014 and 18.02.2016, strongly suggests larger combat operations with active Russian military involvement in this period. In sum, the data suggests that more than 10000 medals of all four considered types were awarded in the considered period.
Source: Bellingcat, Russia’s War in Ukraine: The Medals and Treacherous Numbers

Online and In-Person Education

Coursera (coursera.org)

EdX (edx.org)

CognitiveClass MOOC
http://CognitiveClass.ai

CognitiveClass MOOC (http://CognitiveClass.ai) ▪ Free courses, free
study materials
▪ Cloud-based sandbox for exercises
▪ 2000000+ registered students
▪ 60+ courses

TED and TEDx

Register for free 6-month access with your email @mail.utoronto.ca
https://www.datacamp.com/groups/shared_links/9213435974cadaa4336fc2d5728cb94d5a0958d02a5b935ec2e56fa3544838a7

To Do before Lecture 2

Run IPython examples provided in class
◼ Install Python on your laptop
❑ Recommended to use Python version 3.X
❑ You may use your own Python distribution, Anaconda distribution is recommended to install https://www.anaconda.com/products/individual
◼ Use Python on cloud via Google Colab
❑ You can use Python on Google cloud via https://colab.research.google.com
◼ Use Python on cloud via IBM CognitiveClass.ai Virtual Lab (optional) ❑ Register for CognitiveClass.ai MOOC portal https://cognitiveclass.ai to access
60+ free data science courses and to use Python on the CC cloud ❑ You can use Python on CC cloud via https://labs.cognitiveclass.ai
◼ Get access to IBM Cloud (optional)
❑ Sign-in for IBM Academic Initiative and register for access to IBM Cloud, or get
free lite access to IBM Cloud directly at https://cloud.ibm.com/registration ◼ Check class web-page on Quercus

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com