7CCSMBDT – Big Data Technologies Week 1
Grigorios Loukides, PhD (grigorios.loukides@kcl.ac.uk)
Spring 2017/2018
1
Today
Logistics
Brief overview of the module
Introduction to the topic Big
Data
Technologies
Data analytics & Big data
2
Logistics
Practicals (from Week 2 onwards): Monday, 9-11, Bush House (S) 7.01/2/3 Monday, 11-1, Bush House (S) 6.02
Lecture:
Tuesday, 9-11, Waterloo FWB 1.70
Tutorial (from Week 2 onwards): Tuesday, 11012, Waterloo FWB 1.70
3
Logistics
TAs
Practicals
Retha, Ahmad ahmad.retha@kcl.ac.uk
Amiri Souri, Elmira elmira.amiri@kcl.ac.uk
Tutorials: TBC
Attendance: You are expected to attend every lecture &
practical; attending tutorials is optional but strongly recommended
Resources: on KEATS
4
Logistics
Assessment
Courseworks: 20%
(1st out on Friday, Week 4, due in 2 weeks, feedback 4 weeks after submission) (2nd out on Friday, Week 8, due in 2 weeks, feedback 4 weeks after submission)
Exam: 80% Textbooks
Big Data Fundamentals: Concepts, Drivers, and Techniques T. Erl, W. Khattak, P. Buhrer, Prentice Hall, ISBN-10: 0134291077
Big Data Science & Analytics: a hands-on approach A. Bagha, V. Madisetti, VPT, ISBN-10: 0996025537
Mining of massive datasets (2nd edition)
J. Lescovec, A. Rajaraman, J. Ullman, Cambridge University Press,
programming and open-ended questions
ISBN-10: 1107077230 http://www.mmds.org/
5
Brief overview of the module
Week
Topic
Main techniques/technologies
1
Intro to Big Data Technologies
Overview of Big Data analytics
2
Data collection
Sqoop, Flume
3
Big Data software stack
HDFS, MapReduce
4
Big Data processing
MapReduce (patterns)
5
NoSQL databases
MongoDB
6
Big Data processing
Spark
7
Big Data warehousing
Hive, SparkSQL
8
Data streams
Querying, sampling
9
Data streams
Filtering, counting, aggregate statistic estimation
10
Data streams
Spark streaming
11
Revision
6
Brief overview of the module
Goal: Learn breadth of knowledge
theoretical (architecture, operation, analysis)
practical (use of technologies)
Programming
Python [basic knowledge assumed]
R (only in 1st lab – increasingly popular, easy-to-use tool) [no knowledge assumed]
All the other technologies [no knowledge assumed, except basic SQL] Computation environment
Cloudera QuickStart VM https://www.cloudera.com/downloads/quickstart_vms/5-13.html
[will be accessible from labs, can be installed in your personal computer] You are expected to learn quickly and use many new things
7
Suggested reading for today’s lecture
Textbooks
Big Data Fundamentals: Concepts, Drivers, and Techniques
T. Erl, W. Khattak, P. Buhrer, Prentice Hall, ISBN-10: 0134291077 CHAPTER 1
Big Data Science & Analytics: a hands-on approach A. Bagha, V. Madisetti, VPT, ISBN-10: 0996025537
CHAPTER 1, sections 1.1,…,1.6
Report
https://www.nap.edu/read/18374/chapter/12
8
Introduction: Big Data Technologies
http://www.mmds.org
9
Introduction: Big Data Technologies
More statistics
Per day
2.5 Quintillion (1018) bytes per day
Per minute
Facebook users share ~4.1M pieces of content
Instagram users like ~1.7M photos
Youtube users upload 300h of new video content Amazon receives 4.3K new visitors
Netflix subscribers stream 77K hours of video
10
Introduction: Big Data Technologies
Data based on who creates them Human-generated
human interaction with systems e.g.,
Social media content, emails, messages, documents Machine-generated
Software systems (e.g. DBMS), hardware devices (e.g. sensors)
e.g.,
DBMS log, sensor readings, network traces
11
Introduction: Big Data Technologies
12
Introduction: Big Data Technologies
Data based on their format
Structured
Conforms to a data model or schema Relational dataset/table
Can be managed using a DBMS
e.g., Banking transactions, electronic health records
Unstructured
Does not conform to a model or schema
Textual or binary data
Stored as binary large objects (BLOBs) in a DBMS or in NoSQL databases
e.g., tweets, video files
13
Introduction: Big Data Technologies
Data based on their format
Semi-structured
Non-relational data with a defined level of structure or consistency Hierarchical or graph-based
e.g., spreadsheets, XML data, sensor data, JSON data
JSON: open standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. Used in MongoDB
Android sensor framework AndroSensor
https://play.google.com/store/apps/details?id=com.fivasim.andro sensor&hl=en_GB
14
Introduction: Big Data Technologies
Data based on their format
Metadata
Provides information about a dataset’s characteristics and structure Data provenance
e.g.,
XML tags for the author and creation date of a document
Attributes providing file size and resolution of a digital document
File metadata in Linux
size, permissions, creation date, access date, inode number, uid/gid, file type
“ls –la” and “stat” commands
15
Introduction: Big Data Technologies
Big Data characteristics
• Data quantity is substantial and evergrowing
• Different storage and processing requirements
Volume
• Data speed may be high
• Elastic and available time for processing
Velocity
• Multiple formats and types
Variety
• Bias, noise, abnormalities
• Different integration, transformation, processing and storage requirements
Veracity Value
• Different requirements for removing noise and resolving invalid data
• Data usefulness (utility)
• Depends on veracity, time of processing, storage & analysis decisions (how?)
16
Introduction: Big Data Technologies
Big Data characteristics in online gaming industry
• Data describing players interactions and transactions • Fast streams of data activities (game moves)
Veracity
Value
Volume Velocity
• Demographics, player behavior, relationships between players
Variety
• Noisy user messages (texts), errors, fake accounts
• Action-based analytics (e.g. less bias than user surveys)
• Analysis to increase number of players, improve players’ experience, maximize revenue
17
Introduction: Big Data Technologies
Big Data characteristics in [Your own example] from healthcare, transportation, finance, marketing, web, …
Variety Veracity
Value
Volume Velocity
18
Introduction: Big Data Technologies
https://sranka.files.wordpress.com/2014/01/bigdata.jpg 19
Introduction: Big Data Technologies
http://elearningroadtrip.typepad.com/.a/6a010535fe1bfd970b01901b935600970b-pi 20
Data analytics & Big Data
What is Analytics?
Processes, technologies, frameworks
What is the goal of analytics?
To infer knowledge that makes systems smarter & more efficient
What do analytics involve?
Data filtering
Processing
Categorization
Condensation
Contextualization
https://www.nap.edu/read/18374/chapter/12
21
Data analytics & Big Data
Analytic types (based on their goal)
Descriptive
Diagnostic
Predictive
Prescriptive
Basic Statistics
Generalized N-body problems
Linear algebraic comput.
Graph- Optimi- Integration theoretic zation
comput.
Alignment Problems
Computational analytic tasks (The 7 giants) • help design technologies for Big Data
22
Data analytics & Big Data
The goal of each type of analytics
Descriptive Diagnostic •
•
•
•
Aim to answer what has happened, based on past data that are presented in a summarized form
Aim to answer why it has happened, based on past data Aim to answer what is likely to happen, based on
existing data
Aim to answer what can we do to make something happen, based on existing data
Predictive
Prescriptive
23
Data analytics & Big Data
Comp. analytic tasks for descriptive analytics 1. Basic statistics: the summarized form of data is the statistic
Mean, Median, Variance, Count, Top-N, Distinct Example
Problem: Given a stream 𝐷 = 𝑥 ,…,𝑥 , find the number 𝐹 of distinct items of 𝐷. 1𝑁0
Challenge: We cannot compute 𝐹 accurately by examining a part of the stream 0
Specifics of the problem (not part of the module)
[Char] http://dl.acm.org/citation.cfm?doid=335168.335230
24
𝐹𝐹 𝑁=10000,𝑟=10%∗𝑁,𝛾=0.5,max 0, 0 =1.76
𝐹𝐹 00
Data analytics & Big Data
Comp. analytic tasks for descriptive analytics 2. Linear algebraic computations: the summarized form is a model
describing the data or a smaller dataset built from the data
Linear Regression: builds a model between a dependent variable and independent variables
Principal Component Analysis (PCA)*: builds a summary of data that has fewer dimensions (e.g., a table with 100 columns is summarized into a new table with 10 “special” columns)
Singular Value Decomposition (SVD)**: builds a summary of a data that is in a form of a matrix (low-rank approximation)
(details not part of the course: see Mining Massive Datasets book * 11.2, ** 11.3) 25
Data analytics & Big Data
Comp. analytic tasks for descriptive analytics Example
Problem: Approximate matrix 𝑀 with a matrix 𝑀 that has r linearly independent columns and the error of approximation is minimum.
Challenge: The storage of matrix 𝑀 may be impossible due to the large size of 𝑀
26
Data analytics & Big Data
#Generate 9×6 matrix
hilbert <- function(n) { i <- 1:n; 1 / outer(i - 1, i, "+") } M <- hilbert(9)[, 1:6]
#Run svd() to produce 𝑀 hatM <- svd(M)
# get the first singular value of 𝑀 d1<-c(hatM$d[1],0,0,0,0,0)
# Plot the elements of 𝑀, 𝑀
Mv<-as.vector(M) hatMv<-as.vector(hatM$u%*%diag(d1)%*%t(hatM$v)) matplot(cbind(Mv,hatMv),pch=c(1,2),col=c(2,4),ylab=“”) legend(“topright”,legend=c(“M”,”hatM”),pch=c(1,2), col=c(2,4))
Example using R (not part of the module) Approximate matrix 𝑀 with 𝑟 = 1 (one column matrix 𝑀)
such that the error
𝑀 − 𝑀 = 𝑚2 − 𝐹 𝑖,𝑗 𝑖𝑗
𝑚2 is minimum 𝑖,𝑗 𝑖𝑗
27
Data analytics & Big Data
Comp. analytic tasks for diagnostic analytics
Linear algebraic computations
Example: Investigate why a failure happened by PCA followed by clustering (i.e., make a data summary, group the data in it and see if they explain the failure).
Challenge: Large matrices with “difficult” properties
Generalized N-body problems: These are problems involving similarities.
The similarity may be expressed as a distance or kernel function, which is computed between a set of points.
Examples: clustering (constructs groups of data), classification (to be discussed later)
Challenge: high-dimensionality
Age
PostCode
18
E1
19
E2
20
E3
50
NW1
51
NW2
52
NW3
Age
PostCode
18
E1
19
E2
20
E3
50
NW1
51
NW2
52
NW3
51
19
E NW
28
Data analytics & Big Data
Comp. analytic tasks for diagnostic analytics
Graph-theoretic computations: These are problems involving a graph*. Examples: Search, centrality, shortest paths, minimum spanning tree
Challenge: high interconnectivity (complex relationships)
* A set of objects where pairs of objects are related (a network)
Specifics of the problem (not part of the module)
http://matteo.rionda.to/centrtutorial/
29
Data analytics & Big Data
Comp. analytic tasks for predictive analytics
Linear algebraic computations
Generalized N-body problems
Graph-theoretic computations
Integration
Compute high-dimensional integrals of functions
Alignment problems
Matching (entity resolution, synonyms in text, images referring to the same entity)
30
Data analytics & Big Data
Algorithms for predictive analytics
They build a predictive model from existing data
Based on the model, they predict The occurrence of an event
e.g., should car insurance be offered to a customer? The value of a variable
e.g., what insurance premium to offer to a customer?
Examples of algorithms:
• Classification algorithm: predicts the occurrence of an event
• Regression algorithm: predicts the value of a variable
31
Data analytics & Big Data
The process of classification Training data
Classification algorithm
Name
Age
Income
Car
Insure
Anne
22
11K
BMW
N
Bob
33
20K
Nissan
Y
Chris
44
30K
Nissan
Y
Dan
55
50K
BMW
Y
...
...
...
...
...
Classification rules
IF Age<30, Income<15K, Car=BMW THEN Insure=N
IF Age>30, Income>15K, Car=Nissan THEN Insure=Y
If Age>30, Income>40K THEN Insure=Y
Test data
Emily
22
14K
BMW
?
Insure=N
32
Data analytics & Big Data
The process of entity resolution
Do they refer to the same person?
How likely do they refer to the same person?
Is Amazon reviewer X and Facebook user Y the same person?
https://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf
33
Data analytics & Big Data
The process of entity resolution
Training: entity pairs are annotated w.r.t. whether they represent a match or a non-match and a classifier is trained
Application: classifier decides whether two entities match http://dbs.uni-leipzig.de/file/learning_based_er_with_mr.pdf 34
Data analytics & Big Data
Comp. analytic tasks for Prescriptive analytics
Generalized N-body problems
Graph-theoretic computations
Alignment problems Optimization
Maximization/minimization problems solved by linear programing, integer programing
Prescriptive analytics algorithms use multiple predictive models, built on different data, to predict various outcomes and the best course of action for each outcome
Example: What medication to prescribe to a patient for treating a disease, based on the prescriptions of medicines given to others and their outcomes?
What are the differences between predictive and prescriptive analytics?
35
Data analytics & Big Data
The use of optimization in prescriptive analytics
• We want to select a good model and its parameters, among
several possibilities
• Optimization in model training: A core optimization problem, which asks for optimizing the variables or parameters of the problem w.r.t. a selected objective function (e.g., minimize loss) is solved.
• Optimization in model selection and validation: The core optimization problem may be solved many times.
36
Data analytics & Big Data
Example of optimization: linear programming (details not part of the course)
maximize 𝑐 𝑥 +⋯+𝑐 𝑥
subject to
and
11 𝑛𝑛
𝑎 𝑥 + ⋯ + 𝑎 𝑥 ≤ 𝑏 111 1𝑛𝑛1
…
𝑎𝑚1𝑥1+⋯+𝑎𝑚𝑛𝑥𝑛 ≤𝑏𝑚 𝑥1 ≥0,…,𝑥𝑛 ≥0
Design a webpage that brings maximum profit
• An image occupies 5% of the webpage and brings profit $0.1
• A link occupies 1% of the webpage and brings profit $0.01
• A video occupies 15% of the webpage and brings profit $0.5
• Use at most 10 images, 25 links, and 2 videos
37
Data analytics & Big Data
Prescriptive analytics: linear programming in R
Design a webpage that brings maximum profit
• An image occupies 5% of the webpage and brings profit $0.1
• A link occupies 1% of the webpage and brings profit $0.01
• A video occupies 15% of the webpage and brings profit $0.5
• Use at most 10 images, 25 links, and 2 videos
Optimization function
Optimization constraints
maximize
subject to 0.05𝑥1 + 0.01𝑥2 + 0.15𝑥3 ≤ 1
𝑥1 ≤ 10 𝑥2 ≤ 25 𝑥3 ≤ 2
and 𝑥1 ≥0,𝑥2,𝑥3 ≥0 𝑆𝑜𝑙𝑢𝑡𝑖𝑜𝑛:𝑥1 = 10,𝑥2 = 20,𝑥3 = 2
0.1𝑥1 + 0.01𝑥2 + 0.5𝑥3
38
Data analytics & Big Data
Prescriptive analytics: linear programming in R
maximize 0.1𝑥1 + 0.01𝑥2 + 0.5𝑥3 subject to 0.05𝑥1 + 0.01𝑥2 + 0.15𝑥3 ≤ 1
𝑥1 ≤ 10
𝑥2 ≤ 25 𝑥3 ≤ 2
and 𝑥1 ≥0,𝑥2,𝑥3 ≥0
#install the package for linear programming
install.packages(“lpSolveAPI”) library(“lpSolveAPI”)
#set up the problem type, obj. function, constraints
lprec <- make.lp(4, 3)
lp.control(lprec, sense="max")
set.type(lprec, c(1,2,3),"integer") set.objfn(lprec, c(0.1, 0.02,0.5)) add.constraint(lprec, c(0.05,0.01,0.15),"<=",1) add.constraint(lprec, c(1,0,0), "<=", 10) add.constraint(lprec, c(0,1,0), "<=", 25) add.constraint(lprec, c(0,0,1), "<=", 2)
#solve the problem
solve(lprec)
#get value of objective function
get.objective(lprec)
#get values of variables (solution)
get.variables(lprec)
𝑆𝑜𝑙𝑢𝑡𝑖𝑜𝑛:𝑥1 = 10,𝑥2 = 20,𝑥3 = 2
39
Data analytics & Big Data
Analytics overview (again)
Descriptive
Diagnostic
Predictive
Prescriptive
Basic Statistics
Generalized N-body problems
Linear algebraic comput.
Graph- theoretic comput.
Optimiz ation
Integration
Alignment Problems
40
Data analytics & Big Data
Settings in which analytics are applied:
Default: the dataset is stored in RAM
Streaming: data arrive as a stream, a part (window) is stored
Distributed: data are distributed over multiple machines (in RAM and/or disk)
Multi-threaded: data are in one machine and multiple processors share the RAM of the machine
41
Data analytics & Big Data
Analytics flow for big data (details in next lectures)
Data collection
The data is collected and ingested into a big data stack Issues required for meaningful processing are resolved
Data Preparation
Analysis types
The type of analysis is determined
Analysis modes
The mode of analysis is determined
Visualizations
The analysis results are presented to the user
42
Data analytics & Big Data
Analytics flow for big data - Example
Data collection
Collect and store data from temperature sensors in Big data store
Data Preparation
Remove erroneous values (due to sensors faults)
Analysis types
Pattern mining to find: “increase by 10 degrees in 5 minutes”
Analysis modes
Pattern mining for data arrived in last hour
Visualizations
Plot the pattern to see the extent of the increase
43
Summary
Introduction to the topic Big
Data
Technologies
Data analytics & Big data
Analytic tasks – the 7 giants
Analytics flow
44