程序代写代做代考 kernel game algorithm Hive html finance graph android database data science 7CCSMBDT – Big Data Technologies Week 1

7CCSMBDT – Big Data Technologies Week 1
Grigorios Loukides, PhD (grigorios.loukides@kcl.ac.uk)
Spring 2017/2018
1

Today
 Logistics
 Brief overview of the module
 Introduction to the topic  Big
 Data
 Technologies
 Data analytics & Big data
2

Logistics
 Practicals (from Week 2 onwards):  Monday, 9-11, Bush House (S) 7.01/2/3  Monday, 11-1, Bush House (S) 6.02
 Lecture:
 Tuesday, 9-11, Waterloo FWB 1.70
 Tutorial (from Week 2 onwards):  Tuesday, 11012, Waterloo FWB 1.70
3

Logistics
 TAs
 Practicals
 Retha, Ahmad ahmad.retha@kcl.ac.uk
 Amiri Souri, Elmira elmira.amiri@kcl.ac.uk
 Tutorials: TBC
 Attendance: You are expected to attend every lecture &
practical; attending tutorials is optional but strongly recommended
 Resources: on KEATS
4

Logistics
 Assessment
 Courseworks: 20%
 (1st out on Friday, Week 4, due in 2 weeks, feedback 4 weeks after submission)  (2nd out on Friday, Week 8, due in 2 weeks, feedback 4 weeks after submission)
 Exam: 80%  Textbooks
 Big Data Fundamentals: Concepts, Drivers, and Techniques  T. Erl, W. Khattak, P. Buhrer, Prentice Hall, ISBN-10: 0134291077
 Big Data Science & Analytics: a hands-on approach  A. Bagha, V. Madisetti, VPT, ISBN-10: 0996025537
 Mining of massive datasets (2nd edition)
 J. Lescovec, A. Rajaraman, J. Ullman, Cambridge University Press,
programming and open-ended questions
ISBN-10: 1107077230 http://www.mmds.org/
5

Brief overview of the module
Week
Topic
Main techniques/technologies
1
Intro to Big Data Technologies
Overview of Big Data analytics
2
Data collection
Sqoop, Flume
3
Big Data software stack
HDFS, MapReduce
4
Big Data processing
MapReduce (patterns)
5
NoSQL databases
MongoDB
6
Big Data processing
Spark
7
Big Data warehousing
Hive, SparkSQL
8
Data streams
Querying, sampling
9
Data streams
Filtering, counting, aggregate statistic estimation
10
Data streams
Spark streaming
11
Revision
6

Brief overview of the module
 Goal: Learn breadth of knowledge
 theoretical (architecture, operation, analysis)
 practical (use of technologies)
 Programming
 Python [basic knowledge assumed]
 R (only in 1st lab – increasingly popular, easy-to-use tool) [no knowledge assumed]
 All the other technologies [no knowledge assumed, except basic SQL]  Computation environment
 Cloudera QuickStart VM https://www.cloudera.com/downloads/quickstart_vms/5-13.html
[will be accessible from labs, can be installed in your personal computer] You are expected to learn quickly and use many new things
7

Suggested reading for today’s lecture
 Textbooks
 Big Data Fundamentals: Concepts, Drivers, and Techniques
 T. Erl, W. Khattak, P. Buhrer, Prentice Hall, ISBN-10: 0134291077  CHAPTER 1
 Big Data Science & Analytics: a hands-on approach  A. Bagha, V. Madisetti, VPT, ISBN-10: 0996025537
 CHAPTER 1, sections 1.1,…,1.6
 Report
 https://www.nap.edu/read/18374/chapter/12
8

Introduction: Big Data Technologies
http://www.mmds.org
9

Introduction: Big Data Technologies
 More statistics
 Per day
 2.5 Quintillion (1018) bytes per day
 Per minute
 Facebook users share ~4.1M pieces of content
 Instagram users like ~1.7M photos
 Youtube users upload 300h of new video content  Amazon receives 4.3K new visitors
 Netflix subscribers stream 77K hours of video
10

Introduction: Big Data Technologies
 Data based on who creates them  Human-generated
 human interaction with systems  e.g.,
 Social media content, emails, messages, documents  Machine-generated
 Software systems (e.g. DBMS), hardware devices (e.g. sensors)
 e.g.,
 DBMS log, sensor readings, network traces
11

Introduction: Big Data Technologies
12

Introduction: Big Data Technologies
 Data based on their format
 Structured
 Conforms to a data model or schema  Relational dataset/table
 Can be managed using a DBMS
 e.g., Banking transactions, electronic health records
 Unstructured
 Does not conform to a model or schema
 Textual or binary data
 Stored as binary large objects (BLOBs) in a DBMS or in NoSQL databases
 e.g., tweets, video files
13

Introduction: Big Data Technologies
 Data based on their format
 Semi-structured
 Non-relational data with a defined level of structure or consistency  Hierarchical or graph-based
 e.g., spreadsheets, XML data, sensor data, JSON data
 JSON: open standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. Used in MongoDB
 Android sensor framework  AndroSensor
https://play.google.com/store/apps/details?id=com.fivasim.andro sensor&hl=en_GB
14

Introduction: Big Data Technologies
 Data based on their format
 Metadata
 Provides information about a dataset’s characteristics and structure  Data provenance
 e.g.,
 XML tags for the author and creation date of a document
 Attributes providing file size and resolution of a digital document
 File metadata in Linux
 size, permissions, creation date, access date, inode number, uid/gid, file type
 “ls –la” and “stat” commands
15

Introduction: Big Data Technologies
 Big Data characteristics
• Data quantity is substantial and evergrowing
• Different storage and processing requirements
Volume
• Data speed may be high
• Elastic and available time for processing
Velocity
• Multiple formats and types
Variety
• Bias, noise, abnormalities
• Different integration, transformation, processing and storage requirements
Veracity Value
• Different requirements for removing noise and resolving invalid data
• Data usefulness (utility)
• Depends on veracity, time of processing, storage & analysis decisions (how?)
16

Introduction: Big Data Technologies
 Big Data characteristics in online gaming industry
• Data describing players interactions and transactions • Fast streams of data activities (game moves)
Veracity
Value
Volume Velocity
• Demographics, player behavior, relationships between players
Variety
• Noisy user messages (texts), errors, fake accounts
• Action-based analytics (e.g. less bias than user surveys)
• Analysis to increase number of players, improve players’ experience, maximize revenue
17

Introduction: Big Data Technologies
 Big Data characteristics in [Your own example] from healthcare, transportation, finance, marketing, web, …
Variety Veracity
Value
Volume Velocity
18

Introduction: Big Data Technologies
https://sranka.files.wordpress.com/2014/01/bigdata.jpg 19

Introduction: Big Data Technologies
http://elearningroadtrip.typepad.com/.a/6a010535fe1bfd970b01901b935600970b-pi 20

Data analytics & Big Data
 What is Analytics?
 Processes, technologies, frameworks
 What is the goal of analytics?
 To infer knowledge that makes systems smarter & more efficient
 What do analytics involve?
 Data filtering
 Processing
 Categorization
 Condensation
 Contextualization
https://www.nap.edu/read/18374/chapter/12
21

Data analytics & Big Data
Analytic types (based on their goal)
Descriptive
Diagnostic
Predictive
Prescriptive
Basic Statistics
Generalized N-body problems
Linear algebraic comput.
Graph- Optimi- Integration theoretic zation
comput.
Alignment Problems
Computational analytic tasks (The 7 giants) • help design technologies for Big Data
22

Data analytics & Big Data
 The goal of each type of analytics
Descriptive Diagnostic •



Aim to answer what has happened, based on past data that are presented in a summarized form
Aim to answer why it has happened, based on past data Aim to answer what is likely to happen, based on
existing data
Aim to answer what can we do to make something happen, based on existing data
Predictive
Prescriptive
23

Data analytics & Big Data
 Comp. analytic tasks for descriptive analytics 1. Basic statistics: the summarized form of data is the statistic
 Mean, Median, Variance, Count, Top-N, Distinct Example
Problem: Given a stream 𝐷 = 𝑥 ,…,𝑥 , find the number 𝐹 of distinct items of 𝐷. 1𝑁0
Challenge: We cannot compute 𝐹 accurately by examining a part of the stream 0
Specifics of the problem (not part of the module)
[Char] http://dl.acm.org/citation.cfm?doid=335168.335230
24
𝐹𝐹 𝑁=10000,𝑟=10%∗𝑁,𝛾=0.5,max 0, 0 =1.76
𝐹𝐹 00

Data analytics & Big Data
 Comp. analytic tasks for descriptive analytics 2. Linear algebraic computations: the summarized form is a model
describing the data or a smaller dataset built from the data
 Linear Regression: builds a model between a dependent variable and independent variables
 Principal Component Analysis (PCA)*: builds a summary of data that has fewer dimensions (e.g., a table with 100 columns is summarized into a new table with 10 “special” columns)
 Singular Value Decomposition (SVD)**: builds a summary of a data that is in a form of a matrix (low-rank approximation)
(details not part of the course: see Mining Massive Datasets book * 11.2, ** 11.3) 25

Data analytics & Big Data
 Comp. analytic tasks for descriptive analytics Example
Problem: Approximate matrix 𝑀 with a matrix 𝑀 that has r linearly independent columns and the error of approximation is minimum.
Challenge: The storage of matrix 𝑀 may be impossible due to the large size of 𝑀
26

Data analytics & Big Data
#Generate 9×6 matrix
hilbert <- function(n) { i <- 1:n; 1 / outer(i - 1, i, "+") } M <- hilbert(9)[, 1:6] #Run svd() to produce 𝑀 hatM <- svd(M) # get the first singular value of 𝑀 d1<-c(hatM$d[1],0,0,0,0,0) # Plot the elements of 𝑀, 𝑀 Mv<-as.vector(M) hatMv<-as.vector(hatM$u%*%diag(d1)%*%t(hatM$v)) matplot(cbind(Mv,hatMv),pch=c(1,2),col=c(2,4),ylab=“”) legend(“topright”,legend=c(“M”,”hatM”),pch=c(1,2), col=c(2,4))  Example using R (not part of the module) Approximate matrix 𝑀 with 𝑟 = 1 (one column matrix 𝑀) such that the error 𝑀 − 𝑀 = 𝑚2 − 𝐹 𝑖,𝑗 𝑖𝑗 𝑚2 is minimum 𝑖,𝑗 𝑖𝑗 27 Data analytics & Big Data  Comp. analytic tasks for diagnostic analytics  Linear algebraic computations  Example: Investigate why a failure happened by PCA followed by clustering (i.e., make a data summary, group the data in it and see if they explain the failure).  Challenge: Large matrices with “difficult” properties  Generalized N-body problems: These are problems involving similarities.  The similarity may be expressed as a distance or kernel function, which is computed between a set of points.  Examples: clustering (constructs groups of data), classification (to be discussed later)  Challenge: high-dimensionality Age PostCode 18 E1 19 E2 20 E3 50 NW1 51 NW2 52 NW3 Age PostCode 18 E1 19 E2 20 E3 50 NW1 51 NW2 52 NW3 51 19 E NW 28 Data analytics & Big Data  Comp. analytic tasks for diagnostic analytics  Graph-theoretic computations: These are problems involving a graph*.  Examples: Search, centrality, shortest paths, minimum spanning tree  Challenge: high interconnectivity (complex relationships) * A set of objects where pairs of objects are related (a network) Specifics of the problem (not part of the module) http://matteo.rionda.to/centrtutorial/ 29 Data analytics & Big Data  Comp. analytic tasks for predictive analytics  Linear algebraic computations  Generalized N-body problems  Graph-theoretic computations  Integration  Compute high-dimensional integrals of functions  Alignment problems  Matching (entity resolution, synonyms in text, images referring to the same entity) 30 Data analytics & Big Data  Algorithms for predictive analytics  They build a predictive model from existing data  Based on the model, they predict  The occurrence of an event  e.g., should car insurance be offered to a customer?  The value of a variable  e.g., what insurance premium to offer to a customer? Examples of algorithms: • Classification algorithm: predicts the occurrence of an event • Regression algorithm: predicts the value of a variable 31 Data analytics & Big Data  The process of classification Training data Classification algorithm Name Age Income Car Insure Anne 22 11K BMW N Bob 33 20K Nissan Y Chris 44 30K Nissan Y Dan 55 50K BMW Y ... ... ... ... ... Classification rules IF Age<30, Income<15K, Car=BMW THEN Insure=N IF Age>30, Income>15K, Car=Nissan THEN Insure=Y
If Age>30, Income>40K THEN Insure=Y
Test data
Emily
22
14K
BMW
?
Insure=N
32

Data analytics & Big Data
 The process of entity resolution
Do they refer to the same person?
How likely do they refer to the same person?
Is Amazon reviewer X and Facebook user Y the same person?
https://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf
33

Data analytics & Big Data
 The process of entity resolution
Training: entity pairs are annotated w.r.t. whether they represent a match or a non-match and a classifier is trained
Application: classifier decides whether two entities match http://dbs.uni-leipzig.de/file/learning_based_er_with_mr.pdf 34

Data analytics & Big Data
 Comp. analytic tasks for Prescriptive analytics
 Generalized N-body problems
 Graph-theoretic computations
 Alignment problems  Optimization
 Maximization/minimization problems solved by linear programing, integer programing
Prescriptive analytics algorithms use multiple predictive models, built on different data, to predict various outcomes and the best course of action for each outcome
Example: What medication to prescribe to a patient for treating a disease, based on the prescriptions of medicines given to others and their outcomes?
What are the differences between predictive and prescriptive analytics?
35

Data analytics & Big Data
 The use of optimization in prescriptive analytics
• We want to select a good model and its parameters, among
several possibilities
• Optimization in model training: A core optimization problem, which asks for optimizing the variables or parameters of the problem w.r.t. a selected objective function (e.g., minimize loss) is solved.
• Optimization in model selection and validation: The core optimization problem may be solved many times.
36

Data analytics & Big Data
 Example of optimization: linear programming (details not part of the course)
maximize 𝑐 𝑥 +⋯+𝑐 𝑥
subject to
and
11 𝑛𝑛
𝑎 𝑥 + ⋯ + 𝑎 𝑥 ≤ 𝑏 111 1𝑛𝑛1

𝑎𝑚1𝑥1+⋯+𝑎𝑚𝑛𝑥𝑛 ≤𝑏𝑚 𝑥1 ≥0,…,𝑥𝑛 ≥0
Design a webpage that brings maximum profit
• An image occupies 5% of the webpage and brings profit $0.1
• A link occupies 1% of the webpage and brings profit $0.01
• A video occupies 15% of the webpage and brings profit $0.5
• Use at most 10 images, 25 links, and 2 videos
37

Data analytics & Big Data
 Prescriptive analytics: linear programming in R
Design a webpage that brings maximum profit
• An image occupies 5% of the webpage and brings profit $0.1
• A link occupies 1% of the webpage and brings profit $0.01
• A video occupies 15% of the webpage and brings profit $0.5
• Use at most 10 images, 25 links, and 2 videos
Optimization function
Optimization constraints
maximize
subject to 0.05𝑥1 + 0.01𝑥2 + 0.15𝑥3 ≤ 1
𝑥1 ≤ 10 𝑥2 ≤ 25 𝑥3 ≤ 2
and 𝑥1 ≥0,𝑥2,𝑥3 ≥0 𝑆𝑜𝑙𝑢𝑡𝑖𝑜𝑛:𝑥1 = 10,𝑥2 = 20,𝑥3 = 2
0.1𝑥1 + 0.01𝑥2 + 0.5𝑥3
38

Data analytics & Big Data
 Prescriptive analytics: linear programming in R
maximize 0.1𝑥1 + 0.01𝑥2 + 0.5𝑥3 subject to 0.05𝑥1 + 0.01𝑥2 + 0.15𝑥3 ≤ 1
𝑥1 ≤ 10
𝑥2 ≤ 25 𝑥3 ≤ 2
and 𝑥1 ≥0,𝑥2,𝑥3 ≥0
#install the package for linear programming
install.packages(“lpSolveAPI”) library(“lpSolveAPI”)
#set up the problem type, obj. function, constraints
lprec <- make.lp(4, 3) lp.control(lprec, sense="max") set.type(lprec, c(1,2,3),"integer") set.objfn(lprec, c(0.1, 0.02,0.5)) add.constraint(lprec, c(0.05,0.01,0.15),"<=",1) add.constraint(lprec, c(1,0,0), "<=", 10) add.constraint(lprec, c(0,1,0), "<=", 25) add.constraint(lprec, c(0,0,1), "<=", 2) #solve the problem solve(lprec) #get value of objective function get.objective(lprec) #get values of variables (solution) get.variables(lprec) 𝑆𝑜𝑙𝑢𝑡𝑖𝑜𝑛:𝑥1 = 10,𝑥2 = 20,𝑥3 = 2 39 Data analytics & Big Data  Analytics overview (again) Descriptive Diagnostic Predictive Prescriptive Basic Statistics Generalized N-body problems Linear algebraic comput. Graph- theoretic comput. Optimiz ation Integration Alignment Problems 40 Data analytics & Big Data  Settings in which analytics are applied:  Default: the dataset is stored in RAM  Streaming: data arrive as a stream, a part (window) is stored  Distributed: data are distributed over multiple machines (in RAM and/or disk)  Multi-threaded: data are in one machine and multiple processors share the RAM of the machine 41 Data analytics & Big Data  Analytics flow for big data (details in next lectures) Data collection The data is collected and ingested into a big data stack Issues required for meaningful processing are resolved Data Preparation Analysis types The type of analysis is determined Analysis modes The mode of analysis is determined Visualizations The analysis results are presented to the user 42 Data analytics & Big Data  Analytics flow for big data - Example Data collection Collect and store data from temperature sensors in Big data store Data Preparation Remove erroneous values (due to sensors faults) Analysis types Pattern mining to find: “increase by 10 degrees in 5 minutes” Analysis modes Pattern mining for data arrived in last hour Visualizations Plot the pattern to see the extent of the increase 43 Summary  Introduction to the topic  Big  Data  Technologies  Data analytics & Big data  Analytic tasks – the 7 giants  Analytics flow 44