Data Management in the Cloud Module 1: Course Overview
Some slides from (U. Waterloo), , Harvard CS 109
What is Data? Data is real
Copyright By PowCoder代写 加微信 powcoder
Etchings on a disk
Electrons stored in a capacitor,
Empty/filled oxide gates in solid-state drives
Manipulation of the physical world that permits storing and retrieving bits via electronics
What is Data? Data describes important information
Transactions/sales
Web logs, e.g., web page visits
Content classification: “image of a dog”
Meta-data: bits 0-1024 capture null-terminated strings representing names
Information enables decision making
Which regional outlet should be notified for poor sales? Which web page on our site is most popular?
Financial transactions, movie ratings, etc.
Cull valuable data from large amounts of data Use stats, machine learning, domain expertise
What is Data Science?
“Data scientist = statistician + programmer + coach + story teller + artist”
Computer Statistics
Data Scientist
Machine learning
System builders
Domain Expertise
http://go.osu.edu/CYHJ
How Does Data Science Happen? 1.Ask an interesting question
● What is the scientific goal? What do you want to predict? 2.Get the data
● How were data harvested and stored? Which data are relevant? 3.Explore the data
● Plot the data. Look for anomalies? 4.Model the data
● Build a model. Fit a model. Validate a model.
5.Communicate and visualize the results ● What did we learn? Do the results make sense?
How Does Data Science Happen? 1.Ask an interesting question
● What is the scientific goal? What do you want to predict? 2.Get the data
● How were data harvested and stored? Which data are relevant? Can we automate this process?
3.Explore the data
Answering What If, Should I
Plot the data. Look for anomalies?
and Other Expectation Exploration Queries
Using Causal Inference over Longitudinal Data
4.Model the data
● Build a model. Fit a model. Validate a model.
5.Communicate and visualize the results ● What did we learn? Do the results make sense?
How Does Data Science Happen? 1.Ask an interesting question
● What is the scientific goal? What do you want to predict?
2.Get the data
● How were data harvested and stored? Which data are relevant?
3.Explore the data
● Plot the data. Look for anomalies?
4.Model the data
● Build a model. Fit a model. Validate a model.
5.Communicate and visualize the results ● What did we learn? Do the results make sense?
This class!
Source: Wikipedia (Noctilucent cloud)
Why big data? What big data?
No data like (a lot) more data! – Notice the X axis is log-scale
(Banko and Brill, ACL 2001)
(Brants et al., EMNLP 2007)
No data like (a lot) more data!
Distil ERT
Columns (millions of dimensions)
Rows (bytes)
Normalized Model size
Performance
No Data Like More Data!
No data like (a lot) more data
With a little data: Answer factoid questions
Pattern matching on the Web Works amazingly well
Who shot ? X shot Y where Y = With a lot of data: Learn relations
1 Start with seed instances
Search for patterns on the Web
3 Using patterns to find more instances (virtuous cycle)
Birthday-of(Mozart, 1756) Birthday-of(Einstein, 1879)
Mozart (1756 – 1791)
Wolfgang was born in 1756 Einstein was born in 1879
PERSON (DATE –
PERSON was born in DATE
(Brill et al., TREC 2001; Lin, ACM TOIS 2007)
(Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; … )
No data like (a lot) more data
With a little data: Answer factoid questions
Pattern matching on the Web Works amazingly well
Data begets data!
Who shot ? X shot Y where Y = With a lot of data: Learn relations
1 Start with seed instances
Search for patterns on the Web
3 Using patterns to find more instances (virtuous cycle)
Birthday-of(Mozart, 1756) Birthday-of(Einstein, 1879)
Mozart (1756 – 1791) Einstein was born in 1879
PERSON (DATE –
PERSON was born in DATE
(Brill et al., TREC 2001; Lin, ACM TOIS 2007)
(Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; … )
The amount of data worldwide is growing exponentially — 90% of the world’s data was created in the last two years
Data begets data
How is this possible? Data begets data is one explanation The logistics equation provides an explanation:
– a = rate at which data produces new data
Using our example, rate at which a seed instance leads to a new instance – h = units of time for data processing
– n(t) = The total amount of data at time t
– ahn(t) = New data found at time t
How quickly is data produced given data produces data at rate a?
– n(t+h) = n(t) + ah+1n(t)
This formula captures total data after t+h data processing intervals
– n(t+h) – n(t) = ah+1 n(t)
Data begets data cont.
Let h → 0 then d/dt n(t) = a n(t)
Basic math and calculus yields the following derivative
What equation produces this derivative?
Data begets data cont.
How quickly is data produced given data produces data at rate a?
– n(t+h) = n(t) + ah+1 n(t) – n(t+h) – n(t) = ah+1 n(t)
Let h → 0 then d/dt n(t) = a n(t)
Virtuous cycles where data processing yields new, informative data has the potential for exponential growth.
How Does Data Science Happen? 1.Ask an interesting question
● What is the scientific goal? What do you want to predict?
2.Get the data
● How were data harvested and stored? Which data are relevant?
3.Explore the data
● Plot the data. Look for anomalies?
In different contexts, data arrives with different velocity and volume Data analysis begets more data
Storage demands are growing
Source: Wikipedia (Noctilucent cloud)
Why big data? Who big data?
System builders
Statistics
Machine learning
Domain Expertise
There are multiple career paths in data science http://go.osu.edu/dataengvssci-er3ws
Data Scientist Profiles Title: data engineer
Description: data arrives with very loose structure (e.g., streams webpage clicks); related data stored on multiple machines; write software to organize data and find correlations
Nemesis: Software bugs (because a lot of code is written for each result); agility (because it takes time to write code)
Favorite Tools:
Data Scientist Profiles Title: data analyst
Description: data is organized in tables or, even better, relational Databases; generate massive reports to find anomalies
Nemesis: low dimensionality (if data is already being collected, it is hard to create); small server clusters and/or big problems (my reports compute too slowly)
Favorite Tools:
Data Scientist Profiles Title: data wranglers and deep data experts
Description: define your own data structures (normally as large vectors or graphs); rank competing structures (in terms of revealing anomalies or correlations)
Nemesis: time to “stage” data(i.e., reorder data according
to the proposed structure); compute time (for humans to think of good structures and for computers to evaluate)
Favorite Tools:
Data management includes placing, structuring and organizing data for computational tasks
Data management facilitates fast & effective data science
http://go.osu.edu/stonebraker-xlr33
Source: Wikipedia (Noctilucent cloud)
Why big data? How big data?
Cloud Computing
Before clouds…
Connection machine
Vector supercomputers …
Cloud computing means many different things:
Big data
Rebranding of web 2.0 Auto scaling hardware
Rebranding of web 2.0
Rich, interactive web applications
Clouds refer to the servers that run them
AJAX as the de facto standard (for better or worse) Examples: Facebook, YouTube, Gmail, …
“The network is the computer”: take two
User data is stored “in the clouds”
Rise of the netbook, smartphones, etc. Browser is the OS
Dr. . Wang, Microsoft
BrowserShield: Vulnerability-driven filtering of dynamic HTML
Dr. , Prof. Harvard
Mugshot: Deterministic Capture and Replay for JavaScript Applications.
https://go.osu.edu/cse3244-jamesmickens-browsers
Auto Scaling Workloads
Ready-made big data problems in data centers
Social media, user-generated content = big data
Examples: Facebook friend suggestions, Google ad placement
Business intelligence: gather everything in a data warehouse and
run analytics to generate insight
Cloud computing as auto scaling provides:
Ability to provision Hadoop clusters on-demand in the cloud Lower barrier to entry for tackling big data problems
Democratization of big data capabilities
Enabling Technology: Virtualization
App App App
Operating System
Traditional Stack
Hypervisor
Operating systems do two things: (1) allow users/programs to request access hardware and (2) manage access to hardware
Operating systems support complicated requests for hardware that depend on the execution of co-located programs.
Hypervisors multiplex multiple operating systems but enforce independence between the execution of co-located programs
Virtualized Stack
Enabling Technology: Virtualization
Worker Worker Worker 123
Hypervisor
Hypervisor
Data on multiple disks
(too big for 1)
Divide and Conquer
Parallelization Challenges
How do we assign work units to workers?
What if we have more work units than workers? What if workers need to share partial results? How do we aggregate partial results?
How do we know all the workers have finished? What if workers die?
What’s the common theme of all of these problems?
Common Theme?
Access to shared resources (e.g., data)
Thus, we need a synchronization mechanism
Parallelization problems arise from:
Communication between workers (e.g., to exchange state)
Source: ̃es Outcomes
● You will learn principles of data management
– Programming models that limit how developers are allowed to access data, e.g., which server to use, what computations are allowed
– Data models explain how data is stored, e.g., what meta-data is available, what consistency is enforced, what relations/schemas are defined
– Optimizations allow us to place, organize and manage data to improve our data science tasks (here, improve means enable and/or speed up)
● You will see data management in Hadoop, Spark & Salesforce platforms
● You will be exposed to computer architecture for large scale data
management, esp. geo-distributed data centers, data warehouses
● You will be exposed to widely used declarative programming languages, e.g., SQL and TensorFlow, and learn how the alleviate problems that arise in imperative languages, e.g., Java.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com