计算机代考 Big Data hype!

Big Data hype!

Big Data hype!

Copyright By PowCoder代写 加微信 powcoder

Big Data hype!

So what is Big Data?
Data whose capturing, storage, curation, management, and processing are not possible with traditional tools within a tolerable amount of time

So what is Big Data?
Very large amounts of data. Right? :-\

So what is Big Data?
• Volume (􏰔hank 􏰇o􏰂 ca􏰕􏰔ain ob􏰓io􏰂􏰖􏰅)
• Complexity: entities, data, hierarchies, links, relationships, …
• Variety: Structured, semi-structured, unstructured
– Text, video, audio, photos
– Wikipedia pages: text + infoboxes
– Microformat, microdata, (schema.org), …
• Velocity: near-real time
– Storage, querying, analytics, …
• Veracity: Is it really the “true” data? errors? alterations?
• Variability: data flows, peaks & valleys, …
• endetta: any of these can break a traditional approach

Some􏰗hing old􏰘

Old news to say the least
• We have always struggled with very large data
• The first International Conference on Very Large Data Bases (VLDB) was held in…
a. ‘􏰑0s b. ‘􏰙0s c. ‘􏰚0s d. ‘00s
• Then: mini-computers, mainframes, …
• Afterwards: servers, clusters, …
• Nowadays: The Cloud

Old news at least…
Data CLOUD
t1 t2 t3 t4 Time
t1 t2 t3 t4 Time
DATA SIZE / AVAILABILITY / DATA USE
SIZE OF REQUIRED COMPUTING RESOURCES

Recurring nightmares… • Scalability
– To scale up or to scale out?
• Availability and Fault Tolerance
– More on these shortly • Consistency
– Big headache…
• Performance
– Latency / throughput
– Selectivity
– Access patterns
– Transaction processing vs analytics

Physical Storage for a Database • Fundamental Limitation: Database is too large to fit in RAM
– By default, data lives on secondary storage
• But the CPU cannot directly access data not in RAM
– Data must first be loaded into main memory from secondary storage, then fetched into registers to be processed
• Secondary storage access time becomes the bottleneck – Data access takes about 10ms-60ms, instead of 30-120ns
• Alleviating this bottleneck the subject of research into data storage and database systems
What if the data is too large to fit on a single computer?

System Latencies
@ B. Gregg, “Systems Performance: Enterprise and the Cloud”, Prentice-Hall, 2013. Source: A. Silberschatz, “Operating System Concepts”, 􏰚th Ed., 2012.

Storage challenges
Storage layer properties: – Efficient
TOO COSTLY
– Scalable
– Highlyavailable – Faulttolerant
– Consistent
TOO DESIRABLE
TOO FREQUENT

Processing challenges
􏰛Google ha􏰖 􏰖􏰕oiled 􏰂􏰖”…
– 47% of users expect web pages to load in under 􏰎􏰁􏰁
– 40% of users abandon a website that takes more than 􏰜􏰁􏰁 to load
– A 􏰐􏰁􏰁 delay (or 􏰜’’ of waiting) decreases customer satisfaction by 16%

Processing challenges
Big Data processing is a complex task…
– Large/fast/incomplete/… amounts of data…
– Complex algorithms…
– Possibly requiring multiple passes over the data… – Across large or geo-distributed datacentres …
􏰅 and the clock is ticking 􏰅

Availability
• Example:
– Assume a cluster with 10 PCs
– Assume that they all need to be up for a service to work
– Assume that the probability of any one being down at any point in time is 1%
– What is the probability that the service is up (i.e., available) at a random time point?
– P(PCisup)=1-P(PCisdown)=1-0.01=0.99
– P(service up) = P(all PCs up) = P(1st PC up AND 2nd PC up AND …)
– …􏰝P(􏰐st PCup)*P(2nd PCup)*…􏰝(P(PCisup))10
– … 􏰝 (􏰐 – P(PC is down))10 = 0.9910 = 0.90 (= 90%)
• With 100 PCs?
– … 􏰝 (􏰐 – P(PC is down))100 = 0.99100 = 0.37 (= 37%)
• With 10,000 PCs?
– … 􏰝 (􏰐 – P(PC is down))10,000 = 0.9910,000 = 2.2e-44 (= 2.2e-42 %)
– … 􏰝 0.00000000 … 000000000􏰎􏰎 􏰞

Some􏰗hing ne􏰟􏰘

Why me? Why this? Why now?
Data exhaust
Data hoarding

Needle in a haystack
One man’s “trash” is another man’s gold
“Mind the Trash Gap”
t1 t2 t3 t4 Time
Relevant Data

Some􏰗hing bo􏰠􏰠o􏰟ed􏰘

Essential Big Data tools
A framework for massively-parallel access to huge volumes of data, including
􏰡 A programmatic interface: MapReduce/Spark 􏰢 Abaseline(distributed)filesystem(HDFS)
A back-end storage system
• Scalable! Scalable! Scalable!
• Randomandsequentialaccesses
• More than just a file system …A “modern” DB
Added value software
• Machine learning libraries
• Graphprocessinglibraries
• Document processing libraries •…

Why MapReduce?
• “Million Lyrics Dataset”
– One file per song
– Verses one per line, accessed as key-value pairs (key: line no; value: line text)
Hear the rime of the Ancient Mariner, see his eye as he stops one of three.
Mesmerises one of the wedding guests. Stay here and listen to the nightmares of the Sea.
And the music plays on, as the bride passes by. Caught by his spell and the Mariner tells his tale. Driven south to the land of the snow and ice, to a place where nobody’s been.
Through the snow fog flies on the albatross. Hailed in God’s name, hoping good luck it brings . And the ship sails on, back to the North. Through the fog and ice and the albatross follows on. The mariner kills the bird of good omen. His shipmates cry against what he’s done,
but when the fog clears, they justify him and make themselves a part of the crime.
Q: Compute the sorted set of (distinct) words and their frequency of appearance (aka Word Count)

Why MapReduce? • A traditional approach:
– Write a simple program (or shell script) to process these data files, line by line – assuming we know the lines’ format
– E.g., what does the following do?
#!/bin/bash
cat lyrics-* |\
sed 􏰣􏰤/[,.;]/ /g;s/’s//g;s/\([A-Z]\)/\L\1/􏰥􏰦 􏰧\
for (i = 1; i <= NF; i++) freq[$i]++; } END { for (word in freq) print word,freq[word]; What􏰁s wrong with that ?!?! Why MapReduce? • Executiontime... – Above query/shell/awk can take virtually forever! – But our task is embarrassingly parallelizable! • Step1:Usemorethan1process/thread – Each processing a different subset of lines – But how to evenly divide the workload to processes? • Note: Total time will depend on the worst performer! • Clever load balancing will lead to post-processing ... • Step2:Usemorecomputers(andtheircores) – Scale out! Step 3: Enjoy! Now you have to deal with: – Communication costs • Network I/O is a major bummer – Reliability • What happens if processes/machines/networks/... fail and do not process their part? – Coordination • Who tells whom what to do? • Who does pre- and post-processing? • What if that guy fails? Why MapReduce? Why MapReduce/Spark/...? In MapReduce (and Spark, Flink, Samza, ...) you only need to code your map and reduce functions; then, ready-made client libraries and the Hadoop/HDFS(/...) infrastructure work their magic! But how do they work? Essential Big Data tools A framework for massively-parallel access to huge volumes of data, including 􏰡 A programmatic interface: MapReduce/Spark 􏰢 A baseline (distributed) file system (HDFS) A back-end storage system • Scalable! Scalable! Scalable! • Randomandsequentialaccesses • More than just a file system ...A 􏰛modern􏰪 DB Added value software • Machine learning libraries • Graphprocessinglibraries • Documentprocessinglibraries •... Back-end storage system • Alternatives: – (Parallel) RDBMSs – Distributed file systems – NoSQL systems: • Key-value stores • Graph data stores • Document stores • Semi-relational stores • Fundamental differences not always clear ... Why NoSQL? DFS culprits A DFS like HDFS is designed to be: • Aswiftdatasink • Expandable–scaleout • Optimizedforbatchdataprocessing Great for many analytic tasks (based on data SCANS) But, what about “random accesses” ? And, what about “real-time analytics” ? Why not then use RDBMSs ? Why NoSQL? RDBMS culprits Front-end Back-end SCALABILITY IS THE GOTCHA! One has to design for scalability from the start! Cannot simply take R-DBMS boxes and put them together BECAUSE EACH SUCH BOX COMES BUNDLED WITH DECISIONS THAT HURT SCALABILITY! Why NoSQL? RDBMS culprits 1. NORMALIZATION Non-normalized Table 􏰅 repeating groups 􏰐NF Table 􏰅 redundancy 2NF Students-Advisor Table 􏰐NF Table 􏰅 redundancy 2NF Students-Classes Table Why NoSQL? RDBMS culprits 1. NORMALIZATION Why NoSQL? RDBMS culprits 1. NORMALIZATION 3NF Students-Advisor Table 3NF Advisor-Room Table If a query needs information from >1 table,
then it must JOIN them 23NF Students-Classes Table
2NF Students-Advisor Table
OIN queries are
Advisor AdvRoom Jones sc4a1l2ability +
very resource-h
1022 performance iss
ungry 143-01
159-02 ues!

Why NoSQL? RDBMS culprits 2. CONSISTENCY
A widely and wildly abused term
ACID = Transactions
Basically Available The
Consistency 􏰛DBMS􏰪
Scalable 􏰛Systems􏰪
Isolation Durability
Eventually consistent view
“CONSISTENCY” = A + C + I + D
“CONSISTENCY” 􏰝 read freshness …

Something blue?

Wi􏰗h 􏰦big􏰨 da􏰗a come􏰤 􏰦big􏰨 􏰠e􏰤pon􏰤ibili􏰗􏰫􏰘

Big data ethics

Big data ethics
Source: XKCD (https://xkcd.com/1390/ , https://xkcd.com/1838/)

Big data ethics
When it comes to privacy and data protection, the law is The Law
– Data Protection Act 1998: https://en.wikipedia.org/wiki/Data_Protection_Act_1998
– EU General Data Protection Regulation (GDPR): https://en.wikipedia.org/wiki/General_Data_Protection_Regulation
But is that enough?
􏰛E􏰔hical 􏰕􏰬ac􏰔ice ha􏰖 􏰔o de􏰓elo􏰕􏰄 in 􏰕a􏰬􏰔 beca􏰂􏰖e the law nearly always lags what is possible􏰪
􏰛The􏰬e 􏰋ill al􏰋a􏰇􏰖 be acts that are legal, not ethical, 􏰖o la􏰋 no􏰔 eno􏰂gh􏰪
Source: (2017). Big data and data sharing: Ethical issues. UK Data Service, UK Data Archive [link]

Big data ethics

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com