Unstructured Data Example
Opinion pieces, box scores, summaries, ads, user comments, etc.
1 website, many types of data
Copyright By PowCoder代写 加微信 powcoder
Unstructured Data Example cont. Business goal: Identify popular players
Download each web page (or frame) and store as a file
file1.txt → “……” file2.txt → “……..” …
Find out which players names appear most frequently?
How would you solve this problem?
Hire a data scientist to:
Real Map Reduce: Player name mentions
Solve problem with MapReduce, assume function:
bool isPlayerName (String s);
→ true if s is name of current WNBA player → false otherwise
(on the board)
Unstructured Data Example cont.
Oh, excuse me. Did I say WNBA? I meant NBA. No, I meant all professional sports? No, I meant all corporate entities?
The same problem is relevant across many domains
At some point, data doesn’t fit on your laptop How do mappers & reducers find the files they need?
A distributed file system is the answer, e.g., Hadoop Distributed
File System (HDFS)
Don’t move data to workers… move workers to the data!
Store data on the local disks of nodes in the cluster
Start up the workers on the node that has the data local
Map Reduce
Programmers specify two functions: map (k1, v1) → [
reduce (k2, [v2]) → [
All values with the same key are sent to the same reducer
The execution framework handles everything else…
“Everything Else”
The execution framework handles everything else…
Scheduling: assigns workers to map and reduce tasks
“Data distribution”: moves processes to data
Synchronization: gathers, sorts, and shuffles intermediate data Errors and faults: detects worker failures and restarts
You don’t know:
Where mappers and reducers run
When a mapper or reducer begins or finishes
Which input a particular mapper is processing
Which intermediate key a particular reducer is processing
What can you do?
Cleverly structure intermediate data to reduce network traffic
Distributed File Systems
Companies like Google, Apple and Facebook run map reduce jobs over 10,000+ machines at multiple locations (called datacenters) hourly!
Challenge: Improve throughput for distributed file systems (and hence map reduce)
Namenode Responsibilities
Managing the file system namespace:
Holds file/directory structure, metadata, file-to-block mapping,
access permissions, etc. Coordinating file operations:
Directs clients to datanodes for reads and writes
No data is moved through the namenode Maintaining overall health:
Periodic communication with the datanodes Block re-replication and rebalancing
Garbage collection
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com