程序代写代做代考 Java database Fortran javascript Microsoft PowerPoint - 23- MongoDB-MapReduce

Microsoft PowerPoint – 23- MongoDB-MapReduce

MongoDB and
Map-Reduce

Abdu Alawini
University of Illinois at Urbana-Champaign

CS411: Database Systems

November 28, 2018

Announcements

•HW 5 was posted last Monday: due on 12/5

•PT1 Stage 5 final demos 11/28 to 12/4

•PT1 report and video: due on 12/3

•PT2 Stage 2 report and slides: due on 12/3
2

Last lecture!

•NoSQL Introduction
•Relational-NoSQL Trade-offs
•MongoDB
•Model and simple queries

Cursor methods

• .count, .pretty, .sort etc are all examples of cursor methods
• .toArray returns an array that contains all documents from a cursor

• .forEach applies a JavaScript function to each document from the
cursor (similar to .map)

nsf= db.awards.find({“by”: “National Science Foundation”).toArray()
if (nsf.length >0) {printjson (nsf[0])}

db.awards.find(). forEach
( function(myDoc) { printjson (“Award: ” + myDoc.name); });

Today’s lecture

•Continue with MongoDB
•Join
•Aggregation

•Updates, replication and sharding
•Map-reduce

Joins in MongoDB

•“Do joins while write, not on reads.”
•Use embedded relationships

•Otherwise, you need to use semi-joins to get an array
of keys from the first collection on which to search the
second collection for matches using cursor methods

Relationships: Embedded

{ _id: 1,
name: { first: “John”, last: “Backus” },
birthyear: 1924,
contribs: [ “Fortran”, “ALGOL”,

“Backus Naur Form”, “FP” ],
awards: [ {title: “National Medal of Science” ,

by: “National Science Foundation”,
year: 1975 },
{title: “Turing Award”,

by: “ACM” ,
year: 1977} ] }

Relationships: Referenced

{ _id: 1,
name: { first: “John”, last: “Backus” },
birthyear: 1924,
contribs: [ “Fortran”, “ALGOL”,

“Backus Naur Form”, “FP” ],
awards: [ { award_id: “NMS001”, year: 1975 },

{ award_id: “TA99”, year: 1977} ] }

People:

{_id: “NMS001”,
title: “National Medal of Science” ,
by: “National Science Foundation”},

{_id: “TA99”,
title: “Turing Award”,
by: “ACM” }

Awards:

“SemiJoins”

•Suppose you want to print people who have won Turing
Awards using referenced relationship
• Problem: object id of Turing Award is in collection “awards”,

collection “people” references it.

• But this only works for one award with title “Turing Award”, suppose
there were more.

turing= db.awards.findOne({title: “Turing Award”})
db.people.find({“awards.award_id”: turing._id})

•Now suppose there is more than one award named
“Turing Award”

Iterating using cursors

var turing= db.awards.find ({title: “Turing Award”})

while (turing.hasNext()) {
db.people.find({“awards.award_id”: turing.next()._id})
}

Today’s lecture

oMongoDB
Join
•Aggregation

•Updates, replication and sharding
•Map-reduce

Aggregation

•A framework to provide “group-by” and aggregate
functionality without the overhead of map-reduce.

•Conceptually, documents from a collection pass through
an aggregation pipeline, which transforms the objects as
they pass through (similar to UNIX pipe “|”)

•Operators include: $project, $match, $group, $sort, $skip,
$limit, $unwind

Aggregation Example*

Aggregation: $group

•Every group expression must specify an _id field.
•Suppose we wanted to find how many people were born

each year

•Contrast with aggregate operation over entire result

> db.people.aggregate( { $group :
{ _id : “$birthyear”, birthsPerYear : { $sum : 1}}})

> db.people.count( )
> db.people.find({“name.first”: “John”}).count( )
> db.people.count({“name.first”: “John”})

{ “result” : [ { “_id” : 1924, “birthsPerYear” : 1 } ], “ok” : 1 }

Aggregation: $unwind

•Deconstructs an array field to output a document for
each element.

>db.posts.aggregate( { $project : { author : 1, tags : 1 }}, { $unwind : “$tags” } )

Posts: {
_id : ObjectId(“4c4ba5c0672c685e5e8aabf3”),
author : “Kevin”,
date : new Date(“February 2, 2012”),
text : “About MongoDB…”,
birthyear: 1980,
tags : [ “tech”, “databases” ]
}

Result of unwind

{ “result” : [ { “_id” : ObjectId(“4c4ba5c0672c685e5e8aabf3”),
“author” : “Kevin”,
“tags” : ”tech” },

{ “_id” : ObjectId(“4c4ba5c0672c685e5e8aabf3”),
“author” : “Kevin”,
“tags” : ”databases” } ],

“OK” : 1}

Today’s lecture

MongoDB
Join
Aggregation

•Updates, Replication and sharding
•Map-reduce

•MongoDB does not support multi-document atomic
transactions. However, it does provide atomic
operations on a single document: if a hundred fields are
being updated in a document they will either all be
updated or none will be.

•To ensure atomicity: if related information is frequently
being updated together, they should be placed in a
single document using embedded documents.

Updates

Replication

Sharding

Today’s lecture

MongoDB
Join
Aggregation
Updates, Replication and sharding
•Map-reduce

Map-Reduce: Motivation

•SQL is a great way of doing batch (bulk) operations to
collections (tables)
•But it’s somewhat limited in the kinds of operations it

supports (by default) –
SELECT … FROM … WHERE … GROUP BY … HAVING
…
•What if we want to do arbitrary selection predicates

(in Java etc.), and arbitrary aggregations (in Java etc.)
•User defined functions, or alternative frameworks

•What if we wanted to scale this up to run on a 10000
node compute cluster??

Map-Reduce Analogy: National census

•Suppose we have
10,000 employees,
whose job is to collect
census forms and
to determine how
many people live in
each city

•How would you
organize this task?

23
http://www.census.gov/2010census/pdf/2010_Questionnaire_Info.pdf

Making things more complicated

•Suppose workers take vacations, get sick, work at
different rates

•Suppose some forms are incorrectly filled out and
require corrections or need to be thrown away

•What if the supervisor gets sick?
•How big should the stacks be?
•How do we monitor progress?
• …

I don’t want to deal with all this!!!

•Wouldn’t it be nice if there were some system that
took care of all these details for you?

•Ideally, you’d just tell the system what needs to be
done

•That’s the MapReduce framework.

Abstracting into a digital data flow

Filter+Stack
Worker

CountStack
Worker

blue: 4k

green: 4k

cyan: 3k

gray: 1k

orange: 4k

Abstracting once more
•There are two kinds of workers:
• Those who take input data items and produce output items for the

“stacks”

• Those who take the stacks and aggregate the results to produce
outputs on a per-stack basis

•We’ll call these:
• map: takes (item_key, value), produces one or more (stack_key,

value’) pairs

• reduce: takes (stack_key, {set of value’}), produces one or more
output results – typically (stack_key, agg_value)

We will refer to this key
as the reduce key; it is like the GROUP BY key

map does SELECT + PROJECT

Why MapReduce?
•Scenario:
• You have a huge amount of data, e.g., all the Google

searches of the last three years

• You would like to perform a computation on the data, e.g.,
find out which search terms were the most popular

•Analogy to the census example:
• The computation isn’t necessarily difficult, but

parallelizing and distributing it, as well as handling faults,
is challenging

•Idea: A programming abstraction / template!
•Write a simple program to express the (simple)

computation, and let the language runtime do all the hard
work 28

Simple example: Word count

•Goal: Given a set of documents, count how
often each word occurs
• Input: Key-value pairs (document:lineNumber, text)
•Output: Key-value pairs (word, #occurrences)
•What should be the intermediate key-value pairs?

map(String key, String value) {
// key: document name, line no
// value: contents of line

}

reduce(String key, Iterator values)
{

}

for each word w in value:
emit(w, “1”)

// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);

emit(key, result)

Simple example: Word count

Mapper
(1-2)

Mapper
(3-4)

Mapper
(5-6)

Mapper
(7-8)

(1, the apple)

(2, is an apple)

(3, not an orange)

(4, because the)

(5, orange)

(6, unlike the apple)

(7, is orange)

(8, not green)

(the, 1)

(apple, 1)

(is, 1)

(apple, 1)
(an, 1)

(not, 1)

(orange, 1)

(an, 1)

(because, 1)
(the, 1)

(orange, 1)

(unlike, 1)

(apple, 1)

(the, 1)

(is, 1)

(orange, 1)
(not, 1)
(green, 1)

(apple, 3)
(an, 2)
(because, 1)
(green, 1)

(is, 2)
(not, 2)

(orange, 3)
(the, 3)

(unlike, 1)

(apple, {1, 1, 1})
(an, {1, 1})
(because, {1})
(green, {1})

(is, {1, 1})
(not, {1, 1})

(orange, {1, 1, 1})
(the, {1, 1, 1})

(unlike, {1})

Each mapper
receives some
of the KV-pairs

as input

The mappers
process the

KV-pairs
one by one

Each KV-pair output
by the mapper is

sent to the reducer
that is responsible

for it

The reducers
sort their input

by key
and group it

The reducers
process their

input one group
at a time

Key range the node
is responsible for

1 2 3 4 5

Reducer
(A-G)

Reducer
(H-N)

Reducer
(O-U)

Reducer
(V-Z)

MapReduce dataflow

Mapper

Reducer

In
p
u
t

d
at

O
u
tp

u
t

d
at

“The Shuffle”

Intermediate
(key,value) pairs

Related Posts