Microsoft PowerPoint – 23- MongoDB-MapReduce
© 2018 A. Alawini
MongoDB and
Map-Reduce
Abdu Alawini
University of Illinois at Urbana-Champaign
CS411: Database Systems
November 28, 2018
1
© 2018 A. Alawini, @D. Maier
Announcements
•HW 5 was posted last Monday: due on 12/5
•PT1 Stage 5 final demos 11/28 to 12/4
•PT1 report and video: due on 12/3
•PT2 Stage 2 report and slides: due on 12/3
2
© 2018 A. Alawini, @D. Maier
Last lecture!
•NoSQL Introduction
•Relational-NoSQL Trade-offs
•MongoDB
•Model and simple queries
3
© 2018 A. Alawini, @D. Maier
Cursor methods
• .count, .pretty, .sort etc are all examples of cursor methods
• .toArray returns an array that contains all documents from a cursor
• .forEach applies a JavaScript function to each document from the
cursor (similar to .map)
4
nsf= db.awards.find({“by”: “National Science Foundation”).toArray()
if (nsf.length >0) {printjson (nsf[0])}
db.awards.find(). forEach
( function(myDoc) { printjson (“Award: ” + myDoc.name); });
© 2018 A. Alawini, @D. Maier
Today’s lecture
•Continue with MongoDB
•Join
•Aggregation
•Updates, replication and sharding
•Map-reduce
5
© 2018 A. Alawini, @D. Maier
Joins in MongoDB
•“Do joins while write, not on reads.”
•Use embedded relationships
•Otherwise, you need to use semi-joins to get an array
of keys from the first collection on which to search the
second collection for matches using cursor methods
6
© 2018 A. Alawini, @D. Maier
Relationships: Embedded
7
{ _id: 1,
name: { first: “John”, last: “Backus” },
birthyear: 1924,
contribs: [ “Fortran”, “ALGOL”,
“Backus Naur Form”, “FP” ],
awards: [ {title: “National Medal of Science” ,
by: “National Science Foundation”,
year: 1975 },
{title: “Turing Award”,
by: “ACM” ,
year: 1977} ] }
© 2018 A. Alawini, @D. Maier
Relationships: Referenced
8
{ _id: 1,
name: { first: “John”, last: “Backus” },
birthyear: 1924,
contribs: [ “Fortran”, “ALGOL”,
“Backus Naur Form”, “FP” ],
awards: [ { award_id: “NMS001”, year: 1975 },
{ award_id: “TA99”, year: 1977} ] }
People:
{_id: “NMS001”,
title: “National Medal of Science” ,
by: “National Science Foundation”},
{_id: “TA99”,
title: “Turing Award”,
by: “ACM” }
Awards:
© 2018 A. Alawini, @D. Maier
“SemiJoins”
•Suppose you want to print people who have won Turing
Awards using referenced relationship
• Problem: object id of Turing Award is in collection “awards”,
collection “people” references it.
• But this only works for one award with title “Turing Award”, suppose
there were more.
9
turing= db.awards.findOne({title: “Turing Award”})
db.people.find({“awards.award_id”: turing._id})
© 2018 A. Alawini, @D. Maier
•Now suppose there is more than one award named
“Turing Award”
Iterating using cursors
10
var turing= db.awards.find ({title: “Turing Award”})
while (turing.hasNext()) {
db.people.find({“awards.award_id”: turing.next()._id})
}
© 2018 A. Alawini, @D. Maier
Today’s lecture
oMongoDB
Join
•Aggregation
•Updates, replication and sharding
•Map-reduce
11
© 2018 A. Alawini, @D. Maier
Aggregation
•A framework to provide “group-by” and aggregate
functionality without the overhead of map-reduce.
•Conceptually, documents from a collection pass through
an aggregation pipeline, which transforms the objects as
they pass through (similar to UNIX pipe “|”)
•Operators include: $project, $match, $group, $sort, $skip,
$limit, $unwind
12
© 2018 A. Alawini, @D. Maier 13*https://docs.mongodb.com/manual/aggregation/
Aggregation Example*
© 2018 A. Alawini, @D. Maier
Aggregation: $group
•Every group expression must specify an _id field.
•Suppose we wanted to find how many people were born
each year
•Contrast with aggregate operation over entire result
14
> db.people.aggregate( { $group :
{ _id : “$birthyear”, birthsPerYear : { $sum : 1}}})
> db.people.count( )
> db.people.find({“name.first”: “John”}).count( )
> db.people.count({“name.first”: “John”})
{ “result” : [ { “_id” : 1924, “birthsPerYear” : 1 } ], “ok” : 1 }
© 2018 A. Alawini, @D. Maier
Aggregation: $unwind
•Deconstructs an array field to output a document for
each element.
15
>db.posts.aggregate( { $project : { author : 1, tags : 1 }}, { $unwind : “$tags” } )
Posts: {
_id : ObjectId(“4c4ba5c0672c685e5e8aabf3”),
author : “Kevin”,
date : new Date(“February 2, 2012”),
text : “About MongoDB…”,
birthyear: 1980,
tags : [ “tech”, “databases” ]
}
© 2018 A. Alawini, @D. Maier
Result of unwind
16
{ “result” : [ { “_id” : ObjectId(“4c4ba5c0672c685e5e8aabf3”),
“author” : “Kevin”,
“tags” : ”tech” },
{ “_id” : ObjectId(“4c4ba5c0672c685e5e8aabf3”),
“author” : “Kevin”,
“tags” : ”databases” } ],
“OK” : 1}
© 2018 A. Alawini, @D. Maier
Today’s lecture
MongoDB
Join
Aggregation
•Updates, Replication and sharding
•Map-reduce
17
© 2018 A. Alawini, @D. Maier
•MongoDB does not support multi-document atomic
transactions. However, it does provide atomic
operations on a single document: if a hundred fields are
being updated in a document they will either all be
updated or none will be.
•To ensure atomicity: if related information is frequently
being updated together, they should be placed in a
single document using embedded documents.
Updates
18
© 2018 A. Alawini, @D. Maier
Replication
19
© 2018 A. Alawini, @D. Maier
Sharding
20
© 2018 A. Alawini, @D. Maier
Today’s lecture
MongoDB
Join
Aggregation
Updates, Replication and sharding
•Map-reduce
21
© 2018 A. Alawini, @D. Maier
Map-Reduce: Motivation
•SQL is a great way of doing batch (bulk) operations to
collections (tables)
•But it’s somewhat limited in the kinds of operations it
supports (by default) –
SELECT … FROM … WHERE … GROUP BY … HAVING
…
•What if we want to do arbitrary selection predicates
(in Java etc.), and arbitrary aggregations (in Java etc.)
•User defined functions, or alternative frameworks
•What if we wanted to scale this up to run on a 10000
node compute cluster??
22
© 2018 A. Alawini, @D. Maier
Map-Reduce Analogy: National census
•Suppose we have
10,000 employees,
whose job is to collect
census forms and
to determine how
many people live in
each city
•How would you
organize this task?
23
http://www.census.gov/2010census/pdf/2010_Questionnaire_Info.pdf
© 2018 A. Alawini, @D. Maier
Making things more complicated
•Suppose workers take vacations, get sick, work at
different rates
•Suppose some forms are incorrectly filled out and
require corrections or need to be thrown away
•What if the supervisor gets sick?
•How big should the stacks be?
•How do we monitor progress?
• …
24
© 2018 A. Alawini, @D. Maier
I don’t want to deal with all this!!!
•Wouldn’t it be nice if there were some system that
took care of all these details for you?
•Ideally, you’d just tell the system what needs to be
done
•That’s the MapReduce framework.
25
© 2018 A. Alawini, @D. Maier
Abstracting into a digital data flow
26
Filter+Stack
Worker
Filter+Stack
Worker
Filter+Stack
Worker
Filter+Stack
Worker
CountStack
Worker
CountStack
Worker
CountStack
Worker
CountStack
Worker
CountStack
Worker
blue: 4k
green: 4k
cyan: 3k
gray: 1k
orange: 4k
© 2018 A. Alawini, @D. Maier
Abstracting once more
•There are two kinds of workers:
• Those who take input data items and produce output items for the
“stacks”
• Those who take the stacks and aggregate the results to produce
outputs on a per-stack basis
•We’ll call these:
• map: takes (item_key, value), produces one or more (stack_key,
value’) pairs
• reduce: takes (stack_key, {set of value’}), produces one or more
output results – typically (stack_key, agg_value)
27
We will refer to this key
as the reduce key; it is like the GROUP BY key
map does SELECT + PROJECT
© 2018 A. Alawini, @D. Maier
Why MapReduce?
•Scenario:
• You have a huge amount of data, e.g., all the Google
searches of the last three years
• You would like to perform a computation on the data, e.g.,
find out which search terms were the most popular
•Analogy to the census example:
• The computation isn’t necessarily difficult, but
parallelizing and distributing it, as well as handling faults,
is challenging
•Idea: A programming abstraction / template!
•Write a simple program to express the (simple)
computation, and let the language runtime do all the hard
work 28
© 2018 A. Alawini, @D. Maier
Simple example: Word count
•Goal: Given a set of documents, count how
often each word occurs
• Input: Key-value pairs (document:lineNumber, text)
•Output: Key-value pairs (word, #occurrences)
•What should be the intermediate key-value pairs?
map(String key, String value) {
// key: document name, line no
// value: contents of line
}
reduce(String key, Iterator values)
{
}
for each word w in value:
emit(w, “1”)
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
emit(key, result)
29
© 2018 A. Alawini, @D. Maier
Simple example: Word count
Mapper
(1-2)
Mapper
(3-4)
Mapper
(5-6)
Mapper
(7-8)
(1, the apple)
(2, is an apple)
(3, not an orange)
(4, because the)
(5, orange)
(6, unlike the apple)
(7, is orange)
(8, not green)
(the, 1)
(apple, 1)
(is, 1)
(apple, 1)
(an, 1)
(not, 1)
(orange, 1)
(an, 1)
(because, 1)
(the, 1)
(orange, 1)
(unlike, 1)
(apple, 1)
(the, 1)
(is, 1)
(orange, 1)
(not, 1)
(green, 1)
(apple, 3)
(an, 2)
(because, 1)
(green, 1)
(is, 2)
(not, 2)
(orange, 3)
(the, 3)
(unlike, 1)
(apple, {1, 1, 1})
(an, {1, 1})
(because, {1})
(green, {1})
(is, {1, 1})
(not, {1, 1})
(orange, {1, 1, 1})
(the, {1, 1, 1})
(unlike, {1})
Each mapper
receives some
of the KV-pairs
as input
The mappers
process the
KV-pairs
one by one
Each KV-pair output
by the mapper is
sent to the reducer
that is responsible
for it
The reducers
sort their input
by key
and group it
The reducers
process their
input one group
at a time
Key range the node
is responsible for
1 2 3 4 5
30
Reducer
(A-G)
Reducer
(H-N)
Reducer
(O-U)
Reducer
(V-Z)
© 2018 A. Alawini, @D. Maier
MapReduce dataflow
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Reducer
In
p
u
t
d
at
a
O
u
tp
u
t
d
at
a
“The Shuffle”
Intermediate
(key,value) pairs
31