程序代写代做代考 Java database javascript COMP5338 – Advanced Data Models

COMP5338 – Advanced Data Models

Dr. Ying Zhou
School of Information Technologies

COMP5338 – Advanced Data Models
Week 3: MongoDB – Aggregation Framework

The picture can’t be displayed.

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)

Outline
 Review

 Aggregation
 Pipeline stages
 Operators

03-2

The picture can’t be displayed.

Review
 Document Storage Systems store data as semi-structured document:

XML or JSON as two dominant semi-structured formats
 Semi-structured data is self-describing

 MongoDB is a popular document storage system that stores data as
Binary representation of JSON document (BSON)

 Documents with similar structure that representing a particular type of
entity are stored in the same collection

 A database is used to hold multiple collections representing related
entities

 All CRUD operations (find, update, insert and delete) target single
collection

 Query criteria, projection and modifier are expressed as JSON
document

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-3

The picture can’t be displayed.

Null, empty string and related
operators

 Null (or null) is a special data type
 Similar to None, Null or Nil in any programming language
 It has a singleton value expressed as null
 Indicating no value is given

 The interpretation of null is different depending on where it
appears

 It might represents
 The field exists, but has no value
 The field does not exits

 This is different to given a field an empty string “” as value

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-4

The picture can’t be displayed.

Null query example
 Collection revisions document sample

{ “_id” : ObjectId(“5799843ee2cbe65d76ed919b”),
“title” : “Hillary_Clinton”,
“timestamp” : “2016-07-23T02:02:06Z”,
“revid” : 731113635,
“user” : “BD2412”,
“parentid” : 731113573,
“size” : 251742,
“minor” : “”}

 We need a field to indicate if a revision is minor or not. The
original schema uses a field with empty string value to
indicate a minor revision; a document without this field
would be a non-minor revision.

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-5

https://docs.mongodb.com/manual/tutorial/query-for-null-fields/

The picture can’t be displayed.

Querying for null or field existance
 Queries

 db.revisions.find({minor:{$exists:true}})
 Find all documents that a field called minor exists

 db.revisions.find(where the {minor:””})
 Find all documents whose minor field has a value of “”, empty string

 db.revisions.find({minor:null})
 Find all documents that does not have a minor field or the value of
minor field is null

 db.revisions.find({minor:{$exists:false}})
 Find all documents that does not have a field called minor

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-6

The picture can’t be displayed.

It is possible to set the value to null
db.revisions.insertOne({title:”nulltest”,

“timestamp” : “2018-08-14T02:02:06Z”,
“revid” : NumberLong(7201808141159),
“user” : “BD2412”,
“parentid” : 731113573,
“size” : NumberInt(251900),
“minor”:null})

db.revisions.insertOne({title:”nulltest”,
“timestamp” : “2018-08-14T02:02:06Z”,
“revid” : NumberLong(201808141157),
“user” : “BD2412”,
“parentid” : NumberLong(731113573),
“size” : NumberInt(251800)})

db.revisions.find({minor:null}) would return both documents
db.revisions.find({minor:{$exists:true}}) can differentiate the two

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-7

The picture can’t be displayed.

Aggregation
 Simple and relatively standard data analytics can be

achieved through aggregation
 Grouping, summing up value, counting, sorting, etc
 Running on the DB engine instead of application layer

 Several options
 Aggregation Pipeline
MapReduce

 Through JavaScript Functions
 Is able to do customized aggregations

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 02 -8

The picture can’t be displayed.

Aggregation Pipeline
 Aggregation pipeline consists of multiple stages

 Stages are specified using pipeline operators such as $match,
$group,$project, $sort and so on
 This is similar to SQL’s WHERE, GROUP BY, SORT BY etc
 Each stage is expressed as an object enclosed by curly bracket

 Various expressions can be specified in each stage
 To filter documents or to perform simple calculation on an document

• $substr, $size, etc, …

 $group stage can specify accumulators to perform calculation on
documents with the same group key

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 02 -9

db.collection.aggregate( [
{ pipeline operator: {expression/accumulator,…, expression/accumulator} },
{ pipeline operator: {expression/accumulator,…, expression/accumulator} },

] )

The picture can’t be displayed.

Aggregation Example

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 02 -10

select cust_id as _id, SUM(amount) as total
from orders
where status = “A”
group by cust_id

The picture can’t be displayed.

Typical aggregation stages
 $match
 $group
 $project
 $sort
 $skip
 $limit
 $count
 $sample
 $out
 $unwind
 $lookup

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-11

The picture can’t be displayed.

$match stage
 $match

 Filter the incoming documents based on given conditions
 Format:
{$match: {}}

 The query document is the same as those in the find query
 Example:
db.revisions.aggregate([{$match:{size :{$lt: 250000 }}}])

Has the same effect as

db.revisions.find({size :{$lt: 250000 }})

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-12

The picture can’t be displayed.

$group stage
 $group

 Groups incoming documents by some specified expression and
outputs to the next stage a document for each distinct group
 The _id field of the output document has the value of the group key for

each group
 The stage can specify many other fields

{ $group: {_id:,
:{accumulator: },
…},

 To specify the whole collection as a group, give _id field null value
 Use field path to access fields in the document and set one or many

as the value of the _id field
 “$title”, or “$address.street”

 There are predefined accumulators: $sum, $avg, $first, $last, etc

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-13

The picture can’t be displayed.

$group stage example
 Find the earliest revision time in the whole collection
db.revisions.find({},{timestamp:1, _id:0})

.sort({timestamp:1})

.limit(1)
db.revisions.aggregate([
{$group: {_id:null, earliest: {$min: “$timestamp”}}}

])

 Find the earliest revision time of each page in the collection
db.revisions.aggregate([
{$group: {_id:”$title”, earliest: {$min: “$timestamp”}}}

])

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-14

Accumulator: field path

field path

The picture can’t be displayed.

$group stage example (cont’d)
 Find the number of revisions made on each page by each

individual user
 This would require grouping based on two fields: title and user
We need to specify these two as the _id field of the output document

db.revisions.aggregate([
{$group: {_id:{title:”$title”,user:”$user”},

rev_count: {$sum: 1}}}
])

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-15

Composite type as _id

The picture can’t be displayed.

$group by more than one field

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-16

{_id:ObjectId(“…”), title: “DT”, user:“A”, size:123, timestamp:…, … }

{_id:ObjectId(“…”), title: “HC”, user:“B”, size:113, timestamp:…, … }

{_id:ObjectId(“…”), title: “DT”, user:“B”, size:125, timestamp:…, … }

{_id:ObjectId(“…”), title: “HC”, user:“A”, size:113, timestamp:…, … }

{_id:ObjectId(“…”), title: “DT”, user:“A”, size:125, timestamp:…, … }

{_id: {title: “DT”, user:“A”}, rev_count: 2}

{_id: {title: “HC”, user:“B”}, rev_count: 1}

{_id: {title: “DT”, user:“B”}, rev_count: 1}

{_id: {title: “HC”, user:“A”}, rev_count: 1}

{$group: {_id:{title:”$title”,user:”$user”},
rev_count: {$sum: 1}}}

The picture can’t be displayed.

$group examples (cont’d)
 Accumulators do not just return a single value, we can use

accumulators to create an array to hold data from incoming
documents

 What do the following two commands do:
db.revisions.aggregate([
{$group: {_id:”$title”,

revs: {$push:{user:”$user”,timestamp:”$timestamp”}}}
}])

db.revisions.aggregate([
{$group: {_id:”$title”,rev_users: {$addToSet:”$user”}}}
])

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-17

The picture can’t be displayed.

$push accumulator

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-18

{_id:ObjectId(“…”), title: “DT”, user:“A”, size:123, timestamp:…, … }

{_id:ObjectId(“…”), title: “HC”, user:“B”, size:113, timestamp:…, … }

{_id:ObjectId(“…”), title: “DT”, user:“B”, size:125, timestamp:…, … }

{_id:ObjectId(“…”), title: “HC”, user:“A”, size:113, timestamp:…, … }

{_id:ObjectId(“…”), title: “DT”, user:“A”, size:125, timestamp:…, … }

db.revisions.aggregate([
{$group:

{_id:”$title”,
revs:{$push:{user:”$user”,timestamp:”$timestamp”}}}

}])

{ _id: “DT”,
revs:[

{user:“A”,timestamp:…},
{user:“B”,timestamp:…},
{user:“A”,timestamp:..}

]}

{ _id:“HC”,
revs:[

{user:“A”, timestamp:…},
{user:“B”, timestamp:…}

]}

The picture can’t be displayed.

$addToSet accumulator

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-19

{_id:ObjectId(“…”), title: “DT”, user:“A”, size:123, timestamp:…, … }

{_id:ObjectId(“…”), title: “HC”, user:“B”, size:113, timestamp:…, … }

{_id:ObjectId(“…”), title: “DT”, user:“B”, size:125, timestamp:…, … }

{_id:ObjectId(“…”), title: “HC”, user:“A”, size:113, timestamp:…, … }

{_id:ObjectId(“…”), title: “DT”, user:“A”, size:125, timestamp:…, … }

db.revisions.aggregate([
{$group: {_id:”$title”,

rev_users:{$addToSet:”$user”}}}
])

{ _id: “DT”,
rev_users:[“A”, “B”]

}

{ _id:“HC”,
rev_users:[“A”, “B”]

}

The picture can’t be displayed.

$project stage
 $project

 Reshape the document by including/excluding field, adding new
fields, resetting the value of existing field

More powerful than the project argument in find query
 Format
{$project: {:

 In the expression, existing field from incoming document can be
accessed using field path: “$fieldname”

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-20

The picture can’t be displayed.

$project examples
 Find the age of each title in the collection, where the age is

defined as the duration between the last and the first
revision of that title, assuming the timestamp is of ISODate
type

db.revisions.aggregate([
{$group: {_id:”$title”,

first: {$min:”$timestamp”},
last: {$max:”$timestamp”} }},

{$project: {_id: 0,
title: “$_id”,
age: {$subtract:[“$last”,”$first”]}}}

])

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-21

The picture can’t be displayed.

$group then $project

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-22

{_id:ObjectId(“…”), title: “DT”, timestamp:“2016-07-01 00:03:46.000Z”, … }

{_id:ObjectId(“…”), title: “HC”, timestamp:“2016-07-01 00:55:44.000Z”, … }

{_id:ObjectId(“…”), title: “DT”, timestamp:“2016-07-15 12:22:35.000Z”, … }

{_id:ObjectId(“…”), title: “HC”, timestamp:“2016-07-28 00:03:58.000Z”,… }

{_id:ObjectId(“…”), title: “DT”, timestamp:“2016-07-28 00:20:19.000Z”, … }

{_id:“DT”, first:“2016-07-01 00:03:46.000Z”, last:“2016-07-28 00:20:19.000Z”}

{_id:“HC”, first:“2016-07-01 00:55:44.000Z”, last:“2016-07-28 00:03:58.000Z”}

{title: “DT”, age:2333793000}

{title: “HC”, age:2329694000}

{$group: {_id:”$title”,
first: {$min:”$timestamp”},
last: {$max:”$timestamp”} }},

{$project: {_id: 0,
title: “$_id”,
age: $subtract:[“$last”,”$first”]}}}

The picture can’t be displayed.

We can combine multiple operators
db.revisions.aggregate([
{$group: {_id:”$title”,

first: {$min:”$timestamp”},
last: {$max:”$timestamp”} }},

{$project: {_id: 0,
title: “$_id”,
age:{$divide:

[{$subtract:[“$last”,”$first”]},
86400000]}}}

age_unit: {$literal:”day”}}}
])

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-23

The picture can’t be displayed.

$sort, $skip, $limit and $count stages
 $sort stage sorts the incoming documents based on specified field(s)

in ascending or descending order
 The function and format is similar to the sort modifier in find query
 { $sort: { : , :

} }

 $skip stage skips over given number of documents
 The function and format is similar to the skip modifier in find query
 { $skip: }

 $limit stage limits the number of documents passed to the next stage
 The function and format is similar to the limit modifier in find query
 { $limit: }

 $count stage counts the number of documents passing to this stage
 The function and format is similar to the count modifier in find query
 { $count: }
 String is the name of the field representing the count

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-24

The picture can’t be displayed.

$sample and $out stages
 The $sample stage randomly selects given number of

documents from the previous stage
 { $sample: { size: } }
 Different sampling approaches depending on the location of the

stage and the size of the sample and the collection
May fail due to memory constraints

 The $out stage writes the documents in a given collection
 should be the last one in the pipeline
 { $out: “” }

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-25

The picture can’t be displayed.

$lookup stage
 New aggregation stages are added with major releases.
 $lookup stage is added since 3.2 to perform left outer join between two

collections
 The collection already in the pipeline (maybe after a few stages)
 Another collection (could be the same one)

 For each incoming document from the pipeline, the $lookup stage adds
a new array field whose elements are the matching documents from the
other collection.

{$lookup:
{ from: ,

localField: ,
foreignField: ,
as:

}
}

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-26

The picture can’t be displayed.

$lookup stage (cont’d)
 The output of $lookup stage has the same number of

documents as the previous stage
 Each document is augmented with an array field storing

matching document(s) from the other collection
 The array could contain any number of documents

depending on the match, including zero
 Missing local or foreign field is treated as having null value

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-27

The picture can’t be displayed.

$lookup stage example
db.orders.aggregate([

{
$lookup:

{
from: “inventory”,
localField: “item”,
foreignField: “sku”,
as: “inventory_docs”

}
}

])

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-28

{“_id”:1, “item”:”abc”, “price”:12,”quantity”:2 }

{“_id”:2, “item”:”nosku”, “price”:20,”quantity”:1 }

{“_id”:3 }

{“_id”:1, “sku”:”abc”, description:”product 1″, “instock”:120}

{“_id”:2, “sku”:”def”, description:”product 2″, “instock”:80 }

{“_id”:3, “sku”:”ijk”, description:”product 3″, “instock”:60}

{“_id”:4, “sku”:”jkl”, description:”product 4″, “instock”:70 }

{“_id”:5, “sku”:null, description:”Incomplete” }

{“_id”:6}

orders

inventory

https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/#pipe._S_lookup

A document with no item field

A document with sku field
equals null

A document with no sku field

The picture can’t be displayed.

$lookup stage example (cont’d)

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-29

{“_id”:1, “item”:”abc”, “price”:12,”quantity”:2 }

{“_id”:2, “item”:”nosku”, “price”:20,”quantity”:1 }

{“_id”:3 }

{“_id”:1, “sku”:”abc”, description:”product 1″, “instock”:120}

{“_id”:2, “sku”:”def”, description:”product 2″, “instock”:80 }

{“_id”:3, “sku”:”ijk”, description:”product 3″, “instock”:60}

{“_id”:4, “sku”:”jkl”, description:”product 4″, “instock”:70 }

{“_id”:5, “sku”:null, description:”Incomplete” }

{“_id”:6}

{“_id”:1, “item”:”abc”, “price”:12,”quantity”:2,
“inventory_docs”: [

{ “_id”:1, “sku”:”abc”, description:”product 1″, “instock”:120 }] }

{“_id”:2, “item”:“nosku”, “price”:20,”quantity”:1,
“inventory_docs” : [] }

{“_id”:3, “inventory_docs” : [
{ “_id” : 5, “sku” : null, “description” : “Incomplete” },
{ “_id” : 6 }]}

An empty array for no matching from other collection

Non exists field matches null and non exists field

The picture can’t be displayed.

Dealing with data of array type
 To aggregate (e.g. grouping) values in an array field, it is possible to

flatten the array to access individual value
 $unwind stage flattens an array field from the input documents to output

a document for each element. Each output document is the input
document with the value of the array field replaced by the element.
 { $unwind: }

 Behaviour
 Input document:

{ “_id” : 1, “item” : “ABC1”, sizes: [ “S”, “M”, “L”] }
 After $unwind:”$sizes”
 Becomes 3 output documents:

{ “_id” : 1, “item” : “ABC1”, “sizes” : “S” }
{ “_id” : 1, “item” : “ABC1”, “sizes” : “M” }
{ “_id” : 1, “item” : “ABC1”, “sizes” : “L” }

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-30

The picture can’t be displayed.

$unwind example
 Find the number of items that are available in each size
db.inventory.aggregate( [
{ $unwind : “$sizes” },
{ $group:{_id: “$sizes”, item_count: {$sum:1}} }
] )

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-31

The picture can’t be displayed.

$unwind then $group
{ “_id” : 1, “item” : “ABC”, “sizes”: [ “S”, “M”, “L”] }

{ “_id” : 2, “item” : “EFG”, “sizes” : [ ] }

{ “_id” : 3, “item” : “IJK”, “sizes”: “M” }

{ “_id” : 4, “item” : “LMN” }

{ “_id” : 5, “item” : “XYZ”, “sizes” : null }

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-32

{ “_id” : 1, “item” : “ABC”, “sizes”: “S”}

{ “_id” : 1, “item” : “ABC”, “sizes”: “M”}

{ “_id” : 1, “item” : “ABC”, “sizes”: “L”}

{ “_id” : 3, “item” : “IJK”, “sizes”: “M” }

{ $group:{_id: “$sizes”,
item_count: {$sum:1}}

{ “_id” : “S”, “item_count”: 1}

{ “_id” : “M”, “item_count”: 2}

{ “_id” : “L”, “item_count”: 1}

{ $unwind : “$sizes” },

The picture can’t be displayed.

Aggregation Operators
 A few aggregation stages allow us to add new fields or to

given existing fields new values based on expression
 In $group stage we can use various operators or accumulators to

compute values for new fields
 In $project stage we can use operators to compute values for new

or exiting fields
 There are many predefined operators for various data types

to carry out common operations in that data type
 Arithmetic operators: $mod, $log, $sqrt, $subtract, …
 String operators: $concat, $split, $indexofBytes, …
 Comparison operators: $gt, $gte, $lt, $lte,…
 Set operators: $setEquals, $setIntersection, …
 Boolean operators: $and, $or, $not, …
 Array operators: $in, $size, ..

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-33

The picture can’t be displayed.

Aggregation vs. Query operators
 There is another set of operators that can be used in

find/update/delete queries or the $match stage of an
aggregation
 E.g. $gt, $lt, $in, $all….

 The set is smaller and are different to the operators used in
$group or $project stage

 Some operators look the same but have different syntax and
slightly different interpretation in query and in aggregation.
 E.g. $gt in query looks like

{age: {$gt:18}}
 $gt in $project stage looks like:

{over18: {$gt:[“$age”, 18]}}

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-34

The picture can’t be displayed.

Aggregation Behaviour
 It operates on a single collection (before 3.2)

 Join can be performed using a particular operator $lookup
 It logically passes the entire collection into the pipeline
 Early filtering can improve the performance
 $match and $sort operator are able to use index if placed at

the beginning of the pipeline

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 02 -35

The picture can’t be displayed.

Summary
 MongoDB stores data as BSON document
 Retrieving data from MongoDB are usually achieved

through
 find query
 aggregate pipeline

 find query targets a single collection, it supports condition
on any field

 aggregate pipeline is able to access other collection(s)
 Both provides rich set of operators
 update/insert/delete operation guarantees document

level atomicity
 None standard query API, set of operators are growing

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-36

The picture can’t be displayed.

References
 BSON types

 https://docs.mongodb.com/manual/reference/bson-types/
 Aggregation Pipelines

 https://docs.mongodb.com/manual/core/aggregation-pipeline/
 Aggregation operators

 https://docs.mongodb.com/manual/reference/operator/aggregation/

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 03-37

https://docs.mongodb.com/manual/reference/bson-types/
https://docs.mongodb.com/manual/core/aggregation-pipeline/
https://docs.mongodb.com/manual/reference/operator/aggregation/

COMP5338 – Advanced Data Models
Outline
Review
Null, empty string and related operators
Null query example
Querying for null or field existance
It is possible to set the value to null
Aggregation
Aggregation Pipeline
Aggregation Example
Typical aggregation stages
$match stage
$group stage
$group stage example
$group stage example (cont’d)
$group by more than one field
$group examples (cont’d)
$push accumulator
$addToSet accumulator
$project stage
$project examples
$group then $project
We can combine multiple operators
$sort, $skip, $limit and $count stages
$sample and $out stages
$lookup stage
$lookup stage (cont’d)
$lookup stage example
$lookup stage example (cont’d)
Dealing with data of array type
$unwind example
$unwind then $group
Aggregation Operators
Aggregation vs. Query operators
Aggregation Behaviour
Summary
References