COMP5338 – Advanced Data Models
Dr. Ying Zhou
School of Information Technologies
COMP5338 – Advanced Data Models
Week 2: Document Store: Data Model and Simple Query
Administrative
Most labs are not full at the moment
If you wish to move lab but cannot do it online, please go to the lab you want
to attend and let the tutor know
Students allocated in SIT118
If you wish to attend Wednesday labs, please attend SIT116 if your sid ends
with even number and SIT117 if your sid ends with odd number
You may attend one of the Tuesday evening labs as well.
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -2
Time Room Capacity Tutor
Tue 8-9pm SIT114 30 Dai
Tue 8-9pm SIT115 30 Andrian
Tue 8-9pm SIT117 20 Heming (Taurus)
Tue 8-9pm SIT118 20 Chenhao
Wed 5-6pm SIT116 20 Givanna
Wed 5-6pm SIT117 20 Heming(Taurus)
Wed 5-6pm SIT118 20 Will be closed
Lab arrangement
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou)
Outline
Overview of Document Databases
MongoDB Data Model
MongoDB CRUD Operations
02 -3
Structured and Unstructured Data
Relational Database System is designed to store
structured data in tabular format, e.g. each pieces of data
is stored in a predefined field (attribute)
Unstructured data does not follow any predefined “model”
or “format” that is aware to the underlying system .
Examples include data stored in various files, e.g word
document
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -4
8703 Heinz 0293514287
8731 Edgell 0378301294
8927 Kraft 0299412020
9031 CSR 0720977632
Supplier Table:
SuppID Name Phone
Semi-structured Data
Many data have some structure but should not be
constrained by a predefined and rigid schema
E.g. if some suppliers have multiple phone numbers, it is hard to
capture such information in a relational model effectively
Self-describing capability is the key characteristics of semi-
structured data
schema/structure is an integral part of the data, instead of a separate
declaration
in database system, the structure is “declared” when you create a
table. All rows need to follow the structure
in CSV and Excel, the structure is “declared” in the header row. All
subsequent rows are supposed to follow that
XML and JSON are two types of semi-structured data
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -5
A Self-describing XML document
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -6
123
metadata/structure information data
Another invoice with slightly different structure
123
456
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -7
JSON Data Format
JSON (JavaScript Object Notation) is a simple way to
represent JavaScript objects as strings.
There are many tools to serialize objects in other programming
language as JSON
JSON was introduced in 1999 as an alternative to XML for
data exchange.
Each JSON object is represented as a list of property
names and values contained in curly braces, in the following
format:
{ propertyName1 : value1, propertyName2 : value2 }
Arrays are represented in JSON with square brackets in the
following format:
[ value1, value2, value3 ]
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -8
JSON format example
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -9
Invoice _1= {
order-id: 1,
customer: {name: “John”, address: “Sydney”},
products:[ { code: “123”, quantity: 1}]
}
Invoice _3= {
order_id: 3,
customer: {name: “Smith”,
address: “Melbourne”,
contact: “12345”},
products: [{ code: “123”, quantity: 20},
{ code: “456”, quantity:2}]
delivery: “express”
}
Document Databases
Document database stores data in semi-structured
documents
Document structure is flexible
Provide own query syntax (different to standard SQL)
Usually has powerful index support
Examples:
XML based database
JSON based database: MongoDB, CouchDB
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -10
Outline
Overview of Document Databases
MongoDB Data Model
MongoDB CRUD operations
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -11
Matching Terms in SQL and MongoDB
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou)
SQL MongoDB
Database Database
Table Collection
Index Index
Row BSON document
Column BSON field
Primary key _id field
Join Embedding and referencing
$lookup in aggregation (since
3.2)
02 -12
MongoDB Document Model
TFN Name Email age
12345 Joe Smith joe@gmail.com 30
54321 Mary Sharp mary@gmail.com 27
{ _id: 12345,
name: “Joe Smith”,
email: “joe@gmail.com”,
age: 30
}
{ _id: 54321,
name: “Mary Sharp”,
email: “mary@gmail.com”,
age: 27
}
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou)
users table in RDBMS
users collection in MongoDB
02-13
two rows
two documents
Column name is part of schema
Field name is part
of data
Repeated in every
document
Native Support for Array
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou)
{ _id: 12345,
name: “Joe Smith”,
emails: [“joe@gmail.com”, “joe@ibm.com”],
age: 30
}
{ _id: 54321,
name: “Mary Sharp”,
email: “mary@gmail.com”,
age: 27
}
02 -14
TFN Name Email age
12345 Joe Smith joe@gmail.com 30
54321 Mary Sharp mary@gmail.com 27
, joe@ibm.com ??
Native Support for Embedded
Document
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou)
{ _id: 12345,
name: “Joe Smith”,
email: [“joe@gmail.com”, “joe@ibm.com”],
age: 30
}
{ _id: 54321,
name: “Mary Sharp”,
email: “mary@gmail.com”,
age: 27,
address: { number: 1,
name: “cleveland street”,
suburb: “chippendale”,
zip: 2008
}
}
02 -15
TFN Name Email age address
12345 Joe Smith joe@gmail.com 30
54321 Mary Sharp mary@gmail.com 27 1 cleveland street,
chippendale, NSW
2008
MongoDB data types
Primitive types
String, integer, boolean (true/false), double, null
Predefined special types
Date, object id, binary data, regular expression, timestamp, and a
few more
DB Drivers implement them in language-specific way
The interactive shell provides constructors for all
ISODate(“2012-09-11 18:00:00”)
Array and object
Field name is of string type with certain restrictions
“_id” is reserved for primary key
cannot start with “$”, cannot contain “.” or null
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -16
http://docs.mongodb.org/manual/reference/bson-types/
http://docs.mongodb.org/manual/reference/bson-types/
Data Modelling
Key design decision in MongoDB data modelling involves
how to represents relationship between data
How many collections should we use
What is the rough document structure in each collection
Embedding or Referencing
Which object should have its own Collection
And reference the id in other collection
Which object can be embedded in other object
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou)
http://www.mongodb.org/display/DOCS/Schema+Design
02 -17
http://www.mongodb.org/display/DOCS/Schema+Design
Referencing
References store the relationships between data by
including links or references from one document to another.
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -18
Embedding
Embedded documents capture relationships between data
by storing related data in a single document structure.
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -19
_id is not required
“Schema” Design Example
A fully normalized relational
model would have the
following tables:
User
Post
Comment
PostLink
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou)
http://docs.mongodb.org/manual/applications/data-models/
02-20
http://docs.mongodb.org/manual/applications/data-models/
MongoDB schema design
Using three collections
User collection
Post collection (with links to User, Comment, and Post itself)
Comment Collection(with links to User)
Using two collections
User collection
Post collection (with embedded Comment object and links to User and
Post itself
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02-21
Two Collections Schema
Two collections
User collection
Post collection (with embedded Comment object and links to User and
Post itself )
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -22
{ _id: “p1”,
author: “u1” ,
title: “A nice day”,
date: 2012-09-10,
comments: [
{ author: “u2”,
content: “nice here too”,
date: 2012-09-11,
}
]
backlinks: [“p2”]
}
{ _id: “u1”,
name: “user1”,
password: “bq7e0dx…”,
email: “user1@gmail.com”
}
{ _id: “u2”,
name: “user2”,
password: “mb8xfv…”,
email: “user2@gmail.com”
}
{ _id: “p2”,
author: “u2”
title: “NoSQL is dead”,
date: 2012-09-11,
tags: [“MongoDB”, “HBase”],
comments: [
{ author: “u1”,
content: “nonsense”
date: 2012-09-11
}
]
}
User collection:
Post collection:
This post does not have tags, so no “tags” field
This post does not have links pointing to
it, so no “backlink” field
Each user profile is saved as a JSON document
An array of Comment objects
Tags and backlinks are stored as
array
Three Collections Schema
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -23
Three collections
User collection
Post collection (with links to User, Comment, and Post itself)
Comment Collection(with links to User)
{ _id: “u1”,
name: “user1”,
password: “bq7e0dx…”,
email: “user1@gmail.com”
}
{ _id: “u2”,
name: “user2”,
password: “mb8xfv…”,
email: “user2@gmail.com”
}
User collection: Post collection:
{ _id: “p1”,
author: “u1” ,
title: “A nice day”,
date: 2012-09-10,
comments: [“c2”],
backlinks: [“p2”]
}
{ _id: “p2”,
author: “u2”
title: “NoSQL is dead”,
date: 2012-09-11,
tags: [“MongoDB”, “HBase”],
comments: [ “c1” ]
}
Comment collection:
{ _id: “c1”,
author: “u1” ,
content: “nonsense”,
date: 2012-09-11,
}
{ _id: “c2”,
author: “u2” ,
content: “nice here too”,
date: 2012-09-11,
}
Two Collections vs. Three Collections
Which one is better?
Hard to tell by schema itself, we need to look at the actual application to
understand
Typical data feature
• What would happen if a post attracts lots of comments?
Typical queries
• Do we want to show all comments when showing a post, or only the latest few, or not at
all?
• Do we need to produce statistics based on comment itself?
Atomicity consideration
• Is there “all or nothing” update requirement with respect to post and comment
Other design variation?
In three collection schema, store post-comment link information in Comment collection
instead of Post collection?
Embed the recent comments in Post?
One User collection with embedded Post and Comment objects?
One User collection with user, post and comment documents?
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -24
General Schema Design Guideline
Depends on data and intended use cases
“independent” object should have its own collection
composition relationship are generally modelled as embedded relation
Eg. ShoppingOrder and LineItems, Polygon and Points belonging to it
aggregation relationship are generally modelled as links (references)
Eg. Department and Employee
Many-to-Many relationship are generally modelled as links (references)
Eg. Course and Students enrolled in a course
If part-objects are always required when whole-object is queried, embed the
part-object
We always want to display line items when displaying shopping order
We always want to display Comments along with the blog Post;
We always want to get Credit Card billing address when querying credit card
information;
But we might not always want to get all students enrolled when querying about a
course.
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -25
Course information page
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -26
Outline
Overview of Document Databases
MongoDB Data Model
MongoDB CRUD Operations
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -27
MongoDB Queries
In MongoDB, a read query targets a specific collection. It
specifies criteria, and may include a projection to specify
fields from the matching documents; it may include modifier
to limit, skip, or sort the results.
A write query may create, update or delete data. One query
modifies the data of a single collection. Update and delete
query can specify query criteria
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -28
http://docs.mongodb.org/manual/core/crud-introduction/
Read Operation Interface
db.collection.find()
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -29
Find at most 5 documents in the users collection with age field
greater than 18, return only the name and address field of
each document.
Read Query Example
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou) 02 -30
Find documents in the users collection with age field greater
than 18, sort the results in ascending order by age
Read Query Features
Users can find data using any criteria in MongoDB
Does not require indexing
Indexing can improve performance (week 4)
Query criteria are expressed as BSON document (query object)
Individual condition is expressed using predefined selection operator, eg. $gt is the
operator for “greater than”
Query projection are expressed as BSON document as well
COMP5338 “Advanced Data Models” – 2018 ( Y. Zhou)
SQL MongoDB Query in Shell
select * from user db.user.find() or db.user.find({})
select name, age from user db.user.find({},{name:1,age:1,_id:0})
select * from user
where name = “Joe Smith”
db.user.find({name: “Joe Smith”})
select * from user
where age < 30
db.user.find({age: {$lt:30}})
02 -31
Querying Array field
MongoDB provide various features for querying array field
https://docs.mongodb.com/manual/tutorial/query-arrays/
The syntax are similar to querying simple type field
db.users.find({emails: “joe@gmail.com”})
Find user(s) whose email include “joe@gmail.com”.
db.users.find({“emails.0”: “joe@gmail.com”})
Find user(s) whose first email is “joe@gmail.com”.
db.users.find({emails: {$size:2}})
Find user(s) with 2 emails
COMP5338 "Advanced Data Models" - 2018 ( Y. Zhou)
{ _id: 12345,
name: “Joe Smith”,
emails: [“joe@gmail.com”, “joe@ibm.com”],
age: 30}
{ _id: 54321,
name: “Mary Sharp”,
email: “mary@gmail.com”,
age: 27}
02 -32
Querying Embedded Document
Embedded Document can be queried as a whole, or by
individual field, or by combination of individual fields
db.user.find({address: {number: 1, name: “pine street”, suburb:
“chippendale”, zip: 2008}})
db.user.find({“address.suburb”: “chippendale”})
db.user.find({“address.name”: “pine street”, “address.suburb”:
“chippendale”})
COMP5338 "Advanced Data Models" - 2018 ( Y. Zhou)
{ _id: 12345,
name: “Joe Smith”, email: [“joe@gmail.com”, “joe@ibm.com”], age: 30,
address: {number: 1, name: “pine street”, suburb: “chippendale”, zip: 2008 }
}
{ _id: 54321,
name: “Mary Sharp”, email: “mary@gmail.com”,age: 27,
address: { number: 1, name: “cleveland street”,suburb: “chippendale”,zip: 2008 }
}
http://docs.mongodb.org/manual/tutorial/query-documents/#embedded-documents
02 -33
http://docs.mongodb.org/manual/tutorial/query-documents/
Write Query- Insert Operation
COMP5338 "Advanced Data Models" - 2018 ( Y. Zhou) 02 -34
Insert a new document in users collection.
Insert Example
db.user.insertOne({_id: 12345, name: “Joe Smith”, emails:
[“joe@gmail.com”, “joe@ibm.com”],age: 30})
COMP5338 "Advanced Data Models" - 2018 ( Y. Zhou)
user collection { _id: 12345, name: “Joe Smith”, emails: [“joe@gmail.com”, “joe@ibm.com”],
age: 30
}
{ _id: 54321,
name: “Mary Sharp”, email: “mary@gmail.com”, age: 27,
address: { number: 1,
name: “cleveland street”,
suburb: “chippendale”,
zip: 2008
}
}
db.user.insertOne({ _id: 54321, name: “Mary Sharp”, email:
“mary@gmail.com”, age: 27,
address: { number: 1, name: “cleveland street”, suburb:
“chippendale”, zip: 2008}})
02 -35
Insert Behavior
If the collection does not exist, the operation will create one
If the new document does not contain an “_id” field, the
system will adds an “_id” field and assign a unique value to
it.
If the new document does contain an “_id” field, it should
have a unique value
Two other operations:
insertMany
Insert many documents
Insert
Major language APIs only support insertOne and insertMany
COMP5338 "Advanced Data Models" - 2018 ( Y. Zhou) 02 -36
Write Query – Update Operation
COMP5338 "Advanced Data Models" - 2018 ( Y. Zhou) 02 -37
Has the same effect as the following SQL:
Two other operations: updateOne, replaceOne
Updates operators
Modifying simple field: $set, $unset
db.user.updateOne({_id: 12345}, {$set: {age: 29}})
db.user.updateOne({_id:54321}, {$unset: {email:1}}) // remove the field
Modifying array elements: $push, $pull, $pullAll
db.user.updateOne({_id: 12345}, {$push: {emails: “joe@hotmail.com”}})
db.user.updateOne({_id: 54321},
{$push: {emails: {$each: [“mary@gmail.com”, “mary@microsoft.com”]}}})
db.user.updateOne({_id: 12345}, {$pull: {emails: “joe@ibm.com”}})
COMP5338 "Advanced Data Models" - 2018 ( Y. Zhou)
{ _id: 12345,
name: “Joe Smith”,
emails: [“joe@gmail.com”, “joe@ibm.com”],
age: 30}
{ _id: 54321,
name: “Mary Sharp”,
email: “mary@gmail.com”,
age: 27}
{ _id: 12345,
name: “Joe Smith”,
emails: [“joe@gmail.com”, “joe@hotmail.com”],
age: 29}
{ _id: 54321,
name: “Mary Sharp”,
emails: [“mary@gmail.com”, “mary@microsoft.com”]
age: 27}
http://www.mongodb.org/display/DOCS/Updating
02 -38
http://www.mongodb.org/display/DOCS/Updating
Write Operation - Delete
db.user.deleteMany();
Remove all documents in user collection
db.user.deleteMany({age: {$gt:18}})
Remove all documents matching a certain condition
db.user.deleteOne({_id: 12345})
Remove one document matching a certain condition
COMP5338 "Advanced Data Models" - 2018 ( Y. Zhou) 02 -39
Isolation of write operation
The modification of a single document is always atomic
It does not leave a document as partially updated.
A concurrent read will not see a partially updated document
This is true even if the operation modifies multiple embedded
documents within a single document
Read Uncommitted
Concurrent read operation may see document that has been
updated but not yet committed, or not durable
If a write operation is subsequently rolled back, a concurrent read
may return the updated value before it is rolled back
COMP5338 "Advanced Data Models" - 2018 ( Y. Zhou) 02 -40
Single Document Atomicity
COMP5338 "Advanced Data Models" - 2018 ( Y. Zhou) 02 -41
db.inventory.insertMany( [
{ item: "canvas", qty: 100, size: { h: 28, w: 35.5, uom: "cm" }, status: "A" },
{ item: "journal", qty: 25, size: { h: 14, w: 21, uom: "cm" }, status: "A" },
{ item: "paper", qty: 100, size: { h: 8.5, w: 11, uom: "in" }, status: "D" }]);
db.inventory.find({item: "paper"})
db.inventory.updateOne(
{ item: "paper" },
{ $set: { "size.uom": "cm", status: "P" }
}
) { item: "paper", qty: 100,
size: { h: 8.5, w: 11, uom: "in" },
status: "D" }]);
{ item: "paper", qty: 100,
size: { h: 8.5, w: 11, uom: “cm" },
status: “P" }]);
{ item: "paper", qty: 100,
size: { h: 8.5, w: 11, uom: “cm" },
status: “D" }]);
Isolation of write operation
If a write operation modifies multiple documents
(insertMany, updateMany, deleteMany), the operation as a
whole is not atomic, and other operations may interleave.
Multi-Document Transactions is supported in version 4.0
Other mechanisms were used in earlier versions
The $isolated operator can prevents a write operation that affects
multiple documents from yielding to other reads or writes once the
first document is written
All those mechanisms have great performance impact and
are recommended to avoid if possible, document
embedding is recommended as an alternative
COMP5338 "Advanced Data Models" - 2018 ( Y. Zhou) 02 -42
Write Operation – interleaving Scenario
COMP5338 "Advanced Data Models" - 2018 ( Y. Zhou) 02 -43
{age: 21, status: “U”}
{age: 23, status: “S”}
{age: 17, status: “E”}
{age: 25, status: “R”}
{age: 15, status: “S”}
{age: 16, status: “C”}
{age: 19, status: “O”}
{age: 22, status: “L”}
db.users.updateMany(
{ age: { $gt: 18 } },
{ $set: { status: “A” } }
)
users collection
{age: 21, status: “U”}
{age: 23, status: “S”}
{age: 25, status: “R”}
{age: 19, status: “O”}
{age: 22, status: “L”}
{age: 21, status: “A”}
{age: 23, status: “A”}
{age: 25, status: “A”}
{age: 19, status: “O”}
{age: 22, status: “L”}
db.users.find(
{ age: { $gt: 20 } }
)
{age: 21, status: “A”}
{age: 23, status: “A”}
{age: 25, status: “A”}
{age: 22, status: “L”}
Read returned documents
write is on going, a
read query comes
{age: 21, status: “A”}
{age: 23, status: “A”}
{age: 25, status: “A”}
{age: 19, status: “A”}
{age: 22, status: “A”}
Write finishes
A write query comes
Write Operation – Isolation Scenario
COMP5338 "Advanced Data Models" - 2018 ( Y. Zhou) 02 -44
{age: 21, status: “U”}
{age: 23, status: “S”}
{age: 17, status: “E”}
{age: 25, status: “R”}
{age: 15, status: “S”}
{age: 16, status: “C”}
{age: 19, status: “O”}
{age: 22, status: “L”}
users collection
{age: 21, status: “U”}
{age: 23, status: “S”}
{age: 25, status: “R”}
{age: 19, status: “O”}
{age: 22, status: “L”}
{age: 21, status: “A”}
{age: 23, status: “A”}
{age: 25, status: “A”}
{age: 19, status: “O”}
{age: 22, status: “L”}
db.users.find(
{ age: { $gt: 20 } }
)
{age: 21, status: “A”}
{age: 23, status: “A”}
{age: 25, status: “A”}
{age: 22, status: “A”}
Read has to wait
write is on going, a
read query comes
{age: 21, status: “A”}
{age: 23, status: “A”}
{age: 25, status: “A”}
{age: 19, status: “A”}
{age: 22, status: “A”}
Write finishes
db.users.updateMany(
{ age: { $gt: 18 } },
{ $set: { status: “A”, $isolated: 1 } }
)
A write query comes
Read returns the results
References
MongoDB online documents:
Mongo DB Data Models
http://docs.mongodb.org/manual/core/data-modeling-introduction/
MongoDB CRUD Operations
http://docs.mongodb.org/manual/core/crud-introduction/
Pramod J. Sadalage, Martin Fowler NoSQL distilled, Addison-
Wesley Professional; 1 edition (August 18, 2012)
https://www.amazon.com/NoSQL-Distilled-Emerging-Polyglot-
Persistence/dp/0321826620
COMP5338 "Advanced Data Models" - 2018 ( Y. Zhou) 02 -45
http://docs.mongodb.org/manual/core/data-modeling-introduction/
http://docs.mongodb.org/manual/core/crud-introduction/
https://www.amazon.com/NoSQL-Distilled-Emerging-Polyglot-Persistence/dp/0321826620
COMP5338 – Advanced Data Models
Administrative
Outline
Structured and Unstructured Data
Semi-structured Data
A Self-describing XML document
Another invoice with slightly different structure
JSON Data Format
JSON format example
Document Databases
Outline
Matching Terms in SQL and MongoDB
MongoDB Document Model
Native Support for Array
Native Support for Embedded Document
MongoDB data types
Data Modelling
Referencing
Embedding
“Schema” Design Example
MongoDB schema design
Two Collections Schema
Three Collections Schema
Two Collections vs. Three Collections
General Schema Design Guideline
Course information page
Outline
MongoDB Queries
Read Operation Interface
Read Query Example
Read Query Features
Querying Array field
Querying Embedded Document
Write Query- Insert Operation
Insert Example
Insert Behavior
Write Query – Update Operation
Updates operators
Write Operation - Delete
Isolation of write operation
Single Document Atomicity
Isolation of write operation
Write Operation – interleaving Scenario
Write Operation – Isolation Scenario
References