COMP2100/COMP6442
Persistent Data
– Lecture 9]
Kin Chau [
Sid Chi
1
Goals of This Lecture
• What is Persistent Data? And How? • Bespoke
• Serialization • XML
• JSON
• Compare Pros and Cons
2
What is Persistent Data?
• A critical task for applications is to save/retrieve data
• Permanent data (storage of data from working memory) • It can be updated, but not as frequent as transient/volatile data • It is stored in database/SSD/harddisk/magnetic tape
• Why do we want permanent data?
• Disadvantages of holding volatile data
• To be used and reused (save and load), and fault tolerant • To be checked and validated for authentication
• How can we store data persistently?
• The choice of the persistence method is part of the design of an application • Files (JSON, XML, images, …)
• Databases
3
Uses of Data and Storage
Types
Use cases
Formats
Text files (unstructured data)
Word Processing
Raw text (ASCII, UTF-8) proprietary word processing formats .doc (generally unstructured)
Structured text files
Spreadsheet, sensor data, simple structured data
csv, tsv, bespoken, XML, JSON
Graphics
Images
png, jpeg (lossy), gif, bmp
Audio/Movie
Lecture recordings, music
mp3, mp4 (lossy)
Data compression
Large file storage
zip, tar, rar, …
4
Which is the Best Data Format
• Use case
• What does your application do? • What kind of data you have?
• Is there any restriction to meet? • Software licenses
• Storage limitation
• Rapid access to data • Rapid development
5
Aspects to Consider
• Programming Agility
• Easy to develop (no overhead) and code
• Extensibility
• Can data be easily extended? (e.g., add new fields, attributes, …)
• Is it easy to add new fields in a CSV file?
• Is it easy to add new attributes in a database?
• Portability
• Will other applications access the data? • Will it run on other hardware?
6
Aspects to Consider
• Robustness
• Bespoke vs XML vs JSON
• Well-designed and structured format
• Use of schema (how verify if your data is correctly formatted?) • Lack of schema and interoperability problems
• Size vs Completeness
• Lossy vs Lossless
• Audio/Image vs financial data/scientific data
• Internationalization
• ASCII vs UTF-8
• Who will use the data (audience)?
7
Bespoke and Serialization
• Bespoke data files
• Define your own persistent data format
• Write your own data formatting and checking methods
• Not often used in industry
• Not robust and may incur extra bugs
• Serialization
• Directly storing binary class data (and even whole executable class)
• Serialization presents technical issues
• Programming language dependent and platform dependent (big- or little-endian)
• Loss of object references
• Securityissues
• Deserialization: revert persistent data to a copy of class object
8
Bespoke and Serialization
• Bespoke
• Implement a simple logging application • Save/load log errors to/from a text file
• Java Serialization
• Implement a simple application
• Terminal command: od -c data.ser
9
Serialization in Java
• Java Serialization
• Class must implement Serializable
• public myClass implements Serializable
• Load serializable data by creating an ObjectInputStream object and casting
the stream to the appropriate class type
• Save serialized data by creating an ObjectOutputStream and writing the object to the stream
• ArrayLists are serializable by default and are commonly used for serializing a data collection (many classes, such as HashMaps, are serializable (check documentation)
10
Serialization in Java
• Deserialization of untrusted data is inherently dangerous and not recommended
• https://www.oracle.com/java/technologies/javase/seccodeguide.html
11
XML
• XML (eXtensible Markup Language)
• Open standards for general data formatting specifications
• Cross platforms, cross programming languages • Wide industry support (W3C)
• A plenty of tools and programming libraries
• Long history of deployment
• Example • HTML
• .docx (Word document) is represented using XML
12
XML
• XML Structure / Tree
• XML is case sensitive!
…
13
XML Example
• XML example:
14
XML
• XML parser error!
• Use < instead of “<”
• https://www.w3schools.com/xml/xml_syntax.asp
…
15
Two Options for XML in Java
• Two approaches:
• SAX
• Simple API for XML
• SAX treats XML as stream and allows extraction of data as stream is read – preferable for very large documents (gigabyte)
• DOM
• Document Object Model (structured around XML standard)
• Java DOM reads in entire XML tree and generate the node object
• SAX is faster and more efficient than DOM • DOM has more structures than SAX
16
XML DOM
Element
Root Element
Element
Element
Element
Element
Element
text
Homer
text
Simpson
text
Johnny
text
Goodman
17
XML DOM
• DOM requires a number of steps to save data to file: • CreateaDocumentBuilder(usesDocumentBuilderFactory)
• Document created from a DocumentBuilder object • Create and append elements
• Transform the XML to a Result (output file)
• Similar series of steps for loading XML/DOM: • DocumentBuilderFactory
• Document Builder
• Document
• Class data structures
import javax.xml.parsers.*;
import javax.xml.transform.*;
import org.w3c.dom.*;
…
…
18
Pros and Cons of XML
Pros
Cons
• Robust,extendable
• More human readable
• Platformindependent
• Programminglanguageindependent
• XML supports Unicode (international
encoding)
• Easyformatverification
• Canrepresentmanydatastructures
(trees, lists…)
• Native support in Java
• XML syntax is verbose and redundant
• XML file sizes are usually big because of above
• Does not support Array
19
JSON
• JavaScript Object Notation (JSON)
• Like XML, is also an open standard for data format that is widely used
• Originally designed for sending data between web client and server, but also very useful for data storage
• Built around attribute-value pairs
• Produces smaller and more readable
documents than XML • JSON example:
[{“age”:11,”name”:”Bart”},
{“age”:40,”name”:”Homer”}]
{“attribute-name”:{JSON object}} {“attribute-name”:“string”} {“attribute-name”:[array]} {“attribute-name”:1} (number) {“attribute-name”:true} (boolean) {“attribute-name”:null}
20
Pros and Cons of JSON
Pros
Cons
• More lightweight
• Humanreadable
• Straightforwardtoimplement
• Support array and null
• Caneasilydistinguishboolean,
number, and string type
• Data is available as JSON objects
• Lackinglanguagespecific features of XML (e.g., XML attributes..)
• No native support in Java
• No display capabilities (no
markup language)
21
Database
• Database management systems (DBMS) are commonly used for storage of large volumes of data
• Fast and efficient large data retrieval and processing • Parallel and distributed data retrieval and processing
• Relational databases
• Linking tables through unique identifiers to avoid problems of duplicating
data entries
• Standardized data retrieval and processing commands (e.g., SQL)
22
Database Example
• Represent a person in a bespoke/csv file:
id, FullName, HomePhone, MobilePhone, WorkPhone 1, Alice, 555-555, 123-321, ?
2, Bob, ? ,123-222, ?
• Relational Database (RDB)
• SQL (Structure Query Language) designed for data query and manipulation
Person ContactPhone
id
FullName
1
Alice
2
Bob
…
…
id
PhoneNumber
1
555-555
2
123-222
1
123-321
23
Reference
• IBM developer works 5 things you need to know about serialization • https://developer.ibm.com/technologies/java/articles/j-5things1/
• Oracle serialization FAQ
• https://www.oracle.com/technetwork/java/javase/tech/serializationfaq-jsp-
136699.html
• W3C XML standards pages
• https://www.w3.org/standards/
• JSON
• https://www.json.org/
• https://www.ecma-international.org/publications/files/ECMA-ST/ECMA- 404.pdf
24