2021S2-workshop-week4-lab-solution
COMP20008 2021 Semester 2 Workshop Week 4¶
Using lxml to read XML data¶
We will use the lxml python package. lxml provides us with various methods of dealing with XML data known as APIs (Application Programming Interfaces). The first way is the ElementTree API, which enables us to easily access XML data in a tree-like structure. The full API reference is availabld here, but is probably less useful than the pandas API refernece you encountered last week. The official lxml site does, however, have a tutorial which is quite through and makes a great reference.
As with any other Python packages, you need to issue an import command to load a package:
In [1]:
from lxml import etree
For this section we will work with the royal.xml file, which contains the names of some members of the British royal family. The code below simply displays the contents of that file, you can also open the file in a web browser or text editor. Look through the file and ensure you understand its content.
In [2]:
f = open(“royal.xml”, “r”)
text = f.read()
print(text)
f.close()
In order to load an XML file and to represent it as a tree in computer memory, you need to parse the XML file. The etree.parse() function parses the XML file that is passed in as a parameter.
In [3]:
xmltree = etree.parse(“royal.xml”)
The parse() function returns an XML ElementTree object, which represents the whole XML tree. Each node in the tree is translated into an Element object .
Use getroot() function of an ElementTree object to get the root element of the XML tree. You can print out the XML tag of an element using tag property.
In [4]:
root = xmltree.getroot()
print (root.tag)
queen
Traversing the XML Tree¶
The following sections describe various methods for traversing the XML tree
To obtain a list all of the children of an element, you can iterate over the XML Element itself:
In [5]:
for e in root:
print (e.tag)
prince
princess
prince
prince
You can use indexing to access the children of an element:
In [6]:
oldest_prince = root[0]
#print(type(oldest_prince))
print (oldest_prince.get(“title”))
Charles, Prince of Wales
The find() method returns only the first matching child.
In [7]:
the_first_child_with_prince_tag = root.find(“prince”)
print (the_first_child_with_prince_tag.get(‘title’))
Charles, Prince of Wales
The iterchildren() function allows you to iterate over children with a particular tag:
In [8]:
for child in root:#.iterchildren(tag=”prince”):
print (child.get(‘title’))
Charles, Prince of Wales
Anne, Princess Royal
Andrew, Duke of York
Edward, Earl of Wessex
There is also a iterdescendants() function to iterate all descendants of a particular node.
Exercise 1¶
Using the royal.xml:
i) Write Python code to get the title property of queen’s grandsons.
ii) Write Python code to get the full title of the only princess in the family tree.
In [10]:
#insert answer to 1 here
from lxml import etree # import the library
xmltree = etree.parse(“royal.xml”)
root = xmltree.getroot()
# Write a Python code to get the title property of queen’s grandsons.
for child in root: # iterate over prince and princess under queen
for grandson in child.iterchildren(tag=”prince”):
print (grandson.get(‘title’))
# Write a Python code to get the full title of the only princess in the family tree.
the_only_princess = root.find(“princess”)
print (the_only_princess.get(‘title’))
Prince William of Wales
Prince Henry of Wales
Anne, Princess Royal
Accessing XML attributes¶
You can access the XML attributes of an element using the get() method
or attrib properties of an element.
In [11]:
print (root.attrib)
print (root.get(“title”))
{‘title’: ‘Queen Elizabeth II’, ‘marriedTo’: ‘Philip, Duke of Edinburgh’}
Queen Elizabeth II
Accessing XML text¶
Let’s now use another sample of XML data. Consider the file book.xml
Salinger, J. D.
English
1951-07-16
Little, Brown and Company
0-316-76953-3
A story about a few important days in the life of Holden Caulfield
This XML looks different to the royal2.xml in that it has some
text content within each element. To access the text content of an
element (text between start and end tag), use text properties of that
element
In [12]:
from lxml import etree
xmltree = etree.parse(‘book.xml’)
root = xmltree.getroot()
for child in root:
print (child.tag + “: ” + child.text)
author: Salinger, J. D.
title: The Catcher in the Rye
language: English
publish_date: 1951-07-16
publisher: Little, Brown and Company
isbn: 0-316-76953-3
description: A story about a few important days in the life of Holden Caulfield
Building XML data¶
Let’s go back to the book.xml example above. As usual, use lxml library to parse the XML and get the root of the tree:
In [13]:
from lxml import etree
xmltree = etree.parse(‘book.xml’)
root = xmltree.getroot()
To create a new XML element, use etree.Element() function:
In [14]:
new_element = etree.Element(‘genre’)
new_element.text = ‘Novel’
root.append(new_element)
print(etree.tostring(root[-1],pretty_print=True,encoding=’unicode’)) # the last element, the newly appended element
Tips: You can create a totally a new XML tree by constructing the root element:
In [15]:
root = etree.Element(‘book’)
You can also create new element using SubElement() function:
In [16]:
new_element = etree.SubElement(root, “price”)
new_element.text = ‘23.95’
for e in root: # check whether the new element is added
print(e.tag)
price
Use insert() to insert a new element at a specific location:
In [17]:
root.insert(1,etree.Element(“country”))
root[1].text = “United States”
print(etree.tostring(root[1],pretty_print=True,encoding=’unicode’))
Serialising XML data (printing as web content or writing into a file)¶
You can get the whole XML string by calling etree.tostring() with the root of the tree as the first parameter:
In [18]:
output = etree.tostring(root, pretty_print=True, encoding=”UTF-8″)
for e in root:
print(e.tag)
price
country
In [19]:
open(‘output.xml’,’wb’).write(output)
Out[19]:
73
Exercise 2¶
Write Python code to load in the file “book.xml”, change the ISBN to “Unknown” and then write out the file to “book-new.xml”
In [20]:
#insert answer to 2 here
xmltree = etree.parse(“book.xml”)
root = xmltree.getroot()
root.find(“isbn”).text=’Unknown’
output = etree.tostring(root, pretty_print=True,encoding=”UTF-8″)
open(‘book-new.xml’,’wb’).write(output)
Out[20]:
346
JSON¶
Python has a built in json module that allows you to process JSON files. You can find out more about it by reading its page at python.org. W3schools also provide a good introductory tutorial, while Real Python has a more comprehensive one.
Below you can see a sample JSON file consisting of some information about a book.
In [21]:
str_json = ”’
{
“id”: “book001”,
“author”: “Salinger, J. D.”,
“title”: “The Catcher in the Rye”,
“price”: “44.95”,
“language”: “English”,
“publish_date”: “1951-07-16”,
“publisher”: “Little, Brown and Company”,
“isbn”: “0-316-76953-3”,
“description”: “A story about a few important days in the life of Holden Caulfield”
}
”’
Using the json library we are able to manipulate the JSON file as follows.
In [22]:
import json
Data = json.loads(str_json)
print(type(Data))
print(Data[“price”])
# modify any attribute
Data[“isbn”] = “Unknown”
# save Json file
with open(‘book_test.json’, ‘w’) as f:
json.dump(Data, f,indent = 2)
# load Json file
with open(‘book_test.json’) as f:
Data = json.load(f)
44.95
Exercise 3¶
Add Spanish and German to the JSON file above as two extra languages represented as an array. Save this file as book2.json. Validate it on JSONLint.
In [23]:
#insert answer to 3 here and save as book2.json
Data[“language”] = [“English”,”Spanish”,”German”]
with open(‘book2.json’, ‘w’) as f:
json.dump(Data, f,indent = 2)
In [24]:
# load and check the answer
with open(‘book2.json’) as f:
Data = json.load(f)
Data
Out[24]:
{‘id’: ‘book001’,
‘author’: ‘Salinger, J. D.’,
‘title’: ‘The Catcher in the Rye’,
‘price’: ‘44.95’,
‘language’: [‘English’, ‘Spanish’, ‘German’],
‘publish_date’: ‘1951-07-16’,
‘publisher’: ‘Little, Brown and Company’,
‘isbn’: ‘Unknown’,
‘description’: ‘A story about a few important days in the life of Holden Caulfield’}
Exercise 4 (If you have time)¶
Now modify the publish date parameter. Make this an array of two objects that have
properties of edition (first, second) and date (1951-07-16,1979-01-01) respectively. Save
this file as book3.json.
In [25]:
#insert answer to 4 here and save as book3.json
obj1= {“edition”:”first”,”date”:”1951-07-16″}
obj2= {“edition”:”second”,”date”:”1979-01-01″}
Data[“publish_date”]=[obj1,obj2]
with open(‘book3.json’, ‘w’) as f:
json.dump(Data, f,indent = 2)
In [26]:
# load and check the answer
with open(‘book3.json’) as f:
Data = json.load(f)
Data
Out[26]:
{‘id’: ‘book001’,
‘author’: ‘Salinger, J. D.’,
‘title’: ‘The Catcher in the Rye’,
‘price’: ‘44.95’,
‘language’: [‘English’, ‘Spanish’, ‘German’],
‘publish_date’: [{‘edition’: ‘first’, ‘date’: ‘1951-07-16’},
{‘edition’: ‘second’, ‘date’: ‘1979-01-01’}],
‘publisher’: ‘Little, Brown and Company’,
‘isbn’: ‘Unknown’,
‘description’: ‘A story about a few important days in the life of Holden Caulfield’}
In [ ]: