SENG 265 – SUMMER 2016 SOFTWARE DEVELOPMENT METHODS ASSIGNMENT 4 UNIVERSITY OF VICTORIA
Due: Tuesday, July 26th, 2016 by 11:00pm.
1 Assignment Overview
The HTML language, which is used to describe the layout and content of web pages, has a famously verbose syntax: relatively simple formatting and layout instructions can often require several layers of bulky HTML tags. Modern HTML is very flexible for specifying visual aspects of the displayed content. However, extracting information from the HTML representation can be di cult. The goal of this assignment is to write a Python 3 program which converts HTML tables to a CSV representation. The CSV output will follow the specification given in Assignment 3.
2 HTML Tables
Tables in HTML are specified with the
tag. A brief tutorial on the
tag, along with interactive examples, can be found at http://www.w3schools.com/html/html_tables.asp. Note that whitespace in HTML is generally ignored: spaces, newlines and tabs are collapsed into a single space when the page is rendered. Line breaks are specified by the tag.
Consider the HTML table below (which appears as the first example in the tutorial linked above).
Firstname
Lastname
Age
Jill
Smith
50
Eve
Jackson
94
1
The table described by the HTML code above would be rendered by a web browser in a similar format to the table below.
The
and
tags enclose the data for each row of the table, and the
and
tags enclose the contents of each cell. The
and
tags enclose the contents of header cells, but their use is optional (some authors use regular
tags for header cells). Cells of a table may contain HTML code, including other HTML tables.
Tag names are not case sensitive, so ‘
’ and ‘
’ are both valid forms of the ‘
’ tag. Any HTML tag may contain attributes that change its appearance. The most common attribute in modern HTML is ‘style’. For example, make the contents of a particular cell boldfaced, the attribute style=”font-weight: bold;” can be added to the
tag:
Jill
Additionally, any HTML tag may contain whitespace after the tag name, between attributes or be- fore the closing ‘>’ character. Whitespace is not permitted between the opening ‘<’ character and the tag name. For example, ‘
’ and ‘
’ are valid tags, but ‘< /td>’ and ‘< td >’ are not.
Since whitespace in HTML is generally ignored, there is no requirement that the table be laid out in a readable way in the HTML code. The table in the example above could also be represented by the code below.
Firstname
Lastname
Age
Jill
Smith
50
Eve
Jackson
94
This assignment will add the following extra constraints to the basic HTML table specification.
• To be considered valid, a test input must be valid HTML. For example, tags like
can only occur inside of a
tag, which in turn must be inside a
tag. All opening tags must have a matching closing tag (note that some HTML tags, like , are singular and do not need a closing tag), and vice versa.
• Between the opening angle bracket (<) and closing angle bracket (>) of a tag, no other instances of closing angle brackets are permitted (including inside of attributes).
• Commas may not appear inside the data for a cell. However, other aspects of the HTML which are not cell content (such as the style attributes of
tags) may contain commas.
• Cells may contain any data, including other HTML tags, but may not contain nested
tags. The prohibition on comma use applies to all contents of each cell, including HTML tags. In other words, if the comma character appears between the opening
tag
and its matching
tag, the input will be considered invalid.
• There is no requirement that each row of the table contain the same number of columns.
• Every HTML table must have at least one row.
• Every row of an HTML table must contain at least one cell.
• The rowspan and colspan, which are used to make cells span multiple rows or columns, are
not permitted.
Jill Eve
Firstname
Lastname
Smith Jackson
Age
50 94
2
3 HTML-to-CSV Converter
Your task is to write a Python 3 program called table_to_csv.py which reads HTML from stan- dard input and outputs a CSV representation of each table in the input, including any header cells specified with
(if present). See the Assignment 3 specification for documentation on the CSV format.
The resulting CSV data will be printed to standard output in the following format:
TABLE 1:
TABLE 2:
TABLE 3:
…
Your implementation may assume that the input table complies with the constraints given in the previous section, and must also meet the following requirements.
• All runs of one or more spaces, newlines, tabs, or other whitespace should be collapsed into a single space.
• Within a table cell, all HTML tags are to be left intact.
• The contents of each table cell should be stripped of all leading and trailing whitespace before
being output. For example, the cell ‘
Lemon Meringue
’ should be output as ‘Lemon Meringue’ (note that the multiple spaces between the two words are also collapsed into one space).
• Every row of the output CSV spreadsheet must contain the same number of columns (recall that the number of columns in a row of a CSV spreadsheet is the number of commas in the row minus one). If the rows of the HTML table contain a di↵ering number of columns, then the number of columns in the output spreadsheet should be equal to the number of columns in the row of the input table with the largest number of columns. Other rows should be padded with blank cells to meet the column requirement.
Section 2 contains two di↵erent representations of the same HTML table. Both should produce the output below when provided as input to a correct implementation.
TABLE 1:
Firstname,Lastname,Age
Jill,Smith,50
Eve,Jackson,94
Consider the HTML table below (which has been posted to the git repository as ‘a4_example_table.html’).
Student Number
Student Name
Major
A1 mark
A2 mark
V00000001
10
11
3
V00123456
Alastair Avocado
Psychology
12
V00123457
Rebecca Raspberry
Computer Science
17
14
V00314159
Fiona Framboise
Computer Science
17
V00654321
Meredith Malina
Software Engineering
18
12
V00654322
Hannah Hindbaer
Physics
15
18
V00951413
Neal Naranja
Anthropology
15
15
When provided as input to a correct implementation, the HTML table above would be converted to the following CSV spreadsheet.
TABLE 1:
Student Number,Student Name,Major,A1 mark,A2 mark
V00000001,,,10,11
V00123456,Alastair Avocado,Psychology,12,
V00123457,Rebecca Raspberry,Computer Science,17,14
V00314159,Fiona Framboise,Computer Science,,17
V00654321,Meredith Malina,Software Engineering,18,12
V00654322,Hannah Hindbaer,Physics,15,18
V00951413,Neal Naranja,Anthropology,15,15
4 Implementation Advice
Since HTML allows such a wide variation in the structure and formatting of tags, the use of regular expressions to match each tag pair is encouraged. However, you are not required to use regular expressions (or any other particular implementation technique, as long as your code is valid Python 3). If you use regular expressions, be aware of the following points.
• By default, the ‘.’ specifier does not match the newline character (‘\n’), so if you are search- ing for something which crosses a line boundary, it will not match. For example, the pattern ‘A.*B’ would match ‘Axy z B’ but not ‘Axy\n z B’ by default. Since whitespace in HTML can be collapsed to a single space, you can remedy this problem by replacing all newlines characters with spaces. You can also use the ‘re.DOTALL’ flag when performing regular ex- pression matching, which will cause newlines to be matched by the ‘.’ specifier. Consider the interactive Python 3 session below, which contains examples of both methods.
4
5
>>>s1=’Axy zB’
>>>s2=’Axy\n zB’
>>> re.match(’A.*B’,s1)
<_sre.SRE_Match object; span=(0, 8), match=’Axy z B’> >>> re.match(’A.*B’,s2)
>>> re.findall(’A.*B’,s1)
[’Axy z B’]
>>> re.findall(’A(.*)B’,s1)
[’xy z ’]
>>> re.findall(’A(.*)B’,s2)
[]
>>> re.findall(’A(.*)B’,s2, re.DOTALL)
[’xy\n z ’]
>>> s3 = s2.replace(’\n’,’ ’)
>>> s3
’Axy z B’
>>> re.findall(’A(.*)B’,s3)
[’xy z ’]
• Since HTML tag names are not case sensitive, you may want to use the ‘re.IGNORECASE’ flag to enable case-insensitive matching. Consider the interactive session below.
>>> x = ’abc’
>>> y = ’Abc’
>>> z = ’A——C’
>>> re.findall(’a.*c’,x)
[’abc’]
>>> re.findall(’a.*c’,y)
[]
>>> re.findall(’a.*c’,z)
[]
>>> re.findall(’a.*c’,y,re.IGNORECASE)
[’Abc’]
>>> re.findall(’a.*c’,z,re.IGNORECASE)
[’A——C’]
Note that if you want to use multiple flags (such as both ‘re.DOTALL’ and ‘re.IGNORECASE’), you can combine them with the bitwise-OR operator (for example,
‘re.findall(’a.*c’,z, re.IGNORECASE|re.DOTALL)’).
Peer Testing
As in previous assignments, part of the mark for this assignment will come from creating test cases that will be used to test your peers’ implementations. For this assignment, you are required to submit one valid and non-trivial test case in a file called table_testcase.html. Your test cases
5
will be considered invalid unless they meet the requirements in Section 2 (and will therefore receive a zero).
Your test case file must contain at least one HTML table. It may contain other HTML tags, but, to be consistent with the specification in Section 2, nested tables are not permitted. Additionally, images and dynamic content (including any use of Javascript) are not allowed.
Your table_to_csv.py implementation and test cases must be submitted by Tuesday, July 26th, 2016 at 11:00pm. After that deadline, the results of testing each implementation using each test case will be posted to the Testing Database on conneX.
Since your test cases will be published, please ensure that they contain no identifying information. Your implementation of table_to_csv.py will not be published.
After the results of peer testing are posted, you will have the opportunity to revise and resubmit your table_to_csv.py implementation until Sunday, July 31st, 2016. You will not be permitted to resubmit if you did not submit an implementation before the July 26th deadline (and you will therefore receive a mark of zero). Since all test cases will be published, you will also not be permitted to resubmit your test cases.
6 Evaluation
Submit your table_to_csv.py and table_testcase.html files electronically through the Assign- ments tab on conneX. Do not submit any other files.
The assignment will be marked out of 16 marks and is worth 8% of your final grade.
Your implementation and test case will be marked based on its performance on test cases developed by your instructor (not on its performance on your peers’ test cases).
The marks are distributed among the components of the assignment as follows.
Ensure that all code files needed to run your code in ECS 242 are submitted. Only the files that you submit through conneX will be marked. The best way to make sure your submission is correct is to download it from conneX after submitting and test it. You are not permitted to revise your submission after the due date, and late submissions will not be accepted, so you should ensure that you have submitted the correct version of your code before the due date. conneX will allow you to change your submission before the due date if you notice a mistake. After submitting your assignment, conneX will automatically send you a confirmation email. If you do not receive such an email, your submission was not received. If you have problems with the submission process, send an email to the instructor before the due date.
6
Marks
Component
14
The table_to_csv.py implementation functions correctly on a variety of valid in- puts.
2
The submitted test case contains a valid and non-trivial test case. You will receive zero marks for this component unless you also submit table_to_csv.py.