Monash University
Faculty of Information Technology
2nd Semester 2021
FIT2014
Assignment 1
Linux tools, regular expressions, induction
DUE: 11:55pm, Friday 20 August 2021
How to manage this assignment
• Do as much as possible of it before your week 4 prac class. There will not be time during the
class itself to do the assignment from scratch; there will only be time to get some help and
clarification.
Instructions
Please read these instructions carefully before you attempt the assessment:
• To begin working on the assignment, download the asgn1.tar.gz workbench from Moodle
and unpack it. Be sure to configure and test the submission-builder before you begin working.
Refer to Lab 0 for a reminder on how to do all of these tasks.
• The workbench provides locations and names for all solution files. These will be empty, needing
replacement. Do not add or remove files from the workbench. All solutions must be submitted
to Moodle by the deadline above, and must be packaged using the workbench’s makefile system.
• Solutions to written questions must be submitted as PDF documents. You can create a PDF
file by scanning your legible (use a pen, write carefully, etc.) hand-written solutions, or by
directly typing up your solutions on a computer.
• Start work on this assignment early, and try to make substantial progress before your sched-
uled lab class: your tutor won’t have time to help you start from scratch. Before you attempt
any problem—or seek help on how to do it—be sure to read and understand the question, as
well as any accompanying code.
• To aid the marking process, you must adhere to all naming conventions that appear in the
assignment materials, including files, directories, code, and mathematics. Not doing so will
cause your submission to incur a one-day late-penalty (in addition to any other late-penalties
you might have). Be sure to check your work carefully.
Your submission must include:
• a sed script, decomposeSyllablesIntoParts, for Problem 1(a);
• a one-line text file, MandarinWadeGiles-NameStructure2, for Problem 1(a);
• a sed script, decomposePartsIntoLetters, for Problem 1(b);
• a one-line text file, MandarinWadeGiles-NameStructure3, for Problem 1(b);
• an awk script, matchMandarinWadeGiles, for Problem 1(c);
• the output file outputFile you produced by running your awk script on the provided input
file, inputFileOfNames, in Problem 1(c);
• a file prob2.pdf with your solution to Problem 2.
To submit your work, simply enter the command ‘make’ from within the asgn1 directory, and
then submit the resulting .tar.gz file to Moodle. Make sure that you have tested the submission
mechanism and that you understand the effect of make on your directory tree. (You first practised
this in Lab 0, §2.3.)
1
Introduction to the Assignment
In Lab 0, you met the stream editor sed, which detects and replaces certain types of patterns in
text, processing one line at a time. These patterns are actually specified by regular expressions. You
will use sed again in Problem 1 of this Assignment, to help construct regular expressions.
You will also learn about awk, which is a simple programming language that is widely used in
Unix/Linux systems and which also uses regular expressions. In Problem 1, you will construct an
awk program to identify a class of Chinese names.
Finally, Problem 2 is about applying induction to a problem of monster anatomy.
Introduction to awk
In an awk program, each line has the form
/pattern / { action }
where the pattern is a regular expression (or certain other special patterns) and the action is an
instruction that specifies what to do with any line that contains a match for the pattern. The action
(and the {. . . } around it) can be omitted, in which case any line that matches the pattern is printed.
Once you have written your program, it does not need to be compiled. It can be executed di-
rectly, by using the awk command in Linux:
$ awk -f programName inputFileName
Your program is then executed on an input file in the following way.
// Initially, we’re at the start of the input file, and haven’t read any of it yet.
If the program has a line with the special pattern BEGIN, then
do the action specified for this pattern.
Main loop, going through the input file:
{
inputLine := next line of input file
Go to the start of the program.
Inner loop, going through the program:
{
programLine := next line of program (but ignore any BEGIN and END lines)
if inputLine contains a string that matches the pattern in programLine, then
if there is an action specified in the programLine, then
{
do this action
}
else
just print inputLine // it goes to standard output
}
}
If the program has a line with the special pattern END, then
do the action specified for this pattern.
Any output is sent to standard output.
You should read about the basics of awk, including the way it represents regular expressions and
the main instruction types used in its actions. Any of the following sources should be a reasonable
place to start:
• A. V. Aho, B. W. Kernighan and P. J. Weinberger, The AWK Programming Language,
Addison-Wesley, New York, 1988.
(The first few sections of Chapter 1 should have most of what you need, but be aware also of
the regular expression specification on p28.)
2
• http://www.grymoire.com/Unix/Awk.html
• http://www.hcs.harvard.edu/~dholland/computers/awk.html (a good, short introduc-
tion)
• the Wikipedia article looks ok
• the awk manpage.
Introduction to Problem 1
The Master said, ‘What is necessary is to rectify names . . . . If names are not recti-
fied, then words are not appropriate. If words are not appropriate, then deeds are not
accomplished.’
– Confucius (孔夫子), The Analects (transl. R. Dawson), Oxford University Press,
1993, §13.3.
Most organisations today deal with people from many different cultures and language groups,
and they must often record and process people’s names in systems that work mainly with English
language text. In such contexts, it is helpful to be able to recognise names from different language
groups. Example applications include: determining how to pronounce students’ names when reading
them out from a list at graduation ceremonies; determining how to greet a person with whom you
have an appointment; determining how to enter the various parts of a person’s name into a database;
determining how automatically-generated emails, sent to many different people listed in some file,
should address each recipient; determining the most likely native language of a person in situations
where their name is known but they cannot be spoken to directly at the time (e.g., in some emergency
situations). Recognising the language group that a name belongs to is an important first step in all
these situations.
In this problem you will write some code in sed and awk to try to recognise one type of Chinese
language names in a long file of Asian names. More specifically, suppose you are given an input
file in which each line starts with a person’s name in some language, with each name transcribed
somehow into English text. Your task is to detect which of these names come from Mandarin Chinese
transcribed using the Wade-Giles romanisation system.1
In the input file, all text from the start of each line until the first colon (:) on the line (but not
including the colon itself) is taken to be a person’s name. In most cases, each line ends with a string
of non-blank letters specifying which language the name is believed to come from. An example input
file is provided, as inputFileOfNames. If you browse through the file, you will notice that it contains
names from a variety of Asian languages: Mandarin, Cantonese, Hokkien, Teochew, Hakka, Korean,
Japanese, Thai, Vietnamese, Malay and Indonesian. They have been represented in English text
using a variety of transcription schemes, and with all extra marks on letters (accents, tone marks,
other diacritical marks, etc.) removed. 2 In many cases, the line about a person also contains some
other information about them, but our name recognition task will ignore that information.3
Further information about working with names from different cultures can be found in:
• Fiona Swee-Lin Price, Success with Asian Names, Allen & Unwin, Crows Nest, NSW, 2007.
• J. Greenberg Motamedi, Z. Jaffery, A. Hagen, and S. Y. Yoon, Getting it right: Reference guides
for registering students with non-English names, 2nd edition. (REL 2016-158 v2), U.S. Depart-
ment of Education, Institute of Education Sciences, National Center for Education Evaluation
1This is no longer the most common system, although it is still used for the names of many millions of Chinese
people throughout the world. The Hànyǔ P̄ınȳın system is now more widely used, and we have used it previously in
this unit.
2These marks carry important information about meaning and pronunciation. But they are often removed when
names are represented using other alphabets.
3The file was compiled from a number of sources, mainly Wikipedia lists of names of type
https://en.wikipedia.org/wiki/List of CultureName people, where CultureName is one of the cultures listed
above; also https://www.goratings.org/en/. The lists obtained from Wikipedia are rather imperfect, with peo-
ple’s names often not written in a form that clearly shows the claimed cultural background.
3
http://www.grymoire.com/Unix/Awk.html
http://www.hcs.harvard.edu/~dholland/computers/awk.html
https://www.goratings.org/en/
and Regional Assistance, Regional Educational Laboratory Northwest, Washington, DC, 2017.
https://ies.ed.gov/ncee/edlabs/regions/northwest/pdf/REL_2016158.pdf
• SBS, The Cultural Atlas, https://culturalatlas.sbs.com.au/. You can look up a country
or culture (e.g., by clicking on a map) and then click on a link to “Naming” for that culture.
Names in Mandarin
Mandarin Chinese names consist of a family name followed by a given name (the reverse of the
usual order in English). The family name usually consists of just a single syllable, and the given
name usually consists of one or two syllables. We restrict ourselves to names with these numbers of
syllables from now on.4
We treat a Mandarin name in Wade-Giles representation as consisting of a syllable, followed by
a space, followed by either a single syllable or two consecutive syllables with a hyphen or a space
between them.
Each syllable consists of either a prefix followed by a suffix, or a standaloneSuffix (which is a
suffix that can be a whole syllable in its own right; other suffixes must follow a prefix).5
• The possibilities for the prefix are:
p, p’, m, f, t, t’, n, l, k, k’, h, ch, ch’, hs, sh, j, ts, ts’, s, w, y
These are also listed, one per line, in the file prefix-file.
• The possibilities for the suffix are:
a, i, o, u, ai, an, ao, eh, ei, en, ia, ih, in, iu, ou, ua, ui, un, uo, ang, eng, iao,
ieh, ien, ing, uai, uan, ueh, uei, ung, iang, iung, uang
These are also listed, one per line, in the file suffix-file.
• The possibilities for the standaloneSuffix are:
ai, ei, ao, ou, an, ang, en, eng, erh
These are also listed, one per line, in the file standaloneSuffix-file.
For example, the syllable chung is formed from the prefix ch followed by the suffix ung. Note that
ung does not appear in the list of standaloneSuffixes, so it cannot be a syllable in its own right; it
always needs a prefix before it. The syllable ang consists just of the standaloneSuffix ang.6
Constructing a regular expression for Mandarin Wade-Giles names
We will use this information, together with sed, to generate a regular expression to try to match
Mandarin Wade-Giles names. (The match will not be perfect, and later you will do some simple
computations to get some idea of how well, or poorly, it does.7) Throughout, we will only do case-
insensitive matching, so there is no need to capitalise initial letters of syllables in order to match
names.
We describe the steps in detail now, and then list the assignment submission requirements for
Problem 1.
4There are exceptions, though they are uncommon: some family names have two or even more syllables, and some
given names have more than two syllables. But in this assignment we consider only one-syllable family names and
one- or two-syllable given names.
5This terminology is different to the more usual linguistic terminology of initials and finals. This is deliberate,
since they represent slightly different sets of strings here to the standard initials and finals. This change was just a
small simplification for the purposes of the assignment.
6The only standaloneSuffix that is not also a suffix is erh.
7Some reasons for the imperfect matching include: not all the combinations of prefixes and suffixes listed above
give valid syllables in Mandarin; some people write Mandarin Wade-Giles names without any hyphen or a space
between the two syllables of the given name; and some family or given names may have more syllables than we have
allowed for.
4
https://ies.ed.gov/ncee/edlabs/regions/northwest/pdf/REL_2016158.pdf
https://culturalatlas.sbs.com.au/
We start with the file MandarinWadeGiles-NameStructure0, which just contains one line con-
sisting of the text
regular expression for Wade-Giles names.
We first transform
Giles names. From our description of the syllable structure above, we can represent this by the regu-
lar expression
to represent any Wade-Giles syllable, [- ] is regular expression syntax for matching a single char-
acter that is either a hyphen or a space, and putting “?” after a symbol means it can occur just
once or not at all at that position.8 Each occurrence of
to a regular expression to represent such syllables. The first step of this transformation can be done
using the following sed command:
$ sed “s/
> MandarinWadeGiles-NameStructure1
Alternatively, you can put the sed instruction “s/
into the file decomposeNameIntoSyllables (provided with this assignment) and do the following:
$ sed -f decomposeNameIntoSyllables MandarinWadeGiles-NameStructure0
> MandarinWadeGiles-NameStructure1
Try these, using the given files, and check that you get the required result in the file
MandarinWadeGiles-NameStructure1. This use of sed is a template for what you will do next.
Now you need to replace each
structures, using
Write the sed instruction to do this, in the file decomposeSyllablesIntoParts, and do the
transformation from your Linux command line using
$ sed -f decomposeSyllablesIntoParts MandarinWadeGiles-NameStructure1
> MandarinWadeGiles-NameStructure2
Now you need to replace each occurrence of each part,
by regular expressions representing possible letter strings, according to the sets of strings given on
the previous page.
Write the sed instructions to do this, in the file decomposePartsIntoLetters. You will need
three lines: one line giving the sed instruction for each part type,
Then do the transformation from your Linux command line using
$ sed -f decomposePartsIntoLetters MandarinWadeGiles-NameStructure2
> MandarinWadeGiles-NameStructure3
This should put, into the file MandarinWadeGiles-NameStructure3, a regular expression that is
intended to match Wade-Giles names.
Matching names in the input file
We will now use this regular expression to construct an awk program to apply it to names in the
input file, and to do some simple computations to determine how well the regular expression matches
Wade-Giles names.
For each name in the input file, checking it against your regular expression gives one of four
possible outcomes:
• Correct Match: a name that belongs to Mandarin Wade-Giles (according to the last word
on its line in inputFileOfNames) is also matched by your regular expression.
8The operator “?” is a standard feature of regular expression syntax in sed and awk. Question to consider (but not
part of the assignment): does using it enlarge the class of languages that can be represented by regular expressions?
5
• False Positive: a name that does not belong to Mandarin Wade-Giles (according to the last
word on its line in inputFileOfNames), but is matched by your regular expression.
• False Negative: a name that belongs to Mandarin Wade-Giles (according to the last word
on its line in inputFileOfNames), but is not matched by your regular expression.
• Correct Non-match: a name that does not belong to Mandarin Wade-Giles (according to
the last word on its line in inputFileOfNames) and is not matched by your regular expression.
Your awk program should compute the number of names, in the input file, of each of these four
types, and report those numbers at the end.
You can write the awk program manually, cutting-and-pasting the regular expression as needed
(and making any necessary modifications to it) from MandarinWadeGiles-NameStructure3.
The heart of your awk program will be a statement that does the following.
• The pattern matches a Mandarin Wade-Giles name whenever it occurs in the specified position
at the start of a line.
– Recall the general structure of an awk statement. By default, matches in awk are case-
sensitive. To do a case-insensitive match, you need to match the pattern against a version
of the current line in which all alphabetic characters have been converted to lower case.
This can be done using a statement of this form:
tolower($0) ~ /pattern / { action }
Here, $0 is a built-in awk variable that always contains the current line of the input file,
and tolower() is a function that converts all alphabetic characters to lower case. You
can read “ ~ ” as “matches”.
• The action increments appropriate variables in order to update the number of correct matches,
false positives, false negatives, or correct non-matches, as appropriate.
You may need another line or few too, including an END line, at the end, in order to print the total
numbers of matches of each type.
The files and commands required to complete this problem are shown in Figure 1.
Problem 1. [12 marks]
First, write the one-line sed script, decomposeNameIntoSyllables, and run sed, as described, to
produce MandarinWadeGiles-NameStructure1.
(a) Write the one-line sed script, decomposeSyllablesIntoParts, and run sed, as described, to
produce MandarinWadeGiles-NameStructure2.
(b) Write the three-line sed script, decomposePartsIntoLetters, and run sed, as described, to
produce MandarinWadeGiles-NameStructure3.
(c) Write the awk script, matchMandarinWadeGiles, and run it, as described, on input file
inputFileOfNames, to produce outputFile.
For bonus marks:
NOTE: This is a more advanced and challenging exercise, requiring exploration, experimentation,
and research. To attempt it, you must have mastered the rest of this question first. Bonus marks
are only available when the total mark for the rest of Problem 1 is at least 10 out of 12.
Can you write a much better regular expression for Mandarin Wade-Giles names? If you attempt
this, then, in addition to the files required above, you should also provide:
• your new regular expression, in a one-line file called MandarinWadeGi les-NameStructure4;
6
MandarinWadeGiles-NameStructure0
sed -f decomposeNameIntoSyllables MandarinWadeGiles-NameStructure0
> MandarinWadeGiles-NameStructure1
MandarinWadeGiles-NameStructure1
sed -f decomposeSyllablesIntoParts MandarinWadeGiles-NameStructure1
> MandarinWadeGiles-NameStructure2
MandarinWadeGiles-NameStructure2
sed -f decomposePartsIntoLetters MandarinWadeGiles-NameStructure2
> MandarinWadeGiles-NameStructure3
MandarinWadeGiles-NameStructure3
write awk code, using regular expression in MandarinWadeGiles-NameStructure3
matchMandarinWadeGiles
awk -f matchMandarinWadeGiles inputFileOfNames > outputFile
inputFileOfNames
decomposeNameIntoSyllables
decomposeSyllablesIntoParts
decomposePartsIntoLetters
outputFile
Figure 1: The plan.
7
• your new awk program, in a file called matchMandarinWadeGilesBonus;
• the output from running your awk program on the input file, as outputFileBonus.
To qualify for bonus marks, your approach must use a significantly different approach to the one
used above, and offer a significant advantage. If your approach is more complex, then it should give
better results (i.e., lower error rate) in general. If your approach is significantly simpler but does
almost as well, then that also may qualify for bonus marks.
Your solution will not just be evaluated on how well it does on the provided input file. It will
be run on other files of names too. Also, do not base your regular expression solely on the specific
details of the Wade-Giles names provided in the input file. They may not be totally representative
of the full range of names from that culture.
As usual, be sure to cite any sources used.
Problem 2. [8 marks]
In the game Monster Factory (Rio Grande Games, https://www.riograndegames.com/, 2012)9,
players construct monsters from square tiles which they place on a flat surface. Each tile depicts
part of a monster. Two examples of monsters are shown in Figure 2. In this problem, monsters are
allowed to be arbitrarily large, and we assume there is an unlimited supply of tiles of all types.
Each of the four sides of each tile is either a linking side or a blank side. We may call them
links or blanks, as appropriate, for brevity. Linking sides correspond to monster parts that can be
joined together. A tile is said to be an `-link tile if it has ` links (and therefore 4− ` blanks).
Two tiles are proximate if they sit next to each other so that one side of one tile is exactly
aligned with a side of the other tile, as if the two sides are stuck together. Two tiles are only allowed
to be proximate if their shared sides are both of the same type, i.e., both linking or both blank.
Two tiles are adjacent if they are proximate via linking sides (rather than blank sides). A pair of
aligned linking sides that link two adjacent tiles is called a linkage.
Every linking side of a tile must be matched with a linking side of some other tile. This means
that any tile side that does not abut another tile side must be blank. In particular the sides around
the boundary of the arrangement of tiles are all blank. The construction must give just one monster,
so it must be possible to move between any two tiles in the monster by moving from tile to tile,
never crossing blank sides, only crossing linking sides.
Figure 2: Two monsters.
A cycle in a monster is a sequence T0, T1, . . . , Tk−1, Tk = T0 of k tiles such that:
∀i ∈ {0, 1, . . . , k − 1} : tiles Ti and T(i+1) mod k are adjacent.
The cycle divides the flat playing surface into interior and exterior regions. Imagine walking
around such a cycle, stepping from each tile Ti to its successor T(i+1) mod k, for all i, and finishing
back where you start. If you walk around the cycle in a clockwise direction, then its interior is on
your right and its exterior is on your left. A cycle is basic if its interior contains the interior of
9by Donald X. Vaccarino and Nina Paley; illustrated by Nina Paley and Marco Morte
8
no other cycle. Informally, this means that there is no route using other links that goes across the
cycle.
For example, the left monster of Figure 2 has three cycles but just two basic cycles, while the
right monster has no cycles of any kind.
For every monster, let c be the number of basic cycles. For all k ∈ {1, 2, 3, 4}, let tk be the
number of its tiles that have k linking sides.
It is known that every monster satisfies the Three Laws of Monstrosity:
First Law: t1 + t3 is even.
Second Law: If t1 = 0 then there is a two-link tile that belongs to a basic cycle.
Third Law: t1 − t3 − 2t4 + 2c = 2.
Prove the Third Law of Monstrosity, by induction on the number of linkages.
For this Problem, you may assume the First and Second Laws of Monstrosity, and don’t need to
prove them. Most likely, you won’t need to use the First Law, but will need to use the Second Law.
9