程序代写代做代考 data structure database VL ‘Programming and Data Analysis’ Prof. Dr. Gerhard J ̈ager | WS 20–21

VL ‘Programming and Data Analysis’ Prof. Dr. Gerhard J ̈ager | WS 20–21
Assignment 07: Analyzing the Spanish Copulas
handed out: January 26th 20:30
to be submitted by: February 2nd 20:30
Spanish (like other Iberian Romance languages) features two different verbs corresponding to the English copula “to be”: ser and estar. There are systematic differences in the verbs’ usage. Introductory textbooks of Spanish typically explain the difference roughly like this: ser is used for properties that are considered immutable (like gender, nationality, colors, traits of character), whereas estar is used for circumstances which can change (like place, temperature, health condition, and mood). This assignment gives you an impression of how a corpus linguist could approach the question whether this difference becomes visible through the lense of their adjectival complements.
For this purpose, we have prepared a tagged corpus of more than 70,000 Spanish sentences which contain instances of either copula. The sentences were extracted from Tatoeba, a crowd-sourced database of parallel sentences (https://www.tatoeba.org). Tagging was performed using spaCy (https://spacy.io/), and one sentence per line was stored in the file spanish-tagged-spacy.txt which you receive along with this assignment.
Your tasks will be to read in the tagged sentence file, to implement a naive lemmatizer for Spanish adjectives which allows you to conflate the different forms of each adjective, to extract all adjective forms which follow forms of ser and estar in the corpus, maintaining counts of which adjectives occurred how often with which copula, and finally, to aggregate the frequency information into three sets of adjectives, depending on whether they predominantly occur with ser, with estar, or whether they can occur with both. The adjectives of the last set will often take different meanings depending on the verb, as in the following example pair:
􏰀 El profesor es aburrido. “The professor is boring.”
􏰀 El profesor est ́a aburrido. “The professor is (feeling) bored.”
Note: This assignment requires some comparatively heavy computations – going through a corpus of 70,000 sentences can take some time. Don’t worry if your main program and tests take longer than you’re used to. In order to make testing more efficient, upon generation of each of the data structures (the loaded sentences in Task 1, the frequency dictionaries in Task 3 and the occurrence sets in Task 4), the respective data structure is stored in a pickle file, from which the object will be loaded in later tests. This makes it possible to e.g. test minor changes in the occurrence set function without every time first calling the load sentences function and crawling through the entire corpus file again, and instead to use the data structures already generated earlier. You don’t need to mind these pickle files at all and you don’t need to submit them either, just work in your ex 07.py and test ex 07.py as usual.
Task 1: Reading the tagged sentences from a file [2 points]
Write a function load sentences(filename) which reads in the contents of the sentence file.
The result should be a list of sentences, where each sentence is represented by a list of pairs of strings. The first string in each pair is the word form, and the second string is the part-of-speech tag.
In the input format, tokens are separated by spaces, and tags are appended after underscores. The sentence
– 1/5 –

file is encoded in UTF-8. Example usage:
Task 2: Implementing a lemmatizer for Spanish adjectives [2 points]
Write a function lemmatize(adj) which converts Spanish adjective forms into the respective lemma. You will not be able to build a perfect solution that works for all adjectives, but converting the following description of Spanish adjective inflection into rules will provide coverage of about 95% of all adjective forms, which is good enough for this exercise:
􏰀 Class I is formed by adjectives that have a paradigm consisting of four forms: masculine singular, feminine singular, masculine plural, and feminine plural. There are only two important types of such adjectives:
1. Adjective lemmas which end in -o replace that ending with -a, -os, and -as. 2. Adjective lemmas in – ́es have forms in -esa, -eses, and -esas.
􏰀 You can assume that all other adjectives do not show a distinction in gender, and only have a singular and a plural form. This is how the plural is formed for this second class (Class II) of adjectives:
>>> sentences = load_sentences(“spanish-tagged-spacy.txt”) >>> print(sentences[0][2:5])
[(’no’, ’ADV’), (’est ́a’, ’AUX’), (’bien’, ’ADV’)]
1. Adjectives in -nte, -nse, -ble, -bre simply add -s.
2. For all other plural forms in -les or -res or -nes, remove the -es. 3. If the plural form ends in -ces, replace that by -z.
4. All other plurals in -es are derived from lemmas ending in -e.
The following table shows the full paradigm for one adjective from each (sub)class:
Class
I.1 I.2 II.1 II.2 II.3 II.4
Lemma
bueno
ingl ́es agradable igual capaz triste
Meaning
“good” “English” “pleasant” “equal” “able” “sad”
mSg
bueno ingl ́es
fSg
buena inglesa
mPl
buenos ingleses
fPl
buenas inglesas
agradable igual capaz triste
[4 points]
agradables iguales capaces tristes
Example usage:
>>> list(map(lemmatize ,[“bueno”,”buena”,”buenos”,”buenas”])) [’bueno ’, ’bueno ’, ’bueno ’, ’bueno ’]
>>> list(map(lemmatize ,[“capaz”,”capaces”]))
[’capaz ’, ’capaz ’]
Task 3: Counting adjective occurrences
Write a function count occurrences(sentences) which goes through every sentence in the sentences object as generated in Task 1, and finds every instance of a copula followed by an adjective. The tag for copulas is “AUX”, and the one for adjectives is “ADJ”. For every such instance, normalize the adjective to the lemma form using the lemmatize(form) method from Task 1, and look up to which of the two copulas the verb form belongs by means of the two sets conj ser and conj estar we are providing as part of the
– 2/5 –

template.
Your function should build two dictionaries freq ser and freq estar, each with all adjectives occurring with either of the two copulas as keys, and the frequency count for the respective copula-adjective combi- nation as values. NOTE: This means that both dictionaries are of equal size, and contain a large number of zeroes as values. The reason for this inefficiency is easier processing in Task 4. Finally, return both dictionaries as a pair, first the dictionary with the counts for ser, then the one for estar.
Example usage:
Task 4: Aggregating occurrence sets [4 points]
Write a function get occurrence sets(freq ser, freq estar) which processes the counts from Task 3. For each of the three relevant cases, build one set: ser containing the adjective lemmas which predominantly occur with ser, estar for the ones with estar, and both for the lemmas occuring with both adjectives. Only adjectives which you found 10 or more times in total should be classified, and each combination of a copula and an adjective is considered safely attested if it was found at least twice. Return the resulting sets as a tuple in the following order: (ser, estar, both)
Example usage:
Bonus 5: Extracting bibliography data using regular expressions [+4 points]
In this bonus task, which is completely independent of the remainder of the assignment, you will be ex- tracting and manipulating bibliography data.
The LaTeX file bibliography.tex contains some bibliographic entries of publications in linguistics and logic. We will use regular expression search to extract some data from these entries – for example the authors of a paper or the pages referenced. In addition, we will use regular expression find&replace to manipulate some of the data into a different fromat.
Each entry in the bibliography looks something like this:
\bibitem{ptq} Montague, R. (1973). The proper treatment of quantification in ordinary English. In {\it Approaches to natural language}, 221-224. Springer, Dordrecht.
– 3/5 –
>>> freq_ser , freq_estar = count_occurrences(sentences)
>>> sorted(freq_ser.items())[2:6]
[(’abandonado’, 0), (’abarrotado’, 0), (’abducido’, 1), (’abejo’, 1)] >>> sorted(freq_estar.items())[2:6]
[(’abandonado’, 2), (’abarrotado’, 12), (’abducido’, 0), (’abejo’, 0)]
>>> ser, estar, both = get_occurrence_sets(freq_ser, freq_estar) >>> sorted(ser)[:5]
[’absurdo ’, ’adecuado ’, ’alema ́n’, ’al ́ergico ’, ’amable ’]
>>> sorted(estar)[:5]
[’abarrotado ’, ’acostado ’, ’acostumbrado ’, ’agotado ’, ’ansioso ’]
>>> sorted(both)[:5]
[’abierto ’, ’aburrido ’, ’agradable ’, ’agradecido ’, ’alto ’]

We want to extract the following pieces of information:
Target
title
authors
year pages
title of col- lection*
Example
The proper treatment of quantifi- cation in ordinary English
Montague, R.
1973
221–224
Approaches to natural language
Specification
a concatenation of word characters, whitespace and some puntuation symbols which appears between the year information (which ends with ). ) and a period
a concatenation of whatever characters which ap- pears between \bibitem{key} , where the key is a concatenation of word characters, and the year in- formation starting with (
a concatenation of 4 digits appearing between open- ing and closing brackets ( )
concatenation of at least 1 digit + ’–’ + at least 1 digit
a concatenation of any characters which appears be- tween In {\it and }
* when the paper appears in a collection
In addition, we want to perform the following transformations:
Target
headings
authors
Example
\bibitem{ptq}Montague… =⇒ \n\bibitem{ptq}\nMontague …
Montague, R. =⇒ R. Montague
Specification
headings of the form \bibitem{key} (key as specified above) should be put between two newline symbols, to add space between the entries and between the heading and the contents
author names, which are written as Lastname, First- name, should be transformed into Firstname Last- name. Here, we will assume that an individual au- thor name is a concatenation of word characters + whitespace + capital letter + period
To make testing easier, the regular expressions are all defined inside a function search bibliography(target, contents), where the parameter target specifies which information string we want to retrieve and contents gives the string we want to perform our search in (for example, the contents of a bibliography file). (In real-life applications, you would just work with the patterns and findalls directly, without the additional clutter of a function around it.)
The patterns to match the paper titles, the authors and the heading transformation are already defined. Your task: Fill in the other patterns to match the targets as described in the specifications.
After successful implementation, you should be able to extract the relevant information like this:
– 4/5 –
>>> search_bibliography(“titles”, file_contents)
[’An introduction to first -order logic ’, ’Universal grammar ’, ’The proper treatment of quantification in ordinary English ’]
>>> search_bibliography(“authors”, file_contents)

[’Barwise , J.’, ’Chomsky , N. \\& Lightfoot , D.’, ’Chomsky , N.’, ’Montague , R.’, ’Montague , R.’, ’OGrady , W., Aronoff , M. \\& Dobrovolsky , M.’]
>>> search_bibliography(“years”, file_contents) [’1977’, ’2002’, ’2014’, ’1970’, ’1973’, ’1989’]
>>> search_bibliography(“pages”, file_contents) [’5–46’, ’373–398’, ’221–224’]
>>> search_bibliography(“colltitles”, file_contents)
[’Studies in Logic and the Foundations of Mathematics’, ’Approaches to
natural language ’]
and perform the desired transformations like this:
>>> search_bibliography(“transform_headings”, file_contents) \begin{thebibliography}
\bibitem{barwise}
Barwise , J. (1977). An introduction to first -order logic. In {\it Studies in Logic and the Foundations of Mathematics}, 90, 5–46. Elsevier.
\bibitem{syntstr}
Chomsky, N. \& Lightfoot, D. (2002). {\it Syntactic structures}. Walter
de Gruyter.
\bibitem{minimalist} …
>>> search_bibliography(“transform_names”, file_contents)
\begin{thebibliography}
\bibitem{barwise} J. Barwise (1977). An introduction to first-order logic. In {\it Studies in Logic and the Foundations of Mathematics}, 90, 5–46. Elsevier.
\bibitem{syntstr} N. Chomsky \& D. Lightfoot (2002). {\it Syntactic structures}. Walter de Gruyter.
\bibitem{minimalist} N. Chomsky (2014). {\it The minimalist program}. MIT press.

That’s it! before submitting, don’t forget to test your functions one last time using test ex 07.py! Total points: 12+4
– 5/5 –