School of Computer Science Dr. Ying Zhou
COMP5349: Cloud Computing Sem. 1/2020
Assignment: Text Corpus Analysis
Group Work: 20% 30.04.2020
1 Introduction
This assignment tests your ability to design and implement a Spark application to handle a relatively large data set. It also tests your ability to analyse the execution performance of your own application on a cluster.
It is a group assignment, each group can have up to 2 students. You are encouraged to form groups within the same lab. A number of self-enrolled groups have been created in Canvas for each lab prefixed with the lab room and day. If you prefer to form a group with a member from a different lab, you need to enrol in the lab which one of the members regularly attends. You may be requested to move to a different group if the chosen lab is too full.
2 Data Set
The data set used in this assignment is a public text corpus called Multi-Genre Natural Language Inference (MultiNLI). The corpus consists of 433k sentence pairs in 10 genres from written and spoken English. Each pair contains a premise sentence and a hypothesis sentence. The corpus is developed for a particular NLP task called Natural Language In- ference. It involves building a model to automatically determine if a hypothesis sentence is true given that its premise sentence is true.
The corpus is divided into the following sets:
• a training set with sentence pairs from five genres: FICTION, GOVERNMENT, SLATE, TELEPHONE and TRAVEL
• a matched development set with sentence pairs from the same five genres as those in the training set
• a matched test set with sentence pairs from the same five genres as those in the training set
• a mismatched development set with sentence pairs from another five genres: 9/11, FACE-TO-FACE, LETTERS, OUP and VERBATIM ;
1
• a mismatched test set with sentence pairs from the same 5 genres as those in the mismatched development set
The training set contains around 390K sentences. The development and testing set con- tains around 10K sentences each.
3 Analysis Workloads
You are asked to perform two types of analysis: • Vocabulary Exploration
• Sentence Vector Exploration
3.1 Vocabulary Exploration
You are asked to produce a few statistics of the vocabulary used in the corpus. Here, the vocabulary refers to the set of all the words appearing in the corpus or in a subset of the corpus. No stemming or spelling check is needed to remove near duplicates or typos.
The first set of workloads involves relatively small development and test data sets. There are four sets in total each with around 10K sentences. The matched development set and the matched test set contain sentence pairs in five genres. The mismatched devel- opment set and mismatched test set contain sentence pairs in five different genres. You are asked to find out:
1. the number of common words between matched and mismatched sets 2. the number of words unique to the matched sets
3. the number of words unique to the mismatched sets
The next set of workloads deals with the training data set. It contains sentence pairs from 5 genres. Among the vocabulary of the training set, there are common words appear- ing in all genres, and there could be words that only appear in one genre. You are asked to find out the distribution of common and unique words. To be specific, you are asked to find out:
1. The percentages of words appearing in five genres, in four genres, in three genres, in two genres and in one genre
2. The same percentages after removing a given list of stop words
2
3.2 Sentence Vector Exploration
Most text mining and NLP tasks start by converting the input document or sentence(s) into a vector. There are many ways to compute the vector representation of a sentence, from traditional vector space based TFIDF representation to simple averaging of word vectors to complex neural models trained on very large corpus. A sentence vector is expected to be embedded with lots of semantic and syntactic information about the sentence.
In this workload, you are asked to compare two sentence vector representation methods based on their ability to captures the genre feature of sentences. The sentence vector representation methods are:
• TFIDF based vector representation. You should use the implementation provided by the Spark ML library. You can decide on the dimension of the vector.
• Pre-trained sentence encoder. You are recommended to use the Universal Sentence Encoder released by Google. The result would be a vector of 512 dimension.
For each vector representation method, you are asked to encode every sentence in the training data set as a vector, then cluster all sentences into five clusters. Each cluster may contain sentences belonging to different genres. Ideally, there is one genre where most sentences are from in that cluster. The cluster will be labelled with the genre that most sentences are from. After labeling each cluster, you are asked to compute the confusion matrix to show the accuracy of clustering using this particular vector representation. A confusion matrix shows for each label the percentage of correctly and incorrectly labelled data. Below is a sample table showing you values and their meanings in a confusion matrix.
True Label
Cluster Label
If we look at the first value column, it says that for all sentences in genre a, 60% is clustered as a, 5% is clustered as b, 15% as c, 5% as d and 15% as e. You only need to compute all values in the confusion matrix in your application. The actual table should be included in the report.
4 Performance Analysis
You are asked to find out empirically, with the same cluster capacity, which one of the following configurations deliver better performance for your application:
3
a
b
c
d
e
a
60%
10%
10%
10%
10%
b
5%
70%
10%
10%
5%
c
15%
5%
70%
5%
5%
d
5%
10%
5%
70%
10%
e
15%
5%
5%
5%
70%
• Having a small number of executors but with better capacity per executor.
• Having a large number of executors each with limited capacity
You need to run one of your applications in a few resource allocation configurations to
observe the performance difference.
5 Implementation Requirement
All analysis should be implemented using Spark API. Use your own judgement to decide which part of the application will use Spark RDD, SparkSQL or SparkML.
While developing your application, make sure that you also pay attention to the quality of your application. In particular,
• You are asked to design the content and sequence of RDDs/DataFrames carefully to minimize repetitive operations.
• You are asked to choose the operations and design their sequence carefully to mini- mize shuffling size.
6 Deliverable
There are two deliverables: source code and project report (up to 10 pages). Both are due on week 12 Wednesday 20th of May 23:59 .
There will be different links for the source code and report submission to facilitate pla- giarism detection. The marker may need to run your code on their environment; make sure you prepare one or multiple read.me files with details on how to configure the envi- ronment and to run your code. The source code and associated README.md file(s) should be submitted as a single zip file. Remember, only the source code and README.md file(s) should be submitted. No data file or compiled version should be included. ANY group that includes data or any large file in their code submission will receive penalty point deduction.
All members need to sign the group assignment cover page digitally and submit the signed cover page together with the report. They can be submitted as a single pdf file or as a zip file.
7 Demo
There will be a zoom demo session in week 12 during normal lab time. Each group will have a 10 minutes session with the tutor. ALL members of the group must attend the demo. The tutor will communicate with each group on their respective session time and
4
zoom meeting room information. Demo details and what you need to prepare before the demo will be released on week 12.
8 Report Structure
The report should include the following sections: • Introduction
• Vocabulary Exploration
• Sentence Vector Exploration
• Performance Evaluation
• Conclusion
You may add other sections or appendix at the end.
The Introduction section should briefly cover the programming language you used to
implement the project and additional software packages used. It should also cover the environment for debugging and running the application.
The Vocabulary Exploration section should cover the application design, implementa- tion details and results related to vocabulary exploration workloads. To explain the design, you should include one or more annotated data flow DAGs to show the sequence of opera- tions you used to get certain statistics. The annotation should show the structure/content of RDDs/DataFrames in the DAG. You should also include design decisions and implemen- tation details to achieve the quality requirements.The results should be put together in an easy to read format.
The Sentence Vector Exploration section should cover the application design, imple- mentation details and results related to sentence vector exploration workloads. You may include one or more annotated data flow DAG to show the sequence of operations you used to prepare the input for clustering and the data flow of computing the confusion ma- trix. The results should contain two tables showing the confusion matrix data obtained from two sentence vector methods.
The Performance Evaluation section should give an overview of the execution en- vironment, including the cluster size and capacity of nodes. You should describe each resource configuration you have experimented with. You should experiment with at least TWO resource configurations on ONE relatively complex application with multiple jobs and stages. For each resource configuration, provide a description of the properties used to control the resource allocation and the high level statistics of the execution statistics, such as execution time, number of executors, number of threads, total shuffling size and so on. This section should include a few screenshots or diagrams to highlight performance variations under different configurations. For any diagram or screenshot, there should be enough explanation of the performance statistics and differences observed.
5
9 Data Set Download Instruction and Other Materials 9.1 Data Set Download
You are suggested to download the MNLI data set from the GLUE site GLUE Tasks. Down- load the MultiNLI Matched as a zip file. After extracting the zip file. You will see five tsv files at top level folder. These are the data sets you will work with.
• train.tsv: the training data set
• dev matched.tsv: the development set with matched genre
• dev mismatched.tsv : the development set with mismatched genre • test matched.tsv: the test set with matched genre
• test mismatched.tsv: the test set with mismatched genre
All tsv files have many columns. The data we work with are on the following three columns: genre, sentence1, sentence2.
There are many stop words lists available online. You can use any that you are familiar with. For students with no preference, we recommend you to use the Stanford Stop Word List.
9.2 Malformed Rows
There could be malformed rows and depending on the file readers you use to load the data, the malformed rows may present different issues. You are suggested to get rid of any malformed row that standard tsv file readers cannot parse properly.
9.3 Tokenizer Options
Majority of the tasks in this assignment involve extracting words from the input sentences. There are many ways of doing it. The simplest option is to use the string split feature to get words from a sentence; alternatively, you may use the tokenizer provided by SparkML, which seems to just use the said split feature. These two options do not end up with a clean set of words. For instance, the punctuation marks are usually attached to the last word of a sentence. A better option is to use the NLTK package. The package itself is installed on EMR. The data sets that are required to run most functions are not installed though. This needs to be done at cluster starting time. Instructions on how to configure cluster wide software will be given in the week 10 lab.
6