PAPER CODE NO.
COMP527/337
EXAMINER: Dr. Angrosh Mandya Tel. No. 0151 7954293 DEPARTMENT: Computer Science
SECOND SEMESTER EXAMINATIONS 2019/20 Data Mining and Visualisation
TIME ALLOWED : Open-Book Exam
INSTRUCTIONS TO CANDIDATES
• Answer all SIX questions.
• The final exams for the course Data Mining and Visualisation is conducted as open-book exam. The question paper will be provided on 22nd May, 2020 at 10:00 a.m. You are expected to complete the exam and return scanned copies of answer scripts before 29th May 10:00 a.m. Answer scripts received after 29th May 10:00 a.m. will be invalid.
• Completed answer scripts should be submitted using the link: https://sam.csc.liv.ac.uk/COMP/Submissions.pl
• Answer scripts can be prepared in the following ways:
1. You can write your answers by hand on plain paper, then scan or photograph the
sheets, convert them to a single PDF and submit it at the link provided above.
2. In case you do not have access to scanners or unable to photograph a handwritten solution, you can use a word processor (MS Word or LaTeX) and convert the file into a single PDF and submit it as above.
• Please note the recommended time for completing the exam when conducted as a regular exam during previous years was 2.5 hours. However, the open-book exam is available to you from 22nd May, 2020 10:00 a.m. – 29th May, 2020, 10:00 a.m. Answer scripts can be submitted at any time during this period. The latest version received before 29th May, 2020, 10:00 a.m. will be considered for evaluation. All submissions received after the deadline will be considered as invalid.
• Since the exam is conducted as an open-book exam, students are expected to work on their own to prepare their answers. Any sort of plagiarism will be strictly penalised.
• Calculators can be used for performing calculations.
PAPER CODE COMP527/337 page 1 of 7 Continued
Question 1 Consider the two sentences
S1 = great movie my family loved it
S2 = worst movie my family hated it
Let us consider the words my and it to be a stop words and apply lemmatisation on the other words. We represent S1 and S2 respectively by vectors x1,x2 ∈ R6 in the vector space where the first six dimensions defined respectively by great, movie, family, loved, worst and hated. Answer the following questions.
A. Write the two vectors x1 and x2 in this vector space.
B. Calculate l1 and l2 norms of x1 and x2
C. Calculate the Euclidean distance and Manhattan distance between x1 and x2 D. Compute the cosine similarity between x1 and x2
E. Write the set of unigrams, bigrams and trigrams that can be extracted from S1.
F. Write the set of bigrams that can be extracted from S2.
G. State one advantage of removing stop words in text mining tasks.
H. State one disadvantage of removing stop words in text mining tasks.
I. What is meant by part-of-speech in text mining?
(2 marks) (2 marks) (2 mark)
(1 mark) (3 mark) (3 mark)
(1 mark)
(1 mark) (2 marks)
PAPER CODE COMP527/337 page 2 of 7
Continued
Question 2 Consider the following two training datasets D1 = {(xn, tn)}4n=1 and D2 = {(xn, tn)}4n=1, where xn ∈ R2.
InD1,x1 =(0,0)⊤,x2 =(0,1)⊤,x3 =(1,0)⊤,andx4 =(1,1)⊤ andthelabelst1 =t2 =t3 =−1 and t4 = 1.
InD2,x1 =(0,0)⊤,x2 =(0,1)⊤,x3 =(1,0)⊤,andx4 =(1,1)⊤ andthelabelst1 =t4 =−1and t2 =t3 =1.
A. What is linear separability? Given D1 and D2, which of the two datasets is not linearly sepa- rable and why? (3 marks)
B. If the current weight vector and bias are respectively w(k) and b(k), write their updated values w(k+1) and b(k+1) after misclassifying an instance x with label t. (2 marks)
C. Consider the training set D1. Initialising w(0) = (0,0)⊤ and b(0) = 0, compute the final values of the weight vector and bias when the Perceptron algorithm is applied to D1 in the order x1,x2,x3 and x4 using the update rule ta ≤ 0, where a = 4n=1 w⊤x + b is the activation function. (4 marks)
D. Show that the Perceptron with weight vector w′ = (3,2)⊤ and bias b′ = −4 also correctly classifies the data points in training dataset D1 . (4 marks)
E. Show that the Perceptron with weight vector w′ = (3,2)⊤ and bias b′ = −4 that correctly classifies all four instances in D1 in part D, fails to correctly classify all data points in the training dataset D2. (4 marks)
PAPER CODE COMP527/337 page 3 of 7 Continued
Question 3 Consider five data points in R2 given by x1 = (1,1)⊤, x2 = (2,1)⊤, x3 = (2,2)⊤, x4 = (1,2)⊤, and x5 = (0,1)⊤. Answer the following questions about this dataset.
A. Let us assume that we clustered this dataset into two clusters S1 = {x1,x2} and S2 = {x3,x4,x5}. Moreover, let us represent S1 and S2 by two 2-dimensional arbitrary vectors respectively μ1 and μ2. Write the Within-Cluster Sum of Squares (WCSS) objective, f (S1, S2) for this clustering. (2 marks)
B. Calculate the partial derivatives of f(S1,S2) with respect to μ1 and μ2 and find the critical points of f (S1, S2) by setting the derivatives to zero. (2 marks)
C. We would like to apply the k-means clustering algorithm taught in the lectures to the above- mentioned dataset to create two clusters. If the initial centroids for the two clusters were set to μ1 = (2, 1)⊤ and μ2 = (2, 2)⊤, then what would be the final set of clusters after the first iteration using Euclidean distance as a similarity measure between the data points. (3 marks)
D. Compute the values for cluster centroids μ1 and μ2 following the assignment resulting in part
(C). (3 marks)
E. In a different setting, consider two clusters S1 = {B, B, R} and S2 = {R, B, R} consisting of blue (B) and red (R) colour balls. Compute the purity for this clustering. (2 mark)
F. Compute the rand index for the clustering described in part (E). (3 marks)
G. Compute the precision, recall and F-score for the clustering described in part (F). (2 mark)
PAPER CODE COMP527/337 page 4 of 7 Continued
Question 4 Consider the following logistic regression model shown in Figure 1. Here a two dimensional input (represented by two features x1 and x2) is multiplied by two-dimensional weight vector w. A bias term b is also used in the network. Let z be the net score and yn by the activation function computed in the network resulting in an output o.
Figure 1: Logistic regression model Answer the following questions:
A. Write the formula for computing net score z in the network. (1 mark)
B. Assuming that the logistic model uses sigmoid function to compute activation yn, write down
the formula for computing yn.
C. Provide the reasons for using sigmoid function in the logistic model.
D. Given a training dataset {(xn, tn)}Nn=1, where tn ∈ {0, 1}, with n =
steps involved in deriving the likelihood as a function of w for this dataset.
(1 mark) (1 mark)
1, … , N,
E. Define the negative log of the likelihood function obtained in part D to derive cross entropy
(3 marks) error function E(w). (2 marks)
F. Write down the gradient ∇E(w), by differentiating E(w) with respect to w obtained in part E . (3 marks)
G. Using a fixed learning rate η, write down the stochastic gradient descent weight update rule using the gradient obtained in F. (2 mark)
H. Modify the update rule in G to add an l2 regularisation on w with a regularisation coefficient
λ. (2 mark)
I. Discuss some of the possible solutions to avoid overfitting in a logistic regression model. (2 mark)
PAPER CODE COMP527/337 page 5 of 7 Continued
write down the
Question 5 Consider the following sentence provided below in S1. Consider that you are required to train a Continuous Bag of Words (CBOW) Word2Vec model using S1.
S1 = i like data science
A. Illustrate with a diagram the CBOW model with a hidden layer consisting of 2 neurons. In- dicate the input, hidden and output layer of the network with x, h and u, respectively. Use W and W ′ to indicate the parameters of the model from input layer to hidden layer and from hiddenlayertooutputlayer,respectively.Writedownthedimensionsofx,handu. (3marks)
1.0 0.2
0.5 2.4
B. Let W = be a randomly initialised weight matrix between the input and the hidden 1.2 0.8
2.0 0.6
layer. Following the vectorised form, compute the output h at the hidden layer obtained in
CBOW model for the input context word “science”, using the one-hot vector representation of “science” obtained in B. (4 marks)
C. Justify the use of softmax activation at the output layer in CBOW model (5 marks)
D. Let W′ = 1 2 3.0 2.5 be a randomly initialised weight matrix defined between the 1.5 0.5 1.0 0.5
hidden layer and the output layer. Write down the softmax function to compute output ui at the output layer in CBOW model. Also compute the output probabilities for each word in V using the hidden output h obtained in part D. (5 marks)
PAPER CODE COMP527/337 page 6 of 7 Continued
Question 6 Consider that you are hired for a project that studies what adverse side effects are reported by patients who are taking a particular drug X on Twitter. You are given a large collection of tweets that mention the drug name X for this purpose. A subset of these tweets, annotated with regard to the effect of the medicine as positive or negative by the Research and Development (R&D) wing of the company is also provided to you. Given this scenario, answer the following questions
A. Describe three challenges that you are likely to encounter in this project. (5 marks) B. Assuming that you want to use an Artificial Neural Network (ANN) for training a supervised
model for sentiment analysis using the annotated corpus, answer the following:
B1. Describe the pre-processing steps that you would use to clean the data. (5 marks)
B2. Provide the details as to how you would proceed in implementing the ANN model. (Please note that we don’t want you to write or explain implementation code. We are more interested in seeing your approach as to how you would proceed. For example, your approach to pre-processing the Twitter data, convert pre-processed text into feature vectors (using pre-trained word embeddings), details of different layers of the network, activation and loss functions used in the model). (5 marks)
PAPER CODE COMP527/337 page 7 of 7 End