程序代写代做代考 algorithm deep learning Assignment4

Assignment4

Document Analysis Assignment 4: Node Label Prediction¶

Your Information¶
Please fill in the following information:

Name: [Your name]

Uni id: [Your uid]

Overview¶
The task in Assignment 4 is to train a document classifier which predicts the label of documents, while taking into account the network structure between documents.

You are given a dataset of 2708 textual documents (i.e. academic papers), as well as the citation relation between them.
We can therefore organize the documents as a network in which each node represents a document, and the directed edge between node $u_i$ and node $u_j$ occurs when the paper $u_i$ cites $u_j$.
Each document is labeled with one of the seven possible classes (i.e. publication venues), denoted as 1, 2, …, 7.

You are provided with the textual content of the 2707 documents in the dataset (the abstracts of the papers), and the citation relations between documents.
You are also provided with the labels of 2167 documents in the train set.
Your task is to predict the labels of the 543 documents in the test set.
You need to take into account both the textual content of documents, as well as the structure of their network.

Searching for solutions for how to embed network structure is part of the assignment.
We outline hereafter several possible solutions, ordered from easier to more difficult, and from lower achieved score to higher score.
The easiest solution is outlined in more detail.

1. Matrix factorization-based solution¶
This solution is based on constructing an embedding of each document (node) in the network, based on the outputs of Singular Value Decomposition (SVD).
In broad strokes, SVD decomposes a matrix $A$ of size $m \times n$ into a product of three matrices $U$, $\Sigma$ and $V$ so that $A = U \Sigma V^*$, where:

$U$ is an $m \times m$ unitary matrix ,
$\Sigma$ is a diagonal $m \times n$ matrix with non-negative real numbers on the diagonal,
$V$ is an $n \times n$ unitary matrix, and $V^∗$ is the conjugate transpose of $V$.

To solve the assignment using this method, you will need to follow the following steps:

Read and preprocess the training and testing dataset in order to extract text feature (note the dataset structure here after);
Construct the adjacency matrix $A$ representing the relations between documents in the network;
Apply SVD on matrix $A$ to obtain matrices $U$, $\Sigma$ and $V$;
Construct network embeddings: use the rows of the matrix $U \Sigma$ or the columns of $\Sigma V$ as network features for each document;
Obtain a document vectorial description by concatenating the network representation and the textual representation;
Train and test classifier.

Notes:

this is a simplistic approach and unlikely to be competitive compared to the assignment’s upper bound;
for a more powerful algorithm based on SVD, consult the paper “1. Matrix factorization” in the papers folder;
you can reuse code given in the tutorial or in your own solutions for the previous assignments, especially for treating the text of the documents.

For more information on SVD please check this MIT tutorial and the Wikipedia page for SVD.

2. Other solutions¶
There are many other solutions to this problem, based on several paradigms, out of which we enumerate the following:

Label propagation uses a statistical machine learning to give a label probability for each vertex, then propagates part of the label probability to all of its neighbours. The propagation steps are repeated until all label probabilities converge. Check the paper “2. Label propagation” in the papers folder;
node2vec constructs for each node a multi-dimensional embedding by performing random walks starting with that node; it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. More details in the paper “3. Node2vec embedding” and on the project’s website;
deep neural network generate hidden representations from both attributes of vertices and representations of neighbouring vertices via recursive neural networks. This is the most difficult solution to implement, but has the potential of delivering the best results. More details in the paper “4. Deep Neural Networks”.

Dataset and provided files¶
links.txt: The structure of the network of documents¶
Every line is in the format of “X Y”, meaning that document X cites Y. X and Y are document ids.

content.txt: The textual content of each document¶
Every line is in the format “id f1 f2 f3 …”, where id is the document id, and the rest of the line contains the tokens (words) of the document.
A word appears only once in a document.
You can apply any textual technique that you know to transform the text of each document in a vectorial format.

label.train.csv: The labels of the training set.¶
Each line provides the pair “id, label”, where id is a document id, and the label is a value between 1 and 7 giving the class corresponding to the document.

label.test.csv: Document ids to be predicted by you¶
Each line contains a document id. For each document in the test set, you need to predict its class (a number from 1 to 7), based on its textual content and its network structure.

Evaluation¶
The performance of your prediction will be evaluated automatically on Kaggle using the Categorization Accuracy (the percentage of correct predictions).

Your score will be computed using a lower bound and an upper bound, which will be shown on the Kaggle leader board.
Achieving a Categorization Accuracy score equal to the lower bound amount to a grade of zero, while achieving the upper bound amounts to the full points (here 7 points, see score distribution here below).
Consequently, your score for this competition task will be calculated based on:

$$
\operatorname{Your\_Score} = \frac{Your\_Performance – Lower\_Bound}{Upper\_Bound – Lower\_Bound} * 7
$$Notes about the lower bound and upper bounds predictors:

The lower bound was obtained by predicting document class using the textual features only (i.e. as in the ML assignment). Any successful embedding of network structure should earn you marks;
The upper bound was obtained using the deep learning approach presented here above.

If you obtain a better performance than the upper bound, you will obtain a grade higher than 7 points for the coding part. This can be useful to compensate for any lost points for the written questions.
Note however, that the total grade of this assignment is capped at 10 marks.

For the Kaggle competition¶
Join the competition here;
Before submitting the result, first go to team menu and change your team name as your university id. Note that submissions not associated with a university id will not be taken into account;
You need to upload the generated result file to Kaggle. The result file should be in the following format:
id,label
324123,1
312312,2
43431,3

Note: there should NOT be any blank space after comma, otherwise the label will not be recognised by kaggle correctly (e.g. 324123,1 instead of 324123, 1 )
Note that you are only allowed to upload 20 copies of your results to Kaggle per day. Make every upload count, and don’t waste your opportunities!
You should use cross-validation instead of relying on the public set – this is what the daily limit is for!
For detailed submission instructions, check the end of this notebook.

Score distribution (total 10 points):

Kaggle competition: 7 points
Written part Q1: 2 points
Written part Q2: 1 point

After completion, please rename this notebook to your_uid.ipynb (e.g. u6000001.ipynb) and submit this file to Wattle. Do not upload any other files to Wattle except this notebook file.

Note: you need to fill in the cell below with your code. Failure to provide your code nullifies your Kaggle grade (meaning you get zero for the coding part).

1. Code for solving the assignment¶
Please paste here below the code which is the solution to your assignment, including any package imports.
Note that, upon request, you may be required to show (on your own machine) that the code included in the submitted notebook actually generates the results submitted to Kaggle.

In [ ]:

# YOUR CODE HERE
raise NotImplementedError()

2. Written part¶
Answer briefly and concisely the following questions, based on your implementation.
Provide answers using bullet list with 2~3 items.
Check this if you are not familiar with markdown syntax.

Question 1 (2pts)¶
How do you use the network structure in your classification? (1 mark) Why does your method work on this task? (1 mark)

YOUR ANSWER HERE

Question 2 (1pt)¶
How do you incorporate text features?

YOUR ANSWER HERE

Upload output file to Kaggle competition site¶
Once you generate output.csv file, you can upload your result on Kaggle competition site. To upload and evaluate your result

Go to Kaggle competition site: Click here.
Sign up for Kaggle if you do not have an account. Go back to the original kaggle page.
Before submitting the result, first go to team menu and change your team name as your university id.

Time to submit your own result. Click submit predictions in the menu, you may need to agree the competition rules before submitting your result.
Upload your output csv file, you can write additional description of your submission in the description box.
Note that you are only allowed to submit 3 results per day. Do not upload an arbitrary result and think which algorithm or parser will perform the best.
If your output format is correct, the system will generate your score automatically.
Go to Leaderboard menu. The leaderboard will show the current score of the other students.

Note that you can check all of your submission from my submission menu. Please select one best performing submission before the assignment due. The selected submission will be used to measure the performance of hidden test case (see below for details).