CS计算机代考程序代写 Keras deep learning algorithm scheme python Excel COMP90042 Project 2021: Rumour Detection and Analysis on Twitter

COMP90042 Project 2021: Rumour Detection and Analysis on Twitter
Copyright the University of Melbourne, 2021
Project type: Individual (non-group)
Report and code submission due date: 9pm Thu, 13th May 2021
Codalab submission due date: 1pm Thu, 13th May 2021 (no extensions possible for this component)
The concept of rumour has a long history, and it is typically defined as an unverified statement or news circulat- ing from person to person. Rumours have the potential to spread quickly through social media, and bring about significant economical and social impact. The figure below illustrates an example of a rumour propagating on Twitter. The source message (green box) started a claim about the cause of Michael Brown’s shooting, and it was published shortly after the shooting happened. It claimed that he was shot ten times by the police for stealing candy. The message was retweeted by multiple users on Twitter, and within 24 hours there were about 900K users involved, either by reposting, commenting, or questioning the original source message. From the replies we can see some users (e.g. User 7; red box) questioned the truthfulness of the original message.
0 hour
4 hours
8 hours
12 hours
16 hours
20 hours
24 hours
User 0
Follower Count: 873021
17 year old unarmed kid shot ten times by police for stealing candy. I didn’t know that was
punishable by death.
User 4
Follower Count: 205 apparently it is
now.
User 5
Follower Count: 6632
These days anything, especially with Stand Your Ground and even a sneeze is punishable by death.
User 1
Follower Count: 1222
No excuse.
User 3
Follower Count: 6 It applies to Black people.
User 2
Follower Count: 3144
This is unbelievable, or should be.
User 6
Follower Count: 1141
there has not been any proof that he stole candy. I guess skittles has become a reason to kill black teens.
User 7
Follower Count: 122
He was 18. Nothing to do with stealing candy. He was walking in the street. Horrible situation. But stop spreading false facts.
User 0
Follower Count: 873021
I was just going off what I read in the #ferguson tag early last night. Wasn’t any real news out at that point.
User 8
Follower Count: 11197
Anything is punishable by death if the youth is black.
The challenge of the project is to develop a rumour detection system and analyse the nature of rumours that are being propagated on Twitter. We will frame this using two tasks: rumour detection and rumour analysis.
Task 1: Rumour Detection
In this task, you will be provided with a set of source tweets and their replies (i.e. the comments we saw in the figure above), and each source tweet is labelled as either a rumour or non-rumour. The task here is to build a binary classifier using this dataset. For each tweet (source tweet or reply tweet), the dataset provides a range of information, including the text of the tweet, information of the user who made the tweet, unique ID of the tweet, etc (more information will be provided in the “Datasets” section below).
You’re free to explore any methods or machine learning models for building the binary classifier. To give some ideas, we could model the source tweet and replies as a sequence of tweets using recurrent networks. Alternatively we could also model them based on their propagation structure (like the tree structure of
1

comments we saw earlier) using recursive networks or graph networks. We might want to consider incor- porating some user information, as it could provide hints to the trustworthiness of a user. While you are permitted to use pretrained models or embeddings, you should only use the provided dataset for training the model, i.e. you should not crawl or search for more training data. Whatever methods or features you use, you must at least incorporate the tweet text in your model (we are doing an NLP project, after all).
Task 2: Rumour Analysis
In this task, you will use your trained rumour classifier from the first task and apply it a set of provided COVID-19 tweets to detect rumours. Given the predicted rumours and non-rumours, the aim here is to perform some analyses to understand the nature COVID-19 rumours and how they differ to their non- rumour counterparts. Below are some questions to get you started:
• What are the topics of COVID-19 rumours? How do they differ from the non-rumours? How do these topics evolve over time?
• What are the popular hashtags of COVID-19 rumours and non-rumours? How much overlap or dif- ference do they share?
• Do rumour source tweets convey a different sentiment/emotion to the non-rumour source tweets? What about their replies?
• What are the characteristics of rumour-creating users, and how do they differ compared to normal users?
Note that these are just some suggestions, and you are free to explore any questions to understand COVID- 19 rumours on Twitter. As before, while you are free to do any analyses that may not be text-related (e.g. propagation analysis), most of your analyses should involve the text of the tweets. You should do your analyses using only the provided data.
You will complete two tasks, and write a report that details: (1) your detection system (first task), e.g. the reason behind the choices you have made and its performance; and (2) findings from your analyses (second task). Both tasks are equally important, so you should put similar efforts to them when you’re working on the project.
We hope that you will enjoy the project. To make it more engaging we will run the first task as a Codalab competition. You will be competing with other students in the class. The following sections give more details on data format, the grading scheme and the use of Codalab. Your assessment will be based on your report, your performance in the competition, and your code.
Submission materials: Please submit the following:
• Report(.pdf):https://canvas.lms.unimelb.edu.au/courses/121115/assignments/213183
• A zip file (.zip) containing your python code (.py or .ipynb) and scripting code (.sh or similar) if using Unix tools: https://canvas.lms.unimelb.edu.au/courses/121115/assignments/213184
Note that there are two different submission links/shells for the project report and code. The reason for separating these two submissions is that we will be running peer reviewing for the report one week after the project has completed. Please note that you should be uploading a single pdf document for your report and a single zip file for your code; all other formats are not allowed, e.g. docx, 7z, rar, etc. Your submission will not be marked and will be given a score of 0 if you use these other formats. You should not have any external data files, as you should use only the provided data for model training and rumour analysis.
If multiple code files are included, please make it clear in the header of each file what it does. If pre-trained models or embeddings are used, you do not need to include them as part of the submission, but make sure your code or script downloads them if necessary. We should be able to run your code, if needed, however note that code is secondary — the primary focus of marking will be your report, and your system performance on Codalab.
You must submit at least one entry to the Codalab competition. Late submissions: -10% per day
2

Marks: 35% of mark for class
Materials: See Using Jupyter Notebook and Python page on Canvas (under Modules>Resources) for informa- tion on the basic setup required for COMP90042, including an iPython notebook viewer and the Python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and Gensim. For this project, you are encouraged to use the NLP tools accessible from NLTK, such as the Stanford parser, NER tagger etc, or you may elect to use the Spacy or AllenNLP toolkit, which bundle a lot of excellent NLP tools. You may also use Python based deep learning libraries: TensorFlow/Keras or PyTorch. You should use Python 3.
You are being provided with various files including a training, a development and a test set. See the instructions below (section “Datasets”) for information on their format and usage. If there’s something you want to use and you are not sure if it’s allowed, please ask in the discussion forum (without giving away too much about your ideas to the class).
Grading: You will be graded based on several criteria: clarity of expressions of your report, soundness and novelty of your methods, substance of your work, interpretation of your results and performance of your system (section “Grading” below will provide more details).
Updates: Any major changes to the project will be announced via Canvas. Minor changes and clarifications will be announced in the discussion forum on Canvas; we recommend you check it regularly.
Academic Misconduct: This is an individual project, and while you’re free to discuss the project with other students, this is ultimately an individual task, and so reuse of code between students, copying large chunks of code from online sources, or other instances of clear influence will be considered cheating. Do remember to cite your sources properly, both for research ideas and algorithmic solutions and code snippets. We will be checking submissions for originality and will invoke the University’s Academic Misconduct policy where inappropriate levels of collusion or plagiarism are deemed to have taken place.
Datasets
You are provided with several data files for use in the project. :
[train,dev,test].data.jsonl: tweet data for the first task (rumour detection); [train,dev].label.json: rumour labels for the first task (rumour detection); covid.data.jsonl: COVID-19 tweet data for the second task (rumour analysis).
All data files ([train,dev,test,covid].data.jsonl) are JSONL files, where each line is a JSON string. For these files, each line is an event: a list of tweets where the first tweet is a source tweet and the rest are reply tweets. When classifying whether a source tweet is spreading a rumour, we recommend that you use both source and reply tweets from the event.1 Note that the reply tweets are sorted arbitrarily — they are not ordered chronologically by their timestamp, so you need to be careful if you are modelling them as a sequence of tweets.
An event (list of tweets) is a JSON string, and looks like the following:
[ {
1While you can technically build a classifier that use only the source tweets to detect rumours, its performance is likely to be poor, as the reply tweets will provide helpful information whether the source tweet is spreading a rumour, as we saw in the example figure earlier.
“created_at”: “Wed Jan 07 12:01:03 +0000 2015”,
“id_str”: “552797058990870528”,
“in_reply_to_user_id_str”: null,
“text”: “Spread this cover in solidarity with the victims at Charlie Hebdo. Don’t let the
sword conquer the pen. http:\/\/t.co\/XVkPPbkLhn”,
“user”: {
“id”: 11345012,
“id_str”: “11345012”,
3

“name”: “Ivo Vegter”,
… },
… },
{
“created_at”: “Wed Jan 07 17:39:24 +0000 2015”,
“id_str”: “552882207426355201”,
“in_reply_to_status_id_str”: “552797058990870528”,
“text”: “@IvoVegter @pankajchandak #IMwithCharlieHebdo”,
“user”: {
“id”: 1083066139,
“id_str”: “1083066139”,
“name”: “Solid Item”,

},
… }
]
Here the event consists of just 2 tweets, one source tweet and a reply tweet. Notice that the source tweet’s in_reply_to_user_id_str has a null value, while the reply tweet’s in_reply_to_user_id_str has a value of 552797058990870528, which is the id_str of the source tweet. If you are modelling the propagation structure of the comments (i.e. to recover the tree structure that we saw in the example figure earlier), you will need to use this key. There are many more key-value pairs for each tweet object; you should consult the Twitter API documentation for their definitions. You are free to use any of these key-value pairs in the tweet object as features for your rumour detection system.
The label files ([train,dev].label.json), on the other hand, are standard JSON files: {
“552800070199148544”: “non-rumour”,
“544388259359387648”: “non-rumour”,
“552805970536333314”: “non-rumour”,
“525071376084791297”: “rumour”,
“498355319979143168”: “non-rumour”,
“553591259672379392”: “rumour”,
“580342401392312321”: “non-rumour”,
“498609222020780033”: “non-rumour”,

“552797058990870528”: “non-rumour”,
… }
Each entry gives the ground truth label for a source tweet, e.g. here we can see that the source tweet earlier (552797058990870528) is a non-rumour. Note that we are using the string version of the ID (id_str rather than id) for matching source tweets between the data (e.g. train.data.jsonl) and label file (e.g. train.label.json).
For all files (JSONL data files or JSON label files), you should use the Python json library to load them. Training, Development and Test
For the first task (rumour detection), there are 3 data partitions: training, development and test. Each of this partitions has a different purpose. The training set (train.data.jsonl and train.label.json) should be used for building your models, e.g., for use in development of features, rules and heuristics, and for super- vised/unsupervised learning. You are encouraged to inspect this data closely to fully understand the task.
4

The development set (dev.data.jsonl and dev.label.json) is formatted like the training set. This will help you make major implementation decisions (e.g. choosing optimal hyper-parameter configurations), and should also be used for detailed analysis of your system — both for measuring performance and for error analysis — in the report.
You will use the test set (test.data.jsonl) to participate in the Codalab competition. For this reason no labels are provided for this partition. You should not at any time manually inspect the test dataset; any sign that you have done so will result in loss of marks.
For the second task (rumour analysis), there is only 1 file: covid.data.jsonl. You will use your trained rumour detection system from the first task to classify these COVID-19 tweets and do some analyses to understand the nature of COVID-19 rumours.
Evaluation Script
For the first task, we provide a script (eval.py) for evaluating your rumour detection results. This script takes two input files: the ground truth and your predictions, and computes precision, recall and F1 score on the rumour class. Shown below is the output from running against random predictions on the development set:
$ python eval.py –predictions dev.baseline.json –groundtruth dev.label.json
Performance on the rumour class:
Precision = 0.3120567375886525
Recall = 0.47058823529411764
F1 = 0.3752665245202559
Your rumour detection system will hopefully be a good deal higher! We will be focussing on F1 scores on the positive class, and the precision and recall performance are there for your information, and may prove useful in developing your system.
The example prediction file, dev-baseline.json, is the output of a baseline system where the labels are randomly populated. This file will help you understand the required file format for creating your development output (for tuning your system using eval.py) and your test output (for submission to the Codalab competition).
Grading
Your submissions will be graded as follows:
Component
Writing
Content Performance
Criteria
Clarity Tables/Figures
Soundness Substance Novelty Results
Rumour class F1
Description Marks
Is the report well-written and well-structured? 5 Are tables and figures interpretable and used effectively? 3
Are the experiments sound? Are methods justified and used correctly? 7 How much work is done? Is there enough substance? 5 How novel or ambitious are the techniques or methods? 5 Are the results and findings convincing? Are they well articulated? 5
Graded based on Codalab leaderboard ranking 5
A report should be submitted with the description, analysis, and comparative assessment of methods used. You should describe your methods in enough detail that we could replicate them without looking at your code. For the first task, you should mention any choices you made in implementing your system along with empirical justi- fication for those choices using the development set. You should also detail both your development performance and the “Final Evaluation” performance on the Codalab leaderboard (details in the section below). For the sec- ond task, you should articulate the intention of your analyses (i.e. the questions they are addressing), and explain your findings. For both tasks, you should use tables and the appropriate charts to report your results/findings.
5