Show Us Your Skills: Twitter Data Introduction
As a culmination of the four projects in this class, we introduce this final dataset that you will explore and your task is to walk us through an end-to-end ML pipeline to accomplish any particular goal: regression, classification, clustering or anything else. This is a design question and it is going to be about 30% of your grade.
Below is a description and some small questions about the provided dataset to get you started and familiarized with the dataset:
3.4 About the Data
Copyright By PowCoder代写 加微信 powcoder
Download the training tweet data3. The data consists of 6 text files, each one containing tweet data from one hashtag as indicated in the filenames.
Report the following statistics for each hashtag, i.e. each file(Question 26):
• Average number of tweets per hour
• Average number of followers of users posting the tweets per tweet (to make it simple, we average over the number of tweets; if a users posted twice, we count the user and the user’s followers twice as well)
• Average number of retweets per tweet
Plot “number of tweets in hour” over time for #SuperBowl and #NFL (a bar plot with 1-hour bins). The tweets are stored in separate files for di↵erent hashtags and files are named as tweet [#hashtag].txt. (Question 27)
Note: The tweet file contains one tweet in each line and tweets are sorted with respect to their posting time. Each tweet is a JSON string that you can load in Python as a dictio- nary. For example, if you parse it to object json_object = json.loads(json_string) , you can look up the time a tweet is posted by:
json_object[‘citation_date’]
You may also assess the number of retweets of a tweet through the following command:
json_object[‘metrics’][‘citations’][‘total’]
Besides, the number of followers of the person tweeting can be retrieved via:
json_object[‘author’][‘followers’]
The time information in the data file is in the form of UNIX time, which “encodes a
point in time as a scalar real number which represents the number of seconds that have passed since the beginning of 00:00:00 UTC Thursday, 1 January 1970” (see Wikipedia for details). In Python, you can convert it to human-readable date by
The conversion above gives out a datetime object storing the date and time in your local time zone corresponding to that UNIX time.
3https://ucla.box.com/s/24oxnhsoj6kpxhl6gyvuck25i3s4426d
import datetime
datetime_object = datetime.datetime.fromtimestamp(unix_time)
In later parts of the project, you may need to use the PST time zone to interpret the UNIX timestamps. To specify the time zone you would like to use, refer to the example below:
For more details about datetime operation and time zones, see
Follow the steps outlined below: (Question 28)
• Describe your task.
• Explore the data and any metadata (you can even incorporate additional datasets if you choose).
• Describe the feature engineering process. Implement it with reason: Why are you extracting features this way – why not in any other way?
• Generate baselines for your final ML model.
• A thorough evaluation is necessary.
• Be creative in your task design – use things you have learned in other classes too if you are excited about them!
We value creativity in this part of the project, and your score is partially based on how unique your task is. Here are a few pitfalls you should avoid (there are more than this list suggests):
• DO NOTperform shoddy sentiment analysis on Tweets: running a pre-trained sen- timent analysis model on each tweet and correlating that sentiment to the score in time would give you an obvious result.
• DO NOT. only include trivial baselines: In sentiment analysis, for example, if you are going to try and train a Neural Network or use a pre-trained model, your base- lines need to be competitive. Try to include alternate network architectures in addition to simple baselines such as random or naive Bayesian baselines.
Here we list a few project directions that you can consider and modify. These are not complete specifications. You are free and are encouraged to create your projects /project parts (that may get some points for creativity). The projects you come up with should match or exceed the complexity of the following 3 suggested options:
• Time-Series Correlation between Scores and Tweets: Since this tweet dataset contains tweets that were posted before, during, and after the Superbowl, you can find time-series data that have the real-time score of the football game as the tweets are being generated. This score can be used as a dynamic label for your raw tweet dataset: there is an alignment between the tweets and the score. You can then train a model to predict, given a tweet, the team that is winning. Given the score change, can you generate a tweet using an ensemble of sentences from the original data (or using a generative model that is more sophisticated)?
import pytz
pst_tz = pytz.timezone(‘America/Los_Angeles’)
datetime_object_in_pst_timezone =
,! datetime.datetime.fromtimestamp(unix_time,pst_tz)
Figure 1: A sample of the significant events in the game that you can easily find on the internet.
Here is one link that has the time-indexed events.
• Character-centric time-series tracking and prediction: In the #gopatriots dataset, there are several thousand tweets mentioning “ ” and his imme- diate success/failure during the game. He threw 4 touchdowns and 2 interceptions, so fan emotions about Brady throughout the game are fickle. Can we track the average perceived emotion across tweets about each player in the game across time in each fan base? Note that this option would require you to explore ways to find the sentiment associated with each player in time, not to an entire tweet. Can we correlate these emotions with the score and significant events (such as interceptions or fumbles)? Using these features, can you predict the MVP of the game? Who was the most successful receiver? The MVP was Brady.
• Library of Prediction Tasks given a tweet: Predict the hashtags or how likely it is that a tweet belongs to a specific team fan. Predict the number of retweets/likes/quotes. Predict the relative time at which a tweet was posted.
Submission
Your submission should be made to both of the two places: BruinLearn and Gradescope within BruinLearn.
BruinLearn Please submit a zip file containing your report, and your codes with a readme file on how to run your code to CCLE. The zip file should be named as
“Project1 UID1 UID2 … UIDn.zip”
where UIDx’s are student ID numbers of the team members. Only one submission per team is required. If you have any questions, please ask on Piazza or through email. Piazza is preferred.
Gradescope Please submit your report to Gradescope as well. Please specify your group members in Gradescope. It is very important that you assign each part of your report to the question number provided in the Gradescope template.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com