title: MY472 Final Exam Part A: Data Cleaning
The overall question you will be trying to answer in this final exam is: Why is there so much negativity on Facebook comments about politics?
To answer this question, I will share with you a dataset that contains public Facebook data that corresponds to all the posts by Members of the U.S. Congress between January 1st, 2015 and December 31st, 2016, as well as all the comments and reactions to these posts. In addition, you will also have a dataset with sentiment predictions for each comment negative, neutral, positive.
As a first step, you will have to clean the data and convert it to a format that can facilitate the subsequent analysis. I recommend you use a SQLite database, but you can also work with regular data frames if you prefer.
You have access to five data files. Read the text below for important information regarding their content, as well as links to download the files:
1 congresslist.csv contains information about each Member of Congress, including gender, type House representative or Senator, party Democrat, Republican, Independent, nominatedim1 an estimate of political ideology, from 1 very liberal to 1 very conservative, state and district.
IMPORTANT: this file also contains two important variables to merge all the different datasets. bioguideid is the main key used to merge with external sources. facebook is the Facebook ID for each Member of Congress, and you should use this key to merge with the rest of the internal data sources. All files in the remaining datasets here contain this ID in the file name.
2 facebook114posts.ziphttps:www.dropbox.comstrznn23wtotnkonfacebook114posts.zip?dl0 contains multiple .csv files with information about each post of the legislators pages. All variables should be selfexplanatory. Remember that you shouldnt use fromid or fromname to merge across different data sources. id is the unique numeric ID for each post.
3 facebook114comments.ziphttps:www.dropbox.comsvu2po7a35tqs3fgfacebook114comments.zip?dl0 contains multiple .csv files with information about each comment on the legislators pages. Each file corresponds to a different page. fromid and fromname here correspond to the person who wrote the comment. likescount is the number of likes on each comment. commentscount is the number of replies to each comment. id is the unique numeric ID for each comment. postid is the ID of the post to which this comment is replying i.e. id in the posts .csv files. isreply indicates whether the comment is a toplevel comment FALSE or is a reply to an existing comment TRUE; and if so, inreplytoid indicates the ID of the comment to which this comment is replying.
Some additional information: remember that Facebook comments have a threaded structure: whenever you write a comment, you can comment directly on the post toplevel comment or as a reply to an existing comment reply.
4 facebook114reactionstotals.ziphttps:www.dropbox.comsyy3ams7szs3fa73facebook114reactionstotals.zip?dl0 offers statistics on the total of reactions love, haha, angry… to each post. id here corresponds to id in the facebook114posts datasets.
5 facebook114commentssentiment.ziphttps:www.dropbox.comsiovfv0l2wj2j5dpfacebook114commentssentiment.zip?dl0 contains datasets that predict the sentiment of each comment in the facebook114comments.zip files. There are three variables measuring the probability that each comment is negative, neutral or positive. They add up to one. You can either use the probabilities or, for each comment, predict a category based on which probability is highest.
NOTE: as you work on cleaning the dataset, if anything is not clear, you can ask in the forum for clarification.
1. Before you start cleaning the data, first consider how to design the database. Read the rest of the final exam to help you think through the options. How many tables should you have, and why? Clue: the answer is not five!
2. Do any required steps necessary to clean and merge the data; and then enter the datasets into a SQLite database, or into data frames that you can save to disk.
Make sure you do this in an efficient way. Pay special attention to variables that you will not need, and drop them from the tablesdata.frames to save memory and be more efficient.
r
Write your code here
3. Compute relevant summary statistics for your tables. You should at least answer the following questions: how many rows do you have in each table? what are the average values of all numeric variables? what are the distribution of the categorical variables?
r
title: MY472 Final Exam Part B: Descriptive Analysis
The goal of this second part of the assignment is to analyze the datasets you just created in order to answer a set of descriptive questions. Your answer to the questions will offer important context towards the overall research question: Why is there so much negativity on Facebook comments about politics?
For each item below, you should write code with any statistical or graphical analysis that you consider appropriate, and then answer the question.
1. First of all, how much negativity is there on the comments of pages by U.S. legislators? In other words, what proportion of comments are negative?
r
2. How much variation is there in the level of negativity that legislators see on their Facebook pages? Which are the legislators with the highest and lowest proportion of negative comments?
r
3. How did negativity evolve over time during the period of analysis? Do you identify any particular days or periods during which negativity spiked? Can you explain why?
r
4. Are there any other variables in the dataset that could help you measure negativity? If so, do you find similar results to questions 2 and 3 when you use that other signal?
r
title: MY472 Final Exam Part C: Scraping additional data
Now you will collect additional data to continue exploring the broader research question in the exam.
1. The website EveryPoliticianhttps:everypolitician.org contains information on legislators around the world. Using the webscraping tools you learned in the course, create a dataset with two variables bioguideid and age by scraping the data available in these two pages: https:everypolitician.orgunitedstatesofamericahousetermtable114.html and https:everypolitician.orgunitedstatesofamericasenatetermtable114.html
If you are having trouble scraping it, you can also just click on Download data but you will not get full mark if you do that!
r
2. Are there more negative comments on the pages of younger politicians? Use any statistical or graphical methods that you consider appropriate to answer this question.
r
3. The file congresslist.csv contained five other legislatorlevel variables chamber, gender, party, ideology, state. Choose TWO of these variables and explore whether they are related with the extent to which Members of Congress receive negative comments on their Facebook pages. Write a summary of your findings.
r
title: MY472 Final Exam Part D: Testing a Substantive Hypothesis
To conclude this assignment, you will offer preliminary evidence regarding one potential explanation about why there is so much negativity on Facebook comments: negative comments are widespread because they receive more engagement. In other words, maybe negative comments generated the type of reactions on people that make them more likely to like those comments or to reply to those comments.
1. Do negative comments receive more likes than neutral or positive comments? Use any statistical or graphical methods that you consider appropriate to answer this question.
r
2. Replicate the analysis above, but this time separately for Republicans and Democrats. Do you find any differences?
r