Assignment 2 – R Text Analysis Due 15/04 – Before Class
How to submit your assignment
1. Download the R Workspace “newspapers” and open it with R-studio 2. Read the instructions and finish Questions 1 to 4
3. Save your R-script to “yourlastname studentID assignment2.R”
4. Send the file to my email (colonct@gmail.com)
The Ideological Bias in Newspaper
Text analysis gives researchers a powerful set of tools for extracting general infor- mation from a large body of documents.
This exercise is based on Gentzkow, M. and Shapiro, J. M. 2010. Drives Media Slant? Evidence From U.S. Daily Newspapers.“ Econometrica, 78(1): 35-71.
We will analyze data from newspapers across the country to see what topics they cover and how those topics are related to their ideological bias. The authors computed a measure of a newspaper’s ”slant“ by comparing its language to speeches made by Democrats and Republicans in the U.S. Congress.
You will use three data sources for this analysis. The first ‘dtm’, is a document term matrix with one row per newspaper, containing the 1000 phrases—stemmed and processed—that do the best job of identifying the speaker as a Republican or a Democrat. For install, ”living in poverty“ is a phrase most frequently spoken by Democrats, while ”global war on terror“ is a phrase most frequently spoken by Republicans; a phrase like ”exchange rate“ would not be included in this data-set, as it is used often by members of both parties and is thus a poor indicator of ideology. The second object, ‘papers’, contains some data on the newspapers on which ‘dtm’ is based. The row names in ‘dtm’ correspond to the ‘newsid’ variable in ’papers’. The variables are:
1
“exchange rate” would not be included in this dataset, as it is used often by members of
both parties and is thus a poor indicator of ideology.
The second object, ‘papers’, contains some data on the newspapers on which ‘dtm’ is based. The row names in ‘dtm’ correspond to the ‘newsid’ variable in ‘papers’. The variables are:
Name ‘newsid’ ‘paper’ ‘city’ ‘state’ ‘district’ ‘nslant’
Description
The newspaper ID
The newspaper name
The city in which the newspaper is based
The state in which the newspaper is based
Congressional district where the newspaper is based (data for Texas only) The “ideological slant” (lower numbers mean more Democratic)
The third object, ‘cong’, contains data on members of Congress based on their political The third object, ’cong’, contains data on members of Congress based on their
speech, which we will compare to the ideological slant of newspapers from the areas that political speech, which we will compare to the ideological slant of newspapers from
these legislators represent. The variables are:
the areas that these legislators represent. The variables are:
Name ‘legname’ ‘state’ ‘district’ ‘chamber’ ‘party’ ‘cslant’
2
Description
Legislator’s name
Legislator’s state
Legislator’s Congressional district
Chamber in which legislator serves (House or Senate)
Legislator’s party
Ideological slant based on legislator’s speech (lower numbers mean more Democratic)
Question 1 Question 1
We will first focus on the slant of newspapers, which the authors define as the
tendency to use language that would sway readers to the political left or right. We will first focus on the slant of newspapers, which the authors define as the tendency
to use laLnogaudagtheethdatawaonudldpslowtatyhreedaidsetrsibtuotitohneopfo‘lnitsilcaanlt’leifnt tohrer‘ipgahpt.erLs’oaddattahferadmatea, with a and plotvetrhteicdailsltirniebuatiothneomfe‘sdlaiannt.’ iWnhthiceh‘npeawpseprsa’pdeartianftrhaemceo,uwntitrhyhaavsetrhteiclalrgliensetaletft-wing the medsilaan.t?WWhihchatnaebwosuptapriegrhtin? the country has the largest left-wing slant? What
about right?
Question 2
We will now explore the relationship between the political slant of newspapers and Questionth2e language used by members of Congress.
Using the dataset ‘cong’, compute average slant by state separately for the House
and Senate. Now use ‘papers’ to compute the average newspaper slant by state. 2
Make two plots with Congessional slant on the x-axis and newspaper slant on the y-axis – one for the House, one for the Senate. Include a best-fit line in each plot – a red one for the Senate and a green one for the House. Label your axes, title your
We will now explore the relationship between the political slant of newspapers and the language used by members of Congress. Using the data-set ‘cong’, compute average slant by state separately for the House and Senate. Now use ‘papers’ to compute the average newspaper slant by state. Make two plots with Congressional slant on the x-axis and newspaper slant on the y-axis—one for the House, one for the Senate. Include a best-
fit line in each plot—a red one for the Senate and a green one for the House. Label your axes, title your plots, and make sure the axes are the same for comparability. Can you conclude that newspapers are influenced by the political language of elected officials? How else can you interpret the results?
Question 3
Identify the most important terms for capturing regional variation in what is consid- ered newsworthy the terms that appear frequently in some documents, but not across all documents. To do so, compute the *term frequency-inverse document frequency (tf-idf)* for each phrase and newspaper combination in the data-set (for this, use the ‘tm’ package and the ‘dtm’ object originally provided).
Question 4
Cluster all the newspapers from New Jersey on their tf-idf measure. Apply the k- means algorithm with 3 clusters. Summarize the results by printing out the ten most important terms at the centroid of each of the resulting clusters, and show which newspapers belong to each cluster. What topics does NJ care about?
3