Data Analytics Final, version B (40 pts)¶
In [ ]:
import time
time.ctime()
In [ ]:
name = “ABC” # Enter your name
user = “ABC” # Enter your BC username
print(“Final submission for {} ({})”.format(name,user))
Please rename the file so that it includes your BC username. For example, I would change Final2020-B to Final2020-B-xuaeh.
In [ ]:
# You might find them helpful
import math
import statistics
import pandas
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats # Basic package for basic univariate regressions
import statsmodels.api as statsmod # More sophisticated package for univariate and multivariate regressions
1. FOMC Minutes (30 pts)¶
Background: The Federal Open Market Committee (FOMC), a committee within the Federal Reserve System, is charged under United States law with overseeing the nation’s open market operations. This Federal Reserve committee makes key decisions about interest rates and the growth of the United States money supply. Committee members meet regularly (typically once a month or once every other month) to make these decisions. Their meeting minutes are made public shortly after the meeting.
Goal: Your research team aims to (a) measure and quantify central bankers’ $\textit{positive}$ and $\textit{negative}$ sentiment expressed during these high-level meetings, and then (b) examine how the two sentiment measures evolve over time on a $\textit{yearly}$ basis.
Suggestive steps:
1.1 Download and unzip folder “ALL_FOMC_MINUTES_1990_2020”. This folder provides all 242 FOMC meeting minutes from February 7, 1990 (see file ‘19900207.txt’) to the latest meeting on March 15, 2020 (see file ‘20200315.txt’).¶
In [ ]:
# list all files in the current directory
% ls
1.2 [+10pts] Create a “negative” sentiment score of the FOMC meeting on February 7, 1990, using the following equation: $\mbox{NegativeScore}_{19900207} = \frac{Counts of Negative Words_{19900207}}{Total Word Counts_{19900207}}\times 100$, where file “1.6_LM_negative.txt” provides a collection of negative words suggesed by Loughran and McDonald (https://sraf.nd.edu/textual-analysis/resources/).¶
Similarly, we can create a “positive” sentiment score of the FOMC meeting on February 7, 1990, using the following equation: $\mbox{PositiveScore}_{19900207} = \frac{Counts of Positive Words_{19900207}}{Total Word Counts_{19900207}}\times 100$, where file “1.6_LM_positive.txt” provides a collection of positive words from the same academic source.¶
You can pick any text file that you think would be more interesting to you. In other words, you don’t have to use the 19900207 file.¶
I will help you start with several (but not all) input codes.¶
In [ ]:
# Input function
def Input(filename):
f = open(filename, ‘r’,encoding=”mbcs”) #, encoding=”utf8″
lines = f.readlines()
lines = [l.strip() for l in lines] # output will be a list of strings where each string corresponds to each line
f.close()
return lines
In [ ]:
# Import the negative and positive words
list_neg = Input(‘1.6_LM_negative.txt’)
list_pos = Input(‘1.6_LM_positive.txt’)
In [ ]:
# Import the FOMC meeting file
# Your codes here
In [ ]:
# Calculate the negative score
# Your codes here
In [ ]:
print(“During this FOMC meeting, central bankers and officials expressed slightly more {} sentiment. To be specific, {} of {} words (or {:2.2f}%) are related to positive sentiment, while {:2.2f}% are related to negative sentiment.”.format(YOUR ANSWERS HERE))
1.3 [+10pts] Repeat steps in 1.2 for all the 242 FOMC meeting minutes files, and obtain a list of negative scores (call this list “NEGSCORE”) and a list of positive scores (call this list “POSSCORE”). For instance, in list “NEGSCORE”, the first element corresponds to the negative score calculated using the 19900207 FOMC minutes, and the last element corresponds to the negative score calculated using the 20200315 FOMC minutes.¶
In [ ]:
# Here is a complete list of file names, which might be useful in a for loop
TITLE = [‘19900207.txt’,’19900327.txt’, ‘19900515.txt’, ‘19900703.txt’, ‘19900821.txt’, ‘19901002.txt’, ‘19901113.txt’, ‘19901218.txt’, ‘19910206.txt’, ‘19910326.txt’, ‘19910514.txt’, ‘19910703.txt’, ‘19910820.txt’, ‘19911001.txt’, ‘19911105.txt’, ‘19911217.txt’, ‘19920205.txt’, ‘19920331.txt’, ‘19920519.txt’, ‘19920701.txt’, ‘19920818.txt’, ‘19921006.txt’, ‘19921117.txt’, ‘19921222.txt’, ‘19930203.txt’, ‘19930323.txt’, ‘19930518.txt’, ‘19930707.txt’, ‘19930817.txt’, ‘19930921.txt’, ‘19931116.txt’, ‘19931221.txt’, ‘19940204.txt’, ‘19940322.txt’, ‘19940517.txt’, ‘19940706.txt’, ‘19940816.txt’, ‘19940927.txt’, ‘19941115.txt’, ‘19941220.txt’, ‘19950201.txt’, ‘19950328.txt’, ‘19950523.txt’, ‘19950706.txt’, ‘19950822.txt’, ‘19950926.txt’, ‘19951115.txt’, ‘19951219.txt’, ‘19960130.txt’, ‘19960326.txt’, ‘19960521.txt’, ‘19960702.txt’, ‘19960820.txt’, ‘19960924.txt’, ‘19961113.txt’, ‘19961217.txt’, ‘19970204.txt’, ‘19970325.txt’, ‘19970520.txt’, ‘19970701.txt’, ‘19970819.txt’, ‘19970930.txt’, ‘19971112.txt’, ‘19971216.txt’, ‘19980203.txt’, ‘19980331.txt’, ‘19980519.txt’, ‘19980630.txt’, ‘19980818.txt’, ‘19980929.txt’, ‘19981117.txt’, ‘19981222.txt’, ‘19990202.txt’, ‘19990330.txt’, ‘19990518.txt’, ‘19990629.txt’, ‘19990824.txt’, ‘19991005.txt’, ‘19991116.txt’, ‘19991221.txt’, ‘20000202.txt’, ‘20000321.txt’, ‘20000516.txt’, ‘20000628.txt’, ‘20000822.txt’, ‘20001003.txt’, ‘20001115.txt’, ‘20001219.txt’, ‘20010131.txt’, ‘20010320.txt’, ‘20010515.txt’, ‘20010627.txt’, ‘20010821.txt’, ‘20011002.txt’, ‘20011106.txt’, ‘20011211.txt’, ‘20020130.txt’, ‘20020319.txt’, ‘20020507.txt’, ‘20020626.txt’, ‘20020813.txt’, ‘20020924.txt’, ‘20021106.txt’, ‘20021210.txt’, ‘20030129.txt’, ‘20030318.txt’, ‘20030506.txt’, ‘20030625.txt’, ‘20030812.txt’, ‘20030916.txt’, ‘20031028.txt’, ‘20031209.txt’, ‘20040128.txt’, ‘20040316.txt’, ‘20040504.txt’, ‘20040630.txt’, ‘20040810.txt’, ‘20040921.txt’, ‘20041110.txt’, ‘20041214.txt’, ‘20050202.txt’, ‘20050322.txt’, ‘20050503.txt’, ‘20050630.txt’, ‘20050809.txt’, ‘20050920.txt’, ‘20051101.txt’, ‘20051213.txt’, ‘20060131.txt’, ‘20060328.txt’, ‘20060510.txt’, ‘20060629.txt’, ‘20060808.txt’, ‘20060920.txt’, ‘20061025.txt’, ‘20061212.txt’, ‘20070131.txt’, ‘20070321.txt’, ‘20070509.txt’, ‘20070628.txt’, ‘20070807.txt’, ‘20070918.txt’, ‘20071031.txt’, ‘20071211.txt’, ‘20080130.txt’, ‘20080318.txt’, ‘20080430.txt’, ‘20080625.txt’, ‘20080805.txt’, ‘20080916.txt’, ‘20081029.txt’, ‘20081216.txt’, ‘20090128.txt’, ‘20090318.txt’, ‘20090429.txt’, ‘20090624.txt’, ‘20090812.txt’, ‘20090923.txt’, ‘20091104.txt’, ‘20091216.txt’, ‘20100127.txt’, ‘20100316.txt’, ‘20100428.txt’, ‘20100623.txt’, ‘20100810.txt’, ‘20100921.txt’, ‘20101103.txt’, ‘20101214.txt’, ‘20110126.txt’, ‘20110315.txt’, ‘20110427.txt’, ‘20110622.txt’, ‘20110809.txt’, ‘20110921.txt’, ‘20111102.txt’, ‘20111213.txt’, ‘20120125.txt’, ‘20120313.txt’, ‘20120425.txt’, ‘20120620.txt’, ‘20120801.txt’, ‘20120913.txt’, ‘20121024.txt’, ‘20121212.txt’, ‘20130130.txt’, ‘20130320.txt’, ‘20130501.txt’, ‘20130619.txt’, ‘20130731.txt’, ‘20130918.txt’, ‘20131030.txt’, ‘20131218.txt’, ‘20140129.txt’, ‘20140319.txt’, ‘20140430.txt’, ‘20140618.txt’, ‘20140730.txt’, ‘20140917.txt’, ‘20141029.txt’, ‘20141217.txt’, ‘20150128.txt’, ‘20150318.txt’, ‘20150429.txt’, ‘20150617.txt’, ‘20150729.txt’, ‘20150917.txt’, ‘20151028.txt’, ‘20151216.txt’, ‘20160127.txt’, ‘20160316.txt’, ‘20160427.txt’, ‘20160615.txt’, ‘20160727.txt’, ‘20160921.txt’, ‘20161102.txt’, ‘20161214.txt’, ‘20170201.txt’, ‘20170315.txt’, ‘20170503.txt’, ‘20170614.txt’, ‘20170726.txt’, ‘20170920.txt’, ‘20171101.txt’, ‘20171213.txt’, ‘20180131.txt’, ‘20180321.txt’, ‘20180502.txt’, ‘20180613.txt’, ‘20180801.txt’, ‘20180926.txt’, ‘20181108.txt’, ‘20181219.txt’, ‘20190130.txt’, ‘20190320.txt’, ‘20190501.txt’, ‘20190619.txt’, ‘20190731.txt’, ‘20190918.txt’, ‘20191030.txt’, ‘20191211.txt’, ‘20200129.txt’, ‘20200315.txt’]
print(TITLE[0]) # FIRST FILE NAME
print(TITLE[241]) # LAST FILE NAME
print(type(TITLE),type(TITLE[0]))
In [ ]:
# Your codes here
In [ ]:
# print the following
print(NEGSCORE[0], NEGSCORE[-1])
print(POSSCORE[25], POSSCORE[-25])
1.4 [+5pts] Obtain $\textit{yearly}$ positive and negative scores. That is, take an average of positive scores from all FOMC meetings within the same year; similarly, take an average of negative scores from all FOMC meetings within the same year. You should obtain 2 lists (or arrays), each with 31 numbers corresponding to the 31 years in the data (i.e., from Year 1990 to Year 2020). Finally, print them out.¶
Hint: You probably have realized by now that there were different numbers of FOMC meetings in different years. Therefore, you cannot simply take the average of 12 numbers when calculating a yearly average. Instead, notice that the first 4 digits of each element in “TITLE” (a list variable that was defined in 1.3) indicates Year already. You can use this information; for example, to obtain the yearly negative score of Year 1990, one can keep all negative scores from meetings where the first 4 digits of the corresponding meeting filenames are “1990”, and then take the average.
This question can be answered using list or numpy.
In [ ]:
# Your codes here
In [ ]:
# print them out
1.5 [+5pts] Plot the time series of positive and negative scores in one plot. Appropriate legend labels, line coloring, and line shapes are expected to help differentiate the two lines. Save your plot to “sentiment_1990_2020_YOURBCID.png” (e.g., in my case, “sentiment_1990_2020_xuaeh.png”).¶
Hint: plt.plot()
In [ ]:
# Your codes here
2. Regression Analysis (10 pts)¶
Motivation and question: In Question 1, we obtained yearly negative sentiment scores. By construction, the higher the negative score is, the more negative sentiment there was during FOMC meetings of that year. However, it is still unclear: what is negative sentiment? Because the purpose of these meetings is indeed to discuss the recent past market and economic conditions and future interest rate strategies, it would be interesting to understand what fundamental events or indicators explain negative sentiment. This is what we are after in this question.
Suggestive steps: Please download “QUESTION2_DATA.csv”,
• Column 1: year, 1990-2020
• Column 2: yearly negative sentiment score, constructed from Question 1 [You may use my answers to check your solutions in Q1 :)]
• Column 3: yearly ambiguity score, constructed using FOMC meeting minutes and an “ambiguity” dictionary. It captures how uncertain central bankers are about their own comments, opinions, and speeches made during the meeting. This score is measured by the percentage of ambiguity words. The higher the score is, the more ambiguity there was at the meeting. In short, the methodology is exactly the same as in Question 1.
• Column 4: yearly litigation score, constructed using FOMC meeting minutes and a “litigious” dictionary. The higher the score is, the more discussions or concerns about business lawsuits there were at the meeting.
• Column 5: yearly VIX index (source: CBOE). This measure indicates stock market fear and anxiety from investors. It is usually dubbed as the “fear index” by mainstream newspapers. The higher the index is, the higher the anxiety there is in the market.
As a result, to put it in a context of a regression framework, our dependent variable is the negative sentiment score (Column 2), and the three potential explanatory variables capture three different perspectives of the economy:
• Ambiguity and belief (Column 3)
• Business litigation concerns (Column 4)
• Stock market anxiety (Column 5)
Below, you can choose your own regression framework; for instance, you can run 1 regression with all three explanatory variables; or perhaps, you want to run univariate regressions first and then include more variables.
To receive full points, you need to write at least 5 full sentences (1) describing your regression results (i.e., from the regression table) and (2) making useful inferences (i.e., what can we learn from your results?).
In [ ]:
def reg_m(y, x):
X = np.hstack((np.ones((len(x),1)), x)) # adds column of ones to X
results = statsmod.OLS(y, X).fit() # creates object containing regression results
return results
2.1 Import and understand the csv dataset using numpy¶
In [ ]:
# Import data
Q2DATA = np.genfromtxt(“./QUESTION2_DATA.csv”, delimiter=’,’) # import data
Q2DATA_n = np.array([“YEAR”, “NEGSCORE”,”LITSCORE”, “UNCSCORE”, “VIX”])
In [ ]:
print(Q2DATA[0:2])
In [ ]:
Q2DATA = Q2DATA[1:, :]
In [ ]:
Q2DATA.shape
2.2 [+5pts] Regressions¶
In [ ]:
# Your codes here
2.3 [+5pts] Please enter your discussions below.¶
The box below is what-we-call a markdown box; please double click on it, and you will see that “Type $\textit{Markdown}$ and Latex: $\alpha^2$” becomes a blank box where you can type in your discussions. Once you finish composing, you can use shift-enter to compile it as usual. If you want to change your writing after you compiled it, you can double click on the same area and it will change back to its editing mode.
In [ ]:
import time
time.ctime()
Great job! You have reached the end. Please upload your compiled notebook along with the png file generated from Question 1 (if you have successfully compiled one) to Canvas. Make sure that both file names contain your BC ID for grading convenience.¶
Stay well and safe!
Warmly, Professor Nancy R. Xu (徐冉)