Project 1
Part 1:
1.
Question:
Under current circumstance of high inflation, what would the job market look like in the
United States in the near future?
Description: A series stimulus packages and measures taken by the Federal Reserve has led the
inflation to soar high – ever since April 2021, the inflation has gone above 4% and stayed at the
level until now, according to Charts in Figure 1. While people are worried about how inflation
would impact the purchasing power of their salary, another thing they have ignored is the
impact of inflation on unemployment rate – whether one is employed or not determines
whether he would earn the salary that has the purchasing power.
Figure 1: Inflation rate of the US for the past 12 months.
Source: YCharts.com (https://ycharts.com/indicators/us_inflation_rate)
Fortunately, according to A. W. Phillips, there is a negative relationship between the inflation
rate and the unemployment rate as shown in Figure 2 (University of Toronto, n.d.).
Consequently, my chosen question of interest aims to study this relationship.
Figure 2: The Phillips Curve.
Source: University of Toronto (https://www.economics.utoronto.ca/jfloyd/modules/phlc.html)
2.
To study this effect, two datasets are needed. First, the national unemployment rate of the US.
This data can be retrieved from the website of the US Bureau of Labor Statistics at:
https://www.bls.gov/charts/employment-situation/civilian-unemployment-rate.htm.
Second, the consumer price index of the US. This data can also be retrieved from the Bureau of
Labor Statistics at: https://www.bls.gov/cpi/.
a.
For the national unemployment rate of the US in Table 1, the data represents monthly
unemployment rate from September 2001 to September 2021. Besides the total
unemployment rate, the data also shows unemployment rate by age, gender and ethnicity.
https://www.bls.gov/charts/employment-situation/civilian-unemployment-rate.htm
Table 1: national unemployment rate of the US.
Data source: the US Bureau of Labor Statistics (https://www.bls.gov/charts/employment-
situation/civilian-unemployment-rate.htm)
b.
The data is readily accessible at the website of the US Bureau of Labor Statistics:
https://www.bls.gov/charts/employment-situation/civilian-unemployment-rate.htm. In
addition, based on its disclaimer in Figure 3, it can be seen that the website has an open license
and aims to provide access of information to the public. Furthermore, the data is can be
downloaded to Excel and imported to Python for calculation – it is machine processable.
Figure 3: General disclaimer by the US Bureau of Labor Statistics.
Source: US Bureau of Labor Statistics (https://www.bls.gov/bls/disclaimer.htm)
c.
To access the data:
• Open the website of the US Bureau of Labor Statistics at:
https://www.bls.gov/charts/employment-situation/civilian-unemployment-rate.htm
• Select/click the ‘show table’ button to preview the unemployment rate data in table format
• Select the data needed, for example from year 2010 to 2021, and copy and paste the data
into Excel
Part 2:
1.
The study I want to conduct is observational, as it is not based on any experiments that I
designed – adjusting inflation rate or unemployment rate is impossible to me as I am not the
government. I believe that as this data is not from a designed experiment in a controlled
environment, and it is impossible for me to manipulate the data, it is relatively objective.
Nonetheless, studied period chosen and analytical methods may introduce subjectivity.
2.
My dependent variable is the unemployment rate and independent variable is the inflation
rate. The null hypothesis is that there is a negative linear relationship between the inflation rate
and the unemployment rate. That is, for relationship:
𝑦 = 𝛼 + 𝛽𝑥
where x is the inflation rate, and y is the unemployment rate, there is:
𝐻0: 𝛽 = 0
and
𝐻1: 𝛽 < 0 3. As mentioned in the part above, my dependent variable is the unemployment rate and independent variable is the inflation rate. Part 3 1. First, I would check the data for the NaNs and empty values. Second, I would plot the scatter plot with the inflation rate on the x axis and unemployment rate on the y axis to explore whether using a linear relationship to describe their relationship is appropriate. Third, I would plot the histogram of the variables. Combining the histograms and the scatter plot, I could see potential outliers. 2. The first step helps me remove the NaNs and fill in blanks if there are any, so that I could better utilize my data. If the points on the scatter plot concentrates around a regression line, the second step could help me confirm the rationality of my linear model. If I find any outliers using the third step, I could remove the outliers that could bias my regression results. Part 4 1. Outlier removal. In the scatterplot, data points that seem far away from the potential regression line should be given specific attention. We could do a preliminary regression and check the standardized residual values and decide if there are any outliers needed to be removed. In addition, standardized-value (z-score) of the variables itself should also be checked to remove any extreme values. Time series decomposition. Apart from studying the effect of inflation on unemployment rate, I should also explore the autocorrelation between unemployment rate of consecutive periods. That is, the previous period’s unemployment rate might also be a factor that can affect the current unemployment rate. Or can time series models be used to predict unemployment rate, so that we would not need the inflation rate? Feature selection. I aim to explore the negative relationship between the two variables and predict the unemployment rate using the expected inflation rate in the near future. However, there are over 20 years’ data included in the dataset. Is it reasonable to use all 20 years’ data to estimate the model? Or is it better to use the most recent economic cycle? For example, from 2001 to 2021, we have experienced three major global economic crisis. Perhaps it would be better to use the data of the most recent economic cycle to estimate the model. I would carry out two regressions: 1) using the 20 years’ data all together and 2) using only the data of the most recent economic cycle. I would leave data from April 2021 to September 2021 for testing to verify my assumption. 2. I expected that the first step would help me to remove most outliers that would introduce bias to my model. (In linear regression, this should help me increase the R-square of my regression model.) It may help me better explore the negative relationship between the two variables and predict future unemployment rate based on the expected inflation. The second step is expected to help me confirm that the linear relationship between the inflation rate and the unemployment rate is justifiable and that time series models might not be a good choice to model the change of the unemployment rate. If so, I could better explore the negative relationship between the two variables and predict the unemployment rate based on expected inflation, rather than using the unemployment rate data alone. The third step is expected to help me choose a more appropriate period to estimate my regression model. This model would then help me explore the negative relationship between the two variables and predict the unemployment rate using the expected inflation more accurately. Part 5 1. As mentioned above, I am going to perform a linear regression model, using the inflation rate as the independent variable and the unemployment rate as the dependent variable. After cleaning and pre-processing the data, I expect the relationship to be negative and the R-square of the regression model should be reasonably high. The estimated regression equation should be in the form of: 𝑦 = 𝛼 + 𝛽𝑥 where x is the inflation rate, and y is the unemployment rate. 2. The algorithm I am going to choose is linear regression. There are actually two aims of my test: 1) confirm the negative relationship between the inflation rate and the unemployment rate and 2) try to forecast the unemployment rate using the expected inflation rate. As both variables are continuous variables, I am going to use the linear regression algorithm to model the relationship, as mentioned above. 𝑦 = 𝛼 + 𝛽𝑥 To verify the negative linear relationship, I am also going to do the hypothesis testing: 𝐻0: 𝛽 = 0 and 𝐻1: 𝛽 < 0 For this, I could use the t-statistic to verify the significance of the coefficient of the independent variable at a given significance level, say 5%. Furthermore, the F-statistic could be used the determine the significance of the regression model. If it is significant at, say, 5% level, we could adopt the model and use it to make predictions in the near future. Furthermore, I also need to make sure that the R-square is reasonably high.