Empirical Project Guidelines
Empirical Project Guidelines
The empirical project is an original research paper that involves using one of the datasets
provided for the course. The written portion due on Canvas December 9, 2021.
● You choose your own topic for this project based on the dataset you choose.
● You can merge in other data if needed. Other sources of data include the ICPSR
website, Census Bureau, Bureau of Labor Statistics, and Bureau of Economic
Analysis.
● The project should involve answering a question, testing a hypothesis, or addressing a
policy issue
● This should be an original piece of work.
● Plagiarism rules will be strictly enforced. Make sure to cite any published paper or
web site from which you report information in your paper. Any excerpt taken
word-for-word from another source must be put in quotes and cited and facts/ideas
taken from other sources (including the web) must be cited in the paper.
Paper Format
● Single spaced, 12-point Times New Roman font, 1” margins
● Make sure to number the pages (don’t start a new page for each section). The length
of the paper will be approximately 5 – 7 pages without the Appendix (see below).
● All tables and figures go in the Appendix (a separate page for each one)
Paper Structure
The outline below is a list of sections and things that should be included in the paper. You
are, of course, free to include any other information/procedures that you think is important to
your analysis.
A. Title page
This should be a separate page.
● Title, name, date, and a short abstract on the same page (≤200 words).
B. Section 1: Introduction
Here you will include background information about your topic.
● State the hypothesis you are testing or question you are answering in your paper.
● Explain why this is interesting and any policy implications that follow from your
results.
● You can state the result in the introduction.
● You can also give a brief description of the different sections of the paper.
C. Section 2: Literature Review
● Make sure to summarize at least two articles on your topic.
● A useful place to search for articles on your topic is Google Scholar.
● Articles can be working papers and come from online journals but generally does
NOT include other material from the web.
● All referenced papers must be available in English
● Check with me if you are unsure if your article meets these criteria.
● Make sure to use the proper methods for citing these articles in the text of your paper
(full citations will be included in the Reference section).
D. Section 3: Data Description and Visualization
Provide:
● The source of each dataset used in your analysis
● The time-period of analysis and the frequency of the data
● The unit of observation
● Any restrictions on the full sample that you imposed to obtain your final dataset
● The sample size of the final dataset
Include definitions of any variables that you’ll be referencing in the paper and include a table
of summary statistics for these variables in the Appendix. Also in the Appendix, include at
least two plots of the relevant data.
E. Section 4: Model
Write out the mathematical model(s) that you will estimate (you can use the equation editor
in word or any other editor).
● Include the predictor and repsonse variables with the appropriate subscripts, that is, if
you have a model
Y = β0 + β1X1 + β2X2 + β3X1X2
explain what X1, X2 and Y refer to.
● If you are using any matheamtical notation (like σ or β0) make sure that you define it.
● Include a discussion of model selection (e.g. if polyonmial, why did you choose this
degree?)
● Include a discussion of variable selection (e.g. if you chose a subset of all possible
predictors, why this subset?)
F. Section 5: Empirical Analysis
Present the results of the empirical analysis in this section. Results should be put in tables or
figures (with a title and numbered)
● Discuss the results including their statistical significance (if relevant). Do NOT
include formal hypothesis tests in this section of the paper; only discuss the outcomes
of the test (i.e. give the p-value and whether or not the null hypothesis is rejected).
● Are the results consistent with your hypothesis? If not, explain why this is the case.
G. Conclusion
Summarize the results of the paper and raise any possible policy implications of the results (if
relevant).
H. References
Include full citations for articles/books that you reference in the text of the paper
I. Appendix
● Include your R script file, Matlab code, etc. making sure to annotate the results so that
it is clear what the purpose of each piece of output is.
● Tables and graphs should be numbered and have a title.
● Make sure that every tables and graphs in the Appendix is referened somwhere in the
text of the paper.
● Tables should be in “stand alone” form. The reader should not have to search
elsewhere in the paper to get relevant information for the table results.
Data set:
Health/Medical Data
Overview
The National Health and Nutrition Examination Survey (NHANES) is a program of studies
designed to assess the health and nutritional status of people in the U.S. This survey contains
critical information for the National Center for Health Statistics, which is responsible for
producing vital and health statistics for the nation.
This survey examines a nationally representative sample of around 5,000 people each year,
located in counties throughout the country. In 1999, the survey became a continuous program
to meet emerging health and nutritional needs. The NHANES survey combines physical
examinations with interview answers. The interview includes obtaining demographic,
socioeconomic, dietary and health-related information. The physical examination consists of
medical, dental and physiological measurements.
General Description of Variables
Variables in the NHANES database are broken into several categories. They come from both
the physical examinations and the interview answers. Additionally, access to these variables
and the data is not permitted in the NHANES 2017-2018 and 2019-2020 data. The general
categories are:
Demographic data
Dietary data
Examination data
Laboratory data
Questionnaire data
Limited access data – not released to the public, on-site access only granted through
NCHS’s Research Data Center to guarantee confidentiality
For detailed description on each of the specific variables in each category, one can go to the
NHANES database, select the year of interest, and click on the group of interest.
Accessing the Data
Use the following link https://wwwn.cdc.gov/nchs/nhanes/default.aspx.
https://wwwn.cdc.gov/nchs/nhanes/default.aspx.
1. Choose a particular year – NHANES 2015-16 and earlier.
2. Select the dataset you plan on using, download the data as an XPT file.
3. To load this data into R, you will need to run
install.packages(“foreign”) and import the library with library(foreign.
Then you can read the xpt file with:
data<-read.xport("Downloads/your filename.xpt") Make sure to access the document file (codebook) as well as the data file. There are plenty of tutorials to help you get acquainted with the data. Potential Uses of the Data ● Analyzing risk factors for diseases – Try to find which demographics, dietary habits, and lab results are most highly correlated with certain diseases – Are these relationships causal or are they due to other underlying factors (OVB) ● Determining how nutrition impacts various aspects of one’s life – Blood pressure, Cardiovascular health, kidney conditions, etc. – Also can see what types of food/vitamins/minerals are important to one’s health measured by a number of different metrics ● This is a great database, as there seems to be so much data that we generally assume are correlated with each other. We can test a lot of these assumptions – How bad for you is alcohol usage, drug usage, junk food – Importance of a well-balanced diet – Different types of nutritional intake for different socioeconomic groups