DA 101 Final Project
Part 1: Data tool glossary and exploratory analysis
(Individual*)
Here, the objective is to demonstrate mastery of the skills and tools you’ve developed. This is helpful as a means to review your skills all in one place and be able to use them in a context of your own choosing. On the last page of this document is a list of key wrangling, graphing, and statistical techniques we have used this semester. Produce a series of code chunks that, as a whole, use each item on the list on the last page.** Ideally, each of the chunks you produce in this part might fit within the exploratory process (wrangling and preliminary analysis)*** that builds toward what you produce in Part II. Of course, some commands may not fit in cleanly to your analytic strategy, but at least try to make the tangent relevant.
For each code chunk include:
- Start with a header that states the command(s) you are demonstrating in the code chunk.
- State (in 2-3 sentences) your broad question for exploration — i.e. what does the code chunk aim to accomplish and why.
- For each command used, provide 1-2 sentences explaining what each command does, when the type of graph is appropriate to use, and/or what the statistical technique accomplishes.
- Your work/code demonstrating this command/graph within this dataset (and hopefully, relevant to your overall question of interest) — provide evidence that your code worked
- For all graphs, annotation and final polish (presentation-ready)
- Conclude with 1-2 sentences of appropriate interpretation
I recommend that you start on your individual parts first, and use it as a way to individually explore the data. The likelihood you’ll do the exact same things will be minimal, and you can use it as a way to jumpstart discussion and work on your team analysis together later. While aspects of the individual work will be discussed in the course of doing your team project, avoid the temptation to share the code directly with each other, or to workshop your individual text and answers together.
*You will use your team’s dataset for the individual part, but you must work on your own and explanations must be in your own words. Do not copy and paste text from an online (or any) reference. Commands and/or interpretation that are identical among team members, or that are copied verbatim from another source will be reported as a breach of Denison’s academic integrity policy.
**Note: you can combine multiple commands into one example as long as all of the parts above are covered. For example, you might use group_by() with summarise() to make a line plot. This is fine as long as each command is introduced, the code serves to demonstrate how the commands are used, the use of the code fits within the broad scope of your investigation, and the output is polished. You would still need to clearly describe and define each step, as outlined above.
***If you choose to work with a dataset that we have already been introduced to in a lab, command lines that were given in previous labs may not be directly copied from your previous lab or project—you must apply the command to a different variable in the dataset or use the variable with a new command.
Part 1: List of required commands/skills to include
Commands/Wrangling
Choose 10 of the commands below to demonstrate, following the instructions in Part 1.
- table
- summary
- glimpse
- str
- rename
- group_by
- summarise: numeric or character
- mutate: mean, or sum, or length (here or in summarise) or making new variables as functions of existing ones (like we did in the congress or basketball data).
- if_else
- filter: numeric or character
- Remove NA’s
- select
- arrange
- Inner/left join, merge
- sub
- rbind
- use a regular expression and/or stringr command
- write your own function(s) to automate a task
Visualization
Choose 5 of the plots below to demonstrate, following the instructions in Part 1. At least one of the plots should demonstrate using color, size, shape and/or facet_wrap or facet_grid to add a 3rd or fourth variable, and at least one plot should add confidence intervals.
- Histogram
- Frequency polygon (across multiple subgroups)
- Bar plot (with at least 2 grouping variables; e.g. multiple categories for bars and use faceting; or use color to highlight a variable and stack or group together bars into categories – these are just examples)
- Boxplot
- Scatter (e.g., with geom_smooth, with or without method = “lm”), maybe note value of jitter
- Line plot
- Map showing the data (ggmap or leaflet)
- Demonstrate a new kind of ggplot, something beyond what we covered in class (e.g. heat map, violin plot, area chart, etc.)
Statistical/Analytical tools
Choose 10 of the statistical commands below to demonstrate and interpret, following the instructions in Part 1. The commands you choose must include a t-test, a correlation test, and a regression.
- Two- sample T-test
- One-sample T-test
- Correlation test
- Bivariate Regression
- Multivariate regression
- Confint
- Predict
- Pairs plots
- Residual plot(s)
- Residual histogram
- Shapiro-wilk test
- NCV Test
- Stepwise model selection using stepAIC and the modelaic$anova output
Part 2: Communicating the answer to a data question
(Group)
Three written pages maximum. Consider it a “final takeaway” of all the exploration you did in Part 1. Below is a rough structure of your Part II. Here you should use code folding so this section mirrors a ready to deliver report with clear section headers and interpretations of any statistical or graphical output (like several of our previous projects), but it will also be easy for your instructor to see the code, if needed (we will demonstrate in class how this works).
Introduction:
- Provide a one or two paragraph introduction, professionally written, that gives an overview of the essentials someone needs to know to make sense of the data you show.
Ethical Consideration:
- Provide one paragraph discussing the stakeholders in your data analysis and any ethical concerns or responsibilities.
Data Explanation and exploration:
- Provide some details describing the data you are working with. What are the observations? The key variables you will be looking at? Are there any particular challenges in the data you will need to work through or be aware of during analysis?
- Provide two polished visuals that describes the data in a way relevant to your question (descriptive, not related to your statistical model specifically–not a scatterplot). Write text that describes the data and what the visuals tell you about your data or decisions you will need to make for the analysis.
Statistical Analysis and Interpretation:
- Provide at least two statistical models (e.g., multivariate regression and/or t-test) that you interpret correctly and fully in the text.
- Provide at least three polished visuals that specifically support and validate the model(s) you have developed (e.g., residual and regression line/scatter, histogram showing normality of data or residuals, etc.), or help to communicate your main result. Visuals should have captions and be referred to clearly in your text.
- Text should fully explain what you show and your findings, to someone who is unfamiliar with your data, code, and models.
Conclusions:
- Provide one or two paragraphs concluding about the data: what does it tell us, what are the limitations to this data/model, and what is one future direction you could envision for future data analysts or data collectors.
Question-asking
We can ask questions that are closed (yes or no answer; are the two things different or not?) or open (require more thought and explanation; how much does something change? How is it related to something else?). Likewise, our data analyses can serve one or more purposes as we move through the data analysis cycle: descriptive (describes different measures of the data), exploratory (looking for patterns or unknown relationships in the data); inferential (using a sample to tell us something about a larger population); predictive (use relationships in the current/past data to predict the future); or causal (what happens to one variable when one or more other variables change?).
Mostly, choose something that you are curious about and will allow you demonstrate the listed concepts for the project.