R Toolbox Assignment
The purpose of this project is to tie together all the concepts that you have learned throughout this course in order to create a portfolio piece that you can talk about during interviews and career fairs.
Your submission must be in Shiny or R Markdown alongside all necessary resources for us to run and test your code (i.e. data, code, documentation, links, etc). From import, manipulation and analysis data through visualization and communicating results, your submission must contain the entire workflow of your code. Our expectation is to be able to run your code without any obvious errors. Please submit your files in GitHub by April 29 at 11:59pm. The link to Github will be shared later in the semester.
Requirements
The question or problem that you want to solve is open ended. You are welcome to discuss it with your instructional team, but it is up to you to decide what you want to pursue. You will select your data, analyze it and create the visualizations you think will best drive the message that you want your audience to see. The first and most important step is that you identify and describe the problem statement. Subsequently everything you do from analysis to visualizations should support the answer to your problem statement. Below is an outline and guidelines for how to develop the toolbox.
Guidelines
1. Identify and state the problem or question that you are seeking to solve or answer
a. Make sure that your question/problem is quantifiable and can be answered with the data that you select.
b. Outline the goals and criteria for your success when completing your analysis. Define key metrics and views of data and how these tie to your response.
2. Choose a unique dataset(s) that interests you and can answer your question
a. This dataset must have breadth and depth. In this context depth refers to the level of detail in the data and breadth refers to the number of dimensions that you can explore with your data.
b. Here are some good data sources:
i. Kevin Chai’s Dataset List
ii. Kaggle
iii. Federal Government Data
iv. Smart Data Collective list
v. KDNuggets
vi. Healthcare Data
vii. Census Data
viii. Socrata
3. Import, prepare and clean your dataset for analysis.
a. Document and mitigate any outliers, missing data, or incorrectly recorded data – assess their impact on your data and how you mitigated this impact (i.e. what values you chose to replace the missing values with, and why)
b. Thoroughly comment your code so that we can easily follow every step of what you are trying to do and the logic/reason behind it
4. Analyze your data set
a. Create segments of data that enable you to focus on portions of your answer (these segments will eventually lead to a visualization that will answer your question)
b. Summarize each segment into a simple tabulation of data that brings out the key insights
5. Visualize your data
a. Visualize these tabulations in a way that brings out the message that you are trying to drive across. Make sure that this message ties closely and supports to your problem and its answer.
b. Appropriate formatting of all your visualizations: usage of color/shape/size aesthetics to differentiate groupings in your plots, a legend if needed to understand your visual, labeling of your axes and titling of your plots
6. Put everything together in a Markdown PDF/HTML document or a Shiny application
a. Create 3-5 Visualizations that can either be included each as their own page in a Shiny Dashboard (https://rstudio.github.io/shinydashboard/ ) or in R Markdown.
b. If you decide to go for Shiny: Usage of Shiny Controls for each visual in some form: choosing your axis values, filtering values, selecting value, selecting graph geometry or any interaction of similar or greater complexity
c. Bonus: Create a Shiny application in RMarkdown
Example
1.a. You are trying to understand the determinants of wages in the state of New York.
1.b. Some key drivers of wages can be location (zip code), age and educational attainment.
2.a. Census data will provide great breadth (hundreds or thousands of columns) and depth (all of these columns can be drilled down from the state level all the way down to the county and sometimes zip code).
3.a. Carefully and thoroughly review the Census data documentation so that you understand how the data were gathered, how missing values are treated, and the availability of your data. Download the necessary data from the Census website, use a package like readr to import it as a tibble. You decide to use tidyr to turn your tibble into tidy data.
3.b. Some missing values are given a value of ‘99999998’ so you decide to write code to convert these into NA and comment such code to make sure we can follow your logic and process.
4.a/b. Use dplyr to create buckets of wages broken out by age brackets to better understand this variation across ages.
5.a. You visualize the data in 4.a using a bar chart
6.a. The output of 5.a is in your final RMarkdown or Shiny application
Grading
Your submission will be graded on a scale from 1 to 3 points per section.
1. Code workflow (50% of final grade)
a. 3 points –your code runs without error (warning and other messages okay as long as they are not fatal to runtime)
b. 3 points – your code exactly replicates the output in your Shiny application or RMarkdown submission.
c. 3 points – organization and workflow. The way your code is organized makes sense. For example, we would not be able to manipulate data before importing it or to visualize it before cleaning out missing values or outliers.
d. 3 points – your code is thoroughly commented so that we can follow everything that you are trying to do in your script
e. 5 points – efficient use of R programming skills (i.e. leveraging functional programming to work with repetitive tasks, using dplyr for creating segments of data or tidyr to tidy data)
2. Analysis (20% of final grade)
a. 3 points – analysis questions and research plan are sufficiently defined as to be answerable using data (i.e. asking how is the company doing? Vs how much did revenue grow this year?
b. 3 points – the data you selected can appropriately answer the questions you set out to explore
c. 3 points – appropriate use of existing metrics and creation of new metrics to explore your data (i.e. yoy growth, composite metrics (weighted averages), dummy variables, etc)
d. 3 points – appropriate and efficient segmentation of data to address the questions in 2.a
e. BONUS: 5 points – create a slide deck that goes through your work as if you were to present to a professional audience.
1. Visualization and communication (30% of final grade)
a. 3 points – creation of an appropriate visualization for your objectives in 2.a
b. 3 points – appropriate formatting of visualizations as outlined in Requirement #5
c. 3 points – adequate Shiny Interactivity as outlined in Requirement #6
d. BONUS: 5 points – create an Shiny application within an RMarkdown document
Please note that although you are encouraged to use data, work and code that you have done before, this work MUST BE original and not previously submitted to another class.