10/22/2020 7. Project Deliverable 1
A key feature of this class is applying learning to a real-world dataset. The project requires you to work with a partner (a team of 3 or 4 members for larger classes) to perform all the steps of a typical data analysis project over the course of the term. There are two deliverables for this assignment. Deliverables will include a write-up (in a text editor like Microsoft Word), relevant R code (as .R or .rmd or knit r markdown file), dataset(s), and presentation slides.
DISCLAIMER
STUDENTS TAKING PART IN THIS PROJECT MUST REVIEW AND UNDERSTAND THE GUIDELINES AND LIMITATIONS ASSOCIATED WITH THE TERMS AND CONDITIONS OF THE COMPANY.
Helpful Tip: Working in Your Group
Before beginning this assignment, you may wish to meet informally in your group homepage to divide up the responsibilities for this assignment. For more information on how to do this, please visit the page called “Working in Your Group”.
Details
In this part, you will find a suitable dataset, put together a description of the dataset and variables. You will examine the data to identify any issues with it. You will then clean and process the dataset, using the data issues you identified as your starting point. You will document data cleaning steps. The end-product is a dataset ready for analysis.
Detailed Descrip on
1. Raw Dataset: Find a dataset to analyze. Generally speaking the dataset must contain at least 1000 observations on at least 8 variables. The variables must include a mix of categorical and numeric variables. It would be best if the dataset has a US context, so that the rest of the class can relate to it at the time of the presentation. If use of the dataset requires permission of the author, it is your responsibility to obtain permission. Generally, use of publicly available datasets are deemed okay to use under the Fair Use doctrine.
2. Source of Data: There are many sources of data including government websites (e.g., data.gov), aid organizations (e.g., WHO), websites sharing their own data (e.g., Google
https://courseworks2.columbia.edu/courses/107618/assignments/488986?module_item_id=1005734 1/4
10/22/2020 7. Project Deliverable 1
Trends), data gathered from API (e.g., twitter API, yahoo finance API), and data platforms (e.g., UCI Machine Learning Repository, Kaggle). You can find a comprehensive list of sources of data on this blog post.
3. Once you have identified the dataset, construct a description of the dataset. This should include a short narrative (3-4 lines) on the dataset and a brief description of each variable (a few words).
4. Articulate the question(s) you would like to answer using the data
5. Data seldom comes in an analysis-ready form. Use data preparation skills learnt in Framework and Methods – 1 class to clean the data and prepare it for analysis. This may include dealing with missing data, fixing data types, rescaling variables, combining variables, reshaping the data, among many others.
6. Once you have prepared the data, summarize all data processing steps and attach organized R code that will generate the cleaned dataset from the raw dataset.
Alignment
This assignment aligns with the following objectives:
Design an impactful presentation
Deliver and explain analytical outputs to a general audience
Assessment
Rubric
Criteria Points Raw dataset meets requirements for Project 10 Properly described data and variables 10 Identified and resolved key issues with raw data 10 Effectively demonstrated the use of functions in R to clean the data 10 Clearly articulated question(s) to be answered using the data 10
Submission
https://courseworks2.columbia.edu/courses/107618/assignments/488986?module_item_id=1005734 2/4
10/22/2020 7. Project Deliverable 1
To complete your submission,
What to Submit:
Submit the following as part of deliverable 1:
1. Raw dataset: Submit the raw dataset. If the dataset is too large, post dataset to a cloud drive and share persistent link.
2. R Code: Share R code that can be used to obtain the cleaned dataset from the raw dataset. The R code may be submitted as an R script (.R), R Markdown (.rmd), or a knit R Markdown (.html) file.
3. Cleaned dataset: Submit the cleaned dataset obtained after processing the raw dataset. If the dataset is too large, share a persistent link to a cloud source.
4. Write up: Use a text editor (like Microsoft Word) for the write-up. Expected length is 2-3 pages. The write-up should include:
1. a brief description of the data,
2. a short narrative on the dataset,
3. a brief description of each variable
4. description of the question(s) to be answered using the data 5. a summary of the data processing steps undertaken
How to Submit:
Only one member of your group need submit this assignment on behalf of the group. To complete your submission,
1. Click the blue Submit Assignment button at the top of this page.
2. Click the Choose File button, and locate your submission. Repeat for each file you
submit.
3. Feel free to include a comment with your submission.
4. Finally, click the blue Submit Assignment button.
https://courseworks2.columbia.edu/courses/107618/assignments/488986?module_item_id=1005734 3/4
10/22/2020 7. Project Deliverable 1
https://courseworks2.columbia.edu/courses/107618/assignments/488986?module_item_id=1005734 4/4