Final Project – Data Management Competition
STAT 440 – Statistical Data Management Summer 2018
The focus of the final exam, which is a final project consisting of a written report and SAS program, is applying data management approaches we have covered during the semester to prepare a thoroughly clean and validated dataset for competition. These approaches include reading raw data files checking data for errors, validating and cleaning data, creating formats and labels, deriving new variables, array creation, iterative (DO-loop) processing, subsetting data, joining/merging/appending, and conditional output. Each week of the semester (excluding Week 8), the Instructor will provide a dataset that is related to the IRI drug store sales data. It is up to you (the student) to work with those datasets becoming familiar with them and investigating how they may be used together for the competition. Check the Data Competition folder for more information about the IRI drug stores sales data and for the weekly added datasets. Each student will manage the IRI drug store sales data set to compete in one of the following categories:
- Most Relevant Variables: the prepared dataset that has the largest number of rele- vant variables. Relevant variables are those that are deemed appropriate, related to the subject of the observations, in correspondence with the other variables/measurements, and sensible given the nature of the original data.
- Least Missing Values: the prepared dataset that has the smallest number of missing values among all variables in the dataset. False, student-defined recoding, or imputed values are not allowed.
- Largest Number of Consistent Stores: the prepared dataset that has the most stores that appear every week of the year 2001 along with the most relevant variables for those observations.
- Overall Best Prepared Dataset: the prepared dataset that utilizes all of the In- structor’s provided datasets as well as other datasets, which is the most data-rich. This data set will look clean, be easy to query, and contain the most useful information re- lated to college students’ spending.
For each category, there will be two winners (undergraduate and graduate student) for a total of 8 winners. You can only select one of these categories to compete in. The fi- nal project has a maximum total of 50 points, for which all students in the class have the
1
ability to earn 50 points on the final project. This means that each student could earn (50/50) · 20 = 20 points contributed to your final grade. These 8 students that win the competition’s categories will receive 5 bonus points earning them (55/50) · 20 = 22 points contributed to your final grade.
Each student’s final report should be between 3-5 pages double-spaced. You must write in complete sentences and pay attention to grammar, spelling, and readability. If you include a table or chart, make sure you say something about it. Do not place charts or figures in your report without discussing them. You need to provide thorough description and nar- ration about the steps you are taking, the subsequent decisions you made during the data management process, an explanation of your results, and a concise conclusion.
Each student will submit the final report (.pdf) and SAS program (.sas) in Compass by 11:50 PM on August 4, 2018 and will count for 20% of the final grade. Late submissions will not be accepted. The final report should at least include the following:
1. Title of the project
2. Category of the data competition you want to be considered for 3. Introduction section
- Background information and description of the Instructor-proved datasets and any other datasets you used
- Explain why you are considering the category for the competition 4. Methods section
- Description/explanation of the guidelines used to validate the data
- Description/explanation of the issues that needed cleaning and you did the that
- Description/explanation of additional data preparation that you performed (e.g. merging, joining, subsetting)5. Results section
- Charts and tables pertaining to validation and cleaning.
- Summary of the results while noting important information from the charts and tables6. Conclusion section
• Persuade the audience to believe why your prepared dataset is going to win the category of the competition
2
Final Project Grading Rubric
The grading criteria for the final project are listed below. The maximum total points for the final project is 50 points.
• SAS program (.sas)
– 10 points: the efficacy of the whole SAS program
– 10 points: the data management aspect of the code
– 5 points: the organization and logical flow of the code
• Summary Report (.pdf)
- – 10 points: the data preparation techniques
- – 5 points: the output and results in the report
- – 5 points: the value of the information in the report and how that information aligns with the data management process
- – 5 points: the organization and readability of the report
3