STA 402/502 PROJECT DESCRIPTION
The goal of the project is to draw data from complicated sources to prepare a new data set that is ready for a statistical analysis:
- Project activities consist of obtaining, cleaning, extracting, transforming, and combining data sets.
- Projects are programming intensive and use computational tools discussed in class, found in the text, or included in SAS.
- Projects include only light summarization or exploratory analysis to show that project code correctly prepares the data.
All projects involve continuing guidance from the instructor:
- Students are expected to identify data sources for their projects and have those sources approved by the instructor before beginning.
- Students are expected to consult with the instructor on refocusing the project scope as needed.
Projects are intentionally open-ended because time in the semester is limited and it’s not always possible to predict how easy or difficult a data management project will turn out to be. As project work develops during the semester, the project focus may need to shift (as approved or suggested by the instructor) to keep the project scope appropriate for the course level.
The end product is working code and a brief written report, to be evaluated using the criteria in this document. Examples of previous STA 402/502 project topics and some widely used data websites are given at the end of this document.
Project work is graded at three stages:
- progress update,
- preliminary review,
- final code and report.
Restrictions
Projects cannot be based on exercises from any textbook.
Projects cannot be data analyses consisting mainly of calling statistical procedures, with minimal programming to prepare and manage the data.
Data must be in a form that can be input by a sequence of programming instructions from the original source, which might include downloadable files. In particular, you may not input data by hand, manipulate data with a text editor or spreadsheet to prepare it for input, nor copy-and-paste data from web pages, word-processor documents or PDFs.
Data must be available for approval at the beginning of the project. Projects cannot depend on data to be collected by means of experiments or surveys conducted during the semester.
Main Requirements
p.1
Four project phases
See the course syllabus for timeline and due dates. See grading rubric on next page.
To start your project, please email the instructor a one page Word or RTF document describing the data source you wish to use, the nature of the data, and how the data are made available. Your description of how to find the data (e.g., web addresses) should be clear enough that the instructor can judge if the data are suitable for the project.
Progress update (25 points). Please prepare 2-4 written pages indicating what you have accomplished so far and your questions on how to proceed. Attach graphs, tables and code as appendices – some of these will be preliminary and may be the basis for your questions to the instructor. Please include enough detail for the instructor to understand what you’ve done and provide comments on what’s needed to complete the project.
SCORING: Code, 15 points. Progress report, 10 points.
Preliminary review (50 points). Please turn in your code and a rough draft of the final written report for instructor comment and guidance. By this stage you are expected to have completed: (a) the main aspects of your project coding effort, (b) the introduction and data description sections of the report, (c) an exploratory data analysis [tables, graphs] with your impressions on the findings so far. It’s okay to have some computational work (e.g., implementation of macros) left to do at this stage.
SCORING: Code, 30 points. Report rough draft, 20 points.
Final code and report (50 points). Please turn in a polished document with the following structure.
Main body of text. Four to six single-spaced pages of written exposition in paragraph form, plus up to four pages of tables and graphs. Write the report as if the target audience consists of STA 402/502 students. Organize the report body into sections in this order:
Section title
- Introduction
- Description of Data
- Strategy Employed
- Results
- Discussion
Purpose of section
– What research questions might motivate the use of the chosen data and where are the data from? – What structural aspects of the data present preparation challenges to be overcome?
– How are the challenges addressed by the coding solution employed in the project?
– What features in the prepared data are illustrated by the graphs or tables?
– What answers to the research questions are suggested by the graphs and tables?
Graphs and tables should be carefully selected and constructed to provide context or illumination of the exposition in the main body.
Code. All code needed for data input, data preparation, tables and graphs. The code must be fully documented and structured for readability, with macros used where appropriate. Indicate in comments which code segments produce which graphs and tables, as numbered in the main text.
SCORING: Report with graphs and tables, 25 points. Code presentation and flexibility, 15 points. Project aptness and execution, 5 points. Responsiveness to instructor guidance, 5 points.
p.2
Grading Rubric for Quality of Code, Written Report, and Effort (STA 402/502 Projects)
Report Attributes and Criteria |
Excellent |
Good |
In progress |
A Start |
Computer code Is the code complete? How easy is the code to understand? Is documentation appropriate? Is repetitive and unnecessary code avoided? |
Elegant and efficient code. Thorough and accurate comments. Macros thoughtfully implemented. No unnecessary code. |
Code easy to read. Good commenting, formatting, and object names. Macros provide flexibility and avoid redundant calculation. |
Code sometimes difficult to read. Comments sparse or formatting awkward. Missed opportunities for macros. |
Code is haphazard. Few or no comments. Difficult to understand what is being attempted. |
Introduction How clearly are the data and its possible questions introduced? |
Concise and clear; generates reader interest. |
Easy to follow; main problem or question clearly delineated. |
Too verbose, sketchy or wandering. |
Unclear or difficult to follow. |
Description of data Are all relevant features of the data and complexity clearly described? |
Well-organized and concise; provides good intuition. |
Understandable and complete. |
Somewhat confusing, or sketchy on key aspects. |
Unclear, difficult to follow, or not specific. |
Strategy Employed Is the approach clearly presented? Is it suited to the structure and questions? |
Others can implement a solution based on this description. |
Clear and easy to follow. |
Presentation sketchy; or approach unsound as described. |
Unclear how method or strategy addresses question or problem. |
Results Are the displays reasonably addressed? |
Clearly addressed; strong presentation. |
Mostly addressed; good presentation. |
Partially addressed; awkward presentation. |
Not clearly related to question or problem. |
Discussion Is the interpretation of outcomes clear and relevant? |
Outcomes tied to valid insights. |
Summarizes outcomes with clear ties to main research questions. |
Summarizes outcomes, but no connection to research question. |
Unclear or irrelevant. |
Graphs and tables Are displays well-chosen and annotated? |
Creative displays that yield strong insights. |
Well-chosen and annotated displays. |
Some displays or annotations awkward. |
Unsuitable displays and annotations. |
Project Aptness and Execution Are the problem and solution appropriate in complexity and scope? |
Challenging problem, addressed robustly and with professionalism. |
Appropriate problem, approach and effort. |
Decent problem, but some important gaps in approach or effort. |
Lightweight problem, unsuitable approach, or weak effort. |
Responsiveness to instructor feedback Were the instructor’s comments and suggestions on coding and draft report suitably addressed by the end? |
All comments and suggestions fully addressed or resolved. |
Major comments and suggestions addressed or resolved. |
Some major comments and suggestions not adequately addressed. |
Minor, but no major, suggestions and comments addressed. |
p.3
Some Projects from Previous STA 402/502 Students Data from these areas have been used for previous projects:
- Education level by nation and wealth
- Daily sales v. advertising
- Student retention v. course performance
- Employment earnings
- Study of visual working
memory capacity
- Traffic fatality v. region
Play-by-play defense against specific MLB pitchers
Gold prices v. other markets
Mononucleotide repeat sequences
Youth risky behavior Consumer lending
NBA player data (many sources; 600 variables) Course GPA by subject
Insect populations, plant phenology, agriculture
Stock trading and prices National park funding
and visitation
State employment rates
Int’l coffee production, prices, & consumption
Employment by sectors
Inventory of a pet food supplier
Tour de France
Polymerase chain
reaction
Adolescent obesity
College tuition, interest rates, and stock prices
NFL team performance v. player performance
UFO sightings
NBA regular season v.
playoff performance Psychometric study Altitude sickness
Projects can use easily downloadable data sets. Or projects might incorporate web scraping in SAS to obtain data directly using FILE URL or PROC HTTP to access web data, after which data is extracted from HTML source with using character functions or PRXPARSE and PRMATCH.
Where to find data sets?
Google a topic of interest and add “data sets”. Or here is a list of popular websites that provide a wide variety of data:
Data.gov http://data.gov |
National Climatic Data Center http://www.ncdc.noaa.gov/data-access/quick-links DBPedia http://wiki.dbpedia.org |
Enigma Public http://www.enigma.com/blog/the-new-enigma-public
The SUNY Geneseo library website lists many more: http://libguides.geneseo.edu/c.php?g=67454&p=434785
p.4