MET CS521, Boston University, Summer 2020 Prof. Alan Burstein
Outcomes
Final Project: Data Analysis with Python Due: August 11, 2020, 11:59 EST
This project intends to bring together many of the skills we have (and will) talk about in the course. You will get a taste of modern Data Science using Python. Although the analysis I expect from you is not meant to be novel, we will use a modern technology stack, state of the art methodology, and practice creating and presenting analysis.
Methods and Technology Stack
We will be using Jupyter and Git/GitHub for working on the project.
We use Jupyter because it allows us to continue to develop code and visualize results without needing to rerun lengthy file processing, visualizations, and data analysis. We will be making scripts, not programs because the results are initially meant to be human-readable. There is no auto-grading or testing of your code. It will be evaluated based on the results you generate and the code itself.
We will use GitHub to “collaborate” with me. Periodically, I will look at your GitHub to make sure you are on track with the project and provide some feedback.
Assignment
For the project, you will solve a problem using data by:
• Progammatically download a data-set for analysis. The data is available here: GitHub • Load the data-set into Python using a data structure (Data Frame).
• Understand and interpret the data-set.
• Visualize aspects of the data-set.
• Use Machine Learning techniques to solve the problem and perform the analysis.
• Customize your project through extension tasks.
1 of 7
MET CS521, Boston University, Summer 2020 Prof. Alan Burstein
Options
For the final project, you must choose 1 of 3 options. The first two are more structured and meant for the majority of students. The last option is an open project meant for those of you with more programming experience or a passion for a certain data-oriented problem (it will be a bit more work).
The options are ordered by the expected difficulty:
• Predicting Housing Prices (Linear Regression) • Detecting Fake News (Support Vector Machine) • Open
I will spend roughly 2 hours (split up) on various parts of the project. This is meant to be a chal- lenging project with no provided starter code. I expect you to use Google, Pandas, and SKLearn documentation to solidify your understanding of the tools we are using in this project. I will introduce many of them, but will not spend much time explaining why they work and their inner- workings.
What I Will Show In Class:
• How to programmatically download a file • Reading a csv with Pandas
• Looking at a dataframe
• Filtering columns
• Making a scatterplot
• Splitting data into Train and Test Data
• SKLearn Linear Regression with dummy data • SKLearn SVM with dummy data
• Suggested Columns
2 of 7
MET CS521, Boston University, Summer 2020 Prof. Alan Burstein
Rubric
Points
Description
5
A separate script to download a data file
5
Loading the data-set into Python using Pandas
15
3 good visualizations of the data-set
10
Splitting data into training and testing sets
25
A working Machine Learning model that trains on the training data and is tested on the testing data
25
At least 2 extension tasks completed
(or your own extensions of comparable difficulty)
5
Documentation of how to run your scripts
20
A (≤5 pages) write-up discussion
Up to 10 points of extra credit points may be awarded for students who submit remarkable work described along the following 3 parameters:
• Program quality (Professionally written code)
• Program functionality (Additional or creative extensions) • Written work quality (Detailed report)
Discussion Writeup
You are expected to include a discussion of no more than 5 pages that explains the purpose of your work.
The discussion should include but is not limited to:
• A description of the problem you tried to solve
• The files you included and what they do
• Methods and Techniques you used along with “brief” explanations of how they should work • What extensions you implemented and how they make your project unique and valuable
• Conclusions of the project, including what you’ve learnt Submission Deliverables
2 points are deducted for failing to meet a deliverable.
July 15th: GitHub Project Exists – must include script to download data-set.
• If you are choosing the open-ended project, you must include a brief explanation and justifi- cation of your data-set and the problem you intend to solve/explore
July 29th: Code that loads data into dataframe and some exploration August 11th: Final completed project with:
3 of 7
MET CS521, Boston University, Summer 2020
Prof. Alan Burstein
• Script to download code
• Script to generate analysis
• README on how to reproduce results • PDF write-up
4 of 7
MET CS521, Boston University, Summer 2020
Prof. Alan Burstein
Option 1: House Price Predictor
Overarching goal: Create a service to predict housing price
House Price Prediction
• Get and read a large data set
• Get an understanding of what is going on
• Process and wrangle the data
• Build a simple regression model using SKLearn • Improve the model and do data analysis
Things to Research
• Making a network request for data
• Pandas, using a dataframe, plotting • SKLearn and linear regression model
Extension Suggestions
• Implement a linear regression class yourself
• Visualize and evaluate the data in new ways (correlation)
• Use more variables (intelligently)
• Compare different regression models
• Create a simple application that can let people enter information and get an estimate of their house price (like Zillow)
• Visually and descriptively compare your results to Zillow or other sites
Things I’ll show you in class
• Reading a csv with Pandas
• Looking at data (Seeing columns, filtering, making a scatter plot, etc.) • A couple of suggested columns
• Splitting data into Train and Test Data
• SKLearn simple model with dummy data
5 of 7
MET CS521, Boston University, Summer 2020 Prof. Alan Burstein
Option 2: Fake News Detector
Overarching goal: Create a fake news detection algorithm to help fight the spread of misinforma-
tion before the 2020 election
Detect Fake News
• Get and read large data set
• Process and wrangle the data
• Gain an understanding of what is going on
• Vectorize content
• Use a Support Vector Machine to detect fake news
Things to Research
• Making a network request for data • Pandas, using a dataframe, plotting • SKLearn and classifier model
Extension Suggestions
• Compare other classifer models
• Use the article title
• Create a simple application to upload and test articles
• Using your classifier, analyze and compare several news sites
Things I’ll show you in class:
• Reading a csv with Pandas
• Looking at data (Seeing columns, filtering, making a scatter plot, etc.) • A couple of suggested columns
• Splitting data into Train and Test Data
• SKLearn simple model with dummy data
6 of 7
MET CS521, Boston University, Summer 2020
Prof. Alan Burstein
Option 3: Unguided Final Project
Things You Must Do
• Find a large (>1000 rows) data-set online
• Programmatically download the data-set
• Load the data into a Pandas Dataframe
• If needed, filter the data to remove unwanted rows and columns
• Use Pandas or Matplotlib to create at least 3 interesting visualizations of meaningful data
• Use Regression or Machine Learning through SKLearn to perform meaningful analysis on the data
• Extend your program to do something advanced
– Allow user input to interact with your model (predict results)
– Compare multiple (at least 3) different methods for doing your data analysis – Answer real-world questions that pertain to your data (extend the analysis)
7 of 7