✅ Put your name here.
¶
Homework Assignment #1 (Individual)¶
Git Practice, Debugging Practice, Modeling World Population, and Writing Python Scripts¶
Goals for this homework assignment¶
By the end of this assignment, you should be able to:
• Use Git to create a repository, track changes to the files within the repository, and push those changes to a remote repository.
• Debug Python code.
• Get more experience with Pandas using excel spread sheets.
• Make plots of your model using matplotlib.
• Practice writing and executing a Python script.
Assignment instructions¶
Work through the following assignment, making sure to follow all of the directions and answer all of the questions.
There are 25 points possible on this assignment. Point values for each part are included in the section headers.
This assignment is due roughly two weeks from now at 11:59 pm on Friday, January 24. It should be uploaded into the “Homework Assignments” submission folder for Homework #1 in your D2L webpage. Submission instructions can be found at the end of the notebook.
Part 1: Setting up a git repository to track your progress on your assignment (5 points)¶
For this assignment, you’re going to add it to the cmse202-s20-turnin repository you created in class so that you can track your progress on the assignment and preserve the final version that you turn in. In order to do this you need to:
1. Navigate to your cmse202-s20-turnin repository and create a new directory called hw-01.
2. Move this notebook into that new directory in your repository, then add it and commit it to your repository.
3. Finally, to test that everything is working, “git push” the file so that it ends up in your GitHub repository.
Important: Make sure you’ve added your PTA as a collaborators\ to your “turnin” respository with “Read” access so that we can see your assignment.
• Section 001: nchamusah14
• Section 002: nchamusah14
• Section 003: genster6
Also important: Make sure that the version of this notebook that you are working on is the same one that you just added to your repository! If you are working on a different copy of the noteobok, none of your changes will be tracked.
If everything went as intended, the file should now show up on your GitHub account in the “cmse202-s20-turnin” repository inside the hw-01 directory that you just created. Periodically, you’ll be asked to commit your changes to the repository and push them to the remote GitHub location. Of course, you can always commit your changes more often than that, if you wish. It can be good to get into a habit of committing your changes any time you make a significant modification, or when you stop working on the project for a bit.
Part 2: Debugging code: understanding error messages (6 points)¶
For the next part of this assignment, you’re going to presented with a variety of code bugs and your job is to fix them.
Fixing errors¶
✅ To Do
For each piece of code, rewrite it in the cell below and fix the error so that the code runs. If it is a runtime error, fix it with a message to avoid running the incorrect code. In the code you write, provide a comment as to what was wrong.
In [ ]:
for i in range(10)
print(“The value of i is {}”.format(i))
In [ ]:
# your answer here
In [ ]:
cnt = 10
the_sum = 0
for i in range(cnt):
if i%2 == 0:
the_sum += i
print(“The sum of the even number from {} to {} is {}”.format(0,cnt,the_sum))
In [ ]:
# your answer here
In [ ]:
def compute_fraction(numerator, denominator):
fraction = numerator/denominator
print(“The value of the fraction is {}”.format(fraction))
compute_fraction(25,5)
compute_fraction(5,25)
compute_fraction(5, 0)
In [ ]:
# your answer here
In [ ]:
import numpy as np
import random
cnt = 10
n = np.arange(cnt)
for i in range(cnt):
n[i] = random.randint(100,1000)
print(i)
print(“The value of the 10th element is {}”.format(n[cnt]))
In [ ]:
# your answer here
In [ ]:
cnt = 100
even = [ i for i in range(cnt) if i%2 ]
print(“The largest even number in the range {} to {} is {}”.format(0,cnt,max(even)))
In [ ]:
# your answer here
In [ ]:
spanish = {
‘hello’:’hola’,
‘yes’:’si’
‘one’:’uno’
}
print(“‘Hello’ in spanish is {}”.format(spanish[‘Hello’]))
In [ ]:
# your answer here
STOP¶
Take a moment to save your notebook, commit the changes to your Git repository using the commit message “Finished Part 2”, and push the changes to GitHub.
Part 3: Reading in the Migrant Flow Data (7 points)¶
Before we go too far, let’s pull in all the imports we need. If we have to, we can come back here and do it again in one cell.
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
OK, let’s get started. One of the things you have to do as a Data Scientist is work with what you are given. The data below is not perfect but we are going to work our way through to make it work for us.
Take a look a the web page below:
“Visualizing Migrant Flows”
It is a very well thought out way to visualize the movement of migrants to and from the various countries of the world from 1990 to 2010. Some of the movement may surprise you. The data was collected originally for a paper in 2014 titled “Quantifying Global International Migration Flows”. The database S2 in the link is the source of the data, an excel spread sheet. That spread sheet is in the D2L location for this homework and called migrant_flows.xlsx.
You read the sheet in the following way (in Excel):
• you select a sheet at the bottom representing a 5 year period
• the rows indicate the origin/source country
• the columns represent the destination country
Thus the totals at the end of a row represent how many people left the country during that 5 year span and the column totals represent how many people entered the country during that 5 year span. Any particular cell indicates: row: leaving , col: entering. For example, on the sheet tab 1990-95, Australia (row) had 67 people leave for Aruba (col).
We are going to work on gathering and manipulating some of that data using pandas. We are going to stretch our understanding a bit about what we can do with pandas to deal with data that doesn’t quite fit the norm.
We are going to go over some issues this particular spreadsheet has that you may not have seen before. Having gotten through those issues, we can do some analysis and potentially some model building.
Important Note:¶
You cannot modify the original spreadsheet provided. We will test your code using the original spread sheet!
3.1 Pandas and excel spread sheets (1 point)¶
Let’s start easy with some simple example spread sheets. Pandas has a function read_excel (“manual page” ) that has a lot of details. Let’s try to break it down a bit.
There is a spreadsheet in D2L called sheet1.xlsx. You can read in this simple sheet using read_excel as you would read_csv. There is only 1 sheet (tab at the bottom when you open the sheet in Excel) named ClassGrades. If there is only one sheet, read_excel reads in that sheet
✅ To Do
Read in that data sheet and print out the names of every person whose Grade is strictly less than a 3.0. Write that code in the cell below. Do as you wish but two possible approaches are:
• iterate through the rows of the data frame, see iterrrows
• print the Names of people whose Grades are less than 3
In [ ]:
# your code here
STOP¶
Take a moment to save your notebook, commit the changes to your Git repository using the commit message “Finished Part 3.1”, and push the changes to GitHub.
3.2 Multiple Sheets (1 point)¶
Take a look at sheet2.xlsx in the D2L directory. It has two sheets (tabs at the bottom when you look at it with Excel) labeled J_names and M_names. If you just load the sheet as before, you only get the first sheet (in this case the J_names). However, there is an optional argument called sheet_names (note the underline).
• if you do not provide this argument, you get only the first sheet
• if you provide the actual sheet name as an argument, you get that sheet as a data frame
• if you provide an integer, you can the sheet the integer represents (first sheet is 0, left to right numbering)
• if you provide as an argument to sheet_name the value None, you get a dictionary back where the key is the sheet name and the value is the data frame associated with the sheet name
✅ To Do
Write the code below that looks at all sheets and reports all Names whose Grade is less than 3.
In [ ]:
# your code here
Again, there are choices here and you are free to do as you like but some possibilities include:
• iterate through the sheets, then the rows in the data frame of each sheet
• again, iterate through the sheets, subset each data frame and collect the results
It is useful to note that the result of an expression that produces a panda data series (play with this, use the python command type to examine return values) can be coerced to a list data type with the list command.
STOP¶
Take a moment to save your notebook, commit the changes to your Git repository using the commit message “Finished Part 3.2”, and push the changes to GitHub.
3.3 Dealing with weird headers (1 point)¶
Pandas really likes it if it has headers for the rows and columns of the sheet you read in. Unfortunately, sometimes people decorate a spreadsheet with all kinds of helpful information for people which basically prevents you from getting the right headers setup in pandas. Take a look at sheet3.xlsx. It is a single sheet but I put in some header stuff which gets in pandas way. Nonethless, let’s try reading it in directly and see what we get.
In [ ]:
df3 = pd.read_excel(“sheet3.xlsx”)
df3
That’s pretty wrong I would say. Weird headers, lot of NaN, etc. But we can change what we read in with two other parameters of read_excel
• skiprows takes an integer list of rows (starting with the first row as 0) to exclude from the data frame
• usecols takes a bit of the opposited approach. It has a few more options, but one of them is a list of integers indicating the columns to include
By using these two arguments we can get back to making a sane data frame.
In [ ]:
df = pd.read_excel(‘sheet3.xlsx’, skiprows=[0], usecols=[1,2,3,4])
df
We still have a bit of a problem. Because of our manipulation (or because someone just forgot) a label might be missing. Here, the label of the 4th column is missing so the system names it Unnamed: 4. Pandas will name anything it thinks should have a label in this way. You can rename anything you like, and that might things better for running the system. Note als0 that the last value in Names was missing and, since pandas assumes that should be a value, it sets it to NaN, meaning “Not a Number”. Accurate but not particularly informative.
You can rename a column with the .rename method. It takes a dictionary of “old_name”:”new_name” pairs for the columns. The call looks like
df.rename(columns={“old1″:”new1”, “old2″:”new2”}, inplace=True)
Works for as many columns as you wish to rename. Normally .rename returns a new data frame but inplace=True changes the called data frame.
For a data frame with an index, you can also use index in place of columns above and change a value in the index list, as in
df.rename(index={“old1″:”new1”, “old2″:”new2”}, inplace=True).
If the value is listed as NaN, then that is actually numpy.nan as a value.
✅ To Do
Read in the same sheet3.xlsx using the skiprows and usecols as shown above and rename the last column to be “Pass/NonPass”. Show the sheet.
In [ ]:
# your code here
STOP¶
Take a moment to save your notebook, commit the changes to your Git repository using the commit message “Finished Part 3.3”, and push the changes to GitHub.
3.4 The real work (4 points)¶
We are ready to read in our spread sheet. Here are some things to remind you how pandas works:
• the .set_index method allows you to establish a particular column as an index for the sheet such that the row values can be used as indicies. That is very helpful.
▪ note that .set_index typically does not update the data frame itself but returns a new one. However, you can provide an argument called inplace=True which will in fact update the called frame. That is important to remember
• for a particular cell, you can index a cell with the .loc method, using a list of two labels: df.loc[“row_label”,”col_label”]. You can access or set a value for a cell this way.
▪ for this to work you have to have set the index for a particular set of row labels and pandas has to recognize there is a label for each column (you have to have set it up right)
✅ To Do
Write a function called read_migrant_flow that reads in the file migrant_flow.xlsx and returns a dictionary of sheets to data frames. In detail the function should:
• Read in the spreadsheet migrant_flow.xlsx
▪ Read in all the sheets as a map called sheet_map
▪ As numbered in Excel for the spreadsheet (not necessarily how you number it in pandas), skip rows 1 and 3, skip columns A and C.
▪ for each sheet:
◦ rename the column Unnamed: 1 to Origin
◦ set the index of the sheet to Origin
◦ rename the column Unnamed: 199 to Outof
◦ rename the last cell of Origin (presently NaN) to Into
In [ ]:
# your code here
In [ ]:
# test yourself, should come back all `True`
sheet_map = read_migrant_flow()
first_df = sheet_map[“1990-95”]
print(first_df.loc[“Greece”,”France”] == 166)
STOP¶
Take a moment to save your notebook, commit the changes to your Git repository using the commit message “Finished Part 3.4”, and push the changes to GitHub.
4 Doing some work with the Migrant Flow Data (7 points)¶
OK, we have the data, now let’s do some work with what we have gathered. You have some experience with pandas but here are some things you might want to check out that might be of some help:
• .iterrows() as a way to iterate through a row
• .max() and .idxmax() (as well as the associated mins) to fine the largest value in a pandas Series (a row) and the id associated with that max
• .sort_values() allows you to sort the elements of a Series
• .bar.plot() for plotting a Series
We’ll add more as we think of them.
✅ To Do
4.1 Top destinations (2 points)¶
Write a function called top_dests that takes a single argument, a data frame. It should plot as a bar chart the top 10 destination countries and the number of immigrants that came.
In [ ]:
# your answer here
In [ ]:
# should show a nice bar chart
sheet_map = read_migrant_flow()
df = sheet_map[“2000-05”]
top_dests(df)
STOP¶
Take a moment to save your notebook, commit the changes to your Git repository using the commit message “Finished Part 4.1”, and push the changes to GitHub.
4.2 Top Source Countries (2 points)¶
✅ To Do
Write a function called top_leaving that takes a single argument, a data frame. It should plot as a bar chart the top 10 source countries (countries with the largest number of people leaving) and the immigrant count.
In [ ]:
# should show a nice bar chart
sheet_map = read_migrant_flow()
df = sheet_map[“2000-05”]
top_leaving(df)
STOP¶
Take a moment to save your notebook, commit the changes to your Git repository using the commit message “Finished Part 4.2”, and push the changes to GitHub.
4.3 Biggest Flow (3 points)¶
✅ To Do
Write a function called biggest_flow that takes a single argument, a data frame. It should return a string that indicates what the largest flow was between two countries. Include the source country, the destination country and the number of people.
In [ ]:
# your answer here
In [ ]:
df = sheet_map[“1990-95”]
print(biggest_flow(df))
STOP¶
Take a moment to save your notebook, commit the changes to your Git repository using the commit message “Finished Part 4.3”, and push the changes to GitHub.
Assignment wrap-up¶
Please fill out the form that appears when you run the code below. You must completely fill this out in order to receive credit for the assignment!
In [ ]:
from IPython.display import HTML
HTML(
“””
“””
)
Congratulations, you’re done!¶
Submit this assignment by uploading it to the course Desire2Learn web page. Go to the “Homework Assignments” folder, find the dropbox link for Homework #1, and upload it there.
© Copyright 2018, Michigan State University Board of Trustees