CSCI 3022: Intro to Data Science – Spring 2019 Practicum 1¶
This practicum is due on Canvas by 11:59 PM on Monday March 4. Your solutions to theoretical questions should be done in Markdown/MathJax directly below the associated question. Your solutions to computational questions should include any specified Python code and results as well as written commentary on your conclusions.
Here are the rules:
1. All work, code and analysis, must be your own.
2. You may use your course notes, posted lecture slides, textbooks, in-class notebooks, and homework solutions as resources. You may also search online for answers to general knowledge questions like the form of a probability distribution function or how to perform a particular operation in Python/Pandas.
3. This is meant to be like a coding portion of your midterm exam. So, the instructional team will be much less helpful than we typically are with homework. For example, we will not check answers, help debug your code, and so on.
4. If something is left open-ended, it is because we want to see how you approach the kinds of problems you will encounter in the wild, where it will not always be clear what sort of tests/methods should be applied. Feel free to ask clarifying questions though.
5. You may NOT post to message boards or other online resources asking for help.
6. You may NOT copy-paste solutions from anywhere.
7. You may NOT collaborate with classmates or anyone else.
8. In short, your work must be your own. It really is that simple.
Violation of the above rules will result in an immediate academic sanction (at the very least, you will receive a 0 on this practicum or an F in the course, depending on severity), and a trip to the Honor Code Council.
By submitting this assignment, you agree to abide by the rules given above.
Name:
NOTES:
• You may not use late days on the practicums nor can you drop your practicum grades.
• If you have a question for us, post it as a PRIVATE message on Piazza. If we decide that the question is appropriate for the entire class, then we will add it to a Practicum clarifications thread.
• Do NOT load or use any Python packages that are not available in Anaconda 3.6.
• Some problems with code may be autograded. If we provide a function API do not change it. If we do not provide a function API then you’re free to structure your code however you like.
• Submit only this Jupyter notebook to Canvas. Do not compress it using tar, rar, zip, etc.
• This should go without saying, but… For any question that asks you to calculate something, you must show all work to receive credit. Sparse or nonexistent work will receive sparse or nonexistent credit.
Shortcuts: Problem 1 | Problem 2 | Bottom
In [ ]:
from scipy import stats
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Back to top
[30 points] Problem 1: The Game of Strife¶
Below, and at the link here, you will find the board for the Game of Strife, a simplified and slightly more depressing version of the Game of Life. Here are some rules:
• Players begin at START and may choose to begin the game by either going to college (moving to the right from START) or starting a career (moving downward from START). Players then move along the game board in order of increasing tile number.
▪ If a player begins by going to college, then they start the game with -\$20,000. That is indeed negative money, to account for student loan debt.
▪ If a player begins by starting a career, then they start the game with \$5,000.
• At the beginning of a player’s turn, they roll a fair 6-sided die, the outcome of which determines how many tiles they move forward.
• When a player reaches a red square (tiles 9, 17 or 30), they must stop at that square for the rest of their turn, even if they would not have landed on the red square.
▪ When a player stops on the CAREER tile after college (square 9), they are randomly assigned a career and salary from the possibilities: \$50,000, \$70,000, \$90,000, \$110,000, or \$130,000 (all with equal probability). The player’s actual career is irrelevant to the game, but please make something up so you are emotionally invested in the game.
▪ If a player starts a career at the beginning of the game, they are assigned a salary randomly from possibilities \$40,000, \$50,000, \$60,000, \$70,000, or \$80,000.
▪ When a player stops on the HOUSE tile, they put a down payment on a house. This down payment is drawn randomly from the set \$25,000, \$40,000, \$55,000 or \$70,000.
▪ When a player stops on the RETIRE tile, the player collects a pension equal to half their salary and then the game ends immediately.
• When a player lands on or passes a PAYDAY square, they earn money equal to their salary.
• When a player lands on a STRIFE square (1, 4, 7, 13, 18, 23 or 29 if they go to college, or 2, 7, 13, 18, 23 or 29 if they start a career immediately at the beginning of the game), they draw a STRIFE card. The STRIFE cards have the player earn \$5,000 or \$10,000, or lose \$1,000, \$2,000 or \$5,000.
• Players can have negative money, which corresponds to being in debt.

Part A: Write code to simulate an entire game of the Game of Strife (with only one player). You may not have two separate routines for simulating the game, or a turn, depending on whether a player goes to college or starts a career at the beginning; both possibilities should be accounted for within your one set of codes.
Then run two ensembles of at least 10,000 games, one where the player starts by going to college, the other where the player starts a career immediately. Plot density histograms of the players’ ending distributions of money on the same set of axes. Be sure to label your axes, include a legend and make your histogram box faces slightly transparent, so both sets of data are visible.
In [ ]:
Part B: Use concepts from class to describe the two distributions of player cash at retirement, depending on whether or not they went to college or immediately started a career. How are the two distributions similar? How do they differ? Address characteristics like skew, modality, central tendency and spread. How could the rules of the Game of Strife account for these differences?
In [ ]:
Part C: Use your results from Part A to estimate the probability that a person would retire with at least \$300,000, if they went to college.
In [ ]:
Part D: The United States Bureau of Labor Statistics has found that approximately 66.7% of students go to college. Suppose players of the Game of Strife choose to go to college at the beginning of the game with this probability of $P(\text{college}) = 0.667$.
Use your two ensembles of games from Part A to estimate the probability that an individual, whose college education status is unknown, will retire in the Game of Strife with at least \$300,000. State any relevant probability laws, theorems or rules that you use, and show all calculations.
In [ ]:
Part E: Let’s see how important the Strife tiles are in affecting a player’s final money. What is the probability that a player ends the game with at least \$300,000 in cash if they landed on at least one Strife square? You may want to modify your previous codes to run additional simulations for this part. Use the same method as Part D to address the proportion of players who begin by going to college versus starting a career.
In [ ]:
Back to top
[30 points] Problem 2: Sonic or Tails?¶

In the file flipadelphia.csv you will find the results of an experiment that was conducted by Amy, the famous hedgehog data scientist, as she was flipping a coin one sunny day in a meadow. This is no ordinary coin, however: this coin has on one side Sonic, and on the other side Tails! The two sides of this coin are above, and at this link.
In Amy’s experiment she repeatedly flipped the coin until it came up Sonic. After each trial, she recorded her observed value for $X=$ the number of flips required to see the first Sonic. The results are stored in flipadelphia.csv.
Amy has a lot of coins for performing cool data science experiments, and these coins have different biases (not all unique). Amy is a forgetful hedgehog, so she isn’t sure which coin she was flipping. Her coins have biases of $p_S=.2, .3, .4, .5, .6, .7$ and $.8$, where $p_S$ is the probability of any given flip coming up Sonic.
Part A: Read in the data set and make a frequency histogram of the data. Be sure to label your axes appropriately, and center your bins above the integer numbers of flips (0, 1, 2, etc…). What is the name of the distribution for the random variable that Amy observed and recorded in her data table?
In [ ]:
Part B: Use the distribution that you identified in Part A to determine $P(X=n \mid p_S=0.5)$, the probability that Amy would observe the first Sonic flip on the $n$-th flip, assuming that the coin is fair ($p_S=0.5$), for each of the $n$ from her 10 trials in her data set. Then, combine these to find the overall likelihood that she would observe her entire data set, assuming that the coin was fair. That is, estimate $P(\text{data} \mid p_S=0.5)$. Be sure to note any assumptions you make about how the outcome of one trial relates to the outcomes of the others.
If it helps to have some mathematical notation, consider that Amy’s data set consists of the results of all 10 of her trials: $$\text{data} = (X_1 = n_1) \cap (X_2 = n_2) \cap \ldots (X_{10} = n_{10})$$
In [ ]:
Part C: Suppose before we observed Amy’s data set, we believe that each of the seven possible coin biases occur with equal probability, $P(p_S)$. This is called the prior distribution for the coin bias, $p_S$, because we have not yet taken into account Amy’s data set.
• Now, estimate the probability of each possible bias, given the data: $P(p_S \mid \text{data})$. This is called the posterior distribution for the coin bias, because it is our assessment of the coin’s bias after we have accounted for Amy’s data.
• Make a line plot of the bias along x-axis versus the posterior probability of that bias along the y-axis, and be sure to label your axes.
• Comment on your plot. What appears to be the most probable value for the bias, $p_S$? This is called the maximum a posteriori estimate, because it maximizes the posterior distribution and sounds very, very fancy.
In [ ]:
Part D: Now suppose the prior probability distribution of the coins is not uniform. Namely, suppose these probabilities follow a triangular distribution, centered at $p_S=0.5$: $$P(p_S = p) = \begin{cases} mp & p \leq 0.5 \\ m(1-p) & p > 0.5 \end{cases}$$
Determine what value the constant $m$ should have in order to make $P(p_S = p)$ is a valid probability mass function. Remember, $p_S \in \{.2, .3, \ldots , .7, .8\}$ and is discrete.
In [ ]:
Part E: Compare, using words, the triangular prior distribution (this part) and the uniform prior distribution (from Part C). What does each represent in terms of our prior knowledge of the coin bias?
In [ ]:
Part F: Modify your calculation of the posterior distribution from Part C to use the new triangular prior distribution from Part D. Make a plot of the results that includes both posterior distribution using the uniform prior (from Part C) and the posterior distribution using the triangular prior (from Part D) in the same figure panel. Be sure to label your axes and include a legend.
In [ ]:
Part G: Comment on the effect of your choice of different prior distribution on your posterior inference for the most probable coin bias.
In [ ]:
Back to top