In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use(‘seaborn’)
import scipy.stats as ss
Instructions Group Project¶
• Make groups of 4 or 5 students (maximum of 2 students of the same nationality in each group)
• Use a publicly available real-world data set from the web:
▪ kaggle.com/datasets
▪ Other datasets.
• Identify and analyze interesting and relevant questions
• Use Python to analyze those questions
• Turn in a report (Jupyter Notebook) and the database used no later than Dec 18th at 11h00. Your notebook needs to include a description of how the variables you analyze are measured.
• Structure of the report:
▪ Abstract: summary of your most interesting discoveries. Maximum one page.
▪ Try to find the goal of the dataset. Explain which questions you want to answer.
▪ Answer your questions in the Notebook.
▪ Show all codes you have used to manipulate and to analyze the data.
• There are two important technical aspects to consider:
1. Save the database manipulated using a Notebook in your PC
2. Handle missing values in Python
• Deadlines:
▪ Tuesday Nov 5th: database and names of the group
▪ Friday Nov 29th: First analyses
▪ Wednesday Dec 18th, 11 am: Final draft
1. Manipulate and save the database¶
Typically, you will use a Jupyther notebook to download and read a publically available data set. At the beginning of your project, you will manipulate the database (e.g., you will eliminate rows or columns, create new variables, mark missing values as NaN, etc.). In order to keep these changes you have made and use the manipulated database the next time you work the notebook, you need to save the manipulated database to your hard disk. Then, when you work on your dataset again using your working notebook, you need to upload the manipulated dataset and not the original one. Here I explain how to do this using an example.
The US Department of Agriculture (USDA) developed and administered the HEI-2005 to measure how well the population follows the recommendations of the 2005 Dietary Guidelines for Americans. They randomly selected individuals who participated in two extended interviews and medical examinations. This database was published by Guenther et al (2007) (Guenther, P. M., J. Reedy, S. M. Krebs-Smith, B. B. Reeve, and P. P. Basiotis. 2007. Development and Evaluation of the Healthy Eating Index–2005: Technical Report. Center for Nutrition Policy and Promotion, U.S. Department of Agriculture. Available at http://www.cnpp.usda.gov/Healthy- EatingIndex.htm. (accessed March 9, 2011)). A description of some of its variables is shown below:
• HEI2005 (Healthy Eating Index-2005): Measures the adequacy consumption of vegetables, fruits, grains, milk, meat and beans, and liquid oils. It is measured on a 100-point scale.
• daycode: data of first (daycode=1) and second (daycode=2) interviews.
• sr_overweight: 1-subject reported was overweight.
The description of all the study variables can be found in Newbold et al (2008), Apprendix in Chapter 10.
In [0]:
# Read the original database
df = pd.read_excel(‘//content/HEI_cost data variable subset.xls’)
df.head()
Out[0]:
SEQN
Suppl
doc_bp
daycode
sr_overweight
try_wl
try_mw
sr_did_lm_wt
daily_cost
HEI2005
daily_cost2
Friday
weekend_ss
week_mth
keeper
WIC
fsp
fsec
PIR_p
PIR_grp
nhw
hisp
nhb
single
female
waist_cir
waistper
age
hh_size
WTINT2YR
WTMEC2YR
immigrant
citizen
native_born
hh_income_est
English
Spanish
smoker
doc_chol
BMI
doc_dib
no_days_ph_ng
no_days_mh_ng
doc_ow
screen_hours
activity_level
total_active_min
waist_large
pff
prest
p_ate_at_home
hs
col_grad
pstore
0
21009
0.0
0
2
1.0
0.0
0.0
0
4.906433
38.071868
24.073088
0
1
0
1
0.0
0.0
1.0
379.0
4
1
0
0
0.0
0
118.4
1.160784
55
2
97593.678977
97731.727244
0
1.0
1.0
50000.0
1
0
0.0
0
31.26
0.0
0.0
3.0
0
3.0
2
660.0
1
22.877698
0.000000
77.122302
1.0
0.0
77.122302
1
21009
0.0
0
1
1.0
0.0
0.0
0
14.775193
41.776279
218.306332
1
0
0
1
0.0
0.0
1.0
379.0
4
1
0
0
0.0
0
118.4
1.160784
55
2
97593.678977
97731.727244
0
1.0
1.0
50000.0
1
0
0.0
0
31.26
0.0
0.0
3.0
0
3.0
2
660.0
1
53.103817
19.728862
14.876918
1.0
0.0
19.764538
2
21010
1.0
0
1
1.0
1.0
1.0
0
3.425998
44.951786
11.737465
0
0
1
1
0.0
0.0
0.0
124.0
2
1
0
0
0.0
1
91.4
1.038636
52
3
39599.362689
43286.576474
0
1.0
1.0
17500.0
1
0
1.0
0
25.49
0.0
0.0
1.0
0
2.0
2
1125.0
1
0.000000
0.000000
72.066738
0.0
0.0
88.428418
3
21010
1.0
0
2
1.0
1.0
1.0
0
2.666035
52.025321
7.107742
0
0
1
1
0.0
0.0
0.0
124.0
2
1
0
0
0.0
1
91.4
1.038636
52
3
39599.362689
43286.576474
0
1.0
1.0
17500.0
1
0
1.0
0
25.49
0.0
0.0
1.0
0
2.0
2
1125.0
1
0.000000
0.000000
100.000000
0.0
0.0
100.000000
4
21012
1.0
0
1
0.0
0.0
0.0
1
4.432392
32.954255
19.646099
0
0
1
1
0.0
0.0
1.0
89.0
1
0
0
1
0.0
0
73.7
0.722549
63
2
12629.440480
12947.338142
0
1.0
1.0
7500.0
1
0
1.0
0
19.60
0.0
NaN
NaN
0
5.5
1
16.0
0
0.000000
0.000000
100.000000
1.0
0.0
100.000000
We decide to study only individuals who are overweight.
In [0]:
df.groupby(‘sr_overweight’).size()
Out[0]:
sr_overweight
0.0 3911
1.0 4663
dtype: int64
4463 people are overweight.
In [0]:
# Keep only observations of individuals who are overweight
df = df[df[‘sr_overweight’]==1]
df.head()
Out[0]:
SEQN
Suppl
doc_bp
daycode
sr_overweight
try_wl
try_mw
sr_did_lm_wt
daily_cost
HEI2005
daily_cost2
Friday
weekend_ss
week_mth
keeper
WIC
fsp
fsec
PIR_p
PIR_grp
nhw
hisp
nhb
single
female
waist_cir
waistper
age
hh_size
WTINT2YR
WTMEC2YR
immigrant
citizen
native_born
hh_income_est
English
Spanish
smoker
doc_chol
BMI
doc_dib
no_days_ph_ng
no_days_mh_ng
doc_ow
screen_hours
activity_level
total_active_min
waist_large
pff
prest
p_ate_at_home
hs
col_grad
pstore
0
21009
0.0
0
2
1.0
0.0
0.0
0
4.906433
38.071868
24.073088
0
1
0
1
0.0
0.0
1.0
379.0
4
1
0
0
0.0
0
118.4
1.160784
55
2
97593.678977
97731.727244
0
1.0
1.0
50000.0
1
0
0.0
0
31.26
0.0
0.0
3.0
0
3.0
2
660.0
1
22.877698
0.000000
77.122302
1.0
0.0
77.122302
1
21009
0.0
0
1
1.0
0.0
0.0
0
14.775193
41.776279
218.306332
1
0
0
1
0.0
0.0
1.0
379.0
4
1
0
0
0.0
0
118.4
1.160784
55
2
97593.678977
97731.727244
0
1.0
1.0
50000.0
1
0
0.0
0
31.26
0.0
0.0
3.0
0
3.0
2
660.0
1
53.103817
19.728862
14.876918
1.0
0.0
19.764538
2
21010
1.0
0
1
1.0
1.0
1.0
0
3.425998
44.951786
11.737465
0
0
1
1
0.0
0.0
0.0
124.0
2
1
0
0
0.0
1
91.4
1.038636
52
3
39599.362689
43286.576474
0
1.0
1.0
17500.0
1
0
1.0
0
25.49
0.0
0.0
1.0
0
2.0
2
1125.0
1
0.000000
0.000000
72.066738
0.0
0.0
88.428418
3
21010
1.0
0
2
1.0
1.0
1.0
0
2.666035
52.025321
7.107742
0
0
1
1
0.0
0.0
0.0
124.0
2
1
0
0
0.0
1
91.4
1.038636
52
3
39599.362689
43286.576474
0
1.0
1.0
17500.0
1
0
1.0
0
25.49
0.0
0.0
1.0
0
2.0
2
1125.0
1
0.000000
0.000000
100.000000
0.0
0.0
100.000000
12
21019
0.0
0
2
1.0
0.0
0.0
0
1.034646
66.856371
1.070493
0
0
1
1
1.0
0.0
0.0
16.0
1
0
1
0
0.0
1
102.7
1.167045
50
7
14746.454143
15876.718692
1
0.0
0.0
2500.0
0
0
0.0
0
38.03
0.0
0.0
0.0
1
3.0
1
0.0
1
0.000000
0.000000
100.000000
0.0
0.0
100.000000
In [0]:
df.shape
Out[0]:
(4663, 54)
I want to drop some variables we are not interested in to make the dataset more manageable. E.g. we drop WIC, fsp, and fsec.
In [0]:
# Drop some variables to make the dataset more manageable
# We drop WIC, fsp and fsec
df = df.drop([‘WIC’, ‘fsp’, ‘fsec’], axis=1)
In [0]:
df.head()
Out[0]:
SEQN
Suppl
doc_bp
daycode
sr_overweight
try_wl
try_mw
sr_did_lm_wt
daily_cost
HEI2005
daily_cost2
Friday
weekend_ss
week_mth
keeper
PIR_p
PIR_grp
nhw
hisp
nhb
single
female
waist_cir
waistper
age
hh_size
WTINT2YR
WTMEC2YR
immigrant
citizen
native_born
hh_income_est
English
Spanish
smoker
doc_chol
BMI
doc_dib
no_days_ph_ng
no_days_mh_ng
doc_ow
screen_hours
activity_level
total_active_min
waist_large
pff
prest
p_ate_at_home
hs
col_grad
pstore
0
21009
0.0
0
2
1.0
0.0
0.0
0
4.906433
38.071868
24.073088
0
1
0
1
379.0
4
1
0
0
0.0
0
118.4
1.160784
55
2
97593.678977
97731.727244
0
1.0
1.0
50000.0
1
0
0.0
0
31.26
0.0
0.0
3.0
0
3.0
2
660.0
1
22.877698
0.000000
77.122302
1.0
0.0
77.122302
1
21009
0.0
0
1
1.0
0.0
0.0
0
14.775193
41.776279
218.306332
1
0
0
1
379.0
4
1
0
0
0.0
0
118.4
1.160784
55
2
97593.678977
97731.727244
0
1.0
1.0
50000.0
1
0
0.0
0
31.26
0.0
0.0
3.0
0
3.0
2
660.0
1
53.103817
19.728862
14.876918
1.0
0.0
19.764538
2
21010
1.0
0
1
1.0
1.0
1.0
0
3.425998
44.951786
11.737465
0
0
1
1
124.0
2
1
0
0
0.0
1
91.4
1.038636
52
3
39599.362689
43286.576474
0
1.0
1.0
17500.0
1
0
1.0
0
25.49
0.0
0.0
1.0
0
2.0
2
1125.0
1
0.000000
0.000000
72.066738
0.0
0.0
88.428418
3
21010
1.0
0
2
1.0
1.0
1.0
0
2.666035
52.025321
7.107742
0
0
1
1
124.0
2
1
0
0
0.0
1
91.4
1.038636
52
3
39599.362689
43286.576474
0
1.0
1.0
17500.0
1
0
1.0
0
25.49
0.0
0.0
1.0
0
2.0
2
1125.0
1
0.000000
0.000000
100.000000
0.0
0.0
100.000000
12
21019
0.0
0
2
1.0
0.0
0.0
0
1.034646
66.856371
1.070493
0
0
1
1
16.0
1
0
1
0
0.0
1
102.7
1.167045
50
7
14746.454143
15876.718692
1
0.0
0.0
2500.0
0
0
0.0
0
38.03
0.0
0.0
0.0
1
3.0
1
0.0
1
0.000000
0.000000
100.000000
0.0
0.0
100.000000
In [0]:
from google.colab import drive
drive.mount(‘mydrive’)
Drive already mounted at mydrive; to attempt to forcibly remount, call drive.mount(“mydrive”, force_remount=True).
In [0]:
# Save the manipulated database in My Drive
df.to_excel(‘/content/mydrive/My Drive/Stats/Notes/Group Project/HEI_2005_modified_Oct_2019.xlsx’, index=False)
2. Missing values¶
Most databases you will download will have missing values in some of its variables. However, not all data bases mark the missing values the same way. Some databases will mark the missing values as blank cells, others will use a dot (.), others will use numbers like 999 or 888, etc. Python understands a value is missing when the observation is marked with a NaN (not a number). Thus, before you analyze the data, you need to convert the missing values of the dataset into NaN. We explain this using an example.
We will use a reduced version of the data collected on the European Working Conditions Survey (2015) to work the missing data.
In [0]:
# Read the dataset
df = pd.read_excel(‘//content/ewcs6_2015_ukda_reduced.xls’)
df.head()
Out[0]:
Country
Q2a
Q2b
Q2c
Q2d
Q7
Q11
Q12_years
Q12_months
Q12
Q14
Q16b
Q17
Q23
Q24
Q25
Q26
Q34
Q36
Q42
Q44
Q46
Q47
Q62
Q82
Q88
Q100
Q104
Q104_euro
Q105
ISCED
nace_rev1_1
isco_08_1
0
35
1
45
1
1.0
1
5.0
NaN
NaN
NaN
2
88
1
0
40
777
6
2
40
1
2
4
2
1.0
0
2
4
28000
201.04
NaN
7
5.0
9.0
1
35
1
40
1
2.0
2
NaN
NaN
NaN
NaN
1
1
15
0
60
777
6
2
50
4
4
3
4
NaN
0
2
6
88888888
88888888.00
23.0
3
1.0
6.0
2
35
2
50
1
1.0
2
NaN
NaN
NaN
NaN
1
2
25
0
35
777
7
2
45
4
2
3
3
NaN
888
4
6
88888888
88888888.00
23.0
2
1.0
6.0
3
35
1
60
1
2.0
2
NaN
NaN
NaN
NaN
1
2
40
0
56
888
7
2
30
4
2
3
1
NaN
0
4
6
99999999
99999999.00
23.0
2
1.0
6.0
4
35
1
42
1
2.0
2
NaN
NaN
NaN
NaN
1
1
20
0
60
777
7
2
40
4
2
4
3
NaN
888
4
6
88888888
88888888.00
23.0
3
1.0
6.0
In [0]:
# There are 9 missing values in Q2a (gender) marked with a 9
miss = df[df[‘Q2a’]==9]
miss
Out[0]:
Country
Q2a
Q2b
Q2c
Q2d
Q7
Q11
Q12_years
Q12_months
Q12
Q14
Q16b
Q17
Q23
Q24
Q25
Q26
Q34
Q36
Q42
Q44
Q46
Q47
Q62
Q82
Q88
Q100
Q104
Q104_euro
Q105
ISCED
nace_rev1_1
isco_08_1
7803
6
9
42
1
2.0
1
1.0
NaN
NaN
NaN
2
3
15
0
43
35
5
2
20
1
2
3
4
2.0
7
1
3
22300
8.155110e+02
NaN
4
12.0
4.0
9362
11
9
21
7
NaN
1
5.0
NaN
NaN
NaN
1
2
2
0
1
777
1
2
20
1
1
5
1
2.0
0
1
4
300
3.000000e+02
NaN
5
7.0
5.0
11872
8
9
49
1
2.0
1
1.0
NaN
NaN
NaN
2
3
2
0
45
777
5
2
40
2
2
5
3
1.0
0
2
3
1200
1.200000e+03
NaN
7
13.0
2.0
16544
9
9
44
1
2.0
1
1.0
NaN
NaN
NaN
1
3
7
0
40
777
5
1
90
3
2
5
1
1.0
5
1
3
2000
2.000000e+03
NaN
4
4.0
7.0
17692
10
9
25
1
1.0
1
2.0
0.0
6.0
6.0
1
4
999
8888
15
30
3
1
45
1
1
5
3
1.0
0
2
1
400
4.000000e+02
NaN
8
9.0
5.0
19705
4
9
888
88
NaN
1
1.0
NaN
NaN
NaN
2
3
21
0
40
40
5
2
30
1
1
3
2
2.0
10
2
3
6000
7.926000e+02
NaN
8
13.0
2.0
19706
4
9
999
8
NaN
1
2.0
1.0
NaN
12.0
1
2
1
0
40
40
6
2
10
1
2
5
2
2.0
0
2
3
3500
4.623500e+02
NaN
4
NaN
4.0
22248
14
9
35
1
2.0
1
1.0
NaN
NaN
NaN
1
4
4
0
48
48
2
2
60
2
3
3
3
2.0
10
2
1
3000
3.000000e+03
NaN
6
4.0
3.0
23531
15
9
37
1
1.0
1
1.0
NaN
NaN
NaN
1
2
3
0
25
36
5
2
70
1
1
5
2
1.0
25
1
1
88888888
8.888889e+07
9.0
8
14.0
3.0
In [0]:
#Replace the 9 cells marked with 9 (missing values) by NaN
df[‘Q2a’]=np.where(df[‘Q2a’]==9, np.nan, df[‘Q2a’])
In [0]:
df[‘Q2a’].unique()
Out[0]:
array([ 1., 2., nan])
In [0]:
# Check that observation [22248, 1] is a np.nan
df.iloc[22248,1]
Out[0]:
nan
In [0]:
# Change missing in Q2b, age, (coded 888 and 999) to np.nan
df[‘Q2b’]=np.where(df[‘Q2b’]==888, np.nan, np.where(df[‘Q2b’]==999, np.nan, df[‘Q2b’]))
In [0]:
# We can also use the replace function
df[‘Q2a’]=df[‘Q2a’].replace(9,np.nan)
In [0]:
df[‘Q2a’].unique()
Out[0]:
array([ 1., 2., nan])
In [0]:
df.iloc[23531,1]
Out[0]:
nan
Finally, missing values can be excluded from the calculations of several functions (e.g. mean, standard deviation).