US_Baby_Names-2010
Introductory Example¶
Copyright By PowCoder代写 加微信 powcoder
US Baby Names 2010¶
‘C:\\Users\\roman\\OneDrive – University of Toronto\\University of Toronto\\MIE1624 – Winter 2022\\Lecture 1 – Introduction\\Python’
http://www.ssa.gov/oact/babynames/limits.html
Load file into a DataFrame
import pandas as pd
names2010 = pd.read_csv(‘yob2010.txt’, names=[‘name’, ‘sex’, ‘births’])
name sex births
1 Sophia F 20477
4 Ava F 15300
… … … …
33833 Zymaire M 5
33834 Zyonne M 5
33835 Zyquarius M 5
33836 Zyran M 5
33837 Zzyzx M 5
33838 rows × 3 columns
Total number of birth in year 2010 by sex
names2010.groupby(‘sex’).births.sum()
F 1759010
M 1898382
Name: births, dtype: int64
Insert prop column for each group
def add_prop(group):
# Integer division floors
births = group.births.astype(float)
group[‘prop’] = births / births.sum()
return group
names2010 = names2010.groupby([‘sex’]).apply(add_prop)
name sex births prop
0 22731 0.012923
1 Sophia F 20477 0.011641
2 17179 0.009766
3 16860 0.009585
4 Ava F 15300 0.008698
… … … … …
33833 Zymaire M 5 0.000003
33834 Zyonne M 5 0.000003
33835 Zyquarius M 5 0.000003
33836 Zyran M 5 0.000003
33837 Zzyzx M 5 0.000003
33838 rows × 4 columns
names2010.describe()
births prop
count 33838.000000 33838.000000
mean 108.085348 0.000059
std 693.442991 0.000376
min 5.000000 0.000003
25% 7.000000 0.000004
50% 11.000000 0.000006
75% 29.000000 0.000016
max 22731.000000 0.012923
Verify that the prop clumn sums to 1 within all the groups
import numpy as np
np.allclose(names2010.groupby([‘sex’]).prop.sum(), 1)
Extract a subset of the data with the top 10 names for each sex
def get_top10(group):
return group.sort_values(by=’births’, ascending=False)[:10]
grouped = names2010.groupby([‘sex’])
top10 = grouped.apply(get_top10)
top10.index = np.arange(len(top10))
name sex births prop
0 22731 0.012923
1 Sophia F 20477 0.011641
2 17179 0.009766
3 16860 0.009585
4 Ava F 15300 0.008698
5 14172 0.008057
6 Abigail F 14124 0.008030
7 13070 0.007430
8 Chloe F 11656 0.006626
9 Mia F 10541 0.005993
10 21875 0.011523
11 Ethan M 17866 0.009411
12 17133 0.009025
13 Jayden M 17030 0.008971
14 16870 0.008887
15 16634 0.008762
16 16281 0.008576
17 15679 0.008259
18 Aiden M 15403 0.008114
19 15364 0.008093
top10.describe()
births prop
count 20.000000 20.000000
mean 16312.250000 0.008918
std 3013.830748 0.001641
min 10541.000000 0.005993
25% 15018.000000 0.008084
50% 16457.500000 0.008730
75% 17144.500000 0.009455
max 22731.000000 0.012923
Aggregate all birth by the first letter from name column
# extract first letter from name column
get_first_letter = lambda x: x[0]
first_letters = names2010.name.map(get_first_letter)
first_letters.name = ‘first_letter’
table = names2010.pivot_table(‘births’, index=first_letters,
columns=[‘sex’], aggfunc=sum)
table.head()
first_letter
A 309608 198870
B 64191 108460
C 96780 168356
D 47211 123298
E 118824 102513
Normalize the table
table.sum()
F 1759010
M 1898382
dtype: int64
letter_prop = table / table.sum().astype(float)
Plot proportion of boys and girls names starting in each letter
%matplotlib inline
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 1, figsize=(10, 8))
letter_prop[‘M’].plot(kind=’bar’, rot=0, ax=axes[0], title=’Male’)
letter_prop[‘F’].plot(kind=’bar’, rot=0, ax=axes[1], title=’Female’, legend=False)
fig.tight_layout()
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com