Pandas Basics
Copyright By PowCoder代写 加微信 powcoder
Pandas Basics
CSCI-UA.0479-001
numpy + ┬──┬ =�
Um… sort of. pandas is… →
pandas a python module that has data structures and tools for working with data
You’ll find a lot of numpy-like functionality in it:
especially for array based computing and functions
and, of course, a style that favors vectorized array operations over for loops
However, unlike numpy, pandas specializes in dealing with tabular data composed of mixed data types
Some Types!
pandas offers a few types for manipulation of tabular data: →
one-dimensional, labeled, array
two-dimensional data structure (think of a table with columns and rows)
Bonus Type! Index
the type that holds the labels for a Series and DataFrame
Let’s check out a Series first!
You can think of a Series as: →
a numpy ndarray with labels for each value
… or a dict with ordered key/value pairs and potentially duplicate keys
… but officially (from the docs):
a one-dimensional labeled array
where the associated labels are collectively referred to as the index
index and value
A Series has two properties that show the labels and data it holds →
values – the actual data in the Series
index – the labels for the data in Series
Creating a Series
There are several ways to create a Series, each resulting in different labels for the index →
using a single positional argument, data (an ndarray or sequence type like list), to specify values in Series
two positional (data and index) arguments with the second specifying the index labels
passing keyword arguments for data and index
passing in a dict with dictionary keys as labels and values as values
(can also be called with a specific index value)
Remember, the index provides a label for each element in a Series.
Implicit Index
Without and index specified, the labels are simply 0 to length of values – 1. Check the examples →
# an ndarray
pd.Series(np.array([7, 8, 9]))
dtype: int64
pd.Series([‘ant’, ‘bat’, ‘cat’])
dtype: object
That Looks Like numpy! But!
Hey… this actually looks just like an ndarray. Note the dtype property!
However, you’ll see that there are a couple of major differences:
the obvious difference is that it has index labels (that can be repeated)
additionally, it supports different types in its values:
pd.Series([‘ant’, ‘bat’, 123])
dtype: object
Specifying Labels �
So, um… if index labels are just gonna be 0 through length, then that’s just the same as an ndarray, right? Let’s specify labels by adding a second positional argument →
pd.Series([‘Hoboken’, ‘Ithaca’], [‘NJ’, ‘NY’])
NJ Hoboken
NY Ithaca
Oh yes. Duplicate. Labels. R. Allowed. 👯
pd.Series([‘Syracuse’, ‘Hoboken’, ‘Ithaca’],
[‘NY’, ‘NJ’, ‘NY’]) # (line continuation)
NY Syracuse
NJ Hoboken
NY Ithaca
With 🔑 Arguments
You can also pass these arguments in as keyword arguments data and index (for labels) →
pd.Series(data=[7, 8])
pd.Series(data=[7, 8], index=[‘A’, ‘B’])
pd.Series([7, 8, 9], index=[‘A’, ‘B’, ‘C’])
len(data) == len(index)
The lengths of the data and index passed in must be the same.
If these lengths are different, you’ll get a ValueError:
pd.Series([7, 8, 9], index=[‘A’, ‘B’])
ValueError: Length of passed values is 3, index implies 2
Creating Series with dict 📖
Earlier, we described a Series as a dictionary that allows duplicate labels.
In fact you can pass a dict in to a Series constructor:
pd.Series({‘B’: ‘bat’, ‘A’: ‘ant’})
dict keys become labels
dict values are the values in the Series
dict with index
pd.Series({‘A’: ‘ant’, ‘B’: ‘bat’}, [‘A’, ‘B’]) # OK
If a key from data doesn’t match an element in index, it’s value is not included.
pd.Series({‘A’: ‘ant’, ‘B’: ‘bat’}, [‘A’])
If an index label does not have a corresponding key in data, then missing data values will be NaN
pd.Series({‘A’: ‘ant’, ‘B’: ‘bat’}, [‘A’, ‘C’])
NaN Means Missing Data
In pandas, NaN implies that a value is missing or “N/A” →
The pandas functions / instance methods, isnull and notnull can be used to check for missing values:
s = pd.Series({‘x’: 100}, [‘x’, ‘y’])
pd.isnull(s) # or s.isnull()
x False
y True
pd.notnull(s) # or s.notnull()
x True
y False
index and value Revisited
The index and values properties of a Series object can be used to retrieve the labels and data from a Series
(note that this is slightly confusing as the keyword arg is called data, while the property is called values)
s = pd.Series([7, 8, 9], [‘x’, ‘y’, ‘z’])
array([7, 8, 9])
Index([‘x’, ‘y’, ‘z’], dtype=’object’)
Indexing a Series is similar to indexing a 1-dimensional ndarray
s = pd.Series([7, 8, 9, 10], list(‘xyxz’))
s[‘y’] # 8 … (as expected)
Using a list to specify multiple labels:
s[[‘y’, ‘z’]] # Series! y 8
# z 10
Repeating a label repeats value:
s[[‘y’, ‘y’, ‘z’]] # Series y 8
# y 8
# z 10
Duplicate Labels 🔖👯
If a label specified maps to more than one value, give back all values →
s = pd.Series([7, 8, 9, 10], list(‘xyxz’))
s[‘x’] # Series! x 7
# x 9
Indexing by Position
Just like a numpy ndarray, you can still use position for indexing →
s = pd.Series([2, 3, 4, 5], list(‘abcd’))
Both of the following… →
…gives us 2
Slicing with Labels
Although indexing by labels and position is similar, there’s a pretty big gotcha when slicing âš ï¸� →
slicing by position works as expected
slicing with labels is inclusive at the end
s = pd.Series([2, 3, 4, 5, 6], list(‘abcde’))
c 4 #👌
s[‘b’:’d’]
d 5 # WAT!?😮
Vectorized Arithmetic
Yup ✅ … works as you’d expect:
s = pd.Series([1, 2], [‘x’, ‘y’])
Label Alignment
If the other operand is a Series, operations will be done based on label alignment →
values for matching labels will be operated on
non-matching labels result in NaN (in pandas, this means NA or missing)
the union of labels will be the result of the operation
Let’s start off with a straightforward one; what’s the result of this operation? →
s = pd.Series([1, 2], [‘x’, ‘y’])
s + pd.Series([9, 8], [‘x’, ‘y’]) # let’s add!
Tricky Label Alignment
Now for something a little tricker. What is the result of this operation? →
s = pd.Series([1, 2], [‘x’, ‘y’])
s + pd.Series([9, 100], [‘x’, ‘z’]) # tricky!
Alignment Summary
Series operations align by label rather than position: →
if index pairs aren’t the same (present in one, missing from the other), then resulting index will be both labels!
missing values are inserted where labels to not match
s1 = pd.Series([1, 2], [‘x’, ‘y’])
s2 = pd.Series([9, 100], [‘x’, ‘z’])
Comparison Operations
Comparison operators work similarly to arithmetic operators, except, of course, they return boolean values… what are the results of the following comparisons? →
s = pd.Series([1, 2], [‘x’, ‘y’])
x True # such vectorized!
y False
s == pd.Series([1, 2], [‘x’, ‘y’])
x True # compared by value
Nothing Compares 2 U
One big difference âš ï¸� though… if the labels don’t align, you get an exception (ValueError)!
s = pd.Series([1, 2], [‘x’, ‘y’])
# try this…
s == pd.Series([1, 99], [‘x’, ‘z’])
# or this…
s == pd.Series([1], [‘x’])
ValueError: Can only compare identically-labeled Series objects
Filtering with Booleans
Just like a numpy ndarray, you can filter a Series with a list of booleans: →
s = pd.Series([2, 3, 4, 5]) # 0 2
What does the following expression give us? →
s[[True, False, True, False]]
0 2 # keeps 1st and 3rd (index 0 and 2)
2 4 # discards 2nd and 4th (index 1 and 3)
Using Results of Comparison to Filter
A common pattern is to use a Series of booleans returned from a comparison to filter out values:
What is the result of the following? →
s = pd.Series([5, 6, 7, 8], index=[‘A’, ‘B’, ‘C’, ‘D’])
s[s % 2 == 1]
A 5 # only odds, s % 2 == 1
C 7 # gives us booleans [T, F, T, F]
DataFrames 🖼
You can think of a DataFrame as: →
a rectangular table of data
or an ordered collection of columns
(where perhaps each column is a Series!)
(think a dict of Series objects!)
index and columns
In a DataFrame, both rows and columns have an index. The nomenclature is:
index – for row labels
columns – for column labels
data – again, the actual values is called data
data, index and columns can be specified when creating a new DataFrame
Creating DataFrames
Like Series, there are multiple ways to create DataFrames →
positional arguments
with anndarray or other sequence types
with a dict of dict objects
using keyword arguments
mixing positional and keyword arguments
Each method allows different ways to specify data, index and columns
Implicit index and columns
Without the second or third arguments specified, index and columns are generated as 0 to length of rows or cols – 1
# only data (index and columns generated)
pd.DataFrame([[1, 2, 3], [4, 5, 6]])
0 1 2 3
1 4 5 6
# only data and index (columns generated)
pd.DataFrame([[1, 2, 3], [4, 5, 6]], [‘r1’, ‘r2’])
r1 1 2 3
r2 4 5 6
Creating DataFrames Continued
Of course, with all three, you can explictly set data, index, and columns →
pd.DataFrame(
[[1, 2, 3], [4, 5, 6], [7, 8, 9 ], # data
[‘r1’, ‘r2’, ‘r3’], # index
[‘A’, ‘B’, ‘C’]) # columns
r1 1 2 3
r2 4 5 6
r3 7 8 9
Nested Dictionaries
A nested dict can be used to explicitly define row labels and column names as well →
outer keys are column names
inner keys are row names
d = pd.DataFrame({
“colA”: {‘r1’: 6, ‘r2’: 7},
“colB”: {‘r1’: 8, ‘r2’: 9}
colA colB
r1 6 8
r2 7 9
Keyword Arguments
Like Series, you can mix and match with keyword arguments: →
In the following code, notice that:
data is passed in as a positional argument,
index is left out (to be generated automatically)
columns is defined as a keyword argument
pd.DataFrame([[1, 2, 3], [4, 5, 6]],
columns=[‘A’, ‘B’, ‘C’])
0 1 2 3
1 4 5 6
values index columns
The data, row labels and columns can all be retrieved by accessing attributes / properties on a DataFrame instance →
values – the data for the table
index – the row labels
columns – the column names
there’s also dtype…
since a DataFrame and Series can hold different types…
*dtype will be set to the type that can accommodate all the values in the DataFrame
Retrieving Columns
Columns can be retrieved by:
indexing with a single column name
(which may return a Series or DataFrame)
indexing with a list of column names to return a DataFrame
Using the following DataFrame, let’s check out some indexing possibilities →
pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])
foo bar baz
0 4 5 6
1 7 8 9
Retrieving one Column
With a single column name, a column is returned as a Series →
# note that the type and name of the column are usually
# given too:
Name: foo, dtype: int64
type(df[‘foo’]) # we get a series back
pandas.core.series.Series
df[‘foo’].name # note the name attribute!
Out[107]: ‘foo’
Retrieving Multiple Columns pt 1!
If a label in the index occurs more than once, then a DataFrame of multiple columns is returned rather than a single Series →
d = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘a’, ‘b’, ‘a’])
Retrieving Multiple Columns pt 2!
When indexing with a list of column names (even if there’s only one name in the list), a DataFrame is returned with only the columns matching the names in the list included in the returned DataFrame →
df[[‘foo’, ‘bar’]]
0 4 5
1 7 8
type(df[[‘foo’, ‘bar’]])
pandas.core.frame.DataFrame
type(df[[‘foo’]]) # list w/ 1 element
pandas.core.frame.DataFrame # (still!)
Rearrange / Repeat
Indexing can also be used to to retrieve a new DataFrame with reordered columns and/or repeated columns →
pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])
What DataFrame will we get back from: →
df[[‘bar’, ‘bar’, ‘foo’]]
bar bar foo # bar is repeated
0 5 5 4 # and placed before
1 8 8 7 # foo
If Key Doesn’t Exist…
Regardless of whether or not a list or a single column is used for indexing into a DataFrame, a KeyError is raised if a key doesn’t exist →
df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])
df[[‘foo’, ‘dne’]] # both of these are 🚫
KeyError: “[‘dne’] not in index”
DataFrame Slices Gives Rows!
If a DataFrame is sliced BY POSITION, it yields rows rather than columns →
d = pd.DataFrame({“cA”: {‘r1’: 1, ‘r2’: 2, ‘r3’: 3},
“cB”: {‘r1’: 4, ‘r2’: 5, ‘r3’: 6},
“cC”: {‘r1’: 7, ‘r2’: 8, ‘r3’: 9}})
d[:2] # slicing refers to rows here!
cA cB cC # only first two rows!
r1 1 4 7
r2 2 5 8
Indexing with List of Booleans / Arrays
Much like Series and ndarray, we can use a list or array of booleans to select parts of a DataFrame
a list/array of booleans filters DataFrame rows
the length of the booleans must match the number of rows
d = pd.DataFrame({“cA”: {‘r1’: 1, ‘r2’: 2, ‘r3’: 3},
“cB”: {‘r1’: 4, ‘r2’: 5, ‘r3’: 6},
“cC”: {‘r1’: 7, ‘r2’: 8, ‘r3’: 9}})
d[[False, True, True]]
cA cB cC # gimme last two rows!
r2 2 5 8
r3 3 6 9
Constructing Boolean Selection
Again, we don’t have to manually create a Boolean array; it can be the result of a vectorized boolean comparison →
Let’s take this example…
d = pd.DataFrame({“cA”: {‘r1’: 1, ‘r2’: 2, ‘r3’: 3},
“cB”: {‘r1’: 4, ‘r2’: 5, ‘r3’: 6},
“cC”: {‘r1’: 7, ‘r2’: 8, ‘r3’: 9}})
Boolean Selection Continued
Retrieve the rows where column cA is more than 1
# d is cA cB cC
# r1 1 4 7
# r2 2 5 8
# r3 3 6 9
d[‘cA’] > 1 # r1 False
# r2 True
# r3 True
d[d[‘cA’] > 1] # 😘
cA cB cC
r2 2 5 8
r3 3 6 9
Setting Values
Now that we know how to retrieve values with indexing… let’s see how we can set values with Series →
s = pd.Series([2, 3, 4, 5, 6], list(‘abcde’))
s[‘a’] = 100 # assigning with an index
s[‘b’:’d’] = 200 # assigning with a slice
Setting Values, DataFrame
Using our usual example:…
df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])
Assigning to a scalar sets all values of a column:
df[‘foo’] = 77
foo bar baz
0 77 5 6
1 77 8 9
Assignment Continued
Of course, you can set each value in a column to a specific value using a list or even a Series →
df[‘foo’] = [99, 100]
foo bar baz
0 99 5 6
1 100 8 9
df[‘foo’] = pd.Series([-8, -9])
foo bar baz
0 -8 5 6
1 -9 8 9
About That Series
As you might expect, if you assign a Series to a DataFrame column: →
the labels will be aligned to perform assignment
with… DataFrame labels missing from the Series filled with NaN
and extra labels in the Series (not matching any of the DataFrame’s labels) ignored
More DataFrame / Series Assignment
Pay attention to the mismatched labels… →
df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
index=[‘r1’, ‘r2’],
columns=[‘foo’, ‘bar’, ‘baz’])
# foo bar baz
# r1 4 5 6
# r2 7 8 9
df[‘foo’] = pd.Series([100, 200], [‘r1’, ‘r3’])
foo bar baz
r1 100.0 5 6
r2 NaN 8 9 # r2 is added as NaN
# r3 ignored
Assignment Errors
When assigning a list/ndarray or Series to a column, the length of the data must match the length of the DataFrame column. →
df[‘foo’] = [100]
ValueError: Length of values does not match length of index
Assignment + Boolean Selection
Note that indexing with an array, Series or list of booleans can be used in assignment as well →
d = pd.DataFrame({“cA”: {‘r1’: 1, ‘r2’: 2, ‘r3’: 3},
“cB”: {‘r1’: 4, ‘r2’: 5, ‘r3’: 6},
“cC”: {‘r1’: 7, ‘r2’: 8, ‘r3’: 9}})
d[d[‘cA’] > 1] = 0
cA cB cC
r1 1 4 7
r2 0 0 0
r3 0 0 0
Adding Columns
If the column name used in assignment does not exist, a new column will be created →
df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])
df[‘qux’] = [20, 30] # qux is new!
foo bar baz qux
0 4 5 6 20
1 7 8 9 30
Removing Columns: a Mystery ��
The .drop method on a DataFrame can be used remove a column. Let’s try it: →
df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])
df.drop(‘baz’) # 🤔
KeyError: “[‘baz’] not found in axis”
# 😮 what happened!?
Axis Flashback
Remember numpy… specifically the significance of .shape and axis? →
.shape describes the length of the dimensions of an ndarray
a two-dimensional ndarray has a .shape that’s a two-element tuple
what does the first element of that tuple represent? and the second?
the first, axis 0, represents rows
the second, axis 1, represents columns
Really Removing Columns
.drop takes axis as a keyword argument →
buuuut… it’s default value is 0 (rows! 😮)
to remove a column, use axis=1
df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])
df.drop(‘baz’, axis=1)
foo bar # baz column
0 4 5 # was removed!
1 7 8
Dictionary Life (del)
Similar to deleting keys/values in dictionaries, the del keyword cal also be used to drop columns →
df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])
del df[‘baz’]
foo bar # again, baz
0 4 5 # is removed!
1 7 8
Ok, How About Rows?
Indexing into rows can be done by indexing into the loc attribute / property of a DataFrame object. →
again, a Series is returned
the labels are the column names, though!
df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])
foo 7 # last row returned
bar 8 # (2nd row is index 1)
Retrieving a Single Value
Remember that once you have a row, you can index into that as well. →
we can use .loc to get a row…
and then get a specific element from that row
What value would this retrieve? →
# df is foo bar baz
# 0 4 5 6
# 1 7 8 9
df.loc[1][‘bar’]
Rows can be made into columns (and columns to rows) using transpose: →
Imagine that df looks like this DataTable:
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com