CS作业代写 Pandas Basics

Pandas Basics

CSCI-UA.0479-001

numpy + â”¬â”€â”€â”¬ =ðŸ�¼

Um… sort of. pandas is… →

pandas a python module that has data structures and tools for working with data

You’ll find a lot of numpy-like functionality in it:

especially for array based computing and functions
and, of course, a style that favors vectorized array operations over for loops

However, unlike numpy, pandas specializes in dealing with tabular data composed of mixed data types

Some Types!

pandas offers a few types for manipulation of tabular data: →

one-dimensional, labeled, array

two-dimensional data structure (think of a table with columns and rows)

Bonus Type! Index
the type that holds the labels for a Series and DataFrame

Let’s check out a Series first!

You can think of a Series as: →

a numpy ndarray with labels for each value
… or a dict with ordered key/value pairs and potentially duplicate keys
… but officially (from the docs):
a one-dimensional labeled array
where the associated labels are collectively referred to as the index

index and value

A Series has two properties that show the labels and data it holds →

values – the actual data in the Series
index – the labels for the data in Series

Creating a Series

There are several ways to create a Series, each resulting in different labels for the index →

using a single positional argument, data (an ndarray or sequence type like list), to specify values in Series
two positional (data and index) arguments with the second specifying the index labels
passing keyword arguments for data and index
passing in a dict with dictionary keys as labels and values as values
(can also be called with a specific index value)

Remember, the index provides a label for each element in a Series.

Implicit Index

Without and index specified, the labels are simply 0 to length of values – 1. Check the examples →

# an ndarray
pd.Series(np.array([7, 8, 9]))
dtype: int64

pd.Series([‘ant’, ‘bat’, ‘cat’])
dtype: object

That Looks Like numpy! But!

Hey… this actually looks just like an ndarray. Note the dtype property!

However, you’ll see that there are a couple of major differences:

the obvious difference is that it has index labels (that can be repeated)
additionally, it supports different types in its values:

pd.Series([‘ant’, ‘bat’, 123])
dtype: object

Specifying Labels ðŸ�·

So, um… if index labels are just gonna be 0 through length, then that’s just the same as an ndarray, right? Let’s specify labels by adding a second positional argument →

pd.Series([‘Hoboken’, ‘Ithaca’], [‘NJ’, ‘NY’])
NJ Hoboken
NY Ithaca

Oh yes. Duplicate. Labels. R. Allowed. ðŸ‘¯

pd.Series([‘Syracuse’, ‘Hoboken’, ‘Ithaca’],
[‘NY’, ‘NJ’, ‘NY’]) # (line continuation)
NY Syracuse
NJ Hoboken
NY Ithaca

With ðŸ”‘ Arguments

You can also pass these arguments in as keyword arguments data and index (for labels) →

pd.Series(data=[7, 8])

pd.Series(data=[7, 8], index=[‘A’, ‘B’])

pd.Series([7, 8, 9], index=[‘A’, ‘B’, ‘C’])

len(data) == len(index)

The lengths of the data and index passed in must be the same.

If these lengths are different, you’ll get a ValueError:

pd.Series([7, 8, 9], index=[‘A’, ‘B’])

ValueError: Length of passed values is 3, index implies 2

Creating Series with dict ðŸ“–

Earlier, we described a Series as a dictionary that allows duplicate labels.

In fact you can pass a dict in to a Series constructor:

pd.Series({‘B’: ‘bat’, ‘A’: ‘ant’})

dict keys become labels
dict values are the values in the Series

dict with index

pd.Series({‘A’: ‘ant’, ‘B’: ‘bat’}, [‘A’, ‘B’]) # OK

If a key from data doesn’t match an element in index, it’s value is not included.

pd.Series({‘A’: ‘ant’, ‘B’: ‘bat’}, [‘A’])

If an index label does not have a corresponding key in data, then missing data values will be NaN

pd.Series({‘A’: ‘ant’, ‘B’: ‘bat’}, [‘A’, ‘C’])

NaN Means Missing Data

In pandas, NaN implies that a value is missing or “N/A” →

The pandas functions / instance methods, isnull and notnull can be used to check for missing values:

s = pd.Series({‘x’: 100}, [‘x’, ‘y’])

pd.isnull(s) # or s.isnull()
x False
y True

pd.notnull(s) # or s.notnull()
x True
y False

index and value Revisited

The index and values properties of a Series object can be used to retrieve the labels and data from a Series

(note that this is slightly confusing as the keyword arg is called data, while the property is called values)

s = pd.Series([7, 8, 9], [‘x’, ‘y’, ‘z’])

array([7, 8, 9])

Index([‘x’, ‘y’, ‘z’], dtype=’object’)

Indexing a Series is similar to indexing a 1-dimensional ndarray

s = pd.Series([7, 8, 9, 10], list(‘xyxz’))
s[‘y’] # 8 … (as expected)

Using a list to specify multiple labels:

s[[‘y’, ‘z’]] # Series! y 8
# z 10

Repeating a label repeats value:

s[[‘y’, ‘y’, ‘z’]] # Series y 8
# y 8
# z 10

Duplicate Labels ðŸ”–ðŸ‘¯

If a label specified maps to more than one value, give back all values →

s = pd.Series([7, 8, 9, 10], list(‘xyxz’))

s[‘x’] # Series! x 7
# x 9

Indexing by Position

Just like a numpy ndarray, you can still use position for indexing →

s = pd.Series([2, 3, 4, 5], list(‘abcd’))

Both of the following… →

…gives us 2

Slicing with Labels

Although indexing by labels and position is similar, there’s a pretty big gotcha when slicing âš ï¸� →

slicing by position works as expected
slicing with labels is inclusive at the end

s = pd.Series([2, 3, 4, 5, 6], list(‘abcde’))

c 4 #ðŸ‘Œ

s[‘b’:’d’]
d 5 # WAT!?ðŸ˜®

Vectorized Arithmetic

Yup âœ… … works as you’d expect:

s = pd.Series([1, 2], [‘x’, ‘y’])

Label Alignment

If the other operand is a Series, operations will be done based on label alignment →

values for matching labels will be operated on
non-matching labels result in NaN (in pandas, this means NA or missing)
the union of labels will be the result of the operation

Let’s start off with a straightforward one; what’s the result of this operation? →

s = pd.Series([1, 2], [‘x’, ‘y’])
s + pd.Series([9, 8], [‘x’, ‘y’]) # let’s add!

Tricky Label Alignment

Now for something a little tricker. What is the result of this operation? →

s = pd.Series([1, 2], [‘x’, ‘y’])
s + pd.Series([9, 100], [‘x’, ‘z’]) # tricky!

Alignment Summary

Series operations align by label rather than position: →

if index pairs aren’t the same (present in one, missing from the other), then resulting index will be both labels!
missing values are inserted where labels to not match

s1 = pd.Series([1, 2], [‘x’, ‘y’])
s2 = pd.Series([9, 100], [‘x’, ‘z’])

Comparison Operations

Comparison operators work similarly to arithmetic operators, except, of course, they return boolean values… what are the results of the following comparisons? →

s = pd.Series([1, 2], [‘x’, ‘y’])

x True # such vectorized!
y False

s == pd.Series([1, 2], [‘x’, ‘y’])

x True # compared by value

Nothing Compares 2 U

One big difference âš ï¸� though… if the labels don’t align, you get an exception (ValueError)!

s = pd.Series([1, 2], [‘x’, ‘y’])

# try this…
s == pd.Series([1, 99], [‘x’, ‘z’])

# or this…
s == pd.Series([1], [‘x’])

ValueError: Can only compare identically-labeled Series objects

Filtering with Booleans

Just like a numpy ndarray, you can filter a Series with a list of booleans: →

s = pd.Series([2, 3, 4, 5]) # 0 2

What does the following expression give us? →

s[[True, False, True, False]]

0 2 # keeps 1st and 3rd (index 0 and 2)
2 4 # discards 2nd and 4th (index 1 and 3)

Using Results of Comparison to Filter

A common pattern is to use a Series of booleans returned from a comparison to filter out values:

What is the result of the following? →

s = pd.Series([5, 6, 7, 8], index=[‘A’, ‘B’, ‘C’, ‘D’])

s[s % 2 == 1]

A 5 # only odds, s % 2 == 1
C 7 # gives us booleans [T, F, T, F]

DataFrames ðŸ–¼

You can think of a DataFrame as: →

a rectangular table of data
or an ordered collection of columns
(where perhaps each column is a Series!)
(think a dict of Series objects!)

index and columns

In a DataFrame, both rows and columns have an index. The nomenclature is:

index – for row labels
columns – for column labels
data – again, the actual values is called data

data, index and columns can be specified when creating a new DataFrame

Creating DataFrames

Like Series, there are multiple ways to create DataFrames →

positional arguments
with anndarray or other sequence types
with a dict of dict objects

using keyword arguments
mixing positional and keyword arguments

Each method allows different ways to specify data, index and columns

Implicit index and columns

Without the second or third arguments specified, index and columns are generated as 0 to length of rows or cols – 1

# only data (index and columns generated)
pd.DataFrame([[1, 2, 3], [4, 5, 6]])

0 1 2 3
1 4 5 6

# only data and index (columns generated)
pd.DataFrame([[1, 2, 3], [4, 5, 6]], [‘r1’, ‘r2’])

r1 1 2 3
r2 4 5 6

Creating DataFrames Continued

Of course, with all three, you can explictly set data, index, and columns →

pd.DataFrame(
[[1, 2, 3], [4, 5, 6], [7, 8, 9 ], # data
[‘r1’, ‘r2’, ‘r3’], # index
[‘A’, ‘B’, ‘C’]) # columns

r1 1 2 3
r2 4 5 6
r3 7 8 9

Nested Dictionaries

A nested dict can be used to explicitly define row labels and column names as well →

outer keys are column names
inner keys are row names

d = pd.DataFrame({
“colA”: {‘r1’: 6, ‘r2’: 7},
“colB”: {‘r1’: 8, ‘r2’: 9}

colA colB
r1 6 8
r2 7 9

Keyword Arguments

Like Series, you can mix and match with keyword arguments: →

In the following code, notice that:

data is passed in as a positional argument,
index is left out (to be generated automatically)
columns is defined as a keyword argument

pd.DataFrame([[1, 2, 3], [4, 5, 6]],
columns=[‘A’, ‘B’, ‘C’])

0 1 2 3
1 4 5 6

values index columns

The data, row labels and columns can all be retrieved by accessing attributes / properties on a DataFrame instance →

values – the data for the table
index – the row labels
columns – the column names
there’s also dtype…
since a DataFrame and Series can hold different types…
*dtype will be set to the type that can accommodate all the values in the DataFrame

Retrieving Columns

Columns can be retrieved by:

indexing with a single column name
(which may return a Series or DataFrame)

indexing with a list of column names to return a DataFrame

Using the following DataFrame, let’s check out some indexing possibilities →

pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])

foo bar baz
0 4 5 6
1 7 8 9

Retrieving one Column

With a single column name, a column is returned as a Series →

# note that the type and name of the column are usually
# given too:
Name: foo, dtype: int64

type(df[‘foo’]) # we get a series back
pandas.core.series.Series

df[‘foo’].name # note the name attribute!
Out[107]: ‘foo’

Retrieving Multiple Columns pt 1!

If a label in the index occurs more than once, then a DataFrame of multiple columns is returned rather than a single Series →

d = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘a’, ‘b’, ‘a’])

Retrieving Multiple Columns pt 2!

When indexing with a list of column names (even if there’s only one name in the list), a DataFrame is returned with only the columns matching the names in the list included in the returned DataFrame →

df[[‘foo’, ‘bar’]]
0 4 5
1 7 8

type(df[[‘foo’, ‘bar’]])
pandas.core.frame.DataFrame

type(df[[‘foo’]]) # list w/ 1 element
pandas.core.frame.DataFrame # (still!)

Rearrange / Repeat

Indexing can also be used to to retrieve a new DataFrame with reordered columns and/or repeated columns →

pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])

What DataFrame will we get back from: →

df[[‘bar’, ‘bar’, ‘foo’]]

bar bar foo # bar is repeated
0 5 5 4 # and placed before
1 8 8 7 # foo

If Key Doesn’t Exist…

Regardless of whether or not a list or a single column is used for indexing into a DataFrame, a KeyError is raised if a key doesn’t exist →

df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])

df[[‘foo’, ‘dne’]] # both of these are ðŸš«

KeyError: “[‘dne’] not in index”

DataFrame Slices Gives Rows!

If a DataFrame is sliced BY POSITION, it yields rows rather than columns →

d = pd.DataFrame({“cA”: {‘r1’: 1, ‘r2’: 2, ‘r3’: 3},
“cB”: {‘r1’: 4, ‘r2’: 5, ‘r3’: 6},
“cC”: {‘r1’: 7, ‘r2’: 8, ‘r3’: 9}})

d[:2] # slicing refers to rows here!

cA cB cC # only first two rows!
r1 1 4 7
r2 2 5 8

Indexing with List of Booleans / Arrays

Much like Series and ndarray, we can use a list or array of booleans to select parts of a DataFrame

a list/array of booleans filters DataFrame rows
the length of the booleans must match the number of rows

d = pd.DataFrame({“cA”: {‘r1’: 1, ‘r2’: 2, ‘r3’: 3},
“cB”: {‘r1’: 4, ‘r2’: 5, ‘r3’: 6},
“cC”: {‘r1’: 7, ‘r2’: 8, ‘r3’: 9}})

d[[False, True, True]]

cA cB cC # gimme last two rows!
r2 2 5 8
r3 3 6 9

Constructing Boolean Selection

Again, we don’t have to manually create a Boolean array; it can be the result of a vectorized boolean comparison →

Let’s take this example…

d = pd.DataFrame({“cA”: {‘r1’: 1, ‘r2’: 2, ‘r3’: 3},
“cB”: {‘r1’: 4, ‘r2’: 5, ‘r3’: 6},
“cC”: {‘r1’: 7, ‘r2’: 8, ‘r3’: 9}})

Boolean Selection Continued

Retrieve the rows where column cA is more than 1

# d is cA cB cC
# r1 1 4 7
# r2 2 5 8
# r3 3 6 9

d[‘cA’] > 1 # r1 False
# r2 True
# r3 True

d[d[‘cA’] > 1] # ðŸ˜˜

cA cB cC
r2 2 5 8
r3 3 6 9

Setting Values

Now that we know how to retrieve values with indexing… let’s see how we can set values with Series →

s = pd.Series([2, 3, 4, 5, 6], list(‘abcde’))

s[‘a’] = 100 # assigning with an index
s[‘b’:’d’] = 200 # assigning with a slice

Setting Values, DataFrame

Using our usual example:…

df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])

Assigning to a scalar sets all values of a column:

df[‘foo’] = 77

foo bar baz
0 77 5 6
1 77 8 9

Assignment Continued

Of course, you can set each value in a column to a specific value using a list or even a Series →

df[‘foo’] = [99, 100]

foo bar baz
0 99 5 6
1 100 8 9

df[‘foo’] = pd.Series([-8, -9])

foo bar baz
0 -8 5 6
1 -9 8 9

About That Series

As you might expect, if you assign a Series to a DataFrame column: →

the labels will be aligned to perform assignment
with… DataFrame labels missing from the Series filled with NaN
and extra labels in the Series (not matching any of the DataFrameâ€™s labels) ignored

More DataFrame / Series Assignment

Pay attention to the mismatched labels… →

df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
index=[‘r1’, ‘r2’],
columns=[‘foo’, ‘bar’, ‘baz’])
# foo bar baz
# r1 4 5 6
# r2 7 8 9

df[‘foo’] = pd.Series([100, 200], [‘r1’, ‘r3’])

foo bar baz
r1 100.0 5 6
r2 NaN 8 9 # r2 is added as NaN
# r3 ignored

Assignment Errors

When assigning a list/ndarray or Series to a column, the length of the data must match the length of the DataFrame column. →

df[‘foo’] = [100]

ValueError: Length of values does not match length of index

Assignment + Boolean Selection

Note that indexing with an array, Series or list of booleans can be used in assignment as well →

d = pd.DataFrame({“cA”: {‘r1’: 1, ‘r2’: 2, ‘r3’: 3},
“cB”: {‘r1’: 4, ‘r2’: 5, ‘r3’: 6},
“cC”: {‘r1’: 7, ‘r2’: 8, ‘r3’: 9}})
d[d[‘cA’] > 1] = 0

cA cB cC
r1 1 4 7
r2 0 0 0
r3 0 0 0

Adding Columns

If the column name used in assignment does not exist, a new column will be created →

df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])

df[‘qux’] = [20, 30] # qux is new!

foo bar baz qux
0 4 5 6 20
1 7 8 9 30

Removing Columns: a Mystery â�‰ï¸�

The .drop method on a DataFrame can be used remove a column. Let’s try it: →

df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])

df.drop(‘baz’) # ðŸ¤”

KeyError: “[‘baz’] not found in axis”
# ðŸ˜® what happened!?

Axis Flashback

Remember numpy… specifically the significance of .shape and axis? →

.shape describes the length of the dimensions of an ndarray
a two-dimensional ndarray has a .shape that’s a two-element tuple
what does the first element of that tuple represent? and the second?
the first, axis 0, represents rows
the second, axis 1, represents columns

Really Removing Columns

.drop takes axis as a keyword argument →

buuuut… it’s default value is 0 (rows! ðŸ˜®)
to remove a column, use axis=1

df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])

df.drop(‘baz’, axis=1)

foo bar # baz column
0 4 5 # was removed!
1 7 8

Dictionary Life (del)

Similar to deleting keys/values in dictionaries, the del keyword cal also be used to drop columns →

df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])

del df[‘baz’]

foo bar # again, baz
0 4 5 # is removed!
1 7 8

Ok, How About Rows?

Indexing into rows can be done by indexing into the loc attribute / property of a DataFrame object. →

again, a Series is returned
the labels are the column names, though!

df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])

foo 7 # last row returned
bar 8 # (2nd row is index 1)

Retrieving a Single Value

Remember that once you have a row, you can index into that as well. →

we can use .loc to get a row…
and then get a specific element from that row

What value would this retrieve? →

# df is foo bar baz
# 0 4 5 6
# 1 7 8 9

df.loc[1][‘bar’]

Rows can be made into columns (and columns to rows) using transpose: →

Imagine that df looks like this DataTable:

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts