CS作业代写 ABC 0123 1456

3/8/22, 7:31 PM Pandas Basics
Pandas Basics CSCI-UA.0479-001
numpy + ┬──┬ =
Um… sort of. pandas is… →

pandas a python module that has data structures and tools for working with data You’ll find a lot of numpy-like functionality in it:
especially for array based computing and functions
and, of course, a style that favors vectorized array operations over for loops
However, unlike numpy, pandas specializes in dealing with tabular data composed of mixed data types
Some Types!
pandas offers a few types for manipulation of tabular data: →
one-dimensional, labeled, array
two-dimensional data structure (think of a table with columns and rows) Bonus Type! Index
the type that holds the labels for a Series and DataFrame Let’s check out a Series first!
You can think of a Series as: →
a numpy ndarray with labels for each value
… or a dict with ordered key/value pairs and potentially duplicate keys … but officially (from the docs):
a one-dimensional labeled array
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 1/28

3/8/22, 7:31 PM Pandas Basics
index and value
A Series has two properties that show the labels and data it holds →
values – the actual data in the Series index – the labels for the data in Series
Creating a Series
There are several ways to create a Series, each resulting in different labels for the index →
1. using a single positional argument, data (an ndarray or sequence type like list), to specify values in Series
2. two positional (data and index) arguments with the second specifying the index labels
3. passing keyword arguments for data and index
4. passing in a dict with dictionary keys as labels and values as values
(can also be called with a specific index value)
Remember, the index provides a label for each element in a Series.
where the associated labels are collectively referred to as the index
Implicit Index
Without and index specified, the labels are simply 0 to length of values – 1. Check the
examples →
# an ndarray
pd.Series(np.array([7, 8, 9])) 07
dtype: int64
pd.Series([‘ant’, ‘bat’, ‘cat’])
dtype: object
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 2/28

3/8/22, 7:31 PM Pandas Basics
That Looks Like numpy! But!
Hey… this actually looks just like an ndarray. Note the dtype property! However, you’ll see that there are a couple of major differences:
1. the obvious difference is that it has index labels (that can be repeated) 2. additionally, it supports different types in its values:
pd.Series([‘ant’, ‘bat’, 123])
dtype: object
Specifying Labels
So, um… if index labels are just gonna be 0 through length, then that’s just the same as an ndarray, right? Let’s specify labels by adding a second positional argument →
pd.Series([‘Hoboken’, ‘Ithaca’], [‘NJ’, ‘NY’])
NJ Hoboken
NY Ithaca
Oh yes. Duplicate. Labels. R. Allowed.
pd.Series([‘Syracuse’, ‘Hoboken’, ‘Ithaca’],
[‘NY’, ‘NJ’, ‘NY’]) # (line continuation)
NY Syracuse
NJ Hoboken
NY Ithaca
With Arguments
You can also pass these arguments in as keyword arguments data and index (for labels) →
pd.Series(data=[7, 8]) 07
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 3/28

3/8/22, 7:31 PM Pandas Basics
len(data) == len(index)
The lengths of the data and index passed in must be the same.
If these lengths are different, you’ll get a ValueError: pd.Series([7, 8, 9], index=[‘A’, ‘B’])
ValueError: Length of passed values is 3, index implies 2
Creating Series with dict
Earlier, we described a Series as a dictionary that allows duplicate labels.
In fact you can pass a dict in to a Series constructor:
pd.Series({‘B’: ‘bat’, ‘A’: ‘ant’})
dict keys become labels
dict values are the values in the Series
pd.Series(data=[7, 8], index=[‘A’, ‘B’]) A7
pd.Series([7, 8, 9], index=[‘A’, ‘B’, ‘C’]) A7
dict with index
pd.Series({‘A’: ‘ant’, ‘B’: ‘bat’}, [‘A’, ‘B’]) # OK
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 4/28

3/8/22, 7:31 PM Pandas Basics
NaN Means Missing Data
In pandas, NaN implies that a value is missing or “N/A” →
The pandas functions / instance methods, isnull and notnull can be used to check for missing values:
s = pd.Series({‘x’: 100}, [‘x’, ‘y’])
pd.isnull(s) # or s.isnull()
x False
y True
pd.notnull(s) # or s.notnull()
x True
y False
If a key from data doesn’t match an element in index, it’s value is not included. pd.Series({‘A’: ‘ant’, ‘B’: ‘bat’}, [‘A’])
If an index label does not have a corresponding key in data, then missing data values will be NaN
pd.Series({‘A’: ‘ant’, ‘B’: ‘bat’}, [‘A’, ‘C’])
index and value Revisited
The index and values properties of a Series object can be used to retrieve the labels and
data from a Series
(note that this is slightly confusing as the keyword arg is called data, while the property is called
s = pd.Series([7, 8, 9], [‘x’, ‘y’, ‘z’])
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 5/28

3/8/22, 7:31 PM Pandas Basics
Indexing a Series is similar to indexing a 1-dimensional ndarray
s = pd.Series([7, 8, 9, 10], list(‘xyxz’))
s[‘y’] # 8 … (as expected)
Using a list to specify multiple labels: s[[‘y’, ‘z’]] # Series! y 8
Repeating a label repeats value:
s[[‘y’, ‘y’, ‘z’]] # Series y 8 #y8 # z10
Duplicate Labels
If a label specified maps to more than one value, give back all values →
s = pd.Series([7, 8, 9, 10], list(‘xyxz’))
s[‘x’] # Series! x 7 #x9
array([7, 8, 9])
Index([‘x’, ‘y’, ‘z’], dtype=’object’)
Indexing by Position
Just like a numpy ndarray, you can still use position for indexing →
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 6/28

3/8/22, 7:31 PM Pandas Basics
Slicing with Labels
Although indexing by labels and position is similar, there’s a pretty big gotcha when slicing
s = pd.Series([2, 3, 4, 5, 6], list(‘abcde’))
s[‘b’:’d’] b3 c4
d 5 # WAT!?
slicing by position works as expected slicing with labels is inclusive at the end
s = pd.Series([2, 3, 4, 5], list(‘abcd’)) Both of the following… →
s[‘a’] s[0]
…gives us 2
Vectorized Arithmetic
Yup … works as you’d expect:
s = pd.Series([1, 2], [‘x’, ‘y’])
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 7/28

3/8/22, 7:31 PM Pandas Basics
Label Alignment
If the other operand is a Series, operations will be done based on label alignment →
values for matching labels will be operated on
non-matching labels result in NaN (in pandas, this means NA or missing) the union of labels will be the result of the operation
Let’s start off with a straightforward one; what’s the result of this operation? → s = pd.Series([1, 2], [‘x’, ‘y’])
s + pd.Series([9, 8], [‘x’, ‘y’]) # let’s add!
Tricky Label Alignment
Now for something a little tricker. What is the result of this operation? →
s = pd.Series([1, 2], [‘x’, ‘y’])
s + pd.Series([9, 100], [‘x’, ‘z’]) # tricky!
Alignment Summary
Series operations align by label rather than position: →
if index pairs aren’t the same (present in one, missing from the other), then resulting index will be both labels!
missing values are inserted where labels to not match
s1 = pd.Series([1, 2], [‘x’, ‘y’])
s2 = pd.Series([9, 100], [‘x’, ‘z’])
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 8/28

3/8/22, 7:31 PM Pandas Basics
Comparison Operations
Comparison operators work similarly to arithmetic operators, except, of course, they return
boolean values… what are the results of the following comparisons? → s = pd.Series([1, 2], [‘x’, ‘y’])
x True # such vectorized!
y False
s == pd.Series([1, 2], [‘x’, ‘y’])
x True # compared by value
Nothing Compares 2 U
One big difference though… if the labels don’t align, you get an exception (ValueError)!
s = pd.Series([1, 2], [‘x’, ‘y’])
# try this…
s == pd.Series([1, 99], [‘x’, ‘z’])
# or this…
s == pd.Series([1], [‘x’])
ValueError: Can only compare identically-labeled Series objects
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 9/28

3/8/22, 7:31 PM Pandas Basics
Filtering with Booleans
Just like a numpy ndarray, you can filter a Series with a list of booleans: →
s = pd.Series([2, 3, 4, 5]) # 0 2 #13 #24 #35
What does the following expression give us? → s[[True, False, True, False]]
0 2 # keeps 1st and 3rd (index 0 and 2)
2 4 # discards 2nd and 4th (index 1 and 3)
Using Results of Comparison to Filter
A common pattern is to use a Series of booleans returned from a comparison to filter out values:
What is the result of the following? →
s = pd.Series([5, 6, 7, 8], index=[‘A’, ‘B’, ‘C’, ‘D’])
s[s % 2 == 1]
A 5 # only odds, s % 2 == 1
C 7 # gives us booleans [T, F, T, F]
DataFrames
You can think of a DataFrame as: →
a rectangular table of data
or an ordered collection of columns
(where perhaps each column is a Series!) (think a dict of Series objects!)
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 10/28

3/8/22, 7:31 PM Pandas Basics
index and columns
In a DataFrame, both rows and columns have an index. The nomenclature is:
index – for row labels
columns – for column labels
data – again, the actual values is called data
data, index and columns can be specified when creating a new DataFrame
Creating DataFrames
Like Series, there are multiple ways to create DataFrames →
positional arguments
with anndarray or other sequence types with a dict of dict objects
using keyword arguments
mixing positional and keyword arguments
Each method allows different ways to specify data, index and columns
Implicit index and columns
Without the second or third arguments specified, index and columns are generated as 0 to
length of rows or cols – 1
# only data (index and columns generated)
pd.DataFrame([[1, 2, 3], [4, 5, 6]])
012 0123 1456
# only data and index (columns generated)
pd.DataFrame([[1, 2, 3], [4, 5, 6]], [‘r1’, ‘r2’])
012 r1 1 2 3
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 11/28

3/8/22, 7:31 PM Pandas Basics
Creating DataFrames Continued
Of course, with all three, you can explictly set data, index, and columns →
pd.DataFrame(
[[1, 2, 3], [4, 5, 6], [7, 8, 9 ], # data
[‘r1’, ‘r2’, ‘r3’],
[‘A’, ‘B’, ‘C’])
ABC r1 1 2 3 r2 4 5 6 r3 7 8 9
Nested Dictionaries
A nested dict can be used to explicitly define row labels and column names as well →
outer keys are column names inner keys are row names
d = pd.DataFrame({
“colA”: {‘r1’: 6, ‘r2’: 7},
“colB”: {‘r1’: 8, ‘r2’: 9}
colA colB
r1 6 8
r2 7 9
Keyword Arguments
Like Series, you can mix and match with keyword arguments: → In the following code, notice that:
data is passed in as a positional argument,
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 12/28

3/8/22, 7:31 PM Pandas Basics
values index columns
The data, row labels and columns can all be retrieved by accessing attributes / properties on a DataFrame instance →
values – the data for the table index – the row labels columns – the column names there’s also dtype…
since a DataFrame and Series can hold different types… *dtype will be set to the type that can accommodate all the values in the DataFrame
Retrieving Columns Columns can be retrieved by:
indexing with a single column name
(which may return a Series or DataFrame)
indexing with a list of column names to return a DataFrame
Using the following DataFrame, let’s check out some indexing possibilities →
pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])
foo bar baz 0456 1789
index is left out (to be generated automatically) columns is defined as a keyword argument
pd.DataFrame([[1, 2, 3], [4, 5, 6]],
columns=[‘A’, ‘B’, ‘C’])
ABC 0123 1456
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 13/28

3/8/22, 7:31 PM Pandas Basics
Retrieving Multiple Columns pt 1!
If a label in the index occurs more than once, then a DataFrame of multiple columns is
returned rather than a single Series →
d = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘a’, ‘b’, ‘a’])
aa 046 179
Retrieving one Column
With a single column name, a column is returned as a Series →
df[‘foo’] 04 17
# note that the type and name of the column are usually
# given too:
Name: foo, dtype: int64
type(df[‘foo’]) # we get a series back
pandas.core.series.Series
df[‘foo’].name # note the name attribute!
Out[107]: ‘foo’
Retrieving Multiple Columns pt 2!
When indexing with a list of column names (even if there’s only one name in the list), a DataFrame is returned with only the columns matching the names in the list included in the returned DataFrame →
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 14/28

3/8/22, 7:31 PM Pandas Basics
Rearrange / Repeat
Indexing can also be used to to retrieve a new DataFrame with reordered columns and/or
repeated columns →
pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’]) What DataFrame will we get back from: →
df[[‘bar’, ‘bar’, ‘foo’]]
bar bar foo # bar is repeated
0 5 5 4 # and placed before
1 8 8 7 # foo
df[[‘foo’, ‘bar’]]
type(df[[‘foo’, ‘bar’]])
pandas.core.frame.DataFrame
type(df[[‘foo’]]) # list w/ 1 element
pandas.core.frame.DataFrame # (still!)
If Key Doesn’t Exist…
Regardless of whether or not a list or a single column is used for indexing into a DataFrame,
a KeyError is raised if a key doesn’t exist →
df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])
df[[‘foo’, ‘dne’]] # both of these are
KeyError: “[‘dne’] not in index”
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 15/28

3/8/22, 7:31 PM Pandas Basics
DataFrame Slices Gives Rows!
If a DataFrame is sliced BY POSITION, it yields rows rather than columns →
d = pd.DataFrame({“cA”: {‘r1’: 1, ‘r2’: 2, ‘r3’: 3},
“cB”: {‘r1’: 4, ‘r2’: 5, ‘r3’: 6},
“cC”: {‘r1’: 7, ‘r2’: 8, ‘r3’: 9}})
d[:2] # slicing refers to rows here!
cA cB cC # only first two rows!
r1 1 4 7
Indexing with List of Booleans / Arrays
Much like Series and ndarray, we can use a list or array of booleans to select parts of a
a list/array of booleans filters DataFrame rows
the length of the booleans must match the number of rows
d = pd.DataFrame({“cA”: {‘r1’: 1, ‘r2’: 2, ‘r3’: 3},
“cB”: {‘r1’: 4, ‘r2’: 5, ‘r3’: 6},
“cC”: {‘r1’: 7, ‘r2’: 8, ‘r3’: 9}})
d[[False, True, True]]
cA cB cC # gimme last two rows!
r2 2 5 8
Constructing Boolean Selection
Again, we don’t have to manually create a Boolean array; it can be the result of a vectorized boolean comparison →
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 16/28

3/8/22, 7:31 PM Pandas Basics
Boolean Selection Continued Retrieve the rows where column cA is more than 1
# d is cA cB cC # r1147 # r2258 # r3369
d[‘cA’] > 1 # r1 False
# r2 True
# r3 True
d[d[‘cA’] > 1] #
cA cB cC
r2 2 5 8
r3 3 6 9
Let’s take this example…
d = pd.DataFrame({“cA”: {‘r1’: 1, ‘r2’: 2, ‘r3’: 3},
“cB”: {‘r1’: 4, ‘r2’: 5, ‘r3’: 6},
“cC”: {‘r1’: 7, ‘r2’: 8, ‘r3’: 9}})
Setting Values
Now that we know how to retrieve values with indexing… let’s see how we can set values with
s = pd.Series([2, 3, 4, 5, 6], list(‘abcde’))
s[‘a’] = 100 # assigning with an index
s[‘b’:’d’] = 200 # assigning with a slice
a 100 b 200 c 200 d 200 e6
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 17/28

3/8/22, 7:31 PM Pandas Basics
Setting Values, DataFrame Using our usual example:…
df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])
Assigning to a scalar sets all values of a column:
df[‘foo’] = 77
foo bar baz
0 77 5 6
1 77 8 9
Assignment Continued
Of course, you can set each value in a column to a specific value using a list or even a
df[‘foo’] = [99, 100]
foo bar baz
0 99 5 6
1 100 8 9
df[‘foo’] = pd.Series([-8, -9])
foo bar baz
0 -8 5 6
1 -9 8 9
About That Series
As you might expect, if you assign a Series to a DataFrame column: →
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 18/28

3/8/22, 7:31 PM Pandas Basics
More DataFrame / Series Assignment Pay attention to the mismatched labels… →
df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
index=[‘r1’, ‘r2’],
columns=[‘foo’, ‘bar’, ‘baz’])
# foo bar baz
# r1 4 5 6
# r2 7 8 9
df[‘foo’] = pd.Series([100, 200], [‘r1’, ‘r3’])
foo bar baz
r1 100.0 5 6
r2 NaN 8 9 # r2 is added as NaN
# r3 ignored
Assignment Errors
When assigning a list/ndarray or Series to a column, the length of the data must match
the length of the DataFrame column. → df[‘foo’] = [100]
ValueError: Length of values does not match length of index
the labels will be aligned to perform assignment
with… DataFrame labels missing from the Series filled with NaN
and extra labels in the Series (not matching any of the DataFrame’s labels) ignored
Assignment + Boolean Selection
Note that indexing with an array, Series or list of booleans can be used in assignment
d = pd.DataFrame({“cA”: {‘r1’: 1, ‘r2’: 2, ‘r3’: 3},
“cB”: {‘r1’: 4, ‘r2’: 5, ‘r3’: 6},
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 19/28

3/8/22, 7:31 PM Pandas Basics
Adding Columns
If the column name used in assignment does not exist, a new column will be created →
df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])
df[‘qux’] = [20, 30] # qux is new!
foo bar baz qux
0 4 5 6 20
1 7 8 9 30
Removing Columns: a Mystery
The .drop method on a DataFrame can be used remove a column. Let’s try it: →
df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])
df.drop(‘baz’) #
KeyError: “[‘baz’] not found in axis” # what happened!?
d[d[‘cA’] > 1] = 0
cA cB cC
r1 1 4 7
r2 0 0 0
r3 0 0 0
“cC”: {‘r1’: 7, ‘r2’: 8, ‘r3’: 9}})
Axis Flashback
https://cs.nyu.edu/courses/spring22/CSCI-UA.0479-001/_site/slides/python/pandas-basics.html?print 20/28

3/8/22, 7:31 PM Pandas Basics
Really Removing Columns .drop takes axis as a keyword argument →
buuuut… it’s default value is 0 (rows! ) to remove a column, use axis=1
df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])
df.drop(‘baz’, axis=1)
foo bar # baz column 0 4 5 # was removed! 178
Dictionary Life (del)
Similar to deleting keys/values in dictionaries, the del keyword cal also be used to drop
df = pd.DataFrame([[4, 5, 6], [7, 8, 9]],
columns=[‘foo’, ‘bar’, ‘baz’])
del df[‘baz’]
foo bar # again, baz 0 4 5 # is removed! 178
Remember numpy… specifically the significance of .shape and

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts