The Pandas Library
Introduction to Pandas
an open source Python library providing high performance data structures and analysis tools.
Copyright By PowCoder代写 加微信 powcoder
>>> import pandas as pd
Often used in conjunction with numpy and matplotlib
>>> import numpy as np
>>> import matplotlib.pyplot as plt
Reading Files in Pandas Use read_csv(file_path) function
import pandas as pd
url=”http://analytics.romanko.ca/data/
customer_dbase_sel.csv”
df = pd.read_csv(url)
Display a number of rows from the dataset
>>> df.head(n = 5)
Data Exploration in Pandas -1
Get summary stats for the data
>>> df.describe()
Data Exploration in Pandas -2
Data Exploration in Pandas -3 Get further information about the data.
> Get the number of rows and columns in the data >>> df.shape
(5000, 30)
> Display column names
>>> df.columns
Index([‘custid’, ‘gender’, ‘age’, ‘age_cat’, ‘debtinc’, ‘card’, ‘carditems’, ‘cardspent’, ‘cardtype’, ‘creddebt’, ‘commute’, ‘commutetime’, ‘card2’, ‘card2items’, ‘card2spent’, ‘card2type’, ‘marital’, ‘homeown’,’hometype’, ‘cars’, ‘carown’, ‘region’, ‘ed_cat’, ‘ed_years’, ‘job_cat’, ’employ_years’, ’emp_cat’, ‘retire’, ‘annual_income’, ‘inc_cat’],dtype=’object’)
Example: label customers in the data into high-income and low-income
df [ ‘income_category’ ] = df [ ‘annual_income’ ]. map( lambda x : 1 if x>30000 else 0)
df [ [ ‘annual_income’ , ‘income_category’ ] ]. head(n = 5)
Pandas – “Labeling” Data
Lambda functions:
lambda [arguments] : expression
lambda x ,y : x * y
lambda x : 1 if x>30000 else 0
Drop NaN (Not-a-Number) observations:
>>> df [ [ ‘commutetime’ ] ].dropna().count() commutetime 4998
Print observations with NaN commutetime:
>>> print( df [ np.isnan( df [ “commutetime” ] ) ] )
Pandas – Data Cleaning
Pandas – Visualizing Data Selecting/reducing the number of columns
viz = df [ [‘cardspent’ , ‘debtinc’ , ‘carditems’ , ‘commutetime’]] viz.head()
Get a histogram of the data and plot it
df [ [ ‘cardspent’ ] ].hist() plt.show()
Pandas – Plotting Data -1
Get a histogram of the data and plot it
>>> viz.hist() >>> plt.show()
Pandas – Plotting Data -2
Pandas Data Structures -Series One-dimensional labeled array
Holds any data type (integers, strings, floating point numbers, Python objects, etc.)
The axis labels are collectively referred to as the index.
>>> s = pd.Series(data, index=index)
data: a dictionary, an ndarray, a scalar value (e.g., 11) index: is a list of axis labels.
Series from from an ndarray
>>> s = pd.Series(np.random.randn(5), index=[‘a’,
‘b’, ‘c’, ‘d’, ‘e’])
a 0.2735
b 0.6052
c -0.1692
d 1.8298
e 0.5432
dtype: float64
>>> s.index
Index([‘a’, ‘b’, ‘c’, ‘d’, ‘e’], dtype=’object’)
>>> pd.Series(np.random.randn(5))
0 0.3674
1 -0.8230
2 -1.0295
3 -1.0523
4 -0.8502
dtype: float64
Series from from a dictionary
If an index is passed, the values in data corresponding to the labels in the index will be pulled out.
If no index is passed, an index will be constructed from the sorted keys of the dict, if possible.
>>> d = {‘a’ : 0., ‘b’ : 1., ‘c’ : 2.} >>> pd.Series(d)
dtype: float64
>>> pd.Series(d, index=[‘b’, ‘c’, ‘d’, ‘a’]) b 1.0
dtype: float64
NOTE: NaN is the standard missing data marker used in pandas
If data is a scalar value, an index must be provided. The value will be repeated to match the length of index
>>> pd.Series(5., index=[‘a’, ‘b’, ‘c’, ‘d’, ‘e’])
dtype: float64
Series from from a scalar value
Series acts very similarly to a ndarray, and is a valid argument to
most NumPy functions.
>>> 0.27348116325673794
>>> a 0.2735
b 0.6052
c -0.1692
dtype: float64
A Series is like a fixed-size dict in that you can get and set values
by index label:
>>> s[‘a’]
>>> 0.27348116325673794
>>> s[‘e’] = 12.
>>> s.get(‘a’)
>>> 0.27348116325673794
Series Behaviour
class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)[source]
Two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled rows and columns.
Dictionar-like container for Series objects.
Arithmetic operations align on both row and column labels.
DataFrame Objects
data : numpy ndarray, dictionary (Series, arrays, constants, or list-like objects), or DataFrame
index : index or array-like to use for resulting frame.
columns : Index or array-like, labels to use for resulting frame. dtype : data_type (default None), to force, otherwise infer copy : boolean (default False), to copy data from inputs.
numpy.random.randn returns a sample(s) from the
“standard normal” distribution.
DataFrame – Parameters
>>> d = {‘col1’: ts1, ‘col2’: ts2}
>>> df1 = DataFrame(data = d, index = index)
>>> df2 = DataFrame(numpy.random.randn(10, 5))
>>> df3 = DataFrame(numpy.random.randn(10, 5),
columns=[‘a’, ‘b’, ‘c’, ‘d’, ‘e’])
The result index will be the union of the indexes of the various Series.
DataFrames from Series or dictionaries
d = {‘one’ : pd.Series([1., 2., 3.], index=[‘a’, ‘b’, ‘c’]),
‘two’ : pd.Series([1., 2., 3., 4.], index=[‘a’, ‘b’, ‘c’, ‘d’])} >>> df = pd.DataFrame(d)
>>> pd.DataFrame(d, index=[‘d’, ‘b’, ‘a’]) one two
The row and column labels can be accessed, respectively, by accessing the index and columns attributes:
>>> df.index
Index([‘a’, ‘b’, ‘c’, ‘d’], dtype=’object’) >>> df.columns
Index([‘one’, ‘two’], dtype=’object’)
Accessing Rows and Columns
An immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all pandas objects
Parameters:
data : array-like (1-dimensional)
dtype : NumPy dtype (default: object)
copy : bool Make a copy of input ndarray name : objectName to be stored in the index tupleize_cols : bool (default: True)
When True, attempt to create a MultiIndex if possible
Index Attributes
Index Methods
Reshaping DataFrame Objects by pivoting -1
Reshaping by pivoting DataFrame objects
Data is often stored in CSV files or databases in so-called “stacked” or “record” format:
0 2000-01-03
1 2000-01-04
2 2000-01-05
3 2000-01-03
4 2000-01-04
5 2000-01-05
6 2000-01-03
7 2000-01-04
8 2000-01-05
9 2000-01-03
10 2000-01-04 11 2000-01-05
variable value
0.469112 -0.282863 -1.509059 -1.135632
1.212112 -0.173215 0.119209
-1.044236 -0.861849 -2.104569 -0.494929
Reshaping DataFrame Objects by pivoting -2 To select out everything for variable A we could do:
>>> df[df[‘variable’] ==’A’]
For time series operations a better representation would have the columns as unique variables and an index of dates identifying individual observations.
To reshape the data use the pivot function:
>>> df.pivot( index = ‘date’, columns = ‘variable’, values = ‘value’) >>> variable A B C D
2000-01-03 0.469112 -1.135632 0.119209 -2.104569
2000-01-04 -0.282863 1.212112 -1.044236 -0.494929
>>> date variable 0 2000-01-03 A
1 2000-01-04 A
2 2000-01-05 A
value 0.469112
-0.282863 -1.509059
2000-01-05 -1.509059 -0.173215 -0.861849 1.071804
Computing Descriptive Statistics
DataFrame.describe(percentiles=None, include=None, exclude=None)[source]
Generate various summary statistics, excluding NaN values. Parameters:
percentiles : array-like, optional. The percentiles to include in the output. Should all be in the interval [0, 1].
include, exclude : list-like, ‘all’, or None (default) Specify the form of the returned result. Either:
– None to both (default). The result will include only numeric- typed columns or, if none are, only categorical columns.
– A list of dtypes or strings to be included/excluded.
– If include= ‘all’, the output column-set will match the input one. Returns: summary statistics
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com