COMP3115 Exploratory Data Analysis and Visualization
Lecture 3: Python libraries for data analytics: NumPy, Matplotlib and Pandas
Introduction to Python Programming (Cont’d) – refer to week2’s slides
Exploratory Data Analysis by Simple Summary Statistics (Cont’d) – refer to week1’s slides
Copyright By PowCoder代写 加微信 powcoder
A Brief Introduction to Python Libraries for Data Analytics: NumPy, Matplotlib and Pandas
An Example of using Pandas to Explore Data by Summary Statistics
NumPy stands for Numerical Python and it is the fundamental package for scientific computing with Python. NumPy is a Python library for handling multi-dimensional arrays.
It contains both the data structures needed for the storing and accessing arrays, and operations and functions for computation using these arrays.
Unlike lists, the arrays must has the same data types for all its elements.
The homogeneity of arrays allows highly optimized functions that use arrays as their inputs and outputs.
The major features of NumPy
Easily generate and store data in memory in the form of multidimensional array
Easily load and store data on disk in binary, text, or CSV format
Support efficient operations on data arrays, including basic arithmetic and logical operations, shape manipulation, data sorting, data slicing, linear algebra, statistical operation, discrete Fourier transform, etc.
Vectorised computation: simple syntax for elementwise operations without using loops (e.g., a = b + c where a, b, and c are three multidimensional arrays with same shape).
How to use NumPy?
In order to use NumPy, we need to import the module numpy first. A widely used convention is to use np as a short name of numpy.
Usages of high-dimensional arrays in data analysis Store matrices, solve systems of linear equations, compute
eigenvalues/eigenvectors, matrix decompositions, …
Images and videos can be represented as NumPy arrays
A 2-dimensional table might store a input data matrix in data analysis, where row represents a sample, column represents a feature (Commonly used in Scikit-learn).
Representation of 2-dimensional table
We obtain information about cases (records) in a dataset, and generally record the information for each case in a row of a data table.
A variable is any characteristic that is recorded for each case. The variables generally correspond to the columns in a data table.
A Variable
List vs. Array
Arrays need extra declaration while lists don’t.
Lists are generally used more often between the two, which works fine most
of the time.
If you’re going to perform arithmetic functions to your lists, you should really be using arrays instead.
Arrays will store your data more compactly and efficiently.
Creation of arrays
Four different approaches to create ndarray objects
1. Use the numpy.array() function to generate an ndarray object from any
sequence-like object (e.g., list and tuple)
2. Use the build-in functions (e.g. np.zeros(), np.ones())to generate some special ndarray object. Use help( ) to find out the details of each function.
3. Generate ndarray with random numbers (random sampling). The numpy.random module provides functions to generate arrays of sample values from popular probability distributions.
4. Save ndarray to disk file, and read ndarray from disk file (e.g.np.load())
1. Creation of arrays (numpy.array()) Import the NumPy library
– Suggested to use the standard abbreviation np
Give a (nested) list as a parameter to the array
constructor
– One dimensional array: list
– Two dimensional array: list of lists
– Three dimensional array: list of lists of list
One dimensional array, Two dimensional array, Three dimensional array
In two dimensional array, you have rows and columns. The rows are indicated as “axis 0” while the columns are the “axis 1”
The number of the axis goes up accordingly with the number of the dimensions.
2. Creation of arrays (build-in functions)
Useful function to create common types of arrays
– np.zeros(): all elements are 0s
– np.ones(): all elements are 1s
– np.full() : all elements to a specific value
– np.empty(): all elements are uninitialized
– np.eye(): identity matrix: a matrix with elements on the diagonal are 1s, others are 0s
– np.arrange():generate evenly spaced values within a given interval.
3. Creation of arrays (random sampling)
The numpy.random module provides to generate arrays of sample values from popular probability distributions.
4. Creation of arrays (from disk file)
Binary format (which is not suitable for human to read)
Txt format (which is suitable for human to read)
Array types and attributes
An array has several attributes: – ndim: the number of dimensions – shape: size in each dimension
– size: the number of elements
– dtype: the type of element
One dimensional array works like the list.
For multi-dimensional array, the index is a comma separated tuple instead of single integer
Note that if you give only a single index to a multi-dimensional array, it indexes the first dimension of the array.
Slicing works similarly to lists, but now we can have slices in different dimensions.
We can even assign to a slice
Extract rows or columns from an array
Arithmetic Operations in NumPy
The basic arithmetic operations in NumPy are defined in the vector form. The name vector operation comes from linear algebra
– addition of two vectors 𝐚 = [𝑎1, 𝑎2], 𝐛 = [𝑏1, 𝑏2] is element-wise addition 𝐚 + 𝐛 = [𝑎1 + 𝑏1, 𝑎2 + 𝑏2]
Arithmetic Operations – +: addition
– -: subtraction
– *:multiplication – /:division
– //:floor division – **: power
– %: remainder
Aggregations: max, min, sum, mean, standard deviations…
Aggregations allow us to describe the information in an array by using few numbers
Aggregation over certain axes
Instead of aggregating over the whole array, we can aggregate over certain axes only as well.
Comparisons
Just like NumPy allows element-wise arithmetic operations between arrays, it is also possible to compare two arrays element-wise.
We can also count the number of comparisons that were True. This solution relies on the interpretation that True corresponds to 1 and False corresponds to 0.
Another use of boolean arrays is that they can be used to select a subset of elements. It is called masking.
It can also be used to assign a new value. For example the following zeroes out the negative numbers.
Fancy Indexing
Using indexing we can get a single elements from an array. If we wanted multiple (not necessarily contiguous) elements, we would have to index several times.
That’s quite verbose. Fancy indexing provides a concise syntax for accessing multiple elements.
We can also assign to multiple elements through fancy indexing.
Matrix operations
NumPy support a wide variety of matrix operations, such as matrix multiplication, solve systems of linear equations, compute eigenvalues/eigenvectors, matrix decompositions and other linear algebra related operations.
The matrix operations will be discussed in detail when we talk about machine learning algorithms.
Matplotlib (brief introduction)
Matplotlib
Visualization is an important technique to help to understand the data.
Matplotlib is the most common low-level visualization library for Python.
It can create line graphs, scatter plots, density plots, histograms, heatmaps, and so on.
Simple Figure
Simply line plot
Two line plots in the same figure
SubFigures
One can create a figure with several subfigures using the command plt.subplots.
It creates a grid of subfigures, where the number of rows and columns in the grid are given as parameters.
It returns a pair of a figure object and an array containing the subfigures. In matplotlib the subfigures are called axes.
Scatter plots
The scatterplot is a visualization technique that enjoys widespread use in data analysis and is a powerful way to convey information about the relationship between two variables.
More examples will be shown in lab session.
2D array in Numpy is not easily to be interpreted without external information. One solution to this is to give a descriptive name to each column. These column names stay fixed and attached to their corresponding columns, even if we remove some of the columns.
In addition, the rows can be given names as well, these are called indices in Pandas.
High performance open-source Python library for data manipulation and analysis
– We usually use an alias pd for pandas, like np for numpy
Pandas Data Structures
1-dimensional: Series
– 1D labeled homogeneously-typed array 2-dimensional: DataFrame
– General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column
Series is a 1-dimensional labeled array, capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
All data items in a Series must be the same data type
Each item in the array has an associated label, also called index. Why do we need labels?
– Like a dict, a data time can be quickly located by its label.
– Remark 1: Unlike a dict, labels in a Series don’t need to be unique
– Remark 2: Unlike a dict, the size of a Series is fixed after its creation
Creation and indexing of series
Series is one-dimensional version of DataFrame. Each column in a DataFrame is a Series.
One can turn any one-dimensional iterable into a Series, which is a one-dimensional data structure.
We can attach a name to this series
Row indices of Series
Series can be created by pd.Series()
In addition to the values of the series, also the row indices were printed. All the accessing methods from NumPy arrays also work for the Series: indexing, slicing, masking and fancy indexing.
Note that the indices stick to the corresponding values, they are not renumbered!
The values of Series as a NumPy array are accessible via the values attribute.
The indices are available through the index attribute.
The index is not simply a NumPy array, but a data structure that allows fast access to the elements.
It is still possible to access the series using NumPy style implicit integer indices.
This can be confusing though.
Pandas offers attributes loc and iloc. The attributes loc always uses the explicit index, while the attribute iloc always uses the implicit integer index
The Pandas library is built on top of the NumPy library, and it provides a special kind of two dimensional data structure called DataFrame.
The DataFrame allows to give names to the columns, so that one can access a column using its name in place of the index of the column.
Creation of Dataframes
The DataFrame is essentially a two dimensional object, and it can be created in four different ways:
– from a two dimensional NumPy array – from given columns
– from given rows
– from a local file
1. Creating DataFrames from a NumPy array
In the following example a DataFrame with 2 rows and 3 column is created. The row and column indices are given explicitly.
If either columns or index argument is left out, then an implicit integer index will be used.
2. Creating DataFrames from columns
A column can be specified as a list, an NumPy array, or a Pandas’ Series.
Input is a dictionary, keys give the column names and values are the actual column content.
3. Creating DataFrames from rows
We can give a list of rows as a parameter to the DataFrame constructor.
4. Creating DataFrames from a local file
Import ‘iris.csv’
We see that the DataFrame contains five columns, four of which are numerical variables.
Accessing columns and rows of a Dataframe
We can refer to a column by its name
Recommend to use attributes loc and iloc for accessing
columns and rows in a dataframe.
loc uses explicit indices and the iloc uses the implicit integer indices.
Drop a column
We can drop some columns from the DataFrame with the drop method.
We can use the inplace parameter of the drop method to modify the original DataFrame.
Many of the modifying methods of the DataFrame have the inplace
parameter.
Add a new column in DataFrame
Summary statistic methods on Pandas columns
There are several summary statistic methods that operate on a column or on all the columns.
Summary statistics of Pandas
The summary statistic methods work in a similar way as their counter parts in NumPy. By default, the aggregation is done over columns.
The describe method of the DataFrame object gives different summary statistics for each (numeric) column. The result is a DataFrame. This method gives a good overview of the data, and is typically used in the exploratory data analysis phase.
Use Summary Statistics to Explore Data
Use Summary Statistics to Explore Data (in Python)
Load and see the data
Feature Type
Show the feature types
Summary statistic in python
Standard Deviation Min
Q2 (Median)
statistic in python: mean; standard deviation; median
Compute the mean, standard deviation and median for the feature ‘sepal_length’
Summary statistic in python: Mode
Get counts of unique values for ‘species’
Compute the mode for ‘species’
Summary statistics for different groups
Explore more about the data
The petal_length of ‘setosa’ is always smaller than the petal_length of ‘versicolor’
Summary statistics for different group
Explore more about the data
The petal_width of ‘setosa’ is always smaller than the petal_width of ‘versicolor’
Summary statistics for different group
Explore more about the data
• We have our own way to distinguish ‘setosa’ and ‘versicolor’ (i.e., using petal_length or petal_width)
• The knowledge is gained by performing simple summary statistics techniques on data.
The pattern is more obvious when we visualize it
The petal_length of ‘setosa’ is always smaller than the petal_length of ‘versicolor’
The petal_width of ‘setosa’ is always smaller than the petal_width of ‘versicolor’
It is very easy to classify ‘setosa’ and ‘versicolor’ just based on petal_length and petal_width.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com