CS代考 COMP3115 Exploratory Data Analysis and Visualization

COMP3115 Exploratory Data Analysis and Visualization
Lecture 3: Python libraries for data analytics: NumPy, Matplotlib and Pandas
 Introduction to Python Programming (Cont’d) – refer to week2’s slides
 Exploratory Data Analysis by Simple Summary Statistics (Cont’d) – refer to week1’s slides

Copyright By PowCoder代写 加微信 powcoder

 A Brief Introduction to Python Libraries for Data Analytics: NumPy, Matplotlib and Pandas
 An Example of using Pandas to Explore Data by Summary Statistics

 NumPy stands for Numerical Python and it is the fundamental package for scientific computing with Python. NumPy is a Python library for handling multi-dimensional arrays.
 It contains both the data structures needed for the storing and accessing arrays, and operations and functions for computation using these arrays.
 Unlike lists, the arrays must has the same data types for all its elements.
 The homogeneity of arrays allows highly optimized functions that use arrays as their inputs and outputs.

The major features of NumPy
 Easily generate and store data in memory in the form of multidimensional array
 Easily load and store data on disk in binary, text, or CSV format
 Support efficient operations on data arrays, including basic arithmetic and logical operations, shape manipulation, data sorting, data slicing, linear algebra, statistical operation, discrete Fourier transform, etc.
 Vectorised computation: simple syntax for elementwise operations without using loops (e.g., a = b + c where a, b, and c are three multidimensional arrays with same shape).

How to use NumPy?
 In order to use NumPy, we need to import the module numpy first.  A widely used convention is to use np as a short name of numpy.

Usages of high-dimensional arrays in data analysis  Store matrices, solve systems of linear equations, compute
eigenvalues/eigenvectors, matrix decompositions, …
 Images and videos can be represented as NumPy arrays
 A 2-dimensional table might store a input data matrix in data analysis, where row represents a sample, column represents a feature (Commonly used in Scikit-learn).

Representation of 2-dimensional table
 We obtain information about cases (records) in a dataset, and generally record the information for each case in a row of a data table.
 A variable is any characteristic that is recorded for each case. The variables generally correspond to the columns in a data table.
A Variable

List vs. Array
 Arrays need extra declaration while lists don’t.
 Lists are generally used more often between the two, which works fine most
of the time.
 If you’re going to perform arithmetic functions to your lists, you should really be using arrays instead.
 Arrays will store your data more compactly and efficiently.

Creation of arrays
 Four different approaches to create ndarray objects
 1. Use the numpy.array() function to generate an ndarray object from any
sequence-like object (e.g., list and tuple)
 2. Use the build-in functions (e.g. np.zeros(), np.ones())to generate some special ndarray object. Use help( ) to find out the details of each function.
 3. Generate ndarray with random numbers (random sampling). The numpy.random module provides functions to generate arrays of sample values from popular probability distributions.
 4. Save ndarray to disk file, and read ndarray from disk file (e.g.np.load())

1. Creation of arrays (numpy.array())  Import the NumPy library
– Suggested to use the standard abbreviation np
 Give a (nested) list as a parameter to the array
constructor
– One dimensional array: list
– Two dimensional array: list of lists
– Three dimensional array: list of lists of list

One dimensional array, Two dimensional array, Three dimensional array
 In two dimensional array, you have rows and columns. The rows are indicated as “axis 0” while the columns are the “axis 1”
 The number of the axis goes up accordingly with the number of the dimensions.

2. Creation of arrays (build-in functions)
 Useful function to create common types of arrays
– np.zeros(): all elements are 0s
– np.ones(): all elements are 1s
– np.full() : all elements to a specific value
– np.empty(): all elements are uninitialized
– np.eye(): identity matrix: a matrix with elements on the diagonal are 1s, others are 0s
– np.arrange():generate evenly spaced values within a given interval.

3. Creation of arrays (random sampling)
 The numpy.random module provides to generate arrays of sample values from popular probability distributions.

4. Creation of arrays (from disk file)
 Binary format (which is not suitable for human to read)
 Txt format (which is suitable for human to read)

Array types and attributes
 An array has several attributes: – ndim: the number of dimensions – shape: size in each dimension
– size: the number of elements
– dtype: the type of element

 One dimensional array works like the list.
 For multi-dimensional array, the index is a comma separated tuple instead of single integer
 Note that if you give only a single index to a multi-dimensional array, it indexes the first dimension of the array.

 Slicing works similarly to lists, but now we can have slices in different dimensions.
 We can even assign to a slice
 Extract rows or columns from an array

Arithmetic Operations in NumPy
 The basic arithmetic operations in NumPy are defined in the vector form. The name vector operation comes from linear algebra
– addition of two vectors 𝐚 = [𝑎1, 𝑎2], 𝐛 = [𝑏1, 𝑏2] is element-wise addition 𝐚 + 𝐛 = [𝑎1 + 𝑏1, 𝑎2 + 𝑏2]
 Arithmetic Operations – +: addition
– -: subtraction
– *:multiplication – /:division
– //:floor division – **: power
– %: remainder

Aggregations: max, min, sum, mean, standard deviations…
 Aggregations allow us to describe the information in an array by using few numbers

Aggregation over certain axes
 Instead of aggregating over the whole array, we can aggregate over certain axes only as well.

Comparisons
 Just like NumPy allows element-wise arithmetic operations between arrays, it is also possible to compare two arrays element-wise.
 We can also count the number of comparisons that were True. This solution relies on the interpretation that True corresponds to 1 and False corresponds to 0.

 Another use of boolean arrays is that they can be used to select a subset of elements. It is called masking.
 It can also be used to assign a new value. For example the following zeroes out the negative numbers.

Fancy Indexing
 Using indexing we can get a single elements from an array. If we wanted multiple (not necessarily contiguous) elements, we would have to index several times.
 That’s quite verbose. Fancy indexing provides a concise syntax for accessing multiple elements.
 We can also assign to multiple elements through fancy indexing.

Matrix operations
NumPy support a wide variety of matrix operations, such as matrix multiplication, solve systems of linear equations, compute eigenvalues/eigenvectors, matrix decompositions and other linear algebra related operations.
 The matrix operations will be discussed in detail when we talk about machine learning algorithms.

Matplotlib (brief introduction)

Matplotlib
 Visualization is an important technique to help to understand the data.
 Matplotlib is the most common low-level visualization library for Python.
 It can create line graphs, scatter plots, density plots, histograms, heatmaps, and so on.

Simple Figure
 Simply line plot

Two line plots in the same figure

SubFigures
 One can create a figure with several subfigures using the command plt.subplots.
 It creates a grid of subfigures, where the number of rows and columns in the grid are given as parameters.
 It returns a pair of a figure object and an array containing the subfigures. In matplotlib the subfigures are called axes.

Scatter plots
 The scatterplot is a visualization technique that enjoys widespread use in data analysis and is a powerful way to convey information about the relationship between two variables.
 More examples will be shown in lab session.

 2D array in Numpy is not easily to be interpreted without external information. One solution to this is to give a descriptive name to each column. These column names stay fixed and attached to their corresponding columns, even if we remove some of the columns.
 In addition, the rows can be given names as well, these are called indices in Pandas.
 High performance open-source Python library for data manipulation and analysis
 – We usually use an alias pd for pandas, like np for numpy

Pandas Data Structures
 1-dimensional: Series
 – 1D labeled homogeneously-typed array  2-dimensional: DataFrame
 – General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column

     
Series is a 1-dimensional labeled array, capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
All data items in a Series must be the same data type
Each item in the array has an associated label, also called index. Why do we need labels?
– Like a dict, a data time can be quickly located by its label.
– Remark 1: Unlike a dict, labels in a Series don’t need to be unique
– Remark 2: Unlike a dict, the size of a Series is fixed after its creation

Creation and indexing of series
 Series is one-dimensional version of DataFrame. Each column in a DataFrame is a Series.
 One can turn any one-dimensional iterable into a Series, which is a one-dimensional data structure.
 We can attach a name to this series

Row indices of Series
 Series can be created by pd.Series()
 In addition to the values of the series, also the row indices were printed. All the accessing methods from NumPy arrays also work for the Series: indexing, slicing, masking and fancy indexing.
 Note that the indices stick to the corresponding values, they are not renumbered!

 The values of Series as a NumPy array are accessible via the values attribute.
 The indices are available through the index attribute.
 The index is not simply a NumPy array, but a data structure that allows fast access to the elements.

 It is still possible to access the series using NumPy style implicit integer indices.
 This can be confusing though.
 Pandas offers attributes loc and iloc. The attributes loc always uses the explicit index, while the attribute iloc always uses the implicit integer index

 The Pandas library is built on top of the NumPy library, and it provides a special kind of two dimensional data structure called DataFrame.
 The DataFrame allows to give names to the columns, so that one can access a column using its name in place of the index of the column.

Creation of Dataframes
 The DataFrame is essentially a two dimensional object, and it can be created in four different ways:
– from a two dimensional NumPy array – from given columns
– from given rows
– from a local file

1. Creating DataFrames from a NumPy array
 In the following example a DataFrame with 2 rows and 3 column is created. The row and column indices are given explicitly.
 If either columns or index argument is left out, then an implicit integer index will be used.

2. Creating DataFrames from columns
 A column can be specified as a list, an NumPy array, or a Pandas’ Series.
 Input is a dictionary, keys give the column names and values are the actual column content.

3. Creating DataFrames from rows
 We can give a list of rows as a parameter to the DataFrame constructor.

4. Creating DataFrames from a local file
 Import ‘iris.csv’
 We see that the DataFrame contains five columns, four of which are numerical variables.

Accessing columns and rows of a Dataframe
 We can refer to a column by its name
 Recommend to use attributes loc and iloc for accessing
columns and rows in a dataframe.
 loc uses explicit indices and the iloc uses the implicit integer indices.

Drop a column
 We can drop some columns from the DataFrame with the drop method.
 We can use the inplace parameter of the drop method to modify the original DataFrame.
 Many of the modifying methods of the DataFrame have the inplace
parameter.

Add a new column in DataFrame

Summary statistic methods on Pandas columns
 There are several summary statistic methods that operate on a column or on all the columns.

Summary statistics of Pandas
 The summary statistic methods work in a similar way as their counter parts in NumPy. By default, the aggregation is done over columns.
 The describe method of the DataFrame object gives different summary statistics for each (numeric) column. The result is a DataFrame. This method gives a good overview of the data, and is typically used in the exploratory data analysis phase.

Use Summary Statistics to Explore Data

Use Summary Statistics to Explore Data (in Python)
 Load and see the data

Feature Type
 Show the feature types

Summary statistic in python
 Standard Deviation  Min
 Q2 (Median)
 statistic in python: mean; standard deviation; median
 Compute the mean, standard deviation and median for the feature ‘sepal_length’

Summary statistic in python: Mode
 Get counts of unique values for ‘species’
 Compute the mode for ‘species’

Summary statistics for different groups
 Explore more about the data

The petal_length of ‘setosa’ is always smaller than the petal_length of ‘versicolor’

Summary statistics for different group
 Explore more about the data

The petal_width of ‘setosa’ is always smaller than the petal_width of ‘versicolor’

Summary statistics for different group
 Explore more about the data

• We have our own way to distinguish ‘setosa’ and ‘versicolor’ (i.e., using petal_length or petal_width)
• The knowledge is gained by performing simple summary statistics techniques on data.

The pattern is more obvious when we visualize it
 The petal_length of ‘setosa’ is always smaller than the petal_length of ‘versicolor’
 The petal_width of ‘setosa’ is always smaller than the petal_width of ‘versicolor’
 It is very easy to classify ‘setosa’ and ‘versicolor’ just based on petal_length and petal_width.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com