程序代写代做代考 Excel data structure python matlab QBUS6850: Tutorial 2 – Linear Algebra, Data Handling

QBUS6850: Tutorial 2 – Linear Algebra, Data Handling
and Plotting

Objectives

 To handle data using Numpy, Pandas libraries;
 To plot data using matplotlib library;

1. Libraries

For this tutorial you will need to use external libraries. Libraries are groups of
useful functions for a particular domain.

The import statement is used to import code libraries.

import numpy as np

Numpy contains useful linear algebra functions. Numpy borrows many
function names and concepts from MATLAB

2. Vectors

Creating a vector is straight forward with numpy. To create a vector you can
either use either

a = np.array([1, 2, 3])
A = np.matrix([1, 2, 3])

where the contents of the [ ] (square brackets) are the contents of the vector.

You can check the shape of a vector by

print(a.shape)
print(A.shape)

Note that the shape of vector a is different from vector A. This is
because numpy array dimensions are ambiguous. It can be interpreted
as a row or column vector. If you want to exactly specify the shape then
use the matrix function instead to create your vector.

Be careful with arrays! Products of arrays are elementwise, not vector
multipliciation!

print(a * a)

Instead you should use

print(np.dot(a,a))

or

print(A * A.transpose())

The type of data that a vector (and matrix) can store is fixed to a single type.
You cannot mix integers, floats etc in the same vector. Numpy infers the data
type from the data used to create the vector.

print(a.dtype)
c = np.array([1.0, 2.0, 3.0])
print(c.dtype)

Here are some useful shortcuts for creating common special vectors

# Length 10 zero vector
zeros = np.zeros((10,1))

# Length 5 ones vector
ones = np.ones((5,1))

# You can create a single valued vector as the product
# of a vector of ones and your desired value
twos = ones * 2

# 11 numbers from 0 to 100, evenly spaced
lin_spaced = np.linspace(0, 100, 11)

# All numbers from 0 up to (not including) 100 with gaps
of 2
aranged = np.arange(0, 100, 2)

Vector transposition is straight forward

A_transposed = A.transpose()

Checking for vector/array equality is a little trickier than checking for equality
of other types. Here is how I suggest you check if two vectors are equal

print(np.array_equal(A, A))
print(np.array_equal(A, A_transposed))

# You may wish to check equality up to a tolerance
# This is useful since floating points aren’t perfect
print(np.allclose(A, A, rtol=1e-05))

Vectors can be summed together as expected, or linear combinations can be
formed

c = a + a
d = 3*a * 1.5*c

Taking the norm of the vector is taken care of by numpy

norm2 = np.linalg.norm(a, ord=2)
print(norm2)

We can test if two vectors are orthogonal by the dot product. If vectors are
orthog then their inner product (dot product) is 0.

# Generate an orthogonal vector
x = [1, 2, 3]
y = [4, 5, 6]
orthog = np.cross(x, y)
print(“Orthogonal”) if np.dot(x, orthog) == 0 else print
(“Not Orthogonal”)

3. Matrices

Numpy Matrices behave in the same way as vectors

b = np.array([[1, 4, 7],
[2, 5, 8],
[3, 6, 9]])
B = np.matrix([[1, 4, 7],
[2, 5, 8],
[3, 6, 9]])
print(B)

# Multiplication of matrices is as normal
print(B * B)

# Scalar Product
print(5 * B)

# Summing
print(B + B)

# Transpose
print(B.transpose())

Some useful shortcuts for special matrices

# 10 x 10 zero matrix
m_zeros = np.zeros((10,10))
print(m_zeros)

# 5 x 5 ones matrix
m_ones = np.ones((5,5))
print(m_ones)

# Matrix of all twos
m_twos = m_ones * 2
print(m_twos)

Often we are interested in the diagonal entries of a matrix (e.g. covariance
matrices) or creating a diagonal matrix (e.g. identity matrix)

# Identity matrix
eye = np.identity(3)
print(eye)

# Get the diagonal entries
diag_elements = np.diag(B)
print(diag_elements)

# Create a matrix with values along the diagonal
m_diag = np.diag([1,2,3])
print(m_diag)

Other useful functions of matrices include the rank, trace and determinant. Th
e rank tells us how many linearly independent cols or rows there are, the trace
is used in calculating norms and the determinant tells us if the matrix is inverti
ble.

# Matrix Rank i.e. n linearly independent cols or rows
print(np.linalg.matrix_rank(B))

# Sum of diagonal entries
print(np.trace(B))

# Determinant
e = np.array([[1, 2], [3, 4]])
print(np.linalg.det(e))
print(“Matrix is invertible”) if np.linalg.det(e) != 0
else print (“Matrix not invertible”)

4. Pandas

Pandas is a library for data manipulation. The key feature of Pandas is that
the data structures it uses can hold multiple different data types. For example
it can create an array with integers, strings and floating point numbers all at
once. You could consider this capability as similar to an excel spreadsheet.
Whereas you cannot mix and match data types in Python Lists or Numpy
Arrays.

Pandas is already installed and available in Anaconda/Spyder. To begin using
it we first import it as follows

import pandas as pd

Download the drinks.csv file from Blackboard and place it in the same folder
as your Python file.

Then load the CSV by using

drinks = pd.read_csv(‘drinks.csv’)

Check that the DataFrame was loaded correctly by viewing some basic
information by using

drinks.dtypes
drinks.info()

View summary statistics of the DataFrame drinks by using the describe
function

drinks.describe()

Now you can begin to manipulate the data. Lets begin by extracting the
beer_servings column. Extracting a column will return a variable with the
Series type, not DataFrame. Series is for 1D data and DataFrame is for 2D
data.

Here I have used the head function to show the first 5 rows.

drinks[‘beer_servings’].head()

Just like a DataFrame you can get statistics on a Series. Try it

drinks[‘beer_servings’].describe()

You can get single statistics from a series

drinks[‘beer_servings’].mean()

You can search or query the DataFrame. For example you can get the rows
where the continent is Europe. Try the following

euro_frame = drinks[drinks[‘continent’] == ‘EU’]

Again you can query this Series for it’s own statistics.

euro_frame[‘beer_servings’].mean()

Queries can be compounded. In the following example you will get the
countries in Europe where the number of wine servings is greater than 300.
Try it.

euro_heavywine = drinks[(drinks[‘continent’] == ‘EU’) &
(drinks[‘wine_servings’] > 300)]
euro_heavywine

DataFrames can be sorted. Try the following code to sort by litres of alcohol
and return the last 10 entries using the tail function

top_drinkers =
drinks.sort_values(by=’total_litres_of_pure_alcohol’).tai
l(10)
top_drinkers

4.1. Handling Missing Data

The drinks_corrupt.csv actually contains lots of missing entries. Pandas filters
out any rows with missing data automatically for you. However sometimes you
may want to filter them out manually or specify special codes to ignore etc.

In this case we want to do two things:

 Keep contintent code “NA” since this is valid and means “North
America”

 Remove all other rows containing legitimate missing values

By default Pandas will convert “NA” to NaN data type. We should prevent this.

default_na_values = [”, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-
1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’,
‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘nan’]

our_na_values = default_na_values
our_na_values.remove(‘NA’)

drinks_dirty = pd.read_csv(‘drinks_corrupt.csv’,
keep_default_na = False, na_values = our_na_values)

Check the number of NaN values per column

drinks_dirty.isnull().sum()

To remove rows that contain any missing entries try the dropna function

drinks_clean = drinks_dirty.dropna()

If you want to remove rows where every column has a missing entry use the
how parameter

drinks_clean_byrows = drinks_dirty.dropna(how=’all’)

You can avoid the creation of a new variable by doing an in place
replacement. Try the following

drinks_dirty.dropna(inplace = True)

Alternative

Instead of deleting corrupt or missing entries you can replace them with a
useful value. If we use the drinks.csv file we know there are no corrupt or
missing entries. In this case we can simply use replacement to fix the “NA”
(North America) issue.

drinks = pd.read_csv(‘drinks.csv’)

To replace the data in place use the fillna function. Try the following to replace
missing continent values with the string ‘NA’

drinks[‘continent’].fillna(value=’NA’, inplace=True)

4.2. Creating Columns

You can create new columns as a function of (i.e. based on) the existing
columns. For example try creating a total_servings column

drinks[‘total_servings’] = drinks.beer_servings +
drinks.spirit_servings + drinks.wine_servings

and create the total litres column

drinks[‘alcohol_mL’] =
drinks.total_litres_of_pure_alcohol * 1000

Then check your changes using the head function

# Check changes
drinks.head()

4.3. Renaming Data

Renaming columns is straightforward

drinks.rename(columns={‘total_litres_of_pure_alcohol’:’al
cohol_litres’}, inplace=True)

4.4. Deleting Data

Deleting data is easy. Pandas provides two options. The first is with the drop
function. Axis 1 refers to columns and axis 0 refers to rows.

drinks_wout_ml = drinks.drop([‘alcohol_litres’], axis=1)

5. Plotting: Pyplot

Pyplot is a library for data manipulation. Pyplot follows many of the
conventions of MATLAB’s plotting functions. It provides a very simple
interface to plotting for common tasks.

Pyplot is already installed and available in Anaconda/Spyder. To begin using
it we first import it as follows

import matplotlib.pyplot as plt

Pyplot requires a figure to draw each plot upon. A Figure can contain a single
plot or you can subdivide it into many plots. You can think of a figure as a
“blank canvas” to draw on.

By default Pyplot will draw on the las figure you created.

Creating a new figure is simple

my_figure = plt.figure()

5.1. Bar Chart

Lets create a simple bar chart comparing the number of beer servings among
EU countries that drink lots of wine.

The first parameter of the bar(x, y) function is the x position to draw at and the
second is the vertical heights of each bar. Here we use x = ind and y =
euro_heavywine[‘beer_servings’]

ind = np.arange(len(euro_heavywine))
plt.bar(ind, euro_heavywine[‘beer_servings’])
# This shows the figure
my_figure

We need to label the individual bars. We can use the xticks function to do this

plt.xticks(ind, euro_heavywine[‘country’])
my_figure

Next label each of the axis

plt.xlabel(“Country”)
plt.ylabel(“Beer Servings”)
my_figure

Finally give your plot a title

plt.title(“Beer Servings of Heavy Wine Drinking EU
Countries”)
my_figure

5.2. Line Plot

Lets create a line plot showing the total servings of top drinking countries in
descending order.

Create a new figure to draw on

# Line Plot
my_figure2 = plt.figure()

Then sort the dataframe by total servings and pick the top 10 using the head
function.

# Get the 10 countries with highest total servings
top_drinkers = drinks.sort_values(by=’total_servings’,

ascending=False).head(10)

Then plot the total servings column of top_drinkers. You should also label
your line plots so that you can show a legend later. Pyplot will automatically
assign a colour to each of your lines. You can choose a specific colour like so

# Line plot with a label and custom colour
# Other optional parameters include linestyle and
markerstyle
plt.plot(np.arange(0,10,1),
top_drinkers[‘total_servings’], label=”Total Servings”,
color=”red”)
my_figure2

Label each of the axis and title

plt.xlabel(“Drinking Rank”)
plt.ylabel(“Total Servings”)
plt.title(“Drinking Rank vs Total Servings”)
my_figure2

Show the legend on your Figure

# Activate the legend using label information
plt.legend()
my_figure2