CS计算机代考程序代写 matlab python chain algorithm F21_APS1070_Tutorial_1

F21_APS1070_Tutorial_1

APS1070¶
Basic Principles and Models – Tutorial 1¶

In this tutorial, we will be using the popular machine learning library scikit-learn in tandem with a popular scientific computing library in Python, NumPy, to investigate basic machine learning principles and models. The topics that will be covered in this lab include:

Introduction to scikit-learn and Num preparation and cleaning with Pandas
Exploratory data analysis (EDA)
Nearest neighbors classification algorithm
Nested cross-validation

Note: Some other useful Python libraries include matplotlib (for plotting/graphing) and Pandas (for data analysis), though we won’t be going into detail on these in this bootcamp.

Jupyter Notebooks¶
This lab will be using Jupyter Notebooks as a Python development environment. Hopefully you’re somewhat familiar with them. Write your code in cells (this is a cell!) and execute your code by pressing the play button (up top) or by entering ctrl+enter. To format a cell for text, you can select “Markdown” from the dropdown – the default formatting is “Code”, which will usually be what you want.

Getting started¶
Let’s get started. First, we’re going to test that we’re able to import the required libraries.

>> Run the code in the next cell to import scikit-learn,NumPy, and Pandas.

In [ ]:

import numpy as np
import pandas as pd
import sklearn

Num ¶
Great. Let’s move on to our next topic: getting a handle on NumPy basics. You can think of NumPy as sort of like a MATLAB for Python (if that helps). The main object is multidimensional arrays, and these come in particularly handy when working with data and machine learning algorithms.

Let’s create a 2×4 array containing the numbers 1 through 8 and conduct some basic operations on it.

>> Run the code in the next cell to create and print the array.*

In [ ]:

# see all functions and attributes in numpy
print(dir(np))

In [ ]:

print(help(np.arange))

In [ ]:

array = np.arange(8).reshape(2,4)
array

We can access the shape, number of dimensions, data type, and number of elements in our array as follows:

(Tip: use “print()” when you want a cell to output more than one thing, or you want to append text to your output, otherwise the cell will output the last object you call, as in the cell above)

In [ ]:

print (“Shape:”, array.shape)
print (“Dimensions:”, array.ndim)
print (“Data type:” , array.dtype.name)
print (“Number of elements:”, array.size)

If we have a Python list containing a set of numbers, we can use it to create an array:

(Tip: if you click on a function call, such as array(), and press “shift+tab” the Notebook will provide you all the details of the function)

In [ ]:

mylist = [0, 1, 1, 2, 3, 5, 8, 13, 21]
myarray = np.array(mylist)
myarray

And we can do it for nested lists as well, creating multidimensional NumPy arrays:

In [ ]:

my2dlist = [[1,2,3],[4,5,6]]
my2darray = np.array(my2dlist)
my2darray

We can also index and slice NumPy arrays like we would do with a Python list or another container object as follows:

In [ ]:

array = np.arange(10)
print (“Originally: “, array)
print (“First four elements: “, array[:4])
print (“After the first four elements: “, array[4:])
print (“Elements 3 to 7: “, array[3:8])
print (“The last element: “, array[-1])

And we can index/slice multidimensional arrays, too.

In [ ]:

array = np.array([[1,2,3],[4,5,6]])
print (“Originally: “, array)
print (“First row only: “, array[0])
print (“First column only: “, array[:,0])

Sneak preview¶
Often, when designing a machine learning classifier, it can be useful to compare an array of predictions (0 or 1 values) to another array of true values. We can do this pretty easily in NumPy to compute the accuracy (e.g., the number of values that are the same), for example, as follows:

In [ ]:

true_values = [0, 0, 1, 1, 1, 1, 1, 0, 1, 0]
predictions = [0, 0, 0, 1, 1, 1, 0, 1, 1, 0]

true_values_array = np.array(true_values)
predictions_array = np.array(predictions)

accuracy = np.sum(true_values_array == predictions_array) / true_values_array.size
print (“Accuracy: “, accuracy * 100, “%”)

In the previous cell, we took two Python lists, converted them to NumPy arrays, and then used a combination of np.sum() and .size to compute the accuracy (proportion of elements that are pairwise equal). A tiny bit more advanced, but demonstrates the power of NumPy arrays.

You’ll notice we didn’t use nested loops to conduct the comparison, but instead used the np.sum() function. This is an example of a vectorized operation within NumPy that is much more efficient when dealing with large datasets.

Numpy contains a wide range of mathematical functions. You can use the following to see a list of mathematical functions supported by numpy: https://numpy.org/doc/stable/reference/routines.math.html

In [ ]:

# mathematical functions
array1 = np.arange(9).reshape(3,3)
array2= np.arange(5,14).reshape(3,3)
print(‘array1:’,array1)

print(‘array2:’,array2)

In [ ]:

# summation over an axis
np.sum(array1,1)

In [ ]:

# adding arrays
np.add(array1,array2)

In [ ]:

# dot product
np.dot(array1,array2)

Pandas basics¶
Pandas is an incredibly useful library that allows us to work with large datasets in Python. It contains myriad useful tools, and is highly compatible with other libraries like Scikit-learn, so you don’t have to spend any time getting the two to play nicely together.

First we are going to load a dataset with Pandas:

In [ ]:

!pip install wget

In [ ]:

import wget

wget.download(
‘https://raw.githubusercontent.com/aps1070-2019/datasets/master/arabica_data.csv’,
‘arabica_data.csv’
)

In [ ]:

df = pd.read_csv(‘arabica_data.csv’)

With Pandas, the main object we work with is referred to as a DataFrame (hence calling our object here df). A DataFrame stores our dataset in a way that immediately gives us a lot of power to interact with it. If you just put the DataFrame in a cell on its own, you instantly get a clear, easy to read preview of the data you have:

In [ ]:

df

In [ ]:

# see first 5 rows
df.head()

In [ ]:

df.columns

In [ ]:

# see the data types in each column
df.info()

In [ ]:

# getting the summary of numerical columns
df.describe()

Let’s say we want to zero in on a single column. This is done the same way that you access a dictionary entry:

In [ ]:

df[‘Species’]

Using this method of column access on its own returns a series object – think of this as a DataFrame with only one column. If you want to get the raw values however, you can simply specify this by adding .values after your entry. Using this, and by putting the object in a Set (which does not allow duplicate entries), we can quickly see all of the possible values for any column:

In [ ]:

set(df[‘Variety’].values)

You may notice that the final entry in this set isn’t like the others – it’s nan, which in Pandas denotes a missing entry. When working with real world datasets it’s very common for entries to be missing, and there are a variety of ways of approaching a problem like this. For now, though, we are simply going to tell Pandas to drop any row that has a missing column, using the dropna() method.

In [ ]:

df_clean = df.dropna()

In [ ]:

df.shape,df_clean.shape

What percentage of entries are left in df_clean?
What column had the highest number of nan entries?

In [ ]:

# What percentage of entries are left in `df_clean`?
print(“%.2f%% entries from full dataset are left in df_clean” % ((len(df_clean)/len(df))*100))

# What column had the highest number of `nan` entries?
Nan_Entries = df.isna()
Sum_of_Nan= Nan_Entries.sum()
Sorted_Sum = Sum_of_Nan.sort_values(ascending=False)
print(Sorted_Sum)

### Now write it in one line!
df.isna().sum().sort_values(ascending=False)

As you perform this analysis, you will probably notice that we’ve lost quite a bit of our original data by simply dropping the nan values. There is another approach that we can examine, however. Instead of dropping the missing entries entirely, we can impute their value using the data we do have. For a single column we can do this like so:

In [ ]:

from sklearn.impute import SimpleImputer

imp = SimpleImputer(
missing_values=np.nan,
strategy=’mean’,
verbose=1
)

imp.fit(
df[‘altitude_mean_meters’].values.reshape((-1,1)) #we have to do the reshape operation because we are only using one feature.
)

df[‘altitude_mean_meters_imputed’] = imp.transform(df[‘altitude_mean_meters’].values.reshape((-1,1)))

In [ ]:

df[[‘altitude_mean_meters’,’altitude_mean_meters_imputed’]].head(10)

OK, great! Now we have replaced the useless NaN values with the average height. While this obviously isn’t as good as original data, in a lot of situations this can be a step up from losing rows entirely.

Sophisticated analysis can be done in only a few lines using Pandas. Let’s say that we want to get the average coffee rating by country. First, we can use the groupby method to automatically collect the results by country. Then, we can select the column we want – quality_score – and calculate its mean the same way we would using NumPy:

In [ ]:

df_clean.groupby(‘Country of Origin’)[‘quality_score’].mean()

This is certainly interesting, but it could be presented better. First, all of the ratings are pretty high (what’s the highest and lowest rating?). Let’s standardize to unit mean and variance so that we can tell the difference more easily. We’ll just do that on our subset here for now, but you can apply it to the entire dataset too!

In [ ]:

country_means = df_clean.groupby(‘Country of Origin’)[‘quality_score’].mean()
mu,si = country_means.mean(), country_means.std() #Calculate the overall mean and standard deviation of the quality scores
country_means -= mu #Subtract the mean from every entry
country_means /= si #Divide every entry by the standard deviation
country_means

This is a lot clearer! Finally, let’s sort this list so that it’s easier to compare entries.

In [ ]:

country_means.sort_values()

Finally, we’ll look at indexing using Pandas. Let’s say that we want to only look at the coffee entries from Taiwan. We can use the following syntax to identify those rows:

In [ ]:

df_clean[df_clean[‘Country of Origin’] == ‘Taiwan’]

Say that out of the Taiwanese coffees, we only want to look at those which are the Bourbon variety. We can also chain those indexing operations like so:

In [ ]:

df_clean[df_clean[‘Country of Origin’] == ‘Taiwan’][df_clean[‘Variety’] == ‘Bourbon’]

Scikit-learn Basics¶
Scikit-learn is a great library to use for doing machine learning in Python. Data preparation, exploratory data analysis (EDA), classification, regression, clustering; it has it all.

Scikit-learn usually expects data to be in the form of a 2D matrix with dimensions n_samples x n_features with an additional column for the target. To get acquainted with scikit-learn, we are going to use the iris dataset, one of the most famous datasets in pattern recognition.

Each entry in the dataset represents an iris plant, and is categorized as:

Setosa (class 0)
Versicolor (class 1)
Virginica (class 2)

These represent the target classes to predict. Each entry also includes a set of features, namely:

Sepal width (cm)
Sepal length (cm)
Petal length (cm)
Petal width (cm)

In the context of machine learning classification, the remainder of the lab is going to investigate the following question:

Can we design a model that, based on the iris sample features, can accurately predict the iris sample class?

Scikit-learn has a copy of the iris dataset readily importable for us. Let’s grab it now and conduct some EDA.

In [ ]:

from sklearn.datasets import load_iris
iris_data = load_iris()
feature_data = iris_data.data

What is the shape of this feature data?
The data type?
How many samples are there?
How many features are there?

In [ ]:

print(“Feature data shape:”,feature_data.shape)
print(“Feature data type:”,type(feature_data[0,0]))
print(“Number of samples:”,feature_data.shape[0])
print(“Number of features:”,feature_data.shape[1])

Next, we will save the target classification data in a similar fashion.

In [ ]:

target_data = iris_data.target
target_names = iris_data.target_names

What values are in “target_data”?
What is the data type?
What values are in “target_names”?
What is the data type?
How many samples are of type “setosa”?

In [ ]:

print(“Target data content:”,np.unique(target_data))
print(“Target data type:”,type(target_data[0]))

print(“Target names content:”,target_names[0:10])
print(“Target names type:”,type(target_names[0]))

setosa_samples= len([t for t in target_data if t == target_names.tolist().index(‘setosa’)])
print(“%d samples of type setosa” % setosa_samples )

print(np.sum(target_data==0))

We can also do some more visual EDA by plotting the samples according to a subset of the features and coloring the data points to coincide with the sample classification. We will use matplotlib, a powerful plotting library within Python, to accomplish this.

For example, lets plot sepal width vs. sepal length.

In [ ]:

import matplotlib.pyplot as plt

In [ ]:

setosa = feature_data[target_data==0]
versicolor = feature_data[target_data==1]
virginica = feature_data[target_data==2]

plt.scatter(setosa[:,0], setosa[:,1], label=”setosa”)
plt.scatter(versicolor[:,0], versicolor[:,1], label=”versicolor”)
plt.scatter(virginica[:,0], virginica[:,1], label=”virginica”)

plt.legend()
plt.xlabel(“sepal length (cm)”)
plt.ylabel(“sepal width (cm)”)
plt.title(“Visual EDA”);

In the above step, we used boolean indexing to filter the feature data based on the target data class. This allowed us to create a scatter plot for each of the iris classes and distinguish them by color.

Observations: We can see that the “setosa” class typically consists of medium-to-high sepal width with low-to-medium sepal length, while the other two classes have lower width and higher length. The “virginica” class appears to have the largest combination of the two.

YOUR TURN:

Which of the iris classes is seperable based on sepal characteristics?
Which of the iris classes is not?
Can we (easily) visualize each of the samples w.r.t. all features on the same plot? Why/why not?

Creating a Nearest Neighbors Classifier¶
Now that we’ve explored the data a little bit, we’re going to use scikit-learn to create a nearest neighbors classifier for the data. Effectively we’ll be developing a model whose job it is to build a relationship over input feature data (sepal and petal characteristics) that predicts the iris sample class (e.g. “setosa”). This is an example of a supervised learning task; we have all the features and all the target classes.

Model creation in scikit-learn follows a data prep -> fit -> predict process. The “fit” function is where the actual model is trained and parameter values are selected, while the “predict” function actually takes the trained model and applies it to the new samples.

First, we load the nearest neighbor library from scikit-learn:

In [ ]:

from sklearn import neighbors

Now, we’re going to save our feature data into an array called ‘X’ and our target data into an array called ‘y’. We don’t need to do this, but it is traditional to think of the problem using this notation.

In [ ]:

X = feature_data
y = target_data

Next, we create our nearest neighbor classifier object:

In [ ]:

knn = neighbors.KNeighborsClassifier(n_neighbors=1)

And then we fit it to the data (i.e., train the classifier).

In [ ]:

knn.fit(X,y)

Now we have a model! If you’re new to this, you’ve officially built your first machine learning model. If you use “knn.predict([[feature array here]])”, you can use your trained model to predict the class of a new iris sample.

What is the predicted class of a new iris sample with feature vector [3,4,5,2]? What is its name?
Do you think this model is overfit or underfit to the iris dataset? Why?
How many neighbors does our model consider when classifying a new sample?

In [ ]:

t = knn.predict(np.array([[3,4,5,2]]))[0]
print(“New prediction is class %d, aka. %s” % (t, target_names[t]))

As you may have noted in the previous cell, we’ve trained this classifier on our entire dataset. This typically isn’t done in practice and results in overfitting to the data. Here’s a bit of a tricky question:

If we use our classifier to predict the classes of the iris samples that were used to train the model itself, what will our overall accuracy be?

We can validate our hypothesis fairly easily using either: i) the NumPy technique for calculating accuracy we used earlier in the lab, or ii) scikit-learn’s in-house “accuracy_score()” function.

Let’s use our technique first:

In [ ]:

accuracy = np.sum(target_data == knn.predict(feature_data)) / target_data.size
print (“Accuracy: “, accuracy * 100, “%”)

and then using scikit-learn’s customized function:

In [ ]:

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(target_data, knn.predict(feature_data))
print (“Accuracy: “, accuracy * 100, “%”)

We see that our classifier has achieved 100% accuracy (and both calculation methods agree)!

DISCUSSION:

Why do you think the model was able to achieve such a “great” result?
What does this really tell us?
Do you expect the model to perform this well on new data?

Cross Validation¶
A popular way to mitigate this overfitting issue is to train your model on some of the data (the training set) and validate your model on the remaining data (the validation set). You will then select the model/configuration that performs best on the validation data. The train/validate division of the data is usually done with a 70%/30% split. Often, practitioners will use a third data set, the test set (or hold-out set), to get a sense for how their best model performs on unseen, real-world data. In this scenario, you will tune your models to perform best on the validation set and then test their “real-world” performance on the unseen test set.

Sometimes applications don’t have enough data to do these splits meaningfully (e.g., the test data is only a few samples). In these cases, cross-validation is a useful technique (and, indeed, has become standard in machine learning practice).

The general premise of “k-folds” cross validation is to first divide the entire dataset (grey) into a training set (green) and a test set (unseen data, blue). Then, we divide the training set into different folds and use these folds to form new sub-training and sub-test sets. We select the model configuration that performs the best on all of these. The below figure provides a nice visualization for what’s going on here:

Accomplishing k-folds cross validation in scikit-learn is a manageable task. First, we divide our data into a train and test set, then we conduct the cross validation and look at the mean scores across the splits, then we conduct our final evaluation.

In [ ]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(feature_data, target_data, test_size=0.3, random_state=0)

We have divided our data into two sections: training data (70% of the data) and testing data (30% of the data). Now, we will fit our nearest neighbors classifier to the training data using 5-fold cross-validation and see how it performs.

We will be applying cross_validate in sklearn to perform cross-validation. We can get both train and validation accuracies using cross_validate. Please note that you should set return_train_score=True if you want cross_validate to return train scores in addition to test scores.

You can use the following link to learn more about cross_validate:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

In [ ]:

from sklearn.model_selection import cross_validate
scores = cross_validate(knn, X_train, y_train, cv=5,return_train_score=True)

print(‘Mean Train Accuracy:’,scores[‘train_score’].mean()) # returns the mean cross-validation train score
print(‘Mean Validation Accuracy:’, scores[‘test_score’].mean()) # returns the mean cross-validation validation score

Our cross-validated model has an accuracy of 94% across all the splits on the training data. If we think that is a reasonable value, we can train our final model on the training data and then see how it performs on the held-out test data.

Comparing classifiers¶
However, to get a true sense for the utility of cross-validation, let’s create a second nearest neighbors classifier that uses five neighbors instead of one.

In [ ]:

knn_5 = neighbors.KNeighborsClassifier(n_neighbors=5)
scores = cross_validate(knn_5, X_train, y_train, cv=5, return_train_score=True)

print(‘Mean Train Accuracy:’,scores[‘train_score’].mean()) # returns the mean cross-validation train score
print(‘Mean Validation Accuracy:’, scores[‘test_score’].mean())

Let’s train it on the training data and use it to predict the final held-out test data.

In [ ]:

knn_5.fit(X_train, y_train)
accuracy = accuracy_score(y_test, knn_5.predict(X_test))
print (“Test set accuracy: “, accuracy * 100, “%”)

And we see our model has a 97.7% accuracy on the held out test data (30% of the original dataset).