Laboratory #2 Tensorflow and CNN
Table of Contents
Step1. GPU ……………………………………………………………………………………………………………………. 1 Step2. Implement handwritten recognition in Tensorflow using CNN …………………………………… 7 Step3. Text mining using CNN…………………………………………………………………………………………. 9
3.1. Pre-processing:……………………………………………………………………………………………………… 9 3.2. Embedded word: …………………………………………………………………………………………………. 12 3.3. Model training:……………………………………………………………………………………………………. 13
One of the main reasons in recent year’s breakthrough of DNN is the power of new super computers specially with introducing the GPUs. In this lab we will first have a review on a tool that can be used as a GPU tool and then will continue the discussion of Tensorflow that we started in lab1.
Step1. GPU
Google colab is an environement that allows developers to have access to an interactive IDE. There are some advantages of this colab. For example,
You have access to both Python 2 and 3.
You have access to CPU, GPU, and TPU
You can write linux commands in IDE environment
Most of the required libraries are pre-installed
You have access to cloud for storing and retrieving your data
Figure 1- Google colab and connection to other tools
You can either start coding in a notebook or upload from github ipython notebook. Let’s start from writing a code from scratch. Here are the steps:
1- Go to https://colab.research.google.com. This is the page that you will see:
Page 1 of 14
Figure 2- Main page of colab
2- You may go to “GOOGLE DRIVE” tab to store your code on google drive. Click on arrow next to “NEW PYTHON NOTEBOOK 3” to choose the version of the language that you want to use. This is the environment that you will see which is very similar to Python notebook:
Figure 3- Colab coding environment
Before start actual coding, click on the “Runtime\Change runtime type” to choose between
available sources.
Page 2 of 14
Figure 4- Available resources
Here again you can choose between available types of python. As you see here, you can choose to run your code on GPU rather than CPU to speed up your computations. You can also click on the name on top of the page and changed it to your desired name.
If you like to see the power of GPU resources, you may type following commands in the cell:
To run the command, click on the arrow key on the left hand side of the cell. This will show you a page like this:
from tensorflow.python.client import device_lib print(“Show System RAM Memory:\n\n”)
!cat /proc/meminfo | egrep “MemTotal*”
print(“\n\nShow Devices:\n\n”+str(device_lib.list_local_devices()))
Page 3 of 14
It looks we have a Tesla P4 GPU. With 16GB of RAM which is good for our experiments.
Note that in case your tensorflow is the old version, you should upgrade it using the provided link.
You can do this by typing following command:
!pip install q tensorflow-gpu==2.0.0 And restart the code from here:
Page 4 of 14
Now you can check the version of tesorflow using:
We can share a google drive into Colab environment using:
This will show you a link, click on it and login to your google account:
Copy the provided link back into authorization code of Colab.
import tensorflow as tf tf.__version__
from google.colab import drive drive.mount(‘/content/gdrive’)
Page 5 of 14
This will copy your address under “content/gdraive”. You can have access to this file from left hand side of the environment.
Now we can load codes our datasets from the google drive.
There are other resources like Amazon (AWS), Microsoft (Azure) or Floydhub but unfortunately, they are not free.
For start we want to compare running a code on your machine and on your cloud system.
Q1- A Keras code is provided for running hand written recognition on both GPU and CPU. Run the code on colab and your own machine and compare the results.
You may use following code to see the speed of the code: For running the code on your own machine:
Run time on my computer: 2469 seconds
For Colab:
import time
import time
start = time.time()
%run Address/to/file/mnist_cnn end = time.time()
print(end – start)
Page 6 of 14
start = time.time()
!python3 “Adress/to/drive/mnist_cnn.py” end = time.time()
print(end – start)
Run time on GPU: 55 seconds
Step2. Implement handwritten recognition in Tensorflow using CNN
In this section, we want to implement the code which is provided in the attached file in Colab and analyze each section of the code.
We start with loading the dataset:
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten from tensorflow.keras.layers import Conv2D, MaxPooling2D from tensorflow.keras import backend as K
batch_size = 128 num_classes = 10 epochs = 12
# input image dimensions
img_rows, img_cols = 28, 28
# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
This general code will convert the add the number of channels to the dataset. Remember that it is a key factor that should be attached to dataset in order to analyze the data using CNN.
if K.image_data_format() == ‘channels_first’:
x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols) x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols) input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1) x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1) input_shape = (img_rows, img_cols, 1)
Continue the pre-processing with:
x_train = x_train.astype(‘float32’) x_test = x_test.astype(‘float32’) x_train /= 255
x_test /= 255
print(‘x_train shape:’, x_train.shape) print(x_train.shape[0], ‘train samples’) print(x_test.shape[0], ‘test samples’)
# convert class vectors to binary class matrices
Page 7 of 14
y_train = keras.utils.to_categorical(y_train, num_classes) y_test = keras.utils.to_categorical(y_test, num_classes)
Now is the time to design the model.
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),activation=’relu’, input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation=’relu’)) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation=’relu’)) model.add(Dropout(0.5))
model.add(Dense(num_classes, activation=’softmax’))
Q2- Explain the way that this model is designed. Talk about all the layers and their functionality. Now to evaluate the model, we can write:
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’,
metrics=[‘accuracy’])
model.fit(x_train, y_train, batch_size=batch_size,
epochs=epochs, verbose=1, validation_split=0.2)
score = model.evaluate(x_test, y_test, verbose=0) print(‘Test loss:’, score[0])
print(‘Test accuracy:’, score[1])
Page 8 of 14
Q3- Design the learning curve and talk about what you see.
Step3. Text mining using CNN
The majority of this part of lab is coming from Realpython website.
3.1. Pre-processing:
For this part of the lab we want to see one of the other applications of CNN which is text mining. The dataset is downloaded from the Sentiment Labelled Sentences Data Set from the UCI Machine Learning Repository. It is also uploaded on Canvas. This data includes labeled overviews from Amazon, Yelp and IMDB. We will work only part of the dataset which is Amazon reviews. Each review is marked as 0 for negative comment or 1 for positive sentiment. Run following code to see one of the results:
import pandas as pd
df = pd.read_csv(‘gdrive/My Drive/data/amazon_cells_labelled.txt’, names=[ ‘sentence’, ‘label’], sep=’\t’)
We can print one of the dataset values to see inside the dataframe.
print(df.iloc[0])
The result will be:
The way that the dataset is labeled is taught in Sentimental analysis course. The collection of texts (corpus) is analyzed and the frequency of particular word is counted. Then, it is compared to a dictionary to see if it is positive or negative. Feature vector is a vector that contains all the vocabulary words plus their count. Let’s see how these vectors are generated. Let’s think of the sentences that we have as following vector named sentences:
sentences = [‘John likes ice cream’, ‘John hates chocolate.’]
CountVectorizer from scikit-learn library can take these sentences and make this feature vector. This is how it works:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=0, lowercase=False) vectorizer.fit(sentences)
vectorizer.vocabulary_
Page 9 of 14
This vocabulary serves also as an index of each word. Now, you can take each sentence and get
the word occurrences of the words based on the previous vocabulary. The vocabulary consists of
all five words in our sentences, each representing one word in the vocabulary. When you take the
previous two sentences and transform them with the CountVectorizer you will get a vector
representing the count of each word of the sentence:
vectorizer.transform(sentences).toarray()
Now, you can see the resulting feature vectors for each sentence based on the previous vocabulary. For example, if you take a look at the first item, you can see that both vectors have a 1 there. This means that both sentences have one occurrence of John, which is in the first place in the vocabulary. This is called Bag of Words (BOW) model.
Let’s get back to our own problem and load “Amazon” dataset for more analysis.
We can again use BOW strategy to create vectorized sentences.
from sklearn.model_selection import train_test_split sentences = df[‘sentence’].values
y = df[‘label’].values
sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.25, random_state=1000)
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train) X_test = vectorizer.transform(sentences_test) X_train
It shows 750 samples which are the number of training samples. Each sample has 1546 dimensions which is the size of the vocabulary.
Just as a side note, we really don’t need to always use fancy algorithms. For example here even using a logistic regression model, gives us a reasonable result:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression() classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)
print(“Accuracy:”, score)
Page 10 of 14
Now, we can implement a normal DNN:
from tensorflow.keras.models import Sequential from tensorflow.keras import layers
input_dim = X_train.shape[1] # Number of features
model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation=’relu’)) model.add(layers.Dense(1, activation=’sigmoid’))
model.compile(loss=’binary_crossentropy’, optimizer=’adam’,
metrics=[‘accuracy’])
hist = model.fit(X_train, y_train, epochs=100, validation_split=0.2 , batch_size=10)
loss, accuracy = model.evaluate(X_test, y_test, verbose=False) print(“Test Accuracy: “,accuracy*100)
Let’s draw the learning curves. We can use following function to draw learning curves. And here is what you will see:
Page 11 of 14
Q4- Explain these graphs. If you see any issue, suggest a solution to resolve it. Make the model by creating 3 hidden layers (first one 200 nodes, second one 100 nodes and last one 50 nodes and after each step, add dropout of 0.2 and report the accuracy.
If you don’t see a huge improvement, don’t worry we are not done with the model yet.
3.2. Embedded word:
Text is considered a form of sequence data similar to time series data that you would have in weather data or financial data. In the previous BOW model, you have seen how to represent a whole sequence of words as a single feature vector. Now you will see how to represent each word as vectors. There are various ways to vectorize text, such as:
Words represented by each word as a vector
Characters represented by each character as a vector
N-grams of words/characters represented as a vector (N-grams are overlapping groups of
multiple succeeding words/characters in the text)
If you want to learn more about the algorithm, you may read this website:
https://medium.com/@krishnakalyan3/a-gentle-introduction-to-embedding-567d8738372b
Data pre-processing steps:
from keras.preprocessing.text import Tokenizer tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(sentences_train)
X_train = tokenizer.texts_to_sequences(sentences_train)
X_test = tokenizer.texts_to_sequences(sentences_test)
vocab_size = len(tokenizer.word_index) + 1 # Adding 1 because of reserved 0
index
print(sentences_train[3]) print(X_train[3])
The indexing is ordered after the most common words in the text, which you can see by the word this having the index 1. It is important to note that the index 0 is reserved and is not assigned to any word. This zero index is used for padding, which I’ll introduce in a moment.
Unknown words (words that are not in the vocabulary) are denoted in Keras with word_count + 1 since they can also hold some information. You can see the index of each word by taking a look at the word_index dictionary of the Tokenizer object:
for word in [‘the’, ‘all’, ‘happy’]:
Page 12 of 14
print(‘{}: {}’.format(word, tokenizer.word_index[word]))
With CountVectorizer, we had stacked vectors of word counts, and each vector was the same length (the size of the total corpus vocabulary). With Tokenizer, the resulting vectors equal the length of each text, and the numbers don’t denote counts, but rather correspond to the word values from the dictionary tokenizer.word_index.
We can add a parameter to identify how long each sequence should be.
from keras.preprocessing.sequence import pad_sequences
maxlen = 100
# Pad variables with zeros
X_train = pad_sequences(X_train, padding=’post’, maxlen=maxlen) X_test = pad_sequences(X_test, padding=’post’, maxlen=maxlen) print(X_train[0, :])
3.3. Model training:
We can now start training the model:
from tensorflow.keras.models import Sequential from tensorflow.keras import layers
embedding_dim = 50
model = Sequential() model.add(layers.Embedding(input_dim=vocab_size,
output_dim=embedding_dim,
input_length=maxlen)) model.add(layers.GlobalMaxPool1D())
model.add(layers.Dense(10, activation=’relu’)) model.add(layers.Dense(1, activation=’sigmoid’)) model.compile(optimizer=’adam’, loss=’binary_crossentropy’, metrics=[‘accuracy’])
model.summary()
This shows a summary of model.
Page 13 of 14
And train and evaluate the model with:
Accuracy = 81%
Q5- How do you interpret these results?
Q6- What is your recommendation to improve the accuracy? Implement your idea.
hist = model.fit(X_train, y_train, epochs=50,
validation_split=0.2,
batch_size=10)
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print(“Accuracy: “,accuracy)
Page 14 of 14