Reading and processing a data set.
This code is deliberately written to be easy to understand, minimizing the use of libraries, syntactic sugar etc.
If you are comfortable with Python programming, and / or once you’ve understood the basic logic below, you are welcome to use libraries such as ‘csv’ or ‘pandas’, or any other shortcuts Python has on offer.
In [289]:
# let’s read the data into a list of lines
data = open(‘data.csv’, ‘r’).readlines()
# we know that the first line is the label, the rest of the lines actually contains data
header = data[0]
instances = data[1:]
In [290]:
# let’s look at our labels
print(header)
outlook,temperature,humidity,windy,play
In [291]:
# let’s look at our data
print(instances)
[‘sunny,hot,high,FALSE,no\n’, ‘sunny,hot,high,TRUE,no\n’, ‘overc,hot,high,FALSE,yes\n’, ‘rainy,mild,high,FALSE,yes\n’, ‘rainy,cool,normal,FALSE,yes\n’, ‘rainy,cool,normal,TRUE,no\n’, ‘overc,cool,normal,TRUE,yes\n’, ‘sunny,mild,high,FALSE,no\n’, ‘sunny,cool,normal,FALSE,yes\n’, ‘rainy,mild,normal,FALSE,yes\n’, ‘sunny,mild,normal,TRUE,yes\n’, ‘overc,mild,high,TRUE,yes\n’, ‘overc,hot,normal,FALSE,yes\n’, ‘rainy,mild,high,TRUE,no\n’]
What do we want to do with the data? Recall from the lecture, that our goal is to predict the class of whether to play outside or not from a set of attributes (outlook, temperature, humidity, windy)
So, let’s take our list of instances, and create from it a list of features (x), and a list of labels (y)
In [292]:
# firs, initialize the empty lists
features = []
labels = []
# iterate over our instances:
for instance in instances:
instance = instance.strip() #remove all leading and trailing whitespace (i.e., the newline symbol ‘\n’)
instance = instance.split(“,”) # split each instance at each comma, into separate values
inst_features = instance[:4] # store the first 4 fields as the instance’s features
# store the label as the last field
# (Python supports indexing starting from the final element, using negative indices)
inst_label = instance[-1]
# append this instance’s to our global list of features / labels
features.append(inst_features)
labels.append(inst_label)
Let’s look at what we got
In [293]:
print(“all features: {}\n”.format(features))
print(“all labels : {}\n”.format(labels))
# print features and label of 1st instance
print(“features of first instance: {}\nlabel of first instance: {} “.format(features[0], labels[0]))
all features: [[‘sunny’, ‘hot’, ‘high’, ‘FALSE’], [‘sunny’, ‘hot’, ‘high’, ‘TRUE’], [‘overc’, ‘hot’, ‘high’, ‘FALSE’], [‘rainy’, ‘mild’, ‘high’, ‘FALSE’], [‘rainy’, ‘cool’, ‘normal’, ‘FALSE’], [‘rainy’, ‘cool’, ‘normal’, ‘TRUE’], [‘overc’, ‘cool’, ‘normal’, ‘TRUE’], [‘sunny’, ‘mild’, ‘high’, ‘FALSE’], [‘sunny’, ‘cool’, ‘normal’, ‘FALSE’], [‘rainy’, ‘mild’, ‘normal’, ‘FALSE’], [‘sunny’, ‘mild’, ‘normal’, ‘TRUE’], [‘overc’, ‘mild’, ‘high’, ‘TRUE’], [‘overc’, ‘hot’, ‘normal’, ‘FALSE’], [‘rainy’, ‘mild’, ‘high’, ‘TRUE’]]
all labels : [‘no’, ‘no’, ‘yes’, ‘yes’, ‘yes’, ‘no’, ‘yes’, ‘no’, ‘yes’, ‘yes’, ‘yes’, ‘yes’, ‘yes’, ‘no’]
features of first instance: [‘sunny’, ‘hot’, ‘high’, ‘FALSE’]
label of first instance: no
Now, computers are much better at working with numbers than with strings. Let’s write a function that maps each type of value to a unique number. We can do this by
1. creating a set of all occuring values (a set by definition contains each value exactly once)
2. map each value to its position in this list
For example
• our observed values are v=[a,b,c,a,a,b,d]
• turning this into a set: set(v)=[a,b,c,d]
• and turning each value into a number based on its set position: a=0, b=1, c=2, d=4
In [294]:
def string_feature_to_numeric_feature(str_values):
str_value_set = list(set(str_values)) # create a set of all values in value_list
numeric_values = [] # initialize our new value list
for str_value in str_values:
num_value = str_value_set.index(str_value) # Python way of saying: ‘give me the position of str_value in list value_set’
numeric_values.append(num_value) # append the numeric value to the new value list
return numeric_values # return the new numeric values as an output of the function
Let’s see if it works
In [295]:
numeric_labels = string_feature_to_numeric_feature(labels)
print(“string labels : {}”.format(labels))
print(“numeric labels: {}”.format(numeric_labels))
string labels : [‘no’, ‘no’, ‘yes’, ‘yes’, ‘yes’, ‘no’, ‘yes’, ‘no’, ‘yes’, ‘yes’, ‘yes’, ‘yes’, ‘yes’, ‘no’]
numeric labels: [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1]
Similary, we can iterate over all features (columns in our data matrix), and transform all our features and labels to numeric values
In [296]:
# print the original features
print(“This is our original feature matrix:”)
for i in features:
print(‘\t’.join(i))
# initialize our new structure to hold the numeric features
numeric_features = [[] for i in features]
# iterate over each feature (i.e., over the columns of our data set)
for feature_idx in range(len(features[0])):
# extract all values for that feature (i.e,. write all values in the nth column into a list)
str_feat_values = [values[feature_idx] for values in features]
# apply our function
num_feat_values = string_feature_to_numeric_feature(str_feat_values)
# write the new, numeric feature values into the numeric feature structure
for idx, instance in enumerate(features):
numeric_features[idx].append(num_feat_values[idx])
# print the new, numeric veatures
print(“\n\nThis is our new, numeric feature matrix:”)
for i in numeric_features:
print(‘{}\t{}\t{}\t{}’.format(i[0], i[1], i[2], i[3]))
This is our original feature matrix:
sunny hot high FALSE
sunny hot high TRUE
overc hot high FALSE
rainy mild high FALSE
rainy cool normal FALSE
rainy cool normal TRUE
overc cool normal TRUE
sunny mild high FALSE
sunny cool normal FALSE
rainy mild normal FALSE
sunny mild normal TRUE
overc mild high TRUE
overc hot normal FALSE
rainy mild high TRUE
This is our new, numeric feature matrix:
1 1 0 0
1 1 0 1
2 1 0 0
0 0 0 0
0 2 1 0
0 2 1 1
2 2 1 1
1 0 0 0
1 2 1 0
0 0 1 0
1 0 1 1
2 0 0 1
2 1 1 0
0 0 0 1
Opening and reading csv files with Python’s Pandas library
There are various useful libraries which allow you to handle data sets much more efficiently (even though everything they do you could implement yourself fairly easily, similarly to the code you see ablve). The most important one is called Pandas. Below is some Pandas example code.
In [297]:
import pandas as pd
data_p = pd.read_csv(‘data.csv’, sep=’,’)
print(data_p.head())
outlook temperature humidity windy play
0 sunny hot high False no
1 sunny hot high True no
2 overc hot high False yes
3 rainy mild high False yes
4 rainy cool normal False yes
In [298]:
label_p = data_p[‘play’]
print(label_p)
0 no
1 no
2 yes
3 yes
4 yes
5 no
6 yes
7 no
8 yes
9 yes
10 yes
11 yes
12 yes
13 no
Name: play, dtype: object
In [299]:
features_p = data_p[[‘outlook’, ‘temperature’, ‘humidity’, ‘windy’]]
print(features_p)
outlook temperature humidity windy
0 sunny hot high False
1 sunny hot high True
2 overc hot high False
3 rainy mild high False
4 rainy cool normal False
5 rainy cool normal True
6 overc cool normal True
7 sunny mild high False
8 sunny cool normal False
9 rainy mild normal False
10 sunny mild normal True
11 overc mild high True
12 overc hot normal False
13 rainy mild high True
In [300]:
# Turning string features into numeric features with minimal code. We use three handy tools:
# 1 Pandas’ ‘apply’ function which allows you to apply an operation to all items in the input dataframe
# 2 Pandas’ ‘factorize’ which automatically maps each categorical value to a unique integer
# it returns both the converted values, and the mapping it used. We are only interested in the converted
# values (hence the index [0])
# 3 Python’s lambda functionality ‘lambda i: expression’ which executes ‘expression’ any number of input arguments (here: colums)
numeric_features_p = features_p.apply(lambda feature: pd.factorize(feature)[0])
print(numeric_features_p)
numeric_labels_p = pd.factorize(label_p)[0]
print(numeric_labels_p)
outlook temperature humidity windy
0 0 0 0 0
1 0 0 0 1
2 1 0 0 0
3 2 1 0 0
4 2 2 1 0
5 2 2 1 1
6 1 2 1 1
7 0 1 0 0
8 0 2 1 0
9 2 1 1 0
10 0 1 1 1
11 1 1 0 1
12 1 0 1 0
13 2 1 0 1
[0 0 1 1 1 0 1 0 1 1 1 1 1 0]
After working through this tutorial you should know
• how to open and read in a data set from a csv file
• how to split the data set into features (i.e., input to your ML algorithm) and labels (i.e., desired output of your ML algorithm)
• how to map string values to numeric values
• how to import and use a library (here: Pandas) in your Python program
In [ ]: