Notebook 5: Bits and Pieces¶
At this stage I guess you have enough knowledge, and hopefully enough skills to ‘do some damage’. There are a couple of structures that I would like to address. Structures that I use almost every day.
Dictionaries¶
The first one is the ‘dictionary’. A dictionary is like a list, but instead of integer indexes, they utilize user-created labels, or ‘keys’. Every value in a dictionary can be accessed by its key. You decide what the keys are. These data-structures are insertion-ordered, meaning that the order in which key-value pairs are added decides the order of the dictionary. Let’s create one:
In [1]:
# create an empty dictionary, use curly brackets
my_dict = {}
# add a key-value pair, adding and accessing is done with square brackets! Yay!
my_dict[‘key_1’] = ‘value’
# add a second
my_dict[0] = 12
# and a third
my_dict[’99 problems’] = False
# show the dictionary
my_dict
Out[1]:
{‘key_1’: ‘value’, 0: 12, ’99 problems’: False}
You can get a list of all keys with:
In [2]:
my_dict.keys()
Out[2]:
dict_keys([‘key_1′, 0, ’99 problems’])
As you can see it’s not a real list. But you can use it as a normal list as we will see shortly. You can explicitly convert it to a list:
In [3]:
list(my_dict.keys())
Out[3]:
[‘key_1′, 0, ’99 problems’]
For all values we have a similar function:
In [4]:
my_dict.values()
Out[4]:
dict_values([‘value’, 12, False])
Did you notice we can feed the dictionary key-value pairs without using a special function? There is no append, you just assign a key to a value. You can access a value like you have done with lists:
In [5]:
my_dict[‘Sherlock’] = ‘Holmes’
my_dict[‘Sherlock’]
Out[5]:
‘Holmes’
In real life these dictionaries can be big. Especially when you generate them automatically. Let me give you a small example. Let’s split a piece of text into separate words and do a word count. We generate a dictionary with the found words as keys, and their frequency as values:
In [6]:
# this is our text, run the cell
text = ‘Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task.[1][2]:2 Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop a conventional algorithm for effectively performing the task.’
In [7]:
# generate a word-count dictionary
word_count = {}
# iterate over all words
for word in text.split():
# if this word is not a part of our dictionary
if word not in word_count.keys():
# create the key, with value 0
word_count[word] = 0
# add 1 for this particular word
word_count[word] += 1
# I use print here to avoid a lengthy output
print(word_count)
{‘Machine’: 3, ‘learning’: 3, ‘(ML)’: 1, ‘is’: 3, ‘the’: 3, ‘scientific’: 1, ‘study’: 1, ‘of’: 3, ‘algorithms’: 3, ‘and’: 3, ‘statistical’: 1, ‘models’: 1, ‘that’: 1, ‘computer’: 2, ‘systems’: 1, ‘use’: 1, ‘to’: 4, ‘perform’: 2, ‘a’: 5, ‘specific’: 1, ‘task’: 1, ‘without’: 2, ‘using’: 1, ‘explicit’: 1, ‘instructions,’: 1, ‘relying’: 1, ‘on’: 2, ‘patterns’: 1, ‘inference’: 1, ‘instead.’: 1, ‘It’: 1, ‘seen’: 1, ‘as’: 3, ‘subset’: 1, ‘artificial’: 1, ‘intelligence.’: 1, ‘build’: 1, ‘mathematical’: 1, ‘model’: 1, ‘based’: 1, ‘sample’: 1, ‘data,’: 1, ‘known’: 1, ‘”training’: 1, ‘data”,’: 1, ‘in’: 2, ‘order’: 1, ‘make’: 1, ‘predictions’: 1, ‘or’: 2, ‘decisions’: 1, ‘being’: 1, ‘explicitly’: 1, ‘programmed’: 1, ‘task.[1][2]:2’: 1, ‘are’: 1, ‘used’: 1, ‘wide’: 1, ‘variety’: 1, ‘applications,’: 1, ‘such’: 1, ’email’: 1, ‘filtering’: 1, ‘vision,’: 1, ‘where’: 1, ‘it’: 1, ‘difficult’: 1, ‘infeasible’: 1, ‘develop’: 1, ‘conventional’: 1, ‘algorithm’: 1, ‘for’: 1, ‘effectively’: 1, ‘performing’: 1, ‘task.’: 1}
Let’s assume the text was much bigger. And you are not sure if a certain key exists, but if it does, you would like to know the frequency. When you do it like this:
In [8]:
word_count[‘Trump’]
—————————————————————————
KeyError Traceback (most recent call last)
—-> 1 word_count[‘Trump’]
KeyError: ‘Trump’
you will receive an error. A safer way to access a key-value pair is to use the get function. Its first argument is the key you are looking for, the second argument is a value that is going to be returned if the key can’t be found:
In [9]:
word_count.get(‘Trump’, ‘Can\’t find Trump’)
Out[9]:
“Can’t find Trump”
In [10]:
word_count.get(‘learning’, False)
Out[10]:
3
In [11]:
word_count.get(‘flower’, False)
Out[11]:
False
Of course you can loop over all key-value pairs, but not directly. You need the items function to produce an iterable structure:
In [12]:
word_count.items()
Out[12]:
dict_items([(‘Machine’, 3), (‘learning’, 3), (‘(ML)’, 1), (‘is’, 3), (‘the’, 3), (‘scientific’, 1), (‘study’, 1), (‘of’, 3), (‘algorithms’, 3), (‘and’, 3), (‘statistical’, 1), (‘models’, 1), (‘that’, 1), (‘computer’, 2), (‘systems’, 1), (‘use’, 1), (‘to’, 4), (‘perform’, 2), (‘a’, 5), (‘specific’, 1), (‘task’, 1), (‘without’, 2), (‘using’, 1), (‘explicit’, 1), (‘instructions,’, 1), (‘relying’, 1), (‘on’, 2), (‘patterns’, 1), (‘inference’, 1), (‘instead.’, 1), (‘It’, 1), (‘seen’, 1), (‘as’, 3), (‘subset’, 1), (‘artificial’, 1), (‘intelligence.’, 1), (‘build’, 1), (‘mathematical’, 1), (‘model’, 1), (‘based’, 1), (‘sample’, 1), (‘data,’, 1), (‘known’, 1), (‘”training’, 1), (‘data”,’, 1), (‘in’, 2), (‘order’, 1), (‘make’, 1), (‘predictions’, 1), (‘or’, 2), (‘decisions’, 1), (‘being’, 1), (‘explicitly’, 1), (‘programmed’, 1), (‘task.[1][2]:2’, 1), (‘are’, 1), (‘used’, 1), (‘wide’, 1), (‘variety’, 1), (‘applications,’, 1), (‘such’, 1), (’email’, 1), (‘filtering’, 1), (‘vision,’, 1), (‘where’, 1), (‘it’, 1), (‘difficult’, 1), (‘infeasible’, 1), (‘develop’, 1), (‘conventional’, 1), (‘algorithm’, 1), (‘for’, 1), (‘effectively’, 1), (‘performing’, 1), (‘task.’, 1)])
items produces a list containing tuples with key-value pairs. We can loop over this like so:
In [13]:
for key, value in my_dict.items():
print(key, ‘ – ‘, value)
key_1 – value
0 – 12
99 problems – False
Sherlock – Holmes
Can you follow the code in the next cell?
In [14]:
for key,value in list(word_count.items())[0:5]:
print(‘key {} has value {}’.format(key, value))
key Machine has value 3
key learning has value 3
key (ML) has value 1
key is has value 3
key the has value 3
So how would we sort a dictionary. It is sorted by insertion right now, but maybe we would like to sort it by key, or by value:
In [15]:
# I use print again to keep the output of the cell small
print(sorted(word_count.items(), key=lambda x: x[0]))
[(‘”training’, 1), (‘(ML)’, 1), (‘It’, 1), (‘Machine’, 3), (‘a’, 5), (‘algorithm’, 1), (‘algorithms’, 3), (‘and’, 3), (‘applications,’, 1), (‘are’, 1), (‘artificial’, 1), (‘as’, 3), (‘based’, 1), (‘being’, 1), (‘build’, 1), (‘computer’, 2), (‘conventional’, 1), (‘data”,’, 1), (‘data,’, 1), (‘decisions’, 1), (‘develop’, 1), (‘difficult’, 1), (‘effectively’, 1), (’email’, 1), (‘explicit’, 1), (‘explicitly’, 1), (‘filtering’, 1), (‘for’, 1), (‘in’, 2), (‘infeasible’, 1), (‘inference’, 1), (‘instead.’, 1), (‘instructions,’, 1), (‘intelligence.’, 1), (‘is’, 3), (‘it’, 1), (‘known’, 1), (‘learning’, 3), (‘make’, 1), (‘mathematical’, 1), (‘model’, 1), (‘models’, 1), (‘of’, 3), (‘on’, 2), (‘or’, 2), (‘order’, 1), (‘patterns’, 1), (‘perform’, 2), (‘performing’, 1), (‘predictions’, 1), (‘programmed’, 1), (‘relying’, 1), (‘sample’, 1), (‘scientific’, 1), (‘seen’, 1), (‘specific’, 1), (‘statistical’, 1), (‘study’, 1), (‘subset’, 1), (‘such’, 1), (‘systems’, 1), (‘task’, 1), (‘task.’, 1), (‘task.[1][2]:2’, 1), (‘that’, 1), (‘the’, 3), (‘to’, 4), (‘use’, 1), (‘used’, 1), (‘using’, 1), (‘variety’, 1), (‘vision,’, 1), (‘where’, 1), (‘wide’, 1), (‘without’, 2)]
We have converted the dictionary in a list of tuples again with the items function. No! Let me be a bit more precise, we have created a list-like structure with all key-value pairs by using the items function. We did not convert. We created something new. The result can be sorted on a certain key (not to be confused with our dictionary keys). If we would like to sort by our dictionary-keys we have to tell the function to sort on the first element of every tuple. You can use a lambda to direct sorted to the correct element of the tuple.
Let’s sort on the values as well:
In [16]:
print(sorted(word_count.items(), key=lambda x: x[1], reverse=True))
[(‘a’, 5), (‘to’, 4), (‘Machine’, 3), (‘learning’, 3), (‘is’, 3), (‘the’, 3), (‘of’, 3), (‘algorithms’, 3), (‘and’, 3), (‘as’, 3), (‘computer’, 2), (‘perform’, 2), (‘without’, 2), (‘on’, 2), (‘in’, 2), (‘or’, 2), (‘(ML)’, 1), (‘scientific’, 1), (‘study’, 1), (‘statistical’, 1), (‘models’, 1), (‘that’, 1), (‘systems’, 1), (‘use’, 1), (‘specific’, 1), (‘task’, 1), (‘using’, 1), (‘explicit’, 1), (‘instructions,’, 1), (‘relying’, 1), (‘patterns’, 1), (‘inference’, 1), (‘instead.’, 1), (‘It’, 1), (‘seen’, 1), (‘subset’, 1), (‘artificial’, 1), (‘intelligence.’, 1), (‘build’, 1), (‘mathematical’, 1), (‘model’, 1), (‘based’, 1), (‘sample’, 1), (‘data,’, 1), (‘known’, 1), (‘”training’, 1), (‘data”,’, 1), (‘order’, 1), (‘make’, 1), (‘predictions’, 1), (‘decisions’, 1), (‘being’, 1), (‘explicitly’, 1), (‘programmed’, 1), (‘task.[1][2]:2’, 1), (‘are’, 1), (‘used’, 1), (‘wide’, 1), (‘variety’, 1), (‘applications,’, 1), (‘such’, 1), (’email’, 1), (‘filtering’, 1), (‘vision,’, 1), (‘where’, 1), (‘it’, 1), (‘difficult’, 1), (‘infeasible’, 1), (‘develop’, 1), (‘conventional’, 1), (‘algorithm’, 1), (‘for’, 1), (‘effectively’, 1), (‘performing’, 1), (‘task.’, 1)]
It’s easy to convert a list containing tuples with key-value pairs into a dictionary. Check this out:
In [17]:
x = [(‘key_{}’.format(i), i ** 2) for i in range(8)]
another_dict = dict(x)
another_dict
Out[17]:
{‘key_0’: 0,
‘key_1’: 1,
‘key_2’: 4,
‘key_3’: 9,
‘key_4’: 16,
‘key_5’: 25,
‘key_6’: 36,
‘key_7’: 49}
And list-comprehensions are also possible, or maybe I should say dictionary-comprehensions:
In [18]:
{ k:v*2 for (k,v) in another_dict.items() if v < 40 and v > 5 }
Out[18]:
{‘key_3’: 18, ‘key_4’: 32, ‘key_5’: 50, ‘key_6’: 72}
I’d like to finish this section with a couple of examples and tricks. Note how easily we can convert one type into another
In [19]:
# 2 equal sized lists
l1 = [‘alpha’, ‘beta’, ‘gamma’, ‘delta’, ‘epsilon’, ‘zeta’, ‘eta’]
l2 = [7, 6, 5, 4, 3, 2, 1]
In [20]:
# merge into list of tuples
merged = list(zip(l1, l2))
merged
Out[20]:
[(‘alpha’, 7),
(‘beta’, 6),
(‘gamma’, 5),
(‘delta’, 4),
(‘epsilon’, 3),
(‘zeta’, 2),
(‘eta’, 1)]
In [21]:
# convert into dict
new_dict = dict(merged)
new_dict
Out[21]:
{‘alpha’: 7,
‘beta’: 6,
‘gamma’: 5,
‘delta’: 4,
‘epsilon’: 3,
‘zeta’: 2,
‘eta’: 1}
In [22]:
keys = list(new_dict.keys())
keys
Out[22]:
[‘alpha’, ‘beta’, ‘gamma’, ‘delta’, ‘epsilon’, ‘zeta’, ‘eta’]
In [23]:
# using a list comprehension
values = [v for k, v in new_dict.items()]
values
Out[23]:
[7, 6, 5, 4, 3, 2, 1]
By the by. It is ridiculous to write a word-counter when you have one at your disposal:
In [24]:
from collections import Counter
Counter(text.split()).most_common(5)
Out[24]:
[(‘a’, 5), (‘to’, 4), (‘Machine’, 3), (‘learning’, 3), (‘is’, 3)]
Sets¶
Sets are another list-like structure, but this time there is no indexing. They are very efficient when you have to do set-operations like intersections, unions, etc. Let’s start with creating two small sets.
In [25]:
# 2 ways of create a set
s1 = set([7, 1, 8, ‘casper’, ‘uu’])
s2 = { 2, 4, 0, ‘uu’, 7 }
First set-operation: membership
In [26]:
print(7 in s1)
print(‘LinkedIn’ in s2)
print(‘LinkedIn’ not in s2)
True
False
True
Do you think there is a difference in membership-operation-efficiency between sets and lists? Let’s find out:
In [27]:
def member_of_list(needle, haystack_list):
return needle in haystack_list
haystack = list(range(1000))
%timeit member_of_list(500, haystack)
7.06 µs ± 347 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [28]:
def member_of_set(needle, haystack_list):
return needle in haystack_list
haystack = set(list(range(1000)))
%timeit member_of_set(500, haystack)
142 ns ± 7.65 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
YES THERE IS. Do not underestimate sets! They look like the retarded brothers of lists. But in fact they have special properties which make them blazingly fast in certain situations. Like membership queries.
Moving on. Sets are also ideal if you are interested in intersections:
In [29]:
# by operator
s1 & s2
Out[29]:
{7, ‘uu’}
In [30]:
# by function
s1.intersection(s2)
Out[30]:
{7, ‘uu’}
and unions:
In [31]:
# by operator
s1 | s2
Out[31]:
{0, 1, 2, 4, 7, 8, ‘casper’, ‘uu’}
In [32]:
# by function
union = s1.union(s2)
union
Out[32]:
{0, 1, 2, 4, 7, 8, ‘casper’, ‘uu’}
or maybe you would like to know is one set is the superset of another:
In [33]:
# by operator
union >= s1
Out[33]:
True
In [34]:
# by function
union.issuperset(s1)
Out[34]:
True
This is getting boring. Real all about sets here. There is one trick that I would like to share. Sets can’t have duplicate elements. If you ever have to remove duplicate elements from a list, you do it like this:
In [35]:
dupes = [1, 2, 3, 1, 1, 1, 3, 5, 4, 5, 2, 2, 2, 4]
no_dupes = list(set(dupes))
no_dupes
Out[35]:
[1, 2, 3, 4, 5]
Tuples¶
You’ve seen quite a lot of tuples in the previous Notebook. But I have never explained them properly. Tuples are … drum roll … lists which can’t be changed after creation. As soon as you define one, you can’t manipulate its data anymore. You create a tuple with ordinary brackets:
In [36]:
t = (1, 2, 3, 4)
Access values like you have seen with lists:
In [37]:
t[-1]
Out[37]:
4
In [38]:
t[0:2] # delivers a tuple, not a list
Out[38]:
(1, 2)
Let’s try to change the first value:
In [39]:
t[0] = 10
—————————————————————————
TypeError Traceback (most recent call last)
—-> 1 t[0] = 10
TypeError: ‘tuple’ object does not support item assignment
See? Can’t change anything! Let’s try to remove an element:
In [ ]:
del t[2]
Can’t do that either.
Again, tuples are just special lists. I can loop over all values:
In [ ]:
for v in (‘it\’s’, ‘raining’, ‘men’):
print(v)
They can be nested:
In [ ]:
nested = ((1, 2, 3), (4, 5, 6))
nested[-1][-1]
So why bother!? If they are just lists, what’s their use?
If you are writing a function that returns more than a single value, tuples are your friend. Stuff all values you would like to send back to your function call in a tuple and return it:
In [ ]:
def foo(l=1):
return (l, l**2, l**3, l**4)
foo(2)
I could have done the same thing by returning a list. But a list can be altered, and I think programming-wise it’s better to deliver something that can’t be changed by accident. And that’s why enumerate produces a list of tuples:
In [ ]:
# cast to list to see its contents
list(enumerate([1, 2, 3, 4, 5]))
The tuples make sure both indexes and values cannot be changed by incompetent programmers.
Let me finish with this fine piece of syntactic sugar:
In [ ]:
a, b, c, d = foo(2)
print(d, b)
Lovely.
JSON¶
You know things about lists, and you know things about dictionaries. When you combine both structures you have a very flexible container to store complex data:
In [ ]:
data = {
‘trial_1’: [1, 4, 2, 7, 6, 5, 9, 0, 1],
‘trial_2’: [8, 7, 5, 9, 8, 1, 3, 4, 5],
‘parameters’: {
‘alpha’: [1, 4],
‘offset’: 5,
‘max’: 40
},
‘subjects’: [‘john’, ‘jane’, ‘bert’, ‘the donald’]
}
Remember accessing values in nested lists? Same here:
In [ ]:
data[‘trial_1’][3]
Is this better?
In [ ]:
data.get(‘trial_1’, False)[4]
Yes and no, it is safer to access the key like this, but if the key can’t be found you’ll be confronted with a confusing error:
In [ ]:
data.get(‘alpha’, False)[3]
The get function returned a boolean (False) and that’s not a ‘container-data-structure’ (a list in our case). And therefor it’s not ‘subscriptable’, you can’t get an element out of it by an index.
Let’s do another one:
In [ ]:
data[‘parameters’][‘alpha’][1]
We can compress this object in many ways. One way is to convert it into a string. We can do this with the JSON library. JSON stands for JavaScript Object Notation, which is a lightweight format for data exchange. It was invented by the JavaScript people, but other communities started using the idea as well and now it is a standard way to compress and decompress objects like our data dictionary. Let me show you:
In [ ]:
import json
text = json.dumps(data)
text
It almost looks like nothing happened, but our dictionary is converted into a single string, it is just a piece of text now:
In [ ]:
type(text)
We can get our dictionary back with the loads function:
In [ ]:
extracted = json.loads(text)
extracted
In [ ]:
type(extracted)
You’ll encounter JSON if you start calling API’s. It is easier to send one string over the internet than a complex data-structure. After your request for data, the API server returns a string containing JSON. The only thing you have to do is to load it upon receipt.
Exercises¶
1. Group into a dictionary (5 points)¶
Write a function group that takes a plain list, divides it in pairs and converts these pairs in key-value items in a dictionary. To illustrate:
your function receives:
[‘name’, ‘Casper’, ‘address’, ‘Marie Jungiusstraat 8’, ‘zipcode’, ‘1963BT’, ‘country’, ‘Netherlands’]
step 1, divide:
[[‘name’, ‘Casper’], [‘address’, ‘Marie Jungiusstraat 8’], [‘zipcode’, ‘1963BT’], [‘country’, ‘Netherlands’]]
step 2, convert:
{
‘name’: ‘Casper’,
‘address’: ‘Marie Jungiusstraat 8’,
‘zipcode’: ‘1963BT’,
‘country’: ‘Netherlands’
}
You have to do this ‘manually’, you’re not allowed to use special libraries. I will check if your code only contains list comprehensions and/or for-loops.
But what if your list has an odd amount of items? Then you create a last pair with an ‘unknown’ string added. See the example-output below:
In [86]:
def group(my_list):
for i in range(0,len(my_list),2):
result = my_list[i:i+2]
return result
example = [‘donald’, 7, ‘baxter’, 3, ‘david’, ‘deceased’, ‘linda’]
group(example) # should return -> { ‘donald’: 7, ‘baxter’: 3, ‘david’: ‘deceased’, ‘linda’: ‘unknown’}
Out[86]:
[‘linda’]
2. Queues (5 points)¶
We’re going to build a simple queue. A queue is an important data-structure which is basically a list that is ‘open’ at both ends. One end is always used to insert data (enqueue) and the other end is used to remove data (dequeue). Queues follows First-In-First-Out methodology, i.e., the data item stored first will be accessed first. Let me give you an example:
queue = []
enqueue(queue, 3) -> state of queue: [3]
enqueue(queue, 4) -> state of queue: [4, 3]
enqueue(queue, 1) -> state of queue: [1, 4, 3]
dequeue(queue) -> state of queue: [1, 4]
enqueue(queue, 5) -> state of queue: [5, 1, 4]
dequeue(queue) -> state of queue: [5, 1]
You will implement the enqueu and dequeue functions:
In [113]:
# enqueue will add an element :
# enqueue([], 3) will return [3]
# enqueue([3], 1) will return [1, 3]
# ===================================
def enqueue(q, item):
q = []
return q.insert(0,item)
# dequeue will remove an element at the end of list . You have to return 2 things:
# the removed element AND the manipulated list. Use a tuple for this. The first element
# of the tuple is the removed element, the second element is our list without the removed element.
# IMPORTANT: when you try to remove an element from an empty queue, your function should
# return False!!!
#
# dequeue([]) -> False
# dequeue([4]) -> (4, [])
# dequeue([3, 1, 5, 6]) -> (6, [3, 1, 5])
# ===================================
def dequeue(q):
result = (q, )
pass
#queue = []
queue = enqueue(queue, 3)
#queue = enqueue(queue, 4)
#queue = enqueue(queue, 1)
#(removed, queue) = dequeue(queue)
#queue = enqueue(queue, 5)
#(removed, queue) = dequeue(queue)
print(“state of queue”, queue) # should return [5, 1]
state of queue None
3. Up and down (5 points)¶
You receive a list L with numbers. Your job is to create a function comparison which generates a new list that shows if item $n$ in L is larger, equal, or smaller than item $n – 1$.
When L[n] > L[n-1] you add a 1 to your new list,
a 0 when L[n] = L[n-1]
and a -1 if L[n] < L[n-1].
Assume you will never receive an empty list. You can't compare the first item with a previous item, add a 0 in this situation.
Example:
L = [1, 2, 2, 3, 5, 1, 0, 10]
comparison(L) -> [0, 1, 0, 1, 1, -1, -1, 1]
In [ ]:
# put your code in here
4. Getting busy with stock data (5 + 5 points)¶
Anaconda comes with a great Pandas extension to retrieve all kinds of data: pandas-datareader. If you continue this Notebook after the exercises, you will use it again. We will gather stock data from a website called Tiingo with the datareader. It’s not completely public, you need an API-key to get in. I got one, and you can use it.
if you can’t run the following cell, send me (Casper) an email straight away. It works like this:
In [ ]:
# import datareader
import pandas_datareader as pdr
# retrieve Apple data between first of Jan and end of 30th of June 2019
apple_data = pdr.get_data_tiingo(‘AAPL’, api_key=’83b56978f4990f78e6c167d8886f4510ead7676c’,
start=’01-01-2019′, end=’06-30-2019′)
# the result is a plain Pandas dataframe
apple_data.head(n=10)
Job #1, collect adjClose (5 pt)¶
Write a function collect that takes a list with stock-symbols, collects for every symbol data from Tiingo between 1/1/2019 and 6/30/2019, selects the ‘adjClose’ column and converts it into a PLAIN Python list (google how to do this if necessary). After that, you have to add the result into a dictionary, use the symbol as key, and the plain list as value.
Example:
res = collect([‘AAPL’, ‘COKE’])
res would look like this (not necessarily in this order, and it looks worse when you print it):
{
‘AAPL’: [155.575185, 140.078745, 146.058617, 145.733517, etc…],
‘COKE’: [179.447342, 177.235510, 181.220793, 184.787622, etc…]
}
In [ ]:
def collect(symbols_list):
# your code here
pass
# test example
tiingo = collect([‘AAPL’, ‘COKE’])
tiingo
Job #2, getting the averages (5 pt)¶
Write a function get_averages that receives a dictionary in which every value is a plain list. These lists might be empty. Your job is to produce a new dictionary, containing the same keys, but with the averages of the lists as values. If I digest the result of the previous exercise, the result of get_averages should look like this:
# collect Tiingo data
tiingo_data = collect([‘AAPL’, ‘COKE’])
# run get_averages on the result
res = get_averages(tiingo_data)
res -> {‘AAPL’: 180.60922945689273, ‘COKE’: 270.9550740857307}
When an empty list is encountered, a False should be associated with the key value. Like this:
res -> {‘AAPL’: 180.60922945689273, ‘COKE’: 270.9550740857307, ‘BOGU’: False}
In [ ]:
def get_averages(a_dict):
# your code here
pass
averages = get_averages(tiingo)
averages
5. Constituent lists and stocks revisited. (10 points)¶
Implement the search_all_stocks_smarter function: find all list indexes of multiple stocks in a constituents list. Check the previous Notebook’s exercises if you need a more thorough description. This time you:
1. Check which stocks are worth looking for by using sets. Do not search for list indexes of stocks that can’t be found.
2. Collect all list indexes in a dictionary. Your stocks are the keys. If you have found multiple list indexes: put them in a list. A single list index is stored as a regular value. Stocks that can’t be found in constituent list get a False value.
3. Return the result as a JSON string, not a dictionary.
Run the next cell to get some test-data.
In [ ]:
DJ = [‘3M’,’AMERICAN EXPRESS’,’APPLE’,’BOEING’,’CATERPILLAR’,’CHEVRON’, ‘CISCO SYSTEMS’,
‘COCA COLA’, ‘COCA COLA’, ‘DOW ORD SHS’,’EXXON MOBIL’,’GOLDMAN SACHS GP.’,’HOME DEPOT’,’INTEL’, ‘INTEL’
‘INTERNATIONAL BUS.MCHS.’,’JP MORGAN CHASE & CO.’,’JOHNSON & JOHNSON’,
‘MCDONALDS’,’MERCK & COMPANY’,’MICROSOFT’,”NIKE ‘B'”,’PFIZER’,’PROCTER & GAMBLE’, ‘INTEL’,
‘TRAVELERS COS.’,’UNITED TECHNOLOGIES’,’UNITEDHEALTH GROUP’,’VERIZON COMMUNICATIONS’,
“VISA ‘A'”,’WALGREENS BOOTS ALLIANCE’,’WALMART’,’WALT DISNEY’]
In [ ]:
def search_all_stocks_smarter(constituents, stocks):
# Here is your game-plan for you:
#
# setup your resulting-dictionary, it will be empty before your start the show
#
# convert
# By using these sets, find out which stocks can and can’t be found in
# constituents by using fast set operations. Store both sets in variables
# so you can use them in the next steps.
#
# The stocks that CAN’T be found, can be stored in the dictionary immeditaly,
# use the stock-names as keys.
#
# As for the stocks that CAN be found: start looping over them:
# hint:
# use a list comprehension to find all indexes of the stock you’re trying to find.
# If, in the end, you could find only 1 index, make sure you store this index as
# a single value (integer) in the dictionary, if you have found multiple indexes
# they should be stored (in the dict) as a list.
#
# At this point you have a dictionary containing all the information you need.
# Convert it to json by dumping it and return the json
pass
found = search_all_stocks_smarter(DJ, [‘CATERPILLAR’, ‘COCA COLA’, ‘SHELL’])
found
# This example call should return:
# ‘{“SHELL”: false, “CATERPILLAR”: 4, “COCA COLA”: [7, 8]}’
# WHICH IS A STRING!!!! AND NOT A DICTIONARY.
# PLEASE NOTE: The order in which you see the stocks may be different! No worries about that
Deep Learning¶
Deep learning is hot! Deep Learning is Machine Learning (ML) with big artificial neural networks (ANN). Since 1960, the ANN has been making a comeback every 30 years. I guess this time it will stick around much longer. Due to the computational power of computers today, and the vast amount of data we’re gathering, you can do interesting stuff with them! Like automatically recognizing faces, reading license plates, driving a car, or predicting if a piece of text is interesting for you.
It will take too much time to explain how a neural network works in detail. If you are interested, you could read this article. If you’re not: an ANN mimics a tiny part of your brain. We create several layers of fully interconnected neurons, or nodes. The connections carry a certain weight. An input-layer is fed with numerical data. Per non-input-layer we calculate the incoming signal of every node (that’s the input values multiplied with the weights) and determine what the output value should be. At the end of the structure there is an output layer. You can use the output values to classify, or get probabilities, etc.
Supervised neural networks need training. To train we need a dataset with input values and desired output values. For instance: 10.000 MRI images that may, or may not, show a small tumor. For every picture, we know beforehand if there is a tumor displayed or not. We feed the pixel-values of the images to the input layer and compare the output of the ANN with the desired output. In this example we would like to see an output-value of 1 if the picture displayed a tumor, 0 otherwise.
During the training phase, we repeatedly feed our training dataset to the network. An algorithm tries to adjust the connection-weights to minimize the difference between desired and real output values. After that, our model is ready to be tested: we check how are model is doing on testing data which we left out during training. If the model does well, we’re done, and we can apply it on new unknown data. If not, we have to adjust parameters and start the whole process of training and testing again.
Warning: this is not a Machine Learning course! We’re just going to play a bit with a Neural Network, nothing more than that. The end-goal is to train a regression model using a neural network on your favorite stock-data. We will get the stock-data from an API.
prerequisites:¶
For this demo you need scikit-learn. It should have been installed with your Anaconda distribution. If not, open a terminal and type conda install scikit-learn. Scikit is a big Machine-Learning library with all kinds of ML and preprocessing goodies. It is build on top of SciPy
The second library is pandas-datareader. This module enables you to gather financial API-data from certain sources and import it straight into a Pandas DataFrame. Install with pip install pandas-datareader.
Run the next cell to set things up:
In [ ]:
import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_datareader as pdr
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import scale
In [ ]:
# set figure size for images in our Notebook
plt.rcParams[‘figure.figsize’] = [9, 7]
As said we would like to use a Neural Network to fit a regression model on financial data. But before we get into that, let’s start simple. Let’s see if we can train a neural network which approximates a simple function:
$$ f(x) = \frac{1}{x} $$
Our NN will be supervised: first we have to create a dataset containing input values and their desired output values. Create a Pandas Series x with values 1 (not 0!) through 100. Apply our function $f$ to create another Series y.
In [ ]:
# first value is 1 because a 0 would produce a division-by-0 error
x = #
y = #
You might wonder why we’re not creating a DataFrame with an x and y-column. It’s because we can’t feed our neural network a DataFrame column. Every ANN has a number of input nodes. It expects all input-values in a separate list: we have to create a list of lists. We can do this with reshape:
In [ ]:
# by the way: the result is a numpy matrix which contains 1 column and 50 rows
x = x.values.reshape(-1, 1)
# and we can check the first 10 rows like this
x[0:10, :]
Our network will have only 1 node in its input layer, and that’s why we have sublists containing only 1 element.
The second step is to split our data into a train-set and a test-set. The test-set is used to see how well our neural network is trained. Scikit-learn is filled to the brim with helper functions to assist you with these Machine Learning related chores. Instead of splitting our set manually (nice exercise by the way), we can use the train_test_split function. We’ve imported it from the model_selection submodule as a stand-alone function. Please take a look at the import cell to see how it was done.
Its first argument is our input-set, the second our desired outputs and the test-size argument tells the function to make a 80/20% split. That’s 80% random training-data.
In [ ]:
# randomly split our dataset into 80% training data,
# random_state makes sure we can repeat this particular random split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
# plot with pyplot, use the scatter function, remember that our input set is a
# list of lists (in fact a matrix with 1 column), we need the first column
plt.scatter(X_train[:, 0], y_train)
plt.title(‘Training data’)
plt.show()
Time to create our Neural Network with the imported neural networks submodule from scikit-learn.
First argument is a tuple, every item in this tuple represents a layer between the input and output layer (a so called ‘hidden’-layer). The number tells how many nodes, or neurons, the hidden-layer contains. So (10, 30) represents 2 hidden layers between the input and output layer. The first contains 10 nodes, the second 30 nodes.
Second argument: activation function. The activation function decides what a neuron will output based on its input signal.
Third and fourth argument: not getting into that
The rest: random_state makes sure that the initial random assignment of connection-weights is the same for you and me, max_iter tells the network how many iterations we would like to train, the learning rate tells the network how fast it should train (fast may overshoot optimum vs. small but slow steps), and alpha is a penalty term to avoid overfitting.
In [ ]:
# create the neural network
nn = MLPRegressor(hidden_layer_sizes=(2), activation=’tanh’, solver=’lbfgs’, learning_rate=’adaptive’,
random_state=1, max_iter=10, alpha=0.01)
Where can we define how many nodes we need in the input and output layers? We don’t have to! Scikit-learn will infer that information from the training data. If our input-data contains sublists with 3 items, it will create an input layer with three nodes. Same thing for the output data.
With our network ready, we can start to train. After training we use our model to predict our test-data and compare our predictions with the desired test-output:
In [ ]:
# train it, try to ‘fit’ the output data as best as possible
# fit returns a trained neural network
trained_network = nn.fit(X_train, y_train)
# use our trained model to get predictions for our training data
y_pred_trained = trained_network.predict(X_train)
# use our trained model to get predictions for our test data
y_pred_test = trained_network.predict(X_test)
# compare visually, don’t mind the matplotlib stuff
fig, ax = plt.subplots()
# original data
ax.scatter(X_train[:, 0], y_train, label=’Actual training data’)
ax.scatter(X_train[:, 0], y_pred_trained, label=’Predicted training data’)
ax.scatter(X_test[:, 0], y_pred_test, label=’Predicted test data’)
ax.legend()
plt.show()
We made a plot to see how our model approximates function $f$. But that’s impossible when we have high dimensional data. Besides that, we should have a more formal way to check how our model performs. We can call score on the model to see how we’re doing on the test data:
In [ ]:
score = trained_network.score(X_test, y_test)
score
If you want to know exactly what score measures, you could read this. If not: score returns 1 if the model predicts the test-data perfectly. Everything below 1 is less than perfect, negative numbers are possible.
I have used the worst possible parameters. The single hidden layer with 2 nodes is a joke. Let’s add nodes to our hidden layer, and train more extensively.
In [ ]:
# create the neural network, let’s go for 20 nodes in our hidden layer
nn = MLPRegressor(hidden_layer_sizes=(20), activation=’tanh’, solver=’lbfgs’, learning_rate=’adaptive’,
random_state=1, max_iter=100, alpha=0.01)
# train
trained_network = nn.fit(X_train, y_train)
# use our trained model to get predictions for our training data
y_pred_trained = trained_network.predict(X_train)
# use our trained model to get predictions for our test data
y_pred_test = trained_network.predict(X_test)
# compare visually, don’t mind the matplotlib stuff
fig, ax = plt.subplots()
# original data
ax.scatter(X_train[:, 0], y_train, label=’Actual training data’)
ax.scatter(X_train[:, 0], y_pred_trained, label=’Predicted training data’)
ax.scatter(X_test[:, 0], y_pred_test, label=’Predicted test data’)
ax.legend()
plt.show()
# what is our score
score = trained_network.score(X_test, y_test)
print(‘score: ‘, score)
OK! Looking pretty good there, the network almost scored a perfect 1. Let’s see how our trained network performs on values that are outside the range of our training data:
In [ ]:
ext_data = np.concatenate([np.arange(0.3, 4, 0.15), np.arange(100, 200, 5)])
ext_data = pd.Series(ext_data).values.reshape(-1, 1)
# predict our unknown data
y_ext = trained_network.predict(ext_data)
# compare visually, don’t mind the matplotlib stuff
fig, ax = plt.subplots()
# original data
ax.scatter(X_train[:, 0], y_train, label=’Actual training data’)
ax.scatter(ext_data[:, 0], y_ext, label=’Extrapolated prediction’)
ax.legend()
plt.show()
That looks pretty good as well. I think it’s fair to say that the neural network ‘got’ the idea of the function.
Our function $f$ was easy. How about a periodic pattern! Let’s try to approximate a sinus:
$$ g(x) = sin(2x) $$
In [ ]:
# np.arange is more flexible than range, the third argument is the step-size
x = pd.Series(np.arange(-6, 6, 0.05))
# THIS IS YOURS: apply function g to x
y = #
# split our data
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=1, test_size=0.3)
# prepare our input-values for a neural network
X_train = X_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
# plot our training data
plt.scatter(X_train[:, 0], y_train)
plt.title(‘Training data’)
plt.show()
Create a new neural network. The function we would like to approximate is more complex, so let’s go for 2 hidden layers containing 20 neurons each:
In [ ]:
nn = MLPRegressor(hidden_layer_sizes=(20, 20), activation=’tanh’, solver=’lbfgs’, learning_rate=’adaptive’,
random_state=1, max_iter=1000, alpha=0.01)
trained_network = nn.fit(X_train, y_train)
y_pred_training = trained_network.predict(X_train)
y_pred_test = trained_network.predict(X_test)
# compare visually, don’t mind the matplotlib stuff
fig, ax = plt.subplots()
# original data
ax.scatter(X_train[:, 0], y_train, label=’Actual training data’)
ax.scatter(X_train[:, 0], y_pred_training, label=’Predicted training data’)
ax.scatter(X_test[:, 0], y_pred_test, label=’Predicted testing data’)
ax.legend()
plt.show()
# what is our score
score = trained_network.score(X_test, y_test)
print(‘score: ‘, score)
Alright, almost a score of 1 on the test-set! So our model performs really well. But what happens when we extrapolate:
In [ ]:
extrapolate = np.concatenate([np.arange(-12, -6, 0.1), np.arange(6, 12, 0.1)])
ext_data = pd.Series(extrapolate).values.reshape(-1, 1)
y_ext = trained_network.predict(ext_data)
fig, ax = plt.subplots()
ax.scatter(X_train[:, 0], y_train, label=’Actual training data’)
ax.scatter(X_test[:, 0], y_pred_test, label=’Predicted testing data’)
ax.scatter(ext_data[:, 0], y_ext, label=’Extrapolated’)
ax.legend()
plt.show()
We have trained and tested on a certain range of input values. Our trained model did very well on that data. But it didn’t learn periodicity.
But being naive and optimistic, we decide that short-term predictions can be made here. Time to make some money with this technique. Let’s try to predict stock prices!
With the Pandas datareader we can request stock data from the Tiingo website. We don’t have to scrape, we can access their data through an API and immediately import it into a Pandas DataFrame. You need an API key:
83b56978f4990f78e6c167d8886f4510ead7676c
De documentation of the get_data_tiingo function can be found here.
An example: let’s get the Google daily data from the first half year of 2019:
In [ ]:
goog = pdr.get_data_tiingo(‘GOOG’, api_key=’83b56978f4990f78e6c167d8886f4510ead7676c’,
start=’01-01-2019′, end=’06-30-2019′)
goog.head(n=20)
Note that goog is a DataFrame. We would like to use a neural network to fit a regression model on the adjusted closing price (‘adjClose’ column) as a function of the date.
But we can’t do math with dates, we need numbers. Adding the index as a separate column won’t work since the indexes are the timestamps. Add a column ‘date_value’ in which you put sequential numbers. I don’t care what numbers, we are going to rescale anyway. The easiest way is to use a range.
In [ ]:
goog[‘date_value’] = #
goog.head()
Create a new DataFrame goog_data in which you put the ‘date_value’ and ‘adjClose’ column (in this particular order).
In [ ]:
goog_data = #
It would be better to feed our network scaled data. Scikit-learn handles this with the scale function (imported from the preprocessing submodule):
In [ ]:
goog_scaled = scale(goog_data)
goog_scaled
The result is a numpy matrix, the first column should be the scaled date_values, the second the scaled stock prices.
Let’s plot this data. Although we haven’t used numpy, you will undoubtedly recognize the syntax to select columns:
In [ ]:
plt.plot(goog_scaled[:, 0], goog_scaled[:, 1])
plt.show()
And a last tip: you can reshape the scaled date_values in numpy like this:
In [ ]:
goog_scaled[:, 0].reshape(-1, 1)
Your hacking assignment: create a neural network that performs a regression function on your favorite stock data. Retrieve the data from Tiingo. Go nuts. Can you incorporate a second independent variable and create a neural network with 2 inputs? Experiment with the hidden layers, the activation function, the penalty argument alpha, etc. Check how your network is doing on the test-data with the score function, make plots, avoid overfitting. Show me your results. As soon as you start making money: cut me in, I deserve it. And last but not least: have fun.