CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021 Learning Outcomes
This unit will introduce you to the idea of analyzing the efficiency of algorithms and data structures. By the end of this unit, you will be able to:
1
• • • •
Analyze simple algorithm run times by their best, worst and average cases Understand how to use Big O analysis to compare algorithms
Implement, analyze and compare various sorting algorithms
Write code to build and manipulate various types of tree data structures
Are linked-lists efficient?
Up to this point , we have learned how to represent, store, organize, and search through a collection of possibly complex data items. We know what an Abstract Data Type is and why they are useful for coming up with solutions to problems in a way that makes the solution independent from a specific implementation. We studied lists, and their implementation as linked lists, and we used these for building a simple database able to answer a few queries on items in our collection.
You will no doubt use lists in the future for a variety of applications. So before we close our discussion on lists, it’s worth taking time to think about how efficient they are as a solution to problems that require frequent look-up of items in the collection (whether it is to look at their content, update the data stored there, or carry out aggregate computations).
Suppose we wanted to use linked lists to implement a fully functional database engine. The database should be able to store information for a large number of data items (think big – tens of millions, hundreds of millions, or even billions of nodes in the linked list). The question is, what can we expect in terms of the time it takes for us to perform a search on the database?
ASIDE
Having such a large number of data items in a collection is very common.
IBM’s Watson used millions of documents to answer game questions for Jeopardy Google Image Search was indexing well over 10 billion images by 2010
Facebook had about 1.5 billion daily active users in September 2018
Amazon sells over 3 billion different products worldwide!
So, thinking about what is the most efficient way to maintain a very large collection is of great importance.
To fully understand this question, we have to think a bit about how data was stored in the list. A common assumption, and one that is reasonable for many database applications, is that the data was added in no particular order, or what is the same thing the entries in the list are randomly ordered with respect to any data fields we may want to use for answering database queries.
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 1
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
We also typically assume that queries come in in random order. This is a reasonable assump- tion – think about Google searches arriving over a span of 1 minute from everywhere in the world – chances are, there will be no order or pattern to the searches carried out by people from Mexico, Australia, Singapore, and Finland, all of whom happen to be online at that particular time.
Finally, we have to provide some definition of efficiency so we can evaluate how well our list is doing. For a database engine, one reasonable measure is how many data items have to be inspected before we find the one we are looking for is found.
Now, recall from the last unit that to perform a search in a linked list, we have to do a list traversal. This means starting at the head of the list, and traveling along the list’s nodes until we find the one that contains the information we are looking for.
Question: How many nodes do we need to look at before we find the one we want?
• If we are super lucky, the data we want will be in the head node and thus we only look at 1 node to find what we want, but with randomly ordered data, the chance of this happening is 1/N, where N is the number of nodes in our list. For a list with 1,000,000,000 entries, the odds are very large against this ever happening!
• If we are super unlucky, the data we want is in the tail of the list (or it is not in the list), and we have to look at all N nodes before we find it or can be sure it’s not there.
• Most of the time, we are neither super lucky, not super un-lucky. The data is somewhere in the list, sometimes closer to the head, sometimes closer to the tail. Because the data is randomly ordered, the number of nodes we need to examine averages out to N/2 after we run many, many queries.
This makes sense: Sometimes we find the data closer to the head, sometimes closer to the tail, but these two balance each other out so on average we have to look at about half of the linked list to answer any given query.
This is really not too bad if our list is a few thousand entries long. But once our linked list grows into the millions of items, and beyond, the cost of answering a query becomes too large. Simply put, our linked list has to look at too many data items in order to find a specific one. This translates into a longer wait time for a program requesting information from the database. At some point, the list is so long that the wait time becomes impractically long.
2 Why we need to look through all that data to find some- thing
We may be inclined to think that the root cause of our problem is the structure of our linked list. And indeed, our linked list has a major limitation in that we can not access an element inside the list without doing list traversal. However, the root cause of our problem is not limited to the structure of the linked list.
Consider for example an array that contains the names of all of the hit songs from 2017 and 2018 (it’s a small list so we can show an example here, but you should be thinking this applies to an array of any size).
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 2
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
The question we want to answer is:
Is searching for a specific item in an array with randomly ordered entries more efficient than searching in a linked list with the same (unordered) entries? Or, what amounts to the same thing, we are asking if using an array allows us to find what we want with less than N/2 items examined on average.
Looking at the array above, it should be fairly clear that there is no pattern to how the data is ordered. If we need to find a specific song name, we have to start at the top and look through the array until we find what we want. This is called linear search. Just like a linked list, there is a chance we get super lucky and our query is right at the first entry in the array, we can be super un-lucky so the query item is at the end, or not in the array; and on average we have to look through half the array before we find our information. If the array has N entries, we’re back to looking through N/2 entries on average.
This means that from the point of view of the number of items we need to look at to answer a query, a linked list and an array are equally efficient. While there may be a slight performance difference in terms of actual run-time (because there is more overhead to a linked list than an array), the small difference is not relevant as the collection size grows: both of these ways of storing information become inadequate for quickly answering queries.
3 Organizing our data for efficient search
If we are to find a faster way to answer queries, we need to deal with the random order of our data. We need to sort it. Once our data is in sorted order, we can do a much more efficient job of searching for specific items.
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 3
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
For a person, looking up a specific song in the sorted array above is much easier. We understand the entries are in order, so we can easily find the spot where a particular song should be. This is why all content indexes in the backs of books, library shelves, or the phone directory (back in the days when it was actually a printed book!) are sorted.
In programming terms, sorting data has the immediate benefit of making that data much easier to search, with the result that on a sorted array such as the one above, we only have to examine a few items in order to find what we’re looking for.
Binary Search
The process of finding an item in a sorted array is called binary search. Suppose we want to find the song “I’m the One” in the array above. The binary search process is illustrated below.
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 4
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
1) Consider the entire array, and find the item at the middle of the array (divide the length of the array by 2, and round down). Examine the item in the middle.
• If it’s the item we’re looking for we’re done • Otherwise
– If the item we’re looking for is less than the item we just examined, it must be in the first half of the array.
– If the item we’re looking for is greater than the item we just examined, it must be in the second half of the array.
In the case of our array “I’m the One” comes before “In My Feelings” (remember, all entries are sorted alphabetically!), so we take the first half of the array.
2) We perform exactly the same process on this chunk of the original array: Find the middle item and compare it to our query. Once more
• If it’s the item we’re looking for we’re done • Otherwise
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 5
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021 – If the item we’re looking for is less than the item we just examined, it must be
in the first half of the array.
– If the item we’re looking for is greater than the item we just examined, it must
be in the second half of the array.
The song we want, “I’m the one” comes after the middle item “God’s Plan”, so we need to
take the second half of this array.
3) Repeat the same process until we find the item we want! “I’m the One” comes after “I Like It”, so we take the second half of the array above.
We found the song we were looking for! To do this, we had to examine 4 entries. In the un-sorted array we would have expected to look at N/2 = 16/2 = 8 entries to find a song (of course, allowing for being more or less lucky).
So it seems that sorting the array is doing something useful for us. Let’s see if we can figure out what’s happening:
• The binary search process uses the fact that the array is sorted to predict in which half of the array the data we want has to be.
• This means that for every item we check, we can discard half of the remaining entries. When we choose which half of the array to search next, we can be sure we will never have to look at any items in the other half!
• So, for every step of binary search we complete, the number of items that remain to be checked is cut in half.
For example:
If our initial array has a length of 1024 entries:
• After the first step of binary search, we are left with 512 entries to check • After the second step of binary search, we have 256 entries left
• After step 3, we have 128 entries left
• After step 4, we have 64 entries left
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 6
CSCA48 – Unit 4 – Solving Problems Efficiently
Winter 2021
• After step 5, we have 32 entries left • After step 6, we have 16 entries left • After step 7, we have 8 entries left • After step 8, we have 4 entries left • After step 9, we have 2 entries left • After step 10, we have 1 entries left
The number of items left to be checked is reduced very quickly! In the case above, by checking 10 items we are able to find any item in an array of size 1024. And this is if we are very un-lucky and the item we want is the very last one we check!
Compare that with the expected N/2 = 1024/2 = 512 items we have to check for linear search on an un-sorted array or a linked list, and it should be clear that binary search is a much more efficient method for finding information.
How many items does binary search have to examine in general?
From the above, we can see that at each step the number of items left to examine is cut in half. The question is, given an initial value for N, the number of entries in the array, how many steps are needed to reach the point where only one element is left? – this is equivalent to finding the maximum number of entries we need to examine to find what we want.
The answer turns out to be k = log2(N)
For the array with 1024 entries, k = log2(N) = 10, which is exactly the number of steps we had to do above to get to a single item. Let’s see what this means in terms of how many items we need to examine as the number of entries in the array grows for three cases:
a) The maximum (worst case) number of items we need to look at for binary search b) The average number of items we need to look at for linear search
c) The maximum (worst case) number of items we need to look at for linear search
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 7
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
From the table above, it should be very clear to you that binary search is incredibly efficient. Even in the worst possible case, with only 25 items examined we can find any one item in an array with over 33 million entries! Conversely, linear search on an un-sorted array or linked list would be expected to have to go through over 16 million items, on average, and all 33 million it we’re unlucky before we find what we want.
4 Computational Complexity
The ideas above can be made into a concrete, general framework for comparing different algorithms in terms of the computational cost of carrying out a task.
One key idea in understanding how we can compare different algorithms is that we want to find a measure of the amount of work a particular algorithm has to do as a function of N, the number of data items the algorithm is working on. We call this measure the computational complexity of the algorithm.
In the examples above, we saw that linear search has to examine N items in the worst case, while binary search only has to look at log2(N). The reason for wanting to look at a function of N is that we can predict how an algorithm will perform for increasingly larger data collections. In our search example, we have two different functions:
fbinary search = Log2(N) flinear search = N
So we say that binary search has a complexity of log2(N) and that linear search has a complexity of N. Because the value of log2(N) grows much more slowly than N, we can conclude without having to test this on every possible array that binary search is much more efficient than linear search for large collections. We can visualize this by plotting both functions for increasing values of N and comparing the amount of work they have to do.
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 8
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
Drawing 1: Visualization of the complexity of two different search algorithms: Linear search (red) and binary search (blue). Binary search is incredibly efficient, to the point of looking like it does no work at all compared to linear search on large arrays! The shape of each function’s curve is easier to compare on a graph for smaller values of N (right side)
It should be obvious from the plots above which of the two algorithms is more efficient. The observation we can make from this is:
Given two algorithm and the functions of N that describe their computational complexity, the function that grows more slowly will always win-out as the number of data items grows. So the algorithm with the slower-growing function is said to have a lower complexity, and is therefore shown to be more efficient for large data collections.
Question: What can we say at this point about the run-time of two programs that do search on the same data, but where one uses binary search, and the other uses linear search?
What to take from the discussion above:
Being able to compare different possible ways of carrying out some task is essential for implementing good solutions to any problem. In computer science, the way to compare algorithms is by under- standing its computational complexity. Computational complexity is expressed as a function of N , the number of data items the program has to work on.
The Big O Notation
In order to be able to better understand and compare the complexity of different algorithms, we use a special notation called the Big O notation. The importance of the Big O notation is that it allows us to think about the efficiency of algorithms in terms of general classes of functions. Here’s what we mean by that:
Suppose that we have different implementations of linear search on arrays (perhaps written by different developers, in different programming languages), and the same for binary search. We then set out to carefully measure their run-time for arrays of different sizes, and we find that they behave as follows:
a) .75 × N (e.g. for N = 10, this implementation runs in 7.5 seconds)
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 9
CSCA48 – Unit 4 – Solving Problems Efficiently
Winter 2021
b) 1.25×N
c) 2.43×N
d) 1.15 × Log2 (N )
e) 1.75 × Log2 (N )
f) 15.245×Log2(N)
The results above require some thought – we know how much work linear search has to do to find an item in the array, and that in the worst case that would be looking at all N entries in the array. But we hadn’t considered what could happen when variations between implementations are introduced.
Perhaps an implementation in C will be slightly faster than one in Java, and perhaps the one in Java will be somewhat faster than one written in Matlab. But the important thing to note is that all of them are linear functions of N. It is a property of the linear search algorithm. The only thing that can change among (correct) implementations is the constant factor. We can say exactly the same in a different way: Any correct implementation of the linear search algorithm will have a complexity that is characterized by c × N , where c is some constant, for large values of N.
The same holds true for different implementations of binary search. They all are logarithmic functions of N multiplied by some constant that is implementation dependent.
We want a way to compare algorithms in an implementation independent way. We need to know which is the better algorithm, the one requiring less work to solve a given task. And if we can analyze algorithms in this way, we can always pick the most efficient one. The Big O notation makes explicit the fastest-growing function of N that characterizes the complexity of some algorithm. The actual definition states that
f(x) = O(g(x))
if and only if, there exists some positive constant c such that for sufficiently large values of x,
|f(x)| ≤ c · g(x), x x0
Put in a different way, to say that the run-time f(x) of some algorithm is of order O(g(x)) means that the run-time is less than, or equal to c · g(x) for some constant c assuming x is large enough. In the case of our search algorithms above, we can state that:
• Linear search has a complexity of O(N) – read as “the complexity of linear search is of order N”
• Binary search has complexity of O(Log2(N)) – read as “the complexity of binary search is of order Log2(n)”
In fact, we can go a bit further and say that the complexity of binary search is O(Log(N)), without needing to worry about the fact that the logarithm is base 2. Why is this reasonable? Consider the graphs below:
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 10
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
Plots of run-time for the search algorithms described above. For small values of N, linear search would appear to be more efficient. However, by the time N=150 all linear search implementations have overtaken the slowest binary search implementation, and by the time N=1000 it’s all over for linear search. The conclusion is: Binary search will win out in the end, regardless of how it’s implemented, if N is large enough.
The point not to be missed in the plots above is that the algorithm whose Big O complexity is slower-growing will win in the end, for large enough N. The Log(N) function (with any base) is much slower growing than any linear function, so we will always expect binary search to be faster on a large enough array.
ASIDE
Why don’t we need to worry about log bases in Big O?
You’ll remember from highschool math that: loga(x) = logb(x) (you do remember
logb (a)
that, don’t you?)
so instead of base 2, what if we wanted to work in base n? Let’s show that
log2(x) = O(logn(x)) for any constant n. log2(x) = logn(x) = 1 ∗ logn(x)
logn (2) logn (2)
but since n is a constant, so is 1 , and thus we simply choose a c larger than it.
logn (2)
So what this means is that logn(x) = O(logm(x)) for any n, m pair… or, in short, we can just ignore the base when dealing with logs in big-oh (nice property to have, huh?)
From algorithm complexity to problem complexity
We started with the goal of understanding how efficient two different algorithms for finding a particular item in an array can be. Complexity analysis is a wonderful tool for doing this. However, there is an even more powerful idea behind the use of complexity analysis in computer science:
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 11
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
We can study, quantify, and prove results regarding the complexity of a given problem. What is the difference? Let’s take an example: Sorting an array (and we will come back to this example soon in more detail!)
• Algorithm complexity means we can look at different algorithms to sort an array, and figure out which one is going to be more efficient as the array size grows.
• Problem complexity means we study the actual problem of sorting, and we try to figure out what is the theoretical lower-limit on how much work the best possible algorithm has to do to sort an array.
For example, proving that linear search has a complexity of O(N) is equivalent to showing that it is not possible to find an algorithm that can do linear search with complexity less than O(N).
This is a powerful and important idea because it allows us to classify problems by how hard they are to solve computationally. Harder problems are characterized by a Big O complexity given by fast-growing, or very-fast growing functions. As a developer, it’s important to become familiar with the different classes of problems you may have to work with one day, and to consider carefully what is reasonable to expect of an algorithm, given the complexity of the problem it is working on.
The plot above shows examples of algorithms (on top) and problems (at the bottom) with different Big O complexity. Note that for given problem (e.g. sorting), you can easily find algorithms whose complexity is greater (bubblesort) than the best known complexity bound for the problem. Knowing what the best known complexity estimate is for a given problem allows you to evaluate how good an algorithm is that solves that problem.
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 12
ASIDE
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
Next year, in B63 you will learn to analyze algorithms to figure out what their complexity is, and you will learn about different ways to measure, quantify, and express what we know about the complexity of different algorithms and operations on different ADTs.
For now, don’t forget that we can measure complexity in terms of different quantities: the number of times an item in a collection is accessed, the run-time of an algorithm, the number of mathematical operations a certain function has to perform. Which measure to use depends on what problem you’re solving, but the analysis, and what it tells you about an algorithm or problem in terms of how much work is being done to solve it, is understood in the same way.
Also, don’t forget that computational complexity analysis applies to every type of problem, not only searching for information in a collection. It is incredibly important in all areas of computer science that deal with numerical computation, including machine learning and artificial intelligence. Make sure you build a sound understanding of the ideas above, and next year expand and strengthen it in B63.
So this means the implementation doesn’t matter right?
Not quite – solving a problem efficiently requires you to first think of the complexity of your algorithm for solving the problem. But once you have satisfied yourself you have an algorithm with the best possible complexity, you need to write a good implementation for it! Once we have selected an algorithm with the right complexity for a problem, we are at the point where we now care about those pesky constant factors!
Going back to our original example of search algorithms, the difference between the implementations becomes important once N is large enough, e.g. N=1000000:
1.15 × Log2(N ) = 22.9 15.245 × Log2(N ) = 303.6
So the implementation makes all the difference between you having to wait around 20 sec. for your result, or having to wait over 5 minutes!
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 13
CSCA48 – Unit 4 – Solving Problems Efficiently
Winter 2021
EXERCISE
Consider a function that takes 2 input arrays:
void multiply accumulate(float input[N], float output[N]) The function computes each value in the ’output’ array as
output[i] = input[i] · input[j] j
That is, the ith entry in the output array is the result of multiplying the ith entry in the input array with every other entry in the input array (including itself), and adding all of these up.
One way to implement this function is shown below:
void multiply_accumulate(float input[N], float output[N])
{
int i,j;
for (i=0; i
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 15
11 for
12 {
13 14
15 16 17 18
19 }
20 }
21 }
(int j=0; j
{
t=array[j];
array[j]=array[j+1];
array[j+1]=t;
}
CSCA48 – Unit 4 – Solving Problems Efficiently
Winter 2021
5 // array is sorted (at most, N iterations) 6
7 int t;
8
9 for (int i=0; i
strcpy(new_review->rev.restaurant_name,””);
strcpy(new_review->rev.restaurant_address,””);
return new_review;
}
Other than the names, the only difference between the function above and the one we used with linked list nodes is highlighed by the red box: We have now two pointers that need to be set to NULL.
Inserting nodes into a BST
For linked lists, we had to keep around a pointer to the head of the list. In the case of a BST, we will need to keep around a pointer to the root of the tree (the node at the top). The insert process must ensure that the BST property is enforced when a new key is inserted in the tree.
This means that keys in the tree are inserted so that for any node in the tree, keys in the left subtree are smaller than or equal to the key in the node, and keys in the right subtree are greater than the key in the node. Since the data in our BST is a composite type ’Review’, we need to define which field we will use as key. Let’s make our BST keep our reviews ordered by the name of the restaurant. So in the implementation of the BST shown below, whenever we talk about
new_review->left=NULL;
new_review->right=NULL;
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 25
10 11 12 13 14 15 16 17
root->left=BST_insert(root->left,new_review);
}
else
{
root->right=BST_insert(root->right,new_review);
}
return root;
}
What’s happening in the function above?
It receives a pointer to the a ’BST Node’ which is the root of a BST. Keep in mind that any subtree of a BST is also a BST! So, many different nodes can be thought of as being root, depending on what chunk of the BST you’re looking at! The figure below illustrates this.
Drawing 3: Every subtree in a BST is also a BST. The figure above shows the different subtrees (rooted at different nodes) in a sample BST. Note that each leaf node has two empty subtrees as children.
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021 the key for a given node, we are in effect talking about the ’restaurant name’ field of the ’Review’
component of the BST node.
Let’s look at how the insert function is defined, and see examples of how it works. It takes as input two parameters: A pointer to the root of a subtree where we are inserting a new node, and a pointer to the new review being inserted in the tree
1 BST_Node *BST_insert(BST_Node *root, BST_Node *new_review) 2{
if (root==NULL) // Tree is empty, new node becomes return new_review; // the root
// Determine which subtree the new key should be in
if (strcmp(new_review->rev.restaurant_name,\
root->rev.restaurant_name)<=0)
3
4
5
6
7
8 9{
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 26
CSCA48 - Unit 4 - Solving Problems Efficiently Winter 2021
Given the root of some BST (it can be the whole tree or a subtree thereof), the insert function checks if the root is NULL. In this case the tree is empty, and the node we are inserting becomes the root.
If the root is not NULL, the (sub)tree is not empty, so the insert function determines whether the new review should be in the left subtree or the right subtree, and proceeds to insert the review in the corresponding subtree (using the corresponding node as root). Eventually, the insert function will find an empty branch where to insert the new node, and because at each step it is testing what the correct subtree is for the new review based on how its ’restaurant name’ compares to other reviews in the tree, the new review will be inserted at the correct location to preserve the strict ordering of keys in the BST.
The figure below shows a few examples of the insertion process for a BST that contains only integer values. The process is identical for our BST with restaurant reviews except for the reviews we are comparing restaurant names rather than ints.
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 27
CSCA48 - Unit 4 - Solving Problems Efficiently Winter 2021
EXERCISE
Starting with the last BST in the diagram above (after 22 is inserted), list the steps carried out, and draw the resulting tree, after inserting:
19, 4, 1, 6 , and 21 (in that order)
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 28
CSCA48 - Unit 4 - Solving Problems Efficiently Winter 2021 Revisiting the complexity of search in a BST
We know that the complexity of searching for a key in a BST depends on the height of the tree, and we also know that on a full BST the height of the tree is ⌈Log2(N)⌉, however, given a randomly ordered set of keys inserted in sequence into a BST , it is rarely the case that we will end up with a full BST. The shape of the BST, and hence its height, depends on the order in which keys are inserted.
If the order of the keys is more-or-less random, we can expect that insertions will be more-or- less equally distributed among the left and right subtrees at any level in the tree. However, if the ordering of the keys is not random, we will quickly get in trouble.
EXERCISE
Draw the BST that results from inserting the following keys in sequence: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Question: What is the height of the resulting BST? Generalizing to a set of N initially sorted keys inserted in sequence into the tree, what will be the height of the BST?
Question: What can you deduce from the above regarding the worst-case search complexity on a BST?
Question: What does the shape of a BST resulting from inserting keys in sorted order remind you of?
The full analysis of search complexity on a BST is beyond our scope here, you’ll likely have a closer look next year in B63. But for now, you should keep in mind that the O(Log(N)) search complexity we discussed for BSTs is only the average case complexity. Under the assumption that keys are randomly ordered when inserted into the tree. The worst case complexity is another matter altogether, and we should keep that in mind when deciding whether or not to use a BST to store and efficiently access a large collection of items.
ASIDE
Before moving on to other operations on BSTs, it’s worth pointing out that there are several types of trees such as AVL-trees, B-trees, and 2-3-4 trees that have the property that they remain balanced regardless of the order in which keys are inserted. Such balanced trees guarantee that the complexity of search, insert, and delete operations remains O(Log(N)) under all conditions. You will learn about these trees, how they work, and how they maintain a balanced shape in B63.
Search in a BST
The search operation in a BST looks for a specific query key. Once the specified query key has been found, a pointer to the node containing it is returned. For BST that store compound data
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 29
3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19 20 21
// Look up a restaurant review by restaurant name
if (root==NULL) return NULL; // Tree or sub-tree is empty
// Check if this node contains the review we want, if so, return // a pointer to it
if (strcmp(root->rev.restaurant_name, name)==0)
return root;
// Not in this node, search the corresponding subtree
if (strcmp(name, root->rev.restaurant_name)<=0)
{
return BST_search(root->left,name);
}
else
{
return BST_search(root->right,name);
} }
The implementation is nearly identical to that of the insert operation. It looks for a key matching the query in the current node, and if not found, searches the corresponding subtree depending on how the query compares to the key in the current node. The search ends when we find the key we want, or we reach an empty branch.
At this point it becomes important to talk about how to handle duplicate keys. The definition of the BSTs does not forbid the existence of duplicate values in the tree. However, the behaviour of search becomes badly defined if duplicates are present – that is because there is no principled way of deciding which of the duplicate nodes the user actually wants (this was also am issue with our linked-lists in the previous Unit!).
Question: Suppose we have multiple reviews for the same restaurant. Which of them would be returned by the search function as defined above?
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021 types, the search operation must know which field of the compound data type to use as key.
Let’s see how this works:
1 BST_Node *BST_search(BST_Node *root, char name[1024]) 2{
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 30
ASIDE
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
Because of the ambiguity that is introduced by the existence of duplicate keys, databases rely on the use of unique identifiers for data items. Examples you use regularly include student number, social insurance number, product bar codes, etc. These are generated for, and associated with each item in a collection to ensure that we always have a way of identifying any particular entry with no ambiguity.
If no unique identified has been generated, it is often possible to create one from a combination of data fields from the item’s information. For example, with a collection of music records, we could uniquely identify any given one by checking all of the following fields match a query: Title, Artist, Label, Year.
Database Primary Keys have the constraint of being unique, and in our BSTs (or linked-lists, or any other data structure intended to support search on a collection), we must enforce the use of unique keys to identify items. You’ll learn all about how primary keys are designed and used in the database course, C43.
Very importantly: Once we have determined what we will use as a unique key to locate items in our collection, the insert operation must enforce that no duplicate records are added to the BST. That means that if we attempt to insert an item whose unique key matches an already existing one, the insertion operation will not be carried out.
Updating an item in the collection
The update operation is simply a search followed by an update to the desired data fields in the item. However it should be noted that no updates are allowed on the data value(s) used as key for the item. Updating the key would likely mean that the BST property of having keys ordered inside the tree would no longer hold, and would therefore break search, insert, and delete operations.
Deleting items from the BST
The deletion operation is a bit more interesting than both search and insert. In a linked-list, insertion involved finding the item to be deleted along with its predecessor, and then linking the predecessor and successor nodes to preserve the linked structure of the list – then deleting the desired item node.
For BSTs it’s more complicated because we must ensure that after deletion the BST property still holds everywhere in the tree.
There are three cases we must consider for deletion:
a) The item to be deleted is a leaf node (no children). In the example below, suppose we want
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 31
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021 to delete the node that contains ’40’.
All that we need to do is set corresponding parent node’s link to NULL. In the case above, ’40’ is the right child of ’34’, so we set the corresponding link to NULL and delete (release memory) for the node containing ’40’.
b) The item to be deleted has one child only. In which case we simply link the node’s child to the node’s parent and delete the node. In the example below, deleting ’19’ means that we link ’22’ to ’27’ and then release the memory used by the node with ’19’.
The first two cases can be handled by the following code:
1 BST_Node *BST_delete(BST_Node *root, char name[1024]) 2{
3 // Remove a review for a restaurant whose name matches the query
4 // Assumes unique restaurant names!
5
6 BST_Node *tmp;
7
8 if (root==NULL) return NULL; // Tree or sub-tree is empty 9
14
10 // Check if this node contains the review we want to delete
11 if (strcmp(name,root->rev.restaurant_name)==0)
12 {
13
if (root->left==NULL && root->right==NULL)
{
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 32
CSCA48 – Unit 4 – Solving Problems Efficiently
Winter 2021
15 // Case a), no children. The parent will
16 // be updated to have NULL instead of this
17 // node’s address, and we delete this node
18 free(root);
19 return NULL;
20 }
21 else if (root->right==NULL)
22 {
23 // Case b), only one child, left subtree
24 // The parent has to be linked to the left
25 // child of this node, and we free this node
26 tmp=root->left;
27 free(root);
28 return tmp;
29 }
30 else if (root->left==NULL)
31 {
32 // Case b), only one child, right subtree
33 // The parent has to be linked to the right
34 // child of this node, and we free this node
35 tmp=root->right;
36 free(root);
37 return tmp;
38 }
39 else
40 {
41 // Case c),
42 // You will
43 return root;
44 }
45 }
46 // Not in this node, delete on the corresponding subtree and
47 // update the the corresponding link
48 if (strcmp(name, root->rev.restaurant_name)<=0)
49 {
50 root->left=BST_delete(root->left,name);
51 }
52 else
53 {
54 root->right=BST_delete(root->right,name);
55 }
56 return root;
57 }
Now, for the last case:
c) The node to be deleted has two children. This is a bit more challenging. The way deletion
two children.
implement this for your exercise!
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 33
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
works when the node we want to remove has two children is as follows: We find the successor to the node we want to delete – this is the node whose key follows in order the one in the node we are deleting. Another way to think about it is: the successor is the smallest key in the right subtree.
Once the successor has been identified:
• Copy the data from the successor node to the node we are deleting.
– If the node contains a compound data type, copy the whole compound data type
• Then delete the successor node from the right subtree.
– This can again be either case a) or b) above. We cannot have a repeat of case c)
because by definition, the successor cannot have a left child.
As you can see, the actual node where ’27’ was stored didn’t go anywhere, and we did not need to re-link anything! What we did is replace the data in the node with the successor’s data and then delete the successor node.
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 34
EXERCISE
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
1. Figure out what the tree would look like after deleting the following keys in sequence (draw the tree after each deletion):
9, 22, 29, 11
2. Complete the implementation of the BST delete() function above to imple- ment case c). You will need to write code to find the successor so you can copy its data over and so you know what do delete from the subtree.
Tree Traversals
There is one more set of operations on BSTs that we need to discuss – these are called tree traversals. A tree traversal constists of visiting each node in the BST in a pre-specified order. We use tree traversals for different purposes, including:
• Printing data for nodes in the BST
• Computing aggregate statistics for data in our collection
• Updating information when the update applies to all nodes (e.g. suppose our BST contains item prices for a store, and there is a store-wide sale with discounts applied to everything).
• Deleting the BST once our program is done
There are three main types of traversals, distinguished by the order in which nodes are visited:
• In-order traversal, which is possibly the most common and specifies that tree nodes are visited in this order for any given subtree:
1) Traverse left sub-tree in-order
2) Root – apply desired operation on the node 3) Traverse right sub-tree in-order
The example below shows the order in which nodes in a tree would be visited during an in- order traversal.
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 35
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
If at each node, the operation we carry out is simply to print the key at the node, the in-order traversal would produce:
5, 11, 15, 17, 19, 27, 34
Which gives you an ordered list of the keys in the BST. Hence the name of the traversal.
EXERCISE
Write a function to print the restaurant reviews in alphabetical order of restau- rant name.
• Pre-order traversal, for this type of traversal the order in which nodes are visited is given by:
1) Root – apply the desired operation on the node 2) Traverse in pre-order the left subtree
3) Traverse in pre-order the right subtree
The tree below shows the order in which nodes would be visited in pre-order.
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 36
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021 This time, if the operation at each node was to print the key, we would obtain the following
list:
17, 11, 5, 15, 27, 19, 34
This may not seem like a very interesting list, however, pre-order traversal is very useful in applications such as compilers and natural language understanding, where the structure of a chunk of code, a sequence or mathematical operations, or parts of speech in a sentence is represented by a tree structure. It is also the traversal order we need to use if we want to create a copy of a BST.
• Post-order traversal, specifies that the nodes in the tree must be visited in the following order:
1) Traverse in post-order the left subtree
2) Traverse in post-order the right subtree
3) Root – apply the desired operation on the node
The tree below shows the order in which nodes would be visited by a post-order traversal.
With post-order traversal, we would obtain the following list of keys:
5, 15, 11, 19, 34, 27, 17
Once more, the list doesn’t look particularly interesting to the eye, but post-order traversal has applications in parsing, and is also the order in which we need to visit nodes when deleting a BST.
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 37
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
EXERCISE
1. Draw the tree that results from inserting the following keys in sequence: 57, 25, 69, 16, 35, 59, 71, 72, 73, 9, 3, 14, 99
Then list the entries in the order that would be produced by in-order, pre- order, and post-order traversals.
2. Write a function to delete the BST for restaurant reviews, ensuring all mem- ory allocated to nodes of the BST is released.
3. Complete your implementation of a little restaurant review database by adding an interactive loop in main() that allows you to input new reviews, search for specific reviews, or delete existing ones.
8 Have we solved our problem?
At this point we have a good understanding of what BSTs can do, and how to implement them, but we have yet to determine whether all this work has helped us solve the problem of finding a more efficient way to store, organize, and access a large collection of information. We may suspect that the BST is likely to give us faster search because of its average search complexity being better than the linked-list’s average search complexity, but we still have to tie a few loose ends:
a) Time it takes to build the BST vs. time needed to build a linked list
• For N items, the linked list can be built in O(N) time – each item is added at the head
so no traversal is needed. That’s pretty good!
• For the same N items, the BST can be built in O(NLog(N)) time on average since
insertions always happen at the bottom of the tree, and the tree has height O(Log(N)). Advantage: Linked list
b) Time it takes to build a BST vs. time needed to sort an array (so we can use binary search) • Using quicksort() an array can be sorted in O(NLog(N)) time on average.
• As we saw above, the BST can be built in O(NLog(N)) time on average. Advantage: None – it’s a tie in terms of complexity
c) Search complexity
• Sorted array: O(Log(N)) worst case using binary search.
• Linked list: O(N) both average and worst case.
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 38
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021 • BST: O(Log(N)) average, O(N) worst case.
Advantage: BST and sorted array on average, sorted array in worst case.
d) Storage use
• Sorted array: Fixed size, not suitable for growing/shrinking collections • Linked list: Space usage is O(N) – one node per item
• BST: Space usage is O(N) – one node per item
Advantage: BST and linked-list, they only use space for items actually added to the collection.
So what do we choose?
The point of going through all of this work is to help you see that there is a fair number of factors you have to consider when choosing how to store and organize a collection of data. Up to this point we have 3 major ways of storing items: arrays, linked-lists, and BSTs, and as shown above, each one of them has advantages and disadvantages. So how are we to choose?
a) Consider how large your collection is going to be:
• For very large collections, you want to choose the data structure that gives you the
best Big O complexity for common operations like insert, delete, and search.
– Consider the space requirements: If items have small memory requirements (e.g. just numeric data, like ints or floats, or small compound data types) it’s not unreasonable to pre-allocate a large array even though entries in it may go un-used. If, on the other hand your items have a large memory footprint (e.g. compound data types with lots of data, or multi-media content like images, sound clips, etc.) you definitely want a data structure that allocates memory only for as many items as you have in the collection at a given time.
• For smaller collections, it will depend on whether you favour ease of implementation vs. the performance of your data structure for common operations.
b) Consider what kind of operations will be performed on the collection’s items:
• If you will mostly perform operations over the entire set, choose a linked list or an array (it doesn’t have to be sorted). An example of this is the 3D point meshes used in computer graphics, which may contain millions of points, but we don’t perform individual point look-ups, and instead render the whole mesh. These are often kept in arrays.
• If search, insert, or delete will be frequent, then you should favour a data structure that gives you the best Big O complexity for these operations.
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 39
ASIDE
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
You do not have to settle for one data structure. Databases organize large amounts of information in the form of tables – these are basically huge arrays, where each row contains all the data fields for one item in the collection. They are stored on disk and are unsorted. Separately from the table that contains the actual data, the database keeps an index in the form of a tree (typically a B-tree or a variation thereof) which contains search keys the user will be running queries on. Each node in the index contains the location in the origi- nal table where the information for the item matching the query key can be found.
An additional advantage of this scheme is that the database can have multiple indexes for the same table: For example, for student records in ACORN, there could be an index organized by student number, and one organized by last-name/first- name. Each index allows the database to efficiently search for records based on different common query terms.
Always remember: Whenever you are designing a solution for storing, organizing, and accessing information, you must consider the size of the collection, how it will be used, what operations will be run on it, and then think about all the ADTs and data structures you know, their complexity class for common operations, and how you could use them to build the most efficient data storage and manipulation solution. Look forward to B63 for learning a whole lot more about ways to store, organize, and search for information, including balanced trees, hash tables, heaps, graphs, and combinations of these.
Does it all work in practice?
It’s worth taking a look at whether all we have learned here actually works out in practice. The graphs you will find below compare the performance of our three storage solutions for the task of organizing and searching through a large collection of integers. Our three methods are:
• A sorted array + binary search • A linked list
• A BST
We will test them on collections of increasing size, and on each collection, we will run 10,000 queries for randomly chosen keys (to check how the search performance compares across these data structures).
You will recall that one worry with BSTs is that the worst case complexity is O(N) for search, insert, and delete. It’s important to figure out whether the performance of the data structures over multiple tests, on different sets of items, may indeed change so much that we should not rely on average case complexity when deciding which of them to use. In order to study this, we will run 50 tests for each collection size, each with different randomly-generated keys. It’s a relatively
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 40
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021 small sample but it should at least give us an indication of whether our data structures perform as
we would expect on different input data sets.
Finally, we show the run-time in three separate ways:
• Set up time: Time it takes to insert the data into the collection (for the array this includes the time it takes to run quicksort() ).
• Search time: The time it takes to run 10,000 queries once the collection is stored
• Total time: The sum of both, giving a complete picture of how each data structure performs.
Important note on the graphs below: Please be very attentive – the x-axis is not to scale! each tick in the x direction corresponds to 10 times more items than the previous one. If we showed the graph to scale, it would spill off the page and run into the next room! Keep this in mind as you think about what the graphs are showing.
Set-up time: Time taken to insert N items into the data structure – for the array this includes time taken by quicksort().
The graph shows the set-up time for all three data structures. As we expected from our analysis above, the linked-list has the advantage, with the set-up time being only O(N). Both the sorted array and the BST take significantly longer.
Conclusion: Set-up time in practice agrees with our analysis of the behaviour of each data structure in the sections above.
Question: There is a weird spike on the set-up time for the sorted array at the smallest number of items (it’s not at zero, that is just an effect of the scale of the x-axis). What could possibly explain this strange behaviour?
Search Time
Below, on the left, you can see the time it took to run 10,000 searches on each of the data structures. Averaged over the 50 different runs with different ordering of random keys. As you can easily see,
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 41
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
the performance of the linked-list is much worse than the array and BST. This is also as expected, the average O(N) search complexity of the linked-list can not compete with the average O(Log(N)) search complexity of the BST, and the guaranteed O(Log(N)) search complexity of binary search. The plot on the right shows only the BST and sorted array search times, for comparison.
The graphs above show behaviour that is consistent with what we expected. The BST and sorted array perform similarly well (the shape of the curves for run-time looks fairly close, except for some constant factor related to the overhead in managing the BST vs. the array). We can also see that the BST ’s performance does not appear to show any slow-down indicative of hitting the worst-case O(N) for search complexity (in which case we should expect it to move toward the green curve of the linked- list in the plots above).
Conclusion: Search time in practice agrees with our analysis of complexity in the sections above. Total time
The graphs below show the total time taken by each of the data structures to carry out the 10,000 queries. It’s the sum of the set-up time and the search time.
Once again, the graph on the left shows the total time for all three data structures, and clearly indicates that if we expect to be doing a lot of searching for items in our collection, the linked-lists are not the data structure of choice. Both the BST and sorted array perform similarly well – similar shaped curves with the difference likely very close to a constant factor explained by the overhead of maintaining and traversing the BST, compared to almost immediate access to any entry in the array.
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 42
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
One more observation: The shape of the total-time curves indicates that the total time for the BST and sorted array is dominated by the set-up time at least for the number of queries that we performed on the database. This should not be surprising. The set-up time for both the sorted array and the BST is O(NLog(N)) (for the BST, remember, this is only on average). The search time, on the other hand, is only O(mLog(N)) where m is the number of queries we run on the data structure. Since m=10,000, the search time is expected to be smaller than the set-up time for the largest collections.
Conclusion: Total search time agrees in practice with our analysis of the behaviour of the data structures from the previous sections.
Final Thoughts
We have spent some time carefully studying the problem of how to make search more efficient for large collections. Along the way we have studied the problem of measuring, characterizing, and reasoning about the complexity of algorithms, problems, and data structures. We have learned that the complexity results we can derive from studying a given solution are clearly and directly translated to run-time when we implement these solutions, and we have thought about how and when we should use the different ADTs we have studied up to this point.
Keep in mind that the theoretical analysis of complexity is only one layer of the complex problem of figuring out the most efficient way to carry out some task. You need to learn about computer architecture, how modern CPUs work, and how to optimize programs for maximum efficiency if you really want to write the best (most efficient, fastest) software to solve any given problem. Don’t forget to keep what you learned here in mind when you take Computer Organization (B58), Algorithms and Data Structures (B63), and Embedded Systems (C85), as these three courses will provide you with deeper understanding of this important problem.
As for the worst-case complexity. The analysis above may lead you to believe you don’t have to worry about it if your data structure has a good average-case complexity. Indeed, for most applications you will find you can get excellent results from using data structures such as BST s, and sorting algorithms such as quicksort(). However: For safety-critical applications, or for real-time applications, examples of which include control software for medical equipment, electrical power generation, transportation (aircraft flight control), robotics, manufacturing, and so on; you can
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 43
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021
not afford to use a data structure whose worst-case performance is bad. The point being that even if you need to be very unlucky to hit the input that triggers worst-case or close to worst-case performance, you can not afford to take that risk. Applications of this kind require guaranteed performance bounds from all the data structures and algorithms involved in their software. So, keep this in mind as you move to B63 and C85 where these issues will be discussed at length.
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 44
CSCA48 – Unit 4 – Solving Problems Efficiently Winter 2021 Additional Exercises
Ex0 – Draw the BST that results from inserting the following keys in sequence into an initially empty BST. Keys are sorted alphabetically.
“Iron Man 3”, “Black Panther”, “The Avengers”, “Captain America”, “Iron Man 2”, “Aquaman”, “Batman Returns”, “Thor”, “Spider Man: Into the Spiderverse”
Ex1 – Draw the BST after we delete “Aquaman”
Ex2 – Draw the BST after we delete “Iron Man 3”
Ex3 – What is the list of movies generated by a post-order traversal of the BST from Ex2?
Ex4 – It is highly likely that a BST whose keys are movie titles will eventually run into the prob- lem that there are multiple entries with the same key. As we discussed above this is really not a great situation. Give at least 3 different suggestions regarding how we can build a BST where entries are organized by movie title, yet, there are no duplicate keys.
Ex5 – Which of the following has the lowest complexity for large values of N ? a) 250000*Log(N)
b) .001*N
c) 000001*N2
d) 5000*Log(N2)
Ex6 – Which of the following has the lowest complexity for small values of N ?
a) 5.25*N b) 1.11*N
c) 5*Log(N)
d) 2.1*Log(N2)
Ex7 – Recall that in the discussion above we specified that updates to the key values in nodes for a BST is not allowed because it can break the BST property. However, in practice we may come upon a situation in which we have to update a data field used as key. List in the space below the steps we could take to accomplish this, without breaking the BST property in the tree (pseudocode is fine, no need to write C code for this). Hint: Consider all available BST operations.
Ex8 – Implement quicksort() from the description in P.16. You can consult on-line resources as well.
⃝c F.Estrada, M.Ahmadzadeh, B.Harrington 2020 45