Searching, Sorting and Merging
Data Structures and Abstractions
Searching, Merging and Sorting
Lecture 25
*
When testing searching and merging algorithms, it is important to check boundary and unusual conditions:
In other words test for containers with [1]:
0 elements;
1 element;
2 elements;
3 elements;
a large number of odd elements;
a large number of even elements;
The Cyclomatic Complexity of your algorithm can be used to guide your test cases. [2]
Testing
*
*
[1] Rule of thumb – to cover your bases.
There can be other tests that may be needed depending on the algorithm. See below.
[2]
For a quick overview of Cyclomatic Complexity, see the Wikipedia entry, section on “Implications for Software Testing” https://en.wikipedia.org/wiki/Cyclomatic_complexity.
There is a Cyclomatic Complexity worked example on page 66- Appendix A , AV Rule 3 in the included document “Joint Strike Fighter Air Vehicle C++ Coding Standards”.
You should also go through the General Design rules 3.1 “Coupling and Cohesion” and 3.2 “Code Size and Complexity” (pages 10-12 of the JSF C++ Coding standards). You are not building mission critical software in this unit but the design principles are the same and needs to be followed. We have covered Coupling and Cohesion at the start of semester/trimester, and it is a good time to review it now.
For a discussion of Safety-Critical Systems, see the included short report on “C++ for Safety-Critical Systems”. There is a single paragraph discussion on Classes. Compare what is there with the class designs/implementations that you are doing in this unit. View it in context through what is called the “Law of Demeter”. https://en.wikipedia.org/wiki/Law_of_Demeter.
The Journal of research of the National Institute of Standards and
Technology (NIST) publishes, amongst other things, issues related to
languages. One of the 2013 papers is titled “A Case Study of Performance
Degradation Attributable to Run-Time Bounds Checks on C++ Vector
Access”. See volume 118 at http://www.nist.gov/nvl/jrespastpapers.cfm.
This paper is relevant to this unit as the primary data structure for assignment 1 was
an encapsulated array called Vector, and bounds checking was an
important responsibility of Vector.
The following is part of the abstract:
“Programmers routinely omit run-time safety checks from applications
because they assume that these safety checks would degrade performance.
The simplest example is the use of arrays or array-like data structures
that do not enforce the constraint that indices must be within bounds.
This report documents an attempt to measure the performance penalty
incurred by two different implementations of bounds-checking in C and
C++ using a simple benchmark and a desktop PC with a modern superscalar
CPU. The benchmark consisted of a loop that wrote to array elements in
sequential order. With this configuration, relative to the best
performance observed for any access method in C or C++, mean degradation
of only (0.881 ± 0.009) % was measured for a standard bounds-checking
access method in C++. ..”
Test cases must check for issues associated with bounds.
Linear searches involve starting at the beginning, then checking each element in the container until we find the right one.
In other words, a brute force approach.
This is sure but slow: its average complexity is O(n). [1]
This code is usually put inside a Find() or Search() routine.
It is the only search available to linked lists and unsorted arrays, and is therefore the search used for the STL vector and list.
Linear Search
*
*
Go through the worked examples found in the section on Asymptotic Notation: Big-O Notation (in the textbook chapter Searching and Sorting Algorithms.
Boolean Find (DataClass target, Address targetPosition) [code in textbook]
Boolean found
Set found to false
Start at the beginning of the container
WHILE not at the end of container AND found is false
IF the current element is the target
targetPosition = Address (index of the current element}
found = true
ENDIF
set current to next element
ENDWHILE
Return found
END FIND
Linear Search Algorithm
*
*
Binary Search
A faster search than linear search exists for sorted, direct access containers such as sets and maps.
This is binary search, where the search space is halved after each guess.
The “Guess a number between 1 and 100” game played by children is a binary search.
It is a divide and conquer strategy.
The order of complexity is O(log(n) + 1) for the number of iterations.
See diagrams explaining this in the textbook – section on binary search, Chapter on Searching and Sorting Algorithms.
Go through the worked examples in this chapter found in the section on Asymptotic Notation: Big-O Notation. [1]
*
*
[1] Algorithms are written before you write code. Once you have written an algorithm, you need to both desk check and analyse the algorithm. If the algorithm has a high level of complexity, you may need to come up with a much more efficient algorithm. All of these activities are done before you write code. You would have learnt all this in the previous two programming units. These activities continue in this unit and beyond.
When examining your algorithm or code, there is another metric called Cyclomatic Complexity. For a quick overview, see the wikipedia entry https://en.wikipedia.org/wiki/Cyclomatic_complexity
Iterative Binary Search Algorithm
Find (DataClass target, integer targetIndex) [code in textbook, data must be sorted]
integer bottomIndex, middleIndex, topIndex [1]
boolean targetFound
targetFound = false
bottomIndex = 0
topIndex = arraySize-1
WHILE topIndex >= bottomIndex AND targetFound = false
middleIndex = (topIndex – bottomIndex + 1)/2 + bottomIndex;
IF target = value at middleIndex
targetIndex = middleIndex
targetFound = true
ELSEIF target < value at middleIndex
topIndex = middleIndex-1 // no point searching above
ELSEIF target > value at middleIndex
bottomIndex = middleIndex+1 // no point searching below
ENDIF
ENDWHILE
END Find
*
*
The word top refers to high index and bottom refers to low index.
Code in textbook is similar with slight variations on what the search routine returns.
Iteration versus Recursion
Anything that can be done with recursion can be done with iteration.
Anything that can be done with iteration can be done with recursion.
Advantages of Iteration:
Often easier to understand
Uses less memory
Advantages of Recursion: [1]
Sometimes much easier to understand
Often simpler to code
Reduces code complexity
*
*
[1]
Advantages are more pronounced when the data structures are also recursive (e.g Trees)
Implementations of recursive algorithms rely on the run-time stack. A deep recursion, or out of control recursion can lead to running out of stack space.
Each subroutine call has space and running time requirements. Run the recursive fibonacci routine found at
http://en.cppreference.com/w/cpp/chrono. You will see that the implementation matches the definition
of Fibonacci sequence, and consequently both the algorithm and the corresponding code is small. Compare the recursive version with the iterative versions found in the textbook – check the books index if you have forgotten that you have
already encountered Fibonacci in the readings for the early part of semester/trimester.
Recursive Binary Search Algorithm [1]
Find (DataClass target, integer targetIndex) : boolean
Set targetIndex to -1
return Find (target, 0, arraySize-1, targetIndex)
End Find
Find (DataClass target, integer bottomIndex, integer topIndex,
integer targetIndex) : boolean // the overloaded version
Boolean found
Set found to false
Integer middleIndex
middleIndex = (topIndex – bottomIndex + 1)/2 + bottomIndex;
IF target = array[middleIndex]
targetIndex = middleIndex
found = true // line A
ELSEIF topIndex <= bottomIndex
found = false
ELSEIF target < array[middleIndex]
Find (target, bottomIndex, middleIndex-1, targetIndex) // Line B
ELSEIF target > array[middleIndex]
Find (target, middleIndex+1, topIndex, targetIndex)
ENDIF
Return found [2] // Line C
End Find
*
*
[1] not recommended that you use this approach. Why?
What would happen if the search space is large?
[2] is this going to work? Think of which found is being returned.
Is the value of found being tracked through successive calls to the overloaded version of Find?
Consider this situation, the overloaded version is first called from the other (first version of) Find.
In the first call to the overloaded version, the following condition is true:
target < array[middleIndex]
Line B is executed resulting in a recursive call.
In this first recursive call the target is found and Line A is executed which will be followed by Line C.
The value of found is returned to the original caller which happened at line B in the first call to the overloaded version and then
Line C is executed returning found to the first version of Find. Which value of found are we referring to.
How would you fix this?
Merging Sorted Containers
When we looked at Sets in a previous lecture, we looked at algorithms for subset, difference, union and intersection.
They all operate in O(n) time.
They were all very similar.
This is because they were all variations of the standard merge algorithm for sorted containers.
The merge algorithm is also important for merge sort which is the best (only) sort to use for very large amounts of data stored on disk.
Note that the STL
*
*
Merge(container1, container2, newContainer) [1]
datum1 = first element in container1
datum2 = first element in container2
WHILE there are elements in both container1 and container2
IF datum1 < datum2
Put datum1 in newContainer
datum1 = next element in container1
ELSEIF datum2 < datum1
Put datum2 in newContainer
datum2 = next element in container2
ELSE
Put datum1 in newContainer
Put datum2 in newContainer // duplicates are being kept
datum1 = next element in container1
datum2 = next element in container2
ENDIF
ENDWHILE
WHILE there are elements in container1
Put datum1 in newContainer
datum1 = next element in container1
ENDWHILE
WHILE there are elements in container2
Put datum2 in newContainer
datum2 = next element in container2
ENDWHILE
End Merge
*
*
[1] It is not sorting but merging two sorted containers into one.
Do this merge using pen and paper to get a feel for how it works.
Categorisation of Sorting Algorithms
Categorising sorting algorithms allows decisions to be made on the best sort to use in a particular situation.
Algorithms are categorised based on:
what is actually moved (direct or indirect);
where the data is stored during the process (internal or external);
whether prior order is maintained (stable vs unstable);
how the sort progresses;
how many comparisons are made on average and in the worst case;
how many moves are made on average and in the worst case.
*
*
Direct vs Indirect
Direct sorting involves moving the elements themselves.
For example when sorting an array
It becomes
Indirect sorting involves moving objects that designate the elements (also called address table sorting). This is particularly common where the actual data is stored on disk or in a database. For example, if sorting an array:
we do not sort the data, but instead set up an array of the addresses:
and sort them based on the data to which they refer:
*
10
20
50
60
10
0
1
2
3
4
2
4
1
0
3
50
20
10
60
10
50
20
10
60
10
*
Internal vs External
Internal: the data is stored in RAM.
External: the data is stored on secondary storage (hard drive, tape, floppy disk etc).
There are two external sorts: natural merge and polyphase. The latter is very complicated and it is usually used for large files. [1]
We wouldn’t be looking at polyphase sort in this unit – read out of interest.
*
*
[1] You may want to see a description of it in wikipedia or see part 2 of this file
http://www.fq.math.ca/Scanned/8-1/lynch.pdf
When mention is made of tape – think of it as a sequential data storage that you can read in the forward direction.
Stable vs Unstable
Stable sorts preserve the prior order of elements where the new order has equal keys.
For example, if you have sorted on name and then sort on address, people with the same address would still be sorted on name.
On the whole stable sorts are slower.
*
*
Type of Progression
Insertion: examine one element at a time and insert it into the structure in the proper order relative to all previously processed elements.
Exchange: as long as there are still elements out of order, select two elements and exchange them if they are in the wrong order.
Selection: as long as there are elements to be processed, find the next largest (or smallest) element and set it aside.
Enumeration: each element is compared to all others and placed accordingly. [1]
Special Purpose: a sort implemented for a particular one-off situation.
*
*
[1]
Find the count of elements that are smaller than value a. If the count is c, a is at position c+1 (Knuth, 73)
Number of Comparisons
*
* Based on empirical evidence only.
Type Name Average O Worst Case O
Insertion Straight Insertion n2 n2
Binary Insertion n log n n log n
Shell* n1.25 -
Exchange Bubble n2 n2
Shaker n2 n2
Quicksort n log n n2
Merge n log n n log n
Selection Straight Selection n2 n2
Heap n log n n log n
*
Number of Comparisons [1]
*
* Based on empirical evidence.
Type Name Average O Worst Case O
Insertion Straight Insertion n2 n2
Binary Insertion n log n n log n
Shell* n1.25 -
Exchange Bubble n2 n2
Shaker n2 n2
Quicksort n log n n2
Merge n log n n log n
Selection Straight Selection n2 n2
Heap n log n n log n
*
[1] a number of these algorithms are described in the textbook chapter on Searching and Sorting Algorithms.
Number of Moves
*
* Based on empirical evidence only as there doesn’t appear to be algorithm analysis for this algorithm
Type Name Average O Worst Case O
Insertion Straight Insertion n2 n2
Binary Insertion n2 n2
Shell* n1.25 -
Exchange Bubble n2 n2
Shaker n2 n2
Quicksort n log n n2
Merge n log n n2
Selection Straight Selection n log n n2
Heap n log n n log n
*
Number of Moves
*
* Based on empirical evidence only
Type Name Average O Worst Case O
Insertion Straight Insertion n2 n2
Binary Insertion n2 n2
Shell* n1.25 -
Exchange Bubble n2 n2
Shaker n2 n2
Quicksort n log n n2
Merge n log n n2
Selection Straight Selection n log n n2
Heap n log n n log n
*
Algorithm Choice
Looking at the tables, ‘clearly’ heap sort is the fastest, followed by mergesort and quicksort.
So why is quicksort the algorithm used by spreadsheets, the STL, in C etc??
There can be several reasons:
The first is that quicksort is an internal sort and the other two are external sorts. Therefore it requires less I/O, but there are versions of merge sort which try to cut down on I/O.
Obtaining and releasing memory is time consuming.
The next reason hidden in the use of big O notation. When running quicksort, merge and heap sort on my PC, I found that although they are all O(n log n) for random data, quicksort ran twice as fast as heap sort and almost 5 times faster than merge sort!
There are lots of very complicated ways to optimise quicksort.
On the flip side, merge sort is very suited to parallel programming.
*
*
Readings
Textbook Chapter Searching and Sorting Algorithms.
Reference book, Introduction to Algorithms. For further study, see part of the book called Sorting and Order Statistics. It contains a number of chapters on sorting.
*