Algorithms and Data, Summer 1 Problem Set 3

Algorithms and Data, Summer 1 Problem Set 3
Due Wed, June 1 9PM
As usual, write up your answers separately from your partner, and name your partner on your PDF and in your code. Submit the zipped file 4800_XX_PS3.zip containing PS3.pdf, Quicksort.java, Bucketsort.java, and any supplemental Java files. You and your partner should both submit. Your code can be identical.
Part I – Recursive running times
Use the Master Theorem (or a different method, if you prefer) to give the big-Θ running time of a recursive algorithm whose running time is described by the given recurrence. You do not need to show work or have different answers from your partner for this part.
1.) T(n)=2T(n/2)+n2 +5 2.) T(n)=4T(n/4)+n +12 3.) T(n) = T(n/2) + log(n) + 2 4.) T(n)=4T(n/2)+n + 3 5.) T(n) = T(n/10) + 5
Part II – Designing and Analyzing Algorithms
1.) Suppose you have N objects of a data type that you didn’t implement and don’t have the source code for, and all you can do to determine their identities is call a method that returns true iff two objects are equivalent to each other. Provide the pseudocode for a divide-and-conquer algorithm that uses no more than O(N log N) calls to the equality check to determine whether a majority of the objects (> N/2) are all equivalent to each other. Briefly explain why your algorithm works and how you know the running time is O(N log N).
2.) You work for a financial news website, and the site is looking for “sleeper” stories where a stock overall slowly made great gains over the course of some time period, even if the day-to- day gains were not necessarily impressive. To help find these stories, you want to write some code that will look at a time series of N gains and losses expressed as percent changes, such as +0.5% or -2.1%, and find the time indices i,j such that the percent gain A[j]/A[i] * 100 is maximized. Find an O(N log N) divide-and-conquer algorithm that does this. (Be careful: recall that the overall gain when gaining 10% and then another 10% is not 20%, but 110% * 110% = 1.1 * 1.1 = 1.21 = 21% gain. Losing 10% of that would not get us back to a 10% gain, but would result in a net gain of 1.21 * 0.9 = 1.089 = +8.9%.)
3.) Suppose you are working with a bad random number generator that can’t actually generate fair bits; instead, 0 is generated with probability p and 1 is generated with probability (1-p). Luckily, you realize a solution: looking at two calls to the random bit generator, the sequence “01” and the sequence “10” have the same probability, p(1-p) = (1-p)p. You could ask for two not-so-random bits, and as long as the sequence isn’t 00 or 11, you can use the “01” or “10” result to get one fair bit.
a.) Determine the expected number of calls to the unfair random number generator that you need to extract N fair bits in expectation.
b.) You suddenly realize you don’t need to throw out the 00 and 11 sequences after all, which is good because that bad random number generator is also slow. Describe a divide-and-conquer algorithm that can extract more fair bits from a sequence of unfair bits.
4.) Suppose we are sorting data that was once sorted, but which has been “jostled” k times such that with probability p, two adjacent elements were chosen at random and swapped. Demonstrate a reasonable big-O bound on the expected running time of Insertion Sort under these conditions, in terms of k, n, and p. Be sure to justify your assumptions.
Part III – Zip files
Create text files consisting of the string “foo” repeated N=1, 250000, 500000, and 1000000 (1 million) times, and compress each using your preferred operating system’s utility for making “zip” files. (You may include a newline after each “foo” or not, as you like.) Record in your answers the zipped file sizes in bytes, then answer the following.
1.) Answer the question that applies to your zipped file size for N=1.
If your result for N=1 is greater than the number of bytes for the uncompressed file: assuming the algorithm used is DEFLATE and some kind of Huffman coding was used, what information is probably contained in those extra bytes?
If your result for N=1 is the same or less than the uncompressed size: how do you think this compression might have been achieved?
2.) Calculate the number of bytes that would be used in a reasonable “run-length encoding” of the million-foo file, where the run-length encoding encodes repetitions of strings instead of repetitions of individual bits, and compare this to the performance of your zip utility. Assume 8 bits per character.
3.) Is this compression algorithm using a version of LZW where the entries in the dictionary can be arbitrarily long — so, for example, 1,000,000 repetitions of “foo” can be one codeword? Use your measurements and some math to argue for or against this state of affairs. You can assume the binary codes for the different codewords are all roughly the same length.
Part IV – Quickersorts?
Because everybody sorts things, and Quicksort is a popular way to do it, various people have come up with Quicksort optimizations that don’t change the big-O, but can hypothetically improve the constants involved. Let’s empirically try out some of these optimizations.
Implement the following variations on Quicksort in a file Quicksort.java. The command
java [optional argument] [N]
should run the correct kind of Quicksort on an N-entry random array of doubles in the range [0,1) (which Random.nextDouble() does for you). For example,
java Quicksort 100
should run a normal Quicksort on a 100-element array, while
java Quicksort -m 1000
should run a median-of-three Quicksort on a thousand random numbers.
The variations:
[no command-line argument] – A normal Quicksort that is not in-place – you can copy the numbers into additional memory when sorting recursively. One pivot is selected at random.
[command-line “-l”] – “Lazy” Quicksort. No need to select the pivot at random – just grab the first element of the array, saving a random number generator call, which is expensive.
[command-line “-m”] – Median-of-three quicksort. Instead of choosing one pivot value, take the middle value of the first, last, and middle elements of the array, and use that as the pivot. Reduces the chance of picking a bad pivot that isn’t close to the median.
[command-line “-i”] – In-place Quicksort, as described in lecture. Supposed to have a better constant in the running time because allocating memory is expensive.
You do not need to implement the cases where multiple command-line arguments are called; assume they’re mutually exclusive. So, -l and -m never need to be in-place, and you don’t need to implement a lazy median-of-three.
Those all have the same big-Θ expected time, so let’s compare to an algorithm that actually has better big-Θ expected time for this particular problem, but potentially may have large constants in that running time as well. When you have uniformly distributed data in some range, Bucketsort is an expected linear time algorithm. The algorithm:
Create N “buckets” (empty arrays or ArrayLists) representing ranges of numbers that are 1/N of the possible values (e.g., for numbers between 0 and 1 going into 100 buckets, the first bucket will contain values in [0,0.01); the last bucket will contain [0.99,1))
Make a linear pass through the values, adding them to the correct buckets1
For each non-empty bucket,
If the bucket has > 1 values, sort the bucket – you can use one of your Quicksorts here Add the sorted values to the end of the final sorted ArrayList
Because the values are uniformly distributed — an even chance of hitting any bucket — each bucket will have only 1 value in expectation, so the sort in the inner loop has extremely low probability of doing more than a few operations.
Code Bucketsort in Bucketsort.java, with a command-line argument for the size of random array, just like Quicksort.
Run all your Quicksort variations and Bucketsort for N = 10, 1000, 100000, 1000000 (1 million), and time them using System.nanoTime(). Report your sorting times (not counting the array setup) in reasonable units in your PDF (if the time is on the order of seconds, don’t report the number of nanoseconds!). Also, in your PDF, discuss each sort’s performance compared to the others – what kinds of optimizations produce the best savings in time? In particular,
1 Note that you should calculate the correct bucket to add a value to in O(1) time to achieve expected linear time overall.

If Bucketsort did worse than any Quicksorts, note which Quicksorts outperformed it and venture a guess as to why that might be.
Your programs should print their times to sort, again with labeled time units, and also print the final sorted array if N < = 100. You may not use any Java-provided implementations of these sorts, and as usual, referencing pseudocode on the Web for these sorts is okay while referencing Java code for these sorts is not.