Mod 3 Homework – Running time analysis
We have two goals in this assignment:
1) Use python to fit data with user-defined functions
2) Plot raw data and a best-fit curve on the same figure
We will use the scipy module (link) for the first and matplotlib (see lab) for the second.
You should be able to install both in terminal using pip; e.g.:
pip install scipy
If you are struggling, you may consider Anaconda (link) – an all-in-one python install designed for data
science. Anaconda includes many common packages, including the ones used here.
Fitting Data
The curve fitting functionality we are interested in is in scipy’s optimize module. We can fit data to a
function of our choice as follows:
from scipy import optimize
params, params_cov = optimize.curve_fit(func, xdata, ydata)
optimize.curve_fit returns two collections – the parameters that make func best fit our data, and the
covariance of those parameters (which can be ignored for this assignment).
An example:
from scipy import optimize
from matplotlib import pyplot as plt
# linear fitting function
# the first parameter of a fitting funciton must be x
# subsequent parameters are automatically adjusted by curve_fit
def lin(x, m, b):
return m*x + b
# generate data on y = 1*x + 0, then add some “noise”
xdata = [i for i in range(10)]
ydata = [i for i in range(10)]
y[1] = 2
y[8] = 7
# Find parameters for lin that best fit xdata and ydata
params, params_cov = optimize.curve_fit(lin, xdata, ydata)
# unpack the calculated parameters
m = params[0]
b = params[1]
# create an empty list for the line of best fit
y_fit = []
for x in xdata:
y_fit.append(lin(x, m, b))
# plot the raw data and the line of best fit
plt.figure()
1
https://www.scipy.org/install.html
https://docs.anaconda.com/anaconda/install/index.html
plt.scatter(xdata, ydata)
plt.plot(xdata, yfit)
plt.show()
See the included figure example_data_fitting.png for the expected result.
TODO1: Fitting.py
Create three fitting functions:
• const – tries to fit with f(x) = c0
• lin – tries to fit with f(x) = c1 · x + c0
• quad – tries to fit with f(x) = c2 · x2 + c1 · x + c0
Quantifying a Fit
We may ask ourselves “how good is that fit?” To answer that, we need to define the error of our fit relative to
the provided data. There are many ways to do this, but we will use the “root mean square” method – we will
calculate the square root of the mean of the squares of the error at all points.
The sum of the squares of the errors (se) is:
se =
∑N−1
i=0 (ydata[i] − yfit[i])
2
The mean (average) square error (mse) is:
mse = 1
N
· sq_err
And the square root of the mean square error (RMS) is:
RMS =
√
mse
Putting it all together:
RMS =
√∑
N−1
i=0
(ydata[i]−yfit[i])2
N
For the data above, if our fit line was y = 1 · x + 0, then the RMS can be calculated as follows:
sq_err = (0 − 0)2 + (1 − 2)2 + (2 − 2)2 + (3 − 3)2 + … + (8 − 7)2 + (9 − 9)2) = 2
mean_sq_err = 110 · 2 = 0.2
RMS =
√
0.2 ≈ 0.4472
The actual fit line calculated by curve_fit is y = 0.915 · x + 0.382. Does this give a lower RMS?
TODO2: Fitting.py
Write a function named fit_data that takes three inputs:
• fitting function
• some xdata
• some y data
and returns 3 outputs (by returning a 3-tuple):
• parameters of best fit
• the root mean square error of that fit
• a list containing the y-data of that fit
2
The numbers you get back will likely be very long floats. If we truncated them to integers, your function’s
return might look something like:
>>> from Fitting import *
>>> x = [i for i in range(5)]
>>> y = [i for i in range(5)]
>>> fit_data(lin, x, y)
(array([1, 0]), 0, [0, 1, 2, 3, 4])
Notes:
• Do not actually truncate your return values. We did it here for demonstration purposes.
• array denotes a special class used in scipy – you do not need to “get rid of it”, and it should not impact
the rest of this assignment if you treat it like a typical list.
Algorithm Timing
We will look at sorting algorithms more in-depth in the coming weeks. For now, it is sufficient to know that
these algorithms should take a (possibly) unsorted list as input, and modify that list until it is sorted. These
algorithms do not need to return anything in Python – when a collection is passed as an argument to a
function, any modificaitons to that collection are persistent outside of the function.
def bubble_sort(L):
n = len(L) # no. of items in list
for i in range(n): # for every item
for j in range(n): # compare to every other item
if L[i] < L[j]: # if out of order:
L[i], L[j] = L[j], L[i] # swap items
L = [1, 5, 4]
print(L) # expect: [1, 5, 4]
bubble_sort(L)
print(L) # expect: [1, 4, 5]
TODO3 - TimingPlot.py
• Generate a series of x (number of items to sort) and y (time to sort) data using the above algorithm.
Use randomly sorted lists. For instance, to generate a list of 100 random integers, you might write:
import random
n = 100
L = [random.randint(0, n) for i in range(n)]
• Fit that data (n vs time) with the linear and quadratic functions. Choose n such that you can clearly
see the expected shape of the curve.
• Generate a figure with three curves:
– a scatter plot of the raw data
– a line of best fit with lin
– a line of best fit with quad
• Follow best practices when presenting data:
– Scale your data so the axes numbers are between 1 and 1000
– Indlude axis labels with units
3
– Label each curve clearly, either with a textbox on the figure or a legend
Save your figure as “bestfit.png”.
We are purposfully giving you flexibility here - how should you generate the lists as n grows? How should you
time the functions? You solved a very similar problem in lab; try to apply those techniques here.
Note that we will manually grade the final plot, and that you produced the data appropriately. Bad-faith
attempts (like arbitrarily generating numbers for your times) will not recieve any credit on this assignment.
Include all code you use to generate data and the figures.
Submitting
At a minimum, submit the following files:
• Fitting.py
– const()
– lin()
– quad()
– fit_data()
• TimingPlot.py
– bubble_sort()
– whatever functions you use to generate timing data
• bestfit.png
Also submit any other files you mayu use to generate data or the final figure.
Students must submit to Mimir individually by the due date (typically, the Wednesday after this module at
11:59 pm EST) to receive credit.
Grading
Auto Graded
• 40 - data fitting functionality
Manually Graded
• 30 - TimingPlot.py generates time vs n data appropriately and times functions correctly
• 30 - “bestfit.png” figure
Feedback
If you have any feedback on this assignment, please leave it here.
We check this feedback regularly, and it has resulted in many improvements.
4
https://s.uconn.edu/cse2050_feedback
Mod 3 Homework - Running time analysis
Fitting Data
TODO1: Fitting.py
Quantifying a Fit
TODO2: Fitting.py
Algorithm Timing
TODO3 - TimingPlot.py
Submitting
Grading
Auto Graded
Manually Graded
Feedback