Lab 2: Working with data¶
Computational modelling often works hand in hand with data. Data is used to guide the choice of a model, set parameters, and validate the model. Comparing predictions of a model with observations also helps to identify where assumptions underpinning a model break down, and so guides refinements of a model, and helps to better understand the original problem.
Copyright By PowCoder代写 加微信 powcoder
Most of the computational models we will look at in this course are concerned with “dynamics”, i.e. how something changes with time. However, we are not yet ready to look at such models in detail. Most of the time with computatonal models there are no exact solutions, but for the simplest cases there are. In the first part of this lab we will look at a simple analytical solution to a model and compare this to data. We will use a simple approach to reading in the data, an approach which is commonly used when our data is in the form of rows of data. In the second part of the lab we will look at a more sophisticated data set, where the power of the pandas python package becomes particularly useful.
Problem 1: Epidemic modelling¶
We are all acutely aware of the pandemic. The actions taken in the face of this crisis are an example of the close collaboration between data and modelling. Epidemic modelling has a long history, and with increasingly detailed data, more and more sophisticated approaches to modelling the spread of disease have become possible. Modelling allows ‘what if’ scenarios to be examined, to identify the best course of action. An example of such modelling in python can be found here
https://towardsdatascience.com/modelling-the-coronavirus-epidemic-spreading-in-a-city-with-python-babd14d82fa2
We can’t yet ask what if questions, but we can look at data to identify the best fit between the data and models of disease spread. We can then ask questions such as, ‘are the current interventions having an impact?’
Model 1: An exponential growth model for an epidemic¶
The simplest reasonable model for the early stages of the growth of an epidemic is one where the rate of change of the number of individuals with the disease is proportional to the number of people who already have the disease. We can see this assumes that the people with the disease are mixed in amongst the people without the disease, such that the number of people a sick person meets is a constant. It also assumes that people do not recover from the disease, at least over the time period of interest. This is characteristic of the earliest phases of the spread of a disease, when there are no controls put in place to restrict the motion of people, there are many more people who are healthy compared with those who are sick, and the sick haven’t yet recovered.
The model can be written as a differential equation:
$$\frac{d P}{dt} = rP$$where $P$ is the number of people with the disease at time $t$, and $r$ is a measure of how many healthy people each sick person infects. The solution to this model is
$$P(t) = A \exp(rt)$$where $A$ is the number of infected people at time $t = 0$.
Model 2: A sub-exponential growth model for an epidemic¶
There are numerous more sophisticated models for epidemic growth. We shall consider the following modification of the previous exponential growth model:
$$\frac{d P}{dt} = rP^q$$where $0< q < 1$ gives a measure of how the growth rate of new cases depends on the number of infected people. The solution to this model is
$$P(t) = \left(\frac{r}{m}t+A \right)^m$$where $m = 1/(1-q)$ and now $A = (P(0))^(1/m)$.
In this lab we will compare model 1 and model 2 with data available for the coronavirus outbreak.
a) Download the data
Below the lab on Canvas you will find a file COVID19_allcases.txt. This file contains the total number of recorded cases up to that day in the early phase of the pandemic. The first record corresponds to January 22nd, the second to January 23rd, etc. Download this file and save it in the same directory as your jupyter notebook file.
b) Read the data
Read the data into a numpy array. You do not need to use pandas to do this. One approach is to set up a list, open the file, read each line in the file and append the result to the list. Once the list is complete you can convert it to a numpy array.
c) Plot the data
Using matplotlib, plot the data versus time. Can we identify by eye any phases in the disease spread?
There appears to be a period of rapid growth (concave up), followed by a steadily slowing growth (concave down). There is also a significant jump in the number of cases, which when looking at the data collection, is due to a change in the way cases have been recorded.
d) Looking only at the first five data points (let's call this data day 1 to day 5), find a value of $r$ which fits the exponential model to this data. Plot the data and the exponential model together over the first 10 days. What can we conclude about the exponential model?
e) Now, using your exponential model, extrapolate backwards in time to find the predicted day when 'patient zero' appeared (i.e. when the number of infected people first goes below one). The disease is thought to have appeared on December 1st 2019. Identify a plausible reason for a discrepancy between your model and this date.
f) Fixing $r = 2$ for Model 2, determine a value for $q$ by comparing the first 14 days of data with the model. Plot the data and model out to day 20. Can we conclude anything about the spread of the disease in light of the model dynamics?
Checkpoint 1
For the first Checkpoint create a png plot showing the data, and models 1 and 2, out to day 31 (the end of the data). Constrain the y axis to a maximum value of $P = 200000$. Include a legend in your graph and label your axes. Submit this figure to Lab 2 Checkpoint 1 on Canvas.
Problem 2: Using pandas with more complex data sets¶
You might like to use the Python notebook on Canvas pandasIntroduction.ipynb, although the required steps are suggested in this exercise.
Our goal in this problem is to determine the average temperature for each hour of the day, averaging over temperature data from the last 72 hours. Rather than individually downloading data from the Bureau of Meteorology, use the data set provided on Canvas. We will thus all have the same results.
To get the average temperature we will have to use a few "power moves", which may seem like big jumps as we do them. We will need to clean the data, introduce a column of datetime objects so we can ask pandas to group data by time, and then use the groupby method to get average temperatures for each hour.
You will be given quite a few of the steps along the way. See this exercise more as an opportunity to engage with how we might use pandas, rather than becoming fluent in the steps.
a) Download the data
Download the file 'IDN60901.94768.axf' from Canvas. This contains 72 hours of recent weather data recorded at the Bureau of Meteorology weather station on Observatory Hill in Sydney.
b) Import the pandas library
c) Read the data into a pandas dataframe, using the command read_csv. Note, there is some additional information in the data file which we aren't interested in. You can skip over these lines in the data file by using the header parameter. Set header=17 to skip to the data we are interested in.
#Write code to read the file here. Call your dataframe data
d) Check that you have read in the data correctly. Print out the first few lines of the dataframe using the head() method.
# Write code to print first four lines in the dataframe
e) Cleaning the data: removing NaNs.
Unfortunately very often a dataset is incomplete. Often when some values are missing, they appear in our data set as NaN, or rather, Not a Number entries. These entries can influence our analysis and lead to the wrong results.
Data cleaning is the process of detecting and correcting (or removing) missing or inaccurate records from a record set, table, or database.
pandas offers a very useful command to check if your dataframe contains NaN entries: the attribute .isnull(). This prints True at the location of NaN. You can then check if there are any True results by passing to .values.any().
# Code for checking for presence of NaNs
We can look along a column (axis = 1) and determine which row contains the NaNs.
# Code for identifying the location of NaNs
# data[
So, there is an entire row (row 144 to be precise) that contains NaN entries. We can remove this row by using the dropna dataframe method (format: .dropna(axis=0,inplace=True) drops a row (axis = 0) if any values are NaN.
# Code to remove NaN values from the dataframe
f) Inspect the end of the dataframe using the method .tail()
g) Working with timeseries.
pandas contains an extensive set of tools for working with dates, times, and time-indexed data. However, by looking at our dataframe we don’t have any definite information about the date and time of our data.
The column local_date_time_full[80] contains this information but, to use it in pandas, we need to convert it to datetime objects. In the cell below, write code that adds a new column ‘time’ in your dataframe which contains the same information as local_date_time_full[80], but correctly transformed into datetime objects.
Currently pandas has interpreted local_date_time_full[80] as floats, and to use the method to_datetime we need to convert the values to strings first. To do this use .apply(str). You can then use the dataframe method .to_datetime(). When implementing the to_datetime method you will also have to specify the format parameter, telling python the format of the string it is being asked to parse. Looking at the data, we can see the format parameter should be format=”%Y%m%d%H%M%S.%f”
# Write code which adds a new column ‘time’, which holds a datetime object.
h) Print out the dtypes of all your dataframe columns, confirming that you do indeed have a column containing datetime objects
i) Print out the hour values for the entries of the time column in your dataframe. Note, pandas series objects have specific methods for accessing datetime objects, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.html
j) Now we move to our last power move, we want to average the temperatures by hour across the data set. So, we would like to group the data by hour, and then take the average of this data. Fortunately there is a perfect method for this .groupby(). Produce a new dataframe group which contains the means for each column, averaged over data with the same hour value. The command should look something like group = data.groupby(
# Average temperature per hour
k) Finally, produce a bar graph of the average air temperature per hour, averaged across the 72 hours. pandas has a built in graphing method .plot.bar(). Which is the hottest hour of the day? Which is the coldest?
# Bar graph plot of average temperature per hour
COSC2002 Checkpoint 2
You are now ready to submit Checkpoint 2 to Canvas. Produce a png figure of the bar graph produced in part k) with the x and y axes appropriately labelled. Submit your figure to Checkpoint 2 of Lab 2.
# Code for saving a figure
#plt.savefig(‘BarPlotAirTemp_your_SID.png’)
COSC2902 Checkpoint 2
Submit the figure as detailed below to Checkpoint 2, Lab 2 on Canvas.
COSC2902 Checkpoint: Dewpoint calculation¶
The Dew point is the temperature to which air must be cooled to produce condensation or dew. Relative humidity and dew point both measure moisture in the air. However, they are different.
The relative humidity describes how close the air is to saturation. It expresses as a percentage the amount of water in the air, with respect to the maximum amount the air can hold. Higher temperature air can hold more water. This means that if the amount of moisture in the air stays the same but the temperature rises, the relative humidity falls.
The dew point is related to the relative humidity. It is the temperature at which the relative humidity is 100%. Typically it is somewhat below the current temperature. This makes sense, as if we have a fixed amount of water in the air, and we lower the temperature then the maximum amount of water the air can hold goes down and so the relative humidity goes up.
This is why we often see dew on grass in the morning. The air temperature overnight has dropped below the dew point forcing water to condense out of the air.
For this checkpoint, you will calculate the dew point based on your data. As the BOM provides this estimate (column dewpt), you can check your calculation by comparing your results with the given values.
The relationship between the dew point temperature ($\,\rm{T_{dp}}$), relative humidity ($\,\rm{H}$ ) and the current air temperature ($\,\rm{T_a}$ ) is given by:
$$\rm{T_{dp}}=\left(\frac{\rm{H}}{100}\right)^{1/8}(112+0.9\rm{T_a})+0.1\rm{T_a}-112$$For Checkpoint 2 you are required to calculate the expected dewpoint per hour, based on the mean humidity and mean air temperature, and plot on the same graph the mean dew point, as given by the BOM data. Plot on the same axes also the ambient temperature. Include a legend identifying your three lines, label your axes and submit your results to Checkpoint 2 Lab 2.
Do you notice any discrepancies between your calculation and the BOM averages? What might be the origin of these discrepancies? Do we expect any dew? When is the best time for dew to form?
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com