程序代写 QBUS2820 - Predictive Analytics

Tutorial_9_Tasks

QBUS2820 – Predictive Analytics
Tutorial 9 Tasks – Working with Time Stamped Data¶

You task is to self study and exercise the below materials, to have better understandng on how to work with the time series data with Python and to potentially facilitate your Assignment 2.

This guide explains the basics of working with dates and times in Python and pandas.

Dates and Time in Python

Conversions between strings and datetime

Date functionality in pandas

Reading time stamped data

Time series subsetting

Time series plot

This notebook assumes the following imports and settings.

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context(‘notebook’)
sns.set_style(‘ticks’)

Dates and Time in Python¶
The datetime module from the standard Python library provides the basic variable types and tools for date and time data. To get started, we retrieve the current date and time.

from datetime import datetime
now=datetime.now()

datetime.datetime(2022, 4, 26, 20, 10, 58, 446160)

print(now)

2022-04-26 20:10:58.446160

The now variable that we created has a special data type which stores date and time down to microseconds.

datetime.datetime

now.year, now.month, now.day

(2022, 4, 26)

An interesting feature of datetime objects is that we can perform operations with them.

delta=datetime(2020, 9, 15)-datetime(2019, 9, 12, 6, 10)

datetime.timedelta(days=368, seconds=64200)

delta.days, delta.seconds

(368, 64200)

Fow example, if we want to shift a date 5 days ahead, we can use:

from datetime import timedelta
start=datetime(2020, 10, 1)
start+timedelta(5)

datetime.datetime(2020, 10, 6, 0, 0)

As a note, the datetime module also has separate date and time objects.

Conversions between strings and datetime¶
Datetime objects (and the pandas counterparts) have the strftime method, which allows us to convert them to a string according to our desired format. Refer to this page for the available formatting options.

stamp=datetime(2020,10,1)
stamp.strftime(‘%Y-%m-%d’)

‘2020-10-01’

stamp.strftime(‘%d/%b/%y’)

’01/Oct/20′

To convert strings to datatime, we can use the parse function, which infers almost any intelligible date format. Here is an example.

from dateutil.parser import parse
parse(‘Sept 15th 2020’)

datetime.datetime(2020, 9, 15, 0, 0)

We just need to be careful with the fact that it assumes a US date format, unless we specify otherwise.

print(parse(‘1/10/2020’))
print(parse(‘1/10/2020′, dayfirst=True))

2020-01-10 00:00:00
2020-10-01 00:00:00

Date functionality in pandas¶
When dealing with multiple dates, we turn to pandas.

dts=[’10/10/2020′,’11/10/2020’]
dates=pd.to_datetime(dts, dayfirst=True)

DatetimeIndex([‘2020-10-10’, ‘2020-10-11′], dtype=’datetime64[ns]’, freq=None)

In pandas, a set of dates has the DatetimeIndex type. Each element of a DatetimeIndex has the TimeStamp, which for practical purposes is equivalent to datetime; we can use the two interchangeably.

Timestamp(‘2020-10-10 00:00:00’)

The pandas [date_range](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html) function allows us to generate a range of dates according to a specified frequency.

dates=pd.date_range(‘1/Oct/2020’, ‘5/Oct/2020’)
print(dates)

DatetimeIndex([‘2020-10-01’, ‘2020-10-02’, ‘2020-10-03’, ‘2020-10-04’,
‘2020-10-05′],
dtype=’datetime64[ns]’, freq=’D’)

To set the frequency, we can set the date option. For example, for a business daily frequency:

dates=pd.date_range(start=’1/Oct/2020′, periods=5, freq=’B’)
print(dates)

DatetimeIndex([‘2020-10-01’, ‘2020-10-02’, ‘2020-10-05’, ‘2020-10-06’,
‘2020-10-07′],
dtype=’datetime64[ns]’, freq=’B’)

Monthly frequency:

dates=pd.date_range(start=’1/Oct/2020′, periods=5, freq=’M’)
print(dates)

DatetimeIndex([‘2020-10-31’, ‘2020-11-30’, ‘2020-12-31’, ‘2021-01-31’,
‘2021-02-28′],
dtype=’datetime64[ns]’, freq=’M’)

The frequency option also accepts multiples:

dates=pd.date_range(start=’1/Oct/2020′, periods=5, freq=’3M’)
print(dates)

DatetimeIndex([‘2020-10-31’, ‘2021-01-31’, ‘2021-04-30’, ‘2021-07-31’,
‘2021-10-31′],
dtype=’datetime64[ns]’, freq=’3M’)

Refer to the pandas documentation for a list of available frequencies.

A related object is the Period, which represents a timespan like months and quarters, rather a point in time as in TimeStamp. For example, the variable below represents the period between 1/10/2020 and 31/10/2020.

month=pd.Period(‘Oct-2020′, freq=’M’)

Period(‘2020-10’, ‘M’)

As with datetime and TimeStamp objects, we can perform operations with a Period object.

month+1 # from above

Period(‘2020-11’, ‘M’)

The counterpart of the DatetimeIndex is the PeriodIndex.

values=[‘2020Q1’, ‘2020Q2’, ‘2020Q3’, ‘2020Q4′]
index=pd.PeriodIndex(values, freq=’Q’)

PeriodIndex([‘2020Q1’, ‘2020Q2’, ‘2020Q3’, ‘2020Q4′], dtype=’period[Q-DEC]’, freq=’Q-DEC’)

To generate a period range:

pd.period_range(‘2020Q1′,’2020Q4′, freq=’Q’)

PeriodIndex([‘2020Q1’, ‘2020Q2’, ‘2020Q3’, ‘2020Q4′], dtype=’period[Q-DEC]’, freq=’Q-DEC’)

Finally, we can convert timestamps to periods as follows.

dates=pd.date_range(start=’1/Oct/2020′, periods=5, freq=’M’)
print(dates)
dates.to_period()

DatetimeIndex([‘2020-10-31’, ‘2020-11-30’, ‘2020-12-31’, ‘2021-01-31’,
‘2021-02-28′],
dtype=’datetime64[ns]’, freq=’M’)

PeriodIndex([‘2020-10’, ‘2020-11’, ‘2020-12’, ‘2021-01’, ‘2021-02′], dtype=’period[M]’, freq=’M’)

Reading time stamped data¶
Let us now work with data. For simplicity, our data will have only one column apart from the date. The same principles applies for working with data frames instead of a single series.

The nswretail.csv file contains monthly retail turnover figures for the state of NSW. I downloaded the data from the Australian Bureau of Statistics website. The ABS explanatory notes define retail turnover as:

Retail sales; wholesale sales; takings from repairs, meals and hiring of goods (except for rent, leasing and hiring of land and buildings);
commissions from agency activity (e.g. commissions received from collecting dry cleaning, selling lottery tickets, etc.); and
the goods and services tax.

To read the data, we follow the usual procedure. If you open the data file in a text editor, you will see that it has two columns: Month and Turnover. In a time series context, we want to make the date the index of the DataFrame by specifying Month as the index via the index_col option. We set the parse_dates options as True so that pandas can automatically recognise the dates column and convert them to TimeStamp objects.

ts=pd.read_csv(‘nswretail.csv’, index_col=’Month’, parse_dates=True)
ts.tail() # tail gives the last 5 observations in the data

2017-02-01 7298.9
2017-03-01 8085.8
2017-04-01 7883.7
2017-05-01 8132.0
2017-06-01 8130.1

DatetimeIndex([‘2006-01-01’, ‘2006-02-01’, ‘2006-03-01’, ‘2006-04-01’,
‘2006-05-01’, ‘2006-06-01’, ‘2006-07-01’, ‘2006-08-01’,
‘2006-09-01’, ‘2006-10-01’,
‘2016-09-01’, ‘2016-10-01’, ‘2016-11-01’, ‘2016-12-01’,
‘2017-01-01’, ‘2017-02-01’, ‘2017-03-01’, ‘2017-04-01’,
‘2017-05-01’, ‘2017-06-01′],
dtype=’datetime64[ns]’, name=’Month’, length=138, freq=None)

We can see that pandas converted (say) “Jun-2017” in the text file to a full date, which by default is the first day of the month. Since we know these figures refer to the whole month, we want to convert the indexes from timestamps to periods:

ts.index=ts.index.to_period()

2017-02 7298.9
2017-03 8085.8
2017-04 7883.7
2017-05 8132.0
2017-06 8130.1

Subsetting a time series¶
Selecting part of the time series works in an intuitive way.

ts[‘Feb-2017′:’Jun-2017’]

2017-02 7298.9
2017-03 8085.8
2017-04 7883.7
2017-05 8132.0
2017-06 8130.1

ts[‘2017Q2’]

2017-04 7883.7
2017-05 8132.0
2017-06 8130.1

ts[‘Feb2017’:]

2017-02 7298.9
2017-03 8085.8
2017-04 7883.7
2017-05 8132.0
2017-06 8130.1

Time series plot¶
Once we load the time series, the beginning of our analysis will always be to visualise the data. The simplest way to plot a time series is to use the pandas as follows.

fig, ax = plt.subplots()
ts[‘Turnover’].plot(color=’red’)
ax.set_xlabel(‘Year’)
ax.set_ylabel(‘Turnover’)
ax.set_title(‘Retail Turnover for NSW (2006-2017)’)
sns.despine()
plt.show()

Using standard fuctions for plotting runs into problems with the horizontal axis labels, which Matplotlib will not recognise. If we want a customised plot, we would need to set the labels manually as below.

fig, ax = plt.subplots()
ax.plot(ts[‘Turnover’].values, color=’red’)

ax.set_xlim(0, len(ts))

indexes = np.arange(0, len(ts), 18) # we will place ticks every 18 months, starting with the first observation
ax.set_xticks(indexes)
ax.set_xticklabels([ts.index[i].strftime(‘%b-%y’) for i in indexes], rotation=’-90′)

indexes = np.arange(9, len(ts), 18) # minor ticks
ax.set_xticks(indexes, minor=True)

ax.set_xlabel(‘Year’)
ax.set_ylabel(‘Turnover’)
ax.set_title(‘Retail Turnover for NSW (2006-2017)’)

plt.tight_layout()

sns.despine()
plt.show()

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts