程序代写代做代考 Excel python data science data structure Elements Of Data Processing – Week 1¶

Elements Of Data Processing – Week 1¶
Getting Started with Jupyter Notebook¶
Jupyter notebook is an extremely useful tool for developing and presenting projects (particularly in python). You can include code segments and view their output directly in your browser. You can also add rich text, visualisations, equations and more.
Cells¶
Jupyter notebook contains two main types of cells:
• Markdown cells: These can be used to contain text, equations and other non-code items. The cell that you’re reading right now is a markdown cell. You can use Markdown to format your text. If you prefer, you can also format your text using HTML. Clicking the Run button will format and display your text.
• Code cells: These contain code segments that can be executed individually. When executed, the output of the code will be displayed below the code cell. Click the Run button to execute a code segment. You can also run a code segment by pressing Ctrl + Enter

Running Code¶
Try running the code segments below and verify that the output is correct.
In [1]:
message=”hello world”
print(message)

hello world
In [2]:
for i in range(5):
print(str(i) + ” squared is ” + str(i*i))

0 squared is 0
1 squared is 1
2 squared is 4
3 squared is 9
4 squared is 16

Variables are retained between code segments. You can, for example, refer the message variable created in the code segment above
In [3]:
print(“The COMP20008 team wishes to say: ” + message)

The COMP20008 team wishes to say: hello world

Try adding your own code cell below and use it to print a different message.

Errors¶
If your code contains any errors, the error message will be displayed underneath the code segment once it’s run. This helps you identify the problem and debug the code. Try fixing the code below:
In [4]:
#print(Welcome to COMP20008)
print(‘Welcome to COMP20008’)

#Print(‘We’re glad you’ve chosen this subject’)
print(“We’re glad you’ve chosen this subject”)
print(‘We\’re glad you\’ve chosen this subject’)

#students=30
#if students>25
# print(‘This is a big class!’)
students=30
if students>25 :
print(‘This is a big class!’)

Welcome to COMP20008
We’re glad you’ve chosen this subject
We’re glad you’ve chosen this subject
This is a big class!

Exercise 1¶
Create a new code cell below this one. Write a Python program that will print the first $n$ numbers of the Fibonacci sequence in reverse order. Verify it works for $n=10$
In [5]:
n=10
fib=[0,1]
for i in range(n-2) :
fib.append(fib[i]+fib[i+1])
fib.reverse()
print(fib)

[34, 21, 13, 8, 5, 3, 2, 1, 1, 0]

Pandas¶
Libraries contain useful resources, such as classes and subroutines, that you can use in your programs.
Pansas is a library that contains high-level data structures and manipulation tools for faster analysis. As with most libraries, an API reference is available which details all of the functionality provided by pandas. This lab will focus on the two most important data structures provided by pandas, the Series and Data Frame.
It’s worth reading through the Intro to Data Strcutres article on the pandas website to familiarise yourself with these two data structures. There are also a number of step-by-step tutorials available online, such as this one by DataCamp that is worth following following.
In [6]:
import pandas as pd

Series¶
One-dimensional array-like object containing the array of data and an associated array of data labels called index.

The basic method to create a Series:
– s = Series(data, index=index)

Here, data can be different things, including:
– a list
– an array
– a dictionary

Example 1 : Create a Basic Series Object¶
In [7]:
# series constructor with data as a list of integers

l = [4,3,-5,9,1,7]
s = pd.Series(l)
In [8]:
# the default indexing starts from zero
s.index
Out[8]:
RangeIndex(start=0, stop=6, step=1)
In [9]:
# retrieve the values of the series
s.values
Out[9]:
array([ 4, 3, -5, 9, 1, 7], dtype=int64)
In [10]:
# create your own index using lists
newIndex = [‘a’,’b’,’c’,’d’,’e’,’f’]
s.index = newIndex
In [11]:
# verify the index
s
Out[11]:
a 4
b 3
c -5
d 9
e 1
f 7
dtype: int64
In [12]:
# Creating a series from a python dict

Aus_Emission = {‘1990’:15.45288167, ‘2000’:17.20060983, ‘2007’:17.86526004,
‘2008’:18.16087566,’2009′:18.20018196,’2010′:16.92095367,
‘2011’:16.86260095, ‘2012’:16.51938578, ‘2013’:16.34730205}

co2_Emission = pd.Series(Aus_Emission)
In [13]:
# retrieve the values of the series
co2_Emission.values
Out[13]:
array([15.45288167, 17.20060983, 17.86526004, 18.16087566, 18.20018196,
16.92095367, 16.86260095, 16.51938578, 16.34730205])
In [14]:
# verify the series object
co2_Emission
Out[14]:
1990 15.452882
2000 17.200610
2007 17.865260
2008 18.160876
2009 18.200182
2010 16.920954
2011 16.862601
2012 16.519386
2013 16.347302
dtype: float64

Slicing¶
Slicing allows you to take part of a Series or DataFrame, in order to visualise it separately or perform more detailed analysis. You can select sections of list-like types (arrays, tuples, NumPy arrays) by using various slice notations:
In [15]:
# slicing the series using a boolean array operation
co2_Emission[co2_Emission>16.0]
Out[15]:
2000 17.200610
2007 17.865260
2008 18.160876
2009 18.200182
2010 16.920954
2011 16.862601
2012 16.519386
2013 16.347302
dtype: float64
In [16]:
# slicing the series using a time period
co2_Emission[:’2000′]
Out[16]:
1990 15.452882
2000 17.200610
dtype: float64
In [17]:
# double the values of the series object
doubled = co2_Emission*2
doubled
Out[17]:
1990 30.905763
2000 34.401220
2007 35.730520
2008 36.321751
2009 36.400364
2010 33.841907
2011 33.725202
2012 33.038772
2013 32.694604
dtype: float64
In [18]:
# finding the average value of the series
co2_Emission.mean()
Out[18]:
17.05889462333333
In [19]:
# defining the column name
co2_Emission.name = ‘CO2 Emission’
In [20]:
# defining the name of the index
co2_Emission.index.name = ‘Year’
In [21]:
# verify the series object
co2_Emission
Out[21]:
Year
1990 15.452882
2000 17.200610
2007 17.865260
2008 18.160876
2009 18.200182
2010 16.920954
2011 16.862601
2012 16.519386
2013 16.347302
Name: CO2 Emission, dtype: float64

Exercise 2¶
Pandas Series objects have both ndarray-like and dict-like properties. Given the co2_Emission series object do the following:
• Similar to the average of the series object, retrieve the maximum, median and cumulative sum of CO2 emission between 1960 to 2013 (max(), median() and cumsum() methods).
• Retrieve the CO2 emissions in Australia between 2000 to 2010.
• Given the population of Australia in 2013 is 23117353, retrieve the CO2 emission per capita for that year.
In [22]:
# max
co2_Emission.max()
Out[22]:
18.20018196
In [23]:
#median
co2_Emission.median()
Out[23]:
16.92095367
In [24]:
# cumulative sum
print(“Hello: \n”+ str( co2_Emission.cumsum()))

Hello:
Year
1990 15.452882
2000 32.653492
2007 50.518752
2008 68.679627
2009 86.879809
2010 103.800763
2011 120.663364
2012 137.182750
2013 153.530052
Name: CO2 Emission, dtype: float64
In [25]:
# co2 emission between 2000 to 2010
co2_Emission[‘2000′:’2010’]
Out[25]:
Year
2000 17.200610
2007 17.865260
2008 18.160876
2009 18.200182
2010 16.920954
Name: CO2 Emission, dtype: float64
In [26]:
# computing the co2 emission per capita
p = 23117353
co2_Emission[‘2013’]/p
Out[26]:
7.071441981268357e-07

Recommended Reading:¶
This article on Dataquest is an excellent introduction to Jupyter notebook. If you haven’t used Jupyter notebook before, I recommend familiarising yourself with it.

Discussion questions¶
• What is data science to you?
• What makes it interesting?
• What is meant by “Big Data”? What are its characteristics?
• It has been claimed that wrangling data takes 80% of the time and the rest 20%. How can this be true, what specific activities cause wrangling to be so time consuming?
In [ ]: