2021S2-workshop-week3-lab-Solution
Elements Of Data Processing (2021S2) – Week 3¶
Visualization with Python¶
In these exercises you will:
learn how to visualize a set of data using a Python library called matplotlib.
find out different forms of visualization, such as bar charts, histograms, scatter plot, and boxplot.
You will be able to transform a set of data into an appropriate visualization form.
matplotlib is a Python 2D plotting library that enables you to produce figures and charts, both in a screen or in an image file.
The following example demonstrates a simple plot of the top 100 emissions in 2010, using the emmisions dataset seen in previous labs.
In [16]:
# create a new DataFrame for the CO2 emission from a csv file
import pandas as pd
emission = pd.read_csv(‘data/emission.csv’,encoding = ‘ISO-8859-1’)
yr2010 = emission[‘2010’]
names = emission[‘Country’]
yr2010.index = names
yr2010_sorted = yr2010.sort_values(ascending = False)
top100_yr2010 = yr2010_sorted[0:100]
In [17]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.boxplot(top100_yr2010) # a boxplot of the top 100 emissions in year 2020
Out[17]:
{‘whiskers’: [
‘caps’: [
‘boxes’: [
‘medians’: [
‘fliers’: [
‘means’: []}
Scatter plot¶
Scatter plot is often used to display the relationship between two variables (plot as x-y pairs). In this scatter plot example, we use famous Iris data set. The data is available here. This data set provides measurements on various parts of three types of Iris flower (Iris setosa, Iris versicolour, and Iris virginica). For each type, there are 50 measurements, or samples. Each data row in the CSV file contains (1) petal width, (2) sepal width, (3) petal length, (4) sepal length, and (5) the type of Iris flower.
The following code generates the scatter plot between petal length and petal width of the three Iris types.
In [18]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
iris=pd.read_csv(‘data/iris.csv’,encoding = ‘ISO-8859-1’)
setosa=iris.loc[iris[‘Name’]==’Iris-setosa’]
versicolor=iris.loc[iris[‘Name’]==’Iris-versicolor’]
virginica=iris.loc[iris[‘Name’]==’Iris-virginica’]
plt.scatter(setosa.iloc[:,2],setosa.iloc[:,0],color=’green’)
plt.scatter(versicolor.iloc[:,2],versicolor.iloc[:,0],color=’red’)
plt.scatter(virginica.iloc[:,2],virginica.iloc[:,0],color=’blue’)
plt.xlim(0.5,7.5)
plt.ylim(0,3)
plt.ylabel(“petal width”)
plt.xlabel(“petal length”)
plt.grid(True)
From the scatter plot, we may be able to suggest a particular type of relationship or a formation of clusters. In the example above you may notice that, for Iris versicolor, the samples with longer petal tend to have wider petal. You can also see clearly that there exist clusters of these three Irises. As such, the measurements of petal and sepal can help identifying the type of Iris flower. This example demonstrates how botanists may indentify a certain species from phenotype characteristics.
Exercise 1 ¶
Modify the example above to generate the scatter plot of sepal length and petal width.
In [ ]:
##answer here
#(1) petal width, (2) sepal width, (3) petal length, (4) sepal length, and (5) the type of Iris flower
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
iris=pd.read_csv(‘data/iris.csv’,encoding = ‘ISO-8859-1’)
setosa=iris.loc[iris[‘Name’]==’Iris-setosa’]
versicolor=iris.loc[iris[‘Name’]==’Iris-versicolor’]
virginica=iris.loc[iris[‘Name’]==’Iris-virginica’]
# index 2 is petal length, replace with index 0 for sepal length
plt.scatter(setosa.iloc[:,3],setosa.iloc[:,0],color=’green’)
plt.scatter(versicolor.iloc[:,3],versicolor.iloc[:,0],color=’red’)
plt.scatter(virginica.iloc[:,3],virginica.iloc[:,0],color=’blue’)
plt.xlim(0.5,7.5)
plt.ylim(0,3)
plt.ylabel(“petal width”)
# change x lable to sepal length
plt.xlabel(“sepal length”)
plt.grid(True)
Bar chart¶
Bar chart is probably the most common type of chart. It displays a property or properties of a set of different entities. Bar chart is typically used to provide comparison, or to show contrast between different entities. For example, the bar chart below displays the GNP per capita of the three poorest and the three richest countries in the world (based on 2004 GNP per capita):
In [6]:
%matplotlib inline
import matplotlib.pyplot as plt
import calendar
from numpy import arange
countries = [‘Burundi’,’Ethiopia’,’Rep of Congo’,’Switzerland’,’Norway’,’Luxembourg’]
gnp = [90,110,110,49600,51810,56380] # GNP per capita (2004)
plt.bar(arange(len(gnp)),gnp)
plt.xticks( arange(len(countries)),countries, rotation=30)
plt.show()
Exercise 2 ¶
Modify the bar chart example to plot the average maximum temperature in all major Australian cities. The data is available here.
In [7]:
tmp = pd.read_csv(‘data/max_temp.csv’,encoding = ‘ISO-8859-1’)
city_avg_tmp = tmp.iloc[:,1:].mean(axis=1)
city_avg_tmp
city = tmp.iloc[:,0]
In [8]:
%matplotlib inline
import matplotlib.pyplot as plt
import calendar
from numpy import arange
plt.bar(arange(len(city)),city_avg_tmp)
plt.xticks( arange(len(city)),city, rotation=30)
plt.show()
In a clustered bar chart, you can display a few measurements from the entities of interest. For example, the clustered bar chart below simultaneously shows the number of births and deaths in four countries of interest. The number of births is displayed as the blue-colored bar and the number of deaths as the red-colored bar:
In [9]:
%matplotlib inline
import matplotlib.pyplot as plt
import calendar
from numpy import arange
countries = [‘Afghanistan’, ‘Albania’, ‘Algeria’, ‘Angola’]
births = [1143717, 53367, 598519, 498887]
deaths = [529623, 16474, 144694, 285380]
plt.bar(arange(len(births))-0.3, births, width=0.3)
plt.bar(arange(len(deaths)),deaths, width=0.3,color=’r’)
plt.xticks(arange(len(countries)),countries, rotation=30)
Out[9]:
([
[Text(0, 0, ‘Afghanistan’),
Text(0, 0, ‘Albania’),
Text(0, 0, ‘Algeria’),
Text(0, 0, ‘Angola’)])
Histogram¶
Histogram displays a distribution of population samples (typically a large set of data like digital images or age of population). The following example creates a histogram of age within a small number of samples (assumes these are the age of your classmates).
In [10]:
%matplotlib inline
import matplotlib.pyplot as plt
ages = [17,18,18,19,21,19,19,21,20,23,19,22,20,21,19,19,14,23,16,17]
plt.hist(ages, bins=10)
plt.grid(which=’major’, axis=’y’)
plt.show()
Exercise 3 ¶
Change the number of bins in the previous example to 20.
Plot the histogram
In [11]:
#Answer 3
%matplotlib inline
import matplotlib.pyplot as plt
ages = [17,18,18,19,21,19,19,21,20,23,19,22,20,21,19,19,14,23,16,17]
plt.hist(ages, bins=20)
plt.grid(which=’major’, axis=’y’)
plt.show()
Parallel co-ordinates¶
Parallel co-ordinates is another method for data visualisation. Each data instance is represented by a line and each feature by a vertical bar. Similar objects can be identified by the similarity of their lines. Correlations between (adjacent) features can also be identified.
The following dataset “Auto MPG” (this file) is a classic dataset providing detail about different models of cars in the 1970s and 1980s. It uses features such as number of cylinders, horsepower, weight, …, miles per gallon
Explain the logic of the code, normalising and not colouring
In [12]:
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import parallel_coordinates
data=pd.read_csv(‘data/mpg.csv’,encoding = ‘ISO-8859-1’)
##normalise data
data[‘mpg’] = (data[‘mpg’]-data[‘mpg’].min())/(data[‘mpg’].max()-data[‘mpg’].min())
data[‘weight’] = (data[‘weight’]-data[‘weight’].min())/(data[‘weight’].max()-data[‘weight’].min())
data[‘cylinders’] = (data[‘cylinders’]-data[‘cylinders’].min())/(data[‘cylinders’].max()-data[‘cylinders’].min())
data[‘horsepower’] = (data[‘horsepower’]-data[‘horsepower’].min())/(data[‘horsepower’].max()-data[‘horsepower’].min())
data[‘model_year’] = (data[‘model_year’]-data[‘model_year’].min())/(data[‘model_year’].max()-data[‘model_year’].min())
###Set ‘name’ to be empty since it is a string. ‘name’ is the class feature used to color the objects, but for this
## case we just want all objects to be the same colour, hence we make it empty. More generally, one can use a categorical
##feature to determine the line colors.
data[‘name’]=”
###plot in parallel co-ordinates
#a document showing the parallel-coordinates API is at
#https://groups.google.com/forum/#!topic/glue-viz/5-ljzYj4Qnc
parallel_coordinates(data[[‘mpg’,’cylinders’,’horsepower’,’weight’,’model_year’,’name’]],’name’)
plt.show()
No handles with labels found to put in legend.
Consider this parallel co-ordinates plot above. What insights can obtain from this plot? To make it easier to visualise, you may like to display fewer car_models (objects) altering the code above and using pandas.DataFrame.sample(…)
Exercise 4 ¶
Select car models with years in the range 1980-1982 and make them green in the parallel co-ordinates plot. Colour all others car models red. This technique is called “brushing”, since it is used to make a particular subset of the objects stand out. What do you notice?
In [14]:
###Exercise 4 answer
import pandas as pd
import matplotlib.pyplot as plt
#from pandas.tools.plotting import parallel_coordinates
from pandas.plotting import parallel_coordinates
data=pd.read_csv(‘data/mpg.csv’,encoding = ‘ISO-8859-1’)
data[‘selected’]=[‘1980-1982’ if ((x>=80) and (x<=82)) else 'Other Years' for x in data['model_year']]
#for x in data['model_year']:
# if ((x>=80) and (x<=82)):
# data['selected'] = '1980-1982'
# else:
# data['selected'] = 'Other Years'
###Normalise features between 0 and 1, to ensure comparability of axes
#data['name']=''
data['mpg'] = (data['mpg']-data['mpg'].min())/(data['mpg'].max()-data['mpg'].min())
data['weight'] = (data['weight']-data['weight'].min())/(data['weight'].max()-data['weight'].min())
data['cylinders'] = (data['cylinders']-data['cylinders'].min())/(data['cylinders'].max()-data['cylinders'].min())
data['horsepower'] = (data['horsepower']-data['horsepower'].min())/(data['horsepower'].max()-data['horsepower'].min())
data['model_year'] = (data['model_year']-data['model_year'].min())/(data['model_year'].max()-data['model_year'].min())
###plot in parallel co-ordinates
parallel_coordinates(data[['mpg','cylinders','horsepower','weight','model_year','selected']],'selected',color=["r","g"])
plt.show()
Heat Map Example ¶
In [15]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
iris= pd.read_csv('data/iris.csv',dtype=None) ###read in data
iris=iris.set_index('Name')
sns.heatmap(iris,cmap='viridis',xticklabels=True)
plt.show()
In [ ]: