程序代写 COMP2420/COMP6420 - Introduction to Data Management, Analysis and Security

Advanced Visualisation

COMP2420/COMP6420 – Introduction to Data Management, Analysis and Security
Lecture – Advanced Visualisation

# Author – Fazil T (https://www.kaggle.com/fazilbtopal)

Table of Contents¶

1. [Exploring Datasets with *pandas*](#1)

2. [Matplotlib: Standard Python Visualization Library](#2)

3. [Seaborn](#3)

4. [Line Plots](#4)

5. [Histograms](#5)

6. [Bar Charts](#6)

7. [Pie Charts](#7)

8. [Box Plots](#8)

9. [Scatter Plots](#9)

10. [Bubble Plots](#10)

Exploring Dataset ¶
The Dataset: Immigration to Canada from 1980 to 2013¶
The dataset contains annual data on the flows of international immigrants as recorded by the countries of destination. The data presents both inflows and outflows according to the place of birth, citizenship or place of previous / next residence both for foreigners and nationals. The current version presents data pertaining to 45 countries.

import numpy as np
import pandas as pd

df = pd.read_excel(‘data/Canada.xlsx’, sheet_name=’Canada by Citizenship’, skiprows=range(20), skipfooter=2)

When analyzing a dataset, it’s always a good idea to start by getting basic information about your dataframe. We can do this by using the info() method.

Let’s clean the data set to remove a few unnecessary columns. Then we rename the columns so that they make sense.

# in pandas axis=0 represents rows (default) and axis=1 represents columns.

We will also add a ‘Total’ column that sums up the total immigrants by country over the entire period 1980 – 2013, check to see how many null objects we have in the dataset.

Finally, let’s view a quick summary of each column in our dataframe using the describe() method.

Column names that are integers (such as the years) might introduce some confusion. For example, when we are referencing the year 2013, one might confuse that when the 2013th positional index.

To avoid this ambuigity, let’s convert the column names into strings: ‘1980’ to ‘2013’.

Matplotlib: Standard Python Visualization Library¶
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.

If you are aspiring to create impactful visualization with python, Matplotlib is an essential tool to have at your disposal.

Matplotlib.Pyplot¶
One of the core aspects of Matplotlib is matplotlib.pyplot. It is Matplotlib’s scripting layer. It is a collection of command style functions that make Matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc. We will work both with the scripting and artist layer.

Two types of plotting¶
There are two styles/options of ploting with matplotlib. Plotting using the Artist layer and plotting using the scripting layer.

Option 1: Scripting layer (procedural method) – using matplotlib.pyplot as ‘plt’

You can use plt i.e. matplotlib.pyplot and add more elements by calling different methods procedurally; for example, plt.title(…) to add title or plt.xlabel(…) to add label to the x-axis.

Option 2: Artist layer (Object oriented method) – using an Axes instance from Matplotlib (preferred)

You can use an Axes instance of your current plot and store it in a variable (eg. ax). You can add more elements by calling methods with a little change in syntax (by adding “set_” to the previous methods). For example, use ax.set_title() instead of plt.title() to add title, or ax.set_xlabel() instead of plt.xlabel() to add label to the x-axis.

This option sometimes is more transparent and flexible to use for advanced plots (in particular when having multiple plots).

Often times we might want to plot multiple plots within the same figure. For example, we might want to perform a side by side comparison of the box plot with the line plot of China and India’s immigration.

To visualize multiple plots together, we can create a figure (overall canvas) and divide it into subplots, each containing a plot. With subplots, we usually work with the artist layer instead of the scripting layer.

Typical syntax is :

fig = plt.figure() # create figure
ax = fig.add_subplot(nrows, ncols, plot_number) # create subplots

nrows and ncols are used to notionally split the figure into (nrows * ncols) sub-axes,
plot_number is used to identify the particular subplot that this function is to create within the notional grid. plot_number starts at 1, increments across rows first and has a maximum of nrows * ncols as shown below.

In the case when nrows, ncols, and plot_number are all less than 10, a convenience exists such that the a 3 digit number can be given instead, where the hundreds represent nrows, the tens represent ncols and the units represent plot_number. For instance,

subplot(211) == subplot(2, 1, 1)

produces a subaxes in a figure which represents the top plot (i.e. the first) in a 2 rows by 1 column notional grid (no grid actually exists, but conceptually this is how the returned subplot has been positioned).

Plotting in pandas¶
Fortunately, pandas has a built-in implementation of Matplotlib that we can use. Plotting in pandas is as simple as appending a .plot() method to a series or dataframe.

Documentation:

Plotting with Series

Plotting with Dataframes

import matplotlib.pyplot as plt

# we are using the inline backend
%matplotlib inline

print(plt.style.available)
plt.style.use([‘ggplot’]) # for ggplot-like style

[‘seaborn-notebook’, ‘seaborn-bright’, ‘fast’, ‘seaborn-talk’, ‘seaborn-ticks’, ‘Solarize_Light2’, ‘seaborn-dark’, ‘fivethirtyeight’, ‘seaborn-pastel’, ‘_classic_test’, ‘seaborn-white’, ‘seaborn-whitegrid’, ‘seaborn-deep’, ‘seaborn’, ‘seaborn-paper’, ‘tableau-colorblind10’, ‘seaborn-muted’, ‘seaborn-dark-palette’, ‘bmh’, ‘seaborn-poster’, ‘ggplot’, ‘seaborn-colorblind’, ‘dark_background’, ‘classic’, ‘seaborn-darkgrid’, ‘grayscale’]

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

We will visualize both of the libraries as we go along, plot them side by side.

import seaborn as sns

Line Plots (Series/Dataframe) ¶
What is a line plot and why use it?

A line chart or line plot is a type of plot which displays information as a series of data points called ‘markers’ connected by straight line segments. It is a basic type of chart common in many fields.
Use line plot when you have a continuous data set. These are best suited for trend-based visualizations of data over a period of time.

Let’s start with a case study:

In 2010, Haiti suffered a catastrophic magnitude 7.0 earthquake. The quake caused widespread devastation and loss of life and aout three million people were affected by this natural disaster. As part of Canada’s humanitarian effort, the Government of Canada stepped up its effort in accepting refugees from Haiti. We can quickly visualize this effort using a Line plot:

First, we will extract the data series for Haiti.

# passing in years 1980 – 2013 to exclude the ‘total’ column

Next, we will plot a line plot by appending .plot() to the haiti dataframe.

pandas automatically populated the x-axis with the index values (years), and the y-axis with the column values (population). However, notice how the years were not displayed because they are of type string. Therefore, let’s change the type of the index values to integer for plotting.

Also, let’s label the x and y axis using plt.title(), plt.ylabel(), and plt.xlabel() as follows:

We can clearly notice how number of immigrants from Haiti spiked up from 2010 as Canada stepped up its efforts to accept refugees from Haiti. Let’s annotate this spike in the plot by using the plt.text() or ax.text() method.

Quick note on x and y values in plt.text(x, y, label):

Since the x-axis (years) is type ‘integer’, we specified x as a year. The y axis (number of immigrants) is type ‘integer’, so we can just specify the value y = 6000.

plt.text(2005, 6000, ‘2010 Earthquake’) # years stored as type int

If the years were stored as type ‘string’, we would need to specify x as the index position of the year. Eg 25th index is year 2005 since it is the 25th year with a base year of 1980.

plt.text(25, 6000, ‘2010 Earthquake’) # years stored as type str

We can easily add more countries to line plot to make meaningful comparisons immigration from different countries.

Let’s compare the number of immigrants from India and China from 1980 to 2013.

Step 1: Get the data set for China and India, and display dataframe.

Step 2: Plot graph. We will explicitly specify line plot by passing in kind parameter to plot().

data.plot(kind=’line’)

That won’t look right…

Recall that pandas plots the indices on the x-axis and the columns as individual lines on the y-axis. Since data is a dataframe with the country as the index and years as the columns, we must first transpose the dataframe.

pandas will auomatically graph the two countries on the same graph. Go ahead and plot the new transposed dataframe. Make sure to add a title to the plot and label the axes.

From the above plot, we can observe that the China and India have very similar immigration trends through the years.

So how come we didn’t need to transpose Haiti’s dataframe before plotting?

That’s because haiti is a series as opposed to a dataframe, and has the years as its indices as shown below.

print(type(haiti))
print(haiti.head())

class ‘pandas.core.series.Series’

1980 1666

1981 3692

1982 3498

1983 2860

1984 1418

Name: Haiti, dtype: int64

Line plot is a handy tool to display several dependent variables against one independent variable. However, it is recommended that no more than 5-10 lines on a single graph; any more than that and it becomes difficult to interpret.

Let’s compare the trend of top 5 countries that contributed the most to immigration to Canada.

# Step 1: Get the dataset. We will sort on this column to get our top 5 countries
# using pandas sort_values() method

Other Plots¶
There are many other plotting styles available other than the default Line plot, all of which can be accessed by passing kind keyword to plot(). The full list of available plots are as follows:

bar for vertical bar plots
barh for horizontal bar plots
hist for histogram
box for boxplot
kde or density for density plots
area for area plots
pie for pie plots
scatter for scatter plots
hexbin for hexbin plot

Histograms¶
A histogram is a way of representing the frequency distribution of numeric dataset. The way it works is it partitions the x-axis into bins, assigns each data point in our dataset to a bin, and then counts the number of data points that have been assigned to each bin. So the y-axis is the frequency or the number of data points in each bin. Note that we can change the bin size and usually one needs to tweak it so that the distribution is displayed nicely.

Let’s find out the the frequency distribution of the number (population) of new immigrants from the various countries to Canada in 2013.

Before we proceed with creating the histogram plot, let’s first examine the data split into intervals. To do this, we will us Numpy’s histrogram method to get the bin ranges and frequency counts as follows:

# np.histogram returns 2 values

By default, the histrogram method breaks up the dataset into 10 bins. The figure below summarizes the bin ranges and the frequency distribution of immigration in 2013. We can see that in 2013:

178 countries contributed between 0 to 3412.9 immigrants
11 countries contributed between 3412.9 to 6825.8 immigrants
1 country contributed between 6285.8 to 10238.7 immigrants, and so on..

In the above plot, the x-axis represents the population range of immigrants in intervals of 3412.9. The y-axis represents the number of countries that contributed to the aforementioned population.

Notice that the x-axis labels do not match with the bin size. This can be fixed by passing in a xticks keyword that contains the list of the bin sizes, as follows:

Side Note: We could use df[‘2013′].plot.hist(), instead. In fact, using some_data.plot(kind=’type_plot’, …) is equivalent to some_data.plot.type_plot(…). That is, passing the type of the plot as argument or method behaves the same.

See the pandas documentation for more info.

We can also plot multiple histograms on the same plot. For example, let’s try to answer the immigration distribution for Denmark, Norway, and Sweden for years 1980 – 2013

df.loc[[‘Denmark’, ‘Norway’, ‘Sweden’], years].plot.hist()

That will not work! We’ll often come across situations like this when creating plots. The solution often lies in how the underlying dataset is structured.

Instead of plotting the population frequency distribution of the population for the 3 countries, pandas instead plots the population frequency distribution for the years.

This can be easily fixed by first transposing the dataset, and then plotting as shown below.

Let’s make a few modifications to improve the impact and aesthetics of the previous plot:

increase the bin size to 15 by passing in bins parameter
set transparency to 60% by passing in alpha paramemter
label the x-axis by passing in x-label paramater
change the colors of the plots by passing in color parameter

For a full listing of colors available in Matplotlib,

import matplotlib
for name, hex in matplotlib.colors.cnames.items():
print(name, hex)

Bar Charts (Dataframe) ¶
A bar plot is a way of representing data where the length of the bars represents the magnitude/size of the feature/variable. Bar graphs usually represent numerical and categorical variables grouped in intervals.

To create a bar plot, we can pass one of two arguments via kind parameter in plot():

kind=bar creates a vertical bar plot
kind=barh creates a horizontal bar plot

Vertical bar plot

In vertical bar graphs, the x-axis is used for labelling, and the length of bars on the y-axis corresponds to the magnitude of the variable being measured. Vertical bar graphs are particuarly useful in analyzing time series data. One disadvantage is that they lack space for text labelling at the foot of each bar.

Let’s start off by analyzing the effect of Iceland’s Financial Crisis:

The 2008 – 2011 Icelandic Financial Crisis was a major economic and political event in Iceland. Relative to the size of its economy, Iceland’s systemic banking collapse was the largest experienced by any country in economic history. The crisis led to a severe economic depression in 2008 – 2011 and significant political unrest.

Let’s compare the number of Icelandic immigrants (country = ‘Iceland’) to Canada from year 1980 to 2013.

The bar plot above shows the total number of immigrants broken down by each year. We can clearly see the impact of the financial crisis; the number of immigrants to Canada started increasing rapidly after 2008.

Let’s annotate this on the plot using the annotate method. We will pass in the following parameters:

s: str, the text of annotation.
xy: Tuple specifying the (x,y) point to annotate (in this case, end point of arrow).
xytext: Tuple specifying the (x,y) point to place the text (in this case, start point of arrow).
xycoords: The coordinate system that xy is given in – ‘data’ uses the coordinate system of the object being annotated (default).
arrowprops: Takes a dictionary of properties to draw the arrow: arrowstyle: Specifies the arrow style, ‘->’ is standard arrow.
connectionstyle: Specifies the connection type. arc3 is a straight line.
color: Specifes color of arror.
lw: Specifies the line width.

Let’s also annotate a text to go over the arrow. We will pass in the following additional parameters:

Horizontal Bar Plot

Sometimes it is more practical to represent the data horizontally, especially if you need more room for labelling the bars. In horizontal bar graphs, the y-axis is used for labelling, and the length of bars on the x-axis corresponds to the magnitude of the variable being measured. As you will see, there is more room on the y-axis to label categetorical variables.

Using the scripting layer and the dataset, let’s create a horizontal bar plot showing the total number of immigrants to Canada from the top 15 countries, for the period 1980 – 2013.

Pie Charts ¶
A pie chart is a circular graphic that displays numeric proportions by dividing a circle (or pie) into proportional slices. You are most likely already familiar with pie charts as it is widely used in business and media. We can create pie charts in Matplotlib by passing in the kind=pie keyword. Seaborn doesn’t support pie charts so for this part we will only use matplotlib.

Let’s use a pie chart to explore the proportion (percentage) of new immigrants grouped by continents for the entire time period from 1980 to 2013.

Step 1: Gather data.

We will use pandas groupby method to summarize the immigration data by Continent. The general process of groupby involves the following steps:

Split: Splitting the data into groups based on some criteria.
Apply: Applying a function to each group independently:
.aggregate()

Combine: Combining the results into a data structure.

# group countries by continents and apply sum() function

Step 2: Plot the data. We will pass in kind = ‘pie’ keyword, along with the following additional parameters:

autopct – is a string or function used to label the wedges with their numeric value. The label will be placed inside the wedge. If it is a format string, the label will be fmt%pct.
startangle – rotates the start of the pie chart by angle degrees counterclockwise from the x-axis.
shadow – Draws a shadow beneath the pie (to give a 3D feel).

Let’s also make a few modifications to improve the visuals:

Remove the text labels on the pie chart by passing in legend and add it as a seperate legend using ax.legend().
Push out the percentages to sit just outside the pie chart by passing in pctdistance parameter.
Pass in a custom set of colors for continents by passing in colors parameter.
Explode the pie chart to emphasize the lowest three continents (Africa, North America, and Latin America and Carribbean) by pasing in explode parameter.

Using a pie chart, let’s explore the proportion (percentage) of new immigrants grouped by continents in the year 2013.

Box Plots ¶
A box plot is a way of statistically representing the distribution of the data through five main dimensions:

Minimun: Smallest number in the dataset.
First quartile: Middle number between the minimum and the media

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts