CS代写 Advanced Visualisation (Sols)

Advanced Visualisation (Sols)

Introduction to Data Visualization

Table of Contents¶

1. [Exploring Datasets with *pandas*](#1)

2. [Matplotlib: Standard Python Visualization Library](#2)

3. [Seaborn](#3)

4. [Line Plots](#4)

5. [Histograms](#5)

6. [Bar Charts](#6)

7. [Pie Charts](#7)

8. [Box Plots](#8)

9. [Scatter Plots](#9)

10. [Bubble Plots](#10)

Exploring Dataset ¶
The Dataset: Immigration to Canada from 1980 to 2013¶
The dataset contains annual data on the flows of international immigrants as recorded by the countries of destination. The data presents both inflows and outflows according to the place of birth, citizenship or place of previous / next residence both for foreigners and nationals. The current version presents data pertaining to 45 countries.

import numpy as np
import pandas as pd

df = pd.read_excel(‘Canada.xlsx’, sheet_name=’Canada by Citizenship’, skiprows=range(20), skipfooter=2)

Type Coverage OdName AREA AreaName REG RegName DEV DevName 1980 … 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
0 Immigrants Foreigners Afghanistan 935 Asia 5501 Southern Asia 902 Developing regions 16 … 2978 3436 3009 2652 2111 1746 1758 2203 2635 2004
1 Immigrants Foreigners Albania 908 Europe 925 Southern Europe 901 Developed regions 1 … 1450 1223 856 702 560 716 561 539 620 603
2 Immigrants Foreigners Algeria 903 Africa 912 Northern Africa 902 Developing regions 80 … 3616 3626 4807 3623 4005 5393 4752 4325 3774 4331
3 Immigrants Foreigners American Samoa 909 Oceania 957 Polynesia 902 Developing regions 0 … 0 0 1 0 0 0 0 0 0 0
4 Immigrants Foreigners Andorra 908 Europe 925 Southern Europe 901 Developed regions 0 … 0 0 1 1 0 0 0 0 1 1

5 rows × 43 columns

When analyzing a dataset, it’s always a good idea to start by getting basic information about your dataframe. We can do this by using the info() method.

RangeIndex: 195 entries, 0 to 194
Data columns (total 43 columns):
Type 195 non-null object
Coverage 195 non-null object
OdName 195 non-null object
AREA 195 non-null int64
AreaName 195 non-null object
REG 195 non-null int64
RegName 195 non-null object
DEV 195 non-null int64
DevName 195 non-null object
1980 195 non-null int64
1981 195 non-null int64
1982 195 non-null int64
1983 195 non-null int64
1984 195 non-null int64
1985 195 non-null int64
1986 195 non-null int64
1987 195 non-null int64
1988 195 non-null int64
1989 195 non-null int64
1990 195 non-null int64
1991 195 non-null int64
1992 195 non-null int64
1993 195 non-null int64
1994 195 non-null int64
1995 195 non-null int64
1996 195 non-null int64
1997 195 non-null int64
1998 195 non-null int64
1999 195 non-null int64
2000 195 non-null int64
2001 195 non-null int64
2002 195 non-null int64
2003 195 non-null int64
2004 195 non-null int64
2005 195 non-null int64
2006 195 non-null int64
2007 195 non-null int64
2008 195 non-null int64
2009 195 non-null int64
2010 195 non-null int64
2011 195 non-null int64
2012 195 non-null int64
2013 195 non-null int64
dtypes: int64(37), object(6)
memory usage: 65.6+ KB

Let’s clean the data set to remove a few unnecessary columns. Then we rename the columns so that they make sense.

# in pandas axis=0 represents rows (default) and axis=1 represents columns.
df.drop([‘AREA’,’REG’,’DEV’,’Type’,’Coverage’], axis=1, inplace=True)
df.rename(columns={‘OdName’:’Country’, ‘AreaName’:’Continent’, ‘RegName’:’Region’}, inplace=True)

Country Continent Region DevName 1980 1981 1982 1983 1984 1985 … 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
0 Afghanistan Asia Southern Asia Developing regions 16 39 39 47 71 340 … 2978 3436 3009 2652 2111 1746 1758 2203 2635 2004
1 Albania Europe Southern Europe Developed regions 1 0 0 0 0 0 … 1450 1223 856 702 560 716 561 539 620 603
2 Algeria Africa Northern Africa Developing regions 80 67 71 69 63 44 … 3616 3626 4807 3623 4005 5393 4752 4325 3774 4331
3 American Samoa Oceania Polynesia Developing regions 0 1 0 0 0 0 … 0 0 1 0 0 0 0 0 0 0
4 Andorra Europe Southern Europe Developed regions 0 0 0 0 0 0 … 0 0 1 1 0 0 0 0 1 1

5 rows × 38 columns

We will also add a ‘Total’ column that sums up the total immigrants by country over the entire period 1980 – 2013, check to see how many null objects we have in the dataset.

df[‘Total’] = df.sum(axis=1)
df.isnull().sum().any()

Finally, let’s view a quick summary of each column in our dataframe using the describe() method.

df.describe()

1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 … 2005 2006 2007 2008 2009 2010 2011 2012 2013 Total
count 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 … 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000 195.000000
mean 508.394872 566.989744 534.723077 387.435897 376.497436 358.861538 441.271795 691.133333 714.389744 843.241026 … 1320.292308 1266.958974 1191.820513 1246.394872 1275.733333 1420.287179 1262.533333 1313.958974 1320.702564 32867.451282
std 1949.588546 2152.643752 1866.997511 1204.333597 1198.246371 1079.309600 1225.576630 2109.205607 2443.606788 2555.048874 … 4425.957828 3926.717747 3443.542409 3694.573544 3829.630424 4462.946328 4030.084313 4247.555161 4237.951988 91785.498686
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 … 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.500000 0.500000 1.000000 1.000000 … 28.500000 25.000000 31.000000 31.000000 36.000000 40.500000 37.500000 42.500000 45.000000 952.000000
50% 13.000000 10.000000 11.000000 12.000000 13.000000 17.000000 18.000000 26.000000 34.000000 44.000000 … 210.000000 218.000000 198.000000 205.000000 214.000000 211.000000 179.000000 233.000000 213.000000 5018.000000
75% 251.500000 295.500000 275.000000 173.000000 181.000000 197.000000 254.000000 434.000000 409.000000 508.500000 … 832.000000 842.000000 899.000000 934.500000 888.000000 932.000000 772.000000 783.000000 796.000000 22239.500000
max 22045.000000 24796.000000 20620.000000 10015.000000 10170.000000 9564.000000 9470.000000 21337.000000 27359.000000 23795.000000 … 42584.000000 33848.000000 28742.000000 30037.000000 29622.000000 38617.000000 36765.000000 34315.000000 34129.000000 691904.000000

8 rows × 35 columns

df.set_index(‘Country’, inplace=True)

# optional: to remove the name of the index
df.index.name = None

Continent Region DevName 1980 1981 1982 1983 1984 1985 1986 … 2005 2006 2007 2008 2009 2010 2011 2012 2013 Total
Afghanistan Asia Southern Asia Developing regions 16 39 39 47 71 340 496 … 3436 3009 2652 2111 1746 1758 2203 2635 2004 58639
Albania Europe Southern Europe Developed regions 1 0 0 0 0 0 1 … 1223 856 702 560 716 561 539 620 603 15699
Algeria Africa Northern Africa Developing regions 80 67 71 69 63 44 69 … 3626 4807 3623 4005 5393 4752 4325 3774 4331 69439
American Samoa Oceania Polynesia Developing regions 0 1 0 0 0 0 0 … 0 1 0 0 0 0 0 0 0 6
Andorra Europe Southern Europe Developed regions 0 0 0 0 0 0 2 … 0 1 1 0 0 0 0 1 1 15

5 rows × 38 columns

Column names that are integers (such as the years) might introduce some confusion. For example, when we are referencing the year 2013, one might confuse that when the 2013th positional index.

To avoid this ambuigity, let’s convert the column names into strings: ‘1980’ to ‘2013’.

df.columns = list(map(str, df.columns))

# useful for plotting later on
years = list(map(str, range(1980, 2014)))

5 rows × 38 columns

Matplotlib: Standard Python Visualization Library¶
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.

If you are aspiring to create impactful visualization with python, Matplotlib is an essential tool to have at your disposal.

Matplotlib.Pyplot¶
One of the core aspects of Matplotlib is matplotlib.pyplot. It is Matplotlib’s scripting layer. It is a collection of command style functions that make Matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc. We will work both with the scripting and artist layer.

Two types of plotting¶
There are two styles/options of ploting with matplotlib. Plotting using the Artist layer and plotting using the scripting layer.

Option 1: Scripting layer (procedural method) – using matplotlib.pyplot as ‘plt’

You can use plt i.e. matplotlib.pyplot and add more elements by calling different methods procedurally; for example, plt.title(…) to add title or plt.xlabel(…) to add label to the x-axis.

Option 2: Artist layer (Object oriented method) – using an Axes instance from Matplotlib (preferred)

You can use an Axes instance of your current plot and store it in a variable (eg. ax). You can add more elements by calling methods with a little change in syntax (by adding “set_” to the previous methods). For example, use ax.set_title() instead of plt.title() to add title, or ax.set_xlabel() instead of plt.xlabel() to add label to the x-axis.

This option sometimes is more transparent and flexible to use for advanced plots (in particular when having multiple plots).

Often times we might want to plot multiple plots within the same figure. For example, we might want to perform a side by side comparison of the box plot with the line plot of China and India’s immigration.

To visualize multiple plots together, we can create a figure (overall canvas) and divide it into subplots, each containing a plot. With subplots, we usually work with the artist layer instead of the scripting layer.

Typical syntax is :

fig = plt.figure() # create figure
ax = fig.add_subplot(nrows, ncols, plot_number) # create subplots

nrows and ncols are used to notionally split the figure into (nrows * ncols) sub-axes,
plot_number is used to identify the particular subplot that this function is to create within the notional grid. plot_number starts at 1, increments across rows first and has a maximum of nrows * ncols as shown below.

In the case when nrows, ncols, and plot_number are all less than 10, a convenience exists such that the a 3 digit number can be given instead, where the hundreds represent nrows, the tens represent ncols and the units represent plot_number. For instance,

subplot(211) == subplot(2, 1, 1)

produces a subaxes in a figure which represents the top plot (i.e. the first) in a 2 rows by 1 column notional grid (no grid actually exists, but conceptually this is how the returned subplot has been positioned).

Plotting in pandas¶
Fortunately, pandas has a built-in implementation of Matplotlib that we can use. Plotting in pandas is as simple as appending a .plot() method to a series or dataframe.

Documentation:

Plotting with Series

Plotting with Dataframes

import matplotlib.pyplot as plt

# we are using the inline backend
%matplotlib inline

print(plt.style.available)
plt.style.use([‘ggplot’]) # for ggplot-like style

[‘seaborn-notebook’, ‘seaborn-bright’, ‘fast’, ‘seaborn-talk’, ‘seaborn-ticks’, ‘Solarize_Light2’, ‘seaborn-dark’, ‘fivethirtyeight’, ‘seaborn-pastel’, ‘_classic_test’, ‘seaborn-white’, ‘seaborn-whitegrid’, ‘seaborn-deep’, ‘seaborn’, ‘seaborn-paper’, ‘tableau-colorblind10’, ‘seaborn-muted’, ‘seaborn-dark-palette’, ‘bmh’, ‘seaborn-poster’, ‘ggplot’, ‘seaborn-colorblind’, ‘dark_background’, ‘classic’, ‘seaborn-darkgrid’, ‘grayscale’]

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

We will visualize both of the libraries as we go along, plot them side by side.

import seaborn as sns

Line Plots (Series/Dataframe) ¶
What is a line plot and why use it?

A line chart or line plot is a type of plot which displays information as a series of data points called ‘markers’ connected by straight line segments. It is a basic type of chart common in many fields.
Use line plot when you have a continuous data set. These are best suited for trend-based visualizations of data over a period of time.

Let’s start with a case study:

In 2010, Haiti suffered a catastrophic magnitude 7.0 earthquake. The quake caused widespread devastation and loss of life and aout three million people were affected by this natural disaster. As part of Canada’s humanitarian effort, the Government of Canada stepped up its effort in accepting refugees from Haiti. We can quickly visualize this effort using a Line plot:

First, we will extract the data series for Haiti.

haiti = df.loc[‘Haiti’, years] # passing in years 1980 – 2013 to exclude the ‘total’ column
haiti.head()

1980 1666
1981 3692
1982 3498
1983 2860
1984 1418
Name: Haiti, dtype: object

Next, we will plot a line plot by appending .plot() to the haiti dataframe.

haiti.plot()

pandas automatically populated the x-axis with the index values (years), and the y-axis with the column values (population). However, notice how the years were not displayed because they are of type string. Therefore, let’s change the type of the index values to integer for plotting.

Also, let’s label the x and y axis using plt.title(), plt.ylabel(), and plt.xlabel() as follows:

haiti.index = haiti.index.map(int) # let’s change the index values of Haiti to type integer for plotting
haiti = haiti.astype(int)

fig = plt.figure(figsize=(20, 8))
ax = fig.add_subplot(121)
haiti.plot(kind=’line’,ax=ax)

ax.set_title(‘Immigration from Haiti Matplotlib’)
ax.set_ylabel(‘Number of immigrants’)
ax.set_xlabel(‘Years’)

ax1 = fig.add_subplot(122)
sns.lineplot(x=haiti.index, y=haiti.values, ax=ax1)

ax1.set_title(‘Immigration from Haiti Seaborn’)
ax1.set_ylabel(‘Number of immigrants’)
ax1.set_xlabel(‘Years’)

plt.tight_layout()
plt.show() # need this line to show the updates made to the figure

We can clearly notice how number of immigrants from Haiti spiked up from 2010 as Canada stepped up its efforts to accept refugees from Haiti. Let’s annotate this spike in the plot by using the plt.text() or ax.text() method.

fig = plt.figure(figsize=(20, 8))
ax = fig.add_subplot(121)
haiti.plot(kind=’line’,ax=ax)

ax.set_title(‘Immigration from Haiti Matplotlib’)
ax.set_ylabel(‘Number of immigrants’)
ax.set_xlabel(‘Years’)
# annotate the 2010 Earthquake.
# syntax: text(x, y, label)
ax.text(2005, 6000, ‘2010 Earthquake’) # see note below

ax1 = fig.add_subplot(122)
sns.lineplot(x=haiti.index, y=haiti.values, ax=ax1)

ax1.set_title(‘Immigration from Haiti Seaborn’)
ax1.set_ylabel(‘Number of immigrants’)
ax1.set_xlabel(‘Years’)

ax1.text(2005, 6000, ‘2010 Earthquake’)

plt.tight_layout()
plt.show() # need this line to show the updates made to the figure

Quick note on x and y values in plt.text(x, y, label):

Since the x-axis (years) is type ‘integer’, we specified x as a year. The y axis (number of immigrants) is type ‘integer’, so we can just specify the value y = 6000.

plt.text(2005, 6000, ‘2010 Earthquake’) # years stored as type int

If the years were stored as type ‘string’, we would need to specify x as the index position of the year. Eg 25th index is year 2005 since it is the 25th year with a base year of 1980.

plt.text(25, 6000, ‘2010 Earthquake’) # years stored as type str

We can easily add more countries to line plot to make meaningful comparisons immigration from different countries.

Let’s compare the number of immigrants from India and China from 1980 to 2013.

Step 1: Get the data set for China and India, and display dataframe.

data = df.loc[[‘China’, ‘India’]][years]
data.head()

1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 … 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
China 5123 6682 3308 1863 1527 1816 1960 2643 2758 4323 … 36619 42584 33518 27642 30037 29622 30391 28502 33024 34129
India 8880 8670 8147 7338 5704 4211 7150 10189 11522 10343 … 28235 36210 33848 28742 28261 29456 34235 27509 30933 33087

2 rows × 34 columns

Step 2: Plot graph. We will explicitly specify line plot by passing in kind parameter to plot().

data.plot(kind=’line’)

That won’t look right…

Recall that pandas plots the indices on the x-axis and the columns as individual lines on the y-axis. Since data is a dataframe with the country as the index and years as the columns, we must first transpose the dataframe.

data = data.T
data.head()

China India
1980 5123 8880
1981 6682 8670
1982 3308 8147
1983 1863 7338
1984 1527 5704

pandas will auomatically graph the two countries on the same graph. Go ahead and plot the new transposed dataframe. Make sure to add a title to the plot and label the axes.

data.index = data.index.map(int) # let’s change the index values of data to type integer for plotting
data = data.astype(int)

fig = plt.figure(figsize=(20, 8))
ax = fig.add_subplot(121)
data.plot(kind=’line’, ax=ax)

ax.set_title(‘Immigrants from China and India Matplotlib’)
ax.set_ylabel(‘Number of immigrants’)
ax.set_xlabel(‘Years’)

ax = fig.add_subplot(122)
sns.lineplot(data=data, ax=ax)

ax.set_title(‘Immigrants from China and India Seaborn’)
ax.set_ylabel(‘Number of immigrants’)
ax.set_xlabel(‘Years’)

plt.tight_layout()
plt.show() # need this line to show the updates made to the figure

From the above plot, we can observe that the China and India have very similar immigration trends through the years.

So how come we didn’t need to transpose Haiti’s dataframe before plotting?

That’s because haiti is a series as opposed to a dataframe, and has the years as its indices as shown below.

print(type(haiti))
print(haiti.head())

class ‘pandas.core.series.Series’

1980 1666

1981 3692

1982 3498

1983 2860

1984 1418

Name: Haiti, dtype: int64

Line plot is a handy tool to display several dependent variables against one independent

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts