Tutorial_10_B-checkpoint
Tutorial 10B. Data Integration with Household heating data¶
1. Case study: visualizing Household heating data on the map¶
Our aim in this tutorial is to visualize the household heating data for each state in USA on a map. We will discuss how to integrate between three different data sources:
a basic map of USA (S1),
shapefiles of States (S2),
and hourse heating data for those regions (S3).
S3 presents household heating by congressional district. Now, our task is to merge the three data set together and generate a map.
We start first with setting up the environment.
If you use your own laptop in the tutorial, you will need to install the following Python libraries
basemap: conda install basemap
shapefile: pip install pyshp
In general, there are a couple of simple options for plotting data on maps, i.e., using latitude and longitude. However, if you want to plot on a country you may have to go for the middle or find an actual location
(e.g. Austin, Texas, USA rather than just USA). There’s also a nice example here:
http://introtopython.org/visualization_earthquakes.html. Instead of using the coordinate values, we can use
shapefiles that often come as a set, the minimum is .dbf, shp & .shx
2. Examples on the map plotting¶
2.1 The use of Basemap¶
The materials in this section are partially based on those from http://vizclass.csc.ncsu.edu/2011/09/tutorial-geographic-data-on-map-with.html.
In [ ]:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
Let’s start with a world map by drawing the countries and the coasts.
In [ ]:
# lon_0 is central longitude of robinson projection.
# resolution = ‘c’ means use crude resolution coastlines.
m = Basemap(projection = ‘robin’, lon_0 = 0, resolution = ‘c’)
# draw coastlines, country boundaries, fill continents.
m.drawcoastlines(color = ‘#6D5F47’, linewidth = .4)
m.drawcountries(color = ‘#6D5F47′, linewidth = .4)
Now, we can further add some color to the map. For example, we can change the background color to light blue so that it indicates the ocean.
In [ ]:
# set a background colour = water
# m.drawmapboundary(fill_color=’#85A6D9′) # using codes or just names:
m.drawmapboundary(fill_color=’lightblue’)
# draw coastlines, country boundaries, fill continents.
m.fillcontinents(color = ‘white’, lake_color = ‘#85A6D9’)
m.drawcoastlines(color = ‘#6D5F47’, linewidth = .4)
m.drawcountries(color = ‘#6D5F47′, linewidth = .4)
We can also draws the lines of longitude and latitude.
In [ ]:
# draw lat/lon grid lines every 30 degrees.
m.drawmeridians(np.arange(-180, 180, 30), color=’#bbbbbb’)
m.drawparallels(np.arange(-90, 90, 30), color=’#bbbbbb’)
m.fillcontinents(color = ‘white’, lake_color = ‘#85A6D9’)
m.drawcoastlines(color = ‘#6D5F47’, linewidth = .4)
m.drawcountries(color = ‘#6D5F47’, linewidth = .4)
Now, assume that we are going to draw a map showing the population information of the cities that are given by their coordinates as follows.
lats = [35.69,37.569,19.433,40.809,18.975,-6.175,-23.55,28.61,34.694,31.2]
lngs = [139.692,126.977,-99.133,-74.02,72.825,106.828,-46.633,77.23,135.502,121.5]
populations = [32.45,20.55,20.45,19.75,19.2,18.9,18.85,18.6,17.375,16.65] #millions
The size of the circle is proportional to the population. If the population is less than 19 million, we will use color blue. There are two ways to do this task. One is as follows:
In [ ]:
def get_marker_color(population):
if population < 19:
return ('bo')
else:
return ('ro')
plt.figure(figsize = (12,6)) # make it bigger first
m = Basemap(projection = 'robin', lon_0 = 0, resolution = 'c')
#set a background colour
m.drawmapboundary(fill_color = '#85A6D9')
m.fillcontinents(color = 'white', lake_color = '#85A6D9')
m.drawcoastlines(color = '#6D5F47', linewidth = .4)
m.drawcountries(color = '#6D5F47', linewidth = .4)
m.drawmeridians(np.arange(-180, 180, 30), color = '#bbbbbb')
m.drawparallels(np.arange(-90, 90, 30), color = '#bbbbbb')
lats = [35.69,37.569,19.433,40.809,18.975,-6.175,-23.55,28.61,34.694,31.2]
lngs = [139.692,126.977,-99.133,-74.02,72.825,106.828,-46.633,77.23,135.502,121.5]
populations = [32.45,20.55,20.45,19.75,19.2,18.9,18.85,18.6,17.375,16.65] #millions
for lng, lat, s in zip(lngs, lats, populations):
x,y = m(lng, lat)
colour = get_marker_color(s)
m.plot(x, y, colour, markersize= s, alpha = 0.25 )
plt.show()
The alternative is to use the scatter() function in Basemap:
In [ ]:
plt.figure(figsize = (12,6)) # make it bigger first
import pandas as pd
##write your code bellow:
####
plt.title('Big Cities (Population)')
plt.show()
2.2 The use of Shapefiles¶
Based on http://stackoverflow.com/questions/15968762/shapefile-and-matplotlib-plot-polygon-collection-of-shapefile-coordinates
In [ ]:
import numpy
import matplotlib
%matplotlib inline
import shapefile
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
The Shapefile format is a popular Geographic Information System vector data format created by Esri. For more information about this format please read the well-written “ESRI Shapefile Technical Description - July 1998” located at http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf.Essentially, Shapefiles encode points, lines, curves, and polygons. This is useful for mapping applications where you need to represent county lines, major highways, lakes, etc.
In [ ]:
sf = shapefile.Reader("world_countries_boundary_file_world_2002") # note, no suffix, all 3 files are used
recs = sf.records()
shapes = sf.shapes()
recs
This is a list of the country name & country code (ISO) in two different formats, in alphabetical order. Then, how many records (countries) are there?
In [ ]:
len(recs), len(shapes)
Now, we are going to draw the world map, where each country is highlighted with a different color, and we would like to have USA marked as black and AUS as pink.
In [ ]:
cm = matplotlib.cm.get_cmap('Dark2')
Nshp = len(shapes)
cccol = cm(1.*numpy.arange(Nshp)/Nshp) # one colour for every contry...
# plot
fig = plt.figure(figsize = (12,6))
ax = fig.add_subplot(111)
#Plot each country
for nshp in xrange(Nshp):
ptchs = []
pts = numpy.array(shapes[nshp].points)
prt = shapes[nshp].parts
par = list(prt) + [pts.shape[0]]
for pij in xrange(len(prt)):
ptchs.append(Polygon(pts[par[pij]:par[pij+1]]))
ax.add_collection(PatchCollection(ptchs,facecolor=cccol[nshp,:],edgecolor='k', linewidths=.1))
# hack the colour for specific countries, actually alphabetical, need to know ISO code
if recs[nshp][1] == "USA":
ax.add_collection(PatchCollection(ptchs,facecolor=[0,0,0,1],edgecolor='k', linewidths=.1)) # black
if recs[nshp][1] == "AUS":
ax.add_collection(PatchCollection(ptchs,facecolor = "pink",edgecolor='k', linewidths=.1)) # pink
ax.set_xlim(-180,+180)
ax.set_ylim(-90,90)
plt.show()
3. Visualizing the house heating data¶
There are three shape files provided. Let's have a look on one of them!
In [ ]:
# You can look at the shapefile data, just display the contents as is, at the moment
# !cat 'gz_2010_us_040_00_500k.dbf'
inFile = open('gz_2010_us_040_00_500k_1.dbf' , 'r')
#Use the following line if you use Python 2
#inFile = open('./gz_2010_us_040_00_500k.dbf' , 'r')
data = inFile.readlines()
inFile.close()
data
3.1 Explore Household heating data as follows:¶
In [ ]:
import pandas as pd
import numpy as np
df = pd.read_csv('Household heating by Congressional District - 2008.csv')
df.head()
Schema Conflict: Different level of aggregation:
The shappefiles (S2) are categorised by state, whereas S3 is represented by district. Therefore, we will need schema matching to integrate between both.
Also that the Congressional District attribute contains the following values: Alabama 1
Alabama 2
etc.
However, we want 'state' names instead of 'district' to be able to visualise data for each state on map. Thus, we can do the following:
Change the shape files so the regions match the data.
Find files that already match (e.g. Google Fusion Tables)
Change the data to match the shape regions
3.2 Take the original Household heating data ('Household heating by Congressional District - 2008.csv') and wrangle the data into a state format.¶
Create your own CSV file contains state information. You should have a dataset that looks something like this:
Now, load your new CSV file and generate the dataframe as shown above
In [ ]:
# Write your code here
3.3 Plot the avergae"Housing Units That Are Mobile Homes" for each state.¶
List the average value for each state. We will use the average values to generate the color codes¶
In [ ]:
state = df.groupby('States')
statedf = state.aggregate(np.average)
avergeValue = dict(statedf['% Housing Units That Are Mobile Homes'])
avergeValue
Now load the map¶
In [ ]:
from mpl_toolkits.basemap import Basemap
import warnings
warnings.filterwarnings('ignore')
from matplotlib.collections import LineCollection
from matplotlib.colors import rgb2hex
# Lambert Conformal map of lower 48 states.
m = Basemap(llcrnrlon=-119,llcrnrlat=22,
urcrnrlon=-64, urcrnrlat=49,
projection='lcc', lat_1=33,lat_2=45,lon_0=-95)
m.drawcoastlines(linewidth=0.25)
m.drawcountries(linewidth=0.25)
m.fillcontinents(color='green',lake_color='aqua')
# draw the edge of the map projection region (the projection limb)
m.drawmapboundary(fill_color='aqua')
# draw lat/lon grid lines every 30 degrees.
m.drawmeridians(np.arange(0,360,30))
m.drawparallels(np.arange(-90,90,30))
plt.show()
How many states in your data (S3)? Does the number of states match with shapefiles (S2)?¶
Is there something we need to fix before we plot information? Here, we need find a list of state names that appear both in S3 and S2.
In [ ]:
shp_info = m.readshapefile('gz_2010_us_040_00_500k','states',drawbounds=False)
statenames=[]
### Wrtie you code here
####
np.unique(statenames)
However, number of states does not match (should be 50 not 52)?
'District of Columbia' is listed as a state while it is not.
and 'Puerto Rico'?? (some sort of dependency)
(not on the map anyway, neither is Alaska but Python can handle it)
Assign color for each range: Using color map¶
We can now cycle through states, color each one.
Use the cmap() function from "plt.cm.BrBG" to generate color for each state with the following function
$\frac{avergeValue[s]-vmin}{vmax-vmin}$
In [ ]:
import matplotlib.pyplot as plt
colors={}
vmin = 0; vmax = 20 # set range.
cmap = plt.cm.BrBG
for s in statenames:
## Write you code below
######
colors
Draw color map and set figure options:¶
In [ ]:
plt.figure(figsize=(20,10))
for nshape,seg in enumerate(m.states):
xx,yy = zip(*seg)
if statenames[nshape] != 'District of Columbia' and \
statenames[nshape] != "Puerto Rico":
color = rgb2hex(colors[statenames[nshape]])
plt.fill(xx,yy,color,edgecolor=color)
# draw meridians and parallels.
m.drawparallels(np.arange(25,65,20), labels=[0,0,0,0], zorder=-1,color="w")
m.drawmeridians(np.arange(-120,-40,20), labels=[0,0,0,0], zorder=-1,color="w")
# set up colorbar:
mm = plt.cm.ScalarMappable(cmap=cmap)
mm.set_array([0,20]) # Replace this with Min and Max values in your data
plt.colorbar(mm, label="Mobile homes",
ticks=[0,5,10,15,20],# Replace this with your ranges
orientation="horizontal", fraction=0.05,
)
plt.title('Filling State Polygons by Population Density')
plt.gca().axis("off")
plt.show()
We can see the lighter colours indicate density of mobile homes, looks like they're popular in the South & South East,
not so much in the North East (too cold?)