程序代写代做代考 Excel database data science python Hive javascript Java Jupyter Notebooks¶

Jupyter Notebooks¶
Jupyter is a nod to 3 languages: Julia, Python, and R. Source @jakevdp.
This document that you’re currently reading is a “Jupyter Notebook”. It’s like a text document, but you can run code on it!
It can also display inline graphs,
In [1]:
from utils import plot_sine
%matplotlib inline
plot_sine()

Pull data from Databases or show excel spreadsheets live!
In [2]:
import pandas as pd
df = pd.read_csv(“./data/US_Accidents_May19_truncated.csv”)
df.head()
Out[2]:

ID
Source
TMC
Severity
Start_Time
End_Time
Start_Lat
Start_Lng
End_Lat
End_Lng
…
Roundabout
Station
Stop
Traffic_Calming
Traffic_Signal
Turning_Loop
Sunrise_Sunset
Civil_Twilight
Nautical_Twilight
Astronomical_Twilight
0
A-603309
MapQuest
241.0
2
2019-02-15 16:56:38
2019-02-15 17:26:16
34.725163
-86.596359
NaN
NaN
…
False
False
False
False
False
False
Day
Day
Day
Day
1
A-676455
MapQuest
201.0
2
2019-01-26 15:54:00
2019-01-26 16:25:00
32.883579
-80.019722
NaN
NaN
…
False
False
False
False
False
False
Day
Day
Day
Day
2
A-2170526
Bing
NaN
4
2018-01-06 10:51:40
2018-01-06 16:51:40
39.979172
-82.983870
39.99384
-82.98502
…
False
False
False
False
False
False
Day
Day
Day
Day
3
A-1162086
MapQuest
201.0
2
2018-05-25 16:12:02
2018-05-25 17:11:49
34.857208
-82.256157
NaN
NaN
…
False
False
False
False
True
False
Day
Day
Day
Day
4
A-309058
MapQuest
201.0
2
2016-10-26 19:42:11
2016-10-26 20:26:58
47.662689
-117.357658
NaN
NaN
…
False
False
False
False
True
False
Night
Night
Night
Night
5 rows × 49 columns

It even allows equations, (through LaTeX) as well as

$$e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$$

Present interactive visualizations and other rich output.

Even gifs!

Isn’t it amazing? 😄

Part 1: everything is a cell¶
Jupyter Notebooks are structured as a collection of “cells”. Each cell can contain different types of content: like Python code (or R, etc), images or even human readable text (markdown), like the one you’re currently reading.
I’ve left a couple of empty cells below for you to play around with:
In [ ]:

In [ ]:

You can create new cells by clicking the plus sign above

Try it now!

Human Readable (Markdown) Cell¶
This is a cell containing Markdown (human readable) code. Markdown is a plain text formatting syntax aimed at making writing easier.
It allows you to incorporate rich text formatting for example bold or text in italics, lines of code or Scratch this. (or heading and lists as shown below) through very simple syntax.
Headings like:
h1¶
h2¶
or lists:
• list 1
• list 2
• list 3
A more comprehensive list of Markdown syntax is available here

Double clicking on any cell will allow you to edit its content. Double click here to see

It will look something like this:

If it does then you’ve correctly entered “Edit Mode” for a given cell. Once you’ve made the changes, you have to “execute”, or “run” the cell to reflect the changes. To do that just click on the little play button on the top menu bar:

Jupyter notebooks are optimized for an efficient workflow. There are many keyboard shortcuts that will let you interact with your documents, run code and make other changes; mastering these shortcuts will speed up your work. For example, there are two shortcuts to execute a cell:
1. shift + return: Run cell and advance to the next one.
2. ctrl + return: Run the cell but don’t change focus.
Try them with the following cell:
In [3]:
2 + 2
Out[3]:
4

You can try executing these cells as many times as you want, it won’t break anything

Part 2: Working with code¶
Jupyter notebooks are great at providing incredible features that allow you to incorporate text and images in order to create beautiful, human readable documents as you’ve just seen. But their main benefit is working with code.
Now we’re going to import a few libraries and start experimenting with Python code. We’ve already done the simple 2 + 2 before, so let’s do something a little bit more interesting. First, we need to import numpy and matplotlib:

Now you probably wont have these libraries installed on your computer, so lets go ahead and install them before moving forward.
For numpy, open up a terminal and run: pip install numpy or if you are currently using anaconda environment you can run: conda install numpy
For matplotlib same thing, open up a terminal and run: pip install matplotlib or conda install matplotlib
We’ve just installed and imported these two libraries:
• numpy the most popular Python library for array manipulation and numeric computing
• matplotlib the most popular visualization library in the Python ecosystem.
Let’s now execute a few lines of code just to make sure your environment is ready:
In [1]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
print(np.version.version)
print(matplotlib.__version__)

1.16.2
3.0.3

If the above set of lines work then we are good to go.
Lets try to do some data science!
In [2]:
# generates evenly spaced numbers over a given interval
# here x = [0, 0.2, 0.4 … 9.8, 10]
# a total of 50 datapoints
x = np.linspace(0, 10, num=50)

# generate random y datapoints that correspond to x for 6 tracks.
y = np.cumsum(np.random.randn(50, 6), 0)

# visualize the plot
plt.plot(x, y)
plt.legend(‘ABCDEF’, ncol=2, loc=’upper left’)
Out[2]:

But what is that 😱? Just random generated datapoints (see comments for more details), but you can clearly see how simple is to do numeric processing and plotting with Jupyter Notebooks.
Now we’re cookin!

Part 3: Interacting with data¶
Jupyter Lab make it really simple to intereact with files in your local storage.
Lets first take a look at the dataset that we have in ./data. Its a truncated version of us-accidents dataset from kaggle. This is a countrywide traffic accident dataset that covers 49 states in the US.
Just for demonstration purposes, I have randomly sampled 500 records, whereas the original dataset consists of 2.25 million records. If you are interested in the full dataset, you can find it here: us-accidents.

In order to access the dataset and process it, we only need a single libarary: pandas. Lets import that here.
In [2]:
import pandas as pd
import s

Lets have a look at the dataset. Use pandas to read the .csv file and display the top 5 rows.
In [3]:
df = pd.read_csv(“./data/US_Accidents_May19_truncated.csv”)
# display top 5 rows
df.head()
Out[3]:

There is a fairly simple yet informative function call that pandas provides.It Generates descriptive statistics that summarizes the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
Take a look at the documentation describe for more details.
Lets try to understand the data in terms of some of the common descriptive statistics by quickly running the command below.
In [4]:
df.describe()
Out[4]:

TMC
Severity
Start_Lat
Start_Lng
End_Lat
End_Lng
Distance(mi)
Number
Temperature(F)
Wind_Chill(F)
Humidity(%)
Pressure(in)
Visibility(mi)
Wind_Speed(mph)
Precipitation(in)
count
392.000000
501.000000
501.000000
501.000000
109.000000
109.000000
501.000000
186.000000
486.000000
89.000000
485.000000
486.000000
484.000000
407.000000
58.000000
mean
206.540816
2.387226
36.551557
-94.753056
37.183830
-95.348954
0.244936
4947.236559
61.420576
26.185393
66.437113
30.028498
9.203512
8.987961
0.049655
std
15.025243
0.552953
4.800930
16.926213
4.852958
17.724792
0.986503
7170.731386
18.768750
13.629560
21.282696
0.191628
2.866254
4.924812
0.089794
min
201.000000
1.000000
25.579687
-123.726077
25.771380
-123.745515
0.000000
1.000000
-14.100000
-29.900000
4.000000
29.330000
0.100000
3.500000
0.000000
25%
201.000000
2.000000
33.596779
-117.084007
34.028270
-117.870083
0.000000
699.000000
50.000000
21.100000
52.000000
29.910000
10.000000
5.800000
0.000000
50%
201.000000
2.000000
36.009870
-88.103207
37.628621
-88.104411
0.000000
2597.500000
63.000000
28.800000
67.000000
30.030000
10.000000
8.100000
0.010000
75%
201.000000
3.000000
40.349678
-80.877567
40.715890
-80.833220
0.010000
5990.750000
75.900000
35.800000
83.000000
30.140000
10.000000
11.500000
0.060000
max
343.000000
4.000000
48.239830
-70.994843
48.239830
-71.394280
12.170000
48795.000000
111.000000
44.800000
100.000000
30.760000
50.000000
42.600000
0.440000
In [5]:
import matplotlib.pyplot as plt

# Pie chart, where the slices will be ordered and plotted counter-clockwise:
# explode = [0.2, 0.1, 0.1, 0.1]
# create split
sizes = df.groupby(‘Severity’).size()
print(sizes)
fig1, ax1 = plt.subplots(figsize=(18,8)) # <-- Increase the size of the plot ax1.pie(sizes, autopct='%1.1f%%', shadow=False, startangle=90) ax1.set_title('Share of Severity') ax1.legend(sizes.index) ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle. plt.show() Severity 1 1 2 321 3 163 4 16 dtype: int64 Looks like the datapoint for Severity = 1 is far too small. Lets remove that. We can do that by first sorting in decending order and picking top 3. In [7]: plt.figure(figsize=(18,8)) sizes = df.groupby('Severity') \ .size() \ .sort_values(ascending = False) \ .iloc[:3] \ fig1, ax1 = plt.subplots(figsize=(18,8)) # <-- Increase the size of the plot ax1.pie(sizes, autopct='%1.1f%%', shadow=False, startangle=90) ax1.set_title('Top 3 Shares of Severity') ax1.legend(sizes.index) ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle. plt.show()

Lets take a look at the distribution of wind chills, but this time we will make use of some of pandas dataframe function calls to plot the histogram.
How do you change this plot for wind_chill(F) for all types of severity?
In [8]:
import numpy as np

# this configuration is used by pandas .hist call below.
fig, ax = plt.subplots(figsize=(8,9))

ax.set_title(“Histogram of Wind Chill(F)”)
ax.set_ylabel(“Freq.”)
ax.set_xlabel(“Distance(mi)”)
# pandas has some common builtin matplotlib plots that are useful in doing common analysis
df[‘Wind_Chill(F)’].hist(bins=20, rwidth=0.95, grid=False)
Out[8]:

A density plot is a smoothed, continuous version of a histogram estimated from the data. The most common form of estimation is known as kernel density estimation. In this method, a continuous curve (the kernel) is drawn at every individual data point and all of these curves are then added together to make a single smooth density estimation. The kernel most often used is a Gaussian (which produces a Gaussian bell curve at each data point).
In [9]:
import numpy as np

# this configuration is used by pandas .hist call below.
fig, ax = plt.subplots(figsize=(8,9))

df[‘Wind_Chill(F)’].plot.kde(ax=ax, legend=False, title=”Histogram of Wind Chill(F)”)
df[‘Wind_Chill(F)’].plot.hist(density=True, bins=10, rwidth=0.95, ax=ax)
Out[9]:

In [10]:
viz_4=df.plot(kind=’scatter’, x=’Start_Lng’,y=’Start_Lat’,label=’Severity’,c=’Severity’,cmap=plt.get_cmap(‘jet’),colorbar=True,alpha=0.8,figsize=(15,10))
# viz_4.legend()
plt.ioff()

Seems like this is the only way to install basemap without running into mulitple issues. Go ahead and run these and then run the next cell.
Linux:
sudo apt-get install libgeos-3.X.X
sudo apt-get install libgeos-dev
pip install –user https://github.com/matplotlib/basemap/archive/master.zip
Mac:
brew install geos
pip install –user https://github.com/matplotlib/basemap/archive/master.zip
In [11]:
from mpl_toolkits.basemap import Basemap
plt.figure(figsize=(15, 10))
map = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49,
projection=’lcc’,lat_1=32,lat_2=45,lon_0=-95)

long = df[‘Start_Lng’].tolist()
lat = df[‘Start_Lat’].tolist()

map.bluemarble(scale=1) # sat view of the map
map.drawcoastlines(color=”white”, linewidth=0.3)

x, y = map(long, lat)

plt.scatter(x, y, c=df[‘Severity’].tolist(), cmap=plt.get_cmap(‘jet’), alpha=0.8)
plt.colorbar()
Out[11]:

Install bokeh if you havent already before running the cell below.
conda install bokeh
or
pip install bokeh
In [35]:
import bokeh
bokeh.sampledata.download()

Using data directory: /home/chaitanya/.bokeh/data
Downloading: CGM.csv (1589982 bytes)
1589982 [100.00%]
Downloading: US_Counties.zip (3171836 bytes)
3171836 [100.00%]
Unpacking: US_Counties.csv
Downloading: us_cities.json (713565 bytes)
713565 [100.00%]
Downloading: unemployment09.csv (253301 bytes)
253301 [100.00%]
Downloading: AAPL.csv (166698 bytes)
166698 [100.00%]
Downloading: FB.csv (9706 bytes)
9706 [100.00%]
Downloading: GOOG.csv (113894 bytes)
113894 [100.00%]
Downloading: IBM.csv (165625 bytes)
165625 [100.00%]
Downloading: MSFT.csv (161614 bytes)
161614 [100.00%]
Downloading: WPP2012_SA_DB03_POPULATION_QUINQUENNIAL.zip (4816256 bytes)
4816256 [100.00%]
Unpacking: WPP2012_SA_DB03_POPULATION_QUINQUENNIAL.csv
Downloading: gapminder_fertility.csv (64346 bytes)
64346 [100.00%]
Downloading: gapminder_population.csv (94509 bytes)
94509 [100.00%]
Downloading: gapminder_life_expectancy.csv (73243 bytes)
73243 [100.00%]
Downloading: gapminder_regions.csv (7781 bytes)
7781 [100.00%]
Downloading: world_cities.zip (645274 bytes)
645274 [100.00%]
Unpacking: world_cities.csv
Downloading: airports.json (6373 bytes)
6373 [100.00%]
Downloading: movies.db.zip (5053420 bytes)
5053420 [100.00%]
Unpacking: movies.db
Downloading: airports.csv (203190 bytes)
203190 [100.00%]
Downloading: routes.csv (377280 bytes)
377280 [100.00%]
Downloading: haarcascade_frontalface_default.xml (930127 bytes)
930127 [100.00%]
In [36]:
def scale(n, r, t):
# scale a value (elem) which is in range (rmin, rmax) = (r[0], r[1])
# to become a value in the given target range between (tmin, tmax) = (t[0], t[1])
ret = (n – r[0]) / (r[1] – r[0])
ret = (ret * (t[1] – t[0])) + t[0]
return ret

In [38]:
from bokeh.plotting import figure, show, output_notebook

from bokeh.sampledata.us_counties import data as counties
from bokeh.sampledata.us_states import data as states

if “HI” in states:
del states[“HI”]
if “AK” in states:
del states[“AK”]

EXCLUDED = (“ak”, “hi”, “pr”, “gu”, “vi”, “mp”, “as”)

The data is in a Dataframe object that can be rendered using pandas. However, since we want to incorporate data visualization interactivity, we will use bokeh library. We first exclude alaska and some of the islands that are part of usa for a more focused inland map. Then we look at getting the frequency of accidents per county, and then associate a color based on the range of values. The resulting product is an interactive map as you see below.
In [39]:
state_xs = [states[code][“lons”] for code in states]
state_ys = [states[code][“lats”] for code in states]

county_xs=[counties[code][“lons”] for code in counties if counties[code][“state”] not in EXCLUDED]
county_ys=[counties[code][“lats”] for code in counties if counties[code][“state”] not in EXCLUDED]

colors = [“#F1EEF6”, “#D4B9DA”, “#C994C7”, “#DF65B0”, “#DD1C77”, “#980043″]

# get county acc. min, max counts.
county_colors = []
county_freq = []
for k, county in counties.items():
rec = df.loc[df[‘County’] == county[‘name’]]
county_freq.append(len(rec))

rmin = np.amin(county_freq)
rmax = np.amax(county_freq)
for count in county_freq:
s = scale(count, (rmin, rmax), (0, 5))
c_id = int(np.ceil(s))
color = colors[c_id]
county_colors.append(color)

p = figure(title=”Accident Frequency per county”, toolbar_location=”left”,
plot_width=1100, plot_height=700)

p.patches(county_xs, county_ys,
fill_color=county_colors, fill_alpha=0.7,
line_color=”white”, line_width=0.5)

p.patches(state_xs, state_ys, fill_alpha=0.0,
line_color=”#884444″, line_width=2, line_alpha=0.3)

output_notebook()

show(p)

BokehUserWarning: ColumnDataSource’s columns must be of the same length. Current lengths: (‘fill_color’, 3233), (‘xs’, 3109), (‘ys’, 3109)

Loading BokehJS …

☝️ as you can see, the plot is interactive. Try zomming in and out, and scrolling in the plot.

If you are running into trouble running the above cell, and the error says
Javascript output is disabled in JupyterLab
then run this to fix it
jupyter labextension install jupyterlab_bokeh

Final words and how to get help¶
That’s it! It’s your time now to start working and playing around with jupyter lab. There are a ton of resources available online for all the things we explored in this notebook.
Below are some key ones.
• https://github.com/jupyter-resources – A collection of videos, talks, tutorials just for jupyter. 
• https://github.com/stephenh67/python-resources-2019 – A large collection of python textbooks, notes, video series, etc 
• https://github.com/r0f1/datascience – For Datascience python libaries like pandas and matplotlib and other resources.

Related Posts