Tutorial_10_A
Tutorial 10A. Web Scraping¶
We cover in this part scraping data from the web. Data can be presented in HTML, XML and API etc. Web scraping is the practice of using libraries to sift through a web page and gather the data that you need in a format most useful to you while at the same time preserving the structure of the data.
There are several ways to extract information from the web. Use of APIs being probably the best way to extract data from a website. Almost all large websites like Twitter, Facebook, Google, Twitter, StackOverflow provide APIs to access their data in a more structured manner. If you can get what you need through an API, it is almost always preferred approach over web scrapping. However, not all websites provide an API. Thus, we need to scrape the HTML website to fetch the information.
Non-standard python libraries needed in this tutorial include
urllib
beatifulsoup
requests
In [ ]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
Task 1 Extract a list of links on a Wikipedia page.¶
Instead of retrieving all the links existing in a Wikipedia article, we are interested in extracting links that point to other article pages. If you look at the source code of the following page
https://en.wikipedia.org/wiki/Kevin_Bacon
in your browser, you fill find that all these links have three things in common:
They are in the div with id set to bodyContent
The URLs do not contain semicolons
The URLs begin with /wiki/
We can use these rules to construct our search through the HTML page.
Firstly, use the urlopen() function to open the wikipedia page for “Kevin Bacon”,
In [ ]:
html = urlopen(“https://en.wikipedia.org/wiki/Kevin_Bacon”)
Then, find and print all the links. In order to finish this task, you need to
find the div whose id = “bodyContent”
find all the link tags, whose href starts with “/wiki/” and does not ends with “:”. For examplesee Kevin Bacon (disambiguation)
Philadelphia
Hint: regular expression is needed.
In [ ]:
bsobj = BeautifulSoup(html, “lxml”)
# write your code bellow
Task 2 Perform a random walk through a given webpate.¶
Assume that we will find a random object in Wikipedia that is linked to “Kevin Bacon” with, so-called “Six Degrees of Wikipedia”. In other words, the task is to find two subjects linked by a chain containing no more than six subjects (including the two original subjects).
In [ ]:
import datetime
import random
random.seed(datetime.datetime.now())
In [ ]:
def getLinks(articleUrl):
#write your code here
In [ ]:
links = getLinks(“/wiki/Kevin_Bacon”)
The details of the random walk along the links are
Randomly choosing a link from the list of retrieved links
Printing the article represented by the link
Retrieving a list of links
repeat the above step until the number of retrieved articles reaches 5.
In [ ]:
count = 0
while len(links) > 0 and count < 5:
# Write your code here
######
count = count + 1
Task 3 Crawl the Entire Wikipedia website¶
The general approach to an exhaustive site crawl is to start with the root, i.e., the home page of a website. Here, we will start with
https://en.wikipedia.org/
by retrieving all the links that appear in the home page. And then traverse each link recursively. However, the number of links is going to be very large and a link can appear in many Wikipedia article. Thus, we need to consider how to avoid repeatedly crawling the same article or page. In order to do so, we can keep a running list for easy lookups and slightly update the getLinks() function.
In [ ]:
pages = set()
Note: add a terminating condition in your code, for example,
len(pages) < 10
Otherwise, the script will run through the entire Wikipedia website, which will take a long time to finish. So please avoid that in the tutorial class.
In [ ]:
def getLinks(pageUrl):
global pages
# Write your code here
In [ ]:
getLinks("")
Task 4 Collect data across the Wikipedia site¶
One purpose of traversing all the the links is to extract data. The best practice is to look at a few pages from the side and determine the patterns. By looking at a handful of Wikipedia pages both articles and non-articles pages, the following pattens can be identified:
All titles are under h1 span tags, and these are the only h1 tags on the page. For example,
Kevin Bacon
Main Page
All body text lives under the div#bodyContent tag. However, if we want to get more specific and access just the first paragraph of text, we might be better off using div#mw-content-text -> p.
Edit links occur only on article pages. If they occur, they will be found in the li#ca-edit tag, under li#ca-edit -> span -> a
Now, the task is to further modify the getLink() function to print the title, the first paragraph and the edit link. The content from each page should be separated by
pyhon
print(“—————-\n”+newPage)
In [ ]:
pages = set()
Please also add a terminating condition in your code, for example,¶
len(pages) < 5
Otherwise, the script will run through the entire Wikipedia website, which will take a long time to finish. So please avoid that in the tutorial class.
In [ ]:
def getLinks(pageUrl):
global pages
html = urlopen("http://en.wikipedia.org"+pageUrl)
bsObj = BeautifulSoup(html, "html.parser")
# Write your code here
In [ ]:
getLinks("")
Task 5 API access¶
In addition to HTML format, data is commonly found on the web through public APIs. We use the 'requests' package (http://docs.python-requests.org) to call APIs using Python. In the following example, we call a public API for collecting weather data.
You need to sign up for a free account to get your unique API key to use in the following code. register at http://api.openweathermap.org
In [ ]:
#Now we use requests to retrieve the web page with our data
import requests
url = 'http://api.openweathermap.org/data/2.5/forecast?id=524901&cnt=16&APPID=YOUR_API_KEY'
#write your APPID here#
response= requests.get(url)
response
The response object contains GET query response. A successfull one has a value of 200. we need to parse the response with json to extract the information.
In [ ]:
#Check the HTTP status code https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
print (response.status_code)
In [ ]:
# response.content is text
print (type(response.content))
In [ ]:
#response.json() converts the content to json
data = response.json()
print (type(data))
In [ ]:
data.keys()
In [ ]:
data
The keys explain the structure of the fetched data. Try displaying values for each element. In this example, the weather information exists in the 'list'.
In [ ]:
data['list'][15]
The next step is to create a DataFrame with the weather information, which is demonstrated as follows. You can select a subset to display or display the entire data
In [ ]:
from pandas import DataFrame
# data with the default column headers
weather_table_all= DataFrame(data['list'])
weather_table_all
Discussion:¶
Further parsing is still required to get the table (DataFrame) in a flat shape. Now it it's your turn, parse the weather data to generate a table.
Please note that some materials used in this tutorial are partially from the book "Web Scraping with Python"
In [ ]: