5AAVC210 Introduction to Programming WEEK 5
File handling
You will be using Python to process data, which will involve reading, writing, or manipulating data.
That data might be from a web page, or stored in a text file, or as numbers in a spreadsheet.
Python can handle a number of different file formats, e.g. text, csv, html and JSON.
Opening files
To open a file in Python, we first need to associate the file on disk with a variable in Python.
We begin by telling Python where the file is. The location of your file is often referred to as the file path.
We then use Python’s open() function to open that file.
The open() function requires the file path as its argument.
The file can then be read using f.read() and the contents printed.
f is now the variable associated with the file
Create this variable (known as a file object)
open() method
file path (to a file that already exists)
the mode argument, i.e. what kind of access to the file do you want?
This command tells Python
to read the file that has been opened and print the contents.
This is the content of demofile.txt
File object f is opening a txt file in read mode and returning the contents and printing them to the screen.
The mode arguments
Mode is an optional string that specifies the mode in which the file is opened. The mode you choose will depend on what you wish to do with the file. Here are some of the mode options:
“r” use for reading
“w” use for writing
“x” use for creating and writing to a new file
“a” use for appending to a file
“r+” use for reading and writing to the same file
In this example, we only want to read from the file, so we will use the “r” mode.
Reading a file
Python provides three related operations for reading information from a file:
The first operation
The second operation
The last operation,
NB: once a file has been read using one of the read operations, it cannot be read again.
E.g., if you were to first run
Writing to a file
To write to an existing file, you must add a parameter to the open() function:
“a” – Append – will append to the end of the file
“w” – Write – will overwrite any existing content
Create a file
To create a new file in Python, use the open() method, with one of the following parameters:
“x” – Create – will create a file, returns an error if the file exist
“a” – Append – will create a file if the specified file does not exist
“w” – Write – will create a file if the specified file does not exist
Python Libraries
In Python, a library is a collection of methods that allows you to perform lots of actions without writing your own code.
Each library in Python contains a huge number of useful modules that you can import for your everyday programming.
The main Python website (https://www.python.org/) has a tab called PyPI (short for Python Package Index) that lets you explore, download and install the various Python libraries.
We are going to use the library Beautiful Soup (https://pypi.org/project/beautifulsoup4/).
HTML (HYPERTEXT MARKUP LANGUAGE)
HTML is mark-up language to structure content for web pages
HTML is the syntax: the structure rather than the appearance
CSS (Cascading Style Sheets) is the major way to add style to structured content
HTML documents are highly structured hierarchies with nested tags that wrap up content.
Using Python, we can download HTML document and open for further manipulation.
Viewing the HTML of a web page
Safari
Click on Safari menu > Preferences > Advanced
Check “Show Develop menu in menu bar”
Close the Preferences window
Go to the Develop menu > Show Page Source
Chrome
Click on View
Click on Developer > View Source
Or Ctrl_click > View Page Source
Use the inspect tool
Right-click anywhere on the webpage, and at the very bottom of the menu that pops up, you will see “Inspect.” …
BASIC HTML5
All metadata goes here. It is mainly meant to communicate with machines. We don’t see it on the web page.
All content that is displayed to viewers by browsers goes here. Other nested tags are used, such as
paragraphs
.
Parent and child relationships in HTML
What is web scraping?
Web scraping is used to extract or “scrape” data from any web page on the Internet.
We could copy and paste by hand but that would get very boring, very quickly, and isn’t feasible for large amounts of data.
A web scraper is a software program that is used to download the contents (usually text based and formatted as HTML) of multiple web pages and then extract data from it.
Web scraping makes data accessible to all kinds of applications and uses.
Web scraper vs. web crawler
A Web scraper is built specifically to handle the structure of a particular website. The scraper then uses this site-specific structure to extract individual data elements from the website. The data elements could be names, addresses, prices, images etc.
Web Crawling mostly refers to downloading and storing the contents of a large number of websites, by following links in web pages. Search Engines depend heavily on web crawlers.
Web scraping steps
Go to target website
Parse the website for the data needed
Extract the data
‘Clean-up’ and store the data
The data can then be used for analysis
When web scraping…
Before scraping, check if there is a public API available. Public APIs provide easier and faster (and legal) data retrieval than web scraping. Check out Twitter API that provides APIs for different purposes.
Many sites prohibit you from using the data for commercial purposes.
Be polite. Do not overload the website by sending hundreds of requests per second.
The scraping rules of the websites can be found in the robots.txt file. You can find it by writing robots.txt after the main domain, e.g www.website_to_scrape.com/robots.txt. These rules identify which parts of the websites are not allowed to be automatically extracted or how frequently a bot is allowed to request a page.
Requests
Requests is a Python library that allows you to make requests to a web page.
Libraries first have to be installed (you’ll get help to do this in the tutorial this week – you’ll need the requests, BeautifulSoup, and statistics libraries).
Once it is installed, to use a library, you have to import it. You can do this simply by adding the following code at the beginning of your script:
Making a request
When you ping a website or portal for information this is called making a request. That is exactly what the Requests library has been designed to do.
To get a webpage, we can use an GET request. GET is a request to a web server to get data.
We create a variable ‘my_page’ which will store the request-response. We use requests.get() method since we are sending a GET request.
my_page = requests.get(‘https://www.xxxx/index.asp’)
Beautiful soup
BeautifulSoup is a library that allows you to parse the HTML source code in a beautiful way.
It is a Python library for pulling data out of HTML and XML files.
Parsing means breaking up ordinary text so it can be extracted/analysed.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
import requests
from bs4 import BeautifulSoup
# Collect first page of DDH staff list
page = requests.get(‘https://www.xxxxx/about/people.aspx’)
We will collect the URL of a web page with Requests. We’ll assign the URL for the first page to the variable page by using the method requests.get().
We’ll now create a BeautifulSoup object, or a parse tree.
This object takes as its arguments the page.text document from Requests (the content of the server’s response) and then parses it from Python’s built-in html.parser.
import requests
from bs4 import BeautifulSoup
# Collect first page of DDH staff list
page = requests.get(‘https://www. xxxxx/about/people.aspx’)
# Create a BeautifulSoup object
soup = BeautifulSoup(page.text, ‘html.parser’)
https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3
Questions?