slides
[320] Web 1: Selenium
Page A
Page B
Page C
Page D
how to scrape a webpage graph?
how to scrape a complicated page?
Page A
Page B
Page C
Page D
how to scrape a webpage graph?
how to scrape a complicated page?
requests module (220)
selenium module (320)
Review Document Object Model
url: http://domain/rsrc.html HTTP Response
HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74
Welcome
Browser
What does a web browser do when it gets some HTML in an HTTP response?
url: http://domain/rsrc.html HTTP Response
Welcome
Content-Type: text/html; charset=utf-8
Content-Length: 74
Welcome
url: http://domain/rsrc.html HTTP Response
before displaying a page, the browser
uses HTML to generate a Document
Object Model
(DOM Tree)
HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74
Welcome
url: http://domain/rsrc.html HTTP Response
html
body
ah1 a
vocab: elements
HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74
Welcome
url: http://domain/rsrc.html HTTP Response
html
body
ah1 a
attr: hrefattr: href
Elements may contain
• attributes
HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74
Welcome
url: http://domain/rsrc.html HTTP Response
html
body
ah1 a
attr: hrefattr: href
ContactAboutWelcome
Elements may contain
• attributes
• text
HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74
Welcome
url: http://domain/rsrc.html HTTP Response
html
ah1
attr: hrefattr: href
ContactAboutWelcome
Elements may contain
• attributes
• text
• other elements
body
a
parent
child
HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74
Welcome
url: http://domain/rsrc.html HTTP Response
html
ah1
attr: hrefattr: href
ContactAboutWelcome
body
a
parent
child
HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74
Welcome
table
JavaScript (if there’s an engine to execute it) may directly edit the DOM!
original .html file doesn’t change,
but the result is equivalent
url: http://domain/rsrc.html HTTP Response
html
ah1
attr: hrefattr: href
ContactAboutWelcome
body
a
parent
child
HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74
Welcome
table
JavaScript (if there’s an engine to execute it) may directly edit the DOM!
original .html file doesn’t change,
but the result is equivalent
challenge: requests module only grabs HTML, misses new elements
import requests
resp = requests.get(…)
print(resp.text)
url: http://domain/rsrc.html HTTP Response
Welcome
About Contact
browser renders (displays) the
DOM tree, based on original file
and any JavaScript changes
HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74
Welcome
we need a JavaScript engine so we
can scrape the generated table
Web Scraping: Simple and Complicated
requests vs. Selenium
computer 2
(Virtual Machine)
IP address: 18.216.110.65
computer 1
(laptop)
index.html, please [GET]
Hello
requests module
– can fetch .html, .js, .etc file
Selenium
– can fetch .html, .js, .etc file
– can run a .js file in browser
– can grab HTML version of DOM after JavaScript has modified it
Jupyter : import requests
r=requests.get(…)
Web Server
requests vs. Selenium
computer 2
(Virtual Machine)
IP address: 18.216.110.65
computer 1
(laptop)
index.html, please [GET]
Hello
requests module
– can fetch .html, .js, .etc file
Selenium
– can fetch .html, .js, .etc file
– can run a .js file in browser
– can grab HTML version of DOM after JavaScript has modified it
from selenium
import webdriver
driver=webdriver.Chrome()
chromedriver
Web Server
note: Selenium is most commonly
used for testing websites, but it works
great for tricky scraping too
A.png, please [GET]
…
Tricky Pages
https://tyler.caraza-harter.com/cs320/tricky/scrape.html
https://tyler.caraza-harter.com/cs320/tricky/scrape.html
Installing: Selenium, Chrome, Driver
Selenium Install (Ubuntu 20.04)
computer 1
(laptop)
from selenium
import webdriver
driver=webdriver.Chrome()
chromedriver
https://chromedriver.chromium.org/downloads
sudo apt -y install chromium-browser
pip3 install selenium
trh@instance-1:~$ chromium-browser –version
/usr/bin/chromium-browser: 12: xdg-settings: not found
Chromium 94.0.4606.81 snap
trh@instance-1:~$ chromium.chromedriver –version
ChromeDriver 94.0.4606.81 (5a03c5f1033171d5ee1671d219a…
Check…
https://github.com/cs320-wisc/f21/tree/main/p3#part-2-web-crawling
on some systems, chromedriver is installed separately
https://chromedriver.chromium.org/downloads
https://github.com/cs320-wisc/f21/tree/main/p3#part-2-web-crawling
Why Drivers?
Python Java Ruby JavaScript
Python module
for Selenium
Java module for
Selenium
Ruby module
for Selenium
JavaScript mod
for Selenium
Chrome Driver Firefox Driver Edge Driver
Examples
Starter Code
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
options = Options()
#options.headless = True
b = webdriver.Chrome(options=options)
b.get(????)
print(b.page_source)
try:
elem = b.find_element_by_id(element_id)
print(“found it”)
except NoSuchElementException:
print(“couldn’t find it”)
b.close()
open browser window
go to a URL
get HTML for current page (including JavaScript changes)
search for id=???? attributes
no such element
Example 1a: Late Loading Table (page1.html)
https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page1.html
added after 1 second
https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page1.html
Example 1b: Headless Mode and Screenshots
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
options = Options()
options.headless = True
b = webdriver.Chrome(options=options)
b.get(????)
from IPython.core.display import Image
b.save_screenshot(“out.png”)
Image(“out.png”)
b.close()
Example 2: Auto-Clicking Buttons
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
options = Options()
options.headless = True
b = webdriver.Chrome(options=options)
b.get(????)
btn = b.find_element_by_id(“BTN_ID”)
btn.click()
b.close()
auto click
https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page2.html
https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page2.html
Example 3: Entering Passwords
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
options = Options()
options.headless = True
b = webdriver.Chrome(options=options)
b.get(????)
pw = b.find_element_by_id(“pw”)
pw.send_keys(“fido”)
b.close()
https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page3.html
https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page3.html
Example 4: Many Queries
https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page4.html
https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page4.html