slides

[320] Web 1: Selenium

Page A

Page B

Page C

Page D

how to scrape a webpage graph?

how to scrape a complicated page?

Page A

Page B

Page C

Page D

how to scrape a webpage graph?

how to scrape a complicated page?
requests module (220)
selenium module (320)

Review Document Object Model

url: http://domain/rsrc.html HTTP Response

HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74

Welcome

About
Contact

Browser

What does a web browser do when it gets some HTML in an HTTP response?

url: http://domain/rsrc.html HTTP Response

Welcome

About
Contact

HTTP/1.0 200 OK

Content-Type: text/html; charset=utf-8
Content-Length: 74

Welcome

About
Contact

url: http://domain/rsrc.html HTTP Response

before displaying a page, the browser
uses HTML to generate a Document

Object Model
(DOM Tree)

HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74

Welcome

About
Contact

url: http://domain/rsrc.html HTTP Response

html

body

ah1 a

vocab: elements

HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74

Welcome

About
Contact

url: http://domain/rsrc.html HTTP Response

html

body

ah1 a

attr: hrefattr: href

Elements may contain
• attributes

HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74

Welcome

About
Contact

url: http://domain/rsrc.html HTTP Response

html

body

ah1 a

attr: hrefattr: href

ContactAboutWelcome

Elements may contain
• attributes
• text

HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74

Welcome

About
Contact

url: http://domain/rsrc.html HTTP Response

html

ah1

attr: hrefattr: href

ContactAboutWelcome

Elements may contain
• attributes
• text
• other elements

body

parent

child

HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74

Welcome

About
Contact

url: http://domain/rsrc.html HTTP Response

html

ah1

attr: hrefattr: href

ContactAboutWelcome

body

parent

child

HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74

Welcome

About
Contact

table

JavaScript (if there’s an engine to execute it) may directly edit the DOM!

original .html file doesn’t change,
but the result is equivalent

url: http://domain/rsrc.html HTTP Response

html

ah1

attr: hrefattr: href

ContactAboutWelcome

body

parent

child

HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74

Welcome

About
Contact

table

JavaScript (if there’s an engine to execute it) may directly edit the DOM!

original .html file doesn’t change,
but the result is equivalent

challenge: requests module only grabs HTML, misses new elements
import requests
resp = requests.get(…)
print(resp.text)

url: http://domain/rsrc.html HTTP Response

Welcome

About Contact

browser renders (displays) the
DOM tree, based on original file

and any JavaScript changes

HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 74

Welcome

About
Contact

we need a JavaScript engine so we
can scrape the generated table

Web Scraping: Simple and Complicated

requests vs. Selenium

computer 2
(Virtual Machine)

IP address: 18.216.110.65

computer 1
(laptop)

index.html, please [GET]

Hello

requests module
– can fetch .html, .js, .etc file

Selenium
– can fetch .html, .js, .etc file
– can run a .js file in browser
– can grab HTML version of DOM after JavaScript has modified it

Jupyter : import requests
r=requests.get(…)

Web Server

requests vs. Selenium

computer 2
(Virtual Machine)

IP address: 18.216.110.65

computer 1
(laptop)

index.html, please [GET]

Hello

requests module
– can fetch .html, .js, .etc file

Selenium
– can fetch .html, .js, .etc file
– can run a .js file in browser
– can grab HTML version of DOM after JavaScript has modified it

from selenium
import webdriver
driver=webdriver.Chrome()

chromedriver

Web Server

note: Selenium is most commonly
used for testing websites, but it works

great for tricky scraping too

A.png, please [GET]

…

Tricky Pages

https://tyler.caraza-harter.com/cs320/tricky/scrape.html

Installing: Selenium, Chrome, Driver

Selenium Install (Ubuntu 20.04)

computer 1
(laptop)

from selenium
import webdriver
driver=webdriver.Chrome()

chromedriver

https://chromedriver.chromium.org/downloads

sudo apt -y install chromium-browser

pip3 install selenium

trh@instance-1:~$ chromium-browser –version
/usr/bin/chromium-browser: 12: xdg-settings: not found
Chromium 94.0.4606.81 snap
trh@instance-1:~$ chromium.chromedriver –version
ChromeDriver 94.0.4606.81 (5a03c5f1033171d5ee1671d219a…

Check…

https://github.com/cs320-wisc/f21/tree/main/p3#part-2-web-crawling

on some systems, chromedriver is installed separately

https://chromedriver.chromium.org/downloads
https://github.com/cs320-wisc/f21/tree/main/p3#part-2-web-crawling

Why Drivers?

Python Java Ruby JavaScript

Python module
for Selenium

Java module for
Selenium

Ruby module
for Selenium

JavaScript mod
for Selenium

Chrome Driver Firefox Driver Edge Driver

Examples

Starter Code
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException

options = Options()
#options.headless = True
b = webdriver.Chrome(options=options)

b.get(????)

print(b.page_source)

try:
elem = b.find_element_by_id(element_id)
print(“found it”)
except NoSuchElementException:
print(“couldn’t find it”)

b.close()

open browser window

go to a URL

get HTML for current page (including JavaScript changes)

search for id=???? attributes

no such element

Example 1a: Late Loading Table (page1.html)

https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page1.html

added after 1 second

https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page1.html

Example 1b: Headless Mode and Screenshots
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException

options = Options()
options.headless = True
b = webdriver.Chrome(options=options)

b.get(????)

from IPython.core.display import Image
b.save_screenshot(“out.png”)
Image(“out.png”)

b.close()

Example 2: Auto-Clicking Buttons
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException

options = Options()
options.headless = True
b = webdriver.Chrome(options=options)

b.get(????)

btn = b.find_element_by_id(“BTN_ID”)
btn.click()

b.close()

auto click
https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page2.html

https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page2.html

Example 3: Entering Passwords
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException

options = Options()
options.headless = True
b = webdriver.Chrome(options=options)

b.get(????)

pw = b.find_element_by_id(“pw”)
pw.send_keys(“fido”)

b.close()

https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page3.html

Example 4: Many Queries

https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page4.html

Welcome

Welcome

Welcome

Welcome

Welcome

Welcome

Welcome

Welcome

Welcome

Welcome

Welcome

Related Posts