This article looks at how to speed up a Python web scraping and crawling script with multithreading via the concurrent.futures
module. We'll also break down the script itself and show how to test the parsing functionality with pytest.
After completing this article, you will be able to:
- Scrape and crawl websites with Selenium and parse HTML with Beautiful Soup
- Set up pytest to test the scraping and parsing functionalities
- Execute a web scraper concurrently with the
concurrent.futures
module - Configure headless mode for ChromeDriver with Selenium
Contents
Project Setup
Clone down the repo if you'd like to follow along. From the command line run the following commands:
$ git clone [email protected]:testdrivenio/concurrent-web-scraping.git
$ cd concurrent-web-scraping
$ python -m venv env
$ source env/bin/activate
(env)$ pip install -r requirements.txt
The above commands may differ depending on your environment.
Install ChromeDriver globally. (We're using version 96.0.4664.45).
Script Overview
The script makes 20 requests to Wikipedia:Random -- https://en.wikipedia.org/wiki/Special:Random
-- for information about each article using Selenium to automate interaction with the site and Beautiful Soup to parse the HTML.
script.py:
import datetime
import sys
from time import sleep, time
from scrapers.scraper import connect_to_base, get_driver, parse_html, write_to_file
def run_process(filename, browser):
if connect_to_base(browser):
sleep(2)
html = browser.page_source
output_list = parse_html(html)
write_to_file(output_list, filename)
else:
print("Error connecting to Wikipedia")
if __name__ == "__main__":
# headless mode?
headless = False
if len(sys.argv) > 1:
if sys.argv[1] == "headless":
print("Running in headless mode")
headless = True
# set variables
start_time = time()
current_attempt = 1
output_timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
output_filename = f"output_{output_timestamp}.csv"
# init browser
browser = get_driver(headless=headless)
# scrape and crawl
while current_attempt <= 20:
print(f"Scraping Wikipedia #{current_attempt} time(s)...")
run_process(output_filename, browser)
current_attempt = current_attempt + 1
# exit
browser.quit()
end_time = time()
elapsed_time = end_time - start_time
print(f"Elapsed run time: {elapsed_time} seconds")
Let's start with the main block. After determining whether Chrome should run in headless mode and defining a few variables, the browser is initialized via get_driver()
from scrapers/scraper.py:
if __name__ == "__main__":
# headless mode?
headless = False
if len(sys.argv) > 1:
if sys.argv[1] == "headless":
print("Running in headless mode")
headless = True
# set variables
start_time = time()
current_attempt = 1
output_timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
output_filename = f"output_{output_timestamp}.csv"
########
# here #
########
# init browser
browser = get_driver(headless=headless)
# scrape and crawl
while current_attempt <= 20:
print(f"Scraping Wikipedia #{current_attempt} time(s)...")
run_process(output_filename, browser)
current_attempt = current_attempt + 1
# exit
browser.quit()
end_time = time()
elapsed_time = end_time - start_time
print(f"Elapsed run time: {elapsed_time} seconds")
A while
loop is then configured to control the flow of the overall scraper.
if __name__ == "__main__":
# headless mode?
headless = False
if len(sys.argv) > 1:
if sys.argv[1] == "headless":
print("Running in headless mode")
headless = True
# set variables
start_time = time()
current_attempt = 1
output_timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
output_filename = f"output_{output_timestamp}.csv"
# init browser
browser = get_driver(headless=headless)
########
# here #
########
# scrape and crawl
while current_attempt <= 20:
print(f"Scraping Wikipedia #{current_attempt} time(s)...")
run_process(output_filename, browser)
current_attempt = current_attempt + 1
# exit
browser.quit()
end_time = time()
elapsed_time = end_time - start_time
print(f"Elapsed run time: {elapsed_time} seconds")
Within the loop, run_process()
is called, which manages the WebDriver connection and scraping functions.
def run_process(filename, browser):
if connect_to_base(browser):
sleep(2)
html = browser.page_source
output_list = parse_html(html)
write_to_file(output_list, filename)
else:
print("Error connecting to Wikipedia")
In run_process()
, the browser instance passed to connect_to_base()
.
def run_process(filename, browser):
########
# here #
########
if connect_to_base(browser):
sleep(2)
html = browser.page_source
output_list = parse_html(html)
write_to_file(output_list, filename)
else:
print("Error connecting to wikipedia")
This function attempts to connect to wikipedia and then uses Selenium's explicit wait functionality to ensure the element with id='content'
has loaded before continuing.
def connect_to_base(browser):
base_url = "https://en.wikipedia.org/wiki/Special:Random"
connection_attempts = 0
while connection_attempts < 3:
try:
browser.get(base_url)
# wait for table element with id = 'content' to load
# before returning True
WebDriverWait(browser, 5).until(
EC.presence_of_element_located((By.ID, "content"))
)
return True
except Exception as e:
print(e)
connection_attempts += 1
print(f"Error connecting to {base_url}.")
print(f"Attempt #{connection_attempts}.")
return False
Review the Selenium docs for more information on explicit wait.
To emulate a human user, sleep(2)
is called after the browser has connected to Wikipedia.
def run_process(filename, browser):
if connect_to_base(browser):
########
# here #
########
sleep(2)
html = browser.page_source
output_list = parse_html(html)
write_to_file(output_list, filename)
else:
print("Error connecting to Wikipedia")
Once the page has loaded and sleep(2)
has executed, the browser grabs the HTML source, which is then passed to parse_html()
.
def run_process(filename, browser):
if connect_to_base(browser):
sleep(2)
########
# here #
########
html = browser.page_source
########
# here #
########
output_list = parse_html(html)
write_to_file(output_list, filename)
else:
print("Error connecting to Wikipedia")
parse_html()
uses Beautiful Soup to parse the HTML, generating a list of dicts with the appropriate data.
def parse_html(html):
# create soup object
soup = BeautifulSoup(html, "html.parser")
output_list = []
# parse soup object to get wikipedia article url, title, and last modified date
article_url = soup.find("link", {"rel": "canonical"})["href"]
article_title = soup.find("h1", {"id": "firstHeading"}).text
article_last_modified = soup.find("li", {"id": "footer-info-lastmod"}).text
article_info = {
"url": article_url,
"title": article_title,
"last_modified": article_last_modified,
}
output_list.append(article_info)
return output_list
This function also passes the article URL to get_load_time()
, which loads the URL and records the subsequent load time.
def get_load_time(article_url):
try:
# set headers
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
}
# make get request to article_url
response = requests.get(
article_url, headers=headers, stream=True, timeout=3.000
)
# get page load time
load_time = response.elapsed.total_seconds()
except Exception as e:
print(e)
load_time = "Loading Error"
return load_time
The output is added to a CSV file.
def run_process(filename, browser):
if connect_to_base(browser):
sleep(2)
html = browser.page_source
output_list = parse_html(html)
########
# here #
########
write_to_file(output_list, filename)
else:
print("Error connecting to Wikipedia")
write_to_file()
:
def write_to_file(output_list, filename):
for row in output_list:
with open(Path(BASE_DIR).joinpath(filename), "a") as csvfile:
fieldnames = ["url", "title", "last_modified"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow(row)
Finally, back in the while
loop, the current_attempt
is incremented and the process starts over again.
if __name__ == "__main__":
# headless mode?
headless = False
if len(sys.argv) > 1:
if sys.argv[1] == "headless":
print("Running in headless mode")
headless = True
# set variables
start_time = time()
current_attempt = 1
output_timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
output_filename = f"output_{output_timestamp}.csv"
# init browser
browser = get_driver(headless=headless)
# scrape and crawl
while current_attempt <= 20:
print(f"Scraping Wikipedia #{current_attempt} time(s)...")
run_process(output_filename, browser)
########
# here #
########
current_attempt = current_attempt + 1
# exit
browser.quit()
end_time = time()
elapsed_time = end_time - start_time
print(f"Elapsed run time: {elapsed_time} seconds")
Want to test this out? Grab the full script here.
It took about 57 seconds to run:
(env)$ python script.py
Scraping Wikipedia #1 time(s)...
Scraping Wikipedia #2 time(s)...
Scraping Wikipedia #3 time(s)...
Scraping Wikipedia #4 time(s)...
Scraping Wikipedia #5 time(s)...
Scraping Wikipedia #6 time(s)...
Scraping Wikipedia #7 time(s)...
Scraping Wikipedia #8 time(s)...
Scraping Wikipedia #9 time(s)...
Scraping Wikipedia #10 time(s)...
Scraping Wikipedia #11 time(s)...
Scraping Wikipedia #12 time(s)...
Scraping Wikipedia #13 time(s)...
Scraping Wikipedia #14 time(s)...
Scraping Wikipedia #15 time(s)...
Scraping Wikipedia #16 time(s)...
Scraping Wikipedia #17 time(s)...
Scraping Wikipedia #18 time(s)...
Scraping Wikipedia #19 time(s)...
Scraping Wikipedia #20 time(s)...
Elapsed run time: 57.36561393737793 seconds
Got it? Great! Let's add some basic testing.
Testing
To test the parsing functionality without initiating the browser and, thus, making repeated GET requests to Wikipedia, you can download the page's HTML (test/test.html) and parse it locally. This can help avoid scenarios where you may get your IP blocked for making too many requests too quickly while writing and testing your parsing functions, as well as saving you time by not needing to fire up a browser every time you run the script.
test/test_scraper.py:
from pathlib import Path
import pytest
from scrapers import scraper
BASE_DIR = Path(__file__).resolve(strict=True).parent
@pytest.fixture(scope="module")
def html_output():
with open(Path(BASE_DIR).joinpath("test.html"), encoding="utf-8") as f:
html = f.read()
yield scraper.parse_html(html)
def test_output_is_not_none(html_output):
assert html_output
def test_output_is_a_list(html_output):
assert isinstance(html_output, list)
def test_output_is_a_list_of_dicts(html_output):
assert all(isinstance(elem, dict) for elem in html_output)
Ensure all is well:
(env)$ python -m pytest test/test_scraper.py
================================ test session starts =================================
platform darwin -- Python 3.10.0, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /Users/michael/repos/testdriven/async-web-scraping
collected 3 items
test/test_scraper.py ... [100%]
================================= 3 passed in 0.19 ==================================
Want to mock get_load_time()
to bypass the GET request?
test/test_scraper_mock.py:
from pathlib import Path
import pytest
from scrapers import scraper
BASE_DIR = Path(__file__).resolve(strict=True).parent
@pytest.fixture(scope="function")
def html_output(monkeypatch):
def mock_get_load_time(url):
return "mocked!"
monkeypatch.setattr(scraper, "get_load_time", mock_get_load_time)
with open(Path(BASE_DIR).joinpath("test.html"), encoding="utf-8") as f:
html = f.read()
yield scraper.parse_html(html)
def test_output_is_not_none(html_output):
assert html_output
def test_output_is_a_list(html_output):
assert isinstance(html_output, list)
def test_output_is_a_list_of_dicts(html_output):
assert all(isinstance(elem, dict) for elem in html_output)
Test:
(env)$ python -m pytest test/test_scraper_mock.py
================================ test session starts =================================
platform darwin -- Python 3.10.0, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /Users/michael/repos/testdriven/async-web-scraping
collected 3 items
test/test_scraper.py ... [100%]
================================= 3 passed in 0.27s =================================
Configure Multithreading
Now comes the fun part! By making just a few changes to the script, we can speed things up:
import datetime
import sys
from concurrent.futures import ThreadPoolExecutor, wait
from time import sleep, time
from scrapers.scraper import connect_to_base, get_driver, parse_html, write_to_file
def run_process(filename, headless):
# init browser
browser = get_driver(headless)
if connect_to_base(browser):
sleep(2)
html = browser.page_source
output_list = parse_html(html)
write_to_file(output_list, filename)
# exit
browser.quit()
else:
print("Error connecting to Wikipedia")
browser.quit()
if __name__ == "__main__":
# headless mode?
headless = False
if len(sys.argv) > 1:
if sys.argv[1] == "headless":
print("Running in headless mode")
headless = True
# set variables
start_time = time()
output_timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
output_filename = f"output_{output_timestamp}.csv"
futures = []
# scrape and crawl
with ThreadPoolExecutor() as executor:
for number in range(1, 21):
futures.append(
executor.submit(run_process, output_filename, headless)
)
wait(futures)
end_time = time()
elapsed_time = end_time - start_time
print(f"Elapsed run time: {elapsed_time} seconds")
With the concurrent.futures
library, ThreadPoolExecutor
is used to spawn a pool of threads for executing the run_process
functions asynchronously. The submit method takes the function along with the parameters for that function and returns a future object. wait is then used to block execution until all tasks are complete.
It's worth noting that you can easily switch to multiprocessing via ProcessPoolExecutor
since both ProcessPoolExecutor
and ThreadPoolExecutor
implement the same interface:
# scrape and crawl
with ProcessPoolExecutor() as executor:
for number in range(1, 21):
futures.append(
executor.submit(run_process, output_filename, headless)
)
Why multithreading instead of multiprocessing?
Web scraping is I/O bound since the retrieving of the HTML (I/O) is slower than parsing it (CPU). For more on this along with the difference between parallelism (multiprocessing) and concurrency (multithreading), review the Speeding Up Python with Concurrency, Parallelism, and asyncio article.
Run:
(env)$ python script_concurrent.py
Elapsed run time: 11.831077098846436 seconds
Check out the completed script here.
To speed things up even further we can run Chrome in headless mode by passing in the headless
command line argument:
(env)$ python script_concurrent.py headless
Running in headless mode
Elapsed run time: 6.222846269607544 seconds
Conclusion
With a small amount of variation from the original code, we were able to execute the web scraper concurrently to take the script's run time from around 57 seconds to just over 6 seconds. In this specific scenario that's just about 90% faster, which is a huge improvement.
I hope this helps your scripts. You can find the code in the repo. Cheers!