Concurrent Web Scraping with Selenium Grid and Docker Swarm

Last updated March 31st, 2022

In this tutorial, we'll look at how to run a Python and Selenium-based web scraper in parallel with Selenium Grid and Docker. We'll also look at how to quickly scale Selenium Grid on DigitalOcean using Docker Swarm to increase efficiency of the scraper. Finally, we'll create a bash script that automates the spinning up and tearing down of resources on DigitalOcean.

Dependencies:

  1. Docker v20.10.13
  2. Python v3.10.4
  3. Selenium v4.1.3

Contents

Learning Objectives

By the end of this tutorial, you will be able to:

  1. Configure Selenium Grid to work with Docker
  2. Deploy Selenium Grid to DigitalOcean via Docker Machine
  3. Create a Docker Swarm Cluster
  4. Scale Selenium Grid across a Docker Swarm Cluster
  5. Automate the deployment of Selenium Grid and Docker Swarm

Getting Started

Start by cloning down the base project with the web scraping script, create and activate a virtual environment, and install the dependencies:

$ git clone https://github.com/testdrivenio/selenium-grid-docker-swarm.git --branch base --single-branch
$ cd selenium-grid-docker-swarm
$ python3.10 -m venv env
$ source env/bin/activate
(env)$ pip install -r requirements.txt

The above commands may differ depending on your environment.

Test out the scraper:

(env)$ python project/script.py

You should see something similar to:

Scraping random Wikipedia page...
[
  {
    'url': 'https://en.wikipedia.org/wiki/Andreas_Reinke',
    'title': 'Andreas Reinke',
    'last_modified': ' This page was last edited on 10 January 2022, at 23:11\xa0(UTC).'
  }
]
Finished!

Essentially, the script makes a request to Wikipedia:Random -- https://en.wikipedia.org/wiki/Special:Random -- for information about the random article using Selenium to automate interaction with the site and Beautiful Soup to parse the HTML.

It's a modified version of the scraper built in the Building A Concurrent Web Scraper With Python and Selenium tutorial. Please review the tutorial along with the code from the script for more info.

Configuring Selenium Grid

Next, let's spin up Selenium Grid to simplify the running of the script in parallel on multiple machines. We'll also use Docker and Docker Compose to manage those machines with minimal installation and configuration.

Add a docker-compose.yml file to the root directory:

version: '3.8'

services:

  hub:
    image: selenium/hub:4.1.3
    ports:
      - 4442:4442
      - 4443:4443
      - 4444:4444

  chrome:
    image: selenium/node-chrome:4.1.3
    depends_on:
      - hub
    environment:
      - SE_EVENT_BUS_HOST=hub
      - SE_EVENT_BUS_PUBLISH_PORT=4442
      - SE_EVENT_BUS_SUBSCRIBE_PORT=4443

Here, we used the official Selenium Docker images to set up a basic Selenium Grid that consists of a hub and a single Chrome node. We used the 4.1.3 tag, which is associated with the following versions of Selenium, WebDriver, Chrome, and Firefox:

  • Selenium: 4.1.3
  • Google Chrome: 99.0.4844.84
  • ChromeDriver: 99.0.4844.51
  • Mozilla Firefox: 98.0.2
  • Geckodriver: 0.30.0

Want to use different versions? Find the appropriate tag from the releases page.

Pull and run the images:

$ docker-compose up -d

Navigate to http://localhost:4444 in your browser to ensure that the hub is up and running with one Chrome node:

selenium grid

Since Selenium Hub is running on a different machine (within the Docker container), we need to configure the remote driver in project/scrapers/scraper.py:

def get_driver():
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")

    # initialize driver
    driver = webdriver.Remote(
            command_executor='http://localhost:4444/wd/hub',
            desired_capabilities=DesiredCapabilities.CHROME)
    return driver

Add the import:

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

Run the scraper again:

(env)$ python project/script.py

While the scraper is running, you should see "Sessions" change to one, indicating that it's in use:

selenium grid running

Deploying to DigitalOcean

Sign up for DigitalOcean if you don't already have an account. To use the DigitalOcean API, you'll also need to generate an access token.

Get $10 in DigitalOcean credit here.

Add the token as an environment variable:

(env)$ export DIGITAL_OCEAN_ACCESS_TOKEN=[your_token]

Provision a new droplet with Docker Machine:

(env)$ docker-machine create \
        --driver digitalocean \
        --digitalocean-access-token $DIGITAL_OCEAN_ACCESS_TOKEN \
        --digitalocean-region "nyc1" \
        --digitalocean-image "debian-10-x64" \
        --digitalocean-size "s-4vcpu-8gb" \
        --engine-install-url "https://releases.rancher.com/install-docker/19.03.9.sh" \
        selenium-hub;

--engine-install-url is required since, as of writing, Docker v20.10.13 doesn't work with Docker Machine.

Next, point the Docker daemon at the newly created machine and set it as the active machine:

(env)$ docker-machine env selenium-hub
(env)$ eval $(docker-machine env selenium-hub)

Spin up the two containers on the droplet:

(env)$ docker-compose up -d

Once up, grab the IP of the droplet:

(env)$ docker-machine ip selenium-hub

Ensure Selenium Grid is up at http://YOUR_IP:4444, and then update the IP address in project/scrapers/scraper.py:

command_executor='http://YOUR_IP:4444/wd/hub',

Run the scraper:

(env)$ python project/script.py

Again, navigate to the Grid dashboard and ensure a session is active. You should see the following output in the terminal:

Scraping random Wikipedia page...
[
  {
    'url': 'https://en.wikipedia.org/wiki/David_Hidalgo',
    'title': 'David Hidalgo',
    'last_modified': ' This page was last edited on 11 November 2021, at 01:24\xa0(UTC).'
  }
]
Finished!

Thus far we're only scraping a single article on Wikipedia. What if we wanted to scrape multiple articles?

(env)$ for i in {1..21}; do {
          python project/script.py &
        };
        done
        wait

Navigate to the Grid dashboard again. You should see one of the requests running along with 20 queued requests:

selenium grid queue

Since we only have one node running, it will take a while to finish (just over 1.5 minutes on my end). We could spin up a few more instances of the node, but each of them would have to fight for resources on the droplet. It's best to deploy the hub and a number of nodes across a few droplets. This is where Docker Swarm comes into play.

Running Docker Swarm

So, with Docker Swarm (or "docker swarm mode", if you want to be more accurate), we can deploy a single Selenium Grid across a number of machines.

Start by initializing Docker Swarm on the current machine:

(env)$ docker swarm init --advertise-addr [YOUR_IP]

You should see something like:

Swarm initialized: current node (mky1a6z8rjaeaeiucvzyo355l) is now a manager.

To add a worker to this swarm, run the following command:

    docker swarm join --token SWMTKN-1-2136awhbig93jh8xunp8yp2wn0pw9i946dvmfrpi05tnpbxula-633h28mn97sxhbfn8479mmpx5 134.122.20.39:2377

To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.

Take note of the join command as it contains a token that we need in order to add workers to the Swarm.

Review the official docs for more info on adding nodes to a Swarm.

Next, spin up three new droplets on DigitalOcean:

(env)$ for i in 1 2 3; do
            docker-machine create \
              --driver digitalocean \
              --digitalocean-access-token $DIGITAL_OCEAN_ACCESS_TOKEN \
              --digitalocean-region "nyc1" \
              --digitalocean-image "debian-10-x64" \
              --digitalocean-size "s-4vcpu-8gb" \
              --engine-install-url "https://releases.rancher.com/install-docker/19.03.9.sh" \
              node-$i;
        done

And then add each to the Swarm as a worker:

(env)$ for i in 1 2 3; do
            docker-machine ssh node-$i \
              -- docker swarm join --token YOUR_JOIN_TOKEN;
        done

You should see something like:

(env)$ for i in 1 2 3; do
            docker-machine ssh node-$i \
              -- docker swarm join --token SWMTKN-1-2136awhbig93jh8xunp8yp2wn0pw9i946dvmfrpi05tnpbxula-633h28mn97sxhbfn8479mmpx5 134.122.20.39:2377
        done
This node joined a swarm as a worker.
This node joined a swarm as a worker.
This node joined a swarm as a worker.

Update the docker-compose.yml file to deploy Selenium Grid in Swarm mode:

version: '3.8'

services:

  hub:
    image: selenium/hub:4.1.3
    ports:
      - 4442:4442
      - 4443:4443
      - 4444:4444
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints:
          - node.role == worker

  chrome:
    image: selenium/node-chrome:4.1.3
    depends_on:
      - hub
    environment:
      - SE_EVENT_BUS_HOST=hub
      - SE_EVENT_BUS_PUBLISH_PORT=4442
      - SE_EVENT_BUS_SUBSCRIBE_PORT=4443
      - NODE_MAX_SESSION=1
    entrypoint: bash -c 'SE_OPTS="--host $$HOSTNAME" /opt/bin/entry_point.sh'
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.role == worker

Major changes:

  1. Placement constraints: We set up a placement constraint of node.role == worker so that all tasks will be run on the worker nodes. It's generally best to keep manager nodes free from CPU and/or memory-intensive tasks.
  2. Entrypoint: Here, we updated the host set in SE_OPTS within the entry_point.sh script so nodes running on different hosts will be able to successfully link back to the hub.

With that, we are ready to deploy the stack:

(env)$ docker stack deploy --compose-file=docker-compose.yml selenium

Let's also add a few more nodes:

(env)$ docker service scale selenium_chrome=5

selenium_chrome scaled to 5
overall progress: 5 out of 5 tasks
1/5: running   [==================================================>]
2/5: running   [==================================================>]
3/5: running   [==================================================>]
4/5: running   [==================================================>]
5/5: running   [==================================================>]
verify: Service converged

You can check the status of the stack like so:

(env)$ docker stack ps selenium

You'll also want to get the IP address of the machine running the hub:

(env)$ docker-machine ip $(docker service ps --format "{{.Node}}" selenium_hub)

Update the IP address again in project/scrapers/scraper.py:

command_executor='http://YOUR_IP:4444/wd/hub',

Test it out:

(env)$ for i in {1..21}; do {
          python project/script.py &
        };
        done
        wait

Back on the Grid dashboard at http://YOUR_IP:4444/, you should see the five nodes, each running a single session. There should also be 16 queued requests:

selenium grid cluster

This should run much faster now. On my end, it took 25 seconds to run.

Commands

Want to view the services?

(env)$ docker service ls

To get more info about the Chrome nodes along with where each are running, run:

(env)$ docker service ps selenium_chrome

Remove the services:

(env)$ docker service rm selenium_chrome
(env)$ docker service rm selenium_hub

Spin down the droplets:

(env)$ docker-machine rm node-1 node-2 node-3
(env)$ docker-machine rm selenium-hub

Automating the Workflow

Right now we have to manually spin the resources up and back down. Let's automate the process so that when you want to run a scraping job the resources are spun up and then torn down automatically.

project/create.sh:

#!/bin/bash


echo "Spinning up four droplets..."

for i in 1 2 3 4; do
    docker-machine create \
        --driver digitalocean \
        --digitalocean-access-token $DIGITAL_OCEAN_ACCESS_TOKEN \
        --digitalocean-region "nyc1" \
        --digitalocean-image "debian-10-x64" \
        --digitalocean-size "s-4vcpu-8gb" \
        --engine-install-url "https://releases.rancher.com/install-docker/19.03.9.sh" \
        node-$i;
done


echo "Initializing Swarm mode..."

docker-machine ssh node-1 -- docker swarm init --advertise-addr $(docker-machine ip node-1)


echo "Adding the nodes to the Swarm..."

TOKEN=`docker-machine ssh node-1 docker swarm join-token worker | grep token | awk '{ print $5 }'`

docker-machine ssh node-2 "docker swarm join --token ${TOKEN} $(docker-machine ip node-1):2377"
docker-machine ssh node-3 "docker swarm join --token ${TOKEN} $(docker-machine ip node-1):2377"
docker-machine ssh node-4 "docker swarm join --token ${TOKEN} $(docker-machine ip node-1):2377"


echo "Deploying Selenium Grid to http://$(docker-machine ip node-1):4444"

eval $(docker-machine env node-1)
docker stack deploy --compose-file=docker-compose.yml selenium
docker service scale selenium_chrome=5

project/destroy.sh:

#!/bin/bash


echo "Bringing down the services"

docker service rm selenium_chrome
docker service rm selenium_hub


echo "Bringing down the droplets"

docker-machine rm node-1 node-2 node-3 node-4 -y

Update the get_driver() in project/scrapers/scraper.py to take an address:

def get_driver(address):
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")

    # initialize driver
    driver = webdriver.Remote(
                command_executor=f'http://{address}:4444/wd/hub',
                desired_capabilities=DesiredCapabilities.CHROME)
    return driver

Update the main block in project/script.py:

if __name__ == '__main__':
    browser = get_driver(sys.argv[1])
    data = run_process(browser)
    print(data)
    browser.quit()
    print(f'Finished!')

Time to test!

(env)$ sh project/create.sh

Run the scraper:

(env)$ docker-machine env node-1
(env)$ eval $(docker-machine env node-1)
(env)$ NODE=$(docker service ps --format "{{.Node}}" selenium_hub)
(env)$ for i in {1..21}; do {
          python project/script.py $(docker-machine ip $NODE) &
        };
        done
        wait

Bring down the resources once done:

(env)$ sh project/destroy.sh

Next Steps

Try out these challenges:

  1. Right now we're not doing anything with the scraped data. Try spinning up a database and adding a function to the scraping script to write the data to the database.
  2. Selenium is also used for browser-based, end-to-end testing. With Selenium Grid you can run the tests against different versions of Chrome and Firefox on different operating systems. In other words, you can spin up a number of nodes, each with different versions of Chrome and Firefox that you can run the tests against. Try this out on your own. Check out the Distributed Testing with Selenium Grid and Docker tutorial to see this in action!
  3. Rip out Docker Swarm from the mix and add Kubernetes in its place.

As always, you can find the code in the repo.

Featured Course

Creating an HTTP Load Balancer in Python

In this course, you'll learn how to implement a load balancer in Python using Test-Driven Development.

Featured Course

Creating an HTTP Load Balancer in Python

In this course, you'll learn how to implement a load balancer in Python using Test-Driven Development.