Selenium spider¶
Foreword¶
Nowadays many websites heavily used AJAX techniques to load their pages. There is a movement with the advent of new frontend (javascript) frameworks like react, vue or svelte that allow to create SPA applications and unfortunately static spiders become useless.
Let's see an example with httpbin to understand what the implications are. This website is a SPA
created with react. For information, this website provides a simple interface to play with
HTTP requests (testing your http client). If you click on the HTTP methods menu, you will see five requests you can
make. Ok now look at the HTML source with ctrl + u
and you will see... almost nothing if not weird information.
The reason is simple, the page is loaded via javascript, in other words
almost all content is loaded dynamically after the page is fetched. Now right click on the menu HTTP methods and click
Inspect Element (or Inspect depending on your browser), you will see the HTML related to this menu :)
I will not do a course on browser dev tools, there are many resources on the web but this is basically how you can inspect the HTML source of SPA applications.
Like I said early static spiders are useless in the case of SPA applications because the inner httpx client they rely on do not know how to deal with javascript loading. So how can we scrape data from SPA applications? There are not many choices here, we need a browser! And we don't only need a browser but also a tool able to handle a browser without human interaction commonly called a driver.
Note
Every browser has a specific driver so you need to install one for the browser you use.
This is where selenium comes in. It is a python library able to control browser via the use of drivers. The installation section is not very detailed about how to install browser drivers. Although you can find resources on the web on this, here are some tips:
- For Windows users, the simplest thing to do is to install your driver via chocolatey. It is an awesome package manager for Windows. Just search selenium on the packages page and you will see all drivers available and instructions to install them.
- For Mac OS users, the installation is also really simple via homebrew
an awesome package manager for Mac OS. You can for example install chromedriver with
brew cask install chromedriver
. To search for a specific driver, refer to the different packages pages. - For linux users, well.. this is probably the most complicated. There are many ways to do depending on the OS you are using, unfortunately I can't give you an all-in-one solution. Nevertheless, for ubuntu users here is a recipe to install chrome and his driver, and here is a forum question with a way to install geckodriver (take care to update the version used in the example).
Configuration¶
There are some Configuration attributes related to a selenium spider that you should be aware of:
selenium_find_timeout
: this indicates the number of secondsselenium
should wait to search for a DOM element.selenium_browser
: a Browser enum to indicate the underlying browser used by selenium. Only two values are possible for now,FIREFOX
for the firefox browser andCHROME
for the chrome browser. Defaults toFIREFOX
, so if you are using chrome make sure to change this attribute.selenium_driver_log_file
: A path to a file were browser drivers will write debug information. Defaults todriver.log
. You can set it toNone
if you don't want this log file.selenium_driver_executable_path
: the path to the driver executable used by the spider. Defaults togeckodriver
for firefox andchromedriver
for chrome. If for some reason, you don't add the driver executable in yourPATH
environment, you must set this attribute with the right path.
Our first selenium spider¶
First of all, I will not teach how to use the selenium
library, if you don't know it, you can read this
introduction. Many concepts of the static spider will be
reused for the selenium spider so if you don't read the static spider guide, it is a good idea to do
so before continuing here.
So if we come back to httpbin, let's say we want to scrape the menu title plus a description of all the methods related to this menu. If we look at the HTML source of a menu with a browser dev tool, we can see that the structure is like the following:
<div class="opblock-tag-section">
<h4 class="opblock-tag" id="operations-tag-HTTP_Methods">
<a class="nostyle" href="#/HTTP Methods"><span>HTTP Methods</span></a>
<small>
<div class="markdown">Testing different HTTP verbs</div>
</small>
<div style="height: auto; border: none; margin: 0px; padding: 0px;">
<!-- react-text: 451 --> <!-- /react-text -->
<div class="opblock opblock-delete" id="operations-HTTP Methods-delete_delete">
<div class="opblock-summary opblock-summary-delete">
<span class="opblock-summary-method">DELETE</span>
<span class="opblock-summary-path">
<a class="nostyle" href="#/operations/HTTP Methods/delete_delete">
<span>/delete</span>
</a><!-- react-empty: 458 --><!-- react-text: 459 --> <!-- /react-text -->
</span>
<div class="opblock-summary-description">The request's DELETE parameters.</div>
</div>
<noscript></noscript>
</div>
...
<button class="expand-operation" title="Expand operation">
<svg class="arrow" width="20" height="20">
<use href="#large-arrow" xlink:href="#large-arrow"></use>
</svg>
</button>
</div>
</h4>
<noscript></noscript>
</div>
We see that the menu title is in a h4
tag. The description of the methods is inside a
<div class="opblock XXX">..</div>
and inside it we have:
- the method name in a
<span class="opblock-summary-method"..>..</span>
- the route path in a
span[class="opblock-summary-path]/a/span
- the method description in
<span class="opblock-summary-description">..</span>
So this is what we can write to scrape all menu information on the website.
With gevent:
from scalpel.green import SeleniumResponse, SeleniumSpider
def parse(_, response: SeleniumResponse) -> None:
for block in response.driver.find_elements_by_xpath('//div[@class="opblock-tag-section"]'):
block.click()
h4_text = block.find_element_by_xpath('./h4').text
title, description = h4_text.split('\n')
print('****', title, '****')
print(description)
print('=== operations ===')
methods = (method.text for method in block.find_elements_by_xpath('.//span[@class="opblock-summary-method"]'))
paths = (path.text for path in block.find_elements_by_xpath('.//span[@class="opblock-summary-path"]/a/span'))
descriptions = (description.text for description in
block.find_elements_by_xpath('.//div[@class="opblock-summary-description"]'))
for method, path, description in zip(methods, paths, descriptions):
print('\tmethod:', method)
print('\tpath:', path)
print('\tdescription:', description, end='\n\n')
# the default browser used is firefox, if you are using chrome, don't forget to set the configuration like this
# config = Configuration(selenium_browser=Browser.CHROME)
# and the instantiate the spider using the config attribute
# spider = SeleniumSpider(urls=['http://httpbin.org/'], parse=parse, config=config)
spider = SeleniumSpider(urls=['http://httpbin.org/'], parse=parse)
spider.run()
With anyio:
import anyio
from scalpel.any_io import SeleniumSpider, SeleniumResponse
async def parse(_, response: SeleniumResponse) -> None:
for block in response.driver.find_elements_by_xpath('//div[@class="opblock-tag-section"]'):
block.click()
h4_text = block.find_element_by_xpath('./h4').text
title, description = h4_text.split('\n')
print('****', title, '****')
print(description)
print('=== operations ===')
methods = (method.text for method in block.find_elements_by_xpath('.//span[@class="opblock-summary-method"]'))
paths = (path.text for path in block.find_elements_by_xpath('.//span[@class="opblock-summary-path"]/a/span'))
descriptions = (description.text for description in
block.find_elements_by_xpath('.//div[@class="opblock-summary-description"]'))
for method, path, description in zip(methods, paths, descriptions):
print('\tmethod:', method)
print('\tpath:', path)
print('\tdescription:', description, end='\n\n')
async def main() -> None:
# the default browser used is firefox, if you are using chrome, don't forget to set the configuration like this
# config = Configuration(selenium_browser=Browser.CHROME)
# and the instantiate the spider using the config attribute
# spider = SeleniumSpider(urls=['http://httpbin.org/'], parse=parse, config=config)
spider = SeleniumSpider(urls=['http://httpbin.org/'], parse=parse)
await spider.run()
anyio.run(main) # with trio: anyio.run(main, backend='trio')
You will have an output similar to the following:
**** HTTP Methods ****
Testing different HTTP verbs
=== operations ===
method: DELETE
path: /delete
description: The request's DELETE parameters.
method: GET
path: /get
description: The request's query parameters.
method: PATCH
path: /patch
description: The request's PATCH parameters.
method: POST
path: /post
description: The request's POST parameters.
method: PUT
path: /put
description: The request's PUT parameters.
**** Auth ****
...
Some notes:
- the first obvious changes are the uses of
SeleniumSpider
andSeleniumResponse
instead ofStaticSpider
andStaticResponse
. The usage ofSeleniumSpider
will not differ fromStaticSpider
but for the SeleniumResponse class there are no morecss
andxpath
methods. Instead we have adrive
attribute which represents a seleniumWebDriver
object and will be used to interact with the browser. - You may have noticed usage of
_
in the definition of the parse functionparse(_, response: SeleniumResponse)
this is because I don't use the spider object in the function body so I mark it explicitly. It is a good practice in my opinion. - The
driver
attribute is a selenium webdriver object where we can apply a lot of methods likefind_elements_by_xpath
which look familiar to theStaticResponse.xpath
method. You can look at the entire webdriver api here. - Do not hesitate to look at the examples for more code snippets to view.
Danger
Please do not call close
or quit
methods of the driver object because this will certainly crashed your
application. The resource closing is done by pyscalpel.
Caveats¶
Integration of selenium in pyscalpel is not optimal because of the synchronous nature of selenium and the fact that
selenium does not have a notion of a tab object. You only handle one window tab at a time. All of this make it hard to
combine selenium with asynchronous frameworks such as gevent
or trio
. The direct consequence of it is that
asynchronous operations such as follow
, save_item
or whatever asynchronous api you use must be done at the
end of your parse callable. You can't mix them up anywhere in the code. For example:
With gevent:
from scalpel.green import StaticResponse
def parse(_, response: StaticResponse) -> None:
response.xpath('//p')
response.follow('/page/2') # not good
# the reason is that for now scalpel cannot guaranteed that the next line
# will be executed on the page you expected
response.xpath('//div')
With anyio:
from scalpel.any_io import StaticResponse
async def parse(_, response: StaticResponse) -> None:
response.xpath('//p')
await response.follow('/page/2') # not good
# the reason is that for now scalpel cannot guaranteed that the next line
# will be executed on the page you expected
response.xpath('//div')
Instead, you should perform all your code scraping, parsing before running asynchronous operations.
With gevent:
from scalpel.green import StaticResponse
def parse(_, response: StaticResponse) -> None:
response.xpath('//p')
response.xpath('//div')
response.follow('/page/2') # good
With anyio:
from scalpel.any_io import StaticResponse
async def parse(_, response: StaticResponse) -> None:
response.xpath('//p')
response.xpath('//div')
await response.follow('/page/2') # good
Another direct consequence of the bad asynchronous integration of selenium is that if you follow urls you will notice that the behaviour is more sequential than concurrent with regard to url handling. Nevertheless, the selenium spider is still useful (if you respect the previous advice), just slow for now.