Skip to content

API

Configuration

class scalpel.Configuration(min_request_delay=0, max_request_delay=0, fetch_timeout=5.0, selenium_find_timeout=10.0, selenium_driver_log_file='driver.log', selenium_browser=, selenium_driver_executable_path=NOTHING, user_agent=NOTHING, follow_robots_txt=False, robots_cache_folder=NOTHING, backup_filename=NOTHING, response_middlewares=NOTHING, item_processors=NOTHING, msgpack_encoder=, msgpack_decoder=)

Configure variables for your spider.

Parameters:

  • min_request_delay: The minimum delay to wait between two http requests. Defaults to 0s.

  • max_request_delay: The maximum delay to wait between two http requests. Defaults to 0s.

  • fetch_timeout: The timeout to fetch http resources using the inner httpx client. Defaults to 5s.

  • selenium_find_timeout: The timeout for selenium driver to find an element in a page. Defaults to 10s.

  • selenium_driver_log_file: The file where the browser log debug messages. Defaults to driver.log. If you want to not create one, just pass None.

  • selenium_browser: The browser to use with the selenium spider. You can use the Browser enum to specify the value. Possible values are Browser.FIREFOX and Browser.CHROME. Defaults to Browser.FIREFOX.

  • selenium_driver_executable_path: The path to the browser driver. Defaults to geckodriver if Browser.FIREFOX is selected as selenium_browser, otherwise defaults to chromedriver.

  • user_agent: The user agent to fake. Mainly useful for the static spider. Defaults to a random value provided by fake-useragent and if it does not work, fallback to Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36

  • follow_robots_txt: Decide whether or not the spider should follow robots.txt rules on the website you are scraping. Defaults to False.

  • robots_cache_folder: A folder to cache content of different website robots.txt file to avoid retrieving it each time you want to analyze an html page. Default to the system temporary directory.

  • backup_filename: The filename were scraped items will be written. If you don't want one, simple pass None. Defaults to backup-{uuid}.mp where uuid is a uuid.uuid4 string value. Note that values inserted in this file are streamed using msgpack. Look at the documentation to see how to use it.

  • response_middlewares: A list of callables that will be called with the callable that fetch the http resource. This parameter is only useful for the static spider. Defaults to an empty list.

  • item_processors: A list of callables that will be called with a scraped item. Defaults to an empty list.

  • msgpack_encoder: A callable that will be called when msgpack serializes an item. Defaults to scalpel.datetime_encoder.

  • msgpack_decoder: A callable that will be called when msgpack deserializes an item. Defaults to scalpel.datetime_decoder.

Usage:

from scalpel import Configuration, Browser
config = Configuration(
    min_request_delay=1, max_request_delay=3, follow_robots_txt=True, selenium_browser=Browser.CHROME
)
load_from_dotenv(env_file)

Loads configuration from a .env file.

Returns: Configuration

Usage:

# .env
SCALPEL_USER_AGENT = Mozilla/5.0
SCALPEL_FETCH_TIMEOUT = 4.0
SCALPEL_FOLLOW_ROBOTS_TXT = yes
from scalpel import Configuration
conf = Configuration.load_from_dotenv('.env')
conf.follow_robots_txt  # equals to True
load_from_toml(toml_file)

Loads configuration from a toml file.

Returns: Configuration

Usage:

# conf.toml
[scalpel]
user_agent = "Mozilla/5.0"
fetch_timeout = 4.0
follow_robots_txt = true
from scalpel import Configuration
conf = Configuration.load_from_toml('conf.toml')
conf.fetch_timeout  # equals to 4.0
load_from_yaml(yaml_file)

Loads configuration from a yaml file.

Returns: Configuration

Usage:

# conf.yaml
scalpel:
  fetch_timeout: 4.0
  user_agent: Mozilla/5.0
  follow_robots_txt: true
from scalpel import Configuration
conf = Configuration.load_from_yaml('conf.yaml')
conf.fetch_timeout  # equals to 4.0
request_delay

A read-only property which is a random value between min_request_delay and max_request_delay (both sides included) and used to wait between two http requests.

State

class scalpel.State()

An empty class used to store arbitrary data.

Browser

class scalpel.Browser(value, names=None, *, module=None, qualname=None, type=None, start=1)

An enum with different browser values.

CHROME

An enum with different browser values.

FIREFOX

An enum with different browser values.

msgpack

scalpel.datetime_encoder(data)

A datetime encoder for msgpack

Usage:

from datetime import datetime
from scalpel import datetime_encoder
import msgpack

data = {'fruit': 'apple', 'date': datetime.utcnow()}
packed_data = msgpack.packb(data, default=datetime_encoder)
scalpel.datetime_decoder(data)

A datetime decoder for msgpack.

Usage:

from datetime import datetime
from scalpel import datetime_encoder, datetime_decoder
import msgpack

data = {'fruit': 'apple', 'date': datetime.utcnow()}
packed_data = msgpack.packb(data, default=datetime_encoder)
assert msgpack.unpackb(packed_data, object_hook=datetime_decoder) == data
scalpel.green.write_mp(filename, data, mode='a', encoder=None)

Writes a msgpack file.

Parameters:

  • filename: The name of the file where data will be written. It can be a string or a pathlib.Path.
  • data: Arbitrary data to serialize. Note that if you want to serialize data types not supported by the json module, you will need to provide a custom encoder function.
  • mode: The mode in which the file is opened. Valid values are "a" (append) and "w" (write). Defaults to "a".
  • encoder: An optional function used to encode data types not handled by default by msgpack.

Returns: The number of written bytes.

Usage:

from datetime import datetime
from scalpel import datetime_encoder
from scalpel.green import write_mp

data = {'fruit': 'apple', 'date': datetime.utcnow()}
length = write_mp('file.mp', data, 'w', datetime_encoder)
print(length)  # 65
scalpel.green.read_mp(filename, decoder=None)

Reads a msgpack file generated by the spider when calling the save_item method.

Parameters:

  • filename: The name of the file to read. It can be a string or a pathlib.Path.
  • decoder: An optional function used to decode data types not handled by default by msgpack.

Usage:

from scalpel import datetime_decoder
from scalpel.green import read_mp

for item in read_mp('file.mp', datetime_decoder):
    print(item)

SpiderStatistics

class scalpel.SpiderStatistics(reachable_urls, unreachable_urls, robot_excluded_urls, followed_urls, request_counter, average_fetch_time, total_time)

Provides some statistics about a ran spider.

Parameters:

  • reachable_urls: set of urls that were fetched (or read in case of file urls) and parsed.
  • unreachable_urls: set that were impossible to fetch (or read in case of file urls).
  • robot_excluded_urls: set of urls that were excluded to fetch because of robots.txt file rules.
  • followed_urls: set of urls that were followed during the process of parsing url content. You will find these urls scattered in the first three sets.
  • request_counter: The number of urls fetched or read (in case of file urls).
  • average_fetch_time: The average time to fetch an url (or read a file in case of file urls).
  • total_time: The total execution time of the spider.

green.StaticSpider

class scalpel.green.StaticSpider(urls, parse, name=NOTHING, config=NOTHING, ignore_errors=False)

A spider suitable to parse files or static HTML files.

Parameters:

  • urls: Urls to parse. Allowed schemes are http, https and file. It can be a list, a tuple or a set.
  • parse: A callable used to parse url content. It takes two arguments: the current spider and a StaticResponse object.
  • reachable_urls: set of urls that are already fetched or read.
  • unreachable_urls: set of urls that were impossible to fetch or read.
  • robot_excluded_urls: set of urls that were excluded to fetch because of robots.txt file rules.
  • followed_urls: set of urls that were followed during the process of parsing url content. You will find these urls scattered in the first three sets.
  • request_counter: The number of urls already fetched or read.

Usage:

from scalpel.green import StaticSpider, StaticResponse

def parse(spider: StaticSpider, response: StaticResponse) -> None:
    ...

spider = StaticSpider(urls=['http://example.com'], parse=parse)
spider.run()
config

Returns the Configuration related to the spider.

followed_urls
name

Returns the name given to the spider.

parse
reachable_urls
request_counter
robots_excluded_urls
run(self)

Runs the spider.

save_item(self, item)

Saves a scrapped item in the backup filename specified in Configuration.backup_filename attribute.

state

Returns the State related to the spider. You can add custom information on this object.

statistics(self)

Provides some statistics related to the ran spider.

Returns: SpiderStatistics

unreachable_urls
urls

green.SeleniumSpider

class scalpel.green.SeleniumSpider(urls, parse, name=NOTHING, config=NOTHING, ignore_errors=False)

A spider suitable to parse dynamic websites i.e where Javascript is heavily used. You will sometimes encounter the term Single-Page Application (SPA) for this type of website. It relies on selenium package and a browser.

Parameters:

  • urls: Urls to parse. Allowed schemes are http, https and file. It can be a list, a tuple or a set.
  • parse: A callable used to parse url content. It takes two arguments: the current spider and a StaticResponse object.
  • reachable_urls: set of urls that are already fetched or read.
  • unreachable_urls: set of urls that were impossible to fetch or read.
  • robot_excluded_urls: set of urls that were excluded to fetch because of robots.txt file rules.
  • followed_urls: set of urls that were followed during the process of parsing url content. You will find these urls scattered in the first three sets.
  • request_counter: The number of urls already fetched or read.

Usage:

from scalpel.green import SeleniumSpider, SeleniumResponse

def parse(spider: SeleniumSpider, response: SeleniumResponse) -> None:
    ...

spider = SeleniumSpider(urls=['http://example.com'], parse=parse)
spider.run()
config

Returns the Configuration related to the spider.

followed_urls
name

Returns the name given to the spider.

parse
reachable_urls
request_counter
robots_excluded_urls
run(self)

Runs the spider.

save_item(self, item)

Saves a scrapped item in the backup filename specified in Configuration.backup_filename attribute.

state

Returns the State related to the spider. You can add custom information on this object.

statistics(self)

Provides some statistics related to the ran spider.

Returns: SpiderStatistics

unreachable_urls
urls

green.io

class scalpel.green.AsyncFile(wrapper)

A wrapper around builtins io objects like io.StringIO or io.BufferedReader running blocking operations like read or write in a threadpool to make it gevent cooperative.

scalpel.green.wrap_file(file)

This function wraps any file object in a wrapper that provides an asynchronous (or gevent cooperative) file object interface.

Parameters:

  • file: A file-like object.

Returns: An AsyncFile object.

Usage:

from io import StringIO
from scalpel.green import wrap_file

s = StringIO()
async_s = wrap_file(s)
assert 5 == async_s.write('hello')
assert 'hello' == async_s.getvalue()
scalpel.green.open_file(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)

An asynchronous version of the builtin open function running blocking operation in a threadpool.

Parameters: The parameters are exactly the same as those passed to the builtin open function. You can check the official documentation to understand their meaning.

Returns: An AsyncFile object.

Usage:

from scalpel.green import open_file

with open_file('hello.txt', 'w') as f:
    f.write('hello world')

with open_file('hello.txt') as f:
    print(f.read())  # 'hello world'

green.StaticResponse

class scalpel.green.StaticResponse(reachable_urls, followed_urls, queue, *, url='', text='', httpx_response=None)

A response class used in combination with a StaticSpider object in the parse callable of a spider.

N.B: You probably don't need to instantiate this class directly unless for some kind of testing. It is mainly exposed for annotation purpose.

Parameters:

  • reachable_urls: A set of urls already fetched.
  • followed_urls: A set of urls already followed by other StaticResponse objects.
  • queue: The gevent.queue.JoinableQueue used by the spider to handle incoming urls.
  • url: An optional keyword parameter representing the current url where content was fetched.
  • text: An optional keyword parameter representing the content of the resource fetched. Note that if you set the url parameter, you must set this one.
  • httpx_response: An optional keyword parameter representing an httpx.Response of the resource fetched. For HTTP urls, this is the one used in favour of url and text parameters.

Usage:

from scalpel.green import StaticResponse

response = StaticResponse(..., url='http://foo.com', text='<p>Hello world!</p>')
print(response.css('p::text').get())  # 'Hello world!'
print(response.xpath('//p/text()').get())  # 'Hello world!'
content

The bytes content associated to the url.

cookies

A dict of cookies associated to the response in case of an HTTP url. Empty dict otherwise.

css(self, query)

Applies CSS rules to select DOM elements.

Parameters:

  • query: The CSS rule used to select DOM elements.

Returns: parsel.SelectorList

follow(self, url)

Follows given url if it hasn't be fetched yet.

Parameters:

  • url: The url to follow.
headers

A dict of http headers in case of an HTTP url. Empty dict otherwise.

text

The string content associated to the url.

url

The url associated to the response object.

xpath(self, query)

Applies XPath rules to select DOM elements.

Parameters:

  • query: The XPath rule used to select DOM elements.

Returns: parsel.SelectorList

green.SeleniumResponse

class scalpel.green.SeleniumResponse(reachable_urls, followed_urls, queue, *, driver, handle)

A response class used in combination with a SeleniumSpider object in the parse callable of a spider.

N.B: You probably don't need to instantiate this class directly unless for some kind of testing. It is mainly exposed for annotation purpose.

Parameters:

  • reachable_urls: A set of urls already fetched.
  • followed_urls: A set of urls already followed by other StaticResponse objects.
  • queue: The gevent.queue.JoinableQueue used by the spider to handle incoming urls.
  • driver: The selenium.WebDriver object that will be use to control the running browser.
  • handle: A string that identifies the current window handled by selenium.

Usage:

from scalpel.green import SeleniumResponse

response = SeleniumResponse(...)
# We assume we have a page source like '<p>Hello world!</p>'
print(response.driver.find_element_by_xpath('//p').text)  # Hello world!
follow(self, url)

Follows given url if it hasn't be fetched yet.

Parameters:

  • url: The url to follow.