API¶
Configuration
¶
scalpel.Configuration
(min_request_delay=0, max_request_delay=0, fetch_timeout=5.0, selenium_find_timeout=10.0, selenium_driver_log_file='driver.log', selenium_browser=Configure variables for your spider.
Parameters:
-
min_request_delay: The minimum delay to wait between two http requests. Defaults to 0s.
-
max_request_delay: The maximum delay to wait between two http requests. Defaults to 0s.
-
fetch_timeout: The timeout to fetch http resources using the inner httpx client. Defaults to 5s.
-
selenium_find_timeout: The timeout for selenium driver to find an element in a page. Defaults to 10s.
-
selenium_driver_log_file: The file where the browser log debug messages. Defaults to driver.log. If you want to not create one, just pass
None
. -
selenium_browser: The browser to use with the selenium spider. You can use the
Browser
enum to specify the value. Possible values areBrowser.FIREFOX
andBrowser.CHROME
. Defaults toBrowser.FIREFOX
. -
selenium_driver_executable_path: The path to the browser driver. Defaults to geckodriver if
Browser.FIREFOX
is selected as selenium_browser, otherwise defaults to chromedriver. -
user_agent: The user agent to fake. Mainly useful for the static spider. Defaults to a random value provided by fake-useragent and if it does not work, fallback to Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36
-
follow_robots_txt: Decide whether or not the spider should follow robots.txt rules on the website you are scraping. Defaults to
False
. -
robots_cache_folder: A folder to cache content of different website robots.txt file to avoid retrieving it each time you want to analyze an html page. Default to the system temporary directory.
-
backup_filename: The filename were scraped items will be written. If you don't want one, simple pass
None
. Defaults to backup-{uuid}.mp where uuid is auuid.uuid4
string value. Note that values inserted in this file are streamed usingmsgpack
. Look at the documentation to see how to use it. -
response_middlewares: A list of callables that will be called with the callable that fetch the http resource. This parameter is only useful for the static spider. Defaults to an empty list.
-
item_processors: A list of callables that will be called with a scraped item. Defaults to an empty list.
-
msgpack_encoder: A callable that will be called when
msgpack
serializes an item. Defaults toscalpel.datetime_encoder
. -
msgpack_decoder: A callable that will be called when
msgpack
deserializes an item. Defaults toscalpel.datetime_decoder
.
Usage:
from scalpel import Configuration, Browser
config = Configuration(
min_request_delay=1, max_request_delay=3, follow_robots_txt=True, selenium_browser=Browser.CHROME
)
load_from_dotenv
(env_file)Loads configuration from a .env file.
Returns: Configuration
Usage:
# .env
SCALPEL_USER_AGENT = Mozilla/5.0
SCALPEL_FETCH_TIMEOUT = 4.0
SCALPEL_FOLLOW_ROBOTS_TXT = yes
from scalpel import Configuration
conf = Configuration.load_from_dotenv('.env')
conf.follow_robots_txt # equals to True
load_from_toml
(toml_file)Loads configuration from a toml file.
Returns: Configuration
Usage:
# conf.toml
[scalpel]
user_agent = "Mozilla/5.0"
fetch_timeout = 4.0
follow_robots_txt = true
from scalpel import Configuration
conf = Configuration.load_from_toml('conf.toml')
conf.fetch_timeout # equals to 4.0
load_from_yaml
(yaml_file)Loads configuration from a yaml file.
Returns: Configuration
Usage:
# conf.yaml
scalpel:
fetch_timeout: 4.0
user_agent: Mozilla/5.0
follow_robots_txt: true
from scalpel import Configuration
conf = Configuration.load_from_yaml('conf.yaml')
conf.fetch_timeout # equals to 4.0
request_delay
A read-only property which is a random value between min_request_delay
and max_request_delay
(both sides included) and used to wait between two http requests.
State
¶
scalpel.State
()An empty class used to store arbitrary data.
Browser
¶
scalpel.Browser
(value, names=None, *, module=None, qualname=None, type=None, start=1)An enum with different browser values.
CHROME
An enum with different browser values.
FIREFOX
An enum with different browser values.
msgpack
¶
scalpel.datetime_encoder
(data)A datetime encoder for msgpack
Usage:
from datetime import datetime
from scalpel import datetime_encoder
import msgpack
data = {'fruit': 'apple', 'date': datetime.utcnow()}
packed_data = msgpack.packb(data, default=datetime_encoder)
scalpel.datetime_decoder
(data)A datetime decoder for msgpack
.
Usage:
from datetime import datetime
from scalpel import datetime_encoder, datetime_decoder
import msgpack
data = {'fruit': 'apple', 'date': datetime.utcnow()}
packed_data = msgpack.packb(data, default=datetime_encoder)
assert msgpack.unpackb(packed_data, object_hook=datetime_decoder) == data
scalpel.green.write_mp
(filename, data, mode='a', encoder=None)Writes a msgpack
file.
Parameters:
- filename: The name of the file where data will be written. It can be a string or a
pathlib.Path
. - data: Arbitrary data to serialize. Note that if you want to serialize data types not supported by the
json
module, you will need to provide a custom encoder function. - mode: The mode in which the file is opened. Valid values are "a" (append) and "w" (write). Defaults to "a".
- encoder: An optional function used to encode data types not handled by default by
msgpack
.
Returns: The number of written bytes.
Usage:
from datetime import datetime
from scalpel import datetime_encoder
from scalpel.green import write_mp
data = {'fruit': 'apple', 'date': datetime.utcnow()}
length = write_mp('file.mp', data, 'w', datetime_encoder)
print(length) # 65
scalpel.green.read_mp
(filename, decoder=None)Reads a msgpack
file generated by the spider when calling the save_item
method.
Parameters:
- filename: The name of the file to read. It can be a string or a
pathlib.Path
. - decoder: An optional function used to decode data types not handled by default by
msgpack
.
Usage:
from scalpel import datetime_decoder
from scalpel.green import read_mp
for item in read_mp('file.mp', datetime_decoder):
print(item)
SpiderStatistics
¶
scalpel.SpiderStatistics
(reachable_urls, unreachable_urls, robot_excluded_urls, followed_urls, request_counter, average_fetch_time, total_time)Provides some statistics about a ran spider.
Parameters:
- reachable_urls:
set
of urls that were fetched (or read in case of file urls) and parsed. - unreachable_urls:
set
that were impossible to fetch (or read in case of file urls). - robot_excluded_urls:
set
of urls that were excluded to fetch because of robots.txt file rules. - followed_urls:
set
of urls that were followed during the process of parsing url content. You will find these urls scattered in the first three sets. - request_counter: The number of urls fetched or read (in case of file urls).
- average_fetch_time: The average time to fetch an url (or read a file in case of file urls).
- total_time: The total execution time of the spider.
green.StaticSpider
¶
scalpel.green.StaticSpider
(urls, parse, name=NOTHING, config=NOTHING, ignore_errors=False)A spider suitable to parse files or static HTML files.
Parameters:
- urls: Urls to parse. Allowed schemes are
http
,https
andfile
. It can be alist
, atuple
or aset
. - parse: A callable used to parse url content. It takes two arguments: the current spider and a
StaticResponse
object. - reachable_urls:
set
of urls that are already fetched or read. - unreachable_urls:
set
of urls that were impossible to fetch or read. - robot_excluded_urls:
set
of urls that were excluded to fetch because of robots.txt file rules. - followed_urls:
set
of urls that were followed during the process of parsing url content. You will find these urls scattered in the first three sets. - request_counter: The number of urls already fetched or read.
Usage:
from scalpel.green import StaticSpider, StaticResponse
def parse(spider: StaticSpider, response: StaticResponse) -> None:
...
spider = StaticSpider(urls=['http://example.com'], parse=parse)
spider.run()
config
Returns the Configuration
related to the spider.
followed_urls
name
Returns the name given to the spider.
parse
reachable_urls
request_counter
robots_excluded_urls
run
(self)Runs the spider.
save_item
(self, item)Saves a scrapped item in the backup filename specified in Configuration.backup_filename
attribute.
state
Returns the State
related to the spider. You can add custom information on this object.
statistics
(self)Provides some statistics related to the ran spider.
Returns: SpiderStatistics
unreachable_urls
urls
green.SeleniumSpider
¶
scalpel.green.SeleniumSpider
(urls, parse, name=NOTHING, config=NOTHING, ignore_errors=False)A spider suitable to parse dynamic websites i.e where Javascript is heavily used. You will sometimes encounter the
term Single-Page Application (SPA) for this type of website. It relies on selenium
package and a browser.
Parameters:
- urls: Urls to parse. Allowed schemes are
http
,https
andfile
. It can be alist
, atuple
or aset
. - parse: A callable used to parse url content. It takes two arguments: the current spider and a
StaticResponse
object. - reachable_urls:
set
of urls that are already fetched or read. - unreachable_urls:
set
of urls that were impossible to fetch or read. - robot_excluded_urls:
set
of urls that were excluded to fetch because of robots.txt file rules. - followed_urls:
set
of urls that were followed during the process of parsing url content. You will find these urls scattered in the first three sets. - request_counter: The number of urls already fetched or read.
Usage:
from scalpel.green import SeleniumSpider, SeleniumResponse
def parse(spider: SeleniumSpider, response: SeleniumResponse) -> None:
...
spider = SeleniumSpider(urls=['http://example.com'], parse=parse)
spider.run()
config
Returns the Configuration
related to the spider.
followed_urls
name
Returns the name given to the spider.
parse
reachable_urls
request_counter
robots_excluded_urls
run
(self)Runs the spider.
save_item
(self, item)Saves a scrapped item in the backup filename specified in Configuration.backup_filename
attribute.
state
Returns the State
related to the spider. You can add custom information on this object.
statistics
(self)Provides some statistics related to the ran spider.
Returns: SpiderStatistics
unreachable_urls
urls
green.io
¶
scalpel.green.AsyncFile
(wrapper)A wrapper around builtins io objects like io.StringIO
or io.BufferedReader
running blocking operations like
read
or write
in a threadpool to make it gevent cooperative.
scalpel.green.wrap_file
(file)This function wraps any file object in a wrapper that provides an asynchronous (or gevent cooperative) file object interface.
Parameters:
- file: A file-like object.
Returns: An AsyncFile
object.
Usage:
from io import StringIO
from scalpel.green import wrap_file
s = StringIO()
async_s = wrap_file(s)
assert 5 == async_s.write('hello')
assert 'hello' == async_s.getvalue()
scalpel.green.open_file
(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)An asynchronous version of the builtin open
function running blocking operation in a threadpool.
Parameters:
The parameters are exactly the same as those passed to the builtin open
function. You can check the official
documentation to understand their meaning.
Returns: An AsyncFile
object.
Usage:
from scalpel.green import open_file
with open_file('hello.txt', 'w') as f:
f.write('hello world')
with open_file('hello.txt') as f:
print(f.read()) # 'hello world'
green.StaticResponse
¶
scalpel.green.StaticResponse
(reachable_urls, followed_urls, queue, *, url='', text='', httpx_response=None)A response class used in combination with a StaticSpider
object in the parse
callable of a spider.
N.B: You probably don't need to instantiate this class directly unless for some kind of testing. It is mainly exposed for annotation purpose.
Parameters:
- reachable_urls: A
set
of urls already fetched. - followed_urls: A
set
of urls already followed by otherStaticResponse
objects. - queue: The
gevent.queue.JoinableQueue
used by the spider to handle incoming urls. - url: An optional keyword parameter representing the current url where content was fetched.
- text: An optional keyword parameter representing the content of the resource fetched. Note that if you set
the
url
parameter, you must set this one. - httpx_response: An optional keyword parameter representing an
httpx.Response
of the resource fetched. For HTTP urls, this is the one used in favour ofurl
andtext
parameters.
Usage:
from scalpel.green import StaticResponse
response = StaticResponse(..., url='http://foo.com', text='<p>Hello world!</p>')
print(response.css('p::text').get()) # 'Hello world!'
print(response.xpath('//p/text()').get()) # 'Hello world!'
content
The bytes content associated to the url.
cookies
A dict
of cookies associated to the response in case of an HTTP url. Empty dict
otherwise.
css
(self, query)Applies CSS rules to select DOM elements.
Parameters:
- query: The CSS rule used to select DOM elements.
Returns: parsel.SelectorList
follow
(self, url)Follows given url if it hasn't be fetched yet.
Parameters:
- url: The url to follow.
headers
A dict
of http headers in case of an HTTP url. Empty dict
otherwise.
text
The string content associated to the url.
url
The url associated to the response object.
xpath
(self, query)Applies XPath rules to select DOM elements.
Parameters:
- query: The XPath rule used to select DOM elements.
Returns: parsel.SelectorList
green.SeleniumResponse
¶
scalpel.green.SeleniumResponse
(reachable_urls, followed_urls, queue, *, driver, handle)A response class used in combination with a SeleniumSpider
object in the parse
callable of a spider.
N.B: You probably don't need to instantiate this class directly unless for some kind of testing. It is mainly exposed for annotation purpose.
Parameters:
- reachable_urls: A
set
of urls already fetched. - followed_urls: A
set
of urls already followed by otherStaticResponse
objects. - queue: The
gevent.queue.JoinableQueue
used by the spider to handle incoming urls. - driver: The
selenium.WebDriver
object that will be use to control the running browser. - handle: A string that identifies the current window handled by
selenium
.
Usage:
from scalpel.green import SeleniumResponse
response = SeleniumResponse(...)
# We assume we have a page source like '<p>Hello world!</p>'
print(response.driver.find_element_by_xpath('//p').text) # Hello world!
follow
(self, url)Follows given url if it hasn't be fetched yet.
Parameters:
- url: The url to follow.