API¶
Configuration¶
scalpel.Configuration(min_request_delay=0, max_request_delay=0, fetch_timeout=5.0, selenium_find_timeout=10.0, selenium_driver_log_file='driver.log', selenium_browser=Configure variables for your spider.
Parameters:
-
min_request_delay: The minimum delay to wait between two http requests. Defaults to 0s.
-
max_request_delay: The maximum delay to wait between two http requests. Defaults to 0s.
-
fetch_timeout: The timeout to fetch http resources using the inner httpx client. Defaults to 5s.
-
selenium_find_timeout: The timeout for selenium driver to find an element in a page. Defaults to 10s.
-
selenium_driver_log_file: The file where the browser log debug messages. Defaults to driver.log. If you want to not create one, just pass
None. -
selenium_browser: The browser to use with the selenium spider. You can use the
Browserenum to specify the value. Possible values areBrowser.FIREFOXandBrowser.CHROME. Defaults toBrowser.FIREFOX. -
selenium_driver_executable_path: The path to the browser driver. Defaults to geckodriver if
Browser.FIREFOXis selected as selenium_browser, otherwise defaults to chromedriver. -
user_agent: The user agent to fake. Mainly useful for the static spider. Defaults to a random value provided by fake-useragent and if it does not work, fallback to Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36
-
follow_robots_txt: Decide whether or not the spider should follow robots.txt rules on the website you are scraping. Defaults to
False. -
robots_cache_folder: A folder to cache content of different website robots.txt file to avoid retrieving it each time you want to analyze an html page. Default to the system temporary directory.
-
backup_filename: The filename were scraped items will be written. If you don't want one, simple pass
None. Defaults to backup-{uuid}.mp where uuid is auuid.uuid4string value. Note that values inserted in this file are streamed usingmsgpack. Look at the documentation to see how to use it. -
response_middlewares: A list of callables that will be called with the callable that fetch the http resource. This parameter is only useful for the static spider. Defaults to an empty list.
-
item_processors: A list of callables that will be called with a scraped item. Defaults to an empty list.
-
msgpack_encoder: A callable that will be called when
msgpackserializes an item. Defaults toscalpel.datetime_encoder. -
msgpack_decoder: A callable that will be called when
msgpackdeserializes an item. Defaults toscalpel.datetime_decoder.
Usage:
from scalpel import Configuration, Browser
config = Configuration(
min_request_delay=1, max_request_delay=3, follow_robots_txt=True, selenium_browser=Browser.CHROME
)
load_from_dotenv(env_file)Loads configuration from a .env file.
Returns: Configuration
Usage:
# .env
SCALPEL_USER_AGENT = Mozilla/5.0
SCALPEL_FETCH_TIMEOUT = 4.0
SCALPEL_FOLLOW_ROBOTS_TXT = yes
from scalpel import Configuration
conf = Configuration.load_from_dotenv('.env')
conf.follow_robots_txt # equals to True
load_from_toml(toml_file)Loads configuration from a toml file.
Returns: Configuration
Usage:
# conf.toml
[scalpel]
user_agent = "Mozilla/5.0"
fetch_timeout = 4.0
follow_robots_txt = true
from scalpel import Configuration
conf = Configuration.load_from_toml('conf.toml')
conf.fetch_timeout # equals to 4.0
load_from_yaml(yaml_file)Loads configuration from a yaml file.
Returns: Configuration
Usage:
# conf.yaml
scalpel:
fetch_timeout: 4.0
user_agent: Mozilla/5.0
follow_robots_txt: true
from scalpel import Configuration
conf = Configuration.load_from_yaml('conf.yaml')
conf.fetch_timeout # equals to 4.0
request_delayA read-only property which is a random value between min_request_delay and max_request_delay
(both sides included) and used to wait between two http requests.
State¶
scalpel.State()An empty class used to store arbitrary data.
Browser¶
scalpel.Browser(value, names=None, *, module=None, qualname=None, type=None, start=1)An enum with different browser values.
CHROMEAn enum with different browser values.
FIREFOXAn enum with different browser values.
msgpack¶
scalpel.datetime_encoder(data)A datetime encoder for msgpack
Usage:
from datetime import datetime
from scalpel import datetime_encoder
import msgpack
data = {'fruit': 'apple', 'date': datetime.utcnow()}
packed_data = msgpack.packb(data, default=datetime_encoder)
scalpel.datetime_decoder(data)A datetime decoder for msgpack.
Usage:
from datetime import datetime
from scalpel import datetime_encoder, datetime_decoder
import msgpack
data = {'fruit': 'apple', 'date': datetime.utcnow()}
packed_data = msgpack.packb(data, default=datetime_encoder)
assert msgpack.unpackb(packed_data, object_hook=datetime_decoder) == data
scalpel.green.write_mp(filename, data, mode='a', encoder=None)Writes a msgpack file.
Parameters:
- filename: The name of the file where data will be written. It can be a string or a
pathlib.Path. - data: Arbitrary data to serialize. Note that if you want to serialize data types not supported by the
jsonmodule, you will need to provide a custom encoder function. - mode: The mode in which the file is opened. Valid values are "a" (append) and "w" (write). Defaults to "a".
- encoder: An optional function used to encode data types not handled by default by
msgpack.
Returns: The number of written bytes.
Usage:
from datetime import datetime
from scalpel import datetime_encoder
from scalpel.green import write_mp
data = {'fruit': 'apple', 'date': datetime.utcnow()}
length = write_mp('file.mp', data, 'w', datetime_encoder)
print(length) # 65
scalpel.green.read_mp(filename, decoder=None)Reads a msgpack file generated by the spider when calling the save_item method.
Parameters:
- filename: The name of the file to read. It can be a string or a
pathlib.Path. - decoder: An optional function used to decode data types not handled by default by
msgpack.
Usage:
from scalpel import datetime_decoder
from scalpel.green import read_mp
for item in read_mp('file.mp', datetime_decoder):
print(item)
SpiderStatistics¶
scalpel.SpiderStatistics(reachable_urls, unreachable_urls, robot_excluded_urls, followed_urls, request_counter, average_fetch_time, total_time)Provides some statistics about a ran spider.
Parameters:
- reachable_urls:
setof urls that were fetched (or read in case of file urls) and parsed. - unreachable_urls:
setthat were impossible to fetch (or read in case of file urls). - robot_excluded_urls:
setof urls that were excluded to fetch because of robots.txt file rules. - followed_urls:
setof urls that were followed during the process of parsing url content. You will find these urls scattered in the first three sets. - request_counter: The number of urls fetched or read (in case of file urls).
- average_fetch_time: The average time to fetch an url (or read a file in case of file urls).
- total_time: The total execution time of the spider.
green.StaticSpider¶
scalpel.green.StaticSpider(urls, parse, name=NOTHING, config=NOTHING, ignore_errors=False)A spider suitable to parse files or static HTML files.
Parameters:
- urls: Urls to parse. Allowed schemes are
http,httpsandfile. It can be alist, atupleor aset. - parse: A callable used to parse url content. It takes two arguments: the current spider and a
StaticResponseobject. - reachable_urls:
setof urls that are already fetched or read. - unreachable_urls:
setof urls that were impossible to fetch or read. - robot_excluded_urls:
setof urls that were excluded to fetch because of robots.txt file rules. - followed_urls:
setof urls that were followed during the process of parsing url content. You will find these urls scattered in the first three sets. - request_counter: The number of urls already fetched or read.
Usage:
from scalpel.green import StaticSpider, StaticResponse
def parse(spider: StaticSpider, response: StaticResponse) -> None:
...
spider = StaticSpider(urls=['http://example.com'], parse=parse)
spider.run()
configReturns the Configuration related to the spider.
followed_urlsnameReturns the name given to the spider.
parsereachable_urlsrequest_counterrobots_excluded_urlsrun(self)Runs the spider.
save_item(self, item)Saves a scrapped item in the backup filename specified in Configuration.backup_filename attribute.
stateReturns the State related to the spider. You can add custom information on this object.
statistics(self)Provides some statistics related to the ran spider.
Returns: SpiderStatistics
unreachable_urlsurlsgreen.SeleniumSpider¶
scalpel.green.SeleniumSpider(urls, parse, name=NOTHING, config=NOTHING, ignore_errors=False)A spider suitable to parse dynamic websites i.e where Javascript is heavily used. You will sometimes encounter the
term Single-Page Application (SPA) for this type of website. It relies on selenium package and a browser.
Parameters:
- urls: Urls to parse. Allowed schemes are
http,httpsandfile. It can be alist, atupleor aset. - parse: A callable used to parse url content. It takes two arguments: the current spider and a
StaticResponseobject. - reachable_urls:
setof urls that are already fetched or read. - unreachable_urls:
setof urls that were impossible to fetch or read. - robot_excluded_urls:
setof urls that were excluded to fetch because of robots.txt file rules. - followed_urls:
setof urls that were followed during the process of parsing url content. You will find these urls scattered in the first three sets. - request_counter: The number of urls already fetched or read.
Usage:
from scalpel.green import SeleniumSpider, SeleniumResponse
def parse(spider: SeleniumSpider, response: SeleniumResponse) -> None:
...
spider = SeleniumSpider(urls=['http://example.com'], parse=parse)
spider.run()
configReturns the Configuration related to the spider.
followed_urlsnameReturns the name given to the spider.
parsereachable_urlsrequest_counterrobots_excluded_urlsrun(self)Runs the spider.
save_item(self, item)Saves a scrapped item in the backup filename specified in Configuration.backup_filename attribute.
stateReturns the State related to the spider. You can add custom information on this object.
statistics(self)Provides some statistics related to the ran spider.
Returns: SpiderStatistics
unreachable_urlsurlsgreen.io¶
scalpel.green.AsyncFile(wrapper)A wrapper around builtins io objects like io.StringIO or io.BufferedReader running blocking operations like
read or write in a threadpool to make it gevent cooperative.
scalpel.green.wrap_file(file)This function wraps any file object in a wrapper that provides an asynchronous (or gevent cooperative) file object interface.
Parameters:
- file: A file-like object.
Returns: An AsyncFile object.
Usage:
from io import StringIO
from scalpel.green import wrap_file
s = StringIO()
async_s = wrap_file(s)
assert 5 == async_s.write('hello')
assert 'hello' == async_s.getvalue()
scalpel.green.open_file(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)An asynchronous version of the builtin open function running blocking operation in a threadpool.
Parameters:
The parameters are exactly the same as those passed to the builtin open function. You can check the official
documentation to understand their meaning.
Returns: An AsyncFile object.
Usage:
from scalpel.green import open_file
with open_file('hello.txt', 'w') as f:
f.write('hello world')
with open_file('hello.txt') as f:
print(f.read()) # 'hello world'
green.StaticResponse¶
scalpel.green.StaticResponse(reachable_urls, followed_urls, queue, *, url='', text='', httpx_response=None)A response class used in combination with a StaticSpider object in the parse callable of a spider.
N.B: You probably don't need to instantiate this class directly unless for some kind of testing. It is mainly exposed for annotation purpose.
Parameters:
- reachable_urls: A
setof urls already fetched. - followed_urls: A
setof urls already followed by otherStaticResponseobjects. - queue: The
gevent.queue.JoinableQueueused by the spider to handle incoming urls. - url: An optional keyword parameter representing the current url where content was fetched.
- text: An optional keyword parameter representing the content of the resource fetched. Note that if you set
the
urlparameter, you must set this one. - httpx_response: An optional keyword parameter representing an
httpx.Responseof the resource fetched. For HTTP urls, this is the one used in favour ofurlandtextparameters.
Usage:
from scalpel.green import StaticResponse
response = StaticResponse(..., url='http://foo.com', text='<p>Hello world!</p>')
print(response.css('p::text').get()) # 'Hello world!'
print(response.xpath('//p/text()').get()) # 'Hello world!'
contentThe bytes content associated to the url.
cookiesA dict of cookies associated to the response in case of an HTTP url. Empty dict otherwise.
css(self, query)Applies CSS rules to select DOM elements.
Parameters:
- query: The CSS rule used to select DOM elements.
Returns: parsel.SelectorList
follow(self, url)Follows given url if it hasn't be fetched yet.
Parameters:
- url: The url to follow.
headersA dict of http headers in case of an HTTP url. Empty dict otherwise.
textThe string content associated to the url.
urlThe url associated to the response object.
xpath(self, query)Applies XPath rules to select DOM elements.
Parameters:
- query: The XPath rule used to select DOM elements.
Returns: parsel.SelectorList
green.SeleniumResponse¶
scalpel.green.SeleniumResponse(reachable_urls, followed_urls, queue, *, driver, handle)A response class used in combination with a SeleniumSpider object in the parse callable of a spider.
N.B: You probably don't need to instantiate this class directly unless for some kind of testing. It is mainly exposed for annotation purpose.
Parameters:
- reachable_urls: A
setof urls already fetched. - followed_urls: A
setof urls already followed by otherStaticResponseobjects. - queue: The
gevent.queue.JoinableQueueused by the spider to handle incoming urls. - driver: The
selenium.WebDriverobject that will be use to control the running browser. - handle: A string that identifies the current window handled by
selenium.
Usage:
from scalpel.green import SeleniumResponse
response = SeleniumResponse(...)
# We assume we have a page source like '<p>Hello world!</p>'
print(response.driver.find_element_by_xpath('//p').text) # Hello world!
follow(self, url)Follows given url if it hasn't be fetched yet.
Parameters:
- url: The url to follow.