Configuration¶
Pyscalpel allows you to configure various parameters for your spider using different flexible ways.
Instantiating directly the Configuration class¶
The first obvious way to configure your spider is to use the class Configuration directly in your code.
from scalpel import Configuration
from scalpel.green import StaticSpider, StaticResponse
def parse(spider: StaticSpider, response: StaticResponse) -> None:
pass
config = Configuration(follow_robots_txt=False, min_request_delay=2)
spider = StaticSpider(urls=['http://foo.com'], parse=parse, config=config)
For a reference of all the settings you can set, please refer to the relevant api section.
Keep in mind that you can access your configuration during spider execution through spider.config attribute and you can even
set some values on the fly, but it is a good idea to only set your configuration before running your spider.
Note
All the settings available have a default value.
Setting configuration through the use of a .env file¶
To follow the 12 factor app principles, you need to store your configuration in your
environment. One easy way to accomplish this is to use .env files.
To know what name you need to use in your .env file for a configuration variable, just put the Configuration
attribute in capital letters and prefixed it with SCALPEL_ i.e if you want to set the
Configuration.min_request_delay property, you need to set the environment variable SCALPEL_MIN_REQUEST_DELAY.
The value of the attribute is simple for string and numbers, but it is less obvious for other types, here are what you can expected for the following cases:
- For boolean values like
Configuration.follow_robots_txt, the valuestrue,yes,yand1are evaluated toTrueand valuesfalse,no,nand0are evaluated toFalse. - For
Enumlike Browser, just write the enum values likeFIREFOXorCHROME. - For callables like
Configuration.msgpack_encoderjust write the path to the callable using dot notation. For example if you have a module namedmy_moduleand a callablemy_callablein that module, you can refer to it withmy_module.my_callable. You can adopt this logic with modules nested in packages. Keep in mind that the module you specify must be in the python path for this to work. - For a list of callables like
Configuration.response_middlewares, the logic is the same as before except that each callable is separated by a;or:or,or a whitespace. For examplemy_module.callable_1:my_module.callable_2. For the first three separators, you can even put a whitespace after the separator for more visibility:my_module.callable_1, my_module.callable_2.
Assuming we have a .env file like the following in our project:
SCALPEL_USER_AGENT = Mozilla/5.0
SCALPEL_FETCH_TIMEOUT = 4.0
SCALPEL_FOLLOW_ROBOTS_TXT = yes
We can set configuration for our spider like this:
from scalpel import Configuration
config = Configuration.load_from_dotenv('.env')
print(config.follow_robots_txt) # True
Setting configuration through the use of a yaml file¶
YAML is a popular file format used for project configuration. You can use it to
configure your spider through the convenient class method Configuration.load_from_yaml. All your settings must be under
a top level key scalpel.
Assuming we have a yaml file like the following:
# conf.yaml
scalpel:
fetch_timeout: 4.0
user_agent: Mozilla/5.0
follow_robots_txt: true
We can set configuration for our spider like this:
from scalpel import Configuration
config = Configuration.load_from_yaml('conf.yaml')
print(config.fetch_timeout) # 4.0
Setting configuration through the use of a toml file¶
TOML is another popular file format for project configuration. You can use it to
configure your spider through the use of the convenient class method Configuration.load_from_toml. Like for the yaml
part, your configuration must be on the scalpel namespace.
Assuming we have a toml file like the following:
# conf.toml
[scalpel]
user_agent = "Mozilla/5.0"
fetch_timeout = 4.0
follow_robots_txt = true
We can set configuration for our spider like this:
from scalpel import Configuration
config = Configuration.load_from_toml('conf.toml')
print(config.fetch_timeout) # 4.0