Item processing¶
A cool feature about pyscalpel is that it lets you decouple item scraping from its analyzing through the item processors.
Note that this require that you save your item using the save_item
method of a response object. It can help
you to reduce the amount of logic in your parse callable.
Configuration¶
Item processors are just functions that are run one after the other on a scraped item. The idea is that you can modify an item to add or update some information or you can simply discard the item if it does not meet certain criteria. To register item processors, this is what you can do.
from scalpel import Configuration
def processor_1(item):
...
async def processor_2(item):
...
config = Configuration(item_processors=[processor_1, processor_2])
As you will have noticed in the example, item processors can be synchronous or asynchronous. Obviously the asynchronous
version is only valid if you are dealing with an asyncio
or trio
spider. If you use an asynchronous function inside
a green spider, you can be sure your application will crash.
Note
The processors are run in the order there are listed when instantiating configuration. So put the most important ones at the beginning.
Example¶
If we come back to our quotes example in the static spider we have done something like that in the parse function:
data = {
'message': quote.xpath('./span[@class="text"]/text()').get(),
'author': quote.xpath('./span/small/text()').get(),
'tags': quote.xpath('./div/a/text()').getall()
}
spider.save_item(data)
So we save an item with three properties message
,author
and tags
. Now let's add a date and let's say we don't like
Marylin Monroe, so we want to remove her quotes from the result 😆
I will show an example with gevent, but you should now know how to use anyio at this point of the documentation.
from datetime import datetime
from scalpel import Configuration
from scalpel.green import StaticSpider
def datetime_processor(item: dict) -> dict:
item['date'] = datetime.utcnow()
return item
def marylin_processor(item: dict) -> None:
if item['author'] == 'Marilyn Monroe':
return
def parse(static_spider, response):
...
config = Configuration(item_processors=[marylin_processor, datetime_processor])
spider = StaticSpider(urls=['https://quotes.toscrape.com/'], parse=parse, config=config)
So that is it! For the case you want to drop an item you just need to return None
from the processor function. Worth
to mention that if a processor returns None
the following processors are not called.
Note on custom object serialization / deserialization¶
If you know msgpack, you know that it cannot serialize datetime
objects by
default. So if you called the read_mp
function like we did in the static spider guide, it will raise an error.
So how can we read datetime
objects? Well, if you look at the pyscalpel msgpack api you will notice
a datetime_decoder
which is a helper to deserialize datetime
objects. Also, the Configuration
object has a msgpack_decoder
attribute which default value is datetime_decoder
. So here is how you can read
datetime
objects.
With gevent:
from scalpel import Configuration
from scalpel.green import read_mp, StaticSpider
def parse(spider, response):
...
config = Configuration(backup_filename='toto.mp')
spider = StaticSpider(urls=['http//foo.com'], parse=parse, config=config)
spider.run()
# read_mp accepts a callback where we can specify how to deserialize custom objects in msgpack
for data in read_mp('toto.mp', decoder=config.msgpack_decoder):
print(data)
With anyio:
import anyio
from scalpel import Configuration
from scalpel.any_io import StaticSpider, read_mp
async def parse(spider, response):
...
async def main():
config = Configuration(backup_filename='toto.mp')
spider = StaticSpider(urls=['http//foo.com'], parse=parse, config=config)
await spider.run()
# read_mp accepts a callback where we can specify how to deserialize custom objects in msgpack
async for data in read_mp('toto.mp', decoder=config.msgpack_decoder):
print(data)
anyio.run(main) # with trio: anyio.run(main, backend='trio')
If you want to serialize / deserialize other custom objects, you need to implement your logic and set attributes
msgpack_encoder
and msgpack_decoder
of the Configuration
object with the appropriate functions.