Response middlewares¶
pyscalpel comes with the ability to intercept responses produced by httpx when fetching urls. This is a feature only available for static spiders. The reason is that for selenium we can't determine exactly when the request to fetch url resource is done.
Usage¶
It is really simple to use the middleware system. If you know how to use decorators in python, you already know how to use middlewares in pyscalpel. Something important to mention, you must return the httpx response object since pyscalpel relies on it. You are only allowed to perform operations before the response object is created, inspect the created object, not discard it or return another object. Here is an example.
With gevent:
from scalpel import Configuration
from scalpel.green import StaticSpider
def middleware(fetch):
# here you can do some initialization
print('initialization')
def wrapper(url):
# code to be executed before each request, here I just print information
# but you can do whatever you want
print('url processed in function middleware:', url)
print('before processing')
response = fetch(url)
# code to executed after each request
print('after processing')
# important to return the response
return response
return wrapper
# we can do the same thing with a class middleware
class SimpleMiddleware:
def __init__(self, fetch):
self.fetch = fetch
# you can do other initialization here
print('class initialization')
def __call__(self, url):
print('url processed in class middleware:', url)
print('before class processing')
response = self.fetch(url)
print('after class processing')
return response
def parse(*_):
pass
config = Configuration(response_middlewares=[middleware, SimpleMiddleware])
spider = StaticSpider(urls=['http://foo.com'], parse=parse, config=config)
spider.run()
With anyio:
import anyio
from scalpel import Configuration
from scalpel.any_io import StaticSpider
def middleware(fetch):
# here you can do some initialization
print('initialization')
async def wrapper(url):
# code to be executed before each request, here I just print information
# but you can do whatever you want
print('url processed in function middleware:', url)
print('before processing')
response = await fetch(url)
# code to executed after each request
print('after processing')
# important to return the response
return response
return wrapper
# we can do the same thing with class middleware
class SimpleMiddleware:
def __init__(self, fetch):
self.fetch = fetch
# here you can do other initialization
print('class initialization')
async def __call__(self, url):
print('url processed in class middleware:', url)
print('before class processing')
response = await self.fetch(url)
print('after class processing')
return response
async def parse(*_):
pass
async def main():
config = Configuration(response_middlewares=[middleware, SimpleMiddleware])
spider = StaticSpider(urls=['http://foo.com'], parse=parse, config=config)
await spider.run()
anyio.run(main) # with trio: anyio.run(main, backend='trio')
Output:
initialization
class initialization
url processed in class middleware: http://foo.com
before class processing
url processed in function middleware: http://foo.com
before processing
after processing
after class processing
With the output, you can have an idea in what order the middleware code is executed.
Note
Keep in mind that the more middlewares you add, the more slower your spider will be.