Static spider¶
Pyscalpel aims to provide a simple interface to write spiders. To demonstrate it, we will try to scrape quotes on quotes.toscrape. I will assume you already installed Pyscalpel following the installation guide.
Learning HTML, CSS and XPATH¶
This guide will not be a course on HTML or CSS / XPATH selectors. If you are not familiar with these technologies, there are many resources on the web but I will give you a few that I know.
For HTML:
For CSS:
For XPATH:
Our first spider¶
Ok, let's create a file spider.py with the following content:
from scalpel.green import StaticResponse, StaticSpider
def parse(spider: StaticSpider, response: StaticResponse) -> None:
print('hello spider')
static_spider = StaticSpider(urls=['https://quotes.toscrape.com'], parse=parse)
static_spider.run()
if you prefer anyio
over gevent
, this is the equivalent:
import anyio
from scalpel.any_io import StaticResponse, StaticSpider
async def parse(spider: StaticSpider, response: StaticResponse) -> None:
print('hello spider')
async def main() -> None:
spider = StaticSpider(urls=['https://quotes.toscrape.com'], parse=parse)
await spider.run()
anyio.run(main)
Note
there is an icon at the top right of the code where you can click to copy and paste into your editor and test it.
Note
anyio
by defaults, run the asyncio
backend. If you want to use the trio
backend, you must first install
trio
and slightly changed the last sentence to anyio.run(main, backend='trio')
.
Note
To improve execution speed on asyncio backend, you can leverage the uvloop package. Note that it does not work on Windows and performed badly on Pypy because it used cython.
If you run this program, you will just see hello spider
printed in your console. Nothing exciting right now, let's
change that.
We can inspect the HTML source of the page to have a clear idea of what we can scrape. To do that on a browser, typically
the combination ctrl + u
can be used. Let's say we want a to print the title of the page.
With gevent:
from scalpel.green import StaticResponse, StaticSpider
def parse(spider: StaticSpider, response: StaticResponse) -> None:
print(response.css('title').get())
print(response.xpath('//title').get())
static_spider = StaticSpider(urls=['https://quotes.toscrape.com'], parse=parse)
static_spider.run()
with anyio:
import anyio
from scalpel.any_io import StaticResponse, StaticSpider
async def parse(spider: StaticSpider, response: StaticResponse) -> None:
print(response.css('title').get())
print(response.xpath('//title').get())
async def main() -> None:
spider = StaticSpider(urls=['https://quotes.toscrape.com'], parse=parse)
await spider.run()
anyio.run(main)
If you run the program, the content <title>Quotes to Scrape</title>
will be printed twice. I deliberately wrote the
instruction to select the title twice to demonstrate how to use css and xpath selectors on the StaticResponse
object.
Note
The css and xpath methods are shortcuts to the parsel Selector methods. Moreover these methods return a SelectorList where you can apply further filters. To know more about parsel, you can read the documentation but we will cover some features in this guide.
So far we have the tag plus its content printed. What if we only want the content? Pretty easy, we just need to add an additional information to our selectors.
With gevent:
from scalpel.green import StaticResponse, StaticSpider
def parse(spider: StaticSpider, response: StaticResponse) -> None:
print(response.css('title::text').get())
print(response.xpath('//title/text()').get())
static_spider = StaticSpider(urls=['https://quotes.toscrape.com'], parse=parse)
static_spider.run()
With anyio:
import anyio
from scalpel.any_io import StaticResponse, StaticSpider
async def parse(spider: StaticSpider, response: StaticResponse) -> None:
print(response.css('title::text').get())
print(response.xpath('//title/text()').get())
async def main() -> None:
spider = StaticSpider(urls=['https://quotes.toscrape.com'], parse=parse)
await spider.run()
anyio.run(main)
Now we have Quotes to Scrape
printed, yeah! You will notice the pseudo-selector ::text
for the css method and the
property /text()
for the xpath method which help to obtain the desired text.
Now if we look carefully at the html source of the website, we notice that all quote information have this skeleton:
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“The world as we have created it is a process of our thinking.
It cannot be changed without changing our thinking.”</span>
<span>by <small class="author" itemprop="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" / >
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
Inside the <div class="quote"..>
we have a <span class="text"..>
holding the quote so if we want to print all quotes
of the page we can do as follow.
With gevent:
from scalpel.green import StaticResponse, StaticSpider
def parse(spider: StaticSpider, response: StaticResponse) -> None:
for quote in response.xpath('//div[@class="quote"]'):
print(quote.xpath('./span[@class="text"]/text()').get())
static_spider = StaticSpider(urls=['https://quotes.toscrape.com'], parse=parse)
static_spider.run()
With anyio:
import anyio
from scalpel.any_io import StaticResponse, StaticSpider
async def parse(spider: StaticSpider, response: StaticResponse) -> None:
for quote in response.xpath('//div[@class="quote"]'):
print(quote.xpath('./span[@class="text"]/text()').get())
async def main() -> None:
spider = StaticSpider(urls=['https://quotes.toscrape.com'], parse=parse)
await spider.run()
anyio.run(main)
We will see an output like the following (I truncated it here):
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
...
Victory! We have all the quotes printed!
If you are wondering why the for loop?, remember that xpath
and css
response methods return
SelectorList objects so we can
iterate on it. Also if you look the second selector ./span[@class="text"]/text()
, you noticed that it starts with ./
it is because the new search is relative the first one, we search inside the <div class="quote"..>
quote elements.
If this seems unclear for you, don't hesitate to look at the
parsel tutorial before continuing this guide.
Ok now let's scrape the author of the quote and the related tags. The author is inside a <small class="author"..>
and tags are inside <a>
tags which are also inside a <div class='tags'..>
. So with this knowledge, we can write the
following code.
With gevent:
from scalpel.green import StaticResponse, StaticSpider
def parse(spider: StaticSpider, response: StaticResponse) -> None:
for quote in response.xpath('//div[@class="quote"]'):
print('===', quote.xpath('./span/small/text()').get(), '===')
print('quote:', quote.xpath('./span[@class="text"]/text()').get())
print('tags', quote.xpath('./div/a/text()').getall())
print()
static_spider = StaticSpider(urls=['https://quotes.toscrape.com'], parse=parse)
static_spider.run()
With anyio:
import anyio
from scalpel.any_io import StaticResponse, StaticSpider
async def parse(spider: StaticSpider, response: StaticResponse) -> None:
for quote in response.xpath('//div[@class="quote"]'):
print('===', quote.xpath('./span/small/text()').get(), '===')
print('quote:', quote.xpath('./span[@class="text"]/text()').get())
print('tags', quote.xpath('./div/a/text()').getall())
print()
async def main() -> None:
spider = StaticSpider(urls=['https://quotes.toscrape.com'], parse=parse)
await spider.run()
anyio.run(main)
We have the following output (I truncated it here):
=== Albert Einstein ===
quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
tags ['change', 'deep-thoughts', 'thinking', 'world']
=== J.K. Rowling ===
quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
tags ['abilities', 'choices']
...
There we goooo! May be you noticed usage of
Selector.getall for the tags.
This is because we have many elements matching the selector and we want all the values. This is different from the get
method which returns only the first element of the list.
Now that we have all these items, you probably want to store them somewhere and do further processing after. You can choose whatever you want to store the data, a relational database, a NoSQL database, cloud services, etc.. pyscalpel does not bother you to do want you want with your data, but for our example we will store the scraped items in a file using some pyscalpel utilities.
First of all, pyscalpel comes with a handy Configuration object to store many settings related to
our spider. A particular interesting one is backup_filename
which allows to declare a filename where the scraped
items will be written.
Another important feature is the save_item
method of the StaticSpider which appends a new
item in the Configuration.backup_filename
file. Messages are serialized using msgpack
which help to serialize complex
data types like list
or dict
in a fast way. So here is what we can do to save all quote information.
With gevent:
from scalpel import Configuration
from scalpel.green import StaticResponse, StaticSpider
def parse(spider: StaticSpider, response: StaticResponse) -> None:
for quote in response.xpath('//div[@class="quote"]'):
data = {
'message': quote.xpath('./span[@class="text"]/text()').get(),
'author': quote.xpath('./span/small/text()').get(),
'tags': quote.xpath('./div/a/text()').getall()
}
spider.save_item(data)
config = Configuration(backup_filename='/path/to/file.mp') # write a true path
static_spider = StaticSpider(urls=['https://quotes.toscrape.com'], parse=parse, config=config)
static_spider.run()
With anyio:
import anyio
from scalpel import Configuration
from scalpel.any_io import StaticResponse, StaticSpider
async def parse(spider: StaticSpider, response: StaticResponse) -> None:
for quote in response.xpath('//div[@class="quote"]'):
data = {
'message': quote.xpath('./span[@class="text"]/text()').get(),
'author': quote.xpath('./span/small/text()').get(),
'tags': quote.xpath('./div/a/text()').getall()
}
await spider.save_item(data)
async def main() -> None:
config = Configuration(backup_filename='/path/to/file.mp') # write a true path here
spider = StaticSpider(urls=['https://quotes.toscrape.com'], parse=parse, config=config)
await spider.run()
anyio.run(main)
Note
You don't necessarily have to specify a backup file. A default one is created for you in the form backup-<uuid>.mp
where <uuid>
represents a random UUID value.
Now we are done, but.. wait a minute! How will we read the file we just created? Since we use msgpack
to serialize
objects the builtin open
function will be useless. This is where pyscalpel msgpack utilities come in
handy. Here is how you can read a file created by your spider.
With gevent:
from scalpel.green import read_mp
for quote in read_mp('/path/to/file.mp'):
print(quote)
With anyio:
import anyio
from scalpel.any_io import read_mp
async def main() -> None:
async for quote in read_mp('/path/to/file.mp'):
print(quote)
anyio.run(main)
Yeah! Now we have useful data that we can exploit.
Going further¶
In the previous part we wrote a spider to scrape all quotes on the first page of
quotes.toscrape. This is already a good step but we might want to go further and
retrieved all quote information on the website. We need a way to follow the link on the next page and process it.
For that the StaticSpider.follow
method helps us.
So if we look closely at the HTML structure of the website, we notice that the link to the next page is referenced like this:
<nav>
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
</nav>
So this is what we can do to get all website quote data.
With gevent:
from scalpel import Configuration
from scalpel.green import StaticResponse, StaticSpider
def parse(spider: StaticSpider, response: StaticResponse) -> None:
for quote in response.xpath('//div[@class="quote"]'):
data = {
'message': quote.xpath('./span[@class="text"]/text()').get(),
'author': quote.xpath('./span/small/text()').get(),
'tags': quote.xpath('./div/a/text()').getall()
}
spider.save_item(data)
next_link = response.xpath('//nav/ul/li[@class="next"]/a').xpath('@href').get()
if next_link is not None:
response.follow(next_link)
config = Configuration(backup_filename='data.mp')
static_spider = StaticSpider(urls=['https://quotes.toscrape.com'], parse=parse, config=config)
static_spider.run()
# we print some statistics about the crawl operation
print(static_spider.statistics())
With anyio:
import anyio
from scalpel import Configuration
from scalpel.any_io import StaticResponse, StaticSpider
async def parse(spider: StaticSpider, response: StaticResponse) -> None:
for quote in response.xpath('//div[@class="quote"]'):
data = {
'message': quote.xpath('./span[@class="text"]/text()').get(),
'author': quote.xpath('./span/small/text()').get(),
'tags': quote.xpath('./div/a/text()').getall()
}
await spider.save_item(data)
next_link = response.xpath('//nav/ul/li[@class="next"]/a').xpath('@href').get()
if next_link is not None:
await response.follow(next_link)
async def main() -> None:
config = Configuration(backup_filename='data.mp')
spider = StaticSpider(urls=['https://quotes.toscrape.com'], parse=parse, config=config)
await spider.run()
# we print some statistics about the crawl operation
print(spider.statistics())
anyio.run(main)
There we go! So now, if you read the file created, data.mp in the example case, you will get all quote information from the first page to the last page. Don't hesitate to look at the examples folder for more code snippets to view.
Some important notes:
- In the previous code, we check if the link we want to follow exists,
if next_link is not None
, it is important because on the last page there is no next link 😛 - On the last line I printed spider statistics which contains many information like the total time taken by the spider, urls scrapped, followed or rejected due to robots.txt rules. You will probably need these information at some point in time.
Good to know¶
pyscalpel can also deals with file urls instead of http ones. The url needs to start with file:///
followed by the
file path. You can use Path.as_uri method
to help you to create these urls.
You will notice that the following spider attributes are public and therefore can be set in the parse function:
- reachable_urls
- unreachable_uls
- robot_excluded_urls
- followed_urls
- request_counter
The reason is that when running (very) long crawlers, it can be useful to empty these sets to avoid running out of memory and set counter to 0 to be in sync with the sets. Please do not abuse of this possibility and only use it when appropriate.
Also, if you choose the follow robots.txt rules, keep in mind that for gevent spiders, the delay specified in this file
will not be taken in account due to a technical limitation I don't explain for now in gevent. It is always the property
Configuration.request_delay
which is used for the delay between requests. The anyio spiders do not suffer from this
limitation.
Furthermore, if you want a more object-oriented approach for your spider than a function, you can always use a class.
Just remember that the parse attribute of the StaticSpider
waits for a callable. An example:
With gevent:
from scalpel.green import StaticSpider, StaticResponse
class Parser:
def __call__(self, spider: StaticSpider, response: StaticResponse) -> None:
print(response.text)
spider = StaticSpider(urls=['http://quotes.toscrape.com'], parse=Parser())
spider.run()
With anyio:
import anyio
from scalpel.any_io import StaticResponse, StaticSpider
class Parser:
async def __call__(self, spider: StaticSpider, response: StaticResponse) -> None:
print(response.text)
async def main():
spider = StaticSpider(urls=['http://quotes.toscrape.com'], parse=Parser())
await spider.run()
anyio.run(main)