Scrapy: settings, multiple concurrent spiders, and middlewares - scrapy

I'm used to running spiders one at a time, because we mostly work with scrapy crawl and on scrapinghub, but I know that one can run multiple spiders concurrently, and I have seen that middlewares often have a spider parameter in their callbacks.
What I'd like to understand is:
the relationship between Crawler and Spider. If I run one spider at a time, I'm assuming there's one of each. But if you run more spiders together, like in the example linked above, do you have one crawler for multiple spiders, or are they still 1:1?
is there in any case only one instance of a middleware of a certain class, or do we get one per-spider or per-crawler?
Assuming there's one, what are the crawler.settings in the middleware creation (for example, here)? In the documentation it says that those take into account the settings overridden in the spider, but if there are multiple spiders with conflicting settings, what happens?
I'm asking because I'd like to know how to handle spider-specific settings. Take again the DeltaFetch middleware as an example:
enabling it seems to be a global matter, because DELTAFETCH_ENABLED is read from the crawler.settings
however, the sqlite db is opened in spider_opened and is a unique instance variable (i.e., not depending on the spider); so if you have more than one spider and the instance is shared, when the second spider is opened, the old db is lost. And if you have only one instance of the middleware per spider, why bother passing the spider as a parameter?
Is that a correct way of handling it, or should you rather have a dict spider_dbs indexed by spider name?

Related

Passing items(or other variables) to middleware (or other modules) of scrapy.

I'm improving the spider I wrote a few months ago. I'm trying to make it smarter and download only the new information from the website. For the purpose I am adding a code in the Download Middleware module to check whether URL ID is already visited or not. Except the URL which I can get fairly easy with request.url command I need to pass an Item from the Spider - that Item is the Date Last Updated.
The idea is to compare both values(URL and Date Last Update) with the ones from the database (regular csv file) and if both are the same to drop the request, if both are missing or if Last Update date doesn't match to proceed with the request.
The problem is that I don't know how to pass the Item from the Spider to the Middleware. I can see that in the Pipelines module (object) is passed to the class, tried to add it in Middleware class but it doesn't work.
Any ideas how to pass an Item or any other variable from the Spider to the Middleware module?
Usually you pass any additional info in the request meta as request.meta['my_thing'] = ... or as an argument yield Request(url, meta={'my_thing': ...}), which all middlewares up in the chain will be able to access. For your case however I'd recommend either to use scrapy built in cache middleware on dummy policy or either one of these two modules which do exactly the thing you have in mind:
https://github.com/TeamHG-Memex/scrapy-crawl-once
https://github.com/scrapy-plugins/scrapy-deltafetch

Stop Scrapy spider when date from page is older that yesterday

This code is part of my Scrapy spider:
# scraping data from page has been done before this line
publish_date_datetime_object = (datetime.strptime(publish_date, '%d.%m.%Y.')).date()
yesterday = (datetime.now() - timedelta(days=1)).date()
if publish_date_datetime_object > yesterday:
continue
if publish_date_datetime_object < yesterday:
raise scrapy.exceptions.CloseSpider('---STOP---DATE IS OLDER THAN YESTERDAY')
# after this is ItemLoader and yield
This is working fine.
My question is Scrapy spider best place to have this code/logic?
I do not know how to put implement it in another place.
Maybe it can be implemented in a pipeline, but AFAIK the pipeline is evaluated after the scraping has been done, so that means that I need to scrape all adds, even thous that I do not need.
A scale is 5 adds from yesterday versus 500 adds on the whole page.
I do not see any benefit in moving code to pipeline it that means processing(downloading and scraping) 500 adds if I only need 5 from it.
It is the right place if you need your spider to stop crawling after something indicates there's no more useful data to collect.
It is also the right way to do it, rising a CloseSpider exception with a verbose closing reason message.
A pipeline would be more suitable only if there were items to be collected after the threshold detected, but if they are ALL disposable this would be a waste of resources.

Scrapy Use both the CORE in the system

I am running scrapy using their internal API and everything is well and good so far. But I noticed that its not fully using the concurrency of 16 as mentioned in the settings. I have changed delay to 0 and everything else I can do. But then looking into the HTTP requests being sent , its clear that scrapy is not exactly downloading 16 sites at all point of times. At some point of time its downloading only 3 to 4 links. And the queue is not empty at that point of time.
When I checked the core usage , what i found was that out of 2 core , one is 100% and other is mostly idle.
That is when i got to know that twisted library on top which scrapy is build is single threaded and that is why its only using single core.
Is there any workaround to convince scrapy to use all the core ?
Scrapy is based on the twisted framework. Twisted is event loop based framework, so it does scheduled processing and not multiprocessing. That's is why your scrapy crawl runs on just one process. Now you can technically start two spiders using the below code
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
And there is nothing that stops you from having the same class for both the spiders.
process.crawl method takes *args and **kwargs to pass to your spider. So you can parametrize your spiders using this approach. Let's say your spider is suppose to crawl 100 pages, you can add a start and end parameter to your crawler class and do something like below
process.crawl(YourSpider, start=0, end=50)
process.crawl(YourSpider, start=51, end=100)
Note, that both the crawlers will have their own settings, so if you have 16 requests set for your spider, then both combined will effectively have 32.
In most cases scraping is less about CPU and more about Network access, which is actually non-blocking in case of twisted, so I am not sure this would give you a very huge advantage against setting the CONCURRENT_REQUEST to 32 in a single spider.
PS: Consider reading this page to understand more https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
Another option is to run your spiders using Scrapyd, which lets you run multiple processes concurrently. See max_proc and max_proc_per_cpu options in the documentation. If you don't want to solve your problem programmatically, this could be the way to go.

Scrapy DupeFilter on a per spider basis?

I currently have a project with quite a few spiders and around half of them need some custom rule to filter duplicating requests. That's why I have extended the RFPDupeFilter class with custom rules for each spider that needs it.
My custom dupe filter checks if the request url is from a site that needs custom filtering and cleans the url (removes query parameters, shortens paths, extracts unique parts, etc.), so that the fingerprint is the same for all identical pages. So far so good, however at the moment I have a function with around 60 if/elif statements, that each request goes through. This is not only suboptimal, but it's also hard to maintain.
So here comes the question. Is there a way to create the filtering rule, that 'cleans' the urls inside the spider? The ideal approach for me would be to extend the Spider class and define a clean_url method, which will by default just return the request url, and override it in the spiders that need something custom. I looked into it, however I can't seem to find a way to access the current spider's methods from the dupe filter class.
Any help would be highly appreciated!
You could implement a downloader middleware.
middleware.py
class CleanUrl(object):
seen_urls = {}
def process_request(self, request, spider):
url = spider.clean_url(request.url)
if url in self.seen_urls:
raise IgnoreRequest()
else:
self.seen_urls.add(url)
return request.replace(url=url)
settings.py
DOWNLOADER_MIDDLEWARES = {'PROJECT_NAME_HERE.middleware.CleanUrl: 500}
# if you want to make sure this is the last middleware to execute increase the 500 to 1000
You probably would want to disable the dupefilter all together if you did it this way.

Use Scrapy to combine data from multiple AJAX requests into a single item

What is the best way to crawl pages with content coming from multiple AJAX requests? It looks like I have the following options (given that AJAX URLs are already known):
Crawl AJAX URLs sequentially passing the same item between requests
Crawl AJAX URLs concurrently and output each part as a separate item
with a shared key (e.g. source URL)
What is the most common practice? Is there a way to get a single item at the end, but allow some AJAX requests to fail w/o compromising the rest of the data?
scrapy is built for concurrency and statelessness, so if point 2 is possible, it is always preferred, from both speed and memory consumption aspects.
in case requests must be serialized, consider accumulate items in request meta field
Check scrapy-inline-requests. It allows to smoothly process multiple nested requests in a response handler.