is there a command to make scrapy stop collecting links? - scrapy

I have a spider currently crawling and I want it to now stop collecting links and just crawl everything it has collected, is there a way to do this? I cannot find anything so far.

scrapy offers different ways to stop the spider (apart from calling ctrl+c), that you can find on the CloseSpider extension
You can put that on your settings.py file, so something like:
CLOSESPIDER_TIMEOUT = 20 # to stop crawling when reaching 20 seconds

Related

best practice to stop scrapy crawl when bad request numbers arrived the numbers I seted

I have 400 pages to crawl,for example.And some pages may return 3xx or 4xx .I wish when the numbers of bad
requests arrived 100,for example. scrapy task auto stop.Thks~
You can use different systems:
A global variable in the class (which is not recommended but probably is the simplest solution)
Storing it in the DB using pipelines
Once you have reached the number that you have configured, you can stop the crawler using:
if errors > maxNumberErrors:
raise CloseSpider('message error')
or (from this answer)
from scrapy.project import crawler
crawler._signal_shutdown(9,0)

Is there a pipeline concept from within ScrapyD?

Looking at the documentation for scrapy and scrapyD it appears the only way you can write the result of a scrape is to write the code in the pipeline of the spider itself. I am being told by my colleagues that there is an additional way whereby I can intercept the result of the scrape from within scrapyD!!
Has anyone heard of this and if so can someone shed some light on this for me please?
Thanks
item exporters
feed exports
scrapyd config
Scrapyd is indeed a service that can be used to schedule crawling processes of your Scrapy application through JSON API. It also permits the integration of Scrapy with different frameworks such as Django, see this guide in case you are interested.
Here is the documentation of Scrapyd.
However if your doubt is about saving the result of your scraping, the standard way is to do so in the pipelines.py file of your Scrapy applicaiton.
An example:
class Pipeline(object):
def __init__(self):
#initialization of your pipeline, maybe connecting to a database or creating a file
def process_item(self, item, spider):
# specify here what it needs to be done with the scraping result of a single page
Remember to define which pipeline are you using in your Scrapy application settings.py:
ITEM_PIPELINES = {
'scrapy_application.pipelines.Pipeline': 100,
}
Source: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

Stop Scrapy spider when date from page is older that yesterday

This code is part of my Scrapy spider:
# scraping data from page has been done before this line
publish_date_datetime_object = (datetime.strptime(publish_date, '%d.%m.%Y.')).date()
yesterday = (datetime.now() - timedelta(days=1)).date()
if publish_date_datetime_object > yesterday:
continue
if publish_date_datetime_object < yesterday:
raise scrapy.exceptions.CloseSpider('---STOP---DATE IS OLDER THAN YESTERDAY')
# after this is ItemLoader and yield
This is working fine.
My question is Scrapy spider best place to have this code/logic?
I do not know how to put implement it in another place.
Maybe it can be implemented in a pipeline, but AFAIK the pipeline is evaluated after the scraping has been done, so that means that I need to scrape all adds, even thous that I do not need.
A scale is 5 adds from yesterday versus 500 adds on the whole page.
I do not see any benefit in moving code to pipeline it that means processing(downloading and scraping) 500 adds if I only need 5 from it.
It is the right place if you need your spider to stop crawling after something indicates there's no more useful data to collect.
It is also the right way to do it, rising a CloseSpider exception with a verbose closing reason message.
A pipeline would be more suitable only if there were items to be collected after the threshold detected, but if they are ALL disposable this would be a waste of resources.

Scrapy Use both the CORE in the system

I am running scrapy using their internal API and everything is well and good so far. But I noticed that its not fully using the concurrency of 16 as mentioned in the settings. I have changed delay to 0 and everything else I can do. But then looking into the HTTP requests being sent , its clear that scrapy is not exactly downloading 16 sites at all point of times. At some point of time its downloading only 3 to 4 links. And the queue is not empty at that point of time.
When I checked the core usage , what i found was that out of 2 core , one is 100% and other is mostly idle.
That is when i got to know that twisted library on top which scrapy is build is single threaded and that is why its only using single core.
Is there any workaround to convince scrapy to use all the core ?
Scrapy is based on the twisted framework. Twisted is event loop based framework, so it does scheduled processing and not multiprocessing. That's is why your scrapy crawl runs on just one process. Now you can technically start two spiders using the below code
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
And there is nothing that stops you from having the same class for both the spiders.
process.crawl method takes *args and **kwargs to pass to your spider. So you can parametrize your spiders using this approach. Let's say your spider is suppose to crawl 100 pages, you can add a start and end parameter to your crawler class and do something like below
process.crawl(YourSpider, start=0, end=50)
process.crawl(YourSpider, start=51, end=100)
Note, that both the crawlers will have their own settings, so if you have 16 requests set for your spider, then both combined will effectively have 32.
In most cases scraping is less about CPU and more about Network access, which is actually non-blocking in case of twisted, so I am not sure this would give you a very huge advantage against setting the CONCURRENT_REQUEST to 32 in a single spider.
PS: Consider reading this page to understand more https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
Another option is to run your spiders using Scrapyd, which lets you run multiple processes concurrently. See max_proc and max_proc_per_cpu options in the documentation. If you don't want to solve your problem programmatically, this could be the way to go.

How do I stop crawling and close spider based on condition?

I have a spider which fetches the latest url based on a particular date range from a paginated webpage. When it gets all latest urls, my spider has to be closed. How to close a spider?
I referred question : Force stop the spider
But raising an exception to close the spider is not pleasing to me.
Is there any other way I could achieve the same?
You should use the Close Spider extension.
The conditions for closing a spider can be configured through the following settings:
CLOSESPIDER_TIMEOUT CLOSESPIDER_ITEMCOUNT CLOSESPIDER_PAGECOUNT CLOSESPIDER_ERRORCOUNT