How do I tell the spider to stop requesting after n failed requests? - scrapy

import scrapy
class MySpider(scrapy.Spider):
start_urls = []
def __init__(self, **kwargs):
for i in range(1, 1000):
self.start_urls.append("some url"+i)
def parse(self, response):
print(response)
Here we queue 1000 urls in __init__ function, but I want to stop making all those requests if it fails or returns something undesirable. How do I tell the spider to stop making requests say after 10 failed requests.

You might want to set CLOSESPIDER_ERRORCOUNT to 10 in that case. It probably doesn't account for failed requests only, though. Alternatively, you might set HTTPERROR_ALLOWED_CODES to handle even the error responses (failed requests) and implement your own failed request counter inside the spider. Then, when the counter is above threshold, you raise CloseSpider exception yourself.

Related

How can I add a new spider arg to my own template in Scrapy/Zyte

I am working on a paid proxy spider template and would like the ability to pass in a new argument on the command line for a Scrapy crawler. How can I do that?
This is achievable by using kwargs in your spider's __init__-Method:
import scrapy
class YourSpider(scrapy.Spider):
name = your_spider
def __init__(self, *args, **kwargs):
super(YourSpider, self).__init__(*args, **kwargs)
self.your_arg = kwargs.get("your_cmd_arg", 42)
Now it would be possible to call the spider as follows:
scrapy crawl your_spider -a your_cmd_arg=foo
For more information on the topic, feel free to check this page in the Scrapy documentation.

How to consume all messages from RabbitMQ queue using pika

I would like to write a daemon in Python that wakes up periodically to process some data queued up in a RabbitMQ queue.
When the daemon wakes up, it should consume all messages in the queue (or min(len(queue), N), where N is some arbitrary number) because it's better for the data to be processed in batches. Is there a way of doing this in pika, as opposed to passing in a callback that gets called on every message arrival?
Thanks.
Here is the code written using pika. A similar function can be written using basic.get
The below code will make use of channel.consume to start consuming messages. We break out/stop when the desired number of messages is reached.
I have set a batch_size to prevent pulling of huge number of messages at once. You can always change the batch_size to fit your needs.
from pika import BasicProperties, URLParameters
from pika.adapters.blocking_connection import BlockingChannel, BlockingConnection
from pika.exceptions import ChannelWrongStateError, StreamLostError, AMQPConnectionError
from pika.exchange_type import ExchangeType
import json
def consume_messages(queue_name: str):
msgs = list([])
batch_size = 500
q = channel.queue_declare(queue_name, durable=True, exclusive=False, auto_delete=False)
q_length = q.method.message_count
if not q_length:
return msgs
msgs_limit = batch_size if q_length > batch_size else q_length
try:
# Get messages and break out
for method_frame, properties, body in channel.consume(queue_name):
# Append the message
try:
msgs.append(json.loads(bytes.decode(body)))
except:
logger.info(f"Rabbit Consumer : Received message in wrong format {str(body)}")
# Acknowledge the message
channel.basic_ack(method_frame.delivery_tag)
# Escape out of the loop when desired msgs are fetched
if method_frame.delivery_tag == msgs_limit:
# Cancel the consumer and return any pending messages
requeued_messages = channel.cancel()
print('Requeued %i messages' % requeued_messages)
break
except (ChannelWrongStateError, StreamLostError, AMQPConnectionError) as e:
logger.info(f'Connection Interrupted: {str(e)}')
finally:
# Close the channel and the connection
channel.stop_consuming()
channel.close()
return msgs
You can use the basic.get API, which pulls messages from the brokers, instead of subscribin for the messages to be pushed

Scrapy. How to yield item after spider_close call?

I want to yield an item only when the crawling is finished.
I am trying to do it via
def spider_closed(self, spider):
item = EtsyItem()
item['total_sales'] = 1111111
yield item
But it does not yield anything, though the function is called.
How do I yield an item after the scraping is over?
Depending on what you want to do, there might be a veeeery hacky solution for this.
Instead of spider_closed you may want to consider using spider_idle signal which is fired before spider_closed. One difference between idle and close is that spider_idle allows execution of requests which then may contain a callback or errback to yield the desired item.
Inside spider class:
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
# ...
crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
return spider
# ...
def yield_item(self, response):
yield MyItem(name='myname')
def spider_idle(self, spider):
req = Request('https://fakewebsite123.xyz',
callback=lambda:None, errback=self.yield_item)
self.crawler.engine.crawl(req, spider)
However this comes with several side effects so i discourage anyone from using this in production, for example the final request which will raise a DNSLookupError. I just want to show what is possible.
Oof, I'm afraid spider_closed is used for tearing down. I suppose you can do it by attaching some custom stuff to Pipeline to post-process your items.

Scrapy high CPU usage

I have a very simple test spider which does no parsing. However I'm passing a large number of urls (500k) to the spider in the start_requests method and seeing very high (99/100%) cpu usage. Is this the expected behaviour? if so how can I optimize this (perhaps batching and using spider_idle?)
class TestSpider(Spider):
name = 'test_spider'
allowed_domains = 'mydomain.com'
def __init__(self, **kw):
super(Spider, self).__init__(**kw)
urls_list = kw.get('urls')
if urls_list:
self.urls_list = urls_list
def parse(self, response):
pass
def start_requests(self):
with open(self.urls_list, 'rb') as urls:
for url in urls:
yield Request(url, self.parse)
I think that the main problem here is that you are scraping too many links, try adding a Rule to avoid scraping links that didnt contains what you want.
Scrapy provides really useful Docs, check them out!:
http://doc.scrapy.org/en/latest/topics/spiders.html

Scrapy request+response+download time

UPD: Not close question because I think my way is not so clear as should be
Is it possible to get current request + response + download time for saving it to Item?
In "plain" python I do
start_time = time()
urllib2.urlopen('http://example.com').read()
time() - start_time
But how i can do this with Scrapy?
UPD:
Solution enought for me but I'm not sure of quality of results. If you have many connections with timeout errors Download time may be wrong (even DOWNLOAD_TIMEOUT * 3)
For
settings.py
DOWNLOADER_MIDDLEWARES = {
'myscraper.middlewares.DownloadTimer': 0,
}
middlewares.py
from time import time
from scrapy.http import Response
class DownloadTimer(object):
def process_request(self, request, spider):
request.meta['__start_time'] = time()
# this not block middlewares which are has greater number then this
return None
def process_response(self, request, response, spider):
request.meta['__end_time'] = time()
return response # return response coz we should
def process_exception(self, request, exception, spider):
request.meta['__end_time'] = time()
return Response(
url=request.url,
status=110,
request=request)
inside spider.py in def parse(...
log.msg('Download time: %.2f - %.2f = %.2f' % (
response.meta['__end_time'], response.meta['__start_time'],
response.meta['__end_time'] - response.meta['__start_time']
), level=log.DEBUG)
You could write a Downloader Middleware which would time each request. It would add a start time to the request before it's made and then a finish time when it's finished. Typically, arbitrary data such as this is stored in the Request.meta attribute. This timing information could later be read by your spider and added to your item.
This downloader middleware sounds like it could be useful on many projects.
Not sure if you need a Middleware here. Scrapy has a request.meta which you can query and yield. For download latency, simply yield
download_latency=response.meta.get('download_latency'),
The amount of time spent to fetch the response, since the request has
been started, i.e. HTTP message sent over the network. This meta key
only becomes available when the response has been downloaded. While
most other meta keys are used to control Scrapy behavior, this one is
supposed to be read-only.
I think the best solution is by using scrapy signals. Whenever the request reaches the downloader it emits request_reached_downloader signal. After download it emits response_downloaded signal. You can catch it from the spider and assign time and its differences to meta from there.
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(SignalSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.item_scraped, signal=signals.item_scraped)
return spider
More elaborate answer is on here