Scrapy. How to yield item after spider_close call? - scrapy

I want to yield an item only when the crawling is finished.
I am trying to do it via
def spider_closed(self, spider):
item = EtsyItem()
item['total_sales'] = 1111111
yield item
But it does not yield anything, though the function is called.
How do I yield an item after the scraping is over?

Depending on what you want to do, there might be a veeeery hacky solution for this.
Instead of spider_closed you may want to consider using spider_idle signal which is fired before spider_closed. One difference between idle and close is that spider_idle allows execution of requests which then may contain a callback or errback to yield the desired item.
Inside spider class:
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
# ...
crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
return spider
# ...
def yield_item(self, response):
yield MyItem(name='myname')
def spider_idle(self, spider):
req = Request('https://fakewebsite123.xyz',
callback=lambda:None, errback=self.yield_item)
self.crawler.engine.crawl(req, spider)
However this comes with several side effects so i discourage anyone from using this in production, for example the final request which will raise a DNSLookupError. I just want to show what is possible.

Oof, I'm afraid spider_closed is used for tearing down. I suppose you can do it by attaching some custom stuff to Pipeline to post-process your items.

Related

How can I add a new spider arg to my own template in Scrapy/Zyte

I am working on a paid proxy spider template and would like the ability to pass in a new argument on the command line for a Scrapy crawler. How can I do that?
This is achievable by using kwargs in your spider's __init__-Method:
import scrapy
class YourSpider(scrapy.Spider):
name = your_spider
def __init__(self, *args, **kwargs):
super(YourSpider, self).__init__(*args, **kwargs)
self.your_arg = kwargs.get("your_cmd_arg", 42)
Now it would be possible to call the spider as follows:
scrapy crawl your_spider -a your_cmd_arg=foo
For more information on the topic, feel free to check this page in the Scrapy documentation.

How do I tell the spider to stop requesting after n failed requests?

import scrapy
class MySpider(scrapy.Spider):
start_urls = []
def __init__(self, **kwargs):
for i in range(1, 1000):
self.start_urls.append("some url"+i)
def parse(self, response):
print(response)
Here we queue 1000 urls in __init__ function, but I want to stop making all those requests if it fails or returns something undesirable. How do I tell the spider to stop making requests say after 10 failed requests.
You might want to set CLOSESPIDER_ERRORCOUNT to 10 in that case. It probably doesn't account for failed requests only, though. Alternatively, you might set HTTPERROR_ALLOWED_CODES to handle even the error responses (failed requests) and implement your own failed request counter inside the spider. Then, when the counter is above threshold, you raise CloseSpider exception yourself.

QThread - changing data in global list - different values

I am using QThread to do some time intensive calculations to keep GUI from freezing. In the QThread I am accessing and changing global lists many times during the thread life span, however I am unable to get the same result as if it were just on the main thread.
I would assume you had to perform some kind of lock, but i'm new to QThread and don't know how to implement it.
#Main Thread
self.runGasAnalysisThread = GasAnalysisThread()
self.runGasAnalysisThread.start()
#QThread
class GasAnalysisThread(QtCore.QThread):
"""Performs gas analysis function"""
def __init__(self,parent = None):
super().__init__(parent)
def run(self):
try:
boolValue = True
while True:
#Change lists here
float(Data.TestList[0])+ 1 #Data is another module I am using to store variables
Again, moving the code to the main thread works correctly, but as soon as I do it with QThread I get different results.
How would I implement a locking mechanism to keep this from happening?
It is common to be confused while using Qt's threading, as one would think that subclassing QThread would be the right path.
Truth is that a QThread is the Qt thread object where your process is actually run, which means that you'll need a separate class for it and move its instance in a QThread. Subclassing QThread is usually unnecessary.
If you need any kind of interaction between the "worker" (the object that does the processing) and the main thread (as in the GUI), it's good practice to use Qt's signals.
In this example I'm using a button to start the processing, once the processor is started it disables the button, and re-enables it once it signals that the process has finished.
class Worker(QtCore.QObject):
stateChanged = QtCore.pyqtSignal(bool)
def __init__(self):
super().__init__()
def run(self):
self.stateChanged.emit(True)
try:
boolValue = True
while True:
# this doesn't make much sense, as it doesn't seem to do anything;
# I'll leave it just for the sake of the argument
float(Data.TestList[0]) + 1
except:
pass
self.stateChanged.emit(False)
class SomeWidget(QtWidgets.QWidget):
def __init__(self, parent=None):
super().__init__(parent)
layout = QtWidgets.QHBoxLayout()
self.setLayout(layout)
self.startButton = QtWidgets.QPushButton('Start')
layout.addWidget(self.startButton)
self.worker = Worker()
self.workerThread = QtCore.QThread()
self.worker.moveToThread(self.workerThread)
self.workerThread.started.connect(self.worker.run)
self.startButton.clicked.connect(self.workerThread.start)
self.worker.stateChanged.connect(self.startButton.setDisabled)

Scrapy high CPU usage

I have a very simple test spider which does no parsing. However I'm passing a large number of urls (500k) to the spider in the start_requests method and seeing very high (99/100%) cpu usage. Is this the expected behaviour? if so how can I optimize this (perhaps batching and using spider_idle?)
class TestSpider(Spider):
name = 'test_spider'
allowed_domains = 'mydomain.com'
def __init__(self, **kw):
super(Spider, self).__init__(**kw)
urls_list = kw.get('urls')
if urls_list:
self.urls_list = urls_list
def parse(self, response):
pass
def start_requests(self):
with open(self.urls_list, 'rb') as urls:
for url in urls:
yield Request(url, self.parse)
I think that the main problem here is that you are scraping too many links, try adding a Rule to avoid scraping links that didnt contains what you want.
Scrapy provides really useful Docs, check them out!:
http://doc.scrapy.org/en/latest/topics/spiders.html

Scrapy request+response+download time

UPD: Not close question because I think my way is not so clear as should be
Is it possible to get current request + response + download time for saving it to Item?
In "plain" python I do
start_time = time()
urllib2.urlopen('http://example.com').read()
time() - start_time
But how i can do this with Scrapy?
UPD:
Solution enought for me but I'm not sure of quality of results. If you have many connections with timeout errors Download time may be wrong (even DOWNLOAD_TIMEOUT * 3)
For
settings.py
DOWNLOADER_MIDDLEWARES = {
'myscraper.middlewares.DownloadTimer': 0,
}
middlewares.py
from time import time
from scrapy.http import Response
class DownloadTimer(object):
def process_request(self, request, spider):
request.meta['__start_time'] = time()
# this not block middlewares which are has greater number then this
return None
def process_response(self, request, response, spider):
request.meta['__end_time'] = time()
return response # return response coz we should
def process_exception(self, request, exception, spider):
request.meta['__end_time'] = time()
return Response(
url=request.url,
status=110,
request=request)
inside spider.py in def parse(...
log.msg('Download time: %.2f - %.2f = %.2f' % (
response.meta['__end_time'], response.meta['__start_time'],
response.meta['__end_time'] - response.meta['__start_time']
), level=log.DEBUG)
You could write a Downloader Middleware which would time each request. It would add a start time to the request before it's made and then a finish time when it's finished. Typically, arbitrary data such as this is stored in the Request.meta attribute. This timing information could later be read by your spider and added to your item.
This downloader middleware sounds like it could be useful on many projects.
Not sure if you need a Middleware here. Scrapy has a request.meta which you can query and yield. For download latency, simply yield
download_latency=response.meta.get('download_latency'),
The amount of time spent to fetch the response, since the request has
been started, i.e. HTTP message sent over the network. This meta key
only becomes available when the response has been downloaded. While
most other meta keys are used to control Scrapy behavior, this one is
supposed to be read-only.
I think the best solution is by using scrapy signals. Whenever the request reaches the downloader it emits request_reached_downloader signal. After download it emits response_downloaded signal. You can catch it from the spider and assign time and its differences to meta from there.
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(SignalSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.item_scraped, signal=signals.item_scraped)
return spider
More elaborate answer is on here