Scrapy: Batch processing before spider closes down - scrapy

I use Scrapy to collect new contacts into my Hubspot account. Now I started to use the pipeline. However, that would lead to a lot of API calls, as each item is handled by itself. Since Hubspot also has an API call for batch processing, I wonder if there is a way to access all items at the end, once my crawler is done.
pipelines.py
class InsertDataToHubspot:
#classmethod
def from_crawler(cls, crawler):
return cls(hubspot_api=crawler.settings.get("HUBSPOT_API"),)
def process_item(self, item, spider):
# INSERT INTO HUBSPOT VIA API CALL
return item

You could implement batching in your pipeline: have the pipeline store input items in a local variable, flush them every N items, and use close_spider to flush the last batch.

Related

Cloning a Request with the already downloaded response

From within a scrapers parse callback, I wish to clone a request along with its response object and change its callback.
The behavior I'm expecting is that will generate a request, and it will have its callback executed skipping the download step, since it already has the original response object.
Is it possible to put new requests into the queue without ending the current iteration in the callback.
Furthermore, is it possible to generate a new request object for other spiders within the crawler?
Just do
response.copy()
yield another_function(response)
def another_function(self, response):
#here comes that your logic
request related data is available in response.request

Is there a pipeline concept from within ScrapyD?

Looking at the documentation for scrapy and scrapyD it appears the only way you can write the result of a scrape is to write the code in the pipeline of the spider itself. I am being told by my colleagues that there is an additional way whereby I can intercept the result of the scrape from within scrapyD!!
Has anyone heard of this and if so can someone shed some light on this for me please?
Thanks
item exporters
feed exports
scrapyd config
Scrapyd is indeed a service that can be used to schedule crawling processes of your Scrapy application through JSON API. It also permits the integration of Scrapy with different frameworks such as Django, see this guide in case you are interested.
Here is the documentation of Scrapyd.
However if your doubt is about saving the result of your scraping, the standard way is to do so in the pipelines.py file of your Scrapy applicaiton.
An example:
class Pipeline(object):
def __init__(self):
#initialization of your pipeline, maybe connecting to a database or creating a file
def process_item(self, item, spider):
# specify here what it needs to be done with the scraping result of a single page
Remember to define which pipeline are you using in your Scrapy application settings.py:
ITEM_PIPELINES = {
'scrapy_application.pipelines.Pipeline': 100,
}
Source: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

Scrapy - download images from image url list

Scrapy has ImagesPipeline that helps download image. the process is
Spider: start a link and parse all image urls in response, and save
image urls to items.
ImagesPipeline: items['image_urls'] are processed by ImagesPipeline.
But what if I don't need spider parts and have 100k images URLs ready to be downloaded, for example read URLs from redis, how do I call ImagePipeline directly to download the image?
I know I could simply make Request in spider and save response, but I'd like to see if there is way use default ImagesPipeline to save images directly.
I don't think that the use case you describe is the best fit for Scrapy. Wget would work fine for such a constrained problem.
If you really need to use Scrapy for this, make a dummy request to some URL:
def start_requests(self):
request = Request('http://example.com')
# load from redis
redis_img_urls = ...
request.meta['redis_img_urls'] = redis_img_urls
yield request
Then on the parse() method return:
def parse(self, response):
return {'image_urls':request.meta['redis_img_urls'] }
This is ugly but it should work fine...
P.S. I'm not aware of any easy way to bypass the dummy request and inject and Item directly. I'm sure there's one but it's such an unusual thing to do.
The idea behind a scrapy Pipeline is to process the items the spider generates explained here.
Now scrapy isn't about "downloading" staff, but a way to create crawlers, spiders, so if you have a list with urls to "download", then just use for loop and download them.
If you still want to use a scrapy Pipeline, then you'll have to return an item with that list inside the image_urls field.
def start_requests(self):
yield Request('http://httpbin.org/ip', callback=self.parse)
def parse(self, response):
...
yield {'image_urls': [your list]}
Then enable the pipeline on settings.

In scrapy do calls to crawler.engine.crawl() bypass the throttling mechanism?

To give some background, I'm writing a spider that listens on a RabbitMQ topic for new URLs to spider. When it pulls a URL from the queue it adds it to the crawl queue by calling crawler.engine.crawl(request). Ive noticed that if i drop 200 urls onto the queue (all for the same domain) I sometimes get timeouts, however this doesn't happen if i add the 200 urls via the start_urls property.
So I'm wondering if the normal throttling mechanisms (concurrent requests per domain, delay etc) apply when adding urls via crawler.engine.crawl()?
Here is a small code sample:
#defer.inlineCallbacks
def read(self, queue_object):
# pull a url from the RabbitMQ topic
ch,method,properties,body = yield queue_object.get()
if body:
req = Request(url=body)
log.msg('Scheduling ' + body + ' for crawl')
self.crawler.engine.crawl(req, spider=self)
yield ch.basic_ack(delivery_tag=method.delivery_tag)
It doesn't bypass the DownloaderMiddlewaresor the Downloader. They go to the Scheduler directly bypassing the SpiderMiddlewares.
Source
IMO you should use a SpiderMiddleware using process_start_requests to override your spider.start_requests.

Scrapy DupeFilter on a per spider basis?

I currently have a project with quite a few spiders and around half of them need some custom rule to filter duplicating requests. That's why I have extended the RFPDupeFilter class with custom rules for each spider that needs it.
My custom dupe filter checks if the request url is from a site that needs custom filtering and cleans the url (removes query parameters, shortens paths, extracts unique parts, etc.), so that the fingerprint is the same for all identical pages. So far so good, however at the moment I have a function with around 60 if/elif statements, that each request goes through. This is not only suboptimal, but it's also hard to maintain.
So here comes the question. Is there a way to create the filtering rule, that 'cleans' the urls inside the spider? The ideal approach for me would be to extend the Spider class and define a clean_url method, which will by default just return the request url, and override it in the spiders that need something custom. I looked into it, however I can't seem to find a way to access the current spider's methods from the dupe filter class.
Any help would be highly appreciated!
You could implement a downloader middleware.
middleware.py
class CleanUrl(object):
seen_urls = {}
def process_request(self, request, spider):
url = spider.clean_url(request.url)
if url in self.seen_urls:
raise IgnoreRequest()
else:
self.seen_urls.add(url)
return request.replace(url=url)
settings.py
DOWNLOADER_MIDDLEWARES = {'PROJECT_NAME_HERE.middleware.CleanUrl: 500}
# if you want to make sure this is the last middleware to execute increase the 500 to 1000
You probably would want to disable the dupefilter all together if you did it this way.