Scrapy - download images from image url list - scrapy

Scrapy has ImagesPipeline that helps download image. the process is
Spider: start a link and parse all image urls in response, and save
image urls to items.
ImagesPipeline: items['image_urls'] are processed by ImagesPipeline.
But what if I don't need spider parts and have 100k images URLs ready to be downloaded, for example read URLs from redis, how do I call ImagePipeline directly to download the image?
I know I could simply make Request in spider and save response, but I'd like to see if there is way use default ImagesPipeline to save images directly.

I don't think that the use case you describe is the best fit for Scrapy. Wget would work fine for such a constrained problem.
If you really need to use Scrapy for this, make a dummy request to some URL:
def start_requests(self):
request = Request('http://example.com')
# load from redis
redis_img_urls = ...
request.meta['redis_img_urls'] = redis_img_urls
yield request
Then on the parse() method return:
def parse(self, response):
return {'image_urls':request.meta['redis_img_urls'] }
This is ugly but it should work fine...
P.S. I'm not aware of any easy way to bypass the dummy request and inject and Item directly. I'm sure there's one but it's such an unusual thing to do.

The idea behind a scrapy Pipeline is to process the items the spider generates explained here.
Now scrapy isn't about "downloading" staff, but a way to create crawlers, spiders, so if you have a list with urls to "download", then just use for loop and download them.
If you still want to use a scrapy Pipeline, then you'll have to return an item with that list inside the image_urls field.
def start_requests(self):
yield Request('http://httpbin.org/ip', callback=self.parse)
def parse(self, response):
...
yield {'image_urls': [your list]}
Then enable the pipeline on settings.

Related

How to collect real-time data from dynamic JS websites using Python Scrapy+Splash?

I am using Scrapy-Splash to scrape real-time data from JavaScript websites. I am using Docker to run Splash. The Spider works completely fine, and I'm getting the required data from the website. However, the Spider crawls once and finishes the process, so I get the data of a particular time. I want to continuously collect the data and store it in a database(i.e MySQL) since the data is updated every second. The crawl needs to be continued and show the data in real-time using plotting libraries(i.e Matplotlib, Plotly). Is there any way to keep the Spider running as the Splash renders the updated data(I'm not sure if Splash updates the data like a normal browser)? Here's my code,
import scrapy
import pandas as pd
from scrapy_splash import SplashRequest
from scrapy.http import Request
class MandalaSpider(scrapy.Spider):
name = 'mandala'
def start_requests(self):
link="website link"
yield SplashRequest(url=link, callback=self.parse)
def parse(self,response):
products=response.css('a.abhead::text').getall()
nub=[]
for prod in products:
prod=prod.strip()
prod=prod.split()
nub.extend(prod)
data=pd.DataFrame({
'Name' : nub[0:len(nub):4],
'Price' : nub[1:len(nub):4],
'Change' : nub[2:len(nub):4],
'Change_percent' : nub[3:len(nub)+1:4],
})
# For single data monitoring
sub=[]
sub=sub.extend(data.iloc[3,1])
yield sub
yield Request(response.url, callback=self.parse, dont_filter=True)
I am completely a newbie in web scraping, so any additional information is greatly appreciated. I have searched other posts from this website but unfortunately couldn't get the solid info that I needed. This type of problem is solved usually using Selenium and BeautifulSoup. But I wanted to use Scrapy.
Thanks in advance.

Scrapy: Batch processing before spider closes down

I use Scrapy to collect new contacts into my Hubspot account. Now I started to use the pipeline. However, that would lead to a lot of API calls, as each item is handled by itself. Since Hubspot also has an API call for batch processing, I wonder if there is a way to access all items at the end, once my crawler is done.
pipelines.py
class InsertDataToHubspot:
#classmethod
def from_crawler(cls, crawler):
return cls(hubspot_api=crawler.settings.get("HUBSPOT_API"),)
def process_item(self, item, spider):
# INSERT INTO HUBSPOT VIA API CALL
return item
You could implement batching in your pipeline: have the pipeline store input items in a local variable, flush them every N items, and use close_spider to flush the last batch.

Is there a pipeline concept from within ScrapyD?

Looking at the documentation for scrapy and scrapyD it appears the only way you can write the result of a scrape is to write the code in the pipeline of the spider itself. I am being told by my colleagues that there is an additional way whereby I can intercept the result of the scrape from within scrapyD!!
Has anyone heard of this and if so can someone shed some light on this for me please?
Thanks
item exporters
feed exports
scrapyd config
Scrapyd is indeed a service that can be used to schedule crawling processes of your Scrapy application through JSON API. It also permits the integration of Scrapy with different frameworks such as Django, see this guide in case you are interested.
Here is the documentation of Scrapyd.
However if your doubt is about saving the result of your scraping, the standard way is to do so in the pipelines.py file of your Scrapy applicaiton.
An example:
class Pipeline(object):
def __init__(self):
#initialization of your pipeline, maybe connecting to a database or creating a file
def process_item(self, item, spider):
# specify here what it needs to be done with the scraping result of a single page
Remember to define which pipeline are you using in your Scrapy application settings.py:
ITEM_PIPELINES = {
'scrapy_application.pipelines.Pipeline': 100,
}
Source: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

Prevent scrapy reponse from being added to cache

I am crawling a website that returns pages with a captcha and a status code 200 suggesting everything is ok. This causes the page to be put into scrapy's cache.
I want to recrawl these pages later. But if they are in the cache, they won't get recrawled.
Is it feasible to overload the process_response function from the httpcache middleware or to look for a specific string in the reponse html and override the 200 code with an error code?
What would be the easiest way to keep scrapy from putting certain responses into the cache.
Scrapy uses scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware to cache http responses. To ignore this caching you can just set request meta keyword dont_cache to True like:
yield Request(url, meta={'dont_cache': True})
The docs above also mention how to disable it project-wide with a setting if you are interested in that too.

Scrapy DupeFilter on a per spider basis?

I currently have a project with quite a few spiders and around half of them need some custom rule to filter duplicating requests. That's why I have extended the RFPDupeFilter class with custom rules for each spider that needs it.
My custom dupe filter checks if the request url is from a site that needs custom filtering and cleans the url (removes query parameters, shortens paths, extracts unique parts, etc.), so that the fingerprint is the same for all identical pages. So far so good, however at the moment I have a function with around 60 if/elif statements, that each request goes through. This is not only suboptimal, but it's also hard to maintain.
So here comes the question. Is there a way to create the filtering rule, that 'cleans' the urls inside the spider? The ideal approach for me would be to extend the Spider class and define a clean_url method, which will by default just return the request url, and override it in the spiders that need something custom. I looked into it, however I can't seem to find a way to access the current spider's methods from the dupe filter class.
Any help would be highly appreciated!
You could implement a downloader middleware.
middleware.py
class CleanUrl(object):
seen_urls = {}
def process_request(self, request, spider):
url = spider.clean_url(request.url)
if url in self.seen_urls:
raise IgnoreRequest()
else:
self.seen_urls.add(url)
return request.replace(url=url)
settings.py
DOWNLOADER_MIDDLEWARES = {'PROJECT_NAME_HERE.middleware.CleanUrl: 500}
# if you want to make sure this is the last middleware to execute increase the 500 to 1000
You probably would want to disable the dupefilter all together if you did it this way.