Is it good to access spider attributes in scrapy pipeline? - scrapy

In the scrapy pipeline doc saids one of the parameter of the function 'process_item' is the spider
process_item(self, item, spider)
Parameters:
item (Item object or a dict) – the item scraped
spider (Spider object) – the spider which scraped the item
I want to send a list of one type of 'item' to the pipeline, but after many digging through the internet everyone is either yielding or returning the item to the pipeline once at a time.
SamplerSpider.py
class SamplerSpider(scrapy.Spider):
name = 'SamplerSpider'
allowed_domains = ['xxx.com']
start_urls = (CONSTANTS.URL)
result = []
pipeline.py
class SamplerSpiderPipeline(object):
def __init__(self):
// do something here
def process_item(self, item, spider):
// do something with spider.result
Is this a good way to do it? If not then why?
Scraping information from a document will always result in more than one item. Why scrapy pipeline is designed to process the item once at a time?
thanks in advance.

Related

how to call one spider from another spider on scrapy

I have two spiders, and I want one call the other with the information scraped, which are not links that I could follow. Is there any way of doing it calling a spider from another one?
To illustrate better the problem: the url of the "one" page is the form /one/{item_name}, where {item_name} is the information I can get from the page /other/
...
<li class="item">item1</li>
<li class="item">someItem</li>
<li class="item">anotherItem</li>
...
Then I have the spider OneSpider that scrapes the /one/{item_name}, and the OtherSpider that scrapes /other/ and retrieve the item names, as shown below:
class OneSpider(Spider):
name = 'one'
def __init__(self, item_name, *args, **kargs):
super(OneSpider, self).__init__(*args, **kargs)
self.start_urls = [ f'/one/{item_name}' ]
def parse(self, response):
...
class OtherSpider(Spider):
name = 'other'
start_urls = [ '/other/' ]
def parse(self, response):
itemNames = response.css('li.item::text').getall()
# TODO:
# for each item name
# scrape /one/{item_name}
# with the OneSpider
I already checked these two questions: How to call particular Scrapy spiders from another Python script, and scrapy python call spider from spider, and several other questions where the main solution is creating another method inside the class and passing it as callbacks to new requests, but I don't think it is applicable when these new requests would have customized urls.
Scrapy don't have possibility to call spider from another spider.
related issue in scrapy github repo
However You can merge logic from 2 your spiders into single spider class:
import scrapy
class OtherSpider(scrapy.Spider):
name = 'other'
start_urls = [ '/other/' ]
def parse(self, response):
itemNames = response.css('li.item::text').getall()
for item_name in itemNames:
yield scrapy.Request(
url = f'/one/{item_name}',
callback = self.parse_item
)
def parse_item(self, response):
# parse method from Your OneSpider class

Spider not recursively calling itself after setting callback

The goal of my project is to search a website for a company phone number.
I'm attempting to parse a webpage and regex for a phone number (I have that part working) and then look for links on the page. These links are what I want to recursively call. So I would call the function on those links and repeat. However, it is only running the function once. See code below:
def parse(self, response):
# The main method of the spider. It scrapes the URL(s) specified in the
# 'start_url' argument above. The content of the scraped URL is passed on
# as the 'response' object.
hxs = HtmlXPathSelector(response)
#print(phone_detail)
print('here')
for phone_num in response.xpath('//body').re(r'\d{3}.\d{3}.\d{4}'):
item = PhoneNumItem()
item['label'] = "a"
item['phone_num'] = phone_num
yield item
for url in hxs.xpath('//a/#href').extract():
# This loops through all the URLs found
# Constructs an absolute URL by combining the responses URL with a possible relative URL:
next_page = response.urljoin(url)
print("Found URL: " + next_page)
#yield response.follow(next_page, self.parse_page)
yield scrapy.Request(next_page, callback=self.parse)
Please let me know what you think...to me it seems like this code should work, but it is not.

Automate page scroll to down in Splash and Scrapy

I am crawling a site which uses lazy loading for product images.
For this reason i included scrapy-splash so that the javascript can be rendered also with splash i can provide a wait argument. Previously i had a though that it is because of the timing that the raw scrapy.Request is returning a placeholder image instead of the originals.
I've tried wait argument to 29.0 secs also, but still my crawler hardly getting 10 items (it should bring 280 items based on calculations). I have a item pipleline which checks if the image is empty in the item so i raise DropItem.
I am not sure, but i also noticed that its not only the wait problem. It looks like that images gets loaded when i scroll down.
What i am looking for is a way to automate a scroll to bottom behaviour within my requests.
Here is my code
Spider
def parse(self, response):
categories = response.css('div.navigation-top-links a.uppercase::attr(href)').extract()
for category in categories:
link = urlparse.urljoin(self.start_urls[0], category)
yield SplashRequest(link, callback=self.parse_products_listing, endpoint='render.html',
args={'wait': 0.5})
Pipeline
class ScraperPipeline(object):
def process_item(self, item, spider):
if not item['images']:
raise DropItem
return item
Settings
IMAGES_STORE = '/scraper/images'
SPLASH_URL = 'http://172.22.0.2:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
ITEM_PIPELINES = {
'scraper.pipelines.ScraperPipeline': 300,
'scrapy.pipelines.images.ImagesPipeline': 1
}
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
# 'custom_middlewares.middleware.ProxyMiddleware': 210,
}
If you are set on using splash this answer should give you some guidance: https://stackoverflow.com/a/40366442/7926936
You could also use selenium in a DownloaderMiddleware, this is a example I have for a Twitter scraper that will get the first 200 tweets of a page:
from selenium import webdriver
from scrapy.http import HtmlResponse
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
class SeleniumMiddleware(object):
def __init__(self):
self.driver = webdriver.PhantomJS()
def process_request(self, request, spider):
self.driver.get(request.url)
tweets = self.driver.find_elements_by_xpath("//li[#data-item-type='tweet']")
while len(tweets) < 200:
try:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
WebDriverWait(self.driver, 10).until(
lambda driver: new_posts(driver, len(tweets)))
tweets = self.driver.find_elements_by_xpath("//li[#data-item-type='tweet']")
except TimeoutException:
break
body = self.driver.page_source
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
def new_posts(driver, min_len):
return len(driver.find_elements_by_xpath("//li[#data-item-type='tweet']")) > min_len
In the while loop I am waiting in each loop for new tweets until there are 200 tweets loaded in the page and have a 10 seconds max wait.

How to download image using Scrapy?

I am newbie to scrapy. I am trying to download an image from here. I was following Official-Doc and this article.
My settings.py looks like:
BOT_NAME = 'shopclues'
SPIDER_MODULES = ['shopclues.spiders']
NEWSPIDER_MODULE = 'shopclues.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'scrapy.contrib.pipeline.images.ImagesPipeline':1
}
IMAGES_STORE="home/pr.singh/Projects"
and items.py looks like:
import scrapy
from scrapy.item import Item
class ShopcluesItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class ImgData(Item):
image_urls=scrapy.Field()
images=scrapy.Field()
I think both these files are good. But I am unable to write correct spider for getting the image. I am able to grab the image URL but don't know how to store image using imagePipeline.
My spider looks like:
from shopclues.items import ImgData
import scrapy
import datetime
class DownloadFirstImg(scrapy.Spider):
name="DownloadfirstImg"
start_urls=[
'http://www.shopclues.com/canon-powershot-sx410-is-2.html',
]
def parse (self, response):
url= response.css("body div.site-container div#container div.ml_containermain div.content-helper div.aside-site-content div.product form#product_form_83013851 div.product-gallery div#product_images_83013851_update div.slide a#det_img_link_83013851_25781870")
yield scrapy.Request(url.xpath('#href').extract(),self.parse_page)
def parse_page(self,response):
imgURl=response.css("body div.site-container div#container div.ml_containermain div.content-helper div.aside-site-content div.product form#product_form_83013851 div.product-gallery div#product_images_83013851_update div.slide a#det_img_link_83013851_25781870::attr(href)").extract()
yield {
ImgData(image_urls=[imgURl])
}
I have written the spider following this-article. But I am not getting anything. I run my spider as scrapy crawl DownloadfirstImg -o img5.json
but I am not getting any json nor any image? Any help on How to grab image if it's url is known. I have never worked with python also so things seem much complicated to me. Links to any good tutorial may help. TIA
I don't understand why you yield a request for an image you just need to save it on the item and the images pipeline will do the rest, this is all you need.
def parse (self, response):
url= response.css("body div.site-container div#container div.ml_containermain div.content-helper div.aside-site-content div.product form#product_form_83013851 div.product-gallery div#product_images_83013851_update div.slide a#det_img_link_83013851_25781870")
yield ImgData(image_urls=[url.xpath('#href').extract_first()])

Exporting unique items from CrawlSpider

I am using scrapy's CrawlSpider spider class to iterate over the list of start_urls and crawl each site's internal pages to fetch e-mail addresses. I would like to export a file with a single (unique) item for each start_url, with the list of matched e-mails. For that I purpose I needed to override the make_requests_from_url and parse methods so I can pass each start_url item in the response's meta dict (see code) to the internal pages. The output from running this code is:
www.a.com,['webmaster#a.com']
www.a.com,['webmaster#a.com','info#a.com']
www.a.com,['webmaster#a.com','info#a.com','admin#a.com']
However, I only want the export file to contain the last entry from the above output
(www.a.com,['admin#a.com,webmaster#a.com, info#a.com'])
Is that possible?
Code:
class MySpider(CrawlSpider):
start_urls = [... urls list ...]
def parse(self, response):
for request_or_item in CrawlSpider.parse(self, response):
if isinstance(request_or_item, Request):
request_or_item.meta.update(dict(url_item=response.meta['url_item']))
yield request_or_item
def make_requests_from_url(self, url):
# Create a unique item for each url. Append email to this item from internal pages
url_item = MyItem()
url_item["url"] = url
url_item["emais"] = []
return Request(url, dont_filter=True, meta = {'url_item': url_item})
def parse_page(self, response):
url_item = response.meta["url_item"]
url_item["emails"].append(** some regex of emails from the response object **)
return url_item
You could use pipeline to process items.
see Duplicates filter on Scrapy documentation.