Automate page scroll to down in Splash and Scrapy - scrapy

I am crawling a site which uses lazy loading for product images.
For this reason i included scrapy-splash so that the javascript can be rendered also with splash i can provide a wait argument. Previously i had a though that it is because of the timing that the raw scrapy.Request is returning a placeholder image instead of the originals.
I've tried wait argument to 29.0 secs also, but still my crawler hardly getting 10 items (it should bring 280 items based on calculations). I have a item pipleline which checks if the image is empty in the item so i raise DropItem.
I am not sure, but i also noticed that its not only the wait problem. It looks like that images gets loaded when i scroll down.
What i am looking for is a way to automate a scroll to bottom behaviour within my requests.
Here is my code
Spider
def parse(self, response):
categories = response.css('div.navigation-top-links a.uppercase::attr(href)').extract()
for category in categories:
link = urlparse.urljoin(self.start_urls[0], category)
yield SplashRequest(link, callback=self.parse_products_listing, endpoint='render.html',
args={'wait': 0.5})
Pipeline
class ScraperPipeline(object):
def process_item(self, item, spider):
if not item['images']:
raise DropItem
return item
Settings
IMAGES_STORE = '/scraper/images'
SPLASH_URL = 'http://172.22.0.2:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
ITEM_PIPELINES = {
'scraper.pipelines.ScraperPipeline': 300,
'scrapy.pipelines.images.ImagesPipeline': 1
}
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
# 'custom_middlewares.middleware.ProxyMiddleware': 210,
}

If you are set on using splash this answer should give you some guidance: https://stackoverflow.com/a/40366442/7926936
You could also use selenium in a DownloaderMiddleware, this is a example I have for a Twitter scraper that will get the first 200 tweets of a page:
from selenium import webdriver
from scrapy.http import HtmlResponse
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
class SeleniumMiddleware(object):
def __init__(self):
self.driver = webdriver.PhantomJS()
def process_request(self, request, spider):
self.driver.get(request.url)
tweets = self.driver.find_elements_by_xpath("//li[#data-item-type='tweet']")
while len(tweets) < 200:
try:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
WebDriverWait(self.driver, 10).until(
lambda driver: new_posts(driver, len(tweets)))
tweets = self.driver.find_elements_by_xpath("//li[#data-item-type='tweet']")
except TimeoutException:
break
body = self.driver.page_source
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
def new_posts(driver, min_len):
return len(driver.find_elements_by_xpath("//li[#data-item-type='tweet']")) > min_len
In the while loop I am waiting in each loop for new tweets until there are 200 tweets loaded in the page and have a 10 seconds max wait.

Related

How to simulate pressing buttons to keep on scraping more elements with Scrapy

On this page (https://www.realestate.com.kh/buy/), I managed to grab a list of ads, and individually parse their content with this code:
import scrapy
class scrapingThings(scrapy.Spider):
name = 'scrapingThings'
# allowed_domains = ['https://www.realestate.com.kh/buy/']
start_urls = ['https://www.realestate.com.kh/buy/']
def parse(self, response):
ads = response.xpath('//*[#class="featured css-ineky e1jqslr40"]//a/#href')
c = 0
for url in ads:
c += 1
absolute_url = response.urljoin(url.extract())
self.item = {}
self.item['url'] = absolute_url
yield scrapy.Request(absolute_url, callback=self.parse_ad, meta={'item': self.item})
def parse_ad(self, response):
# Extract things
yield {
# Yield things
}
However, I'd like to automate the switching from one page to another to grab the entirety of the ads available (not only on the first page, but on all pages). By, I guess, simulating the pressings of the 1, 2, 3, 4, ..., 50 buttons as displayed on this screen capture:
Is this even possible with Scrapy? If so, how can one achieve this?
Yes it's possible. Let me show you two ways of doing it:
You can have your spider select the buttons, get the #href value of them, build a [full] URL and yield as a new request.
Here is an example:
def parse(self, response):
....
href = response.xpath('//div[#class="desktop-buttons"]/a[#class="css-owq2hj"]/following-sibling::a[1]/#href').get()
req_url = response.urljoin(href)
yield Request(url=req_url, callback=self.parse_ad)
The selector in the example will always return the #href of the next page's button (It returns only one value, if you are in page 2 it returns the #href of page 2)
In this page, the href is an relative url, so we need to use response.urljoin() method to build a full url. It will use the response as base.
We yield a new request, the response will be parsed in the callback function you determined.
This will require your callback function to always yield the request for the next page. So it's a recursive solution.
A more simple approach would be to just observe the pattern of the hrefs and manually yield all requests. Each button has a href of "/buy/?page={nr}" where {nr} is the number of the page, se can arbitrarily change this nr value and yield all requests at once.
def parse(self, response):
....
nr_pages = response.xpath('//div[#class="desktop-buttons"]/a[#class="css-1en2dru"]//text()').getall()
last_page_nr = int(nr_pages[-1])
for nr in range(2, last_page_nr + 1):
req_url = f'/buy/?page={nr}'
yield Request(url=response.urljoin(req_url), callback=self.parse_ad)
nr_pages returns the number of all buttons
last_page_nr selects the last number (which is the last available page)
We loop in the range between 2 to the value of last_page_nr (50 in this case) and in each loop we request a new page (that correspond to the number).
This way you can make all the requests in your parse method, and parse the response in the parse_ad without recursive calling.
Finally I suggest you take a look on scrapy tutorial it covers several common scenarios on scraping.

Is it good to access spider attributes in scrapy pipeline?

In the scrapy pipeline doc saids one of the parameter of the function 'process_item' is the spider
process_item(self, item, spider)
Parameters:
item (Item object or a dict) – the item scraped
spider (Spider object) – the spider which scraped the item
I want to send a list of one type of 'item' to the pipeline, but after many digging through the internet everyone is either yielding or returning the item to the pipeline once at a time.
SamplerSpider.py
class SamplerSpider(scrapy.Spider):
name = 'SamplerSpider'
allowed_domains = ['xxx.com']
start_urls = (CONSTANTS.URL)
result = []
pipeline.py
class SamplerSpiderPipeline(object):
def __init__(self):
// do something here
def process_item(self, item, spider):
// do something with spider.result
Is this a good way to do it? If not then why?
Scraping information from a document will always result in more than one item. Why scrapy pipeline is designed to process the item once at a time?
thanks in advance.

How to use Python scrapy for myltiple URL's

my question is similar to this post:
How to use scrapy for Amazon.com links after "Next" Button?
I want my crawler to traverse through all "Next" links. I've searched a lot, but most people ether focus on how to parse the ULR or simply put all URL's in the initial URL list.
So far, I am able to visit the first page and parse the next page's link. But I don't know how to visit that page using the same crawler(spider). I tried to append the new URL into my URL list, it does appended (I checked the length), but later it doesn't visit the link. I have no idea why...
Note that in my case, I only know the first page's URL. Second page's URL can only be obtained after visiting the first page. The same, (i+1)'th page's URL is hidden in the i'th page.
In the parse function, I can parse and print the correct next page link URL. I just don't know how to visit it.
Please help me. Thank you!
import scrapy
from bs4 import BeautifulSoup
class RedditSpider(scrapy.Spider):
name = "test2"
allowed_domains = ["http://www.reddit.com"]
urls = ["https://www.reddit.com/r/LifeProTips/search?q=timestamp%3A1427232122..1437773560&sort=new&restrict_sr=on&syntax=cloudsearch"]
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': { 'wait': 0.5 }
}
})
`
def parse(self, response):
page = response.url[-10:]
print(page)
filename = 'reddit-%s.html' % page
#parse html for next link
soup = BeautifulSoup(response.body, 'html.parser')
mydivs = soup.findAll("a", { "rel" : "nofollow next" })
link = mydivs[0]['href']
print(link)
self.urls.append(link)
with open(filename, 'wb') as f:
f.write(response.body)
Update
Thanks to Kaushik's answer, I figured out how to make it work. Though I still don't know why my original idea of appending new URL's doesn't work...
The updated code is as follow:
import scrapy
from bs4 import BeautifulSoup
class RedditSpider(scrapy.Spider):
name = "test2"
urls = ["https://www.reddit.com/r/LifeProTips/search?q=timestamp%3A1427232122..1437773560&sort=new&restrict_sr=on&syntax=cloudsearch"]
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': { 'wait': 0.5 }
}
})
def parse(self, response):
page = response.url[-10:]
print(page)
filename = 'reddit-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
#parse html for next link
soup = BeautifulSoup(response.body, 'html.parser')
mydivs = soup.findAll("a", { "rel" : "nofollow next" })
if len(mydivs) != 0:
link = mydivs[0]['href']
print(link)
#yield response.follow(link, callback=self.parse)
yield scrapy.Request(link, callback=self.parse)
What you require is explained very well in the Scrapy docs . I don't think you would need any other explanation other than that. Suggest going through it once for better understanding.
A brief explanation first though:
To follow a link to the next page, Scrapy provides many methods. The most basic methods is using the http.Request method
Request object :
class scrapy.http.Request(url[, callback,
method='GET', headers, body, cookies, meta, encoding='utf-8',
priority=0, dont_filter=False, errback, flags])
>>> yield scrapy.Request(url, callback=self.next_parse)
url (string) – the URL of this request
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter.
For convenience though, Scrapy has inbuilt shortcut for creating Request objects by using response.follow where the url can be an absolute path or a relative path.
follow(url, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding=None, priority=0, dont_filter=False,
errback=None)
>>> yield response.follow(url, callback=self.next_parse)
In case if you have to move through to the next link by passing values to a form or any other type of input field, you can use the Form Request objects. The FormRequest class extends the base Request with functionality
for dealing with HTML forms. It uses lxml.html forms to pre-populate
form fields with form data from Response objects.
Form Request object
from_response(response[, formname=None,
formid=None, formnumber=0, formdata=None, formxpath=None,
formcss=None, clickdata=None, dont_click=False, ...])
If you want to simulate a HTML Form POST in your spider and send a couple of key-value fields, you can return a FormRequest object (from your spider) like this:
return [FormRequest(url="http://www.example.com/post/action",
formdata={'name': 'John Doe', 'age': '27'},
callback=self.after_post)]
Note : If a Request doesn’t specify a callback, the spider’s parse() method will be used. If exceptions are raised during processing, errback is called instead.

Scrapy spider not following pagination

I am using code from this link(https://github.com/eloyz/reddit/blob/master/reddit/spiders/pic.py) but somehow I am unable to visit paginated page.
I am on using scrapy 1.3.0
You don't have any mechanism for processing next page, all you do is gathering images.
Here is what you should be doing, I wrote some selectors but didn't test it.
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy import Request
import urlparse
class xxx_spider(Spider):
name = "xxx"
allowed_domains = ["xxx.com"]
def start_requests(self):
url = 'first page url'
yield Request(url=url, callback=self.parse, meta={"page":1})
def parse(self, response):
page = response.meta["page"] + 1
html = Selector(response)
pics = html.css('div.thing')
for selector in pics:
item = PicItem()
item['image_urls'] = selector.xpath('a/#href').extract()
item['title'] = selector.xpath('div/p/a/text()').extract()
item['url'] = selector.xpath('a/#href').extract()
yield item
next_link = html.css("span.next-button a::attr(href)")
if not next_link is None:
yield Request(url=url, callback=self.parse, meta={"page":page})
Similar to what you did, but when I get images, I then check next page link, if it exists then I yield another request with it.
Mehmet

How to download image using Scrapy?

I am newbie to scrapy. I am trying to download an image from here. I was following Official-Doc and this article.
My settings.py looks like:
BOT_NAME = 'shopclues'
SPIDER_MODULES = ['shopclues.spiders']
NEWSPIDER_MODULE = 'shopclues.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'scrapy.contrib.pipeline.images.ImagesPipeline':1
}
IMAGES_STORE="home/pr.singh/Projects"
and items.py looks like:
import scrapy
from scrapy.item import Item
class ShopcluesItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class ImgData(Item):
image_urls=scrapy.Field()
images=scrapy.Field()
I think both these files are good. But I am unable to write correct spider for getting the image. I am able to grab the image URL but don't know how to store image using imagePipeline.
My spider looks like:
from shopclues.items import ImgData
import scrapy
import datetime
class DownloadFirstImg(scrapy.Spider):
name="DownloadfirstImg"
start_urls=[
'http://www.shopclues.com/canon-powershot-sx410-is-2.html',
]
def parse (self, response):
url= response.css("body div.site-container div#container div.ml_containermain div.content-helper div.aside-site-content div.product form#product_form_83013851 div.product-gallery div#product_images_83013851_update div.slide a#det_img_link_83013851_25781870")
yield scrapy.Request(url.xpath('#href').extract(),self.parse_page)
def parse_page(self,response):
imgURl=response.css("body div.site-container div#container div.ml_containermain div.content-helper div.aside-site-content div.product form#product_form_83013851 div.product-gallery div#product_images_83013851_update div.slide a#det_img_link_83013851_25781870::attr(href)").extract()
yield {
ImgData(image_urls=[imgURl])
}
I have written the spider following this-article. But I am not getting anything. I run my spider as scrapy crawl DownloadfirstImg -o img5.json
but I am not getting any json nor any image? Any help on How to grab image if it's url is known. I have never worked with python also so things seem much complicated to me. Links to any good tutorial may help. TIA
I don't understand why you yield a request for an image you just need to save it on the item and the images pipeline will do the rest, this is all you need.
def parse (self, response):
url= response.css("body div.site-container div#container div.ml_containermain div.content-helper div.aside-site-content div.product form#product_form_83013851 div.product-gallery div#product_images_83013851_update div.slide a#det_img_link_83013851_25781870")
yield ImgData(image_urls=[url.xpath('#href').extract_first()])