How to download image using Scrapy? - scrapy

I am newbie to scrapy. I am trying to download an image from here. I was following Official-Doc and this article.
My settings.py looks like:
BOT_NAME = 'shopclues'
SPIDER_MODULES = ['shopclues.spiders']
NEWSPIDER_MODULE = 'shopclues.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'scrapy.contrib.pipeline.images.ImagesPipeline':1
}
IMAGES_STORE="home/pr.singh/Projects"
and items.py looks like:
import scrapy
from scrapy.item import Item
class ShopcluesItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class ImgData(Item):
image_urls=scrapy.Field()
images=scrapy.Field()
I think both these files are good. But I am unable to write correct spider for getting the image. I am able to grab the image URL but don't know how to store image using imagePipeline.
My spider looks like:
from shopclues.items import ImgData
import scrapy
import datetime
class DownloadFirstImg(scrapy.Spider):
name="DownloadfirstImg"
start_urls=[
'http://www.shopclues.com/canon-powershot-sx410-is-2.html',
]
def parse (self, response):
url= response.css("body div.site-container div#container div.ml_containermain div.content-helper div.aside-site-content div.product form#product_form_83013851 div.product-gallery div#product_images_83013851_update div.slide a#det_img_link_83013851_25781870")
yield scrapy.Request(url.xpath('#href').extract(),self.parse_page)
def parse_page(self,response):
imgURl=response.css("body div.site-container div#container div.ml_containermain div.content-helper div.aside-site-content div.product form#product_form_83013851 div.product-gallery div#product_images_83013851_update div.slide a#det_img_link_83013851_25781870::attr(href)").extract()
yield {
ImgData(image_urls=[imgURl])
}
I have written the spider following this-article. But I am not getting anything. I run my spider as scrapy crawl DownloadfirstImg -o img5.json
but I am not getting any json nor any image? Any help on How to grab image if it's url is known. I have never worked with python also so things seem much complicated to me. Links to any good tutorial may help. TIA

I don't understand why you yield a request for an image you just need to save it on the item and the images pipeline will do the rest, this is all you need.
def parse (self, response):
url= response.css("body div.site-container div#container div.ml_containermain div.content-helper div.aside-site-content div.product form#product_form_83013851 div.product-gallery div#product_images_83013851_update div.slide a#det_img_link_83013851_25781870")
yield ImgData(image_urls=[url.xpath('#href').extract_first()])

Related

How to link the items.py and my spider file?

I'm new to scrapy and trying to scrape a page which has several links. Which I want to follow and scrape the content from that page as well, and from that page there is another link that I want to scrape.
I tried this path on shell and it worked but, I don't know what I am missing here. I want to be able to crawl through two pages by following the links.
I tried reading through tutorials but I don't really understand what I am missing here.
This is my items.py file.
import scrapy
# item class included here
class ScriptsItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
attr = scrapy.Field()
And here is my scripts.py file.
import scrapy
import ScriptsItem
class ScriptsSpider(scrapy.Spider):
name = 'scripts'
allowed_domains = ['https://www.imsdb.com/TV/Futurama.html']
start_urls = ['http://https://www.imsdb.com/TV/Futurama.html/']
BASE_URL = 'https://www.imsdb.com/TV/Futurama.html'
def parse(self, response):
links = response.xpath('//table//td//p//a//#href').extract()
for link in links:
absolute_url = self.BASE_URL + link
yield scrapy.Request(absolute_url, callback=self.parse_attr)
def parse_attr(self, response):
item = ScriptsItem()
item["link"] = response.url
item["attr"] = "".join(response.xpath("//table[#class = 'script-details']//tr[2]//td[2]//a//text()").extract())
return item
Replace
import ScriptsItem
to
from your_project_name.items import ScriptsItem
your_project_name - Name of your project

Extract text from table with Scrapy

I'm trying to extract Job title insides a table from this page: http://www.chalmers.se/en/about-chalmers/Working-at-Chalmers/Vacancies/Pages/default.aspx
This is the code, but it always returns empty. Any idea how to fix this?
import os
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
class mySpider(CrawlSpider):
name = "myspider"
allowed_domains = ["www.chalmers.se"]
start_urls = [
"http://www.chalmers.se/en/about-chalmers/Working-at-Chalmers/Vacancies/Pages/default.aspx",
]
def parse(self, response):
sel = response.selector
# try to extract text from a tag inside <td>
for tr in sel.css("table#jobsTable>tbody>tr"):
my_title = tr.xpath('td[#class="jobitem"]/a/text()').extract()
print '================', my_title
I also try to give absolute html path, like bellow but still got empty title:
my_title = response.xpath('/html/body/div/div[1]/div/div[11]/div/table/tbody/tr[1]/td[2]/a/text()').extract()
Your website gets above Jobs table from another source (loading it using AJAX call).
So you just need to start from another url:
start_urls = ['https://web103.reachmee.com/ext/I003/304/main?site=5&validator=a72aeedd63ec10de71e46f8d91d0d57c&lang=UK&ref=&ihelper=http://www.chalmers.se/en/about-chalmers/Working-at-Chalmers/Vacancies/Pages/default.aspx']

Automate page scroll to down in Splash and Scrapy

I am crawling a site which uses lazy loading for product images.
For this reason i included scrapy-splash so that the javascript can be rendered also with splash i can provide a wait argument. Previously i had a though that it is because of the timing that the raw scrapy.Request is returning a placeholder image instead of the originals.
I've tried wait argument to 29.0 secs also, but still my crawler hardly getting 10 items (it should bring 280 items based on calculations). I have a item pipleline which checks if the image is empty in the item so i raise DropItem.
I am not sure, but i also noticed that its not only the wait problem. It looks like that images gets loaded when i scroll down.
What i am looking for is a way to automate a scroll to bottom behaviour within my requests.
Here is my code
Spider
def parse(self, response):
categories = response.css('div.navigation-top-links a.uppercase::attr(href)').extract()
for category in categories:
link = urlparse.urljoin(self.start_urls[0], category)
yield SplashRequest(link, callback=self.parse_products_listing, endpoint='render.html',
args={'wait': 0.5})
Pipeline
class ScraperPipeline(object):
def process_item(self, item, spider):
if not item['images']:
raise DropItem
return item
Settings
IMAGES_STORE = '/scraper/images'
SPLASH_URL = 'http://172.22.0.2:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
ITEM_PIPELINES = {
'scraper.pipelines.ScraperPipeline': 300,
'scrapy.pipelines.images.ImagesPipeline': 1
}
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
# 'custom_middlewares.middleware.ProxyMiddleware': 210,
}
If you are set on using splash this answer should give you some guidance: https://stackoverflow.com/a/40366442/7926936
You could also use selenium in a DownloaderMiddleware, this is a example I have for a Twitter scraper that will get the first 200 tweets of a page:
from selenium import webdriver
from scrapy.http import HtmlResponse
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
class SeleniumMiddleware(object):
def __init__(self):
self.driver = webdriver.PhantomJS()
def process_request(self, request, spider):
self.driver.get(request.url)
tweets = self.driver.find_elements_by_xpath("//li[#data-item-type='tweet']")
while len(tweets) < 200:
try:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
WebDriverWait(self.driver, 10).until(
lambda driver: new_posts(driver, len(tweets)))
tweets = self.driver.find_elements_by_xpath("//li[#data-item-type='tweet']")
except TimeoutException:
break
body = self.driver.page_source
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
def new_posts(driver, min_len):
return len(driver.find_elements_by_xpath("//li[#data-item-type='tweet']")) > min_len
In the while loop I am waiting in each loop for new tweets until there are 200 tweets loaded in the page and have a 10 seconds max wait.

How to use Python scrapy for myltiple URL's

my question is similar to this post:
How to use scrapy for Amazon.com links after "Next" Button?
I want my crawler to traverse through all "Next" links. I've searched a lot, but most people ether focus on how to parse the ULR or simply put all URL's in the initial URL list.
So far, I am able to visit the first page and parse the next page's link. But I don't know how to visit that page using the same crawler(spider). I tried to append the new URL into my URL list, it does appended (I checked the length), but later it doesn't visit the link. I have no idea why...
Note that in my case, I only know the first page's URL. Second page's URL can only be obtained after visiting the first page. The same, (i+1)'th page's URL is hidden in the i'th page.
In the parse function, I can parse and print the correct next page link URL. I just don't know how to visit it.
Please help me. Thank you!
import scrapy
from bs4 import BeautifulSoup
class RedditSpider(scrapy.Spider):
name = "test2"
allowed_domains = ["http://www.reddit.com"]
urls = ["https://www.reddit.com/r/LifeProTips/search?q=timestamp%3A1427232122..1437773560&sort=new&restrict_sr=on&syntax=cloudsearch"]
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': { 'wait': 0.5 }
}
})
`
def parse(self, response):
page = response.url[-10:]
print(page)
filename = 'reddit-%s.html' % page
#parse html for next link
soup = BeautifulSoup(response.body, 'html.parser')
mydivs = soup.findAll("a", { "rel" : "nofollow next" })
link = mydivs[0]['href']
print(link)
self.urls.append(link)
with open(filename, 'wb') as f:
f.write(response.body)
Update
Thanks to Kaushik's answer, I figured out how to make it work. Though I still don't know why my original idea of appending new URL's doesn't work...
The updated code is as follow:
import scrapy
from bs4 import BeautifulSoup
class RedditSpider(scrapy.Spider):
name = "test2"
urls = ["https://www.reddit.com/r/LifeProTips/search?q=timestamp%3A1427232122..1437773560&sort=new&restrict_sr=on&syntax=cloudsearch"]
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': { 'wait': 0.5 }
}
})
def parse(self, response):
page = response.url[-10:]
print(page)
filename = 'reddit-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
#parse html for next link
soup = BeautifulSoup(response.body, 'html.parser')
mydivs = soup.findAll("a", { "rel" : "nofollow next" })
if len(mydivs) != 0:
link = mydivs[0]['href']
print(link)
#yield response.follow(link, callback=self.parse)
yield scrapy.Request(link, callback=self.parse)
What you require is explained very well in the Scrapy docs . I don't think you would need any other explanation other than that. Suggest going through it once for better understanding.
A brief explanation first though:
To follow a link to the next page, Scrapy provides many methods. The most basic methods is using the http.Request method
Request object :
class scrapy.http.Request(url[, callback,
method='GET', headers, body, cookies, meta, encoding='utf-8',
priority=0, dont_filter=False, errback, flags])
>>> yield scrapy.Request(url, callback=self.next_parse)
url (string) – the URL of this request
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter.
For convenience though, Scrapy has inbuilt shortcut for creating Request objects by using response.follow where the url can be an absolute path or a relative path.
follow(url, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding=None, priority=0, dont_filter=False,
errback=None)
>>> yield response.follow(url, callback=self.next_parse)
In case if you have to move through to the next link by passing values to a form or any other type of input field, you can use the Form Request objects. The FormRequest class extends the base Request with functionality
for dealing with HTML forms. It uses lxml.html forms to pre-populate
form fields with form data from Response objects.
Form Request object
from_response(response[, formname=None,
formid=None, formnumber=0, formdata=None, formxpath=None,
formcss=None, clickdata=None, dont_click=False, ...])
If you want to simulate a HTML Form POST in your spider and send a couple of key-value fields, you can return a FormRequest object (from your spider) like this:
return [FormRequest(url="http://www.example.com/post/action",
formdata={'name': 'John Doe', 'age': '27'},
callback=self.after_post)]
Note : If a Request doesn’t specify a callback, the spider’s parse() method will be used. If exceptions are raised during processing, errback is called instead.

Scrapy HtmlXPathSelector

Just trying out scrapy and trying to get a basic spider working. I know this is just probably something I'm missing but I've tried everything I can think of.
The error I get is:
line 11, in JustASpider
sites = hxs.select('//title/text()')
NameError: name 'hxs' is not defined
My code is very basic at the moment, but I still can't seem to find where I'm going wrong. Thanks for any help!
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class JustASpider(BaseSpider):
name = "google.com"
start_urls = ["http://www.google.com/search?hl=en&q=search"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//title/text()')
for site in sites:
print site.extract()
SPIDER = JustASpider()
The code looks quite old version. I recommend using these codes instead
from scrapy.spider import Spider
from scrapy.selector import Selector
class JustASpider(Spider):
name = "googlespider"
allowed_domains=["google.com"]
start_urls = ["http://www.google.com/search?hl=en&q=search"]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//title/text()').extract()
print sites
#for site in sites: (I dont know why you want to loop for extracting the text in the title element)
#print site.extract()
hope it helps and here is a good example to follow.
I removed the SPIDER call at the end and removed the for loop. There was only one title tag (as one would expect) and it seems that was throwing off the loop. The code I have working is as follows:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class JustASpider(BaseSpider):
name = "google.com"
start_urls = ["http://www.google.com/search?hl=en&q=search"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//title/text()')
final = titles.extract()
I had a similar problem, NameError: name 'hxs' is not defined, and the problem related to spaces and tabs: the IDE uses spaces instead of tabs, you should check it out.
Code looks correct.
In latest versions of Scrapy
HtmlXPathSelector is deprecated.
Use Selector:
hxs = Selector(response)
sites = hxs.xpath('//title/text()')
Be sure you are running the code you are showing us.
Try deleting *.pyc files in your project.
This works for me:
Save the file as test.py
Use the command scrapy runspider <filename.py>
For example:
scrapy runspider test.py
You should change
from scrapy.selector import HtmlXPathSelector
into
from scrapy.selector import Selector
And use hxs=Selector(response) instead.
I use Scrapy with BeautifulSoup4.0. For me, Soup is easy to read and understand. This is an option if you don't have to use HtmlXPathSelector. Hope this helps!
import scrapy
from bs4 import BeautifulSoup
import Item
def parse(self, response):
soup = BeautifulSoup(response.body,'html.parser')
print 'Current url: %s' % response.url
item = Item()
for link in soup.find_all('a'):
if link.get('href') is not None:
url = response.urljoin(link.get('href'))
item['url'] = url
yield scrapy.Request(url,callback=self.parse)
yield item
this is just a demo but it works. need to be customized offcourse.
#!/usr/bin/env python
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
for site in sites:
title = site.select('a/text()').extract()
link = site.select('a/#href').extract()
desc = site.select('text()').extract()
print title, link, desc