How to download images from dynamically generated hashed url using scrapy?

How to download images from dynamically generated hashed url using scrapy? - scrapy

I am using scrapy to download images from website https://pixabay.com/.
My working code is as below-
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from website.imageItems import imageItems
class imageSpider(Spider):
name = "imageCrawler"
start_urls = ['https://pixabay.com/en/goose-bird-isolated-feather-1988657/']
def parse(self, response):
img = imageItems()
image_urls = response.xpath('//div[#id="media_container"]/img/#src').extract_first()
yield imageItems(image_urls = [image_urls])
Using this code I am able to download image https://cdn.pixabay.com/photo/2017/01/18/01/07/goose-1988657_960_720.png perfectly. But if I modify my code to download bigger size of same image, my code is not working-
def parse(self, response):
img = imageItems()
image_urls = 'https://pixabay.com/en/photos/download/' + response.xpath('//tr[#class="no_default"]/td/input/#value').extract_first()
yield imageItems(image_urls = [image_urls])
In my last code image url is -
https://pixabay.com/en/photos/download/goose-bird-isolated-feather-1988657.png
But server converts that url into some hased url-
https://pixabay.com/get/e83cb9072ef1063ecd1f4107ee4d4697e16ae3d111b4134392f3c27e/goose-1988657.png
Due to the hased url my scrpy code is not working. Error-
2017-01-24 08:25:22 [scrapy] DEBUG: Crawled (200) <GET https://pixabay.com/en/photos/download/goose-1988657.png> (referer: None)
2017-01-24 08:25:22 [scrapy] DEBUG: File (downloaded): Downloaded file from <GET https://pixabay.com/en/photos/download/goose-1988657.png> referred in <None>
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing BmpImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing BufrStubImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing CurImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing DcxImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing DdsImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing EpsImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing FitsStubImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing FliImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing FpxImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing FtexImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing GbrImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing GifImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing GribStubImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing Hdf5StubImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing IcnsImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing IcoImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing ImImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing ImtImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing IptcImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing JpegImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing Jpeg2KImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing McIdasImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing MicImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing MpegImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing MpoImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing MspImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PalmImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PcdImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PcxImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PdfImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PixarImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PngImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PpmImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PsdImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing SgiImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing SpiderImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing SunImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing TgaImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing TiffImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing WebPImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing WmfImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing XbmImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing XpmImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing XVThumbImagePlugin
2017-01-24 08:25:22 [scrapy] ERROR: File (unknown-error): Error processing file from <GET https://pixabay.com/en/photos/download/goose-1988657.png> referred in <None>
Traceback (most recent call last):
File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\twisted\internet\defer.py", line 1185, in _inlineCallbacks
result = g.send(result)
File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\twisted\internet\defer.py", line 1162, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 https://pixabay.com/en/photos/download/goose-1988657.png>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\scrapy\pipelines\files.py", line 339, in media_downloaded
checksum = self.file_downloaded(response, request, info)
File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\scrapy\pipelines\images.py", line 64, in file_downloaded
return self.image_downloaded(response, request, info)
File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\scrapy\pipelines\images.py", line 68, in image_downloaded
for path, image, buf in self.get_images(response, request, info):
File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\scrapy\pipelines\images.py", line 81, in get_images
orig_image = Image.open(BytesIO(response.body))
File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\PIL\Image.py", line 2349, in open
% (filename if filename else fp))
OSError: cannot identify image file <_io.BytesIO object at 0x000001D25008A888>
2017-01-24 08:25:22 [scrapy] WARNING: Dropped: Item contains no images
{'image_urls': ['https://pixabay.com/en/photos/download/goose-1988657.png']}
2017-01-24 08:25:22 [scrapy] INFO: Closing spider (finished)
2017-01-24 08:25:22 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 574,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 9486,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'file_count': 1,
'file_status_count/downloaded': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 1, 24, 2, 55, 22, 780851),
'item_dropped_count': 1,
'item_dropped_reasons_count/DropItem': 1,
'log_count/DEBUG': 48,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'log_count/WARNING': 2,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 1, 24, 2, 55, 20, 983794)}
2017-01-24 08:25:22 [scrapy] INFO: Spider closed (finished)
This is not very specific problem. Every time if server generated dynamic URL for any image, scrapy fails.
Have any one faced same type of problem?

Related

Scrapy - Multiple resquest by looping JSON file

I'm trying to get the latitude and longitude of different cities. The name of cities are stored in a JSON file. Here is my code:
import scrapy
import json
with open('C:/Users/coppe/tutorial/cities.json') as json_file:
cities = json.load(json_file)
class communes_spider(scrapy.Spider):
name = "geo"
start_urls = ['https://www.latlong.net/']
def parse(self, response):
for city in cities:
return scrapy.FormRequest.from_response(response, formdata={'place': city['city']}, callback=self.get_geo)
def get_geo(self, response):
yield {response.css('span.coordinatetxt::text').get()}
The objective is to loop through the JSON file and for each city send a resquest to a form from the url "https://www.latlong.net/". However nothing is prompting from this request. Is this a bad way to make loop ? Should I treat the JSON file inside the class ?
Log:
2019-04-01 16:27:17 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: tutorial)
2019-04-01 16:27:17 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.1 (default, Oct 28 2018, 08:39:03) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17763-SP0
2019-04-01 16:27:17 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']}
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-01 16:27:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-01 16:27:17 [scrapy.core.engine] INFO: Spider opened
2019-04-01 16:27:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-01 16:27:17 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.latlong.net/robots.txt> (referer: None)
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.latlong.net/> (referer: None)
2019-04-01 16:27:18 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.latlong.net/> (referer: https://www.latlong.net/)
2019-04-01 16:27:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.latlong.net/>
{'latlong': '0,0'}
2019-04-01 16:27:18 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-01 16:27:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 874,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 29252,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 1, 14, 27, 18, 923987),
'item_scraped_count': 1,
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 3,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2019, 4, 1, 14, 27, 17, 773592)}
2019-04-01 16:27:18 [scrapy.core.engine] INFO: Spider closed (finished)

Your parse method should be a generator, so you need to use yield instead of return on the for loop, otherwise you'll finish the loop on the first iteration. Furthermore, get_get method is returning a set, but it must return Request, BaseItem, dict or None.
I suggest changing the code as follow:
import scrapy
import json
with open('C:/Users/coppe/tutorial/cities.json') as json_file:
cities = json.load(json_file)
class communes_spider(scrapy.Spider):
name = "geo"
start_urls = ['https://www.latlong.net/']
def parse(self, response):
for city in cities:
yield scrapy.FormRequest.from_response(response, formdata={'place': city['city']}, callback=self.get_geo)
def get_geo(self, response):
yield {'coord': response.css('span.coordinatetxt::text').get()}
https://www.geeksforgeeks.org/use-yield-keyword-instead-return-keyword-python/

Google Authenticaion via Scrapy

I was curious to know if google authetication could be achieved via scrapy. I tried with the following code snippet to achieve so.
My code:
import scrapy
from scrapy.http import FormRequest, Request
import logging
import json
class LoginSpider(scrapy.Spider):
name = 'hello'
start_urls = ['https://accounts.google.com']
handle_httpstatus_list = [404, 405]
def parse(self, response):
print('inside parse')
print(response.url)
return [FormRequest.from_response(response,
formdata={'Email': 'abc#gmail.com'},
callback=self.log_password)]
def log_password(self, response):
print('inside log_password')
print(response.url)
return [FormRequest.from_response(response,
formdata={'Passwd': 'password'},
callback=self.after_login)]
def after_login(self, response):
print('inside after_login')
print(response.url)
if "authentication failed" in response.body:
self.log("Login failed", level=logging.ERROR)
return
# We've successfully authenticated, let's have some fun!
else:
print("Login Successful!!")
if __name__ == '__main__':
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'DOWNLOAD_DELAY': 0.2,
'HANDLE_HTTPSTATUS_ALL': True
})
process.crawl(LoginSpider)
process.start()
But I'm getting following output when I run the script.
Output
2018-08-15 10:38:05 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2018-08-15 10:38:05 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.4 (default, Mar 22 2018, 14:05:57) - [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.3, Platform Darwin-15.6.0-x86_64-i386-64bit 2018-08-15 10:38:05 [scrapy.crawler] INFO: Overridden settings: {'DOWNLOAD_DELAY': 0.2, 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'} 2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats']
2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-08-15 10:38:05 [scrapy.middleware] INFO: Enabled item pipelines: []
2018-08-15 10:38:05 [scrapy.core.engine] INFO: Spider opened 2018-08-15 10:38:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-08-15 10:38:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-08-15 10:38:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://accounts.google.com/ManageAccount> from <GET https://accounts.google.com>
2018-08-15 10:38:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount> from <GET https://accounts.google.com/ManageAccount>
2018-08-15 10:38:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount> (referer: None)
**inside parse**
https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount
2018-08-15 10:38:09 [scrapy.core.engine] DEBUG: Crawled (405) <POST https://accounts.google.com/> (referer: https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount)
**inside log_password**
https://accounts.google.com/
2018-08-15 10:38:10 [scrapy.core.scraper] ERROR: Spider error processing <POST https://accounts.google.com/> (referer: https://accounts.google.com/ServiceLogin?passive=1209600&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&followup=https%3A%2F%2Faccounts.google.com%2FManageAccount)
Traceback (most recent call last): File "/Users/PathakUmesh/.local/share/virtualenvs/python3env-piKhfpsf/lib/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw) File "google_login.py", line 79, in log_password
callback=self.after_login)] File "/Users/PathakUmesh/.local/share/virtualenvs/python3env-piKhfpsf/lib/python3.6/site-packages/scrapy/http/request/form.py", line 48, in from_response
form = _get_form(response, formname, formid, formnumber, formxpath) File "/Users/PathakUmesh/.local/share/virtualenvs/python3env-piKhfpsf/lib/python3.6/site-packages/scrapy/http/request/form.py", line 77, in _get_form
raise ValueError("No <form> element found in %s" % response) ValueError: No <form> element found in <405 https://accounts.google.com/>
2018-08-15 10:38:10 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-15 10:38:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 1810, 'downloader/request_count': 4, 'downloader/request_method_count/GET': 3, 'downloader/request_method_count/POST': 1, 'downloader/response_bytes': 357598, 'downloader/response_count': 4, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/302': 2, 'downloader/response_status_count/405': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 8, 15, 4, 53, 10, 164911), 'log_count/DEBUG': 5, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'memusage/max': 41132032, 'memusage/startup': 41132032, 'request_depth_max': 1, 'response_received_count': 2, 'scheduler/dequeued': 4, 'scheduler/dequeued/memory': 4, 'scheduler/enqueued': 4, 'scheduler/enqueued/memory': 4, 'spider_exceptions/ValueError': 1, 'start_time': datetime.datetime(2018, 8, 15, 4, 53, 5, 463699)}
2018-08-15 10:38:10 [scrapy.core.engine] INFO: Spider closed (finished)
Any help would be appreciated

Error 405 means that URL doesn't accept the HTTP method, in your case POST generated at parse.
Google default login page is much more complex than simple POST form heavily using JS and Ajax under the hood. To login using Scrapy you have use https://accounts.google.com/ServiceLogin?nojavascript=1 as start URL instead.

Getting NotImplementedError when running Scrapy tutorial

I'm following the tutorial for Scrapy.
I used this code from the tutorial:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1',
'http://quotes.toscrape.com/page/2',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self,response):
page = response.url.split("/)[-2]")
filename = 'quotes-%s.html' % page
with open(filename,'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
When I then run the command scrapy crawl quotes I get the following output:
2017-05-14 02:19:55 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: tutorial)
2017-05-14 02:19:55 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'NEWS
2017-05-14 02:19:55 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2017-05-14 02:19:55 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-14 02:19:55 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-14 02:19:55 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-05-14 02:19:55 [scrapy.core.engine] INFO: Spider opened
2017-05-14 02:19:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped
2017-05-14 02:19:55 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-14 02:19:55 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/ro
2017-05-14 02:19:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET htt
2017-05-14 02:19:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET htt
2017-05-14 02:19:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/pa
2017-05-14 02:19:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/pa
2017-05-14 02:19:56 [scrapy.core.scraper] ERROR: Spider error processing <GET http://quotes.tosc
Traceback (most recent call last):
File "c:\users\mehmet\anaconda3\lib\site-packages\twisted\internet\defer.py", line 653, in _ru
current.result = callback(current.result, *args, **kw)
File "c:\users\mehmet\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 76, in par
raise NotImplementedError
NotImplementedError
2017-05-14 02:19:56 [scrapy.core.scraper] ERROR: Spider error processing <GET http://quotes.tosc
Traceback (most recent call last):
File "c:\users\mehmet\anaconda3\lib\site-packages\twisted\internet\defer.py", line 653, in _ru
current.result = callback(current.result, *args, **kw)
File "c:\users\mehmet\anaconda3\lib\site-packages\scrapy\spiders\__init__.py", line 76, in par
raise NotImplementedError
NotImplementedError
2017-05-14 02:19:56 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-14 02:19:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1121,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'downloader/response_bytes': 6956,
'downloader/response_count': 5,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 2,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 5, 14, 0, 19, 56, 125822),
'log_count/DEBUG': 6,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'spider_exceptions/NotImplementedError': 2,
'start_time': datetime.datetime(2017, 5, 14, 0, 19, 55, 659206)}
2017-05-14 02:19:56 [scrapy.core.engine] INFO: Spider closed (finished)
What is going wrong?

This error means you did not implement the parse function. But according to your post you did. So it could be an indentation error. Your code should be like that:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1',
'http://quotes.toscrape.com/page/2',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self,response):
page = response.url.split("/)[-2]")
filename = 'filename'
with open(filename,'w+') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
I tested it and it works.

Shouldn't the line
page = response.url.split("/)[-2]")
be
page = response.url.split("/)[-1]")
as now it looks like you are selecting the word page and want a number ?

Scrapy FormRequest redirect not wanted link

I followed the basic Scrapy Login. It always works, but in this case, I had some problems. The FormRequest.from_response didn't request the https://www.crowdfunder.com/user/validateLogin, instead it always sent payload to https://www.crowdfunder.com/user/signup. I tried directly request the validateLogin with payload, but it responded with 404 Error. Any idea to help me solve this problem? Thanks in advance!!!
class CrowdfunderSpider(InitSpider):
name = "crowdfunder"
allowed_domains = ["crowdfunder.com"]
start_urls = [
'http://www.crowdfunder.com/',
]
login_page = 'https://www.crowdfunder.com/user/login/'
payload = {}
def init_request(self):
"""This function is called before crawling starts."""
return scrapy.Request(url=self.login_page, callback=self.login)
def login(self, response):
"""Generate a login request."""
self.payload = {'email': 'my_email',
'password': 'my_password'}
# scrapy login
return scrapy.FormRequest.from_response(response, formdata=self.payload, callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if 'https://www.crowdfunder.com/user/settings' == response.url:
self.log("Successfully logged in. :) :) :)")
# start the crawling
return self.initialized()
else:
# login fail
self.log("login failed :( :( :(")
Here is the payload and request link sent by clicking login in browser:
payload and request url sent by clicking login button
Here is the log info:
2016-10-21 21:56:21 [scrapy] INFO: Scrapy 1.1.0 started (bot: crowdfunder_crawl)
2016-10-21 21:56:21 [scrapy] INFO: Overridden settings: {'AJAXCRAWL_ENABLED': True, 'NEWSPIDER_MODULE': 'crowdfunder_crawl.spiders', 'SPIDER_MODULES': ['crowdfunder_crawl.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'crowdfunder_crawl'}
2016-10-21 21:56:21 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-10-21 21:56:21 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-21 21:56:21 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-21 21:56:21 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-21 21:56:21 [scrapy] INFO: Spider opened
2016-10-21 21:56:21 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-21 21:56:21 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-10-21 21:56:21 [scrapy] DEBUG: Crawled (200) <GET https://www.crowdfunder.com/robots.txt> (referer: None)
2016-10-21 21:56:21 [scrapy] DEBUG: Redirecting (301) to <GET http://www.crowdfunder.com/user/login> from <GET https://www.crowdfunder.com/user/login/>
2016-10-21 21:56:22 [scrapy] DEBUG: Redirecting (301) to <GET https://www.crowdfunder.com/user/login> from <GET http://www.crowdfunder.com/user/login>
2016-10-21 21:56:22 [scrapy] DEBUG: Crawled (200) <GET https://www.crowdfunder.com/user/login> (referer: None)
2016-10-21 21:56:23 [scrapy] DEBUG: Crawled (200) <POST https://www.crowdfunder.com/user/signup> (referer: https://www.crowdfunder.com/user/login)
2016-10-21 21:56:23 [crowdfunder] DEBUG: login failed :( :( :(
2016-10-21 21:56:23 [scrapy] INFO: Closing spider (finished)
2016-10-21 21:56:23 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1569,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 16313,
'downloader/response_count': 5,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/301': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 10, 22, 4, 56, 23, 232493),
'log_count/DEBUG': 7,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2016, 10, 22, 4, 56, 21, 180030)}
2016-10-21 21:56:23 [scrapy] INFO: Spider closed (finished)

FormRequest.from_response(response) by default uses the first form it finds. If you check what forms the page has you'd see:
In : response.xpath("//form")
Out:
[<Selector xpath='//form' data='<form action="/user/signup" method="post'>,
<Selector xpath='//form' data='<form action="/user/login" method="POST"'>,
<Selector xpath='//form' data='<form action="/user/login" method="post"'>]
So the form you are looking for is not 1st one. The way to fix it is to use one of many from_response method parameters to specify which form to use.
Using formxpath is the most flexible and my personal favorite:
In : FormRequest.from_response(response, formxpath='//*[contains(#action,"login")]')
Out: <POST https://www.crowdfunder.com/user/login>

How to scrape data in an authenticated session within a dynamic page?

I have coded a Scrapy spider using the loginform library (http://blog.scrapinghub.com/2012/10/26/filling-login-forms-automatically/) and taking this post as reference for dynamic webpages. This is the code:
class MySpider(CrawlSpider):
login_user = 'myusername'
login_pass = 'mypassword'
name = "tv"
allowed_domains = []
start_urls = ["https://twitter.com/Acrocephalus/followers"]
rules = (
Rule(SgmlLinkExtractor(allow=('https://twitter\.com/.*')), callback='parse_items', follow=True),
)
def parse(self, response):
args, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_pass)
return FormRequest(url, method=method, formdata=args, callback=self.after_login)
def after_login(self, response):
# you are logged in here
def __init__(self):
CrawlSpider.__init__(self)
# use any browser you wish
self.browser = webdriver.Firefox()
def __del__(self):
self.browser.close()
def parse_item(self, response):
item = TwitterVizItem()
self.browser.get(response.url)
# let JavaScript Load
time.sleep(3)
# scrape dynamically generated HTML
hxs = Selector(text=self.browser.page_source)
item['Follower'] = hxs.select('(//span[#class="u-linkComplex-target"])[position()>2]').extract()
item['Follows'] = hxs.select('(//span[#class="u-linkComplex-target"])[position()=1]').extract()
return item
When I run it, I get this output:
2015-07-22 18:46:38 [scrapy] INFO: Scrapy 1.0.1.post1+g5b8c9e5 started (bot: TwitterViz)
2015-07-22 18:46:38 [scrapy] INFO: Optional features available: ssl, http11
2015-07-22 18:46:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'TwitterViz.spiders', 'SPIDER_MODULES': ['TwitterViz.spiders'], 'BOT_NAME': 'TwitterViz'}
2015-07-22 18:46:38 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-22 18:46:38 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-22 18:46:38 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-22 18:46:38 [scrapy] INFO: Enabled item pipelines:
2015-07-22 18:46:38 [scrapy] INFO: Spider opened
2015-07-22 18:46:38 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-22 18:46:38 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-22 18:46:39 [scrapy] DEBUG: Redirecting (302) to <GET https://twitter.com/login?redirect_after_login=%2FAcrocephalus%2Ffollowers> from <GET https://twitter.com/Acrocephalus/followers>
2015-07-22 18:46:39 [scrapy] DEBUG: Crawled (200) <GET https://twitter.com/login?redirect_after_login=%2FAcrocephalus%2Ffollowers> (referer: None)
2015-07-22 18:46:39 [scrapy] DEBUG: Redirecting (302) to <GET https://twitter.com/Acrocephalus/followers> from <POST https://twitter.com/sessions>
2015-07-22 18:46:40 [scrapy] DEBUG: Crawled (200) <GET https://twitter.com/Acrocephalus/followers> (referer: https://twitter.com/login?redirect_after_login=%2FAcrocephalus%2Ffollowers)
2015-07-22 18:46:40 [scrapy] INFO: Closing spider (finished)
2015-07-22 18:46:40 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2506,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 63188,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 7, 22, 16, 46, 40, 538785),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2015, 7, 22, 16, 46, 38, 926487)}
2015-07-22 18:46:40 [scrapy] INFO: Spider closed (finished)
From the ouput I understand that it logs in properly. However, it doesn't scrape anything. I've tested the XPath with Firefox's XPath Checker and they seem to work properly. Can anyone have a look to see if we can find what is wrong?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to download images from dynamically generated hashed url using scrapy? - scrapy

Related

Scrapy - Multiple resquest by looping JSON file

Google Authenticaion via Scrapy

Getting NotImplementedError when running Scrapy tutorial

Scrapy FormRequest redirect not wanted link

How to scrape data in an authenticated session within a dynamic page?

Categories

Resources