Scrapy/Selenium: Send more than 3 failed requests before script stops - selenium

I am currently trying to crawl a website (around 500 subpages).
The script is working quite smoothly. However, after 3 to 4 hours of running, I am sometimes getting the error message which you can find bellow. I think it is not the script which makes the problems, it is the website server which is quite slowly.
My question is: Is it somehow possible to send more than 3 "failed requests" before the script automatically stops/ closes the spider?
2019-09-27 10:53:46 [scrapy.extensions.logstats] INFO: Crawled 448 pages (at 1 pages/min), scraped 4480 items (at 10 items/min)
2019-09-27 10:54:00 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480> (failed 1 times): 504 Gateway Time-out
2019-09-27 10:54:46 [scrapy.extensions.logstats] INFO: Crawled 448 pages (at 0 pages/min), scraped 4480 items (at 0 items/min)
2019-09-27 10:55:00 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480> (failed 2 times): 504 Gateway Time-out
2019-09-27 10:55:46 [scrapy.extensions.logstats] INFO: Crawled 448 pages (at 0 pages/min), scraped 4480 items (at 0 items/min)
2019-09-27 10:56:00 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480> (failed 3 times): 504 Gateway Time-out
2019-09-27 10:56:00 [scrapy.core.engine] DEBUG: Crawled (504) <GET https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480> (referer: https://blogabet.com/tipsters) ['partial']
2019-09-27 10:56:00 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <504 https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480>: HTTP status code is not handled or not allowed
2019-09-27 10:56:00 [scrapy.core.engine] INFO: Closing spider (finished)
UPDATED CODE ADDED
class AlltipsSpider(Spider):
name = 'alltips'
allowed_domains = ['blogabet.com']
start_urls = ('https://blogabet.com',)
def parse(self, response):
self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
with open("urls.txt", "rt") as f:
start_urls = [url.strip() for url in f.readlines()]
for url in start_urls:
self.driver.get(url)
self.driver.find_element_by_id('currentTab').click()
sleep(3)
self.logger.info('Sleeping for 5 sec.')
self.driver.find_element_by_xpath('//*[#id="_blog-menu"]/div[2]/div/div[2]/a[3]').click()
sleep(7)
self.logger.info('Sleeping for 7 sec.')
while True:
try:
element = self.driver.find_element_by_id('last_item')
self.driver.execute_script("arguments[0].scrollIntoView(0, document.documentElement.scrollHeight-5);", element)
sleep(3)
self.driver.find_element_by_id('last_item').click()
sleep(7)
except NoSuchElementException:
self.logger.info('No more tipps')
sel = Selector(text=self.driver.page_source)
allposts = sel.xpath('//*[#class="block media _feedPick feed-pick"]')
for post in allposts:
username = post.xpath('.//div[#class="col-sm-7 col-lg-6 no-padding"]/a/#title').extract()
publish_date = post.xpath('.//*[#class="bet-age text-muted"]/text()').extract()
yield {'Username': username,
'Publish date': publish_date
self.driver.quit()
break

You can do this by simply changing the RETRY_TIMES setting to a higher number.
You can read about your retry-related options in the RetryMiddleware docs: https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#std:setting-RETRY_TIMES

Related

How to get cookies from response of scrapy splash

I want to get the cookie value from the response object of a splash. but it is not working as I expected.
Here is spider code
class AmazonSpider(scrapy.Spider):
name = 'amazon'
allowed_domains = ['amazon.com']
def start_requests(self):
url = 'https://www.amazon.com/gp/goldbox?ref_=nav_topnav_deals'
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
print(response.headers)
Output log:
2019-08-17 11:53:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/robots.txt> (referer: None)
2019-08-17 11:53:08 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.99.100:8050/robots.txt> (referer: None)
2019-08-17 11:53:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/gp/goldbox?ref_=nav_topnav_deals via http://192.168.99.100:8050/render.html> (referer: None)
{b'Date': [b'Sat, 17 Aug 2019 06:23:09 GMT'], b'Server': [b'TwistedWeb/18.9.0'], b'Content-Type': [b'text/html; charset=utf-8']}
2019-08-17 11:53:24 [scrapy.core.engine] INFO: Closing spider (finished)
You can try the following approach:
- write a small Lua script that returns the html + the cookies:
lua_request = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go(splash.args.url))
splash:wait(0.5)
return {
html = splash:html(),
cookies = splash:get_cookies()
}
end
"""
Change your Request to the following:
yield SplashRequest(
url,
self.parse,
endpoint='execute',
args={'lua_source': self.lua_request}
)
Then find the cookies in your parse-method as follows:
def parse(self, response):
cookies = response.data['cookies']
headers = response.headers

Scrapy spider for Tripadvisor Crawled 0 pages (at 0 pages/min)

I'm trying to scrape reviews for all hotels in Punta Cana. The code seems to run but when I call crawl, it doesn't actually crawl any of the sites. Here are my file structures, what I called, and what happened when I ran it.
folder structure:
├── scrapy.cfg
└── tripadvisor_reviews
├── __init__.py
├── __pycache__
│   ├── __init__.cpython-37.pyc
│   ├── items.cpython-37.pyc
│   └── settings.cpython-37.pyc
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
├── __pycache__
│   ├── __init__.cpython-37.pyc
│   └── tripadvisorSpider.cpython-37.pyc
└── tripadvisorSpider.py
tripadvisorSpider.py
import scrapy
from tripadvisor_reviews.items import TripadvisorReviewsItem
class tripadvisorSpider(scrapy.Spider):
name = "tripadvisorspider"
allowed_domains = ["www.tripadvisor.com"]
def start_requests(self):
urls = [
'https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for href in response.xpath('//div[#class="listing_title"]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_hotel)
next_page = response.xpath(
'//div[#class="nav next taLnk ui_button primary"]/#href').extract_first()
if next_page:
url = response.urljoin(next_page)
yield scrapy.Request(url, self.parse)
def parse_hotel(self, response):
for href in response.xpath('//div[#class="hotels-review-list-parts-ReviewTitle__reviewTitleText"]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_review)
next_page = response.xpath(
'//div[#class="ui_button nav next primary "]/#href').extract_first()
if next_page:
url = response.urljoin(next_page)
yield scrapy.Request(url, self.parse_hotel)
def parse_review(self, response):
item = TripadvisorReviewsItem()
item['title'] = response.xpath(
'//div[#class="hotels-review-list-parts-ReviewTitle__reviewTitleText"]/text()').extract()
item['content'] = response.xpath(
'//q[#class="hotels-review-list-parts-ExpandableReview__reviewText"]/text()').extract()
# item['stars'] = response.xpath(
# '//span[#class="rate sprite-rating_s rating_s"]/img/#alt').extract()[0]
print(item)
yield item
items.py
import scrapy
class TripadvisorReviewsItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
content = scrapy.Field()
# stars = scrapy.Field()
I ran it using the following command in terminal:
scrapy crawl tripadvisorspider -o items.json
This is my terminal output
2019-05-14 12:32:12 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: tripadvisor_reviews)
2019-05-14 12:32:12 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.7.1 (default, Dec 14 2018, 13:28:58) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryptography 2.4.2, Platform Darwin-18.5.0-x86_64-i386-64bit
2019-05-14 12:32:12 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tripadvisor_reviews', 'FEED_FORMAT': 'csv', 'FEED_URI': 'items.csv', 'NEWSPIDER_MODULE': 'tripadvisor_reviews.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tripadvisor_reviews.spiders']}
2019-05-14 12:32:12 [scrapy.extensions.telnet] INFO: Telnet Password: aae78556d6b8c59b
2019-05-14 12:32:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2019-05-14 12:32:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-05-14 12:32:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-05-14 12:32:12 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-05-14 12:32:12 [scrapy.core.engine] INFO: Spider opened
2019-05-14 12:32:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-14 12:32:12 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2019-05-14 12:32:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/robots.txt> (referer: None)
2019-05-14 12:32:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html> (referer: None)
2019-05-14 12:32:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d15025732-Reviews-Impressive_Resort_Spa_Punta_Cana-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d313884-Reviews-Punta_Cana_Princess_All_Suites_Resort_Spa-Bavaro_Punta_Cana_La_Altagracia_Province_Dom.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d4451011-Reviews-The_Westin_Puntacana_Resort_Club-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d10175054-Reviews-Secrets_Cap_Cana_Resort_Spa-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d7307251-Reviews-The_Level_at_Melia_Caribe_Beach-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_Re.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d292158-Reviews-Grand_Palladium_Punta_Cana_Resort_Spa-Punta_Cana_La_Altagracia_Province_Dominican_Repub.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d1604057-Reviews-Secrets_Royal_Beach_Punta_Cana-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d649099-Reviews-Zoetry_Agua_Punta_Cana-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d150842-Reviews-Iberostar_Dominicana_Hotel-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d15515013-Reviews-Grand_Memories_Punta_Cana-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d150841-Reviews-Iberostar_Selection_Bavaro-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d1233228-Reviews-Iberostar_Grand_Bavaro-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d149397-Reviews-Bavaro_Princess_Resort_Spa_Casino-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d584407-Reviews-Ocean_Blue_Sand-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d259337-Reviews-Grand_Palladium_Bavaro_Suites_Resort_Spa-Bavaro_Punta_Cana_La_Altagracia_Province_Domi.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d150854-Reviews-Hotel_Riu_Palace_Macao-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d11701188-Reviews-BlueBay_Grand_Punta_Cana-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d14838260-Reviews-Melia_Punta_Cana_Beach_Resort-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d1595124-Reviews-Luxury_Bahia_Principe_Esmeralda-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d508162-Reviews-Dreams_Punta_Cana_Resort_Spa-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d1076311-Reviews-Hard_Rock_Hotel_Casino_Punta_Cana-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d10595200-Reviews-Grand_Bahia_Principe_Aquamarine-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d611114-Reviews-Hotel_Riu_Palace_Punta_Cana-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3200043-d8709413-Reviews-Excellence_El_Carmen-Uvero_Alto_Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d1199681-Reviews-Luxury_Bahia_Principe_Ambar-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d1889895-Reviews-Karibo_Punta_Cana-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d6454132-Reviews-Premium_Level_at_Barcelo_Bavaro_Palace-Bavaro_Punta_Cana_La_Altagracia_Province_Domin.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d15080584-Reviews-Impressive_Premium_Resorts_Spa-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_Re.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g147293-d2687221-Reviews-NH_Punta_Cana-Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g3176298-d579774-Reviews-Iberostar_Punta_Cana-Bavaro_Punta_Cana_La_Altagracia_Province_Dominican_Republic.html> (referer: https://www.tripadvisor.com/Hotels-g147293-Punta_Cana_La_Altagracia_Province_Dominican_Republic-Hotels.html)
2019-05-14 12:32:21 [scrapy.core.engine] INFO: Closing spider (finished)
2019-05-14 12:32:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 46023,
'downloader/request_count': 32,
'downloader/request_method_count/GET': 32,
'downloader/response_bytes': 5599418,
'downloader/response_count': 32,
'downloader/response_status_count/200': 32,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 5, 14, 19, 32, 21, 637712),
'log_count/DEBUG': 33,
'log_count/INFO': 8,
'memusage/max': 51412992,
'memusage/startup': 51412992,
'request_depth_max': 1,
'response_received_count': 32,
'scheduler/dequeued': 31,
'scheduler/dequeued/memory': 31,
'scheduler/enqueued': 31,
'scheduler/enqueued/memory': 31,
'start_time': datetime.datetime(2019, 5, 14, 19, 32, 12, 996979)}
2019-05-14 12:32:21 [scrapy.core.engine] INFO: Spider closed (finished)
This selector is not working:
response.xpath('//div[#class="hotels-review-list-parts-ReviewTitle__reviewTitleText"]/a/#href')
The element on the site is <a> and not <div>, also the class name seems wrong. Perhaps this site append some random data to the class name, as you can see below
<span><span>Ótima experiência! Resort amplo, com diversas opções de entretenimento!!!</span></span>
You can try to match only part of the string, for example:
//a[contains(#class, "hotels-review-list-parts-ReviewTitle__reviewTitleText")]
If you just use Scrapy, there is no way to get the js dynamically generated page. You only get the source code of the webpage.I think you can use Scapy-Splash.

How to yield several requests in order in Scrapy?

I need to send my requests in order with Scrapy.
def n1(self, response) :
#self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]
for (elem,) in self.input :
link = urljoin(path,elem)
yield Request(link)
My problem is that the requests are not in the order.
I read this question but it has no correct answer.
How should I change my code for sending the requests in order?
UPDATE 1
I used priority and changed my code to
def n1(self, response) :
#self.input = [elem1,elem2,elem3,elem4,elem5, .... ,elem100000]
self.prio = len(self.input)
for (elem,) in self.input :
self.prio -= 1
link = urljoin(path,elem)
yield Request(link, priority=self.prio)
And my setting for this spider is
custom_settings = {
'DOWNLOAD_DELAY' : 0,
'COOKIES_ENABLED' : True,
'CONCURRENT_REQUESTS' : 1 ,
'AUTOTHROTTLE_ENABLED' : False,
}
Now the order is changed, but it's not in the order of elements in the array
Use a return statement instead of yield.
You don't even need to touch any setting:
from scrapy.spiders import Spider, Request
class MySpider(Spider):
name = 'toscrape.com'
start_urls = ['http://books.toscrape.com/catalogue/page-1.html']
urls = (
'http://books.toscrape.com/catalogue/page-{}.html'.format(i + 1) for i in range(50)
)
def parse(self, response):
for url in self.urls:
return Request(url)
Output:
2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: http://books.toscrape.com/catalogue/page-1.html)
2018-11-20 03:35:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: http://books.toscrape.com/catalogue/page-1.html)
2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: http://books.toscrape.com/catalogue/page-2.html)
2018-11-20 03:35:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: http://books.toscrape.com/catalogue/page-3.html)
2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: http://books.toscrape.com/catalogue/page-4.html)
2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: http://books.toscrape.com/catalogue/page-5.html)
2018-11-20 03:35:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: http://books.toscrape.com/catalogue/page-6.html)
2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: http://books.toscrape.com/catalogue/page-7.html)
2018-11-20 03:35:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: http://books.toscrape.com/catalogue/page-8.html)
2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: http://books.toscrape.com/catalogue/page-9.html)
2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: http://books.toscrape.com/catalogue/page-10.html)
2018-11-20 03:35:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: http://books.toscrape.com/catalogue/page-11.html)
2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: http://books.toscrape.com/catalogue/page-12.html)
2018-11-20 03:35:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: http://books.toscrape.com/catalogue/page-13.html)
2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: http://books.toscrape.com/catalogue/page-14.html)
2018-11-20 03:35:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: http://books.toscrape.com/catalogue/page-15.html)
2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: http://books.toscrape.com/catalogue/page-16.html)
2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: http://books.toscrape.com/catalogue/page-17.html)
2018-11-20 03:35:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: http://books.toscrape.com/catalogue/page-18.html)
2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: http://books.toscrape.com/catalogue/page-19.html)
2018-11-20 03:35:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: http://books.toscrape.com/catalogue/page-20.html)
2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: http://books.toscrape.com/catalogue/page-21.html)
2018-11-20 03:35:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: http://books.toscrape.com/catalogue/page-22.html)
2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: http://books.toscrape.com/catalogue/page-23.html)
2018-11-20 03:35:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: http://books.toscrape.com/catalogue/page-24.html)
With a yield statement, the engine gets all the responses from the generator and executes them in an arbitrary order (I suspect they might be stored in some sort of set to remove duplicates).
I think concurrent request is play at here. You can try setting
custom_settings = {
'CONCURRENT_REQUESTS': 1
}
Default setting is 8. It will kind of explain why priority will not honoured when you have other workers free for work.
You can send the next request only after receiving the previous one:
class MainSpider(Spider):
urls = [
'https://www.url1...',
'https://www.url2...',
'https://www.url3...',
]
def start_requests(self):
yield Request(
url=self.urls[0],
callback=self.parse,
meta={'next_index': 1},
)
def parse(self, response):
next_index = response.meta['next_index']
# do something with response...
# Process next url
if next_index < len(self.urls):
yield Request(
url=self.urls[next_index],
callback=self.parse,
meta={'next_index': next_index+1},
)

Scrapy FormRequest redirect not wanted link

I followed the basic Scrapy Login. It always works, but in this case, I had some problems. The FormRequest.from_response didn't request the https://www.crowdfunder.com/user/validateLogin, instead it always sent payload to https://www.crowdfunder.com/user/signup. I tried directly request the validateLogin with payload, but it responded with 404 Error. Any idea to help me solve this problem? Thanks in advance!!!
class CrowdfunderSpider(InitSpider):
name = "crowdfunder"
allowed_domains = ["crowdfunder.com"]
start_urls = [
'http://www.crowdfunder.com/',
]
login_page = 'https://www.crowdfunder.com/user/login/'
payload = {}
def init_request(self):
"""This function is called before crawling starts."""
return scrapy.Request(url=self.login_page, callback=self.login)
def login(self, response):
"""Generate a login request."""
self.payload = {'email': 'my_email',
'password': 'my_password'}
# scrapy login
return scrapy.FormRequest.from_response(response, formdata=self.payload, callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if 'https://www.crowdfunder.com/user/settings' == response.url:
self.log("Successfully logged in. :) :) :)")
# start the crawling
return self.initialized()
else:
# login fail
self.log("login failed :( :( :(")
Here is the payload and request link sent by clicking login in browser:
payload and request url sent by clicking login button
Here is the log info:
2016-10-21 21:56:21 [scrapy] INFO: Scrapy 1.1.0 started (bot: crowdfunder_crawl)
2016-10-21 21:56:21 [scrapy] INFO: Overridden settings: {'AJAXCRAWL_ENABLED': True, 'NEWSPIDER_MODULE': 'crowdfunder_crawl.spiders', 'SPIDER_MODULES': ['crowdfunder_crawl.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'crowdfunder_crawl'}
2016-10-21 21:56:21 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-10-21 21:56:21 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-21 21:56:21 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-21 21:56:21 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-21 21:56:21 [scrapy] INFO: Spider opened
2016-10-21 21:56:21 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-21 21:56:21 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-10-21 21:56:21 [scrapy] DEBUG: Crawled (200) <GET https://www.crowdfunder.com/robots.txt> (referer: None)
2016-10-21 21:56:21 [scrapy] DEBUG: Redirecting (301) to <GET http://www.crowdfunder.com/user/login> from <GET https://www.crowdfunder.com/user/login/>
2016-10-21 21:56:22 [scrapy] DEBUG: Redirecting (301) to <GET https://www.crowdfunder.com/user/login> from <GET http://www.crowdfunder.com/user/login>
2016-10-21 21:56:22 [scrapy] DEBUG: Crawled (200) <GET https://www.crowdfunder.com/user/login> (referer: None)
2016-10-21 21:56:23 [scrapy] DEBUG: Crawled (200) <POST https://www.crowdfunder.com/user/signup> (referer: https://www.crowdfunder.com/user/login)
2016-10-21 21:56:23 [crowdfunder] DEBUG: login failed :( :( :(
2016-10-21 21:56:23 [scrapy] INFO: Closing spider (finished)
2016-10-21 21:56:23 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1569,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 16313,
'downloader/response_count': 5,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/301': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 10, 22, 4, 56, 23, 232493),
'log_count/DEBUG': 7,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2016, 10, 22, 4, 56, 21, 180030)}
2016-10-21 21:56:23 [scrapy] INFO: Spider closed (finished)
FormRequest.from_response(response) by default uses the first form it finds. If you check what forms the page has you'd see:
In : response.xpath("//form")
Out:
[<Selector xpath='//form' data='<form action="/user/signup" method="post'>,
<Selector xpath='//form' data='<form action="/user/login" method="POST"'>,
<Selector xpath='//form' data='<form action="/user/login" method="post"'>]
So the form you are looking for is not 1st one. The way to fix it is to use one of many from_response method parameters to specify which form to use.
Using formxpath is the most flexible and my personal favorite:
In : FormRequest.from_response(response, formxpath='//*[contains(#action,"login")]')
Out: <POST https://www.crowdfunder.com/user/login>

Conflicts When Generating Start Urls

I'm working on retrieving information from the National Gallery of Art's online catalog. Due to the catalog's structure, I can't navigate by extracting and following links from entry to entry. Fortunately, each object in the collection has a predictable url. I want my spider to navigate the collection by generating start urls.
I have attempted to solve my problem by implementing the solution from this thread. Unfortunately, this seems to break another part of my spider. The error log reveals that my urls are being successfully generated, but they aren't being processed correctly. If I'm interpreting the log correctly—which I suspect I'm not—there is a conflict between the redefinition of the start_urls that allows me to generate the urls I need and the rules section of the spider. As things stand now, the spider also doesn't respect the number of pages that I ask it to crawl.
You'll find my spider and a typical error below. I appreciate any help you can offer.
Spider:
URL = "http://www.nga.gov/content/ngaweb/Collection/art-object-page.%d"
starting_number = 1312
number_of_pages = 10
class NGASpider(CrawlSpider):
name = 'ngamedallions'
allowed_domains = ['nga.gov']
start_urls = [URL % starting_number]
rules = (
Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True))
def __init__(self):
self.page_number = starting_number
def start_requests(self):
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i + ".html" , callback=self.parse)
def parse_CatalogRecord(self, response):
CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
CatalogRecord.default_output_processor = TakeFirst()
CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()
keywords = "medal|medallion"
r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
if r.search(response.body_as_unicode()):
CatalogRecord.add_xpath('title', './/dl[#class="artwork-details"]/dt[#class="title"]/text()')
CatalogRecord.add_xpath('accession', './/dd[#class="accession"]/text()')
CatalogRecord.add_xpath('inscription', './/div[#id="inscription"]/p/text()')
CatalogRecord.add_xpath('image_urls', './/img[#class="mainImg"]/#src')
return CatalogRecord.load_item()
Typical Error:
2016-04-29 15:35:00 [scrapy] ERROR: Spider error processing <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1178.html> (referer: None)
Traceback (most recent call last):
File "/usr/lib/pymodules/python2.7/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
for x in result:
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/lib/pymodules/python2.7/scrapy/spiders/crawl.py", line 73, in _parse_response
for request_or_item in self._requests_to_follow(response):
File "/usr/lib/pymodules/python2.7/scrapy/spiders/crawl.py", line 51, in _requests_to_follow
for n, rule in enumerate(self._rules):
AttributeError: 'NGASpider' object has no attribute '_rules'
Update in Resonse to eLRuLL's Solution
Simply removing def __init__ and start_urls allows my spider to crawl my generated urls. However, it also seems to prevent 'def parse_CatalogRecord(self, response)' from being applied. When I run the spider now, it only scrapes pages from outside the range of generated urls. My revised spider and log output follow below.
Spider:
URL = "http://www.nga.gov/content/ngaweb/Collection/art-object-page.%d"
starting_number = 1312
number_of_pages = 1311
class NGASpider(CrawlSpider):
name = 'ngamedallions'
allowed_domains = ['nga.gov']
rules = (
Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True))
def start_requests(self):
self.page_number = starting_number
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i + ".html" , callback=self.parse)
def parse_CatalogRecord(self, response):
CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
CatalogRecord.default_output_processor = TakeFirst()
CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()
keywords = "medal|medallion"
r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
if r.search(response.body_as_unicode()):
CatalogRecord.add_xpath('title', './/dl[#class="artwork-details"]/dt[#class="title"]/text()')
CatalogRecord.add_xpath('accession', './/dd[#class="accession"]/text()')
CatalogRecord.add_xpath('inscription', './/div[#id="inscription"]/p/text()')
CatalogRecord.add_xpath('image_urls', './/img[#class="mainImg"]/#src')
return CatalogRecord.load_item()
Log:
2016-05-02 15:50:02 [scrapy] INFO: Scrapy 1.0.5.post4+g4b324a8 started (bot: ngamedallions)
2016-05-02 15:50:02 [scrapy] INFO: Optional features available: ssl, http11
2016-05-02 15:50:02 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'FEED_FORMAT': 'json', 'DOWNLOAD_DELAY': 3}
2016-05-02 15:50:02 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-05-02 15:50:02 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-05-02 15:50:02 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-05-02 15:50:02 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2016-05-02 15:50:02 [scrapy] INFO: Spider opened
2016-05-02 15:50:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-02 15:50:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-05-02 15:50:02 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: None)
2016-05-02 15:50:02 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-05-02 15:50:05 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html)
2016-05-02 15:50:05 [scrapy] DEBUG: File (uptodate): Downloaded image from <GET http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg> referred in <None>
2016-05-02 15:50:05 [scrapy] DEBUG: Scraped from <200 http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html>
{'accession': u'1942.9.163.b',
'image_urls': [u'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg'],
'images': [{'checksum': '9d5f2e30230aeec1582ca087bcde6bfa',
'path': 'full/3a692347183d26ffefe9ba0af80b0b6bf247fae5.jpg',
'url': 'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg'}],
'inscription': u'around top circumference: TRINACRIA IANI; upper center: PELORVS ; across center: PA LI; across bottom: BELAVRA',
'title': u'House between Two Hills [reverse]'}
2016-05-02 15:50:05 [scrapy] INFO: Closing spider (finished)
2016-05-02 15:50:05 [scrapy] INFO: Stored json feed (1 items) in: items.json
2016-05-02 15:50:05 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 631,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 26324,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'dupefilter/filtered': 3,
'file_count': 1,
'file_status_count/uptodate': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 5, 2, 19, 50, 5, 810570),
'item_scraped_count': 1,
'log_count/DEBUG': 6,
'log_count/INFO': 8,
'request_depth_max': 2,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 5, 2, 19, 50, 2, 455508)}
2016-05-02 15:50:05 [scrapy] INFO: Spider closed (finished)
don't override the __init__ method if you are not going to call super.
Now, you don't really need to declare start_urls for your spider to work if you are going to use start_requests.
Just remove your def __init__ method and no need for start_urls to exist.
UPDATE
Ok my mistake, looks like CrawlSpider needs the start_urls attribute, so just create it instead of using the start_requests method:
start_urls = [URL % i + '.html' for i in range (starting_number, number_of_pages, -1)]
and remove start_requests