Cant crawl scrapy with depth more than 1 - scrapy

I couldn't configure scrapy to run with depth > 1, I have tried the 3 following options, noone of them worked and the request_depth_max at summary log is always 1:
1) Adding:
from scrapy.conf import settings
settings.overrides['DEPTH_LIMIT'] = 2
to spider file (the example on site, just with different site)
2) Running the command line with -s option:
/usr/bin/scrapy crawl -s DEPTH_LIMIT=2 mininova.org
3) Adding to settings.py and scrapy.cfg:
DEPTH_LIMIT=2
How should it be configured to more than 1?

warwaruk is right, The default value of DEPTH_LIMIT setting is 0 - i.e. "no limit is imposed".
So let's scrape miniova and see what happens. Starting at the today page we see that there are two tor links:
stav#maia:~$ scrapy shell http://www.mininova.org/today
2012-08-15 12:27:57-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'[APSKAFT-018] Apskaft presents: Musique Concrte', fragment='', nofollow=False), Link(url='http://www.mininova.org/tor/13204737', text=u'e4g020-graphite412', fragment='', nofollow=False)]
Let's scrape the first link, where we see there are no new tor links on that page, just the link to iteself, which does not get recrawled by default (scrapy.http.Request(url[, ... dont_filter=False, ...])):
>>> fetch('http://www.mininova.org/tor/13204738')
2012-08-15 12:30:11-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204738> (referer: None)
>>> SgmlLinkExtractor(allow=['/tor/\d+']).extract_links(response)
[Link(url='http://www.mininova.org/tor/13204738', text=u'General information', fragment='', nofollow=False)]
No luck there, we are still at depth 1. Let's try the other link:
>>> fetch('http://www.mininova.org/tor/13204737')
2012-08-15 12:31:20-0500 [default] DEBUG: Crawled (200) <GET http://www.mininova.org/tor/13204737> (referer: None)
[Link(url='http://www.mininova.org/tor/13204737', text=u'General information', fragment='', nofollow=False)]
Nope, this page only contains one link as well, a link to itself, which also gets filtered. So there are actually no links to scrape, so Scrapy closes the spider (at depth==1).

I had a similar issue, it helped to set follow=True when defining Rule:
follow is a boolean which specifies if links should be followed from
each response extracted with this rule. If callback is None follow
defaults to True, otherwise it default to False.

The default value of DEPTH_LIMIT setting is 0 - i.e. "no limit is imposed".
You wrote:
request_depth_max at summary log is always 1
What you see in the logs is the statistics, not the settings. When it says that request_depth_max as 1 it means that from the first callback no other requests have been yielded.
You have to show your spider code to understand what is going on.
But create another question for it.
UPDATE:
Ah, i see you are running mininova spider for the scrapy intro:
class MininovaSpider(CrawlSpider):
name = 'mininova.org'
allowed_domains = ['mininova.org']
start_urls = ['http://www.mininova.org/today']
rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]
def parse_torrent(self, response):
x = HtmlXPathSelector(response)
torrent = TorrentItem()
torrent['url'] = response.url
torrent['name'] = x.select("//h1/text()").extract()
torrent['description'] = x.select("//div[#id='description']").extract()
torrent['size'] = x.select("//div[#id='info-left']/p[2]/text()[2]").extract()
return torrent
As you see from the code, the spider never issues any request for other pages, it scrapes all the data right from the top level pages. That's why the maximum depth is 1.
If you make you own spider which will be following links to other pages, the maximum depth will be greater then 1.

Related

How to address urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=58408): Max retries exceeded with url

I am trying to scrape a few pages of a website with selenium and use the results but when I run the function twice
[WinError 10061] No connection could be made because the target machine actively refused it'
Error appears for the 2nd function call.
Here's my approach :
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup as soup
opts = webdriver.ChromeOptions()
opts.binary_location = os.environ.get('GOOGLE_CHROME_BIN', None)
opts.add_argument("--headless")
opts.add_argument("--disable-dev-shm-usage")
opts.add_argument("--no-sandbox")
browser = webdriver.Chrome(executable_path="CHROME_DRIVER PATH", options=opts)
lst =[]
def search(st):
for i in range(1,3):
url = "https://gogoanime.so/anime-list.html?page=" + str(i)
browser.get(url)
req = browser.page_source
sou = soup(req, "html.parser")
title = sou.find('ul', class_ = "listing")
title = title.find_all("li")
for j in range(len(title)):
lst.append(title[j].getText().lower()[1:])
browser.quit()
print(len(lst))
search("a")
search("a")
OUTPUT
272
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=58408): Max retries exceeded with url: /session/4b3cb270d1b5b867257dcb1cee49b368/url (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001D5B378FA60>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))
This error message...
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=58408): Max retries exceeded with url: /session/4b3cb270d1b5b867257dcb1cee49b368/url (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001D5B378FA60>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))
...implies that the failed to establish a new connection raising MaxRetryError as no connection could be made.
A couple of things:
First and foremost as per the discussion max-retries-exceeded exceptions are confusing the traceback is somewhat misleading. Requests wraps the exception for the users convenience. The original exception is part of the message displayed.
Requests never retries (it sets the retries=0 for urllib3's HTTPConnectionPool), so the error would have been much more canonical without the MaxRetryError and HTTPConnectionPool keywords. So an ideal Traceback would have been:
ConnectionError(<class 'socket.error'>: [Errno 1111] Connection refused)
Root Cause and Solution
Once you have initiated the webdriver and web client session, next within def search(st) you are invoking get() o access an url and in the subsequent lines you are also invoking browser.quit() which is used to call the /shutdown endpoint and subsequently the webdriver & the web-client instances are destroyed completely closing all the pages/tabs/windows. Hence no more connection exists.
You can find a couple of relevant detailed discussion in:
PhantomJS web driver stays in memory
Selenium : How to stop geckodriver process impacting PC memory, without calling
driver.quit()?
In such a situation in the next iteration (due to the for loop) when browser.get() is invoked there are no active connections. hence you see the error.
So a simple solution would be to remove the line browser.quit() and invoke browser.get(url) within the same browsing context.
Conclusion
Once you upgrade to Selenium 3.14.1 you will be able to set the timeout and see canonical Tracebacks and would be able to take required action.
References
You can find a relevant detailed discussion in:
MaxRetryError: HTTPConnectionPool: Max retries exceeded (Caused by ProtocolError('Connection aborted.', error(111, 'Connection refused')))
tl; dr
A couple of relevent discussions:
Adding max_retries as an argument
Removed the bundled charade and urllib3.
Third party libraries committed verbatim
Problem
The driver was asked to crawl the URL after being quit.
Make sure that you're not quitting the driver before getting the content.
Solution
Regarding your code, when executing search("a") , the driver retrieves the url, returns the content and after that it closes.
when serach() runs another time, the driver no longer exists so it is not able to proceed with the URL.
You need to remove the browser.quit() from the function and add it at the end of the script.
lst =[]
def search(st):
for i in range(1,3):
url = "https://gogoanime.so/anime-list.html?page=" + str(i)
browser.get(url)
req = browser.page_source
sou = soup(req, "html.parser")
title = sou.find('ul', class_ = "listing")
title = title.find_all("li")
for j in range(len(title)):
lst.append(title[j].getText().lower()[1:])
print(len(lst))
search("a")
search("a")
browser.quit()
I faced the same issue in Robot Framework.
MaxRetryError: HTTPConnectionPool(host='options=add_argument("--ignore-certificate-errors")', port=80): Max retries exceeded with url: /session (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001ABA3190F10>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')).
This issue got fixed once I updated all the libraries to their latest version in Pycharm and also I selected Intellibot#SeleniumLibrary.patched

Scrapy - get depth level of failed requests with no response

When parsing scraped pages I also save the depth the request was scraped from using response.meta['depth'].
I recently started using errback to log all failed requests into a separate file and having depth there would help me a lot. (I believe) I could use failure.value.response.meta['depth'] for those pages which actually got a response but failed due to ie a http status error like 403 etc., however when an error like TCPTimeout is encountered there is no response.
Is it possible to get the depth level of a failed request with no response?
EDIT1: Tried failure.request.meta['depth'] but that gives an error. Meta seems that can be found but it has no depth key.
EDIT2: The issue seems to be that failure.request.meta['depth'] is created only when the first response is received. So the way I understand is that if the first request, a start_url doesn't receive a response, the depth key is not yet created and hence throws an exception.
I'm going to experiment with this as per the depth middleware:
if 'depth' not in response.meta:
response.meta['depth'] = 0
Yep, the issue turns out to be exactly how I described it in EDIT2. This is how I fixed it:
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, errback=self.my_errback)
def my_errback(self, failure):
if 'depth' not in failure.request.meta:
failure.request.meta['depth'] = 0
depth = failure.request.meta['depth']
# do something with depth...
Big thanks to mr #Galecio who pointed me in the right direction!

Amazon reviews: List index out of range

I would like to scrape the customer reviews of the kindle paperwhite of amazon.
I am aware that although amazon might say the have 5900 reviews, it is only possible to access 5000 of them. (after page=500 no more reviews are displayed with 10 reviews per page).
For the first few pages my spider returns 10 reviews per page, but later this shrinks to just one or two. This results in only about 1300 reviews.
There seems to be a problem with adding the data of the variable "helpul" and "verified". Both throw the following error:
'helpful': ''.join(helpful[count]),
IndexError: list index out of range
Any help would be greatly appreciated!
I tried implementing if statements in case the variables were empty or contained a list, but it didnt work.
My Spider amazon_reviews.py:
import scrapy
from scrapy.extensions.throttle import AutoThrottle
class AmazonReviewsSpider(scrapy.Spider):
name = 'amazon_reviews'
allowed_domains = ['amazon.com']
myBaseUrl = "https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber="
start_urls=[]
# Creating list of urls to be scraped by appending page number a the end of base url
for i in range(1,550):
start_urls.append(myBaseUrl+str(i))
def parse(self, response):
data = response.css('#cm_cr-review_list')
# Collecting various data
star_rating = data.css('.review-rating')
title = data.css('.review-title')
text = data.css('.review-text')
date = data.css('.review-date'))
# Number how many people thought the review was helpful.
helpful = response.xpath('.//span[#data-hook="helpful-vote-statement"]//text()').extract()
verified = response.xpath('.//span[#data-hook="avp-badge"]//text()').extract()
# I scrape more information, but deleted it here not to make the code too big
# yielding the scraped results
for review in star_rating:
yield{'ASIN': 'B07CXG6C9W',
#'ID': ''.join(id.xpath('.//text()').extract()),
'stars': ''.join(review.xpath('.//text()').extract_first()),
'title': ''.join(title[count].xpath(".//text()").extract_first()),
'text': ''.join(text[count].xpath(".//text()").extract_first()),
'date': ''.join(date[count].xpath(".//text()").extract_first()),
### There seems to be a problem with adding these two, as I get 5000 reviews back if I delete them. ###
'verified purchase': ''.join(verified[count]),
'helpful': ''.join(helpful[count])
}
count=count+1
My settings.py :
AUTOTHROTTLE_ENABLED = True
CONCURRENT_REQUESTS = 2
DOWNLOAD_TIMEOUT = 180
REDIRECT_ENABLED = False
#DOWNLOAD_DELAY =5.0
RANDOMIZE_DOWNLOAD_DELAY = True
The extracting of the data works fine. The reviews I do get have complete and accurate information. Just the amount of reviews I get are too little.
When I run the spider with the following command:
runspider amazon_reviews_scraping_test\amazon_reviews_scraping_test\spiders\amazon_reviews.py -o reviews.csv
The ouput on the console looks like the following:
2019-04-22 11:54:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=164> (referer: None)
2019-04-22 11:54:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161>
{'ASIN': 'B07CXG6C9W', 'stars': '5.0 out of 5 stars', 'username': 'BRANDI', 'title': 'Bookworms rejoice!', 'text': "The (...) 5 STARS! 🌟🌟🌟🌟🌟", 'date': 'December 7, 2018'}
2019-04-22 11:54:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161>
{'ASIN': 'B07CXG6C9W', 'stars': '5.0 out of 5 stars', 'username': 'Doug Stender', 'title': 'As good as adverised', 'text': 'I read (...) mazon...', 'date': 'January 8, 2019'}
2019-04-22 11:54:41 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161> (referer: None)
Traceback (most recent call last):
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
for x in result:
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\John\OneDrive\Dokumente\Uni\05_SS 19\Masterarbeit\Code\Scrapy\amazon_reviews_scraping_test\amazon_reviews_scraping_test\spiders\amazon_reviews.py", line 78, in parse
'helpful': ''.join(helpful[count]),
IndexError: list index out of range
Turns out that if a review didnt't have the "verified" tag or if no one commented it, the html part scrapy was looking for isn't there and therefore no item gets added to the list which makes the "verified" and "comments" list shorter than the other ones. Because of this error all items in the list got dropped and werent added to my csv file. The simple fix below which checks if the lists are as long as the other lists worked just fine :)
Edit:
When using this fix it might happen that values are assigned to the wrong review, because it is always added to the end of the list.
If you want to be on the safe side, don't scrape the verified tag or replace the whole list with "Na" or something else that indicates that the value is unclear.
helpful = response.xpath('.//span[#data-hook="helpful-vote-statement"]//text()').extract()
while len(helpful) != len(date):
helpful.append("0 people found this helpful")

Twisted deferreds block when URI is the same (multiple calls from the same browser)

I have the following code
# -*- coding: utf-8 -*-
# 好
##########################################
import time
from twisted.internet import reactor, threads
from twisted.web.server import Site, NOT_DONE_YET
from twisted.web.resource import Resource
##########################################
class Website(Resource):
def getChild(self, name, request):
return self
def render(self, request):
if request.path == "/sleep":
duration = 3
if 'duration' in request.args:
duration = int(request.args['duration'][0])
message = 'no message'
if 'message' in request.args:
message = request.args['message'][0]
#-------------------------------------
def deferred_activity():
print 'starting to wait', message
time.sleep(duration)
request.setHeader('Content-Type', 'text/plain; charset=UTF-8')
request.write(message)
print 'finished', message
request.finish()
#-------------------------------------
def responseFailed(err, deferred):
pass; print err.getErrorMessage()
deferred.cancel()
#-------------------------------------
def deferredFailed(err, deferred):
pass; # print err.getErrorMessage()
#-------------------------------------
deferred = threads.deferToThread(deferred_activity)
deferred.addErrback(deferredFailed, deferred) # will get called indirectly by responseFailed
request.notifyFinish().addErrback(responseFailed, deferred) # to handle client disconnects
#-------------------------------------
return NOT_DONE_YET
else:
return 'nothing at', request.path
##########################################
reactor.listenTCP(321, Site(Website()))
print 'starting to serve'
reactor.run()
##########################################
# http://localhost:321/sleep?duration=3&message=test1
# http://localhost:321/sleep?duration=3&message=test2
##########################################
My issue is the following:
When I open two tabs in the browser, point one at http://localhost:321/sleep?duration=3&message=test1 and the other at http://localhost:321/sleep?duration=3&message=test2 (the messages differ) and reload the first tab and then ASAP the second one, then the finish almost at the same time. The first tab about 3 seconds after hitting F5, the second tab finishes about half a second after the first tab.
This is expected, as each request got deferred into a thread, and they are sleeping in parallel.
But when I now change the URL of the second tab to be the same as the one of the first tab, that is to http://localhost:321/sleep?duration=3&message=test1, then all this becomes blocking. If I press F5 on the first tab and as quickly as possible F5 on the second one, the second tab finishes about 3 seconds after the first one. They don't get executed in parallel.
As long as the entire URI is the same in both tabs, this server starts to block. This is the same in Firefox as well as in Chrome. But when I start one in Chrome and another one in Firefox at the same time, then it is non-blocking again.
So it may not neccessarily be related to Twisted, but maybe because of some connection reusage or something like that.
Anyone knows what is happening here and how I can solve this issue?
Coincidentally, someone asked a related question over at the Tornado section. As you suspected, this is not an "issue" in Twisted but rather a "feature" of web browsers :). Tornado's FAQ page has a small section dedicated to this issue. The proposed solution is appending an arbitrary query string.
Quote of the day:
One dev's bug is another dev's undocumented feature!

Start multiple spiders sequentially from another spider

I have one spider which creates +100 spiders with arguments.
Those spiders scrape x items and forward them to a mysqlpipeline.
The mysqldatabase can handle 10 connections at the time.
Due to that reason I can only have max 10 spiders running at the same time.
How can I make this happen?
My not succesful approach now is
- Add spiders to a list in the first spider like this:
if item.get('location_selectors') is not None and item.get('start_date_selectors') is not None:
spider = WikiSpider.WikiSpider(template=item.get('category'), view_threshold=0, selectors = {
'location': [item.get('location_selectors')],
'date_start': [item.get('start_date_selectors')],
'date_end': [item.get('end_date_selectors')]
})
self.spiders.append(spider)
Then in the first spider I listen to the close_spider signal:
def spider_closed(self, spider):
for spider in self.spiders:
process = CrawlerProcess(get_project_settings())
process.crawl(spider)
But this approach gives me the following error:
connection to the other side was lost in a non-clean fashion
What is the correct way to start multiple spiders in a sequentially manner?
Thanks in advance!