RetryMiddleware not doing its thing - scrapy

I'm looking at the log of a spider that has the RetryMiddleware active:
[scrapy.middleware] Enabled downloader middlewares: Less
['sh_scrapy.diskquota.DiskQuotaDownloaderMiddleware',
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'lari.downloadermiddlewares.DocumentExtraction',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy_crawlera.CrawleraMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'sh_scrapy.middlewares.HubstorageDownloaderMiddleware']
And a response raising a ConnectionRefusedError (which is included in the ones that the RetryMiddleware should intercept):
[scrapy.core.scraper] Error downloading <GET xxx> Less
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.ConnectionRefusedError: Connection was refused by other side: 111: Connection refused.
but the request is not actually retried when the spider is done crawling the rest (yes, the spider finished normally, no cancellation or anything).
Am I missing something?

You can try different scenarios to handle this case
Make sure that your internet connection is proper.
Try different user agents.
Check your scrapy version.
Also, make sure that you are replicating the exact copy of browser request with the given URL(check for header, cookies and all).

Related

How to resolve 502 response code in Scrapy request?

I created a spider that scrapes data from Yelp by using Scrapy. All requests go through Crawlera proxy. Spider gets the URL to scrape from, sends a request, and scrapes the data. This worked fine up until the other day, when I started getting 502 None response. The 502 None response appears
after execution of this line:
r = self.req_session.get(url, proxies=self.proxies, verify='../secret/crawlera-ca.crt').text
The traceback:
2020-11-04 14:27:55 [urllib3.connectionpool] DEBUG: https://www.yelp.com:443 "GET /biz/a-dog-in-motion-arcadia HTTP/1.1" 502 None
So, it seems that spider cannot reach the URL because the connection is closed.
I have checked 502 meaning in Scrapy and Crawlera documentation, and it refers to connection being refused, closed, domain unavailable and similar things.
I have debugged the code related to where the issue is happening, and everything is up to date.
If anyone has ideas or knowledge about this, would love to hear, as I am stuck. What could actually be the issue here?
NOTE: Yelp URLs work normally when I open them in browser.
The website sees that you are a "scraper" and not a human user, from the headers of your request.
You should send a different header with the request, so that the scraped website thinks you are browsing with a regular browser.
For more info, refer to the scrapy documentation.
Some pages is not available for some countries, for this reason is recommended to use proxies. I tried to enter the url and the connection was successful.
2020-11-05 02:50:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2020-11-05 02:50:40 [scrapy.core.engine] INFO: Spider opened
2020-11-05 02:50:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/biz/a-dog-in-motion-arcadia> (referer: None)```

How to make scrapy continue downloading after losing connection fo a while

I am trying to scrape a website using scrapy, but the network in the office is unstable. If we lose network connection for even a few seconds, scrapy gets stuck and stops downloading. we can see that the last log is:
2018-08-27 11:50:05 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): *.*.org
2018-08-27 11:50:07 [urllib3.connectionpool] DEBUG: https://**.**.org:443 "GET /01313_**0.jpg HTTP/1.1" 200 135790
I had tried to change the timeout settings, but nothing happened.
thank you!
You can try to set RETRY_TIMES setting (in settings.py):
RETRY_TIMES=5

Problems Pausing and Resuming a Scrapy Spider

I'm doing a very slow crawl on a moderately sized site in order to respect their guidance for web scraping. That situation means that I need to be able to pause and resume my spider. So far, I've been enabling persistence when I deploy the spider in the command line: scrapy crawl ngamedallions -s JOBDIR=pass1 -o items.csv.
Last night, that seemed to be doing the trick. I tested my spider and found that, when I shut it down cleanly, I could start it again and the crawl would resume where I left off. Today, though, the spider starts at the very beginning. I've checked the contents of my pass1 directory, and my requests.seen file has some content, even though the 1600 lines seems a little light for the 3000 pages I crawled last night.
In any case, does anyone have a sense of where I'm going wrong as I try to resume my spider?
Update
I went ahead and manually skipped my spider ahead to continue yesterday's crawl. When I tried closing and resuming the spider with the same comand (see above), it worked. The start of my log reflects the spider recognizing that a crawl is being resumed.
2016-05-11 10:59:36 [scrapy] INFO: Scrapy 1.0.5.post4+g4b324a8 started (bot: ngamedallions)
2016-05-11 10:59:36 [scrapy] INFO: Optional features available: ssl, http11
2016-05-11 10:59:36 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.csv', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'USER_AGENT': 'ngamedallions', 'FEED_FORMAT': 'csv', 'DOWNLOAD_DELAY': 10}
2016-05-11 10:59:36 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-05-11 10:59:36 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-05-11 10:59:36 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-05-11 10:59:36 [scrapy] INFO: Enabled item pipelines: NgamedallionsCsvPipeline, NgamedallionsImagesPipeline
2016-05-11 10:59:36 [scrapy] INFO: Spider opened
2016-05-11 10:59:36 [scrapy] INFO: Resuming crawl (3 requests scheduled)
When I try to resume the spider after a second graceful close (pause-resume-pause-resume), however, it starts the crawl over again. The beginning of the log in that case follows, but the main takeaway is that the spider does not report recognizing the crawl as resumed.
2016-05-11 11:19:10 [scrapy] INFO: Scrapy 1.0.5.post4+g4b324a8 started (bot: ngamedallions)
2016-05-11 11:19:10 [scrapy] INFO: Optional features available: ssl, http11
2016-05-11 11:19:10 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.csv', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'USER_AGENT': 'ngamedallions', 'FEED_FORMAT': 'csv', 'DOWNLOAD_DELAY': 10}
2016-05-11 11:19:11 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-05-11 11:19:11 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-05-11 11:19:11 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-05-11 11:19:11 [scrapy] INFO: Enabled item pipelines: NgamedallionsCsvPipeline, NgamedallionsImagesPipeline
2016-05-11 11:19:11 [scrapy] INFO: Spider opened
Scrapy avoids duplicate URL crawling, here and here you can find more information about it.
dont_filter (boolean) – indicates that this request should not be
filtered by the scheduler. This is used when you want to perform an
identical request multiple times, to ignore the duplicates filter. Use
it with care, or you will get into crawling loops. Default to False.
Also, look at this question

scrapy shell did not recognize 'sel' object

I'm a python newbie, and trying to using scrapy for project.
Scrapy 0.19 is installed on my centos (linux 2.6.32) and I followed the instruction on scrapy document page,but found that,the scrapy shell could not find 'sel' object,here's my step:
[root#localhost rpm]# scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
2014-03-02 06:33:23+0800 [scrapy] INFO: Scrapy 0.19.0 started (bot: scrapybot)
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Optional features available: ssl, http11, libxml2
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Enabled item pipelines:
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-03-02 06:33:23+0800 [default] INFO: Spider opened
2014-03-02 06:33:24+0800 [default] DEBUG: Crawled (200) <GET
http://doc.scrapy.org/en/latest/_static/selectors-sample1.html> (referer: None)
[s] Available Scrapy objects:
[s] hxs <HtmlXPathSelector xpath=None data=u'<html><head><base href="http://example.c'>
[s] item {}
[s] request <GET http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s] response <200 http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s] settings <CrawlerSettings module=None>
[s] spider <BaseSpider 'default' at 0x3668ed0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
>>> sel.xpath('//title/text()')
Traceback (most recent call last):
File "<console>", line 1, in <module>
NameError: name 'sel' is not defined
>>>
Can anyone tell me how to solve? thx in advance
The sel object was added in 0.20 version. And when you run the shell command it tells you which objects you can use, in your case, hxs, that has similar behaviour:
>>> hxs.select('//title/text()')
You should try to read the documentation first. In the selectors section is explained pretty clear how you can use them based in your current version.

Scrapy forum crawler starting but not returning any scraped data

Here is my code, can someone please help, for some reason the spider runs but does not actually crawl the forum threads. I am trying to extract all the text in the forum threads for the specific forum in my start url.
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from xbox.items import xboxItem
from scrapy.item import Item
from scrapy.conf import settings
class xboxSpider(CrawlSpider):
name = "xbox"
allowed_domains = ["forums.xbox.com"]
start_urls= [
"http://forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/default.aspx",
]
rules= [
Rule(SgmlLinkExtractor(allow=['/t/\d+']),callback='parse_thread'),
Rule(SgmlLinkExtractor(allow=('/t/new\?new_start=\d+',)))
]
def parse_thread(self, response):
hxs=HtmlXPathSelector(response)
item=xboxItem()
item['content']=hxs.selec("//div[#class='post-content user-defined-markup']/p/text()").extract()
item['date']=hxs.select("//span[#class='value']/text()").extract()
return item
Log output:
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Enabled item pipelines:
2013-03-13 11:22:18-0400 [xbox] INFO: Spider opened
2013-03-13 11:22:18-0400 [xbox] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-03-13 11:22:20-0400 [xbox] DEBUG: Crawled (200) <GET forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/…; (referer: None)
2013-03-13 11:22:20-0400 [xbox] DEBUG: Filtered offsite request to 'forums.xbox.com': <GET forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/…;
2013-03-13 11:22:20-0400 [xbox] INFO: Closing spider (finished)
2013-03-13 11:22:20-0400 [xbox] INFO: Dumping spider stats
As a first tweak, you need to modify your first rule by putting a "." at the start of the regex, as follows. I also changed the start url to the actual first page of the forum.
start_urls= [
"http://forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/310.aspx",
]
rules = (
Rule(SgmlLinkExtractor(allow=('./t/\d+')), callback="parse_thread", follow=True),
Rule(SgmlLinkExtractor(allow=('./310.aspx?PageIndex=\d+')), ),
)
I've updated the rules so that the spider now crawls all of the pages in the thread.
EDIT: I've found a typo that may be causing an issue, and I've fixed the date xpath.
item['content']=hxs.selec("//div[#class='post-content user-defined-markup']/p/text()").extract()
item['date']=hxs.select("(//div[#class='post-author'])[1]//a[#class='internal-link view-post']/text()").extract()
The line above says "hxs.selec" and should be "hxs.select". I changed that and could now see content being scraped. Through trial and error (I'm a bit rubbish with xpaths), I've managed to get the date of the first post (ie the date the thread was created) so this should all work now.