I'm a python newbie, and trying to using scrapy for project.
Scrapy 0.19 is installed on my centos (linux 2.6.32) and I followed the instruction on scrapy document page,but found that,the scrapy shell could not find 'sel' object,here's my step:
[root#localhost rpm]# scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
2014-03-02 06:33:23+0800 [scrapy] INFO: Scrapy 0.19.0 started (bot: scrapybot)
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Optional features available: ssl, http11, libxml2
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Enabled item pipelines:
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-03-02 06:33:23+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-03-02 06:33:23+0800 [default] INFO: Spider opened
2014-03-02 06:33:24+0800 [default] DEBUG: Crawled (200) <GET
http://doc.scrapy.org/en/latest/_static/selectors-sample1.html> (referer: None)
[s] Available Scrapy objects:
[s] hxs <HtmlXPathSelector xpath=None data=u'<html><head><base href="http://example.c'>
[s] item {}
[s] request <GET http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s] response <200 http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>
[s] settings <CrawlerSettings module=None>
[s] spider <BaseSpider 'default' at 0x3668ed0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
>>> sel.xpath('//title/text()')
Traceback (most recent call last):
File "<console>", line 1, in <module>
NameError: name 'sel' is not defined
>>>
Can anyone tell me how to solve? thx in advance
The sel object was added in 0.20 version. And when you run the shell command it tells you which objects you can use, in your case, hxs, that has similar behaviour:
>>> hxs.select('//title/text()')
You should try to read the documentation first. In the selectors section is explained pretty clear how you can use them based in your current version.
Related
i wrote this in setting.py after pip install scrapy-rotating-proxies
ROTATING_PROXY_LIST = ['http://209.50.52.162:9050']
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620
}
then if i run spider like this scrapy crawl test
it show this.
2021-05-03 15:03:32 [rotating_proxies.middlewares] WARNING: No proxies available; marking all proxies as unchecked
2021-05-03 15:03:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-05-03 15:03:50 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 0, reanimated: 1, mean backoff time: 0s)
2021-05-03 15:03:53 [rotating_proxies.expire] DEBUG: Proxy <http://209.50.52.162:9050> is DEAD
2021-05-03 15:03:53 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.google.com> with another proxy (failed 3 times, max retries: 5)
how can i solve this issue ?
Notice the message in your logs:
DEBUG: Proxy <http://209.50.52.162:9050> is DEAD
You need to add more proxies as shown in the documentation:
ROTATING_PROXY_LIST = [
'proxy1.com:8000',
'proxy2.com:8031',
# ...
]
You can get a list of proxies from many sites.
I'm looking at the log of a spider that has the RetryMiddleware active:
[scrapy.middleware] Enabled downloader middlewares: Less
['sh_scrapy.diskquota.DiskQuotaDownloaderMiddleware',
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'lari.downloadermiddlewares.DocumentExtraction',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy_crawlera.CrawleraMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'sh_scrapy.middlewares.HubstorageDownloaderMiddleware']
And a response raising a ConnectionRefusedError (which is included in the ones that the RetryMiddleware should intercept):
[scrapy.core.scraper] Error downloading <GET xxx> Less
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.ConnectionRefusedError: Connection was refused by other side: 111: Connection refused.
but the request is not actually retried when the spider is done crawling the rest (yes, the spider finished normally, no cancellation or anything).
Am I missing something?
You can try different scenarios to handle this case
Make sure that your internet connection is proper.
Try different user agents.
Check your scrapy version.
Also, make sure that you are replicating the exact copy of browser request with the given URL(check for header, cookies and all).
I'm doing a very slow crawl on a moderately sized site in order to respect their guidance for web scraping. That situation means that I need to be able to pause and resume my spider. So far, I've been enabling persistence when I deploy the spider in the command line: scrapy crawl ngamedallions -s JOBDIR=pass1 -o items.csv.
Last night, that seemed to be doing the trick. I tested my spider and found that, when I shut it down cleanly, I could start it again and the crawl would resume where I left off. Today, though, the spider starts at the very beginning. I've checked the contents of my pass1 directory, and my requests.seen file has some content, even though the 1600 lines seems a little light for the 3000 pages I crawled last night.
In any case, does anyone have a sense of where I'm going wrong as I try to resume my spider?
Update
I went ahead and manually skipped my spider ahead to continue yesterday's crawl. When I tried closing and resuming the spider with the same comand (see above), it worked. The start of my log reflects the spider recognizing that a crawl is being resumed.
2016-05-11 10:59:36 [scrapy] INFO: Scrapy 1.0.5.post4+g4b324a8 started (bot: ngamedallions)
2016-05-11 10:59:36 [scrapy] INFO: Optional features available: ssl, http11
2016-05-11 10:59:36 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.csv', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'USER_AGENT': 'ngamedallions', 'FEED_FORMAT': 'csv', 'DOWNLOAD_DELAY': 10}
2016-05-11 10:59:36 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-05-11 10:59:36 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-05-11 10:59:36 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-05-11 10:59:36 [scrapy] INFO: Enabled item pipelines: NgamedallionsCsvPipeline, NgamedallionsImagesPipeline
2016-05-11 10:59:36 [scrapy] INFO: Spider opened
2016-05-11 10:59:36 [scrapy] INFO: Resuming crawl (3 requests scheduled)
When I try to resume the spider after a second graceful close (pause-resume-pause-resume), however, it starts the crawl over again. The beginning of the log in that case follows, but the main takeaway is that the spider does not report recognizing the crawl as resumed.
2016-05-11 11:19:10 [scrapy] INFO: Scrapy 1.0.5.post4+g4b324a8 started (bot: ngamedallions)
2016-05-11 11:19:10 [scrapy] INFO: Optional features available: ssl, http11
2016-05-11 11:19:10 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.csv', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'USER_AGENT': 'ngamedallions', 'FEED_FORMAT': 'csv', 'DOWNLOAD_DELAY': 10}
2016-05-11 11:19:11 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-05-11 11:19:11 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-05-11 11:19:11 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-05-11 11:19:11 [scrapy] INFO: Enabled item pipelines: NgamedallionsCsvPipeline, NgamedallionsImagesPipeline
2016-05-11 11:19:11 [scrapy] INFO: Spider opened
Scrapy avoids duplicate URL crawling, here and here you can find more information about it.
dont_filter (boolean) – indicates that this request should not be
filtered by the scheduler. This is used when you want to perform an
identical request multiple times, to ignore the duplicates filter. Use
it with care, or you will get into crawling loops. Default to False.
Also, look at this question
When you run a Scrapy program, either from Python Shell or Command Line, you get items printed to the screen such as the following:
c:\Python27\webscraper2\webscraper2>scrapy crawl mrcrawl2
2014-08-28 00:12:21+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: webscraper2)
2014-08-28 00:12:21+0100 [scrapy] INFO: Optional features available: ssl, http11
2014-08-28 00:12:21+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'webscraper2.spiders', 'SPIDER_MODULES': ['webscraper2.spiders'], 'BOT_NAME': 'webscraper2'}
2014-08-28 00:12:21+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-08-28 00:12:21+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddle
ware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-08-28 00:12:21+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-08-28 00:12:21+0100 [scrapy] INFO: Enabled item pipelines:
2014-08-28 00:12:21+0100 [mrcrawl2] INFO: Spider opened
2014-08-28 00:12:21+0100 [mrcrawl2] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-08-28 00:12:21+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-08-28 00:12:21+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-08-28 00:12:21+0100 [mrcrawl2] DEBUG: Crawled (200) <GET http://www.whoscored.com> (referer: None)
Is there a way to disable things being printed to screen that have not been scraped from a web page? Ideally I only want the 'DEBUG: Crawled' line to print to screen if the response is not within the range 200-300.
I have tried looking on Google for an answer, but I'm not really sure what to search for.
Thanks
Use -L WARNING option to set loglevel to WARNING:
scrapy crawl mrcrawl2 -L WARNING
It'll print message only when something goes wrong.
Here is my code, can someone please help, for some reason the spider runs but does not actually crawl the forum threads. I am trying to extract all the text in the forum threads for the specific forum in my start url.
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from xbox.items import xboxItem
from scrapy.item import Item
from scrapy.conf import settings
class xboxSpider(CrawlSpider):
name = "xbox"
allowed_domains = ["forums.xbox.com"]
start_urls= [
"http://forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/default.aspx",
]
rules= [
Rule(SgmlLinkExtractor(allow=['/t/\d+']),callback='parse_thread'),
Rule(SgmlLinkExtractor(allow=('/t/new\?new_start=\d+',)))
]
def parse_thread(self, response):
hxs=HtmlXPathSelector(response)
item=xboxItem()
item['content']=hxs.selec("//div[#class='post-content user-defined-markup']/p/text()").extract()
item['date']=hxs.select("//span[#class='value']/text()").extract()
return item
Log output:
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Enabled item pipelines:
2013-03-13 11:22:18-0400 [xbox] INFO: Spider opened
2013-03-13 11:22:18-0400 [xbox] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-03-13 11:22:18-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-03-13 11:22:20-0400 [xbox] DEBUG: Crawled (200) <GET forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/…; (referer: None)
2013-03-13 11:22:20-0400 [xbox] DEBUG: Filtered offsite request to 'forums.xbox.com': <GET forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/…;
2013-03-13 11:22:20-0400 [xbox] INFO: Closing spider (finished)
2013-03-13 11:22:20-0400 [xbox] INFO: Dumping spider stats
As a first tweak, you need to modify your first rule by putting a "." at the start of the regex, as follows. I also changed the start url to the actual first page of the forum.
start_urls= [
"http://forums.xbox.com/xbox_forums/xbox_360_games/e_k/gearsofwar3/f/310.aspx",
]
rules = (
Rule(SgmlLinkExtractor(allow=('./t/\d+')), callback="parse_thread", follow=True),
Rule(SgmlLinkExtractor(allow=('./310.aspx?PageIndex=\d+')), ),
)
I've updated the rules so that the spider now crawls all of the pages in the thread.
EDIT: I've found a typo that may be causing an issue, and I've fixed the date xpath.
item['content']=hxs.selec("//div[#class='post-content user-defined-markup']/p/text()").extract()
item['date']=hxs.select("(//div[#class='post-author'])[1]//a[#class='internal-link view-post']/text()").extract()
The line above says "hxs.selec" and should be "hxs.select". I changed that and could now see content being scraped. Through trial and error (I'm a bit rubbish with xpaths), I've managed to get the date of the first post (ie the date the thread was created) so this should all work now.