I wrote a spider class to crawl through eventbrite.com and get data about events in the Urbana Champaign area. However I got this error. Can you tell me what's the error? I am new to scrapy so am posting this over here. Can you also tell me how to correct this error?
scrapy crawl eventbrite
2015-07-02 17:08:38 [scrapy] INFO: Scrapy 1.0.1 started (bot: tutorial)
2015-07-02 17:08:38 [scrapy] INFO: Optional features available: ssl, http11
2015-07-02 17:08:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'}
2015-07-02 17:08:38 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'. Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected.
2015-07-02 17:08:38 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-02 17:08:38 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-02 17:08:38 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-02 17:08:38 [scrapy] INFO: Enabled item pipelines:
2015-07-02 17:08:38 [scrapy] INFO: Spider opened
2015-07-02 17:08:38 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-02 17:08:38 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
Error during info_callback
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/protocols/tls.py", line 415, in dataReceived
self._write(bytes)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/protocols/tls.py", line 554, in _write
sent = self._tlsConnection.send(toSend)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1270, in send
result = _lib.SSL_write(self._ssl, buf, len(buf))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 926, in wrapper
callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1055, in infoCallback
return wrapped(connection, where, ret)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1157, in _identityVerifyingInfoCallback
transport = connection.get_app_data()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1589, in get_app_data
return self._app_data
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1148, in __getattr__
return getattr(self._socket, name)
exceptions.AttributeError: 'NoneType' object has no attribute '_app_data'
2015-07-02 17:08:49 [twisted] CRITICAL: Error during info_callback
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/protocols/tls.py", line 415, in dataReceived
self._write(bytes)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/protocols/tls.py", line 554, in _write
sent = self._tlsConnection.send(toSend)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1270, in send
result = _lib.SSL_write(self._ssl, buf, len(buf))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 926, in wrapper
callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1055, in infoCallback
return wrapped(connection, where, ret)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1157, in _identityVerifyingInfoCallback
transport = connection.get_app_data()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1589, in get_app_data
return self._app_data
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1148, in __getattr__
return getattr(self._socket, name)
exceptions.AttributeError: 'NoneType' object has no attribute '_app_data'
From cffi callback <function infoCallback at 0x106e78a28>:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 926, in wrapper
callback(Connection._reverse_mapping[ssl], where, return_code)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1059, in infoCallback
connection.get_app_data().failVerification(f)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1589, in get_app_data
return self._app_data
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1148, in __getattr__
return getattr(self._socket, name)
AttributeError: 'NoneType' object has no attribute '_app_data'
Error during info_callback
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/protocols/tls.py", line 415, in dataReceived
self._write(bytes)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/protocols/tls.py", line 554, in _write
sent = self._tlsConnection.send(toSend)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1270, in send
result = _lib.SSL_write(self._ssl, buf, len(buf))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 926, in wrapper
callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1055, in infoCallback
return wrapped(connection, where, ret)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1157, in _identityVerifyingInfoCallback
transport = connection.get_app_data()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1589, in get_app_data
return self._app_data
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1148, in __getattr__
return getattr(self._socket, name)
exceptions.AttributeError: 'NoneType' object has no attribute '_app_data'
2015-07-02 17:08:49 [twisted] CRITICAL: Error during info_callback
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/protocols/tls.py", line 415, in dataReceived
self._write(bytes)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/protocols/tls.py", line 554, in _write
sent = self._tlsConnection.send(toSend)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1270, in send
result = _lib.SSL_write(self._ssl, buf, len(buf))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 926, in wrapper
callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1055, in infoCallback
return wrapped(connection, where, ret)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1157, in _identityVerifyingInfoCallback
transport = connection.get_app_data()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1589, in get_app_data
return self._app_data
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1148, in __getattr__
return getattr(self._socket, name)
exceptions.AttributeError: 'NoneType' object has no attribute '_app_data'
From cffi callback <function infoCallback at 0x103b1c9b0>:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 926, in wrapper
callback(Connection._reverse_mapping[ssl], where, return_code)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1059, in infoCallback
connection.get_app_data().failVerification(f)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1589, in get_app_data
return self._app_data
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1148, in __getattr__
return getattr(self._socket, name)
AttributeError: 'NoneType' object has no attribute '_app_data'
2015-07-02 17:08:51 [scrapy] DEBUG: Crawled (200) <GET https://www.eventbrite.com/d/il--urbana/events/?crt=regular&sort=date> (referer: None)
2015-07-02 17:08:51 [scrapy] DEBUG: Crawled (200) <GET https://www.eventbrite.com/d/il--champaign/events/?crt=regular&sort=date> (referer: None)
2015-07-02 17:08:51 [scrapy] INFO: Closing spider (finished)
2015-07-02 17:08:51 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 519,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 55279,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 7, 2, 11, 38, 51, 775192),
'log_count/CRITICAL': 2,
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'log_count/WARNING': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2015, 7, 2, 11, 38, 38, 972701)}
2015-07-02 17:08:51 [scrapy] INFO: Spider closed (finished)
Take a look at the answer here: https://stackoverflow.com/a/30203408/3941341
Your stacktrace seems to be very similar to those in the other question.
And the Scrapy-bug is still open: https://github.com/scrapy/scrapy/issues/1227
Related
i need help. I wanted to do a crawler for a specific website (underminejournal). I want to get this data from the site to create a console output for me, because i mostly work on consoles and dont want to switch that often. The other point is i want to push the data in a database (sql etc is no problem). But somehow i just get this displayed when i try to execute the crawler, the tutorial is not really helpful i think:
2016-10-05 10:55:23 [scrapy] INFO: Scrapy 1.0.3 started (bot: undermine)
2016-10-05 10:55:23 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-10-05 10:55:23 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'undermine.spiders', 'SPIDER_MODULES': ['undermine.spiders'], 'BOT_NAME': 'undermine'}
2016-10-05 10:55:23 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-10-05 10:55:23 [boto] DEBUG: Retrieving credentials from metadata server.
2016-10-05 10:55:24 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-10-05 10:55:24 [boto] ERROR: Unable to read instance data, giving up
2016-10-05 10:55:24 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-10-05 10:55:24 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-10-05 10:55:24 [scrapy] INFO: Enabled item pipelines:
2016-10-05 10:55:24 [scrapy] INFO: Spider opened
2016-10-05 10:55:24 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-05 10:55:24 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-05 10:55:24 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
request = next(slot.start_requests)
File "/usr/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
yield self.make_requests_from_url(url)
File "/usr/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: theunderminejournal.com/#eu/eredar/item/124442
2016-10-05 10:55:24 [scrapy] INFO: Closing spider (finished)
2016-10-05 10:55:24 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 10, 5, 8, 55, 24, 710944),
'log_count/DEBUG': 2,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'start_time': datetime.datetime(2016, 10, 5, 8, 55, 24, 704378)}
2016-10-05 10:55:24 [scrapy] INFO: Spider closed (finished)
My spider is this:
# -*- coding: utf-8 -*-
import scrapy
class JournalSpider(scrapy.Spider):
name = "journal"
allowed_domains = ["theunderminejournal.com"]
start_urls = (
'theunderminejournal.com/#eu/eredar/item/124442',
)
def parse(self, response):
page = respinse.url.split("/")[-2]
filename = 'journal-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
pass
Someone know a hint?
EDIT RESULTS
2016-10-05 11:21:35 [scrapy] INFO: Scrapy 1.0.3 started (bot: undermine)
2016-10-05 11:21:35 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-10-05 11:21:35 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'undermine.spiders', 'SPIDER_MODULES': ['undermine.spiders'], 'BOT_NAME': 'undermine'}
2016-10-05 11:21:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-10-05 11:21:35 [boto] DEBUG: Retrieving credentials from metadata server.
2016-10-05 11:21:36 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-10-05 11:21:36 [boto] ERROR: Unable to read instance data, giving up
ValueError: Missing scheme in request url: theunderminejournal.com/#eu/eredar/item/124442
Your urls should always start with either http:// or https://.
start_urls = (
'theunderminejournal.com/#eu/eredar/item/124442',
# ^ should be:
'http://theunderminejournal.com/#eu/eredar/item/124442',
)
I've been trying to get my head around scrapy but I'm not having much luck getting beyond the basics. When I run my spider i get a spider error processing the page and a spider exemption that isn't implemented yet but if I use scrapy fetch the html response is outputted so its not that the site is not available. The output is included below along with my Items, spider and settings values
Items.py
class MycrawlerItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
files = scrapy.Field()
file_urls = scrapy.Field()
mycrawler.py
import scrapy
from scrapy.spiders import Rule
from bs4 import BeautifulSoup
from scrapy.linkextractors import LinkExtractor
from librarycrawler.items import LibrarycrawlerItem
class CrawlSpider(scrapy.Spider):
name = "mycrawler"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com"
]
#LinkExtractor(),
rules = (
Rule(LinkExtractor(),callback='scrape_page', follow=True)
)
def scrape_page(self,response):
page_soup = BeautifulSoup(response.body,"html.parser")
ScrapedPageTitle = page_soup.title.get_text()
item = LibrarycrawlerItem()
item['title'] =ScrapedPageTitle
item['file_urls'] = response.url
yield item
Settings.py
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline':300,
}
FILES_STORE = 'C:\MySpider\mycrawler\ExtractedText'
Terminal Output
[scrapy] C:\MySpider\mycrawler>scrapy crawl mycrawler -o mycrawler.csv
2016-06-03 16:11:47 [scrapy] INFO: Scrapy 1.0.3 started (bot: mycrawler)
2016-06-03 16:11:47 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-06-03 16:11:47 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mycrawler.spiders', 'FEED_URI': 'mycrawler.csv', 'DEPTH_LIMIT': 3, 'SPIDER_MODULES': ['mycrawler.spiders'], 'BOT_NAME': 'mycrawler', 'USER_AGENT': 'mycrawler(+http://www.example.com)', 'FEED_FORMAT': 'csv'}
2016-06-03 16:11:48 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-06-03 16:11:48 [boto] DEBUG: Retrieving credentials from metadata server.
2016-06-03 16:11:49 [boto] ERROR: Caught exception reading instance dataTraceback (most recent call last):
File "C:\Anaconda3\envs\scrapy\lib\site-packages\boto\utils.py", line 210, inretry_url
r = opener.open(req, timeout=timeout)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 449, in _open
'_open', req)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 1197, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-06-03 16:11:49 [boto] ERROR: Unable to read instance data, giving up
2016-06-03 16:11:49 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-06-03 16:11:49 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-06-03 16:11:49 [scrapy] INFO: Enabled item pipelines: FilesPipeline
2016-06-03 16:11:49 [scrapy] INFO: Spider opened
2016-06-03 16:11:49 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-06-03 16:11:49 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-03 16:11:49 [scrapy] DEBUG: Redirecting (meta refresh) to <GET http://myexample.com> from <GEThttp://myexample.com>
2016-06-03 16:11:50 [scrapy] DEBUG: Crawled (200) <GET http://myexample.com> (referer: None)
2016-06-03 16:11:50 [scrapy] ERROR: Spider error processing <GET http://www.example.com> (referer: None)
Traceback (most recent call last): File "C:\Anaconda3\envs\scrapy\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks current.result = callback(current.result, *args, **kw)
File "C:\Anaconda3\envs\scrapy\lib\site-packages\scrapy\spiders\__init__.py",line 76, in parse raise NotImplementedErrorNotImplementedError
2016-06-03 16:11:50 [scrapy] INFO: Closing spider (finished)
2016-06-03 16:11:50 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 449,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 23526,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 6, 3, 15, 11, 50, 227000),
'log_count/DEBUG': 4,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/NotImplementedError': 1,
'start_time': datetime.datetime(2016, 6, 3, 15, 11, 49, 722000)}
2016-06-03 16:11:50 [scrapy] INFO: Spider closed (finished)
you need to sub-class from scrapy's CrawlSpider if you want that functionality, for example something like this:
from scrapy.item import Field, Item
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from scrapy.spiders.crawl import CrawlSpider
class LibrarycrawlerItem(Item):
title = Field()
file_urls = Field()
class MyCrawlSpider(CrawlSpider):
name = 'sample'
allowed_domains = ['example.com', 'iana.org']
start_urls = ['http://www.example.com']
rules = (
Rule(LinkExtractor(), callback='scrape_page'),
)
def scrape_page(self,response):
item = LibrarycrawlerItem()
item['title'] = response.xpath('//title/text()').extract_first()
item['file_urls'] = response.url
yield item
To understand better the how the rules work please refer to the documenation, btw you can also use the LinkExtractor inside your parse method without sub-classing CrawlSpider.
I'm having trouble using Scrapy's image pipeline to retrieve images. From the error reports, I think I am feeding Scrapy the right image_urls. However, instead of downloading images from them, Scrapy returns the error: ValueError: Missing scheme in request url: h.
This is my first time using the image pipeline feature, so I suspect I'm making a simple mistake. All the same, I'd appreciate help solving it.
Below you'll find my spider, settings, items, and error output. They're not quite MWEs, but I think they're pretty simple and easy to understand all the same.
Spider:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ngamedallions.items import NgamedallionsItem
from scrapy.loader.processors import TakeFirst
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Join
from scrapy.http import Request
import re
class NGASpider(CrawlSpider):
name = 'ngamedallions'
allowed_domains = ['nga.gov']
start_urls = [
'http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html'
]
rules = (
Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True
),)
def parse_CatalogRecord(self, response):
CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
CatalogRecord.default_output_processor = TakeFirst()
keywords = "medal|medallion"
r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
if r.search(response.body_as_unicode()):
CatalogRecord.add_xpath('title', './/dl[#class="artwork-details"]/dt[#class="title"]/text()')
CatalogRecord.add_xpath('accession', './/dd[#class="accession"]/text()')
CatalogRecord.add_xpath('inscription', './/div[#id="inscription"]/p/text()')
CatalogRecord.add_xpath('image_urls', './/img[#class="mainImg"]/#src')
return CatalogRecord.load_item()
Settings:
BOT_NAME = 'ngamedallions'
SPIDER_MODULES = ['ngamedallions.spiders']
NEWSPIDER_MODULE = 'ngamedallions.spiders'
DOWNLOAD_DELAY=3
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = '/home/tricia/Documents/Programing/Scrapy/ngamedallions/medallionimages'
Items:
import scrapy
class NgamedallionsItem(scrapy.Item):
title = scrapy.Field()
accession = scrapy.Field()
inscription = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
pass
Error Log:
2016-04-24 19:00:40 [scrapy] INFO: Scrapy 1.0.5.post2+ga046ce8 started (bot: ngamedallions)
2016-04-24 19:00:40 [scrapy] INFO: Optional features available: ssl, http11
2016-04-24 19:00:40 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'FEED_FORMAT': 'json', 'DOWNLOAD_DELAY': 3}
2016-04-24 19:00:40 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-04-24 19:00:40 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-04-24 19:00:40 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-04-24 19:00:40 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2016-04-24 19:00:40 [scrapy] INFO: Spider opened
2016-04-24 19:00:40 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-04-24 19:00:40 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-04-24 19:00:40 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: None)
2016-04-24 19:00:44 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html> (referer: None)
2016-04-24 19:00:48 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html)
2016-04-24 19:00:48 [scrapy] ERROR: Error processing {'accession': u'1942.9.163.a',
'image_urls': u'http://media.nga.gov/public/objects/1/3/1/2/1312-primary-0-440x400.jpg',
'inscription': u'around circumference: IOHANNES FRANCISCVS GON MA; around bottom circumference: MANTVA',
'title': u'Gianfrancesco Gonzaga di Rodigo, 1445-1496, Lord of Bozzolo, Sabbioneta, and Viadana 1478 [obverse]'}
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/pymodules/python2.7/scrapy/pipelines/media.py", line 44, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/usr/lib/pymodules/python2.7/scrapy/pipelines/images.py", line 109, in get_media_requests
return [Request(x) for x in item.get(self.IMAGES_URLS_FIELD, [])]
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 55, in _set_url
self._set_url(url.encode(self.encoding))
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 59, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
2016-04-24 19:00:48 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-04-24 19:00:51 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html)
2016-04-24 19:00:52 [scrapy] ERROR: Error processing {'accession': u'1942.9.163.b',
'image_urls': u'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg',
'inscription': u'around top circumference: TRINACRIA IANI; upper center: PELORVS ; across center: PA LI; across bottom: BELAVRA',
'title': u'House between Two Hills [reverse]'}
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/pymodules/python2.7/scrapy/pipelines/media.py", line 44, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/usr/lib/pymodules/python2.7/scrapy/pipelines/images.py", line 109, in get_media_requests
return [Request(x) for x in item.get(self.IMAGES_URLS_FIELD, [])]
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 55, in _set_url
self._set_url(url.encode(self.encoding))
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 59, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
2016-04-24 19:00:55 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html)
2016-04-24 19:01:02 [scrapy] INFO: Closing spider (finished)
2016-04-24 19:01:02 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1609,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'downloader/response_bytes': 125593,
'downloader/response_count': 5,
'downloader/response_status_count/200': 5,
'dupefilter/filtered': 5,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 4, 24, 23, 1, 2, 938181),
'log_count/DEBUG': 7,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'request_depth_max': 2,
'response_received_count': 5,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2016, 4, 24, 23, 0, 40, 851598)}
2016-04-24 19:01:02 [scrapy] INFO: Spider closed (finished)
The TakeFirst processor is making image_urls a string when it should be a list.
Add:
CatalogRecord.image_urls_out = lambda v: v
EDIT:
This could also be:
CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()
I cannot schedule a spider run
Deploy seems to be ok:
Deploying to project "scraper" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "scraper", "version": "1418909664", "spiders": 3}
I scheduling a new spider run :
curl http://localhost:6800/schedule.json -d project=scraper -d spider=spider
{"status": "ok", "jobid": "3f81a0e486bb11e49a6800163ed5ae93"}
but on scrapyd I get this error:
2014-12-18 14:39:12+0100 [-] Process started: project='scraper' spider='spider' job='3f81a0e486bb11e49a6800163ed5ae93' pid=28565 log='/usr/scrapyd/logs/scraper/spider/3f81a0e486bb11e49a6800163ed5ae93.log' items='/usr/scrapyd/items/scraper/spider/3f81a0e486bb11e49a6800163ed5ae93.jl'
2014-12-18 14:39:13+0100 [Launcher,28565/stderr] Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/local/lib/python2.7/dist-packages/scrapyd/runner.py", line 39, in <module>
2014-12-18 14:39:13+0100 [Launcher,28565/stderr] main()
File "/usr/local/lib/python2.7/dist-packages/scrapyd/runner.py", line 36, in main
execute()
File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 143, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 89, in _run_print_help
func(*a, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 150, in _run_command
cmd.run(args, opts)
File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/crawl.py", line 58, in run
spider = crawler.spiders.create(spname, **opts.spargs)
2014-12-18 14:39:13+0100 [Launcher,28565/stderr] File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermanager.py", line 48, in create
return spcls(**spider_kwargs)
File "build/bdist.linux-x86_64/egg/scraper/spiders/spider.py", line 104, in __init__
File "/usr/lib/python2.7/os.py", line 157, in makedirs
mkdir(name, mode)
OSError: [Errno 20] Not a directory: '/tmp/scraper-1418909944-dKTRZI.egg/logs/'
2014-12-18 14:39:14+0100 [-] Process died: exitstatus=1 project='scraper'
Any ideas? :(
You are trying to create a directory inside an egg.
OSError: [Errno 20] Not a directory: '/tmp/scraper-1418909944-dKTRZI ---->.egg<----- /logs/'
I'm following the Scrapy tutorial documentation at http://media.readthedocs.org/pdf/scrapy/0.14/scrapy.pdf and I've verified that items.py and dmoz_spider.py are typed (not cut & pasted) correctly.
The first "hmmm..." part for me was this instruction:
This is the code for our first Spider; save it in a file named dmoz_spider.py under the dmoz/spiders directory
I'm using the latest version of Ubuntu and there wasn't a dmoz folder created, so I've put this code into ~/tutorial/tutorial/spiders. (Was this my first error?)
So here's my dmoz_spider.py script:
from scrapy.spider import BaseSpider
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
In my terminal I type
scrapy crawl dmoz
And I get this:
2012-10-08 13:20:22-0700 [scrapy] INFO: Scrapy 0.12.0.2546 started (bot: tutorial)
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, CloseSpider
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Enabled item pipelines:
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-10-08 13:20:22-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-10-08 13:20:22-0700 [dmoz] INFO: Spider opened
2012-10-08 13:20:22-0700 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2012-10-08 13:20:22-0700 [dmoz] ERROR: Spider error processing <http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: <None>)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1178, in mainLoop
self.runUntilCurrent()
File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback
self._startRunCallbacks(result)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/python2.7/dist-packages/scrapy/spider.py", line 62, in parse
raise NotImplementedError
exceptions.NotImplementedError:
2012-10-08 13:20:22-0700 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2012-10-08 13:20:22-0700 [dmoz] ERROR: Spider error processing <http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: <None>)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1178, in mainLoop
self.runUntilCurrent()
File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback
self._startRunCallbacks(result)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/python2.7/dist-packages/scrapy/spider.py", line 62, in parse
raise NotImplementedError
exceptions.NotImplementedError:
2012-10-08 13:20:22-0700 [dmoz] INFO: Closing spider (finished)
2012-10-08 13:20:22-0700 [dmoz] INFO: Spider closed (finished)
In my searching, I saw that someone else had said twisted probably wasn't installed... but wouldn't it be installed if I used the Ubuntu package installer for Scrapy?
Thanks in advance!
The parse method in BaseSpider is getting called instead of your one because you have not correctly overridden the parse method. Your indentation is wrong, so parse is declared as a function outside of the DmozSpider class. Welcome to python :)
It's nothing to do with twisted, I can see that twisted is in the tracebacks, so it's clearly installed.