Conflicts When Generating Start Urls - scrapy

I'm working on retrieving information from the National Gallery of Art's online catalog. Due to the catalog's structure, I can't navigate by extracting and following links from entry to entry. Fortunately, each object in the collection has a predictable url. I want my spider to navigate the collection by generating start urls.
I have attempted to solve my problem by implementing the solution from this thread. Unfortunately, this seems to break another part of my spider. The error log reveals that my urls are being successfully generated, but they aren't being processed correctly. If I'm interpreting the log correctly—which I suspect I'm not—there is a conflict between the redefinition of the start_urls that allows me to generate the urls I need and the rules section of the spider. As things stand now, the spider also doesn't respect the number of pages that I ask it to crawl.
You'll find my spider and a typical error below. I appreciate any help you can offer.
Spider:
URL = "http://www.nga.gov/content/ngaweb/Collection/art-object-page.%d"
starting_number = 1312
number_of_pages = 10
class NGASpider(CrawlSpider):
name = 'ngamedallions'
allowed_domains = ['nga.gov']
start_urls = [URL % starting_number]
rules = (
Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True))
def __init__(self):
self.page_number = starting_number
def start_requests(self):
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i + ".html" , callback=self.parse)
def parse_CatalogRecord(self, response):
CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
CatalogRecord.default_output_processor = TakeFirst()
CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()
keywords = "medal|medallion"
r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
if r.search(response.body_as_unicode()):
CatalogRecord.add_xpath('title', './/dl[#class="artwork-details"]/dt[#class="title"]/text()')
CatalogRecord.add_xpath('accession', './/dd[#class="accession"]/text()')
CatalogRecord.add_xpath('inscription', './/div[#id="inscription"]/p/text()')
CatalogRecord.add_xpath('image_urls', './/img[#class="mainImg"]/#src')
return CatalogRecord.load_item()
Typical Error:
2016-04-29 15:35:00 [scrapy] ERROR: Spider error processing <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1178.html> (referer: None)
Traceback (most recent call last):
File "/usr/lib/pymodules/python2.7/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
for x in result:
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/lib/pymodules/python2.7/scrapy/spiders/crawl.py", line 73, in _parse_response
for request_or_item in self._requests_to_follow(response):
File "/usr/lib/pymodules/python2.7/scrapy/spiders/crawl.py", line 51, in _requests_to_follow
for n, rule in enumerate(self._rules):
AttributeError: 'NGASpider' object has no attribute '_rules'
Update in Resonse to eLRuLL's Solution
Simply removing def __init__ and start_urls allows my spider to crawl my generated urls. However, it also seems to prevent 'def parse_CatalogRecord(self, response)' from being applied. When I run the spider now, it only scrapes pages from outside the range of generated urls. My revised spider and log output follow below.
Spider:
URL = "http://www.nga.gov/content/ngaweb/Collection/art-object-page.%d"
starting_number = 1312
number_of_pages = 1311
class NGASpider(CrawlSpider):
name = 'ngamedallions'
allowed_domains = ['nga.gov']
rules = (
Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True))
def start_requests(self):
self.page_number = starting_number
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i + ".html" , callback=self.parse)
def parse_CatalogRecord(self, response):
CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
CatalogRecord.default_output_processor = TakeFirst()
CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()
keywords = "medal|medallion"
r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
if r.search(response.body_as_unicode()):
CatalogRecord.add_xpath('title', './/dl[#class="artwork-details"]/dt[#class="title"]/text()')
CatalogRecord.add_xpath('accession', './/dd[#class="accession"]/text()')
CatalogRecord.add_xpath('inscription', './/div[#id="inscription"]/p/text()')
CatalogRecord.add_xpath('image_urls', './/img[#class="mainImg"]/#src')
return CatalogRecord.load_item()
Log:
2016-05-02 15:50:02 [scrapy] INFO: Scrapy 1.0.5.post4+g4b324a8 started (bot: ngamedallions)
2016-05-02 15:50:02 [scrapy] INFO: Optional features available: ssl, http11
2016-05-02 15:50:02 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'FEED_FORMAT': 'json', 'DOWNLOAD_DELAY': 3}
2016-05-02 15:50:02 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-05-02 15:50:02 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-05-02 15:50:02 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-05-02 15:50:02 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2016-05-02 15:50:02 [scrapy] INFO: Spider opened
2016-05-02 15:50:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-02 15:50:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-05-02 15:50:02 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: None)
2016-05-02 15:50:02 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-05-02 15:50:05 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html)
2016-05-02 15:50:05 [scrapy] DEBUG: File (uptodate): Downloaded image from <GET http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg> referred in <None>
2016-05-02 15:50:05 [scrapy] DEBUG: Scraped from <200 http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html>
{'accession': u'1942.9.163.b',
'image_urls': [u'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg'],
'images': [{'checksum': '9d5f2e30230aeec1582ca087bcde6bfa',
'path': 'full/3a692347183d26ffefe9ba0af80b0b6bf247fae5.jpg',
'url': 'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg'}],
'inscription': u'around top circumference: TRINACRIA IANI; upper center: PELORVS ; across center: PA LI; across bottom: BELAVRA',
'title': u'House between Two Hills [reverse]'}
2016-05-02 15:50:05 [scrapy] INFO: Closing spider (finished)
2016-05-02 15:50:05 [scrapy] INFO: Stored json feed (1 items) in: items.json
2016-05-02 15:50:05 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 631,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 26324,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'dupefilter/filtered': 3,
'file_count': 1,
'file_status_count/uptodate': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 5, 2, 19, 50, 5, 810570),
'item_scraped_count': 1,
'log_count/DEBUG': 6,
'log_count/INFO': 8,
'request_depth_max': 2,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 5, 2, 19, 50, 2, 455508)}
2016-05-02 15:50:05 [scrapy] INFO: Spider closed (finished)

don't override the __init__ method if you are not going to call super.
Now, you don't really need to declare start_urls for your spider to work if you are going to use start_requests.
Just remove your def __init__ method and no need for start_urls to exist.
UPDATE
Ok my mistake, looks like CrawlSpider needs the start_urls attribute, so just create it instead of using the start_requests method:
start_urls = [URL % i + '.html' for i in range (starting_number, number_of_pages, -1)]
and remove start_requests

Related

Not sure how to use scrapy's itemLoaders

I'm trying to learn how to use scrappy's itemLoaders, can anybody told me what am I doing wrong???I would like to thank you in advance.
import scrapy
from items.items import ItemsItem
from scrapy.loader import ItemLoader
class ItemspiderSpider(scrapy.Spider):
name = 'itemspider'
allowed_domains = ['yellowpages.com']
start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL']
def parse(self, response):
#create the loader using the response
l = ItemLoader(item=ItemsItem(), response=response)
#create a for loop
for listing in response.css('div.search-results.organic div.srp-listing'):
l.add_css('Name', listing.css('a.business-name span::text').extract())
l.add_css('Details', response.urljoin(listing.css('a.business-name::attr(href)')))
l.add_css('WebSite', listing.css('a.track-visit-website::attr(href)').extract_first())
l.add_css('Phones', listing.css('div.phones::text').extract())
yield l.load_item()
When I run the code I keep getting this error:
root#debian:~/Desktop/items/items/spiders# scrapy runspider itemspider.py -o item.csv
/usr/local/lib/python3.5/dist-packages/scrapy/spiderloader.py:37: UserWarning: There are several spiders with the same name:
ItemspiderSpider named 'itemspider' (in items.spiders.itemspider)
ItemspiderSpider named 'itemspider' (in items.spiders.itemspiderLog)
This can cause unexpected behavior.
warnings.warn(msg, UserWarning)
2017-07-04 16:33:20 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: items)
2017-07-04 16:33:20 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'items', 'FEED_FORMAT': 'csv', 'SPIDER_LOADER_WARN_ONLY': True, 'SPIDER_MODULES': ['items.spiders'], 'FEED_URI': 'item.csv', 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'items.spiders'}
2017-07-04 16:33:20 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2017-07-04 16:33:20 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-07-04 16:33:20 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-07-04 16:33:20 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-07-04 16:33:20 [scrapy.core.engine] INFO: Spider opened
2017-07-04 16:33:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-07-04 16:33:20 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-07-04 16:33:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yellowpages.com/robots.txt> (referer: None)
2017-07-04 16:33:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL> (referer: None)
2017-07-04 16:33:24 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL> (referer: None)
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/root/Desktop/items/items/spiders/itemspider.py", line 17, in parse
l.add_css('Details', response.urljoin(listing.css('a.business-name::attr(href)')))
File "/usr/local/lib/python3.5/dist-packages/scrapy/http/response/text.py", line 82, in urljoin
return urljoin(get_base_url(self), url)
File "/usr/lib/python3.5/urllib/parse.py", line 416, in urljoin
base, url, _coerce_result = _coerce_args(base, url)
File "/usr/lib/python3.5/urllib/parse.py", line 112, in _coerce_args
raise TypeError("Cannot mix str and non-str arguments")
TypeError: Cannot mix str and non-str arguments
2017-07-04 16:33:24 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-04 16:33:24 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 503,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 52924,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 7, 4, 21, 33, 24, 121098),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'memusage/max': 49471488,
'memusage/startup': 49471488,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/TypeError': 1,
'start_time': datetime.datetime(2017, 7, 4, 21, 33, 20, 705391)}
2017-07-04 16:33:24 [scrapy.core.engine] INFO: Spider closed (finished)
Not sure what is going this is actually the first time I tried to use the ItemLoaders
There are a few issues with your code:
response.urljoin() expects a single string as parameter, not a list. You are passing the result of listing.css(), which is a SelectorList. You can use response.urljoin(listing.css('a.business-name::attr(href)').extract_first())
you need to instantiate one item loader per loop iteration, otherwise, you're accumulating values for each field of a single yielded item
you are using .add_css() with some values (result of .extract...() calls. .add_css() needs a CSS selector string, not the result of a selector extraction. The CSS extraction will then be done by the item loader. Or, you can use .add_value() if you want to pass the "final" field value directly.
Here are 2 versions that should get you going:
import scrapy
from items.items import ItemsItem
from scrapy.loader import ItemLoader
class ItemspiderSpider(scrapy.Spider):
name = 'itemspider'
allowed_domains = ['yellowpages.com']
start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL']
def parse(self, response):
for listing in response.css('div.search-results.organic div.srp-listing'):
# create the loader using the SELECTOR, inside the loop
l = ItemLoader(item=ItemsItem())
# use .add_value() since we pass the extraction result directly
l.add_value('Name', listing.css('a.business-name span::text').extract())
# pass a single value to response.urljoin()
l.add_value('Details',
response.urljoin(
listing.css('a.business-name::attr(href)').extract_first()
))
l.add_value('WebSite', listing.css('a.track-visit-website::attr(href)').extract_first())
l.add_value('Phones', listing.css('div.phones::text').extract())
yield l.load_item()
Or, using .add_css():
import scrapy
from items.items import ItemsItem
from scrapy.loader import ItemLoader
class ItemspiderSpider(scrapy.Spider):
name = 'itemspider'
allowed_domains = ['yellowpages.com']
start_urls = ['https://www.yellowpages.com/search?search_terms=handyman&geo_location_terms=Miami%2C+FL']
def parse(self, response):
for listing in response.css('div.search-results.organic div.srp-listing'):
# pass the 'listing' selector to the item loader
# so that CSS selection is relative to it
l = ItemLoader(ItemsItem(), selector=listing)
l.add_css('Name', 'a.business-name span::text')
l.add_css('Details', 'a.business-name::attr(href)')
l.add_css('WebSite', 'a.track-visit-website::attr(href)')
l.add_css('Phones', 'div.phones::text')
yield l.load_item()

Crawler Spider: Spider Error processing raises NotImpmentedError

I've been trying to get my head around scrapy but I'm not having much luck getting beyond the basics. When I run my spider i get a spider error processing the page and a spider exemption that isn't implemented yet but if I use scrapy fetch the html response is outputted so its not that the site is not available. The output is included below along with my Items, spider and settings values
Items.py
class MycrawlerItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
files = scrapy.Field()
file_urls = scrapy.Field()
mycrawler.py
import scrapy
from scrapy.spiders import Rule
from bs4 import BeautifulSoup
from scrapy.linkextractors import LinkExtractor
from librarycrawler.items import LibrarycrawlerItem
class CrawlSpider(scrapy.Spider):
name = "mycrawler"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com"
]
#LinkExtractor(),
rules = (
Rule(LinkExtractor(),callback='scrape_page', follow=True)
)
def scrape_page(self,response):
page_soup = BeautifulSoup(response.body,"html.parser")
ScrapedPageTitle = page_soup.title.get_text()
item = LibrarycrawlerItem()
item['title'] =ScrapedPageTitle
item['file_urls'] = response.url
yield item
Settings.py
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline':300,
}
FILES_STORE = 'C:\MySpider\mycrawler\ExtractedText'
Terminal Output
[scrapy] C:\MySpider\mycrawler>scrapy crawl mycrawler -o mycrawler.csv
2016-06-03 16:11:47 [scrapy] INFO: Scrapy 1.0.3 started (bot: mycrawler)
2016-06-03 16:11:47 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-06-03 16:11:47 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mycrawler.spiders', 'FEED_URI': 'mycrawler.csv', 'DEPTH_LIMIT': 3, 'SPIDER_MODULES': ['mycrawler.spiders'], 'BOT_NAME': 'mycrawler', 'USER_AGENT': 'mycrawler(+http://www.example.com)', 'FEED_FORMAT': 'csv'}
2016-06-03 16:11:48 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-06-03 16:11:48 [boto] DEBUG: Retrieving credentials from metadata server.
2016-06-03 16:11:49 [boto] ERROR: Caught exception reading instance dataTraceback (most recent call last):
File "C:\Anaconda3\envs\scrapy\lib\site-packages\boto\utils.py", line 210, inretry_url
r = opener.open(req, timeout=timeout)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 449, in _open
'_open', req)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 1197, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-06-03 16:11:49 [boto] ERROR: Unable to read instance data, giving up
2016-06-03 16:11:49 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-06-03 16:11:49 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-06-03 16:11:49 [scrapy] INFO: Enabled item pipelines: FilesPipeline
2016-06-03 16:11:49 [scrapy] INFO: Spider opened
2016-06-03 16:11:49 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-06-03 16:11:49 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-03 16:11:49 [scrapy] DEBUG: Redirecting (meta refresh) to <GET http://myexample.com> from <GEThttp://myexample.com>
2016-06-03 16:11:50 [scrapy] DEBUG: Crawled (200) <GET http://myexample.com> (referer: None)
2016-06-03 16:11:50 [scrapy] ERROR: Spider error processing <GET http://www.example.com> (referer: None)
Traceback (most recent call last): File "C:\Anaconda3\envs\scrapy\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks current.result = callback(current.result, *args, **kw)
File "C:\Anaconda3\envs\scrapy\lib\site-packages\scrapy\spiders\__init__.py",line 76, in parse raise NotImplementedErrorNotImplementedError
2016-06-03 16:11:50 [scrapy] INFO: Closing spider (finished)
2016-06-03 16:11:50 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 449,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 23526,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 6, 3, 15, 11, 50, 227000),
'log_count/DEBUG': 4,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/NotImplementedError': 1,
'start_time': datetime.datetime(2016, 6, 3, 15, 11, 49, 722000)}
2016-06-03 16:11:50 [scrapy] INFO: Spider closed (finished)
you need to sub-class from scrapy's CrawlSpider if you want that functionality, for example something like this:
from scrapy.item import Field, Item
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from scrapy.spiders.crawl import CrawlSpider
class LibrarycrawlerItem(Item):
title = Field()
file_urls = Field()
class MyCrawlSpider(CrawlSpider):
name = 'sample'
allowed_domains = ['example.com', 'iana.org']
start_urls = ['http://www.example.com']
rules = (
Rule(LinkExtractor(), callback='scrape_page'),
)
def scrape_page(self,response):
item = LibrarycrawlerItem()
item['title'] = response.xpath('//title/text()').extract_first()
item['file_urls'] = response.url
yield item
To understand better the how the rules work please refer to the documenation, btw you can also use the LinkExtractor inside your parse method without sub-classing CrawlSpider.

Value Errors When Retrieving Images With Scrapy

I'm having trouble using Scrapy's image pipeline to retrieve images. From the error reports, I think I am feeding Scrapy the right image_urls. However, instead of downloading images from them, Scrapy returns the error: ValueError: Missing scheme in request url: h.
This is my first time using the image pipeline feature, so I suspect I'm making a simple mistake. All the same, I'd appreciate help solving it.
Below you'll find my spider, settings, items, and error output. They're not quite MWEs, but I think they're pretty simple and easy to understand all the same.
Spider:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ngamedallions.items import NgamedallionsItem
from scrapy.loader.processors import TakeFirst
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Join
from scrapy.http import Request
import re
class NGASpider(CrawlSpider):
name = 'ngamedallions'
allowed_domains = ['nga.gov']
start_urls = [
'http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html'
]
rules = (
Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True
),)
def parse_CatalogRecord(self, response):
CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
CatalogRecord.default_output_processor = TakeFirst()
keywords = "medal|medallion"
r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
if r.search(response.body_as_unicode()):
CatalogRecord.add_xpath('title', './/dl[#class="artwork-details"]/dt[#class="title"]/text()')
CatalogRecord.add_xpath('accession', './/dd[#class="accession"]/text()')
CatalogRecord.add_xpath('inscription', './/div[#id="inscription"]/p/text()')
CatalogRecord.add_xpath('image_urls', './/img[#class="mainImg"]/#src')
return CatalogRecord.load_item()
Settings:
BOT_NAME = 'ngamedallions'
SPIDER_MODULES = ['ngamedallions.spiders']
NEWSPIDER_MODULE = 'ngamedallions.spiders'
DOWNLOAD_DELAY=3
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = '/home/tricia/Documents/Programing/Scrapy/ngamedallions/medallionimages'
Items:
import scrapy
class NgamedallionsItem(scrapy.Item):
title = scrapy.Field()
accession = scrapy.Field()
inscription = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
pass
Error Log:
2016-04-24 19:00:40 [scrapy] INFO: Scrapy 1.0.5.post2+ga046ce8 started (bot: ngamedallions)
2016-04-24 19:00:40 [scrapy] INFO: Optional features available: ssl, http11
2016-04-24 19:00:40 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'FEED_FORMAT': 'json', 'DOWNLOAD_DELAY': 3}
2016-04-24 19:00:40 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-04-24 19:00:40 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-04-24 19:00:40 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-04-24 19:00:40 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2016-04-24 19:00:40 [scrapy] INFO: Spider opened
2016-04-24 19:00:40 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-04-24 19:00:40 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-04-24 19:00:40 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: None)
2016-04-24 19:00:44 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html> (referer: None)
2016-04-24 19:00:48 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html)
2016-04-24 19:00:48 [scrapy] ERROR: Error processing {'accession': u'1942.9.163.a',
'image_urls': u'http://media.nga.gov/public/objects/1/3/1/2/1312-primary-0-440x400.jpg',
'inscription': u'around circumference: IOHANNES FRANCISCVS GON MA; around bottom circumference: MANTVA',
'title': u'Gianfrancesco Gonzaga di Rodigo, 1445-1496, Lord of Bozzolo, Sabbioneta, and Viadana 1478 [obverse]'}
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/pymodules/python2.7/scrapy/pipelines/media.py", line 44, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/usr/lib/pymodules/python2.7/scrapy/pipelines/images.py", line 109, in get_media_requests
return [Request(x) for x in item.get(self.IMAGES_URLS_FIELD, [])]
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 55, in _set_url
self._set_url(url.encode(self.encoding))
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 59, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
2016-04-24 19:00:48 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-04-24 19:00:51 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html)
2016-04-24 19:00:52 [scrapy] ERROR: Error processing {'accession': u'1942.9.163.b',
'image_urls': u'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg',
'inscription': u'around top circumference: TRINACRIA IANI; upper center: PELORVS ; across center: PA LI; across bottom: BELAVRA',
'title': u'House between Two Hills [reverse]'}
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/pymodules/python2.7/scrapy/pipelines/media.py", line 44, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/usr/lib/pymodules/python2.7/scrapy/pipelines/images.py", line 109, in get_media_requests
return [Request(x) for x in item.get(self.IMAGES_URLS_FIELD, [])]
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 55, in _set_url
self._set_url(url.encode(self.encoding))
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 59, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
2016-04-24 19:00:55 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html)
2016-04-24 19:01:02 [scrapy] INFO: Closing spider (finished)
2016-04-24 19:01:02 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1609,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'downloader/response_bytes': 125593,
'downloader/response_count': 5,
'downloader/response_status_count/200': 5,
'dupefilter/filtered': 5,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 4, 24, 23, 1, 2, 938181),
'log_count/DEBUG': 7,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'request_depth_max': 2,
'response_received_count': 5,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2016, 4, 24, 23, 0, 40, 851598)}
2016-04-24 19:01:02 [scrapy] INFO: Spider closed (finished)
The TakeFirst processor is making image_urls a string when it should be a list.
Add:
CatalogRecord.image_urls_out = lambda v: v
EDIT:
This could also be:
CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()

How to scrape data in an authenticated session within a dynamic page?

I have coded a Scrapy spider using the loginform library (http://blog.scrapinghub.com/2012/10/26/filling-login-forms-automatically/) and taking this post as reference for dynamic webpages. This is the code:
class MySpider(CrawlSpider):
login_user = 'myusername'
login_pass = 'mypassword'
name = "tv"
allowed_domains = []
start_urls = ["https://twitter.com/Acrocephalus/followers"]
rules = (
Rule(SgmlLinkExtractor(allow=('https://twitter\.com/.*')), callback='parse_items', follow=True),
)
def parse(self, response):
args, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_pass)
return FormRequest(url, method=method, formdata=args, callback=self.after_login)
def after_login(self, response):
# you are logged in here
def __init__(self):
CrawlSpider.__init__(self)
# use any browser you wish
self.browser = webdriver.Firefox()
def __del__(self):
self.browser.close()
def parse_item(self, response):
item = TwitterVizItem()
self.browser.get(response.url)
# let JavaScript Load
time.sleep(3)
# scrape dynamically generated HTML
hxs = Selector(text=self.browser.page_source)
item['Follower'] = hxs.select('(//span[#class="u-linkComplex-target"])[position()>2]').extract()
item['Follows'] = hxs.select('(//span[#class="u-linkComplex-target"])[position()=1]').extract()
return item
When I run it, I get this output:
2015-07-22 18:46:38 [scrapy] INFO: Scrapy 1.0.1.post1+g5b8c9e5 started (bot: TwitterViz)
2015-07-22 18:46:38 [scrapy] INFO: Optional features available: ssl, http11
2015-07-22 18:46:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'TwitterViz.spiders', 'SPIDER_MODULES': ['TwitterViz.spiders'], 'BOT_NAME': 'TwitterViz'}
2015-07-22 18:46:38 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-22 18:46:38 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-22 18:46:38 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-22 18:46:38 [scrapy] INFO: Enabled item pipelines:
2015-07-22 18:46:38 [scrapy] INFO: Spider opened
2015-07-22 18:46:38 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-22 18:46:38 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-22 18:46:39 [scrapy] DEBUG: Redirecting (302) to <GET https://twitter.com/login?redirect_after_login=%2FAcrocephalus%2Ffollowers> from <GET https://twitter.com/Acrocephalus/followers>
2015-07-22 18:46:39 [scrapy] DEBUG: Crawled (200) <GET https://twitter.com/login?redirect_after_login=%2FAcrocephalus%2Ffollowers> (referer: None)
2015-07-22 18:46:39 [scrapy] DEBUG: Redirecting (302) to <GET https://twitter.com/Acrocephalus/followers> from <POST https://twitter.com/sessions>
2015-07-22 18:46:40 [scrapy] DEBUG: Crawled (200) <GET https://twitter.com/Acrocephalus/followers> (referer: https://twitter.com/login?redirect_after_login=%2FAcrocephalus%2Ffollowers)
2015-07-22 18:46:40 [scrapy] INFO: Closing spider (finished)
2015-07-22 18:46:40 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2506,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 63188,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 7, 22, 16, 46, 40, 538785),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2015, 7, 22, 16, 46, 38, 926487)}
2015-07-22 18:46:40 [scrapy] INFO: Spider closed (finished)
From the ouput I understand that it logs in properly. However, it doesn't scrape anything. I've tested the XPath with Firefox's XPath Checker and they seem to work properly. Can anyone have a look to see if we can find what is wrong?

Scrapy crawl resume does not crawl anything and just finishes

I start a crawl with a CrawlSpider Derived class, and pause it with Ctrl+C. When I execute the command again to resume it, it does not continue.
My start and resume command:
scrapy crawl mycrawler -s JOBDIR=crawls/test5_mycrawl
Scrapy creates the folder. The permissions are 777.
When I resume the crawl, it just outputs:
/home/adminuser/.virtualenvs/rg_harvest/lib/python2.7/site-packages/twisted/internet/_sslverify.py:184: UserWarning: You do not have the service_identity module installed. Please install it from <https://pypi.python.org/pypi/service_identity>. Without the service_identity module and a recent enough pyOpenSSL tosupport it, Twisted can perform only rudimentary TLS client hostnameverification. Many valid certificate/hostname mappings may be rejected.
verifyHostname, VerificationError = _selectVerifyImplementation()
2014-11-21 11:05:10-0500 [scrapy] INFO: Scrapy 0.24.4 started (bot: rg_harvest_scrapy)
2014-11-21 11:05:10-0500 [scrapy] INFO: Optional features available: ssl, http11, django
2014-11-21 11:05:10-0500 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'rg_harvest_scrapy.spiders', 'SPIDER_MODULES': ['rg_harvest_scrapy.spiders'], 'BOT_NAME': 'rg_harvest_scrapy'}
2014-11-21 11:05:10-0500 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-11-21 11:05:10-0500 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-11-21 11:05:10-0500 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-11-21 11:05:10-0500 [scrapy] INFO: Enabled item pipelines: ValidateMandatory, TypeConversion, ValidateRange, ValidateLogic, RestegourmetImagesPipeline, SaveToDB
2014-11-21 11:05:10-0500 [mycrawler] INFO: Spider opened
2014-11-21 11:05:10-0500 [mycrawler] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-11-21 11:05:10-0500 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-11-21 11:05:10-0500 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-11-21 11:05:10-0500 [mycrawler] DEBUG: Crawled (200) <GET http://eatsmarter.de/suche/rezepte> (referer: None)
2014-11-21 11:05:10-0500 [mycrawler] DEBUG: Filtered duplicate request: <GET http://eatsmarter.de/suche/rezepte?page=1> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2014-11-21 11:05:10-0500 [mycrawler] INFO: Closing spider (finished)
2014-11-21 11:05:10-0500 [mycrawler] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 225,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 19242,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'dupefilter/filtered': 29,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 11, 21, 16, 5, 10, 733196),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/disk': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/disk': 1,
'start_time': datetime.datetime(2014, 11, 21, 16, 5, 10, 528629)}
I have one start_url. Could this be the reason? My crawler uses one start_url and then follows the pagination by a Rule with a LinkExtractor, and calls the parse item by a specific url format:
My Spider code:
class MyCrawlSpiderBase(CrawlSpider):
name = 'test_spider'
testmode = True
crawl_start = datetime.utcnow().isoformat()
def __init__(self, testmode=True, urls=None, *args, **kwargs):
self.testmode = bool(int(testmode))
super(MyCrawlSpiderBase, self).__init__(*args, **kwargs)
def parse_item(self, response):
# Item Values
l = MyItemLoader(RecipeItem(), response=response)
l.replace_value('url', response.url)
l.replace_value('crawl_start', self.crawl_start)
return l.load_item()
class MyCrawlSpider(MyCrawlSpiderBase):
name = 'example_de'
allowed_domains = ['example.de']
start_urls = [
"http://example.de",
]
rules = (
Rule(
LinkExtractor(
allow=('/search/entry\?page=', )
)
),
Rule(
LinkExtractor(
allow=('/entry/[0-9A-z\-]{3,}$', ),
),
callback='parse_item'
),
)
def parse_item(self, response):
item = super(MyCrawlSpider, self).parse_item(response)
l = MyItemLoader(item=item, response=response)
l.replace_xpath("name", "//h1[#class='fn title']/text()")
(...)
return l.load_item()
Since your URL is always the same, the requests are most likely being filtered.
You can solve this in two ways:
In yoursettings.py file, add this line:
DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter'
This replaces the default RFPDupeFilter with the BaseDupeFilter which will not filter any requests. This my not be what you want if you in fact want to filter out some other requests not relevant to this question.
You can get more involved in the process of creating requests, and create them with the parameter dont_filter=True, which will disable filtering on a per-requests basis. To achieve this, you could remove the start_urls and replace it with a method start_requests() that would yield the requests for parsing. Check out more info in the official documentation.
If you click Ctrl+C twice (force stop) it won't be able to be continued. Click Ctrl+C just once and wait.