I am facing a weird issue while trying to crawl a particular site.
If i use the basespider to crawl some pages, the code works perfectly, but if i change the code to use crawlspider, the spider completes without any errors but with no spidering
The following piece of code works fine
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.loader import XPathItemLoader
from dirbot.items import Website
from urlparse import urlparse
from scrapy import log
class hushBabiesSpider(BaseSpider):
name = "hushbabies"
#download_delay = 10
allowed_domains = ["hushbabies.com"]
start_urls = [
"http://www.hushbabies.com/category/toys-playgear-bath-bedtime.html",
"http://www.hushbabies.com/category/mommy-newborn.html",
"http://www.hushbabies.com"
]
def parse(self, response):
print response.body
print "Inside parse Item"
return []
The following piece of code does not work
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.loader import XPathItemLoader
from dirbot.items import Website
from urlparse import urlparse
from scrapy import log
class hushBabiesSpider(CrawlSpider):
name = "hushbabies"
#download_delay = 10
allowed_domains = ["hushbabies.com"]
start_urls = [
"http://www.hushbabies.com/category/toys-playgear-bath-bedtime.html",
"http://www.hushbabies.com/category/mommy-newborn.html",
"http://www.hushbabies.com"
]
rules = (
Rule(SgmlLinkExtractor(allow=()),
'parseItem',
follow=True,
),
)
def parseItem(self, response):
print response.body
print "Inside parse Item"
return []
The output from the Scrapy run is as follows
scrapy crawl hushbabies
2012-07-23 18:50:37+0000 [scrapy] INFO: Scrapy 0.15.1-198-g831a450 started (bot: SKBot)
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, WebService, CoreStats, MemoryUsage, SpiderState, CloseSpider
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled downloader middlewares: RobotsTxtMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Enabled item pipelines: SQLStorePipeline
2012-07-23 18:50:37+0000 [hushbabies] INFO: Spider opened
2012-07-23 18:50:37+0000 [hushbabies] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-23 18:50:37+0000 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-07-23 18:50:37+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com/robots.txt> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] DEBUG: Crawled (200) <GET http://www.hushbabies.com/category/mommy-newborn.html> (referer: None)
2012-07-23 18:50:39+0000 [hushbabies] INFO: Closing spider (finished)
2012-07-23 18:50:39+0000 [hushbabies] INFO: Dumping spider stats:
{'downloader/request_bytes': 634,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 44395,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 7, 23, 18, 50, 39, 674965),
'scheduler/memory_enqueued': 2,
'start_time': datetime.datetime(2012, 7, 23, 18, 50, 37, 700711)}
2012-07-23 18:50:39+0000 [hushbabies] INFO: Spider closed (finished)
2012-07-23 18:50:39+0000 [scrapy] INFO: Dumping global stats:
{'memusage/max': 27820032, 'memusage/startup': 27652096}
Changing the site from hushbabies.com to others will make the code to work.
It seems there is a problem in the underlying SGML parser in SgmlLinkExtractor, sgmllib.
The following code returns zero links:
>>> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>>> fetch('http://www.hushbabies.com/')
>>> len(SgmlLinkExtractor().extract_links(response))
0
You can try an alternative link extractor from Slybot which depends on Scraply:
>>> from slybot.linkextractor import LinkExtractor
>>> from scrapely.htmlpage import HtmlPage
>>> p = HtmlPage(body=response.body_as_unicode())
>>> sum(1 for _ in LinkExtractor().links_to_follow(p))
314
Related
Trying to figure out how scrapy works and using it to find information on forums.
items.py
import scrapy
class BodybuildingItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
pass
spider.py
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from bodybuilding.items import BodybuildingItem
class BodyBuildingSpider(BaseSpider):
name = "bodybuilding"
allowed_domains = ["forum.bodybuilding.nl"]
start_urls = [
"https://forum.bodybuilding.nl/fora/supplementen.22/"
]
def parse(self, response):
responseSelector = Selector(response)
for sel in responseSelector.css('li.past.line.event-item'):
item = BodybuildingItem()
item['title'] = sel.css('a.data-previewUrl::text').extract()
yield item
The forum I'm trying to get the post titles from in this example is this: https://forum.bodybuilding.nl/fora/supplementen.22/
However I keep getting no results:
class BodyBuildingSpider(BaseSpider): 2017-10-07 00:42:28
[scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: bodybuilding)
2017-10-07 00:42:28 [scrapy.utils.log] INFO: Overridden settings:
{'NEWSPIDER_MODULE': 'bodybuilding.spiders', 'SPIDER_MODULES':
['bodybuilding.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME':
'bodybuilding'} 2017-10-07 00:42:28 [scrapy.middleware] INFO: Enabled
extensions: ['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.corestats.CoreStats'] 2017-10-07 00:42:28
[scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2017-10-07
00:42:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2017-10-07 00:42:28
[scrapy.middleware] INFO: Enabled item pipelines: [] 2017-10-07
00:42:28 [scrapy.core.engine] INFO: Spider opened 2017-10-07 00:42:28
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min) 2017-10-07 00:42:28
[scrapy.core.engine] DEBUG: Crawled (404) https://forum.bodybuilding.nl/robots.txt> (referer: None) 2017-10-07
00:42:29 [scrapy.core.engine] DEBUG: Crawled (200) https://forum.bodybuilding.nl/fora/supplementen.22/> (referer: None)
2017-10-07 00:42:29 [scrapy.core.engine] INFO: Closing spider
(finished) 2017-10-07 00:42:29 [scrapy.statscollectors] INFO: Dumping
Scrapy stats: {'downloader/request_bytes': 469,
'downloader/request_count': 2, 'downloader/request_method_count/GET':
2, 'downloader/response_bytes': 22878, 'downloader/response_count':
2, 'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1, 'finish_reason':
'finished', 'finish_time': datetime.datetime(2017, 10, 6, 22, 42, 29,
223305), 'log_count/DEBUG': 2, 'log_count/INFO': 7, 'memusage/max':
31735808, 'memusage/startup': 31735808, 'response_received_count':
2, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 10, 6, 22, 42, 28, 816043)}
2017-10-07 00:42:29 [scrapy.core.engine] INFO: Spider closed
(finished)
I have been following the guide here: http://blog.florian-hopf.de/2014/07/scrapy-and-elasticsearch.html
Update 1:
As someone told me I needed to update my code to the new standards, which I did but it didnt change the outcome:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from bodybuilding.items import BodybuildingItem
class BodyBuildingSpider(BaseSpider):
name = "bodybuilding"
allowed_domains = ["forum.bodybuilding.nl"]
start_urls = [
"https://forum.bodybuilding.nl/fora/supplementen.22/"
]
def parse(self, response):
for sel in response.css('li.past.line.event-item'):
item = BodybuildingItem()
yield {'title': title.css('a.data-previewUrl::text').extract_first()}
yield item
Last update with fix
After some good help I finally got it working with this spider:
import scrapy
class BlogSpider(scrapy.Spider):
name = 'bodybuilding'
start_urls = ['https://forum.bodybuilding.nl/fora/supplementen.22/']
def parse(self, response):
for title in response.css('h3.title'):
yield {'title': title.css('a::text').extract_first()}
next_page_url = response.xpath("//a[text()='Volgende >']/#href").extract_first()
if next_page_url:
yield response.follow(next_page_url, callback=self.parse)
You should use response.css('li.past.line.event-item') and there is no need for responseSelector = Selector(response).
Also the CSS you are using li.past.line.event-item, is no more valid, so you need update those first based on the latest web page
To get the next page URL you can use
>>> response.css("a.text::attr(href)").extract_first()
'fora/supplementen.22/page-2'
And then use response.follow to follow this relative url
Edit-2: Next Page processing correction
The previous edit didn't work because on the next page it matches the previous page url, so you need to use below
next_page_url = response.xpath("//a[text()='Volgende >']/#href").extract_first()
if next_page_url:
yield response.follow(next_page_url, callback=self.parse)
Edit-1: Next Page processing
next_page_url = response.css("a.text::attr(href)").extract_first()
if next_page_url:
yield response.follow(next_page_url, callback=self.parse)
I've been trying to get my head around scrapy but I'm not having much luck getting beyond the basics. When I run my spider i get a spider error processing the page and a spider exemption that isn't implemented yet but if I use scrapy fetch the html response is outputted so its not that the site is not available. The output is included below along with my Items, spider and settings values
Items.py
class MycrawlerItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
files = scrapy.Field()
file_urls = scrapy.Field()
mycrawler.py
import scrapy
from scrapy.spiders import Rule
from bs4 import BeautifulSoup
from scrapy.linkextractors import LinkExtractor
from librarycrawler.items import LibrarycrawlerItem
class CrawlSpider(scrapy.Spider):
name = "mycrawler"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com"
]
#LinkExtractor(),
rules = (
Rule(LinkExtractor(),callback='scrape_page', follow=True)
)
def scrape_page(self,response):
page_soup = BeautifulSoup(response.body,"html.parser")
ScrapedPageTitle = page_soup.title.get_text()
item = LibrarycrawlerItem()
item['title'] =ScrapedPageTitle
item['file_urls'] = response.url
yield item
Settings.py
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline':300,
}
FILES_STORE = 'C:\MySpider\mycrawler\ExtractedText'
Terminal Output
[scrapy] C:\MySpider\mycrawler>scrapy crawl mycrawler -o mycrawler.csv
2016-06-03 16:11:47 [scrapy] INFO: Scrapy 1.0.3 started (bot: mycrawler)
2016-06-03 16:11:47 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-06-03 16:11:47 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mycrawler.spiders', 'FEED_URI': 'mycrawler.csv', 'DEPTH_LIMIT': 3, 'SPIDER_MODULES': ['mycrawler.spiders'], 'BOT_NAME': 'mycrawler', 'USER_AGENT': 'mycrawler(+http://www.example.com)', 'FEED_FORMAT': 'csv'}
2016-06-03 16:11:48 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-06-03 16:11:48 [boto] DEBUG: Retrieving credentials from metadata server.
2016-06-03 16:11:49 [boto] ERROR: Caught exception reading instance dataTraceback (most recent call last):
File "C:\Anaconda3\envs\scrapy\lib\site-packages\boto\utils.py", line 210, inretry_url
r = opener.open(req, timeout=timeout)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 449, in _open
'_open', req)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 1197, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-06-03 16:11:49 [boto] ERROR: Unable to read instance data, giving up
2016-06-03 16:11:49 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-06-03 16:11:49 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-06-03 16:11:49 [scrapy] INFO: Enabled item pipelines: FilesPipeline
2016-06-03 16:11:49 [scrapy] INFO: Spider opened
2016-06-03 16:11:49 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-06-03 16:11:49 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-03 16:11:49 [scrapy] DEBUG: Redirecting (meta refresh) to <GET http://myexample.com> from <GEThttp://myexample.com>
2016-06-03 16:11:50 [scrapy] DEBUG: Crawled (200) <GET http://myexample.com> (referer: None)
2016-06-03 16:11:50 [scrapy] ERROR: Spider error processing <GET http://www.example.com> (referer: None)
Traceback (most recent call last): File "C:\Anaconda3\envs\scrapy\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks current.result = callback(current.result, *args, **kw)
File "C:\Anaconda3\envs\scrapy\lib\site-packages\scrapy\spiders\__init__.py",line 76, in parse raise NotImplementedErrorNotImplementedError
2016-06-03 16:11:50 [scrapy] INFO: Closing spider (finished)
2016-06-03 16:11:50 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 449,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 23526,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 6, 3, 15, 11, 50, 227000),
'log_count/DEBUG': 4,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/NotImplementedError': 1,
'start_time': datetime.datetime(2016, 6, 3, 15, 11, 49, 722000)}
2016-06-03 16:11:50 [scrapy] INFO: Spider closed (finished)
you need to sub-class from scrapy's CrawlSpider if you want that functionality, for example something like this:
from scrapy.item import Field, Item
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from scrapy.spiders.crawl import CrawlSpider
class LibrarycrawlerItem(Item):
title = Field()
file_urls = Field()
class MyCrawlSpider(CrawlSpider):
name = 'sample'
allowed_domains = ['example.com', 'iana.org']
start_urls = ['http://www.example.com']
rules = (
Rule(LinkExtractor(), callback='scrape_page'),
)
def scrape_page(self,response):
item = LibrarycrawlerItem()
item['title'] = response.xpath('//title/text()').extract_first()
item['file_urls'] = response.url
yield item
To understand better the how the rules work please refer to the documenation, btw you can also use the LinkExtractor inside your parse method without sub-classing CrawlSpider.
I'm having trouble using Scrapy's image pipeline to retrieve images. From the error reports, I think I am feeding Scrapy the right image_urls. However, instead of downloading images from them, Scrapy returns the error: ValueError: Missing scheme in request url: h.
This is my first time using the image pipeline feature, so I suspect I'm making a simple mistake. All the same, I'd appreciate help solving it.
Below you'll find my spider, settings, items, and error output. They're not quite MWEs, but I think they're pretty simple and easy to understand all the same.
Spider:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ngamedallions.items import NgamedallionsItem
from scrapy.loader.processors import TakeFirst
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Join
from scrapy.http import Request
import re
class NGASpider(CrawlSpider):
name = 'ngamedallions'
allowed_domains = ['nga.gov']
start_urls = [
'http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html'
]
rules = (
Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True
),)
def parse_CatalogRecord(self, response):
CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
CatalogRecord.default_output_processor = TakeFirst()
keywords = "medal|medallion"
r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
if r.search(response.body_as_unicode()):
CatalogRecord.add_xpath('title', './/dl[#class="artwork-details"]/dt[#class="title"]/text()')
CatalogRecord.add_xpath('accession', './/dd[#class="accession"]/text()')
CatalogRecord.add_xpath('inscription', './/div[#id="inscription"]/p/text()')
CatalogRecord.add_xpath('image_urls', './/img[#class="mainImg"]/#src')
return CatalogRecord.load_item()
Settings:
BOT_NAME = 'ngamedallions'
SPIDER_MODULES = ['ngamedallions.spiders']
NEWSPIDER_MODULE = 'ngamedallions.spiders'
DOWNLOAD_DELAY=3
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = '/home/tricia/Documents/Programing/Scrapy/ngamedallions/medallionimages'
Items:
import scrapy
class NgamedallionsItem(scrapy.Item):
title = scrapy.Field()
accession = scrapy.Field()
inscription = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
pass
Error Log:
2016-04-24 19:00:40 [scrapy] INFO: Scrapy 1.0.5.post2+ga046ce8 started (bot: ngamedallions)
2016-04-24 19:00:40 [scrapy] INFO: Optional features available: ssl, http11
2016-04-24 19:00:40 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'FEED_FORMAT': 'json', 'DOWNLOAD_DELAY': 3}
2016-04-24 19:00:40 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-04-24 19:00:40 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-04-24 19:00:40 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-04-24 19:00:40 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2016-04-24 19:00:40 [scrapy] INFO: Spider opened
2016-04-24 19:00:40 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-04-24 19:00:40 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-04-24 19:00:40 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: None)
2016-04-24 19:00:44 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html> (referer: None)
2016-04-24 19:00:48 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html)
2016-04-24 19:00:48 [scrapy] ERROR: Error processing {'accession': u'1942.9.163.a',
'image_urls': u'http://media.nga.gov/public/objects/1/3/1/2/1312-primary-0-440x400.jpg',
'inscription': u'around circumference: IOHANNES FRANCISCVS GON MA; around bottom circumference: MANTVA',
'title': u'Gianfrancesco Gonzaga di Rodigo, 1445-1496, Lord of Bozzolo, Sabbioneta, and Viadana 1478 [obverse]'}
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/pymodules/python2.7/scrapy/pipelines/media.py", line 44, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/usr/lib/pymodules/python2.7/scrapy/pipelines/images.py", line 109, in get_media_requests
return [Request(x) for x in item.get(self.IMAGES_URLS_FIELD, [])]
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 55, in _set_url
self._set_url(url.encode(self.encoding))
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 59, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
2016-04-24 19:00:48 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-04-24 19:00:51 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html)
2016-04-24 19:00:52 [scrapy] ERROR: Error processing {'accession': u'1942.9.163.b',
'image_urls': u'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg',
'inscription': u'around top circumference: TRINACRIA IANI; upper center: PELORVS ; across center: PA LI; across bottom: BELAVRA',
'title': u'House between Two Hills [reverse]'}
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/pymodules/python2.7/scrapy/pipelines/media.py", line 44, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/usr/lib/pymodules/python2.7/scrapy/pipelines/images.py", line 109, in get_media_requests
return [Request(x) for x in item.get(self.IMAGES_URLS_FIELD, [])]
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 55, in _set_url
self._set_url(url.encode(self.encoding))
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 59, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
2016-04-24 19:00:55 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html)
2016-04-24 19:01:02 [scrapy] INFO: Closing spider (finished)
2016-04-24 19:01:02 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1609,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'downloader/response_bytes': 125593,
'downloader/response_count': 5,
'downloader/response_status_count/200': 5,
'dupefilter/filtered': 5,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 4, 24, 23, 1, 2, 938181),
'log_count/DEBUG': 7,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'request_depth_max': 2,
'response_received_count': 5,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2016, 4, 24, 23, 0, 40, 851598)}
2016-04-24 19:01:02 [scrapy] INFO: Spider closed (finished)
The TakeFirst processor is making image_urls a string when it should be a list.
Add:
CatalogRecord.image_urls_out = lambda v: v
EDIT:
This could also be:
CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()
I try to learn Scrapy and I manage to crawl some sites and other I fail for example:
I try to crawl: http://www.polyhousestore.com/
I created a test spider that will get all the products in the page
http://www.polyhousestore.com/catalogsearch/result/?cat=&q=lc+60
When I run the spider I get that it didn’t find any product.
Can someone help me understand what am I doing wrong, is it related to the CSS ::before and ::after ?
And how can I make it to work?
Spider code(that fail to get the products in the page)
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
class PolySpider(scrapy.Spider):
name = "poly"
allowed_domains = ["polyhousestore.com"]
start_urls = (
'http://www.polyhousestore.com/catalogsearch/result/?cat=&q=lc+60',
)
def parse(self, response):
sel = Selector(response)
products = sel.xpath('/html/body/div[4]/div/div[5]/div/div/div/div/div[2]/div[3]/div[2]/div')
items = []
if not products:
print '------------- No products from sel.xpath'
else:
print '------------- Found products ' + str(len(products))
Command line that I run and the output
D:\scrapyProj\cmdProj>scrapy crawl poly
2016-01-19 10:23:16 [scrapy] INFO: Scrapy 1.0.3 started (bot: cmdProj)
2016-01-19 10:23:16 [scrapy] INFO: Optional features available: ssl, http11
2016-01-19 10:23:16 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'cm
dProj.spiders', 'SPIDER_MODULES': ['cmdProj.spiders'], 'BOT_NAME': 'cmdProj'}
2016-01-19 10:23:17 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsol
e, LogStats, CoreStats, SpiderState
2016-01-19 10:23:17 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-01-19 10:23:17 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-01-19 10:23:17 [scrapy] INFO: Enabled item pipelines:
2016-01-19 10:23:17 [scrapy] INFO: Spider opened
2016-01-19 10:23:17 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-01-19 10:23:17 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-19 10:23:17 [scrapy] DEBUG: Crawled (200) <GET http://www.polyhousestore
.com/catalogsearch/result/?cat=&q=lc+60> (referer: None)
------------- No products from sel.xpath
2016-01-19 10:23:18 [scrapy] INFO: Closing spider (finished)
2016-01-19 10:23:18 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 254,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 16091,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 1, 19, 8, 23, 18, 53000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 1, 19, 8, 23, 17, 376000)}
2016-01-19 10:23:18 [scrapy] INFO: Spider closed (finished)
Thanks for the help
When I look at the URL given in your question with Chrome I see only 2 div tags in the body of the site. This means scrapy sees these div tags too. However you want to access the 4th, which does not exist so your search returns no elements.
If I open a scrapy shell and execute a count on the div tags on the body I get 2 in return:
[<Selector xpath='count(/html/body/div)' data=u'2.0'>]
This above is the same as
len(response.xpath('/html/body/div'))
All this means you have to modify your query to get all the products. If you need the 4 elements in the site, try:
response.xpath('//div[#class="item-inner"]')
As you can see you do not need to wrap the response into a selector with scrapy anymore.
I have coded a Scrapy spider using the loginform library (http://blog.scrapinghub.com/2012/10/26/filling-login-forms-automatically/) and taking this post as reference for dynamic webpages. This is the code:
class MySpider(CrawlSpider):
login_user = 'myusername'
login_pass = 'mypassword'
name = "tv"
allowed_domains = []
start_urls = ["https://twitter.com/Acrocephalus/followers"]
rules = (
Rule(SgmlLinkExtractor(allow=('https://twitter\.com/.*')), callback='parse_items', follow=True),
)
def parse(self, response):
args, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_pass)
return FormRequest(url, method=method, formdata=args, callback=self.after_login)
def after_login(self, response):
# you are logged in here
def __init__(self):
CrawlSpider.__init__(self)
# use any browser you wish
self.browser = webdriver.Firefox()
def __del__(self):
self.browser.close()
def parse_item(self, response):
item = TwitterVizItem()
self.browser.get(response.url)
# let JavaScript Load
time.sleep(3)
# scrape dynamically generated HTML
hxs = Selector(text=self.browser.page_source)
item['Follower'] = hxs.select('(//span[#class="u-linkComplex-target"])[position()>2]').extract()
item['Follows'] = hxs.select('(//span[#class="u-linkComplex-target"])[position()=1]').extract()
return item
When I run it, I get this output:
2015-07-22 18:46:38 [scrapy] INFO: Scrapy 1.0.1.post1+g5b8c9e5 started (bot: TwitterViz)
2015-07-22 18:46:38 [scrapy] INFO: Optional features available: ssl, http11
2015-07-22 18:46:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'TwitterViz.spiders', 'SPIDER_MODULES': ['TwitterViz.spiders'], 'BOT_NAME': 'TwitterViz'}
2015-07-22 18:46:38 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-22 18:46:38 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-22 18:46:38 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-22 18:46:38 [scrapy] INFO: Enabled item pipelines:
2015-07-22 18:46:38 [scrapy] INFO: Spider opened
2015-07-22 18:46:38 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-22 18:46:38 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-22 18:46:39 [scrapy] DEBUG: Redirecting (302) to <GET https://twitter.com/login?redirect_after_login=%2FAcrocephalus%2Ffollowers> from <GET https://twitter.com/Acrocephalus/followers>
2015-07-22 18:46:39 [scrapy] DEBUG: Crawled (200) <GET https://twitter.com/login?redirect_after_login=%2FAcrocephalus%2Ffollowers> (referer: None)
2015-07-22 18:46:39 [scrapy] DEBUG: Redirecting (302) to <GET https://twitter.com/Acrocephalus/followers> from <POST https://twitter.com/sessions>
2015-07-22 18:46:40 [scrapy] DEBUG: Crawled (200) <GET https://twitter.com/Acrocephalus/followers> (referer: https://twitter.com/login?redirect_after_login=%2FAcrocephalus%2Ffollowers)
2015-07-22 18:46:40 [scrapy] INFO: Closing spider (finished)
2015-07-22 18:46:40 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2506,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 63188,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 7, 22, 16, 46, 40, 538785),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2015, 7, 22, 16, 46, 38, 926487)}
2015-07-22 18:46:40 [scrapy] INFO: Spider closed (finished)
From the ouput I understand that it logs in properly. However, it doesn't scrape anything. I've tested the XPath with Firefox's XPath Checker and they seem to work properly. Can anyone have a look to see if we can find what is wrong?