Here is my Spider:
import scrapy
import urlparse
from scrapy.http import Request
class BasicSpider(scrapy.Spider):
name = "basic2"
allowed_domains = ["cnblogs"]
start_urls = (
'http://www.cnblogs.com/kylinlin/',
)
def parse(self, response):
next_site = response.xpath(".//*[#id='nav_next_page']/a/#href")
for url in next_site.extract():
yield Request(urlparse.urljoin(response.url,url))
item_selector = response.xpath(".//*[#class='postTitle']/a/#href")
for url in item_selector.extract():
yield Request(url=urlparse.urljoin(response.url, url),
callback=self.parse_item)
def parse_item(self, response):
print "+=====================>>test"
Here is the output:
2016-08-12 14:46:20 [scrapy] INFO: Spider opened
2016-08-12 14:46:20 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-08-12 14:46:20 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-08-12 14:46:20 [scrapy] DEBUG: Crawled (200) http://www.cnblogs.com/robots.txt> (referer: None)
2016-08-12 14:46:20 [scrapy] DEBUG: Crawled (200) http://www.cnblogs.com/kylinlin/> (referer: None)
2016-08-12 14:46:20 [scrapy] DEBUG: Filtered offsite request to 'www.cnblogs.com': http://www.cnblogs.com/kylinlin/default.html?page=2>
2016-08-12 14:46:20 [scrapy] INFO: Closing spider (finished)
2016-08-12 14:46:20 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 445,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 5113,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 8, 12, 6, 46, 20, 420000),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 11,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 8, 12, 6, 46, 20, 131000)}
2016-08-12 14:46:20 [scrapy] INFO: Spider closed (finished)
Why crawled pages are 0?
I cannot understand why there are no output like "+=====================>>test".
Could someone help me out?
2016-08-12 14:46:20 [scrapy] DEBUG: Filtered offsite request to 'www.cnblogs.com': http://www.cnblogs.com/kylinlin/default.html?page=2>
and your's is set to:
allowed_domains = ["cnblogs"]
which is not even a domain. It should be:
allowed_domains = ["cnblogs.com"]
Related
Looking at Twitter: www.twitter.com/twitter
You will see that the amount of followers are shown as 57.9M but if you hover over that value you will see the exact amount of followers.
This appears in the source as:
<span class="ProfileNav-value" data-count="57939946" data-is-compact="true">57.9M</span>
When I inspect this span on Chrome I use:
(//ul[#class='ProfileNav-list']/li/a/span[#class='ProfileNav-value']/#data-count)[3]
I am trying to extract just the attribute "data-count" using the above:
def parseTwitter(self, response):
company_name=response.meta['company_name']
l=ItemLoader(item=TwitterItem(), response=response)
l.add_value('company_name', company_name)
l.add_xpath('twitter_tweets', "(//ul[#class='ProfileNav-list']/li/a/span[#class='ProfileNav-value']/#data-count)[1]/text()")
l.add_xpath('twitter_following', "(//ul[#class='ProfileNav-list']/li/a/span[#class='ProfileNav-value']/#data-count)[2]/text()")
l.add_xpath('twitter_followers', "(//ul[#class='ProfileNav-list']/li/a/span[#class='ProfileNav-value']/#data-count)[3]/text()")
...but I'm not getting anything back:
2018-10-18 10:22:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-18 10:22:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://twitter.com/ADP> (referer: None)
2018-10-18 10:22:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://twitter.com/Workday> (referer: None)
2018-10-18 10:22:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://twitter.com/OracleHCM> (referer: None)
2018-10-18 10:22:16 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-18 10:22:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 892,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 199199,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 10, 18, 10, 22, 16, 833691),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'memusage/max': 52334592,
'memusage/startup': 52334592,
'response_received_count': 3,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2018, 10, 18, 10, 22, 7, 269320)}
SOLUTION:
As per pwinz suggestion below, I was trying to do a text value extract ".text()" from the attribute where simply #-ing the attribute should give you the value. My final - working - solution is:
def parseTwitter(self, response):
company_name=response.meta['company_name']
print('### ### ### Inside PARSE TWITTER ### ### ###')
l=ItemLoader(item=TwitterItem(), response=response)
l.add_value('company_name', company_name)
l.add_xpath('twitter_tweets', "(//ul[#class='ProfileNav-list']/li/a/span[#class='ProfileNav-value']/#data-count)[1]")
l.add_xpath('twitter_following', "(//ul[#class='ProfileNav-list']/li/a/span[#class='ProfileNav-value']/#data-count)[2]")
l.add_xpath('twitter_followers', "(//ul[#class='ProfileNav-list']/li/a/span[#class='ProfileNav-value']/#data-count)[3]")
yield l.load_item()
Its because data is manipulated with Javascript but Scrapy only downloads HTML but does not executes any JS/AJAX code.
When scraping with Scrapy, always disable Javascript in browser and then find what you want to scrape, and if its available, just use your selector/xpath, otherwise, inspect JS/AJAX calls on webspage to understand how it is loading data
So, to scrape number of follower
You can use following CSS Selector
.ProfileNav-item.ProfileNav-item--followers a
Scrapy code
item = {}
item["followers"] = response.css(".ProfileNav-item.ProfileNav-item--followers a").extract_first()
yield item
With respect to other answers, dynamic content is not the issue here. You are trying to get the text() from the data-count attribute. You should just be able to get the data from the #data-count.
Try this pattern:
l.add_xpath('twitter_tweets', "(//ul[#class='ProfileNav-list']/li/a/span[#class='ProfileNav
-value']/#data-count)[1]")
It worked for me.
I'm working on retrieving information from the National Gallery of Art's online catalog. Due to the catalog's structure, I can't navigate by extracting and following links from entry to entry. Fortunately, each object in the collection has a predictable url. I want my spider to navigate the collection by generating start urls.
I have attempted to solve my problem by implementing the solution from this thread. Unfortunately, this seems to break another part of my spider. The error log reveals that my urls are being successfully generated, but they aren't being processed correctly. If I'm interpreting the log correctly—which I suspect I'm not—there is a conflict between the redefinition of the start_urls that allows me to generate the urls I need and the rules section of the spider. As things stand now, the spider also doesn't respect the number of pages that I ask it to crawl.
You'll find my spider and a typical error below. I appreciate any help you can offer.
Spider:
URL = "http://www.nga.gov/content/ngaweb/Collection/art-object-page.%d"
starting_number = 1312
number_of_pages = 10
class NGASpider(CrawlSpider):
name = 'ngamedallions'
allowed_domains = ['nga.gov']
start_urls = [URL % starting_number]
rules = (
Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True))
def __init__(self):
self.page_number = starting_number
def start_requests(self):
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i + ".html" , callback=self.parse)
def parse_CatalogRecord(self, response):
CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
CatalogRecord.default_output_processor = TakeFirst()
CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()
keywords = "medal|medallion"
r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
if r.search(response.body_as_unicode()):
CatalogRecord.add_xpath('title', './/dl[#class="artwork-details"]/dt[#class="title"]/text()')
CatalogRecord.add_xpath('accession', './/dd[#class="accession"]/text()')
CatalogRecord.add_xpath('inscription', './/div[#id="inscription"]/p/text()')
CatalogRecord.add_xpath('image_urls', './/img[#class="mainImg"]/#src')
return CatalogRecord.load_item()
Typical Error:
2016-04-29 15:35:00 [scrapy] ERROR: Spider error processing <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1178.html> (referer: None)
Traceback (most recent call last):
File "/usr/lib/pymodules/python2.7/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output
for x in result:
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/lib/pymodules/python2.7/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/lib/pymodules/python2.7/scrapy/spiders/crawl.py", line 73, in _parse_response
for request_or_item in self._requests_to_follow(response):
File "/usr/lib/pymodules/python2.7/scrapy/spiders/crawl.py", line 51, in _requests_to_follow
for n, rule in enumerate(self._rules):
AttributeError: 'NGASpider' object has no attribute '_rules'
Update in Resonse to eLRuLL's Solution
Simply removing def __init__ and start_urls allows my spider to crawl my generated urls. However, it also seems to prevent 'def parse_CatalogRecord(self, response)' from being applied. When I run the spider now, it only scrapes pages from outside the range of generated urls. My revised spider and log output follow below.
Spider:
URL = "http://www.nga.gov/content/ngaweb/Collection/art-object-page.%d"
starting_number = 1312
number_of_pages = 1311
class NGASpider(CrawlSpider):
name = 'ngamedallions'
allowed_domains = ['nga.gov']
rules = (
Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True))
def start_requests(self):
self.page_number = starting_number
for i in range (self.page_number, number_of_pages, -1):
yield Request(url = URL % i + ".html" , callback=self.parse)
def parse_CatalogRecord(self, response):
CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
CatalogRecord.default_output_processor = TakeFirst()
CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()
keywords = "medal|medallion"
r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
if r.search(response.body_as_unicode()):
CatalogRecord.add_xpath('title', './/dl[#class="artwork-details"]/dt[#class="title"]/text()')
CatalogRecord.add_xpath('accession', './/dd[#class="accession"]/text()')
CatalogRecord.add_xpath('inscription', './/div[#id="inscription"]/p/text()')
CatalogRecord.add_xpath('image_urls', './/img[#class="mainImg"]/#src')
return CatalogRecord.load_item()
Log:
2016-05-02 15:50:02 [scrapy] INFO: Scrapy 1.0.5.post4+g4b324a8 started (bot: ngamedallions)
2016-05-02 15:50:02 [scrapy] INFO: Optional features available: ssl, http11
2016-05-02 15:50:02 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'FEED_FORMAT': 'json', 'DOWNLOAD_DELAY': 3}
2016-05-02 15:50:02 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-05-02 15:50:02 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-05-02 15:50:02 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-05-02 15:50:02 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2016-05-02 15:50:02 [scrapy] INFO: Spider opened
2016-05-02 15:50:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-02 15:50:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-05-02 15:50:02 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: None)
2016-05-02 15:50:02 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-05-02 15:50:05 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html)
2016-05-02 15:50:05 [scrapy] DEBUG: File (uptodate): Downloaded image from <GET http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg> referred in <None>
2016-05-02 15:50:05 [scrapy] DEBUG: Scraped from <200 http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html>
{'accession': u'1942.9.163.b',
'image_urls': [u'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg'],
'images': [{'checksum': '9d5f2e30230aeec1582ca087bcde6bfa',
'path': 'full/3a692347183d26ffefe9ba0af80b0b6bf247fae5.jpg',
'url': 'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg'}],
'inscription': u'around top circumference: TRINACRIA IANI; upper center: PELORVS ; across center: PA LI; across bottom: BELAVRA',
'title': u'House between Two Hills [reverse]'}
2016-05-02 15:50:05 [scrapy] INFO: Closing spider (finished)
2016-05-02 15:50:05 [scrapy] INFO: Stored json feed (1 items) in: items.json
2016-05-02 15:50:05 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 631,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 26324,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'dupefilter/filtered': 3,
'file_count': 1,
'file_status_count/uptodate': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 5, 2, 19, 50, 5, 810570),
'item_scraped_count': 1,
'log_count/DEBUG': 6,
'log_count/INFO': 8,
'request_depth_max': 2,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 5, 2, 19, 50, 2, 455508)}
2016-05-02 15:50:05 [scrapy] INFO: Spider closed (finished)
don't override the __init__ method if you are not going to call super.
Now, you don't really need to declare start_urls for your spider to work if you are going to use start_requests.
Just remove your def __init__ method and no need for start_urls to exist.
UPDATE
Ok my mistake, looks like CrawlSpider needs the start_urls attribute, so just create it instead of using the start_requests method:
start_urls = [URL % i + '.html' for i in range (starting_number, number_of_pages, -1)]
and remove start_requests
I'm having trouble using Scrapy's image pipeline to retrieve images. From the error reports, I think I am feeding Scrapy the right image_urls. However, instead of downloading images from them, Scrapy returns the error: ValueError: Missing scheme in request url: h.
This is my first time using the image pipeline feature, so I suspect I'm making a simple mistake. All the same, I'd appreciate help solving it.
Below you'll find my spider, settings, items, and error output. They're not quite MWEs, but I think they're pretty simple and easy to understand all the same.
Spider:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ngamedallions.items import NgamedallionsItem
from scrapy.loader.processors import TakeFirst
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Join
from scrapy.http import Request
import re
class NGASpider(CrawlSpider):
name = 'ngamedallions'
allowed_domains = ['nga.gov']
start_urls = [
'http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html'
]
rules = (
Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
follow=True
),)
def parse_CatalogRecord(self, response):
CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
CatalogRecord.default_output_processor = TakeFirst()
keywords = "medal|medallion"
r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
if r.search(response.body_as_unicode()):
CatalogRecord.add_xpath('title', './/dl[#class="artwork-details"]/dt[#class="title"]/text()')
CatalogRecord.add_xpath('accession', './/dd[#class="accession"]/text()')
CatalogRecord.add_xpath('inscription', './/div[#id="inscription"]/p/text()')
CatalogRecord.add_xpath('image_urls', './/img[#class="mainImg"]/#src')
return CatalogRecord.load_item()
Settings:
BOT_NAME = 'ngamedallions'
SPIDER_MODULES = ['ngamedallions.spiders']
NEWSPIDER_MODULE = 'ngamedallions.spiders'
DOWNLOAD_DELAY=3
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = '/home/tricia/Documents/Programing/Scrapy/ngamedallions/medallionimages'
Items:
import scrapy
class NgamedallionsItem(scrapy.Item):
title = scrapy.Field()
accession = scrapy.Field()
inscription = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
pass
Error Log:
2016-04-24 19:00:40 [scrapy] INFO: Scrapy 1.0.5.post2+ga046ce8 started (bot: ngamedallions)
2016-04-24 19:00:40 [scrapy] INFO: Optional features available: ssl, http11
2016-04-24 19:00:40 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'FEED_FORMAT': 'json', 'DOWNLOAD_DELAY': 3}
2016-04-24 19:00:40 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-04-24 19:00:40 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-04-24 19:00:40 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-04-24 19:00:40 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2016-04-24 19:00:40 [scrapy] INFO: Spider opened
2016-04-24 19:00:40 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-04-24 19:00:40 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-04-24 19:00:40 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: None)
2016-04-24 19:00:44 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html> (referer: None)
2016-04-24 19:00:48 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html)
2016-04-24 19:00:48 [scrapy] ERROR: Error processing {'accession': u'1942.9.163.a',
'image_urls': u'http://media.nga.gov/public/objects/1/3/1/2/1312-primary-0-440x400.jpg',
'inscription': u'around circumference: IOHANNES FRANCISCVS GON MA; around bottom circumference: MANTVA',
'title': u'Gianfrancesco Gonzaga di Rodigo, 1445-1496, Lord of Bozzolo, Sabbioneta, and Viadana 1478 [obverse]'}
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/pymodules/python2.7/scrapy/pipelines/media.py", line 44, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/usr/lib/pymodules/python2.7/scrapy/pipelines/images.py", line 109, in get_media_requests
return [Request(x) for x in item.get(self.IMAGES_URLS_FIELD, [])]
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 55, in _set_url
self._set_url(url.encode(self.encoding))
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 59, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
2016-04-24 19:00:48 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-04-24 19:00:51 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html)
2016-04-24 19:00:52 [scrapy] ERROR: Error processing {'accession': u'1942.9.163.b',
'image_urls': u'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg',
'inscription': u'around top circumference: TRINACRIA IANI; upper center: PELORVS ; across center: PA LI; across bottom: BELAVRA',
'title': u'House between Two Hills [reverse]'}
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/pymodules/python2.7/scrapy/pipelines/media.py", line 44, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/usr/lib/pymodules/python2.7/scrapy/pipelines/images.py", line 109, in get_media_requests
return [Request(x) for x in item.get(self.IMAGES_URLS_FIELD, [])]
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 55, in _set_url
self._set_url(url.encode(self.encoding))
File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 59, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
2016-04-24 19:00:55 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html)
2016-04-24 19:01:02 [scrapy] INFO: Closing spider (finished)
2016-04-24 19:01:02 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1609,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'downloader/response_bytes': 125593,
'downloader/response_count': 5,
'downloader/response_status_count/200': 5,
'dupefilter/filtered': 5,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 4, 24, 23, 1, 2, 938181),
'log_count/DEBUG': 7,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'request_depth_max': 2,
'response_received_count': 5,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2016, 4, 24, 23, 0, 40, 851598)}
2016-04-24 19:01:02 [scrapy] INFO: Spider closed (finished)
The TakeFirst processor is making image_urls a string when it should be a list.
Add:
CatalogRecord.image_urls_out = lambda v: v
EDIT:
This could also be:
CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()
I try to learn Scrapy and I manage to crawl some sites and other I fail for example:
I try to crawl: http://www.polyhousestore.com/
I created a test spider that will get all the products in the page
http://www.polyhousestore.com/catalogsearch/result/?cat=&q=lc+60
When I run the spider I get that it didn’t find any product.
Can someone help me understand what am I doing wrong, is it related to the CSS ::before and ::after ?
And how can I make it to work?
Spider code(that fail to get the products in the page)
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
class PolySpider(scrapy.Spider):
name = "poly"
allowed_domains = ["polyhousestore.com"]
start_urls = (
'http://www.polyhousestore.com/catalogsearch/result/?cat=&q=lc+60',
)
def parse(self, response):
sel = Selector(response)
products = sel.xpath('/html/body/div[4]/div/div[5]/div/div/div/div/div[2]/div[3]/div[2]/div')
items = []
if not products:
print '------------- No products from sel.xpath'
else:
print '------------- Found products ' + str(len(products))
Command line that I run and the output
D:\scrapyProj\cmdProj>scrapy crawl poly
2016-01-19 10:23:16 [scrapy] INFO: Scrapy 1.0.3 started (bot: cmdProj)
2016-01-19 10:23:16 [scrapy] INFO: Optional features available: ssl, http11
2016-01-19 10:23:16 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'cm
dProj.spiders', 'SPIDER_MODULES': ['cmdProj.spiders'], 'BOT_NAME': 'cmdProj'}
2016-01-19 10:23:17 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsol
e, LogStats, CoreStats, SpiderState
2016-01-19 10:23:17 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-01-19 10:23:17 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-01-19 10:23:17 [scrapy] INFO: Enabled item pipelines:
2016-01-19 10:23:17 [scrapy] INFO: Spider opened
2016-01-19 10:23:17 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-01-19 10:23:17 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-19 10:23:17 [scrapy] DEBUG: Crawled (200) <GET http://www.polyhousestore
.com/catalogsearch/result/?cat=&q=lc+60> (referer: None)
------------- No products from sel.xpath
2016-01-19 10:23:18 [scrapy] INFO: Closing spider (finished)
2016-01-19 10:23:18 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 254,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 16091,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 1, 19, 8, 23, 18, 53000),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 1, 19, 8, 23, 17, 376000)}
2016-01-19 10:23:18 [scrapy] INFO: Spider closed (finished)
Thanks for the help
When I look at the URL given in your question with Chrome I see only 2 div tags in the body of the site. This means scrapy sees these div tags too. However you want to access the 4th, which does not exist so your search returns no elements.
If I open a scrapy shell and execute a count on the div tags on the body I get 2 in return:
[<Selector xpath='count(/html/body/div)' data=u'2.0'>]
This above is the same as
len(response.xpath('/html/body/div'))
All this means you have to modify your query to get all the products. If you need the 4 elements in the site, try:
response.xpath('//div[#class="item-inner"]')
As you can see you do not need to wrap the response into a selector with scrapy anymore.
I have coded a Scrapy spider using the loginform library (http://blog.scrapinghub.com/2012/10/26/filling-login-forms-automatically/) and taking this post as reference for dynamic webpages. This is the code:
class MySpider(CrawlSpider):
login_user = 'myusername'
login_pass = 'mypassword'
name = "tv"
allowed_domains = []
start_urls = ["https://twitter.com/Acrocephalus/followers"]
rules = (
Rule(SgmlLinkExtractor(allow=('https://twitter\.com/.*')), callback='parse_items', follow=True),
)
def parse(self, response):
args, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_pass)
return FormRequest(url, method=method, formdata=args, callback=self.after_login)
def after_login(self, response):
# you are logged in here
def __init__(self):
CrawlSpider.__init__(self)
# use any browser you wish
self.browser = webdriver.Firefox()
def __del__(self):
self.browser.close()
def parse_item(self, response):
item = TwitterVizItem()
self.browser.get(response.url)
# let JavaScript Load
time.sleep(3)
# scrape dynamically generated HTML
hxs = Selector(text=self.browser.page_source)
item['Follower'] = hxs.select('(//span[#class="u-linkComplex-target"])[position()>2]').extract()
item['Follows'] = hxs.select('(//span[#class="u-linkComplex-target"])[position()=1]').extract()
return item
When I run it, I get this output:
2015-07-22 18:46:38 [scrapy] INFO: Scrapy 1.0.1.post1+g5b8c9e5 started (bot: TwitterViz)
2015-07-22 18:46:38 [scrapy] INFO: Optional features available: ssl, http11
2015-07-22 18:46:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'TwitterViz.spiders', 'SPIDER_MODULES': ['TwitterViz.spiders'], 'BOT_NAME': 'TwitterViz'}
2015-07-22 18:46:38 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-22 18:46:38 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-22 18:46:38 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-22 18:46:38 [scrapy] INFO: Enabled item pipelines:
2015-07-22 18:46:38 [scrapy] INFO: Spider opened
2015-07-22 18:46:38 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-22 18:46:38 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-22 18:46:39 [scrapy] DEBUG: Redirecting (302) to <GET https://twitter.com/login?redirect_after_login=%2FAcrocephalus%2Ffollowers> from <GET https://twitter.com/Acrocephalus/followers>
2015-07-22 18:46:39 [scrapy] DEBUG: Crawled (200) <GET https://twitter.com/login?redirect_after_login=%2FAcrocephalus%2Ffollowers> (referer: None)
2015-07-22 18:46:39 [scrapy] DEBUG: Redirecting (302) to <GET https://twitter.com/Acrocephalus/followers> from <POST https://twitter.com/sessions>
2015-07-22 18:46:40 [scrapy] DEBUG: Crawled (200) <GET https://twitter.com/Acrocephalus/followers> (referer: https://twitter.com/login?redirect_after_login=%2FAcrocephalus%2Ffollowers)
2015-07-22 18:46:40 [scrapy] INFO: Closing spider (finished)
2015-07-22 18:46:40 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2506,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 63188,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 7, 22, 16, 46, 40, 538785),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2015, 7, 22, 16, 46, 38, 926487)}
2015-07-22 18:46:40 [scrapy] INFO: Spider closed (finished)
From the ouput I understand that it logs in properly. However, it doesn't scrape anything. I've tested the XPath with Firefox's XPath Checker and they seem to work properly. Can anyone have a look to see if we can find what is wrong?