I'm playing around with scraping infos from Amazon, but it's giving me a hard time. My spider looks like this so far:
class AmzCrawlerSpider(CrawlSpider):
name = 'amz_crawler'
allowed_domains = ['amazon.com']
start_urls = ['http://www.amazon.com/s?ie=UTF8&bbn=1267449011&page=1&rh=n%3A284507%2Cn%3A1055398%2Cn%3A%211063498%2Cn%3A1267449011%2Cn%3A3204211011']
rules = (Rule(SgmlLinkExtractor(allow=r"page=\d+"), callback="parse_item", follow= True),)
def parse_item(self, response):
category_name = Selector(response).xpath('//*[#id="nav-subnav"]/a[1]/text()').extract()[0]
products = Selector(response).xpath('//div[#class="s-item-container"]')
for product in products:
item = AmzItem()
item['title'] = product.xpath('.//a[contains(#class, "s-access-detail-page")]/#title').extract()[0]
item['url'] = product.xpath('.//a[contains(#class, "s-access-detail-page")]/#href').extract()[0]
request = scrapy.Request(item['url'], callback=self.parse_product)
request.meta['item'] = item
print "Crawl " + item["title"]
print "Crawl " + item['url']
yield request
def parse_product(self, response):
print ( "Parse Product" )
item = response.meta['item']
sel = Selector(response)
item['asin'] = sel.xpath('//td[#class="bucket"]/div/ul/li/b[contains(text(),"ASIN:")]/../text()').extract()[0]
return item
There are two issues which I don't seem to understand:
'Parse Product' is never printed - so I assume the method parse_product is never being executed even though the prints with Crawl ... are nicely displayed.
Maybe it's about the Rules?
And then related to the rules:
It's only working for the first page of a category. The crawler doesn't follow links to the 2nd page of a category.
I assume that the pages for Scrapy are generated in a different way then for the browser? In the console I see a lot of 301 redirects:
2015-06-30 14:57:24+0800 [amz_crawler] DEBUG: Redirecting (301) to http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011> from http://www.amazon.com/gp/search/ref=sr_pg_1?sf=qz&fst=as%3Aoff&rh=n%3A2619533011%2Ck%3Apet+supplies%2Cp_n_date_first_available_absolute%3A2661609011%2Cp_72%3A2661618011&sort=date-desc-rank&keywords=pet+supplies&ie=UTF8&qid=1435312739>
2015-06-30 14:57:29+0800 [amz_crawler] DEBUG: Crawled (200) http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011> (referer: None)
2015-06-30 14:57:39+0800 [amz_crawler] DEBUG: Crawled (200) http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Cp_72%3A2661618011> (referer: http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011)
Crawl 2015-06-30 14:57:39 Precious Cat Ultra Premium Clumping Cat Litter, 40 pound bag
Crawl http://www.amazon.com/Precious-Cat-Premium-Clumping-Litter/dp/B0009X29WK
What did I do wrong?
Related
I am using Scrapy's xml feed spider sitemap to crawl and extract urls and only urls.
The xml sitemap looks like this:
<url>
<loc>
https://www.example.com/american-muscle-5-pc-kit-box.html
</loc>
<lastmod>2020-10-14T15:40:02+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
<image:image>
<image:loc>
https://www.example.com/pub/media/catalog/product/cache/de5bc950da2c28fc62848f9a6b789a5c/1/2/1202_45.jpg
</image:loc>
<image:title>
5 PC. GAUGE KIT, 3-3/8" & 2-1/16", ELECTRIC SPEEDOMETER, AMERICAN MUSCLE
</image:title>
</image:image>
<PageMap>
<DataObject type="thumbnail">
<Attribute name="name" value="5 PC. GAUGE KIT, 3-3/8" & 2-1/16", ELECTRIC SPEEDOMETER, AMERICAN MUSCLE"/>
<Attribute name="src" value="https://www.example.com/pub/media/catalog/product/cache/de5bc950da2c28fc62848f9a6b789a5c/1/2/1202_45.jpg"/>
</DataObject>
</PageMap>
</url>
I ONLY want to get the contents of the <loc></loc>
So I set my scrapy spider up like this (some parts omitted for brevity):
start_urls = ['https://www.example.com/sitemap.xml']
namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
itertag = 'url'
def parse_node(self, response, selector):
item = {}
item['url'] = selector.select('url').get()
selector.remove_namespaces()
yield {
'url': selector.xpath('//loc/text()').getall()
}
That ends up givin me the url and url for all the product images. How can I set this spider up to ONLY get the actual product page url?
In order to change this part of sitemap spider logic it is required to override It's _parse_sitemap method (source)
and replace section
elif s.type == 'urlset':
for loc in iterloc(it, self.sitemap_alternate_links):
for r, c in self._cbs:
if r.search(loc):
yield Request(loc, callback=c)
break
by something like this
elif s.type == 'urlset':
for entry in it:
item = entry #entry - sitemap entry parsed as dictionary by Sitemap spider
...
yield item # instead of making request - return item
In this case spider should return items from parsed sitemap entries instead of making requests for every link
Currently for pagination in Amazon data scraper using Scrapy, I am using
next_page = response.xpath('//li[#class="a-last"]/a/#href').get()
if next_page:
next_page = 'https://www.amazon.com' + next_page
yield scrapy.Request(url=next_page,callback=self.parse,headers=self.amazon_header,dont_filter=True)
Say if I want to only fetch data from the first 3 pages, How do I do it?
Go to settings.py file and you limit pagination like as follows:
CLOSESPIDER_PAGECOUNT = 3
Alternative:
suppose,
url =[ 'https:// www.quote.toscrape/page=1 something']
Now make pagination this way in start_urls and exclude next
pagination
start_urls =[ '​https:// www.quote.toscrape/page='+str(x)+' something' for x in range(1,3)]
I want to crawl a web site (http://theschoolofkyiv.org/participants/220/dan-acostioaei) to extract artist's name and biography only. When I define the tags and properties, it comes out without any text, which I want to see.
I am using scrapy to crawl the web site. For other websites, it works fine. I have tested my codes but it seems I can not define the correct tags or properties. Can you please have a look at my codes?
This is the code that I used to crawl the website. (I do not understand why stackoverflow enforces me to enter irrelevant text all the time. I have already explained what I wanted to say.)
import scrapy
from scrapy.selector import Selector
from artistlist.items import ArtistlistItem
class ArtistlistSpider(scrapy.Spider):
name = "artistlist"
allowed_domains = ["theschoolofkyiv.org"]
start_urls = ['http://theschoolofkyiv.org/participants/220/dan-acostioaei']
enter code here
def parse(self, response):
titles = response.xpath("//div[#id='participants']")
for titles in titles:
item = ArtistlistItem()
item['artist'] = response.css('.ng-binding::text').extract()
item['biography'] = response.css('p::text').extract()
yield item
This is the output that I get:
{'artist': [],
'biography': ['\n ',
'\n ',
'\n ',
'\n ',
'\n ',
'\n ']}
Simple illustration (assuming you already know about AJAX request mentioned by Tony Montana):
import scrapy
import re
import json
from artistlist.items import ArtistlistItem
class ArtistlistSpider(scrapy.Spider):
name = "artistlist"
allowed_domains = ["theschoolofkyiv.org"]
start_urls = ['http://theschoolofkyiv.org/participants/220/dan-acostioaei']
def parse(self, response):
participant_id = re.search(r'participants/(\d+)', response.url).group(1)
if participant_id:
yield scrapy.Request(
url="http://theschoolofkyiv.org/wordpress/wp-json/posts/{participant_id}".format(participant_id=participant_id),
callback=self.parse_participant,
)
def parse_participant(self, response):
data = json.loads(response.body)
item = ArtistlistItem()
item['artist'] = data["title"]
item['biography'] = data["acf"]["en_participant_bio"]
yield item
I am trying to scrape 2 domains concurrently. I have created a spider like this:
class TestSpider(CrawlSpider):
name = 'test-spider'
allowed_domains = [ 'domain-a.com', 'domain-b.com' ]
start_urls = [ 'http://www.domain-a.com/index.html',
'http://www.domain-b.com/index.html' ]
rules = (
Rule(LinkExtractor(), follow=True, callback='parse_item'),
)
def parse_item(self, response):
log.msg('parsing ' + response.url, log.DEBUG)
I would expect to see a mix of 'domain-a.com and domain-b.com' entries in the output but I only see domain-a mentioned in the logs. However if I run separate spiders/crawlers I do see both domains scraped concurrently (not actual code but illustrates the point):
def setup_crawler(url):
spider = TestSpider(start_url=url)
crawler = Crawler(get_project_settings())
crawler.configure()
crawler.signals.connect(reactor.stop(), signal=signals.spider_closed)
crawler.crawl(spider)
crawler.start()
setup_crawler('http://www.domain-a.com/index.html')
setup_crawler('http://www.domain-b.com/index.html')
log.start(loglevel=log.DEBUG)
reactor.run()
Thanks
On my site I created two simple pages:
Here are their first html script:
test1.html :
<head>
<title>test1</title>
</head>
<body>
<a href="test2.html" onclick="javascript:return xt_click(this, "C", "1", "Product", "N");" indepth="true">
<span>cool</span></a>
</body></html>
test2.html :
<head>
<title>test2</title>
</head>
<body></body></html>
I want scraping text in the title tag of the two pages.here is "test1" and "test2".
but I am a novice with scrapy I only happens scraping only the first page.
my scrapy script:
from scrapy.spider import Spider
from scrapy.selector import Selector
from testscrapy1.items import Website
class DmozSpider(Spider):
name = "bill"
allowed_domains = ["http://exemple.com"]
start_urls = [
"http://www.exemple.com/test1.html"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//head')
items = []
for site in sites:
item = Website()
item['title'] = site.xpath('//title/text()').extract()
items.append(item)
return items
How to pass the onclik?
and how to successfully scraping the text of the title tag of the second page?
Thank you in advance
STEF
To use multiple functions in your code, send multiple requests and parse them, you're going to need: 1) yield instead of return, 2) callback.
Example:
def parse(self,response):
for site in response.xpath('//head'):
item = Website()
item['title'] = site.xpath('//title/text()').extract()
yield item
yield scrapy.Request(url="http://www.domain.com", callback=self.other_function)
def other_function(self,response):
for other_thing in response.xpath('//this_xpath')
item = Website()
item['title'] = other_thing.xpath('//this/and/that').extract()
yield item
You cannot parse javascript with scrapy, but you can understand what the javascript does and do the same: http://doc.scrapy.org/en/latest/topics/firebug.html