How to only get the product page url in xml sitemap - scrapy

I am using Scrapy's xml feed spider sitemap to crawl and extract urls and only urls.
The xml sitemap looks like this:
<url>
<loc>
https://www.example.com/american-muscle-5-pc-kit-box.html
</loc>
<lastmod>2020-10-14T15:40:02+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
<image:image>
<image:loc>
https://www.example.com/pub/media/catalog/product/cache/de5bc950da2c28fc62848f9a6b789a5c/1/2/1202_45.jpg
</image:loc>
<image:title>
5 PC. GAUGE KIT, 3-3/8" & 2-1/16", ELECTRIC SPEEDOMETER, AMERICAN MUSCLE
</image:title>
</image:image>
<PageMap>
<DataObject type="thumbnail">
<Attribute name="name" value="5 PC. GAUGE KIT, 3-3/8" & 2-1/16", ELECTRIC SPEEDOMETER, AMERICAN MUSCLE"/>
<Attribute name="src" value="https://www.example.com/pub/media/catalog/product/cache/de5bc950da2c28fc62848f9a6b789a5c/1/2/1202_45.jpg"/>
</DataObject>
</PageMap>
</url>
I ONLY want to get the contents of the <loc></loc>
So I set my scrapy spider up like this (some parts omitted for brevity):
start_urls = ['https://www.example.com/sitemap.xml']
namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
itertag = 'url'
def parse_node(self, response, selector):
item = {}
item['url'] = selector.select('url').get()
selector.remove_namespaces()
yield {
'url': selector.xpath('//loc/text()').getall()
}
That ends up givin me the url and url for all the product images. How can I set this spider up to ONLY get the actual product page url?

In order to change this part of sitemap spider logic it is required to override It's _parse_sitemap method (source)
and replace section
elif s.type == 'urlset':
for loc in iterloc(it, self.sitemap_alternate_links):
for r, c in self._cbs:
if r.search(loc):
yield Request(loc, callback=c)
break
by something like this
elif s.type == 'urlset':
for entry in it:
item = entry #entry - sitemap entry parsed as dictionary by Sitemap spider
...
yield item # instead of making request - return item
In this case spider should return items from parsed sitemap entries instead of making requests for every link

Related

Extracting data from div tag

so im scraping data from a website and it has some data in its div tag
like this :
<div class="search-result__title">\nDonald Duck <span>\xa0|\xa0</span>\n<span class="city state" data-city="city, TX;city, TX;city, TX;city, TX" data-state="TX">STATENAME, CITYNAME\n</span>\n</div>,
I want to scrape "Donald Duck" part and state and city name after rel="nofollow"
the site contains a lot of data so name and state are different
the code that i have written is
div = soup.find_all('div', {'class':'search-result__title'})
print (div.string)
this gives me a error
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
first, use .text. Second, find_all() will return a list of elements. You need to specify the index value with either: print (div[0].text), or since you will probably have more than 1 element, just iterate through them
from bs4 import BeautifulSoup
html = '''<div class="search-result__title">\nDonald Duck <span>\xa0|\xa0</span>\n<span class="city state" data-city="city, TX;city, TX;city, TX;city, TX" data-state="TX">STATENAME, CITYNAME\n</span>\n</div>'''
soup = BeautifulSoup(html, 'html.parser')
div = soup.find_all('div', {'class':'search-result__title'})
print (div[0].text)
#OR
for each in div:
print (each.text)

How to extract urls from an xml using scrapy - XMLFeedSpider?

I have started using Scrapy recently and I'm trying to use the "XMLFeedSpider" to extract and load the pages that are in a xml page. But the problem is that it is returning an error: "IndexError: list index out of range".
I'm trying to collect and load all product pages that are at this address:"http://www.example.com/feed.xml"
My spider:
from scrapy.spiders import XMLFeedSpider
class PartySpider(XMLFeedSpider):
name = 'example'
allowed_domains = ['http://www.example.com']
start_urls = [
'http://www.example.com/feed.xml'
]
itertag = 'loc'
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag,''.join(node.extract()))
This is how your XML input starts:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.example.htm</loc></url>
<url><loc>http://www.example.htm</loc></url>
(...)
And there's acutally a bug in XMLFeedSpider when using (the default) iterator iternodes when the XML document uses a namespace. See this archived discussion in scrapy-users mailinglist.
This spider works, changing the iterator to xml, where you can reference a namespace, here http://www.sitemaps.org/schemas/sitemap/0.9 using the prefix n (it could be anything really), and using this namespace prefix for the tag to look for, here n:loc:
from scrapy.spiders import XMLFeedSpider
class PartySpider(XMLFeedSpider):
name = 'example'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/example.xml'
]
namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
itertag = 'n:loc'
iterator = 'xml'
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag,''.join(node.extract()))

Scrapy Request callback not being fired

I'm playing around with scraping infos from Amazon, but it's giving me a hard time. My spider looks like this so far:
class AmzCrawlerSpider(CrawlSpider):
name = 'amz_crawler'
allowed_domains = ['amazon.com']
start_urls = ['http://www.amazon.com/s?ie=UTF8&bbn=1267449011&page=1&rh=n%3A284507%2Cn%3A1055398%2Cn%3A%211063498%2Cn%3A1267449011%2Cn%3A3204211011']
rules = (Rule(SgmlLinkExtractor(allow=r"page=\d+"), callback="parse_item", follow= True),)
def parse_item(self, response):
category_name = Selector(response).xpath('//*[#id="nav-subnav"]/a[1]/text()').extract()[0]
products = Selector(response).xpath('//div[#class="s-item-container"]')
for product in products:
item = AmzItem()
item['title'] = product.xpath('.//a[contains(#class, "s-access-detail-page")]/#title').extract()[0]
item['url'] = product.xpath('.//a[contains(#class, "s-access-detail-page")]/#href').extract()[0]
request = scrapy.Request(item['url'], callback=self.parse_product)
request.meta['item'] = item
print "Crawl " + item["title"]
print "Crawl " + item['url']
yield request
def parse_product(self, response):
print ( "Parse Product" )
item = response.meta['item']
sel = Selector(response)
item['asin'] = sel.xpath('//td[#class="bucket"]/div/ul/li/b[contains(text(),"ASIN:")]/../text()').extract()[0]
return item
There are two issues which I don't seem to understand:
'Parse Product' is never printed - so I assume the method parse_product is never being executed even though the prints with Crawl ... are nicely displayed.
Maybe it's about the Rules?
And then related to the rules:
It's only working for the first page of a category. The crawler doesn't follow links to the 2nd page of a category.
I assume that the pages for Scrapy are generated in a different way then for the browser? In the console I see a lot of 301 redirects:
2015-06-30 14:57:24+0800 [amz_crawler] DEBUG: Redirecting (301) to http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011> from http://www.amazon.com/gp/search/ref=sr_pg_1?sf=qz&fst=as%3Aoff&rh=n%3A2619533011%2Ck%3Apet+supplies%2Cp_n_date_first_available_absolute%3A2661609011%2Cp_72%3A2661618011&sort=date-desc-rank&keywords=pet+supplies&ie=UTF8&qid=1435312739>
2015-06-30 14:57:29+0800 [amz_crawler] DEBUG: Crawled (200) http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011> (referer: None)
2015-06-30 14:57:39+0800 [amz_crawler] DEBUG: Crawled (200) http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Cp_72%3A2661618011> (referer: http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011)
Crawl 2015-06-30 14:57:39 Precious Cat Ultra Premium Clumping Cat Litter, 40 pound bag
Crawl http://www.amazon.com/Precious-Cat-Premium-Clumping-Litter/dp/B0009X29WK
What did I do wrong?

How to use scrapy to crawl multiple pages? (two level)

On my site I created two simple pages:
Here are their first html script:
test1.html :
<head>
<title>test1</title>
</head>
<body>
<a href="test2.html" onclick="javascript:return xt_click(this, "C", "1", "Product", "N");" indepth="true">
<span>cool</span></a>
</body></html>
test2.html :
<head>
<title>test2</title>
</head>
<body></body></html>
I want scraping text in the title tag of the two pages.here is "test1" and "test2".
but I am a novice with scrapy I only happens scraping only the first page.
my scrapy script:
from scrapy.spider import Spider
from scrapy.selector import Selector
from testscrapy1.items import Website
class DmozSpider(Spider):
name = "bill"
allowed_domains = ["http://exemple.com"]
start_urls = [
"http://www.exemple.com/test1.html"
]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//head')
items = []
for site in sites:
item = Website()
item['title'] = site.xpath('//title/text()').extract()
items.append(item)
return items
How to pass the onclik?
and how to successfully scraping the text of the title tag of the second page?
Thank you in advance
STEF
To use multiple functions in your code, send multiple requests and parse them, you're going to need: 1) yield instead of return, 2) callback.
Example:
def parse(self,response):
for site in response.xpath('//head'):
item = Website()
item['title'] = site.xpath('//title/text()').extract()
yield item
yield scrapy.Request(url="http://www.domain.com", callback=self.other_function)
def other_function(self,response):
for other_thing in response.xpath('//this_xpath')
item = Website()
item['title'] = other_thing.xpath('//this/and/that').extract()
yield item
You cannot parse javascript with scrapy, but you can understand what the javascript does and do the same: http://doc.scrapy.org/en/latest/topics/firebug.html

Add full link to short link to make it valid using scrapy? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Scrapy Modify Link to include Domain Name
I use this code to extract data from html website and i stored the data in XML file and it works great with me.
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
site1 = hxs.select('/html/body/div/div[4]/div[3]/div/div/div[2]/div/ul/li')
for site in site1:
item = NewsItem()
item ['title'] = site.select('a[2]/text()').extract()
item ['image'] = site.select('a/img/#src').extract()
item ['text'] = site.select('p/text()').extract()
item ['link'] = site.select('a[2]/#href').extract()
items.append(item)
return items
but the issue that i am facing is the website provide a short link for ['image'] which like this:
<img src="/a/small/72/72089be43654dc6d7215ec49f4be5a07_w200_h180.jpg"
while the full link should be like this:
<img src="http://www.aleqt.com/a/small/72/72089be43654dc6d7215ec49f4be5a07_w200_h180.jpg"
I want to know how to modify my code to add the missing link automatically
You can try this
item ['link'] = urljoin(response.url, site.select('a[2]/#href').extract())
On the assumption that all such image links simply need "http://www.aleqt.com" added to them, you could just do something like this:
def parse(self, response):
base_url = 'http://www.aleqt.com'
hxs = HtmlXPathSelector(response)
items = []
site1 = hxs.select('/html/body/div/div[4]/div[3]/div/div/div[2]/div/ul/li')
for site in site1:
item = NewsItem()
item ['title'] = site.select('a[2]/text()').extract()
item ['image'] = base_url + site.select('a/img/#src').extract()
item ['text'] = site.select('p/text()').extract()
item ['link'] = base_url + site.select('a[2]/#href').extract()
items.append(item)
return items
Alternatively, if you've added that exact same url to the start_urls list (and assuming there's only one, you could replace base_url with self.start_urls[0]