How to extract urls from an xml using scrapy - XMLFeedSpider? - scrapy

I have started using Scrapy recently and I'm trying to use the "XMLFeedSpider" to extract and load the pages that are in a xml page. But the problem is that it is returning an error: "IndexError: list index out of range".
I'm trying to collect and load all product pages that are at this address:"http://www.example.com/feed.xml"
My spider:
from scrapy.spiders import XMLFeedSpider
class PartySpider(XMLFeedSpider):
name = 'example'
allowed_domains = ['http://www.example.com']
start_urls = [
'http://www.example.com/feed.xml'
]
itertag = 'loc'
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag,''.join(node.extract()))

This is how your XML input starts:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.example.htm</loc></url>
<url><loc>http://www.example.htm</loc></url>
(...)
And there's acutally a bug in XMLFeedSpider when using (the default) iterator iternodes when the XML document uses a namespace. See this archived discussion in scrapy-users mailinglist.
This spider works, changing the iterator to xml, where you can reference a namespace, here http://www.sitemaps.org/schemas/sitemap/0.9 using the prefix n (it could be anything really), and using this namespace prefix for the tag to look for, here n:loc:
from scrapy.spiders import XMLFeedSpider
class PartySpider(XMLFeedSpider):
name = 'example'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/example.xml'
]
namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
itertag = 'n:loc'
iterator = 'xml'
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag,''.join(node.extract()))

Related

What are the correct tags and properties to select?

I want to crawl a web site (http://theschoolofkyiv.org/participants/220/dan-acostioaei) to extract artist's name and biography only. When I define the tags and properties, it comes out without any text, which I want to see.
I am using scrapy to crawl the web site. For other websites, it works fine. I have tested my codes but it seems I can not define the correct tags or properties. Can you please have a look at my codes?
This is the code that I used to crawl the website. (I do not understand why stackoverflow enforces me to enter irrelevant text all the time. I have already explained what I wanted to say.)
import scrapy
from scrapy.selector import Selector
from artistlist.items import ArtistlistItem
class ArtistlistSpider(scrapy.Spider):
name = "artistlist"
allowed_domains = ["theschoolofkyiv.org"]
start_urls = ['http://theschoolofkyiv.org/participants/220/dan-acostioaei']
enter code here
def parse(self, response):
titles = response.xpath("//div[#id='participants']")
for titles in titles:
item = ArtistlistItem()
item['artist'] = response.css('.ng-binding::text').extract()
item['biography'] = response.css('p::text').extract()
yield item
This is the output that I get:
{'artist': [],
'biography': ['\n ',
'\n ',
'\n ',
'\n ',
'\n ',
'\n ']}
Simple illustration (assuming you already know about AJAX request mentioned by Tony Montana):
import scrapy
import re
import json
from artistlist.items import ArtistlistItem
class ArtistlistSpider(scrapy.Spider):
name = "artistlist"
allowed_domains = ["theschoolofkyiv.org"]
start_urls = ['http://theschoolofkyiv.org/participants/220/dan-acostioaei']
def parse(self, response):
participant_id = re.search(r'participants/(\d+)', response.url).group(1)
if participant_id:
yield scrapy.Request(
url="http://theschoolofkyiv.org/wordpress/wp-json/posts/{participant_id}".format(participant_id=participant_id),
callback=self.parse_participant,
)
def parse_participant(self, response):
data = json.loads(response.body)
item = ArtistlistItem()
item['artist'] = data["title"]
item['biography'] = data["acf"]["en_participant_bio"]
yield item

Scrapy - Copying only the xpath into .csv file

I have many other scripts with simlar basic code that work, but when I run this spider in cmd, and I open the .csv file to look at the "titles" saved, I get the xpath copied into excel. Any idea why?
import scrapy
class MovieSpider(scrapy.Spider):
name = 'movie'
allowed_domains = ['https://www.imdb.com/search/title?start=1']
start_urls = ['https://www.imdb.com/search/title?start=1/']
def parse(self, response):
titles = response.xpath('//*[#id="main"]/div/div/div[3]/div[1]/div[3]/h3/a')
pass
print(titles)
for title in titles:
yield {'Title': title}
--- Try Two Below:------
for subject in titles:
yield {
'Title': subject.xpath('.//h3[#class="lister-item-header"]/a/text()').extract_first(),
'Runtime': subject.xpath('.//p[#class="text-muted"]/span/text()').extract_first(),
'Description': subject.xpath('.//p[#class="text-muted"]/p/text()').extract_first(),
'Director': subject.xpath('.//*[#id="main"]/a/text()').extract_first(),
'Rating': subject.xpath('.//div[#class="inline-block ratings-imdb-rating"]/strong/text()').extract_first()
}
Use extract() or extract_first(), also use shorter and more capacious notation for xpath:
import scrapy
class MovieSpider(scrapy.Spider):
name = 'movie'
allowed_domains = ['https://www.imdb.com/search/title?start=1']
start_urls = ['https://www.imdb.com/search/title?start=1/']
def parse(self, response):
subjects = response.xpath('//div[#class="lister-item mode-advanced"]')
for subject in subjects:
yield {
'Title': subject.xpath('.//h3[#class="lister-item-header"]/a/text()').extract_first(),
'Rating': subject.xpath('.//div[#class="inline-block ratings-imdb-rating"]/strong/text()').extract_first(),
'Runtime': subject.xpath('.//span[#class="runtime"]/text()').extract_first(),
'Description': subject.xpath('.//p[#class="text-muted"]/text()').extract_first(),
'Directior': subject.xpath('.//p[contains(text(), "Director")]/a[1]/text()').extract_first(),
}
output:

Scrapy find all links with different(similar) class

I'm trying to scrap links with certain class "post-item post-item-xxxxx". But since the class is different in each, how can I capture all of them?
<li class="post-item post-item-18887"><a
href="http://example.com/archives/18887.html" title="Post1"</a></li>
<li class="post-item post-item-18883"><a href="http://example.com/archives/18883.html" title="Post2"</a></li>
my code:
scrap all the cafe links from example.com
class DengaSpider(scrapy.Spider):
name = 'cafes'
allowed_domains = ['example.com']
start_urls = [
'http://example.com/archives/8136.html',
]
rules = [
Rule(
LinkExtractor(
allow=('^http://example\.com/archives/\d+.html$'),
unique=True
),
follow=True,
callback="parse_items"
)
]
def parse(self, response):
cafelink = response.css('post.item').xpath('//a/#href').extract()
if cafelink is not None:
print(cafelink)
the .css part is not working, how can I fix it?
Here's a sample run for the above html in scrapy shell:
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url="Test HTML String", body='<li class="post-item post-item-18887"><a href="http://example.com/archives/18883.html" title="Post2"</li>', encoding='utf-8')
>>>
>>> cafelink = response.css('li.post-item a::attr(href)').extract_first()
>>> cafelink
'http://example.com/archives/18887.html'
>>>
>>> cafelink = response.css('li.post-item a::attr(href)').extract()
>>> cafelink
['http://example.com/archives/18887.html', 'http://example.com/archives/18883.html']
Xpath has the contains() method for this, so you might try this:
cafelink = response.xpath("//*[contains(#class, 'post-item-')]//a/#href").extract()
Also be careful when using // in xpath. It makes xpath starts the search in the document root, no matter where it currently is.
If all the items you want also have the "post-item" class then why do you need to capture them by their other class? In case you still need to do that, try the "starts with" CSS selector:
response.css('li[class^="post-item post-item-"]')
Documentation here.

How to restrict the area in which LinkExtractor is being applied?

I have a scraper with the following rules:
rules = (
Rule(LinkExtractor(allow=('\S+list=\S+'))),
Rule(LinkExtractor(allow=('\S+list=\S+'))),
Rule(LinkExtractor(allow=('\S+view=1\S+')), callback='parse_archive'),
)
As you can see, the 2nd and 3rd rules are exactly the same.
What I would like to do is tell scrappy extract the links I am interested in by referring to particular places within a page only. For convenience, I am sending you the corresponding XPaths although I would prefer a solution based on BeatifullSoup's syntax.
//*[#id="main_frame"]/tbody/tr[3]/td[2]/table/tbody/tr/td/div/table/tbody/tr/td[1]
//*[#id="main_frame"]/tbody/tr[3]/td[2]/table/tbody/tr/td/div/form/table/tbody/tr[1]
//*[#id="main_frame"]/tbody/tr[3]/td[2]/table/tbody/tr/td/div/form/table/tbody/tr[2]
EDIT:
Let me give you an example. Let's assume that I want to extract the five (out of six) links on the top of Scrapy's Offcial Page:
And here is my spider. Any ideas?
class dmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["scrapy.org"]
start_urls = [
"http://scrapy.org/",
]
rules = (
Rule(LinkExtractor(allow=('\S+/'), restrict_xpaths=('/html/body/div[1]/div/ul')), callback='first_level'),
)
def first_level(self, response):
taco = dmozItem()
taco['basic_url'] = response.url
return taco
This can be done with the restrict_xpaths parameter. See the LxmlLinkExtractor documentation
Edit:
You can also pass a list to restrict_xpaths.
Edit 2:
Full example that should work:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class dmozItem(scrapy.Item):
basic_url = scrapy.Field()
class dmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["scrapy.org"]
start_urls = [
"http://scrapy.org/",
]
def clean_url(value):
return value.replace('/../', '/')
rules = (
Rule(
LinkExtractor(
allow=('\S+/'),
restrict_xpaths=(['.//ul[#class="navigation"]/a[1]',
'.//ul[#class="navigation"]/a[2]',
'.//ul[#class="navigation"]/a[3]',
'.//ul[#class="navigation"]/a[4]',
'.//ul[#class="navigation"]/a[5]']),
process_value=clean_url
),
callback='first_level'),
)
def first_level(self, response):
taco = dmozItem()
taco['basic_url'] = response.url
return taco

Does a scrapy spider download from multiple domains concurrently?

I am trying to scrape 2 domains concurrently. I have created a spider like this:
class TestSpider(CrawlSpider):
name = 'test-spider'
allowed_domains = [ 'domain-a.com', 'domain-b.com' ]
start_urls = [ 'http://www.domain-a.com/index.html',
'http://www.domain-b.com/index.html' ]
rules = (
Rule(LinkExtractor(), follow=True, callback='parse_item'),
)
def parse_item(self, response):
log.msg('parsing ' + response.url, log.DEBUG)
I would expect to see a mix of 'domain-a.com and domain-b.com' entries in the output but I only see domain-a mentioned in the logs. However if I run separate spiders/crawlers I do see both domains scraped concurrently (not actual code but illustrates the point):
def setup_crawler(url):
spider = TestSpider(start_url=url)
crawler = Crawler(get_project_settings())
crawler.configure()
crawler.signals.connect(reactor.stop(), signal=signals.spider_closed)
crawler.crawl(spider)
crawler.start()
setup_crawler('http://www.domain-a.com/index.html')
setup_crawler('http://www.domain-b.com/index.html')
log.start(loglevel=log.DEBUG)
reactor.run()
Thanks