Combining spiders in Scrapy - scrapy

Is it possible to create a spider which inherits/uses functionality from two base spiders?
I'm trying to scrape various sites and I've noticed that in many instances the site provides a sitemap but this just points to category/listing type pages, not 'real' content. Because of this I'm having to use the CrawlSpider (pointing to the website root) instead but this is pretty inefficient as it crawls through all pages including a lot of junk.
What I would like to do is something like this:
Start my Spider which is a subclass of SitemapSpider and pass each response to the parse_items method.
In parse_items test if the page contains 'real' content
If it does then process it, if not pass the response to the
CrawlSpider (actually my subclass of CrawlSpider) to process
CrawlSpider then looks for links in the page, say 2 levels deep and
processes them
Is this possible? I realise that I could copy and paste the code from the CrawlSpider into my spider but this seems like a poor design

In the end I decided to just extend sitemap spider and lift some of the code from the crawl spider as it was simpler that trying to deal with multiple inheritance issues so basically:
class MySpider(SitemapSpider):
def __init__(self, **kw):
super(MySpider, self).__init__(**kw)
self.link_extractor = LxmlLinkExtractor()
def parse(self, response):
# perform item extraction etc
...
links = self.link_extractor.extract_links(response)
for link in links:
yield Request(link.url, callback=self.parse)

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import SitemapSpider, CrawlSpider, Rule
class MySpider(SitemapSpider, CrawlSpider):
name = "myspider"
rules = ( Rule(LinkExtractor(allow=('', )), callback='parse_item', follow=True), )
sitemap_rules = [ ('/', 'parse_item'), ]
sitemap_urls = ['http://www.example.com/sitemap.xml']
start_urls = ['http://www.example.com']
allowed_domains = ['example.com']
def parse_item(self, response):
# Do your stuff here
...
# Return to CrawlSpider that will crawl them
yield from self.parse(response)
This way, Scrapy will start with the urls in the sitemap, then follow all links within each url.
Source: Multiple inheritance in scrapy spiders

You can inherit as usual, the only thing you have to be careful is that the base spiders usually override the basic methods start_requests and parse. The other thing to point out is that CrawlSpider will get links form every response that goes through _parse_response.

Setting a low value do DEPTH_LIMIT should be a way to manage the fact that CrawlSpider will get links for every response that goes through _parse_response (after reviewing the original question, it was alrady proposed).

Related

Collect items from multiple requests in an array in Scrapy

I wrote a small example spider to illustrate my problem:
class ListEntrySpider(scrapy.Spider):
start_urls = ['https://example.com/lists']
def parse(self, response):
for i in json.dumps(response.text)['ids']:
scrapy.Request(f'https://example.com/list/{i}', callback=self.parse_lists)
def parse_lists(self, response):
for entry in json.dumps(response.text)['list']:
yield ListEntryItem(**entry)
I need to have all the items that result from multiple requests (all ListEntryItems in an array inside the spider, so dispatch requests that depend on all items.
My first idea was to chain the requests and pass the remaining IDs and the already extracted items in the request's meta attribute until the last request is reached.
class ListEntrySpider(scrapy.Spider):
start_urls = ['https://example.com/lists']
def parse(self, response):
ids = json.dumps(response.text)['ids']
yield self._create_request(ids, [])
def parse_lists(self, response):
self._create_request(response.meta['ids'], response.meta['items'].extend(list(self._extract_lists(response))))
def finish(self, response):
items = response.meta['items'].extend(list(self._extract_lists(response)))
def _extract_lists(self, response):
for entry in json.dumps(response.text)['list']:
yield ListEntryItem(**entry)
def _create_request(self, ids: list, items: List[ListEntryItem]):
i = ids.pop(0)
return scrapy.Request(
f'https://example.com/list/{i}',
meta={'ids': ids, 'items': items},
callback=self.parse_lists if len(ids) > 1 else self.finish
)
As you can see, my solution looks very complex. I'm looking for something more readable and less complex.
there are different approaches for this. One is chaining as you do. Problems occur is one of the requests in the middle of the chain is dropped for any reason. Your have to be really careful about that and handle all possible errors / ignored requests.
Another approach is to use a separate spider for all "grouped" requests.
You can start those spiders programmatically and pass a bucket (e.g. a dict) as spider attribute. Within your pipeline you add your items from each request to this bucket. From "outside" you listen to the spider_closed signal and get this bucket which then contains all your items.
Look here for how to start a spider programatically via a crawler runner:
https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
pass a bucket to your spider when calling crawl() of your crawler runner
crawler_runner_object.crawl(YourSpider, bucket=dict())
and catch the sider_closed signal
from scrapy.signalmanager import dispatcher
def on_spider_closed(spider):
bucket = spider.bucket
dispatcher.connect(on_spider_closed, signal=signals.spider_closed)
this approach might seem even more complicated than chaining your requests but it actually takes a lot of complexity out of the problem as within your spider you can make your requests without taking much care about all the other requests.

Scrapy + extract only text + carriage returns in output file

I am new to Scrapy and trying to extract content from web page, but getting lots of extra characters in the output. See image attached.
How can I update my code to get rid of the characters? I need to extract only the href from the web page.
My code:
class AttractionSpider(CrawlSpider):
name = "get-webcontent"
start_urls = [
'http://quotes.toscrape.com/page/1/'
]
rules = ()
def create_dirs(dir):
if not os.path.exists(dir):
os.makedirs(dir)
else:
shutil.rmtree(dir) #removes all the subdirectories!
os.makedirs(dir)
def __init__(self, name=None, **kwargs):
super(AttractionSpider, self).__init__(name, **kwargs)
self.items_buffer = {}
self.base_url = "http://quotes.toscrape.com/page/1/"
from scrapy.conf import settings
settings.overrides['DOWNLOAD_TIMEOUT'] = 360
def write_to_file(file_name, content_list):
with open(file_name, 'wb') as fp:
pickle.dump(content_list, fp)
def parse(self, response):
print ("Start scrapping webcontent....")
try:
str = ""
hxs = Selector(response)
links = hxs.xpath('//li//#href').extract()
with open('test1_href', 'wb') as fp:
pickle.dump(links, fp)
if not links:
return
log.msg("No Data to scrap")
for link in links:
v_url = ''.join( link.extract() )
if not v_url:
continue
else:
_url = self.base_url + v_url
except Exception as e:
log.msg("Parsing failed for URL {%s}"%format(response.request.url))
raise
def parse_details(self, response):
print ("Start scrapping Detailed Info....")
try:
hxs = Selector(response)
yield l_venue
except Exception as e:
log.msg("Parsing failed for URL {%s}"%format(response.request.url))
raise
Now I must say... obviously you have some experience with Python programming congrats, and you're obviously doing the official Scrapy docs tutorial which is great but for the life of me I have no idea exactly given the code snippet you have provided of what you're trying to accomplish. But that's ok, here's a couple of things:
You are using a Scrapy crawl spider. When using a cross spider the rules set the follow or pagination if you will as well as pointing in a car back to the function when the appropriate regular expression matches the rule to a page to then initialize the extraction or itemization. This is absolutely crucial to understand that you cannot use a crossfire without setting the rules and equally as important when using a cross spider you cannot use the parse function, because the way the cross spider is built parse function is already a native built-in function within itself. Do go ahead and read the documents or just create a cross spider and see how it doesn't create in parse.
Your code
class AttractionSpider(CrawlSpider):
name = "get-webcontent"
start_urls = [
'http://quotes.toscrape.com/page/1/'
]
rules = () #big no no ! s3 rul3s
How it should look like
class AttractionSpider(CrawlSpider):
name = "get-webcontent"
start_urls = ['http://quotes.toscrape.com'] # this would be cosidered a base url
# regex is our bf, kno him well, bassicall all pages that follow
#this pattern ... page/.* (meant all following include no exception)
rules = (
Rule(LinkExtractor(allow=r'/page/.*'), follow=True),callback='parse_item'),
)
Number two: go over the thing I mentioned about using the parts function with a Scrapy crawl spider, you should use "parse-_item"; I assume that you at least looked over the official docs but to sum it up, the reason that it cannot be used this because the crawl spider already uses Parts within its logic so by using Parts within a cross spider you're overriding a native function that it has and can cause all sorts of bugs and issues.
That's pretty straightforward; I don't think I have to go ahead and show you a snippet but feel free to go to the official Docs and on the right side where it says "spiders" go ahead and scroll down until you hit "crawl spiders" and it gives some notes with a caution...
To my next point: when you go from your initial parts you are not (or rather you do not) have a call back that goes from parse to Parts details which leads me to believe that when you perform the crawl you don't go past the first page and aside from that, if you're trying to create a text file (or you're using the OS module 2 write out something but you're not actually writing anything) so I'm super confused to why you are using the right function instead of read.
I mean, myself I have in many occasions use an external text file or CSV file for that matter that includes multiple URLs so I don't have to stick it in there but you're clearly writing out or trying to write to a file which you said was a pipeline? Now I'm even more confused! But the point is that I hope you're well aware of the fact that if you are trying to create a file or export of your extracted items there are options to export and to three already pre-built formats that being CSV JSON. But as you said in your response to my comment that if indeed you're using a pipeline and item and Porter intern you can create your own format of export as you so wish but if it's only the response URL that you need why go through all that hassle?
My parting words would be: it would serve you well to go over again Scrapy's official docs tutorial, at nauseam and stressing the importance of using also the settings.py as well as items.py.
# -*- coding: utf-8 -*-
import scrapy
import os
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from quotes.items import QuotesItem
class QcrawlSpider(CrawlSpider):
name = 'qCrawl'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
rules = (
Rule(LinkExtractor(allow=r'page/.*'), callback='parse_item', follow=True),
)
def parse_item(self, response):
rurl = response.url
item = QuotesItem()
item['quote'] =response.css('span.text::text').extract()
item['author'] = response.css('small.author::text').extract()
item['rUrl'] = rurl
yield item
with open(os.path.abspath('') + '_' + "urllisr_" + '.txt', 'a') as a:
a.write(''.join([rurl, '\n']))
a.close()
Of course, the items.py would be filled out appropriately by the ones you see in the spider but by including the response URL both itemized I can do both writing out given even the default Scrappy methods CSV etc or I can create my own.
In this case being a simple text file but one can get pretty crafty; for example, if you write it out correctly that's the same using the OS module you can, for example as I have create m3u playlist from video hosting sites, you can get fancy with a custom CSV item exporter. But even with that then using a custom pipeline we can then write out a custom format for your csvs or whatever it is that you wish.

scrapy parse_item not being called for a particular domain

I am try to use Scrapy to crawl Shoescribe. But somehow the parse_item is not called. I try to same code with other website and it works fine. Totally no idea what goes wrong. Any help will be really really appreciate! Thanks!
import scrapy
from scrapy import log
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from lsspider.items import *
class ShoeScribeSpider(CrawlSpider):
name = "shoescribe"
merchant_name = "shoescribe.com"
allowed_domains = ["www.shoescribe.com"]
start_urls = [
"http://www.shoescribe.com/us/women/ankle-boots_cod44709699mx.html",
]
rules = (
Rule(LinkExtractor(allow=('http://www.shoescribe.com/us/women/ankle-boots_cod44709699mx.html')), callback='parse_item', follow=True),
)
def parse_item(self, response):
print 'parse_item'
item = Item()
item['url'] = response.url.split('?')[0]
print item['url']
return item
I am not sure if you already figured it out but here are a few observations that I have made that might be helpful.
print statement won't work in your situation, even if the code got executed. Usually you can use the logging command as Scrapy recommended, or you have to turn on the logging so Scrapy will forward all the stdout/stderr to you.
I basically copied your code to a brand new scrapy project with modifying the item class so it will contain the URL field. Then I ran the crawler using the parse and seems like it passed the rule and also used the proper callback function. In the end, it has Scrapyed Items generated. I bet if you write some pipeline the result will be properly generated too.
Here is the output that proved your code is working

Scrapy stopping conditions

Illustrative scenario: A Scrapy spider is built to scrape restaurant menus from a start_urls list of various restaurant websites. Once the menu is found for each restaurant, it is no longer necessary to continue crawling that particular restaurant website. The spider should (ideally) abort the queue for that start_url and move on to the next restaurant.
Is there a way to stop Scrapy from crawling the remainder of its request queue *per start_url* once a stopping condition is satisfied? I don't think that a CloseSpider exception is appropriate since I don't want to stop the entire spider, just the queue of the current start_url, and then move on to the next start_url.
Dont't use scrapy rules.
All what you need:
start_urls = [
'http://url1.com', 'http://url2.com', ...
]
def start_requests(self):
for url in self.start_urls:
yield Request(url, self.parse_url)
def parse_url(self, response):
hxs = Selector(response)
item = YourItem()
# process data
return item
And don't forget add all domains to allowed_domains list.

How can I scrape other specific pages on a forum with Scrapy?

I have a Scrapy Crawler that crawls some guides from a forum.
The forum that I'm trying to crawl the data has got a number of pages.
The problem is that I cannot extract the links that I want to because there aren't specific classes or ids to select.
The url structure is like this one: http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1
Obviously I can change the number after desc&page=1 to 2, 3, 4 and so on but I would like to know what is the best choice to do this.
How can I accomplish that?
PS: This is the spider code
http://dpaste.com/hold/794794/
I can't seem to open the forum URL (always redirects me to another website), so here's a best effort suggestion:
If there are links to the other pages on the thread page, you can create a crawler rule to explicitly follow these links. Use a CrawlSpider for that:
class GuideSpider(CrawlSpider):
name = "Guide"
allowed_domains = ['www.guides.com']
start_urls = [
"http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1",
]
rules = [
Rule(SgmlLinkExtractor(allow=("forumdisplay.php.*f=108.*page=",), callback='parse_item', follow=True)),
]
def parse_item(self, response):
# Your code
...
The spider should automatically deduplicate requests, i.e. it won't follow the same URL twice even if two pages link to it. If there are very similar URLs on the page with only one or two query arguments different (say, order=asc), you can specify deny=(...) in the Rule constructor to filter them out.