Illustrative scenario: A Scrapy spider is built to scrape restaurant menus from a start_urls list of various restaurant websites. Once the menu is found for each restaurant, it is no longer necessary to continue crawling that particular restaurant website. The spider should (ideally) abort the queue for that start_url and move on to the next restaurant.
Is there a way to stop Scrapy from crawling the remainder of its request queue *per start_url* once a stopping condition is satisfied? I don't think that a CloseSpider exception is appropriate since I don't want to stop the entire spider, just the queue of the current start_url, and then move on to the next start_url.
Dont't use scrapy rules.
All what you need:
start_urls = [
'http://url1.com', 'http://url2.com', ...
]
def start_requests(self):
for url in self.start_urls:
yield Request(url, self.parse_url)
def parse_url(self, response):
hxs = Selector(response)
item = YourItem()
# process data
return item
And don't forget add all domains to allowed_domains list.
Related
I created a spider that can scrape a page in an e-commerce site and gather the data on the different items.
The spider works fine with specific pages of the site (www.sitedomain/123-item-category), as well as with the search page (www.sitedomain/searchpage?controller?search=keywords+item+to+be+found).
But, when I run it through Scrapyrt the specific page works fine, but the search page returns 0 items. No errors, just 0 items.This occurs on 2 different sites with 2 different spiders.
Is there something specific to search pages that has to be taken in account when using Scrapyrt?
Take a spider like this:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "minimal"
def start_requests(self):
urls = [
"https://www.dungeondice.it/ricerca?controller=search&s=ticket+to+ride",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print("Found ", len(response.css("article")), " items")
for article in response.css("article"):
print("Item: ", article.css("img::attr(title)").get())
(I also set the Obey_robots = False)
I get 20 items back, if I run:
scrape crawl minimal
But I get 0 items back (no errors, just no results), if I run:
curl "http://localhost:9081/crawl.json?spider_name=minimal&url=https://www.dungeondice.it/ricerca?controller=search&s=ticket+to+ride"
I think you could put "return item" in the end of "process_item" in your pipeline.py:
def process_item(self, item, spider):
...
return item
I found the same issue as you described, after I edit my codes, I solved it.
Hope this could help =)
I wrote a small example spider to illustrate my problem:
class ListEntrySpider(scrapy.Spider):
start_urls = ['https://example.com/lists']
def parse(self, response):
for i in json.dumps(response.text)['ids']:
scrapy.Request(f'https://example.com/list/{i}', callback=self.parse_lists)
def parse_lists(self, response):
for entry in json.dumps(response.text)['list']:
yield ListEntryItem(**entry)
I need to have all the items that result from multiple requests (all ListEntryItems in an array inside the spider, so dispatch requests that depend on all items.
My first idea was to chain the requests and pass the remaining IDs and the already extracted items in the request's meta attribute until the last request is reached.
class ListEntrySpider(scrapy.Spider):
start_urls = ['https://example.com/lists']
def parse(self, response):
ids = json.dumps(response.text)['ids']
yield self._create_request(ids, [])
def parse_lists(self, response):
self._create_request(response.meta['ids'], response.meta['items'].extend(list(self._extract_lists(response))))
def finish(self, response):
items = response.meta['items'].extend(list(self._extract_lists(response)))
def _extract_lists(self, response):
for entry in json.dumps(response.text)['list']:
yield ListEntryItem(**entry)
def _create_request(self, ids: list, items: List[ListEntryItem]):
i = ids.pop(0)
return scrapy.Request(
f'https://example.com/list/{i}',
meta={'ids': ids, 'items': items},
callback=self.parse_lists if len(ids) > 1 else self.finish
)
As you can see, my solution looks very complex. I'm looking for something more readable and less complex.
there are different approaches for this. One is chaining as you do. Problems occur is one of the requests in the middle of the chain is dropped for any reason. Your have to be really careful about that and handle all possible errors / ignored requests.
Another approach is to use a separate spider for all "grouped" requests.
You can start those spiders programmatically and pass a bucket (e.g. a dict) as spider attribute. Within your pipeline you add your items from each request to this bucket. From "outside" you listen to the spider_closed signal and get this bucket which then contains all your items.
Look here for how to start a spider programatically via a crawler runner:
https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
pass a bucket to your spider when calling crawl() of your crawler runner
crawler_runner_object.crawl(YourSpider, bucket=dict())
and catch the sider_closed signal
from scrapy.signalmanager import dispatcher
def on_spider_closed(spider):
bucket = spider.bucket
dispatcher.connect(on_spider_closed, signal=signals.spider_closed)
this approach might seem even more complicated than chaining your requests but it actually takes a lot of complexity out of the problem as within your spider you can make your requests without taking much care about all the other requests.
The site to scrape has multiple projects with multiple pages and requires a log in. I tried:
def start_request(self):
return [scrapy.FormRequest(, callback=self.logged_in)]
def logged_in(self, response):
with open(...) as f:
for url in f.readlines():
return scrapy.Request(url, callback=self.parse)
def parse(self, response):
... do some scraping ...
... determine the url of the next page ...
return scrapy.Request(... next page ..., self.parse)
This results in scraping all pages of one project (login is successful), but then it stops.
If return scrapy.Request() in function logged_in() is replaced by yield scrapy.Request() than it reads the first page of all projects.
I played around with the returns and yields, but I can't get it to scrape all pages of all projects.
BTW I tried to create an array start_uls, but that doesn't work because it first needs to log into the site.
return will always return once, so don't expect more than one Request to be returned there, if you wish to return more than once, use yield
Scrapy Requests has a parameter called dont_filter that probably filter your calls to parse function
dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
#Guy Gavriely makes a good point about return. I would add that you need another parse method to filter all the pages you want:
def start_request(self):
return [scrapy.FormRequest(, callback=self.logged_in)]
def logged_in(self, response):
with open(...) as f:
for url in f.readlines():
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
... do some scraping ...
yield scrapy.Request(... next page ..., self.parse_user)
def parse_user(self, response):
... do some scraping ....
yield items
You're not done yet! The final parse method may need to be iterated for example:
def parse(self, response):
... do some scraping ...
for sel in (some_list_of_users_by_an_xpath):
user_profile_tag = response.xpath('xpath_to_user_profile_urls')
user_profile_url_clean = user_profile_tag(might need to clean prior to Request ie 'domain' + user_profile_tag or .split or .replace, etc.)
yield scrapy.Request(user_profile_url_clean, self.parse_user)
In this case the parse function will parse a user every time in that list of users. Then parse_user will do most of the actual digging and scraping. Once it's done it will return to the original parse method and just go to the next one from the list.
Good luck!
Is it possible to create a spider which inherits/uses functionality from two base spiders?
I'm trying to scrape various sites and I've noticed that in many instances the site provides a sitemap but this just points to category/listing type pages, not 'real' content. Because of this I'm having to use the CrawlSpider (pointing to the website root) instead but this is pretty inefficient as it crawls through all pages including a lot of junk.
What I would like to do is something like this:
Start my Spider which is a subclass of SitemapSpider and pass each response to the parse_items method.
In parse_items test if the page contains 'real' content
If it does then process it, if not pass the response to the
CrawlSpider (actually my subclass of CrawlSpider) to process
CrawlSpider then looks for links in the page, say 2 levels deep and
processes them
Is this possible? I realise that I could copy and paste the code from the CrawlSpider into my spider but this seems like a poor design
In the end I decided to just extend sitemap spider and lift some of the code from the crawl spider as it was simpler that trying to deal with multiple inheritance issues so basically:
class MySpider(SitemapSpider):
def __init__(self, **kw):
super(MySpider, self).__init__(**kw)
self.link_extractor = LxmlLinkExtractor()
def parse(self, response):
# perform item extraction etc
...
links = self.link_extractor.extract_links(response)
for link in links:
yield Request(link.url, callback=self.parse)
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import SitemapSpider, CrawlSpider, Rule
class MySpider(SitemapSpider, CrawlSpider):
name = "myspider"
rules = ( Rule(LinkExtractor(allow=('', )), callback='parse_item', follow=True), )
sitemap_rules = [ ('/', 'parse_item'), ]
sitemap_urls = ['http://www.example.com/sitemap.xml']
start_urls = ['http://www.example.com']
allowed_domains = ['example.com']
def parse_item(self, response):
# Do your stuff here
...
# Return to CrawlSpider that will crawl them
yield from self.parse(response)
This way, Scrapy will start with the urls in the sitemap, then follow all links within each url.
Source: Multiple inheritance in scrapy spiders
You can inherit as usual, the only thing you have to be careful is that the base spiders usually override the basic methods start_requests and parse. The other thing to point out is that CrawlSpider will get links form every response that goes through _parse_response.
Setting a low value do DEPTH_LIMIT should be a way to manage the fact that CrawlSpider will get links for every response that goes through _parse_response (after reviewing the original question, it was alrady proposed).
I have searched a lot of topics but does not seem to find the answer for my specific question.
I have created a crawl spider for a website and it works perfectly. I then made a similar one to crawl a similar website but this time I have a small issue. Down to the business:
my start url looks as follows: www.example.com . The page contains the links I want to apply my spider look like:
www.example.com/locationA
www.example.com/locationB
www.example.com/locationC
...
I now have a issue:
Every time when I enter the start url, it redirects to www.example.com/locationA automatically and all links I got my spider working include
www.example.com/locationB
www.example.com/locationC
...
So my problem is how I can include the www.example.com/locationA in the returned URLs.I even got the log info like:
-2011-11-28 21:25:33+1300 [example.com] DEBUG: Redirecting (302) to from http://www.example.com/>
-2011-11-28 21:25:34+1300[example.com] DEBUG: Redirecting (302) to (referer: None)
2011-11-28 21:25:37+1300 [example.com] DEBUG: Redirecting (302) to (referer: www.example.com/locationB)
Print out from parse_item: www.example.com/locationB
....
I think the issue might be related to that (referer: None) some how. Could anyone please shed some light on this??
I have narrow down this issue by changing the start url to www.example.com/locationB. Since all the pages contain the lists of all locations, this time I got my spider works on:
-www.example.com/locationA
-www.example.com/locationC
...
In a nut shell, I am looking for the way to include the url which is same as (or being redirected from) the start url into the list that the parse_item callback will work on.
For others have the same problem, after a lot of searching, all you need to do is name your callback function to parse_start_url.
Eg:
rules = (
Rule(LinkExtractor(allow=(), restrict_xpaths=(
'//*[contains(concat( " ", #class, " " ), concat( " ", "pagination-next", " " ))]//a',)), callback="parse_start_url", follow=True),
)
Adding sample code based on mindcast suggestion:
I manage using following approach
class ExampleSpider(CrawlSpider):
name = "examplespider"
allowed_domains = ["example.com"]
start_urls = ['http://example.com/A']
rules = (Rule (SgmlLinkExtractor(restrict_xpaths=("//div[#id='tag_cloud']/a",)), callback="parse_items", follow= True),)
def parse_start_url(self, response):
self.log('>>>>>>>> PARSE START URL: %s' % response)
# www.example.com/A will be parsed here
return self.parse_items(response)
def parse_items(self, response):
self.log('>>>>>>>> PARSE ITEM FROM %s' % response.url)
"""Scrape data from links based on Crawl Rules"""
At first I thought that there is a simple solution using start_requests() like:
def start_requests(self):
yield Request('START_URL_HERE', callback=self.parse_item)
But tests showed, that when start_requests() is used instead of a start_urls list, spider ignores rules, because CrawlSpider.parse(response) is not called.
So, here is my solution:
import itertools
class SomeSpider(CrawlSpider):
....
start_urls = ('YOUR_START_URL',)
rules = (
Rule(
SgmlLinkExtractor(allow=(r'YOUR_REGEXP',),),
follow=True,
callback='parse_item'),
),
)
def parse(self, response):
return itertools.chain(
CrawlSpider.parse(self, response),
self.parse_item(response))
def parse_item(self, response):
yield item
Perhaps there is a better way.
A simple workaround is to specifically add a rule for the start_urls(in your case: http://example.com/locationA )(please ignore indentation issue):
class ExampleSpider(CrawlSpider):
name = "examplespider"
allowed_domains = ["example.com"]
start_urls = ['http://example.com/locationA']
rules = (
Rule(LinkExtractor(allow=('locationA')), callback='parse_item'),
Rule(LinkExtractor(allow=('location\.*?'),restrict_css=('.pagination-next',)), callback='parse_item', follow=True),
)
def parse_item(self, response):
self.log('>>>>>>>> PARSE ITEM FROM %s' % response.url)