Spider through Scrapyrt returns 0 items when scraping search page - scrapy

I created a spider that can scrape a page in an e-commerce site and gather the data on the different items.
The spider works fine with specific pages of the site (www.sitedomain/123-item-category), as well as with the search page (www.sitedomain/searchpage?controller?search=keywords+item+to+be+found).
But, when I run it through Scrapyrt the specific page works fine, but the search page returns 0 items. No errors, just 0 items.This occurs on 2 different sites with 2 different spiders.
Is there something specific to search pages that has to be taken in account when using Scrapyrt?
Take a spider like this:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "minimal"
def start_requests(self):
urls = [
"https://www.dungeondice.it/ricerca?controller=search&s=ticket+to+ride",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print("Found ", len(response.css("article")), " items")
for article in response.css("article"):
print("Item: ", article.css("img::attr(title)").get())
(I also set the Obey_robots = False)
I get 20 items back, if I run:
scrape crawl minimal
But I get 0 items back (no errors, just no results), if I run:
curl "http://localhost:9081/crawl.json?spider_name=minimal&url=https://www.dungeondice.it/ricerca?controller=search&s=ticket+to+ride"

I think you could put "return item" in the end of "process_item" in your pipeline.py:
def process_item(self, item, spider):
...
return item
I found the same issue as you described, after I edit my codes, I solved it.
Hope this could help =)

Related

Is there a way to get the URL that a link is scraped from?

I've got a spider written out that crawls my website and scrapes a bunch of tags. I'm now trying to have it return the URL that the link was discovered on.
For example:
www.example.com/product/123 was found on www.example.com/page/2.
When scrapy scrapes information from /product/123 I want to have a field that is "Scraped From" and return /page/2. For every URL that is scraped, I'd want to find the originating page that the URL was found. I've been pouring over the docs and can't seem to figure this out. Any help would be appreciated!
The easiest way is to use the response.headers. There should be a referer header.
referer = response.headers['Referer']
You can also use meta to pass information along to the next URL.
def parse(self, response):
product_url = response.css('#url').get()
yield scrapy.Request(product_url, callback=self.parse_product, meta={'referer': response.url})
def parse_product(self, response):
referer = response.meta['referer']
item = ItemName()
item['referer'] = referer
yield item

Collect items from multiple requests in an array in Scrapy

I wrote a small example spider to illustrate my problem:
class ListEntrySpider(scrapy.Spider):
start_urls = ['https://example.com/lists']
def parse(self, response):
for i in json.dumps(response.text)['ids']:
scrapy.Request(f'https://example.com/list/{i}', callback=self.parse_lists)
def parse_lists(self, response):
for entry in json.dumps(response.text)['list']:
yield ListEntryItem(**entry)
I need to have all the items that result from multiple requests (all ListEntryItems in an array inside the spider, so dispatch requests that depend on all items.
My first idea was to chain the requests and pass the remaining IDs and the already extracted items in the request's meta attribute until the last request is reached.
class ListEntrySpider(scrapy.Spider):
start_urls = ['https://example.com/lists']
def parse(self, response):
ids = json.dumps(response.text)['ids']
yield self._create_request(ids, [])
def parse_lists(self, response):
self._create_request(response.meta['ids'], response.meta['items'].extend(list(self._extract_lists(response))))
def finish(self, response):
items = response.meta['items'].extend(list(self._extract_lists(response)))
def _extract_lists(self, response):
for entry in json.dumps(response.text)['list']:
yield ListEntryItem(**entry)
def _create_request(self, ids: list, items: List[ListEntryItem]):
i = ids.pop(0)
return scrapy.Request(
f'https://example.com/list/{i}',
meta={'ids': ids, 'items': items},
callback=self.parse_lists if len(ids) > 1 else self.finish
)
As you can see, my solution looks very complex. I'm looking for something more readable and less complex.
there are different approaches for this. One is chaining as you do. Problems occur is one of the requests in the middle of the chain is dropped for any reason. Your have to be really careful about that and handle all possible errors / ignored requests.
Another approach is to use a separate spider for all "grouped" requests.
You can start those spiders programmatically and pass a bucket (e.g. a dict) as spider attribute. Within your pipeline you add your items from each request to this bucket. From "outside" you listen to the spider_closed signal and get this bucket which then contains all your items.
Look here for how to start a spider programatically via a crawler runner:
https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
pass a bucket to your spider when calling crawl() of your crawler runner
crawler_runner_object.crawl(YourSpider, bucket=dict())
and catch the sider_closed signal
from scrapy.signalmanager import dispatcher
def on_spider_closed(spider):
bucket = spider.bucket
dispatcher.connect(on_spider_closed, signal=signals.spider_closed)
this approach might seem even more complicated than chaining your requests but it actually takes a lot of complexity out of the problem as within your spider you can make your requests without taking much care about all the other requests.

Scraping a list of urls after log in

The site to scrape has multiple projects with multiple pages and requires a log in. I tried:
def start_request(self):
return [scrapy.FormRequest(, callback=self.logged_in)]
def logged_in(self, response):
with open(...) as f:
for url in f.readlines():
return scrapy.Request(url, callback=self.parse)
def parse(self, response):
... do some scraping ...
... determine the url of the next page ...
return scrapy.Request(... next page ..., self.parse)
This results in scraping all pages of one project (login is successful), but then it stops.
If return scrapy.Request() in function logged_in() is replaced by yield scrapy.Request() than it reads the first page of all projects.
I played around with the returns and yields, but I can't get it to scrape all pages of all projects.
BTW I tried to create an array start_uls, but that doesn't work because it first needs to log into the site.
return will always return once, so don't expect more than one Request to be returned there, if you wish to return more than once, use yield
Scrapy Requests has a parameter called dont_filter that probably filter your calls to parse function
dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
#Guy Gavriely makes a good point about return. I would add that you need another parse method to filter all the pages you want:
def start_request(self):
return [scrapy.FormRequest(, callback=self.logged_in)]
def logged_in(self, response):
with open(...) as f:
for url in f.readlines():
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
... do some scraping ...
yield scrapy.Request(... next page ..., self.parse_user)
def parse_user(self, response):
... do some scraping ....
yield items
You're not done yet! The final parse method may need to be iterated for example:
def parse(self, response):
... do some scraping ...
for sel in (some_list_of_users_by_an_xpath):
user_profile_tag = response.xpath('xpath_to_user_profile_urls')
user_profile_url_clean = user_profile_tag(might need to clean prior to Request ie 'domain' + user_profile_tag or .split or .replace, etc.)
yield scrapy.Request(user_profile_url_clean, self.parse_user)
In this case the parse function will parse a user every time in that list of users. Then parse_user will do most of the actual digging and scraping. Once it's done it will return to the original parse method and just go to the next one from the list.
Good luck!

Combining spiders in Scrapy

Is it possible to create a spider which inherits/uses functionality from two base spiders?
I'm trying to scrape various sites and I've noticed that in many instances the site provides a sitemap but this just points to category/listing type pages, not 'real' content. Because of this I'm having to use the CrawlSpider (pointing to the website root) instead but this is pretty inefficient as it crawls through all pages including a lot of junk.
What I would like to do is something like this:
Start my Spider which is a subclass of SitemapSpider and pass each response to the parse_items method.
In parse_items test if the page contains 'real' content
If it does then process it, if not pass the response to the
CrawlSpider (actually my subclass of CrawlSpider) to process
CrawlSpider then looks for links in the page, say 2 levels deep and
processes them
Is this possible? I realise that I could copy and paste the code from the CrawlSpider into my spider but this seems like a poor design
In the end I decided to just extend sitemap spider and lift some of the code from the crawl spider as it was simpler that trying to deal with multiple inheritance issues so basically:
class MySpider(SitemapSpider):
def __init__(self, **kw):
super(MySpider, self).__init__(**kw)
self.link_extractor = LxmlLinkExtractor()
def parse(self, response):
# perform item extraction etc
...
links = self.link_extractor.extract_links(response)
for link in links:
yield Request(link.url, callback=self.parse)
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import SitemapSpider, CrawlSpider, Rule
class MySpider(SitemapSpider, CrawlSpider):
name = "myspider"
rules = ( Rule(LinkExtractor(allow=('', )), callback='parse_item', follow=True), )
sitemap_rules = [ ('/', 'parse_item'), ]
sitemap_urls = ['http://www.example.com/sitemap.xml']
start_urls = ['http://www.example.com']
allowed_domains = ['example.com']
def parse_item(self, response):
# Do your stuff here
...
# Return to CrawlSpider that will crawl them
yield from self.parse(response)
This way, Scrapy will start with the urls in the sitemap, then follow all links within each url.
Source: Multiple inheritance in scrapy spiders
You can inherit as usual, the only thing you have to be careful is that the base spiders usually override the basic methods start_requests and parse. The other thing to point out is that CrawlSpider will get links form every response that goes through _parse_response.
Setting a low value do DEPTH_LIMIT should be a way to manage the fact that CrawlSpider will get links for every response that goes through _parse_response (after reviewing the original question, it was alrady proposed).

Scrapy stopping conditions

Illustrative scenario: A Scrapy spider is built to scrape restaurant menus from a start_urls list of various restaurant websites. Once the menu is found for each restaurant, it is no longer necessary to continue crawling that particular restaurant website. The spider should (ideally) abort the queue for that start_url and move on to the next restaurant.
Is there a way to stop Scrapy from crawling the remainder of its request queue *per start_url* once a stopping condition is satisfied? I don't think that a CloseSpider exception is appropriate since I don't want to stop the entire spider, just the queue of the current start_url, and then move on to the next start_url.
Dont't use scrapy rules.
All what you need:
start_urls = [
'http://url1.com', 'http://url2.com', ...
]
def start_requests(self):
for url in self.start_urls:
yield Request(url, self.parse_url)
def parse_url(self, response):
hxs = Selector(response)
item = YourItem()
# process data
return item
And don't forget add all domains to allowed_domains list.