Invoke a different Spider after page is parsed on another Spider - scrapy

This has been addressed to some extent here and here
But I'd like to ask here before doing any of what's suggested there because I don't really like any of the approaches.
So basically, I'm trying to scrape Steam games. As you may know, Steam has a link where you can access the whole reviews for a game, an example:
https://steamcommunity.com/app/730/reviews/?browsefilter=toprated&snr=1_5_100010_
You can ignore snr and browsefilter query params there.
Anyhow, I have created a single Spider that will crawl the list of games here and works pretty well:
https://store.steampowered.com/search/?sort_by=Released_DESC
But now, for each game I want to retrieve all reviews.
Originally I created a new Spider that deals with the infinte scroll in the page that has the whole set of reviews for a game, but obviously that spider needs the URL where those reviews live.
So basically what I'm doing now is scrape all games pages and store the URL with reviews for each game in a txt file that is then passed as parameter to the second spider. But I don't like this because it forces me to do a 2-step process and besides, I need to map the results of the second spider to the results of the first one somehow (this reviews belong to this game, etc)
So my questions are:
Would it be best to send the results of scraping the game page (and thus the URL with All reviews) to the second spider, or at least the URL and then fetch all reviews for each game using the second spider? This will be O(N*M) in terms of performance, being N number of games and M number of reviews per game, maybe just because of this, having 2 spiders is worth it...thoughts?
Can I actually invoke a Spider from another Spider? From what I've read in Scrapy documentation, doesn't look like it. I can probably move everything to one spider but will look awful and it doesn't adhere to the single-responsability principle...

Why don't you use a different parse procedure?
https://docs.scrapy.org/en/latest/intro/tutorial.html#more-examples-and-patterns
def parse(self, response):
# follow links to author pages
for href in response.css('.author + a::attr(href)'):
yield response.follow(href, self.parse_author)
# follow pagination links
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).get(default='').strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
# follow pagination links
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse_author)
And add the needed values with the meta tag:
https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta
example in
Is it possible to pass a variable from start_requests() to parse() for each individual request?

Related

How can I list the URL of the page the data was scraped from with Scrapy?

I'm a real beginner but I've been searching high and low and can't seem to find a solution. I'm working on building some spiders but I can't figure out how to identify what URL my scraped data comes from.
My spider is extremely basic right now, I'm trying to learn as I go.
I've tried a few lines I've found on stackoverflow but can't get anything working other than a print function (I can't remember if it was "URL: " + response.request.url or something similar. I tried a bunch of things) that worked in the parse section of the code but I can't get anything working in the yield.
I could add other identifiers in the output but ideally I'd like the URL for the project I'm working towards
import scrapy
class FanaticsSpider(scrapy.Spider):
name = 'fanatics'
start_urls = ['https://www.fanaticsoutlet.com/nfl/new-england-patriots/new-england-patriots-majestic-showtime-logo-cool-base-t-shirt-navy/o-9172+t-70152507+p-1483408147+z-8-1114341320',
'https://www.fanaticsoutlet.com/nfl/new-england-patriots/new-england-patriots-nfl-pro-line-mantra-t-shirt-navy/o-2427+t-69598185+p-57711304142+z-9-2975969489',]
def parse(self, response):
yield {
'sale-price': response.xpath('//span[#data-talos="pdpProductPrice"]/span[#class="sale-price"]/text()').re('[$]\d+\.\d+'),
#'sale-price': response.xpath('//span[#data-talos="pdpProductPrice"]/span[#class="sale-price"]/text()').get(),
'regular-price': response.xpath('//span[#data-talos="pdpProductPrice"]/span[#class="regular-price strike-through"]/text()').re('[$]\d+\.\d+'),
#'regular-price': response.xpath('//span[#data-talos="pdpProductPrice"]/span[#class="regular-price strike-through"]/text()').get(),
}
Any help is much appreciated. I haven't begun to learn anything about pipeline yet, I'm not sure if that might hold a solution?
You can simply add the url in the yield like this:
yield {...,
'url': response.url,
...}

Collect items from multiple requests in an array in Scrapy

I wrote a small example spider to illustrate my problem:
class ListEntrySpider(scrapy.Spider):
start_urls = ['https://example.com/lists']
def parse(self, response):
for i in json.dumps(response.text)['ids']:
scrapy.Request(f'https://example.com/list/{i}', callback=self.parse_lists)
def parse_lists(self, response):
for entry in json.dumps(response.text)['list']:
yield ListEntryItem(**entry)
I need to have all the items that result from multiple requests (all ListEntryItems in an array inside the spider, so dispatch requests that depend on all items.
My first idea was to chain the requests and pass the remaining IDs and the already extracted items in the request's meta attribute until the last request is reached.
class ListEntrySpider(scrapy.Spider):
start_urls = ['https://example.com/lists']
def parse(self, response):
ids = json.dumps(response.text)['ids']
yield self._create_request(ids, [])
def parse_lists(self, response):
self._create_request(response.meta['ids'], response.meta['items'].extend(list(self._extract_lists(response))))
def finish(self, response):
items = response.meta['items'].extend(list(self._extract_lists(response)))
def _extract_lists(self, response):
for entry in json.dumps(response.text)['list']:
yield ListEntryItem(**entry)
def _create_request(self, ids: list, items: List[ListEntryItem]):
i = ids.pop(0)
return scrapy.Request(
f'https://example.com/list/{i}',
meta={'ids': ids, 'items': items},
callback=self.parse_lists if len(ids) > 1 else self.finish
)
As you can see, my solution looks very complex. I'm looking for something more readable and less complex.
there are different approaches for this. One is chaining as you do. Problems occur is one of the requests in the middle of the chain is dropped for any reason. Your have to be really careful about that and handle all possible errors / ignored requests.
Another approach is to use a separate spider for all "grouped" requests.
You can start those spiders programmatically and pass a bucket (e.g. a dict) as spider attribute. Within your pipeline you add your items from each request to this bucket. From "outside" you listen to the spider_closed signal and get this bucket which then contains all your items.
Look here for how to start a spider programatically via a crawler runner:
https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
pass a bucket to your spider when calling crawl() of your crawler runner
crawler_runner_object.crawl(YourSpider, bucket=dict())
and catch the sider_closed signal
from scrapy.signalmanager import dispatcher
def on_spider_closed(spider):
bucket = spider.bucket
dispatcher.connect(on_spider_closed, signal=signals.spider_closed)
this approach might seem even more complicated than chaining your requests but it actually takes a lot of complexity out of the problem as within your spider you can make your requests without taking much care about all the other requests.

Scrapy request url comes from which url response

For Scrapy, we could get the response.url, response.request.url, but how do we know the response.url, response.request.url is extracted from which parent url?
Thank you,
Ken
You can use Request.meta to keep track of such information.
When you yield your request, include response.url in the meta:
yield response.follow(link, …, meta={'source_url': response.url})
Then read it on your parsing method:
source_url = response.meta['source_url']
That is the most straightforward way to do this, and you can use this method to keep track of original URLs even across different parsing methods, if you wish.
Otherwise, you might want to look into taking advantage of the redirect_urls meta key, which keeps track of redirect jumps.

Looping on Scrapy doens't work properly

I'm trying to write a small web crawler with Scrapy.
I wrote a crawler that grabs the URLs of certain links on a certain page, and wrote the links to a csv file. I then wrote another crawler that loops on those links, and downloads some information from the pages directed to from these links.
The loop on the links:
cr = csv.reader(open("linksToCrawl.csv","rb"))
start_urls = []
for row in cr:
start_urls.append("http://www.zap.co.il/rate"+''.join(row[0])[1:len(''.join(row[0]))])
If, for example, the URL of the page I'm retrieving information from is:
http://www.zap.co.il/ratemodel.aspx?modelid=835959
then more information can (sometimes) be retrieved from following pages, like:
http://www.zap.co.il/ratemodel.aspx?modelid=835959&pageinfo=2
("&pageinfo=2" was added).
Therefore, my rules are:
rules = (Rule (SgmlLinkExtractor (allow = ("&pageinfo=\d",
), restrict_xpaths=('//a[#class="NumBtn"]',))
, callback="parse_items", follow= True),)
It seemed to be working fine. However, it seems that the crawler is only retrieving information from the pages with the extended URLs (with the "&pageinfo=\d"), and not from the ones without them. How can I fix that?
Thank you!
You can override parse_start_url() method in CrawlSpider:
class MySpider(CrawlSpider):
def parse_items(self, response):
# put your code here
...
parse_start_url = parse_items
Your rule allows urls with "&pageinfo=\d" . In effect only the pages with matching url will be processed. You need to change the allow parameter for the urls without pageinfo to be processed.

How can I scrape other specific pages on a forum with Scrapy?

I have a Scrapy Crawler that crawls some guides from a forum.
The forum that I'm trying to crawl the data has got a number of pages.
The problem is that I cannot extract the links that I want to because there aren't specific classes or ids to select.
The url structure is like this one: http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1
Obviously I can change the number after desc&page=1 to 2, 3, 4 and so on but I would like to know what is the best choice to do this.
How can I accomplish that?
PS: This is the spider code
http://dpaste.com/hold/794794/
I can't seem to open the forum URL (always redirects me to another website), so here's a best effort suggestion:
If there are links to the other pages on the thread page, you can create a crawler rule to explicitly follow these links. Use a CrawlSpider for that:
class GuideSpider(CrawlSpider):
name = "Guide"
allowed_domains = ['www.guides.com']
start_urls = [
"http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1",
]
rules = [
Rule(SgmlLinkExtractor(allow=("forumdisplay.php.*f=108.*page=",), callback='parse_item', follow=True)),
]
def parse_item(self, response):
# Your code
...
The spider should automatically deduplicate requests, i.e. it won't follow the same URL twice even if two pages link to it. If there are very similar URLs on the page with only one or two query arguments different (say, order=asc), you can specify deny=(...) in the Rule constructor to filter them out.