How can I scrape other specific pages on a forum with Scrapy? - scrapy

I have a Scrapy Crawler that crawls some guides from a forum.
The forum that I'm trying to crawl the data has got a number of pages.
The problem is that I cannot extract the links that I want to because there aren't specific classes or ids to select.
The url structure is like this one: http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1
Obviously I can change the number after desc&page=1 to 2, 3, 4 and so on but I would like to know what is the best choice to do this.
How can I accomplish that?
PS: This is the spider code
http://dpaste.com/hold/794794/

I can't seem to open the forum URL (always redirects me to another website), so here's a best effort suggestion:
If there are links to the other pages on the thread page, you can create a crawler rule to explicitly follow these links. Use a CrawlSpider for that:
class GuideSpider(CrawlSpider):
name = "Guide"
allowed_domains = ['www.guides.com']
start_urls = [
"http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1",
]
rules = [
Rule(SgmlLinkExtractor(allow=("forumdisplay.php.*f=108.*page=",), callback='parse_item', follow=True)),
]
def parse_item(self, response):
# Your code
...
The spider should automatically deduplicate requests, i.e. it won't follow the same URL twice even if two pages link to it. If there are very similar URLs on the page with only one or two query arguments different (say, order=asc), you can specify deny=(...) in the Rule constructor to filter them out.

Related

Invoke a different Spider after page is parsed on another Spider

This has been addressed to some extent here and here
But I'd like to ask here before doing any of what's suggested there because I don't really like any of the approaches.
So basically, I'm trying to scrape Steam games. As you may know, Steam has a link where you can access the whole reviews for a game, an example:
https://steamcommunity.com/app/730/reviews/?browsefilter=toprated&snr=1_5_100010_
You can ignore snr and browsefilter query params there.
Anyhow, I have created a single Spider that will crawl the list of games here and works pretty well:
https://store.steampowered.com/search/?sort_by=Released_DESC
But now, for each game I want to retrieve all reviews.
Originally I created a new Spider that deals with the infinte scroll in the page that has the whole set of reviews for a game, but obviously that spider needs the URL where those reviews live.
So basically what I'm doing now is scrape all games pages and store the URL with reviews for each game in a txt file that is then passed as parameter to the second spider. But I don't like this because it forces me to do a 2-step process and besides, I need to map the results of the second spider to the results of the first one somehow (this reviews belong to this game, etc)
So my questions are:
Would it be best to send the results of scraping the game page (and thus the URL with All reviews) to the second spider, or at least the URL and then fetch all reviews for each game using the second spider? This will be O(N*M) in terms of performance, being N number of games and M number of reviews per game, maybe just because of this, having 2 spiders is worth it...thoughts?
Can I actually invoke a Spider from another Spider? From what I've read in Scrapy documentation, doesn't look like it. I can probably move everything to one spider but will look awful and it doesn't adhere to the single-responsability principle...
Why don't you use a different parse procedure?
https://docs.scrapy.org/en/latest/intro/tutorial.html#more-examples-and-patterns
def parse(self, response):
# follow links to author pages
for href in response.css('.author + a::attr(href)'):
yield response.follow(href, self.parse_author)
# follow pagination links
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).get(default='').strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
# follow pagination links
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse_author)
And add the needed values with the meta tag:
https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta
example in
Is it possible to pass a variable from start_requests() to parse() for each individual request?

Limiting scrapy Request and Items

everyone, I've been learning scrapy for a month. I need assistance with following problems:
Suppose there are 100-200 urls and I use Rule to extract further links from those urls and I want to limit the request of those links, like maximum 30 requests for each url. Can I do that?
If I'm searching a keyword on all urls, If the word is found on particular url, then I want scrapy to stop searching from that url and move to next one.
I've tried limiting url but it doesn't work at all.
Thanks, i hope everything is clear.
You can use a process_links callback function with your Rule, this will be passed the list of extracted links from each response, and you can trim it down to your limit of 30.
Example (untested):
class MySpider(CrawlSpider):
name = "test"
allowed_domains = ['example.org']
rules = (
Rule(LinkExtractor(), process_links="dummy_process_links"),
)
def dummy_process_links(self, links):
links = links[:30]
return links
If I understand correctly, and you want stop after finding some word in the page of the response, all you need to do is find the word:
def my_parse(self, response):
if b'word' is in response.body:
offset = response.body.find(b'word')
# do something with it

Scrapy - target specified URLs only

Am using Scrapy to browse and collect data, but am finding that the spider is crawling lots of unwanted pages. What I'd prefer the spider to do is just begin from a set of defined pages and then parse the content on those pages and then finish. I've tried to implement a rule like the below but it's still crawling a whole series of other pages as well. Any suggestions on how to approach this?
rules = (
Rule(SgmlLinkExtractor(), callback='parse_adlinks', follow=False),
)
Thanks!
Your extractor is extracting every link because it doesn't have any rule arguments set.
If you take a look at the official documentation, you'll notice that scrapy LinkExtractors have lots of parameters that you can set to customize what your linkextractors extract.
Example:
rules = (
# only specific domain links
Rule(LxmlLinkExtractor(allow_domains=['scrapy.org', 'blog.scrapy.org']), <..>),
# only links that match specific regex
Rule(LxmlLinkExtractor(allow='.+?/page\d+.html)', <..>),
# don't crawl speicific file extensions
Rule(LxmlLinkExtractor(deny_extensions=['.pdf','.html'], <..>),
)
You can also set allowed domains for your spider if you don't want it to wonder off somewhere:
class MySpider(scrapy.Spider):
allowed_domains = ['scrapy.org']
# will only crawl pages from this domain ^

Combining spiders in Scrapy

Is it possible to create a spider which inherits/uses functionality from two base spiders?
I'm trying to scrape various sites and I've noticed that in many instances the site provides a sitemap but this just points to category/listing type pages, not 'real' content. Because of this I'm having to use the CrawlSpider (pointing to the website root) instead but this is pretty inefficient as it crawls through all pages including a lot of junk.
What I would like to do is something like this:
Start my Spider which is a subclass of SitemapSpider and pass each response to the parse_items method.
In parse_items test if the page contains 'real' content
If it does then process it, if not pass the response to the
CrawlSpider (actually my subclass of CrawlSpider) to process
CrawlSpider then looks for links in the page, say 2 levels deep and
processes them
Is this possible? I realise that I could copy and paste the code from the CrawlSpider into my spider but this seems like a poor design
In the end I decided to just extend sitemap spider and lift some of the code from the crawl spider as it was simpler that trying to deal with multiple inheritance issues so basically:
class MySpider(SitemapSpider):
def __init__(self, **kw):
super(MySpider, self).__init__(**kw)
self.link_extractor = LxmlLinkExtractor()
def parse(self, response):
# perform item extraction etc
...
links = self.link_extractor.extract_links(response)
for link in links:
yield Request(link.url, callback=self.parse)
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import SitemapSpider, CrawlSpider, Rule
class MySpider(SitemapSpider, CrawlSpider):
name = "myspider"
rules = ( Rule(LinkExtractor(allow=('', )), callback='parse_item', follow=True), )
sitemap_rules = [ ('/', 'parse_item'), ]
sitemap_urls = ['http://www.example.com/sitemap.xml']
start_urls = ['http://www.example.com']
allowed_domains = ['example.com']
def parse_item(self, response):
# Do your stuff here
...
# Return to CrawlSpider that will crawl them
yield from self.parse(response)
This way, Scrapy will start with the urls in the sitemap, then follow all links within each url.
Source: Multiple inheritance in scrapy spiders
You can inherit as usual, the only thing you have to be careful is that the base spiders usually override the basic methods start_requests and parse. The other thing to point out is that CrawlSpider will get links form every response that goes through _parse_response.
Setting a low value do DEPTH_LIMIT should be a way to manage the fact that CrawlSpider will get links for every response that goes through _parse_response (after reviewing the original question, it was alrady proposed).

Looping on Scrapy doens't work properly

I'm trying to write a small web crawler with Scrapy.
I wrote a crawler that grabs the URLs of certain links on a certain page, and wrote the links to a csv file. I then wrote another crawler that loops on those links, and downloads some information from the pages directed to from these links.
The loop on the links:
cr = csv.reader(open("linksToCrawl.csv","rb"))
start_urls = []
for row in cr:
start_urls.append("http://www.zap.co.il/rate"+''.join(row[0])[1:len(''.join(row[0]))])
If, for example, the URL of the page I'm retrieving information from is:
http://www.zap.co.il/ratemodel.aspx?modelid=835959
then more information can (sometimes) be retrieved from following pages, like:
http://www.zap.co.il/ratemodel.aspx?modelid=835959&pageinfo=2
("&pageinfo=2" was added).
Therefore, my rules are:
rules = (Rule (SgmlLinkExtractor (allow = ("&pageinfo=\d",
), restrict_xpaths=('//a[#class="NumBtn"]',))
, callback="parse_items", follow= True),)
It seemed to be working fine. However, it seems that the crawler is only retrieving information from the pages with the extended URLs (with the "&pageinfo=\d"), and not from the ones without them. How can I fix that?
Thank you!
You can override parse_start_url() method in CrawlSpider:
class MySpider(CrawlSpider):
def parse_items(self, response):
# put your code here
...
parse_start_url = parse_items
Your rule allows urls with "&pageinfo=\d" . In effect only the pages with matching url will be processed. You need to change the allow parameter for the urls without pageinfo to be processed.