How do scrapy crawlspider rules work? - scrapy

I have an issue with crawlspider rules.
here my rule definition;
rules = (
Rule(LinkExtractor(allow=("/liste/.*/department\.aspx\?categoryId=\d+", ))),
Rule(LinkExtractor(allow=("/liste/.*/department\.aspx\?categoryId=.*&pn=\d+", ))),
Rule(LinkExtractor(allow=('/liste/.*/productDetails\.aspx\?productId=.*&categoryId=.*', )), callback='parse_page'),
)
What I am supposed to--> at first rule, I want to find categories link, then send request these links and find links according to second rule and send request these links too from extracted second rule . Finally, I want to find links according to rule 3 definition and call_back parse_page function.
But it does not work like what I am supposed. Actually I can not control spider, every run-time it can scraps different pages not all pages. I want to scrap all pages that matches to rules.
How should define my rules to manage this ?

Related

Scrapy - target specified URLs only

Am using Scrapy to browse and collect data, but am finding that the spider is crawling lots of unwanted pages. What I'd prefer the spider to do is just begin from a set of defined pages and then parse the content on those pages and then finish. I've tried to implement a rule like the below but it's still crawling a whole series of other pages as well. Any suggestions on how to approach this?
rules = (
Rule(SgmlLinkExtractor(), callback='parse_adlinks', follow=False),
)
Thanks!
Your extractor is extracting every link because it doesn't have any rule arguments set.
If you take a look at the official documentation, you'll notice that scrapy LinkExtractors have lots of parameters that you can set to customize what your linkextractors extract.
Example:
rules = (
# only specific domain links
Rule(LxmlLinkExtractor(allow_domains=['scrapy.org', 'blog.scrapy.org']), <..>),
# only links that match specific regex
Rule(LxmlLinkExtractor(allow='.+?/page\d+.html)', <..>),
# don't crawl speicific file extensions
Rule(LxmlLinkExtractor(deny_extensions=['.pdf','.html'], <..>),
)
You can also set allowed domains for your spider if you don't want it to wonder off somewhere:
class MySpider(scrapy.Spider):
allowed_domains = ['scrapy.org']
# will only crawl pages from this domain ^

Do scrapy LinkExtractors end up with unique links?

So, I have a page with a lot of articles and page numbers. Now if I want to extract an article I use:
Rule(LinkExtractor(allow=['article\/.+\.html']), callback='parse_article')
for pages I use this Rule:
Rule(LinkExtractor(allow='page=\d+'))
so I end up with these rules:
rules = [
Rule(LinkExtractor(allow='page=\d+')),
Rule(LinkExtractor(allow=['article\/.+\.html']), callback='parse_article')
]
My question is, will I get repeated pages? as in, will it extract page 3 from page 1,2,4,5,6(till page 3 is no longer visible) and add it to the extracted link list? or it only keeps unique urls at the end of it?
By default, LinkExtractor should only return unique links. There is an optional parameter, unique, which is True by default.
But that only ensures the links extracted from each page are unique. If the same link occurs on a later page, it will be extracted again.
By default, your spider should automatically ensure it doesn't visit the same URLs again, according to the DUPEFILTER_CLASS setting. The only caveat to this is if you stop and start your spider again, the record of visited URLs is reset. Look at "Jobs: pausing and resuming crawls" in the documentation for how to persist information when you pause and resume a spider.

Looping on Scrapy doens't work properly

I'm trying to write a small web crawler with Scrapy.
I wrote a crawler that grabs the URLs of certain links on a certain page, and wrote the links to a csv file. I then wrote another crawler that loops on those links, and downloads some information from the pages directed to from these links.
The loop on the links:
cr = csv.reader(open("linksToCrawl.csv","rb"))
start_urls = []
for row in cr:
start_urls.append("http://www.zap.co.il/rate"+''.join(row[0])[1:len(''.join(row[0]))])
If, for example, the URL of the page I'm retrieving information from is:
http://www.zap.co.il/ratemodel.aspx?modelid=835959
then more information can (sometimes) be retrieved from following pages, like:
http://www.zap.co.il/ratemodel.aspx?modelid=835959&pageinfo=2
("&pageinfo=2" was added).
Therefore, my rules are:
rules = (Rule (SgmlLinkExtractor (allow = ("&pageinfo=\d",
), restrict_xpaths=('//a[#class="NumBtn"]',))
, callback="parse_items", follow= True),)
It seemed to be working fine. However, it seems that the crawler is only retrieving information from the pages with the extended URLs (with the "&pageinfo=\d"), and not from the ones without them. How can I fix that?
Thank you!
You can override parse_start_url() method in CrawlSpider:
class MySpider(CrawlSpider):
def parse_items(self, response):
# put your code here
...
parse_start_url = parse_items
Your rule allows urls with "&pageinfo=\d" . In effect only the pages with matching url will be processed. You need to change the allow parameter for the urls without pageinfo to be processed.

Completely custom path with YII?

I have various products with their own set paths. Eg:
electronics/mp3-players/sony-hg122
fitness/devices/gymboss
If want to be able to access URLs in this format. For example:
http://www.mysite.com/fitness/devices/gymboss
http://www.mysite.com/electronics/mp3-players/sony-hg122
My strategy was to override the "init" function of the SiteController in order to catch the paths and then direct it to my own implementation of a render function. However, this doesn't allow me to catch the path.
Am I going about it the wrong way? What would be the correct strategy to do this?
** EDIT **
I figure I have to make use of the URL manager. But how do I dynamically add path formats if they are all custom in a database?
Eskimo's setup is a good solid approach for most Yii systems. However, for yours, I would suggest creating a custom UrlRule to query your database:
http://www.yiiframework.com/doc/guide/1.1/en/topics.url#using-custom-url-rule-classes
Note: the URL rules are parsed on every single Yii request, so be careful in there. If you aren't efficient, you can rapidly slow down your site. By default rules are cached (if you have a cache setup), but I don't know if that applies to dynamic DB rules (I would think not).
In your URL manager (protected/config/main.php), Set urlFormat to path (and toptionally set showScriptName to false (this hides the index.php part of the URL))
'urlManager' => array(
'urlFormat' => 'path',
'showScriptName'=>false,
Next, in your rules, you could setup something like:
catalogue/<category_url:.+>/<product_url:.+> => product/view,
So what this does is route and request with a structure like catalogue/electronics/ipods to the ProductController actionView. You can then access the category_url and product_url portions of the URL like so:
$_GET['category_url'];
$_GET['product_url'];
How this rule works is, any URL which starts with the word catalogue (directly after your domain name) which is followed by another word (category_url), and another word (product_url), will be directed to that controller/action.
You will notice that in my example I am preceding the category and product with the word catalogue. Obviously you could replace this with whatever you prefer or leave it out all together. The reason I have put it in is, consider the following URL:
http://mywebsite.com/site/about
If you left out the 'catalogue' portion of the URL and defined your rule only as:
<category_url:.+>/<product_url:.+> => product/view,
the URL Manager would see the site portion of the URL as the category_url value, and the about portion as the product_url. To prevent this you can either have the catalogue protion of the URL, or define rules for the non catalogue pages (ie; define a rule for site/about)
Rules are interpreted top to bottom, and only the first rule is matched. Obviously you can add as many rules as you need for as many different URL structures as you need.
I hope this gets you on the right path, feel free to comment with any questions or clarifications you need

How can I scrape other specific pages on a forum with Scrapy?

I have a Scrapy Crawler that crawls some guides from a forum.
The forum that I'm trying to crawl the data has got a number of pages.
The problem is that I cannot extract the links that I want to because there aren't specific classes or ids to select.
The url structure is like this one: http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1
Obviously I can change the number after desc&page=1 to 2, 3, 4 and so on but I would like to know what is the best choice to do this.
How can I accomplish that?
PS: This is the spider code
http://dpaste.com/hold/794794/
I can't seem to open the forum URL (always redirects me to another website), so here's a best effort suggestion:
If there are links to the other pages on the thread page, you can create a crawler rule to explicitly follow these links. Use a CrawlSpider for that:
class GuideSpider(CrawlSpider):
name = "Guide"
allowed_domains = ['www.guides.com']
start_urls = [
"http://www.guides.com/forums/forumdisplay.php?f=108&order=desc&page=1",
]
rules = [
Rule(SgmlLinkExtractor(allow=("forumdisplay.php.*f=108.*page=",), callback='parse_item', follow=True)),
]
def parse_item(self, response):
# Your code
...
The spider should automatically deduplicate requests, i.e. it won't follow the same URL twice even if two pages link to it. If there are very similar URLs on the page with only one or two query arguments different (say, order=asc), you can specify deny=(...) in the Rule constructor to filter them out.