Scrapy DupeFilter on a per spider basis? - scrapy

I currently have a project with quite a few spiders and around half of them need some custom rule to filter duplicating requests. That's why I have extended the RFPDupeFilter class with custom rules for each spider that needs it.
My custom dupe filter checks if the request url is from a site that needs custom filtering and cleans the url (removes query parameters, shortens paths, extracts unique parts, etc.), so that the fingerprint is the same for all identical pages. So far so good, however at the moment I have a function with around 60 if/elif statements, that each request goes through. This is not only suboptimal, but it's also hard to maintain.
So here comes the question. Is there a way to create the filtering rule, that 'cleans' the urls inside the spider? The ideal approach for me would be to extend the Spider class and define a clean_url method, which will by default just return the request url, and override it in the spiders that need something custom. I looked into it, however I can't seem to find a way to access the current spider's methods from the dupe filter class.
Any help would be highly appreciated!

You could implement a downloader middleware.
middleware.py
class CleanUrl(object):
seen_urls = {}
def process_request(self, request, spider):
url = spider.clean_url(request.url)
if url in self.seen_urls:
raise IgnoreRequest()
else:
self.seen_urls.add(url)
return request.replace(url=url)
settings.py
DOWNLOADER_MIDDLEWARES = {'PROJECT_NAME_HERE.middleware.CleanUrl: 500}
# if you want to make sure this is the last middleware to execute increase the 500 to 1000
You probably would want to disable the dupefilter all together if you did it this way.

Related

Scrapy: settings, multiple concurrent spiders, and middlewares

I'm used to running spiders one at a time, because we mostly work with scrapy crawl and on scrapinghub, but I know that one can run multiple spiders concurrently, and I have seen that middlewares often have a spider parameter in their callbacks.
What I'd like to understand is:
the relationship between Crawler and Spider. If I run one spider at a time, I'm assuming there's one of each. But if you run more spiders together, like in the example linked above, do you have one crawler for multiple spiders, or are they still 1:1?
is there in any case only one instance of a middleware of a certain class, or do we get one per-spider or per-crawler?
Assuming there's one, what are the crawler.settings in the middleware creation (for example, here)? In the documentation it says that those take into account the settings overridden in the spider, but if there are multiple spiders with conflicting settings, what happens?
I'm asking because I'd like to know how to handle spider-specific settings. Take again the DeltaFetch middleware as an example:
enabling it seems to be a global matter, because DELTAFETCH_ENABLED is read from the crawler.settings
however, the sqlite db is opened in spider_opened and is a unique instance variable (i.e., not depending on the spider); so if you have more than one spider and the instance is shared, when the second spider is opened, the old db is lost. And if you have only one instance of the middleware per spider, why bother passing the spider as a parameter?
Is that a correct way of handling it, or should you rather have a dict spider_dbs indexed by spider name?

Passing items(or other variables) to middleware (or other modules) of scrapy.

I'm improving the spider I wrote a few months ago. I'm trying to make it smarter and download only the new information from the website. For the purpose I am adding a code in the Download Middleware module to check whether URL ID is already visited or not. Except the URL which I can get fairly easy with request.url command I need to pass an Item from the Spider - that Item is the Date Last Updated.
The idea is to compare both values(URL and Date Last Update) with the ones from the database (regular csv file) and if both are the same to drop the request, if both are missing or if Last Update date doesn't match to proceed with the request.
The problem is that I don't know how to pass the Item from the Spider to the Middleware. I can see that in the Pipelines module (object) is passed to the class, tried to add it in Middleware class but it doesn't work.
Any ideas how to pass an Item or any other variable from the Spider to the Middleware module?
Usually you pass any additional info in the request meta as request.meta['my_thing'] = ... or as an argument yield Request(url, meta={'my_thing': ...}), which all middlewares up in the chain will be able to access. For your case however I'd recommend either to use scrapy built in cache middleware on dummy policy or either one of these two modules which do exactly the thing you have in mind:
https://github.com/TeamHG-Memex/scrapy-crawl-once
https://github.com/scrapy-plugins/scrapy-deltafetch

How to force dispatcher cache urls with get parameters

As I understood after reading these links:
How to find out what does dispatcher cache?
http://docs.adobe.com/docs/en/dispatcher.html
The Dispatcher always requests the document directly from the AEM instance in the following cases:
If the HTTP method is not GET. Other common methods are POST for form data and HEAD for the HTTP header.
If the request URI contains a question mark "?". This usually indicates a dynamic page, such as a search result, which does not need to be cached.
The file extension is missing. The web server needs the extension to determine the document type (the MIME-type).
The authentication header is set (this can be configured)
But I want to cache url with parameters.
If I once request myUrl/?p1=1&p2=2&p3=3
then next request to myUrl/?p1=1&p2=2&p3=3 must be served from dispatcher cache, but myUrl/?p1=1&p2=2&p3=3&newParam=newValue should served by CQ for the first time and from dispatcher cache for subsequent requests.
I think the config /ignoreUrlParams is what you are looking for. It can be used to white list the query parameters which are used to determine whether a page is cached / delivered from cache or not.
Check http://docs.adobe.com/docs/en/dispatcher/disp-config.html#Ignoring%20URL%20Parameters for details.
It's not possible to cache the requests that contain query string. Such calls are considered dynamic therefore it should not be expected to cache them.
On the other hand, if you are certain that such request should be cached cause your application/feature is query driven you can work on it this way.
Add Apache rewrite rule that will move the query string of given parameter to selector
(optional) Add a CQ filter that will recognize the selector and move it back to query string
The selector can be constructed in a way: key_value but that puts some constraints on what could be passed here.
You can do this with Apache rewrites BUT it would not be ideal practice. You'll be breaking the pattern that AEM uses.
Instead, use selectors and extensions. E.g. instead of server.com/mypage.html?somevalue=true, use:
server.com/mypage.myvalue-true.html
Most things you will need to do that would ever get cached will work this way just fine. If you give me more details about your requirements and what you are trying to achieve, I can help you perfect the solution.

Use Scrapy to combine data from multiple AJAX requests into a single item

What is the best way to crawl pages with content coming from multiple AJAX requests? It looks like I have the following options (given that AJAX URLs are already known):
Crawl AJAX URLs sequentially passing the same item between requests
Crawl AJAX URLs concurrently and output each part as a separate item
with a shared key (e.g. source URL)
What is the most common practice? Is there a way to get a single item at the end, but allow some AJAX requests to fail w/o compromising the rest of the data?
scrapy is built for concurrency and statelessness, so if point 2 is possible, it is always preferred, from both speed and memory consumption aspects.
in case requests must be serialized, consider accumulate items in request meta field
Check scrapy-inline-requests. It allows to smoothly process multiple nested requests in a response handler.

How should I organize this Instapaper-like functionality in Rails?

Instapaper, if you don't know it, is a bookmarklet that saves your current URL to an account of yours. Essentially the bookmarklet loads a script on the page with parameters on that script's URL with something like
z.setAttribute('src', l.protocol '//www.instapaper.com/j/Jabcdefg?u='
encodeURIComponent(l.href)'&t=' (new Date().getTime()));
b.appendChild(z);
So that's sending a request to a user-based, obfuscated URL along with the current page's URL.
I'm wondering how a similar service would be set up in a Rails app. The work is clearly being done by something called, perhaps, parser, which would probably be a model (it will run an HTTP request, parse, and save the data, for example). Can you route directly into a model? Do you need a controller over it to handle incoming requests? (I've tried this last bit, and it auto-loads a view, which I don't need/want).
I'd love some advice on this general architecture. Thanks!
I guess you cannot route directly to a model.
So, you need a controller over it to handle incoming requests.
And use "render :nothing => true" if you don't want the view to be sent to the browser.