In scrapy do calls to crawler.engine.crawl() bypass the throttling mechanism? - scrapy

To give some background, I'm writing a spider that listens on a RabbitMQ topic for new URLs to spider. When it pulls a URL from the queue it adds it to the crawl queue by calling crawler.engine.crawl(request). Ive noticed that if i drop 200 urls onto the queue (all for the same domain) I sometimes get timeouts, however this doesn't happen if i add the 200 urls via the start_urls property.
So I'm wondering if the normal throttling mechanisms (concurrent requests per domain, delay etc) apply when adding urls via crawler.engine.crawl()?
Here is a small code sample:
#defer.inlineCallbacks
def read(self, queue_object):
# pull a url from the RabbitMQ topic
ch,method,properties,body = yield queue_object.get()
if body:
req = Request(url=body)
log.msg('Scheduling ' + body + ' for crawl')
self.crawler.engine.crawl(req, spider=self)
yield ch.basic_ack(delivery_tag=method.delivery_tag)

It doesn't bypass the DownloaderMiddlewaresor the Downloader. They go to the Scheduler directly bypassing the SpiderMiddlewares.
Source
IMO you should use a SpiderMiddleware using process_start_requests to override your spider.start_requests.

Related

Prevent scrapy reponse from being added to cache

I am crawling a website that returns pages with a captcha and a status code 200 suggesting everything is ok. This causes the page to be put into scrapy's cache.
I want to recrawl these pages later. But if they are in the cache, they won't get recrawled.
Is it feasible to overload the process_response function from the httpcache middleware or to look for a specific string in the reponse html and override the 200 code with an error code?
What would be the easiest way to keep scrapy from putting certain responses into the cache.
Scrapy uses scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware to cache http responses. To ignore this caching you can just set request meta keyword dont_cache to True like:
yield Request(url, meta={'dont_cache': True})
The docs above also mention how to disable it project-wide with a setting if you are interested in that too.

Scrapy DupeFilter on a per spider basis?

I currently have a project with quite a few spiders and around half of them need some custom rule to filter duplicating requests. That's why I have extended the RFPDupeFilter class with custom rules for each spider that needs it.
My custom dupe filter checks if the request url is from a site that needs custom filtering and cleans the url (removes query parameters, shortens paths, extracts unique parts, etc.), so that the fingerprint is the same for all identical pages. So far so good, however at the moment I have a function with around 60 if/elif statements, that each request goes through. This is not only suboptimal, but it's also hard to maintain.
So here comes the question. Is there a way to create the filtering rule, that 'cleans' the urls inside the spider? The ideal approach for me would be to extend the Spider class and define a clean_url method, which will by default just return the request url, and override it in the spiders that need something custom. I looked into it, however I can't seem to find a way to access the current spider's methods from the dupe filter class.
Any help would be highly appreciated!
You could implement a downloader middleware.
middleware.py
class CleanUrl(object):
seen_urls = {}
def process_request(self, request, spider):
url = spider.clean_url(request.url)
if url in self.seen_urls:
raise IgnoreRequest()
else:
self.seen_urls.add(url)
return request.replace(url=url)
settings.py
DOWNLOADER_MIDDLEWARES = {'PROJECT_NAME_HERE.middleware.CleanUrl: 500}
# if you want to make sure this is the last middleware to execute increase the 500 to 1000
You probably would want to disable the dupefilter all together if you did it this way.

Use Scrapy to combine data from multiple AJAX requests into a single item

What is the best way to crawl pages with content coming from multiple AJAX requests? It looks like I have the following options (given that AJAX URLs are already known):
Crawl AJAX URLs sequentially passing the same item between requests
Crawl AJAX URLs concurrently and output each part as a separate item
with a shared key (e.g. source URL)
What is the most common practice? Is there a way to get a single item at the end, but allow some AJAX requests to fail w/o compromising the rest of the data?
scrapy is built for concurrency and statelessness, so if point 2 is possible, it is always preferred, from both speed and memory consumption aspects.
in case requests must be serialized, consider accumulate items in request meta field
Check scrapy-inline-requests. It allows to smoothly process multiple nested requests in a response handler.

Best way to cache RESTful API results of GET calls

I'm thinking about the best way to create a cache layer in front or as first layer for GET requests to my RESTful API (written in Ruby).
Not every request can be cached, because even for some GET requests the API has to validate the requesting user / application. That means I need to configure which request is cacheable and how long each cached answer is valid. For a few cases I need a very short expiration time of e.g. 15s and below. And I should be able to let cache entries expire by the API application even if the expiration date is not reached yet.
I already thought about many possible solutions, my two best ideas:
first layer of the API (even before the routing), cache logic by myself (to have all configuration options in my hand), answers and expiration date stored to Memcached
a webserver proxy (high configurable), perhaps something like Squid but I never used a proxy for a case like this before and I'm absolutely not sure about it
I also thought about a cache solution like Varnish, I used Varnish for "usual" web applications and it's impressive but the configuration is kind of special. But I would use it if it's the fastest solution.
An other thought was to cache to the Solr Index, which I'm already using in the data layer to not query the database for most requests.
If someone has a hint or good sources to read about this topic, let me know.
Firstly, build your RESTful API to be RESTful. That means authenticated users can also get cached content as to keep all state in the URL it needs to contain the auth details. Of course the hit rate will be lower here, but it is cacheable.
With a good deal of logged in users it will be very beneficial to have some sort of model cache behind a full page cache as many models are still shared even if some aren't (in a good OOP structure).
Then for a full page cache you are best of to keep all the requests off the web server and especially away from the dynamic processing in the next step (in your case Ruby). The fastest way to cache full pages from a normal web server is always a caching proxy in front of the web servers.
Varnish is in my opinion as good and easy as it gets, but some prefer Squid indeed.
memcached is a great option, and I see you mentioned this already as a possible option. Also Redis seems to be praised a lot as another option at this level.
On an application level, in terms of a more granular approach to cache on a file by file and/or module basis, local storage is always an option for common objects a user may request over and over again, even as simple as just dropping response objects into session so that can be reused vs making another http rest call and coding appropriately.
Now people go back and forth debating about varnish vs squid, and both seem to have their pros and cons, so I can't comment on which one is better but many people say Varnish with a tuned apache server is great for dynamic websites.
Since REST is an HTTP thing, it could be that the best way of caching requests is to use HTTP caching.
Look into using ETags on your responses, checking the ETag in requests to reply with '304 Not Modified' and having Rack::Cache to serve cached data if the ETags are the same. This works great for cache-control 'public' content.
Rack::Cache is best configured to use memcache for its storage needs.
I wrote a blog post last week about the interesting way that Rack::Cache uses ETags to detect and return cached content to new clients: http://blog.craz8.com/articles/2012/12/19/rack-cache-and-etags-for-even-faster-rails
Even if you're not using Rails, the Rack middleware tools are quite good for this stuff.
Redis Cache is best option.
check here.
It is open source. Advanced key-value cache and store.
I’ve used redis successfully this way in my REST view:
from django.conf import settings
import hashlib
import json
from redis import StrictRedis
from django.utils.encoding import force_bytes
def get_redis():
#get redis connection from RQ config in settings
rc = settings.RQ_QUEUES['default']
cache = StrictRedis(host=rc['HOST'], port=rc['PORT'], db=rc['DB'])
return cache
class EventList(ListAPIView):
queryset = Event.objects.all()
serializer_class = EventSerializer
renderer_classes = (JSONRenderer, )
def get(self, request, format=None):
if IsAdminUser not in self.permission_classes: # dont cache requests from admins
# make a key that represents the request results you want to cache
# your requirements may vary
key = get_key_from_request()
# I find it useful to hash the key, when query parms are added
# I also preface event cache key with a string, so I can clear the cache
# when events are changed
key = "todaysevents" + hashlib.md5(force_bytes(key)).hexdigest()
# I dont want any cache issues (such as not being able to connect to redis)
# to affect my end users, so I protect this section
try:
cache = get_redis()
data = cache.get(key)
if not data:
# not cached, so perform standard REST functions for this view
queryset = self.filter_queryset(self.get_queryset())
serializer = self.get_serializer(queryset, many=True)
data = serializer.data
# cache the data as a string
cache.set(key, json.dumps(data))
# manage the expiration of the cache
expire = 60 * 60 * 2
cache.expire(key, expire)
else:
# this is the place where you save all the time
# just return the cached data
data = json.loads(data)
return Response(data)
except Exception as e:
logger.exception("Error accessing event cache\n %s" % (e))
# for Admins or exceptions, BAU
return super(EventList, self).get(request, format)
in my Event model updates, I clear any event caches.
This hardly ever is performed (only Admins create events, and not that often),
so I always clear all event caches
class Event(models.Model):
...
def clear_cache(self):
try:
cache = get_redis()
eventkey = "todaysevents"
for key in cache.scan_iter("%s*" % eventkey):
cache.delete(key)
except Exception as e:
pass
def save(self, *args, **kwargs):
self.clear_cache()
return super(Event, self).save(*args, **kwargs)

Jmeter : How to test a website to render a page regardless of the content

I have a requirement where the site only needs to respond to the user within certain seconds, regardless of the contents.
Now there is a option in Jmeter in HTTP Proxy Server -> URL Patterns to exclude and then to start recording.
Here I can specify gif, css or other content to ignore. However before starting the recording I have to be aware of what are the various contents that are going to be there.
Is there any specific parameter to pass to Jmeter or any other tool which takes care about loading the page only and I can assert the response code of that page and no the other contents of the page are recorded.
Thanks.
Use the standard HTTP Request sampler with DISABLED (not checked) option Retrieve All Embedded Resources from HTML Files (set via sampler's control panel):
"It also lets you control whether or not JMeter parses HTML files for
images and other embedded resources and sends HTTP requests to
retrieve them."
NOTE: You may also define the same setting via HTTP Request Defaults.
NOTE: See also "Response size calculation" in the same HTTP Request article.
Add assertions to your http samplers:
Duration Assertion: to tests if response was received within a defined amount of time;
Response Assertion: to ensure that request was successfull,
e.g.
Response Field to Test = Response Code
Pattern Matching Rules = Equals
Patterns to Test = 200
You want to run test that would ignore resources after certain number of seconds?
I don't understand, what are you trying to accomplish by doing that?
Users will still receive those resources when they request your url, so your tests wont be accurate.
I don't mean any disrespect, but is it possible that you misunderstood the requirements?
I assume that the requirement is to load all the resources in certain number of seconds, not to cut off the ones that fail to fit in that time?