best practice to stop scrapy crawl when bad request numbers arrived the numbers I seted - scrapy

I have 400 pages to crawl,for example.And some pages may return 3xx or 4xx .I wish when the numbers of bad
requests arrived 100,for example. scrapy task auto stop.Thks~

You can use different systems:
A global variable in the class (which is not recommended but probably is the simplest solution)
Storing it in the DB using pipelines
Once you have reached the number that you have configured, you can stop the crawler using:
if errors > maxNumberErrors:
raise CloseSpider('message error')
or (from this answer)
from scrapy.project import crawler
crawler._signal_shutdown(9,0)

Related

Jmet assessment

Anyone can help solve this assessment?
Using JMeter framework (https://jmeter.apache.org) please implement a load test script:
The script should send 10 concurrent requests to Capital API: https://restcountries.eu/rest/v2/capital/?fields=name;capital;currencies;latlng;regionalBlocs
The script should read the capital values from a CSV file (contains 10 capital names)
The script should perform a status code verification for the transaction response
The script should run for 2 minutes.
The script should contain at least 2 listeners.
My suggestion is: create the script on your own. It's best way to learn any subject. Contributors to this forum will be more that happy to answer any specific questions if you get stuck in your quest.
The script should send 10 concurrent requests - concurrency is defined at Thread Group level
Requests are configured using HTTP Request sampler
Values can be read from the CSV file using CSV Data Set Config
Status code verification is being more or less automatically done by JMeter, it treats status codes below 400 as successful, additionally you can use Response Assertion for this
It's not recommended to use Listeners at all

Stop Scrapy spider when date from page is older that yesterday

This code is part of my Scrapy spider:
# scraping data from page has been done before this line
publish_date_datetime_object = (datetime.strptime(publish_date, '%d.%m.%Y.')).date()
yesterday = (datetime.now() - timedelta(days=1)).date()
if publish_date_datetime_object > yesterday:
continue
if publish_date_datetime_object < yesterday:
raise scrapy.exceptions.CloseSpider('---STOP---DATE IS OLDER THAN YESTERDAY')
# after this is ItemLoader and yield
This is working fine.
My question is Scrapy spider best place to have this code/logic?
I do not know how to put implement it in another place.
Maybe it can be implemented in a pipeline, but AFAIK the pipeline is evaluated after the scraping has been done, so that means that I need to scrape all adds, even thous that I do not need.
A scale is 5 adds from yesterday versus 500 adds on the whole page.
I do not see any benefit in moving code to pipeline it that means processing(downloading and scraping) 500 adds if I only need 5 from it.
It is the right place if you need your spider to stop crawling after something indicates there's no more useful data to collect.
It is also the right way to do it, rising a CloseSpider exception with a verbose closing reason message.
A pipeline would be more suitable only if there were items to be collected after the threshold detected, but if they are ALL disposable this would be a waste of resources.

Scrapy Use both the CORE in the system

I am running scrapy using their internal API and everything is well and good so far. But I noticed that its not fully using the concurrency of 16 as mentioned in the settings. I have changed delay to 0 and everything else I can do. But then looking into the HTTP requests being sent , its clear that scrapy is not exactly downloading 16 sites at all point of times. At some point of time its downloading only 3 to 4 links. And the queue is not empty at that point of time.
When I checked the core usage , what i found was that out of 2 core , one is 100% and other is mostly idle.
That is when i got to know that twisted library on top which scrapy is build is single threaded and that is why its only using single core.
Is there any workaround to convince scrapy to use all the core ?
Scrapy is based on the twisted framework. Twisted is event loop based framework, so it does scheduled processing and not multiprocessing. That's is why your scrapy crawl runs on just one process. Now you can technically start two spiders using the below code
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
And there is nothing that stops you from having the same class for both the spiders.
process.crawl method takes *args and **kwargs to pass to your spider. So you can parametrize your spiders using this approach. Let's say your spider is suppose to crawl 100 pages, you can add a start and end parameter to your crawler class and do something like below
process.crawl(YourSpider, start=0, end=50)
process.crawl(YourSpider, start=51, end=100)
Note, that both the crawlers will have their own settings, so if you have 16 requests set for your spider, then both combined will effectively have 32.
In most cases scraping is less about CPU and more about Network access, which is actually non-blocking in case of twisted, so I am not sure this would give you a very huge advantage against setting the CONCURRENT_REQUEST to 32 in a single spider.
PS: Consider reading this page to understand more https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
Another option is to run your spiders using Scrapyd, which lets you run multiple processes concurrently. See max_proc and max_proc_per_cpu options in the documentation. If you don't want to solve your problem programmatically, this could be the way to go.

Use Scrapy to combine data from multiple AJAX requests into a single item

What is the best way to crawl pages with content coming from multiple AJAX requests? It looks like I have the following options (given that AJAX URLs are already known):
Crawl AJAX URLs sequentially passing the same item between requests
Crawl AJAX URLs concurrently and output each part as a separate item
with a shared key (e.g. source URL)
What is the most common practice? Is there a way to get a single item at the end, but allow some AJAX requests to fail w/o compromising the rest of the data?
scrapy is built for concurrency and statelessness, so if point 2 is possible, it is always preferred, from both speed and memory consumption aspects.
in case requests must be serialized, consider accumulate items in request meta field
Check scrapy-inline-requests. It allows to smoothly process multiple nested requests in a response handler.

Best way to cache RESTful API results of GET calls

I'm thinking about the best way to create a cache layer in front or as first layer for GET requests to my RESTful API (written in Ruby).
Not every request can be cached, because even for some GET requests the API has to validate the requesting user / application. That means I need to configure which request is cacheable and how long each cached answer is valid. For a few cases I need a very short expiration time of e.g. 15s and below. And I should be able to let cache entries expire by the API application even if the expiration date is not reached yet.
I already thought about many possible solutions, my two best ideas:
first layer of the API (even before the routing), cache logic by myself (to have all configuration options in my hand), answers and expiration date stored to Memcached
a webserver proxy (high configurable), perhaps something like Squid but I never used a proxy for a case like this before and I'm absolutely not sure about it
I also thought about a cache solution like Varnish, I used Varnish for "usual" web applications and it's impressive but the configuration is kind of special. But I would use it if it's the fastest solution.
An other thought was to cache to the Solr Index, which I'm already using in the data layer to not query the database for most requests.
If someone has a hint or good sources to read about this topic, let me know.
Firstly, build your RESTful API to be RESTful. That means authenticated users can also get cached content as to keep all state in the URL it needs to contain the auth details. Of course the hit rate will be lower here, but it is cacheable.
With a good deal of logged in users it will be very beneficial to have some sort of model cache behind a full page cache as many models are still shared even if some aren't (in a good OOP structure).
Then for a full page cache you are best of to keep all the requests off the web server and especially away from the dynamic processing in the next step (in your case Ruby). The fastest way to cache full pages from a normal web server is always a caching proxy in front of the web servers.
Varnish is in my opinion as good and easy as it gets, but some prefer Squid indeed.
memcached is a great option, and I see you mentioned this already as a possible option. Also Redis seems to be praised a lot as another option at this level.
On an application level, in terms of a more granular approach to cache on a file by file and/or module basis, local storage is always an option for common objects a user may request over and over again, even as simple as just dropping response objects into session so that can be reused vs making another http rest call and coding appropriately.
Now people go back and forth debating about varnish vs squid, and both seem to have their pros and cons, so I can't comment on which one is better but many people say Varnish with a tuned apache server is great for dynamic websites.
Since REST is an HTTP thing, it could be that the best way of caching requests is to use HTTP caching.
Look into using ETags on your responses, checking the ETag in requests to reply with '304 Not Modified' and having Rack::Cache to serve cached data if the ETags are the same. This works great for cache-control 'public' content.
Rack::Cache is best configured to use memcache for its storage needs.
I wrote a blog post last week about the interesting way that Rack::Cache uses ETags to detect and return cached content to new clients: http://blog.craz8.com/articles/2012/12/19/rack-cache-and-etags-for-even-faster-rails
Even if you're not using Rails, the Rack middleware tools are quite good for this stuff.
Redis Cache is best option.
check here.
It is open source. Advanced key-value cache and store.
I’ve used redis successfully this way in my REST view:
from django.conf import settings
import hashlib
import json
from redis import StrictRedis
from django.utils.encoding import force_bytes
def get_redis():
#get redis connection from RQ config in settings
rc = settings.RQ_QUEUES['default']
cache = StrictRedis(host=rc['HOST'], port=rc['PORT'], db=rc['DB'])
return cache
class EventList(ListAPIView):
queryset = Event.objects.all()
serializer_class = EventSerializer
renderer_classes = (JSONRenderer, )
def get(self, request, format=None):
if IsAdminUser not in self.permission_classes: # dont cache requests from admins
# make a key that represents the request results you want to cache
# your requirements may vary
key = get_key_from_request()
# I find it useful to hash the key, when query parms are added
# I also preface event cache key with a string, so I can clear the cache
# when events are changed
key = "todaysevents" + hashlib.md5(force_bytes(key)).hexdigest()
# I dont want any cache issues (such as not being able to connect to redis)
# to affect my end users, so I protect this section
try:
cache = get_redis()
data = cache.get(key)
if not data:
# not cached, so perform standard REST functions for this view
queryset = self.filter_queryset(self.get_queryset())
serializer = self.get_serializer(queryset, many=True)
data = serializer.data
# cache the data as a string
cache.set(key, json.dumps(data))
# manage the expiration of the cache
expire = 60 * 60 * 2
cache.expire(key, expire)
else:
# this is the place where you save all the time
# just return the cached data
data = json.loads(data)
return Response(data)
except Exception as e:
logger.exception("Error accessing event cache\n %s" % (e))
# for Admins or exceptions, BAU
return super(EventList, self).get(request, format)
in my Event model updates, I clear any event caches.
This hardly ever is performed (only Admins create events, and not that often),
so I always clear all event caches
class Event(models.Model):
...
def clear_cache(self):
try:
cache = get_redis()
eventkey = "todaysevents"
for key in cache.scan_iter("%s*" % eventkey):
cache.delete(key)
except Exception as e:
pass
def save(self, *args, **kwargs):
self.clear_cache()
return super(Event, self).save(*args, **kwargs)