Crawlera, cookies, sessions, rate limiting - scrapy

I'm trying to use scrapinghub to crawl a website that heavily limits request rate.
If I run the spider as-is, I get 429 pretty soon.
If I enable crawlera as per standard instructions, the spider doesn't work anymore.
If I set headers = {"X-Crawlera-Cookies": "disable"} the spider works again, but I get 429s -- so I assume the limiter works (also) on the cookie.
So what would an approach be here?

You can try RandomUserAgent, If you don't want to write your own implementation, you can try use this:
https://github.com/cnu/scrapy-random-useragent

Related

RESTDataSource - How to know if response comes from get request or cache

I need to get some data from a REST API in my GraphQL API. For that I'm extending RESTDataSource from apollo-datasource-rest.
From what I understood, RESTDataSource caches automatically requests but I'd like to verify if it is indeed cached. Is there a way to know if my request is getting its data from the cache or if it's hitting the REST API?
I noticed that the first request takes some time, but the following ones are way faster and also, the didReceiveResponse method is not called everytime I make a query. Is it because the data is loaded from the cache?
I'm using apollo-server-express.
Thanks for your help!
You can time the requests like following:
console.time('restdatasource get req')
this.get(url)
console.timeEnd('restdatasource get req')
Now, if the time is under 100-150 milliseconds, that should be a request coming from the cache.
You can monitor the console, under the network tab. You will be able to see what endpoints the application is calling. If it uses cached data, there will be no new request to your endpoint logged
If you are trying to verify this locally, one good option is to setup a local proxy so that you can see all the network calls being made. (no network call meaning the call was read from cache) Then you can simply configure your app using this apollo documentation to forward all outgoing calls through a proxy like mitmproxy.

Scrapy Use both the CORE in the system

I am running scrapy using their internal API and everything is well and good so far. But I noticed that its not fully using the concurrency of 16 as mentioned in the settings. I have changed delay to 0 and everything else I can do. But then looking into the HTTP requests being sent , its clear that scrapy is not exactly downloading 16 sites at all point of times. At some point of time its downloading only 3 to 4 links. And the queue is not empty at that point of time.
When I checked the core usage , what i found was that out of 2 core , one is 100% and other is mostly idle.
That is when i got to know that twisted library on top which scrapy is build is single threaded and that is why its only using single core.
Is there any workaround to convince scrapy to use all the core ?
Scrapy is based on the twisted framework. Twisted is event loop based framework, so it does scheduled processing and not multiprocessing. That's is why your scrapy crawl runs on just one process. Now you can technically start two spiders using the below code
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
And there is nothing that stops you from having the same class for both the spiders.
process.crawl method takes *args and **kwargs to pass to your spider. So you can parametrize your spiders using this approach. Let's say your spider is suppose to crawl 100 pages, you can add a start and end parameter to your crawler class and do something like below
process.crawl(YourSpider, start=0, end=50)
process.crawl(YourSpider, start=51, end=100)
Note, that both the crawlers will have their own settings, so if you have 16 requests set for your spider, then both combined will effectively have 32.
In most cases scraping is less about CPU and more about Network access, which is actually non-blocking in case of twisted, so I am not sure this would give you a very huge advantage against setting the CONCURRENT_REQUEST to 32 in a single spider.
PS: Consider reading this page to understand more https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process
Another option is to run your spiders using Scrapyd, which lets you run multiple processes concurrently. See max_proc and max_proc_per_cpu options in the documentation. If you don't want to solve your problem programmatically, this could be the way to go.

python - HTTP Error 503 Service Unavailable

I am trying to scrape data from google and linkedin. Somehow it gave me this error:
*** httperror_seek_wrapper: HTTP Error 503: Service Unavailable
Can someone help advice how I solve this?
Google is simply detecting your query as automated. You would need a captcha solver to get unlimited results. The following link might be helpful.
https://support.google.com/websearch/answer/86640?hl=en
Bypassing Captcha using an OCR Engine:
http://www.debasish.in/2012/01/bypass-captcha-using-python-and.html
Simple Approach:
An even simpler approach is to simply use sleep() a few times and to generate random queries. This way google will not spot that you are using an automated system. But the system is far slower ...
Error Handling:
To simply get remove the error message use try and except
I encountered the same situation and tried using the sleep() function before every request to spread the requests a little. It looked like it was working fine but failed soon enough even with a delay of 2 seconds. What solved it finally was using:
with contextlib.closing(urllib.urlopen(urlToOpen)) as x:
#do stuff with x.
This I did because I thought opening too many requests keeps it open and had to closed. Nevertheless, it worked quite consistently with as less as 0.5s delay time.

Intercepting with XMLHttpRequest for a specific address using greasemonkey

I'm trying to write a greasemonkey script that will work on either Chrome and Firefox.. a script that will block XMLHttpRequest to a certain hard-coded url..
I am kind of new to this area and would appreciate some help.
thanks.
it possible now using
#run-at document-start
http://wiki.greasespot.net/Metadata_Block#.40run-at
but it need more improvement, check the example
http://userscripts-mirror.org/scripts/show/125936
This almost impossible to do with Greasemonkey. It is the wrong tool for the job. Here's what to use, most effective first:
Set your hardware firewall, or router, to block the URL.
Set your software firewall to block the URL.
Use Adblock to block the URL.
Write a convoluted userscript that tries to block requests from one set of pages to a specific URL. Note that this potentially has to block inline src requests as well as AJAX, etc.

Server with the sole purpose of setting cookies

At work we ran up against the problem of setting server-side cookies - a lot of them. Right now we have a PHP script, the sole purpose of which is to set a cookie on the client for our domain. This happens a lot more than 'normal' requests to the server (which is running an app), so we've discussed moving it to its own server. This would be an Apache server, probably dedicated, with one PHP script 3 lines long, just running over and over again.
Surely there must be a faster, better way of doing this, rather than starting up the whole PHP environment. Basically, I need something super simple that can sit around all day/night doing the following:
Check if a certain cookie is set, and
If that cookie is not set, fill it with a random hash (right now it's a simple md5(microtime))
Any suggestions?
You could create a simple http server yourself to accept requests and return the set-cookie header and empty body. This would allow you to move the cookie generation overhead to wherever you see fit.
I echo the sentiments above though; Unless cookie generation is significantly expensive, I don't think you will gain much by moving from your current setup.
By way of an example, here is an extremely simple server written with Tornado that simply sets a cookie on GET or HEAD requests to '/'. It includes an async example listening for '/async' which may be of use depending on what you are doing to get your cookie value.
import time
import tornado.ioloop
import tornado.web
class CookieHandler(tornado.web.RequestHandler):
def get(self):
cookie_value = str( time.time() )
self.set_cookie('a_nice_cookie', cookie_value, expires_days=10)
# self.set_secure_cookie('a_double_choc_cookie', cookie_value)
self.finish()
def head(self):
return self.get()
class AsyncCookieHandler(tornado.web.RequestHandler):
#tornado.web.asynchronous
def get(self):
self._calculate_cookie_value(self._on_create_cookie)
#tornado.web.asynchronous
def head(self):
self._calculate_cookie_value(self._on_create_cookie)
def _on_create_cookie(self, cookie_value):
self.set_cookie('double_choc_cookie', cookie_value, expires_days=10)
self.finish()
def _calculate_cookie_value(self, callback):
## meaningless async example... just wastes 2 seconds
def _fake_expensive_op():
val = str(time.time())
callback(val)
tornado.ioloop.IOLoop.instance().add_timeout(time.time()+2, _fake_expensive_op)
application = tornado.web.Application([
(r"/", CookieHandler),
(r"/async", AsyncCookieHandler),
])
if __name__ == "__main__":
application.listen(8888)
tornado.ioloop.IOLoop.instance().start()
Launch this process with Supervisord and you'll have a simple, fast, low-overhead server that sets cookies.
You could try using mod_headers (usually available in the default install) to manually construct a Set-Cookie header and emit it -- no programming needed as long as it's the same cookie every time. Something like this could work in an .htaccess file:
Header add Set-Cookie "foo=bar; Path=/; Domain=.foo.com; Expires=Sun, 06 May 2012 00:00:00 GMT"
However, this won't work for you. There's no code here. It's just a stupid header. It can't come up with the new random value you'd want, and it can't adjust the expire date as is standard practice.
This would be an Apache server, probably dedicated, with one PHP script 3 lines long, just running over and over again. [...] Surely there must be a faster, better way of doing this, rather than starting up the whole PHP environment.
Are you using APC or another bytecode cache? If so, there's almost no startup cost. Because you're talking about setting up an entire server just for this, it sounds like you control the server as well. This means that you can turn off apc.stat for even less of a startup hit.
Really though, if all that script is doing is building an md5 hash and setting a cookie, it should already be blisteringly fast, especially if it's mod_php. Do you already know, though benchmarking and testing, that the script isn't performing as well as you'd like? If so, can you share those benchmarks with us?
It would be interesting to know why do you think you need extra server - do you actually have a bottle neck for generating the cookie or somewhere else ? Is it the log writing as requests happen alot ? ajax polling ? Client download speed ?
Atleast for starters, i'd look something more efficient than fetching time to generate the "random hash". For example, on this intel i7 laptop i have, generating 999999 md5 hashes from microtime takes roughly about 4 seconds and doing same thing with random numbers is second faster (not taking a seeding of rand into account).
Then, if you take opening/and closing of socket into account, just moving your script (which is most likely already really fast - that is, without knowing how your pages take that into account), you will end up actually slowing down the requests. Actually, now that i've re-read your question, it makes me think that your cookie setter script is already a dedicated page ? Or do you just "include" into real content served by another php script? If not, try that approach. Also this would beneficial if you have default logging rules for apache, if cookies are set in on own page, your apache will log a row for that and in high load systems, this will cumulate to total io time spend by apache.
Also, consider that testing if cookie is set and then setting it, might be slower than just to forcefully set it always even if cookie exists or not ?
But overall, i don't think you'd need to set up a server just to offload cookie generation without knowing more about how you handle the cookies now.. Unless you are doing something really nasty.
Apache has a module called mod_usertrack which looks like it might do exactly what you want. There's no need for PHP and you could likely create a really optimised lightweight Apache config to serve this with.
If you want to go for something even faster and are happy to not use Apache you could use lighttpd and it's mod_usertrack or nginx's HttpUserId module