Customizing Crawlera Ban Rules For Scrapy - scrapy

I'm in the process of updating a (formerly) working website crawler. It appears the website I have been crawling has introduced stricter ban rules, so I have begun to use Crawlera as an attempt to circumvent this issue.
The problem I'm having currently is that the target website is using a non-standard banning approach of doing a 302 to a standard html page. Crawlera is not detecting this as a ban and immediately stops the crawl. Is there any way I can customize what Crawlera detects as a ban, or will I need to look into another approach?

I think you can ask them to insert that rule into their system and also depending on your plan, they should offer you a way to customize your own rules (still, you can ask their support, I am not completely sure).
I would say that is your best bet, if not, I would recommend creating your own Downloader Middlewares to retry when that redirect happens. What crawlera does when they get a ban is to retry n more times (you can set that also on the headers), so you'll have to set retries to 0 and handle those yourself depending on what response you get.

Related

How to remove Cloudflare’s javascripts slowing my site?

I have a WordPress site at http://biblicomentarios.com, and I use Cloudflare. No matter what I do, I can’t remove two javascript that comes from Cloudflare. I use GTMetrix, and I see them in the waterfall tab blocking my site. Those are email-decode.min.js and rocket-loader.min.js. Of course, I’ve already disabled email obfuscation in the Scrape Shield tab, and I have Rocket Loader disabled. I purged ALL my caches (Cloudflare cache, Autoptimize cache, SuperCache, even the Cpanel cache). But the js’s are pretty persistent, and they insist on appearing in GTMetrix waterfall, as blocking js’s and so slowing my site. Also, I can’t add expires headers to them, so I have more than a reason to want them out of my site. Is there any way to remove them as they are already disabled in the Cloudflare panel?
Please, note
- Rocket Loader is disabled; the Scrape Shields email obfuscation is disabled.
- I have not a “cdn-cgi” directory within my site or server. Typically, this directory is injected by Cloudflare, so both scripts come from Cloudflare.
- I have no “apps” installed through CloudFlare.
- The blocking scripts paths are https://ajax.cloudflare.com/cdn-cgi/scripts/2448a7bd/cloudflare-static/rocket-loader.min.js and http://biblicomentarios.com/cdn-cgi/scripts/5c5dd728/cloudflare-static/email-decode.min.js.
Depending on what Cloudflare plan you're on, you can set up "Page Rules" for your site or subsections of your site.
I'd suggest adding 2 rules -
Disable Security
This should prevent email-decode.min.js from loading.
Disable Performance
This should prevent rocket-loader.min.js from loading.
I think you can have one setting per rule, and 3 page rules if you're using the Free plan.
Go to Scrape Shield
Then Disable Email Address Obfuscation
This will Disable email-decode.min.js
Go to Speed -> optimization
Then Disable Rocket Loader™
This will Disable rocket-loader.min.js
Remember to Clear Cache
Just be careful.
Disable Performance
Will turn off
Auto Minify
Rocket Loader
Mirage
Polish

Howto create an Apache Rewrite rule to redirect Microsoft Edge browser

How do you create a simple Apache ReWrite rule for mod_rewrite that will simply redirect any Microsoft Edge browser to a page one fixed page. This is needed on an instance of Apache that runs on an Intranet and runs one specific application that does NOT cater to Edge, only IE. When a user tries to go to the application with Edge, this rule should redirect them to a very specific URL on that server that will enlighten them to how to fine IE.
I am facing two issues:
I know the rule needs to act on the User-Agent, but I don't know what makes the Edge browser unique from all the others. Any thoughts on where the best place might be to go and figure this out? I have looked at Microsoft's web site and they share with you what the strings are, but it doesn't spell out how exactly to tell them apart. I am thinking that it might be best to look at some open source library that has already figured it out.
How I do write a rule for any URL that hits the site EXCEPT the 'enlightenment' page?
In general, User agents are really crappy to deal with. It is best to not rethink it and use a heavily tested library. One of the best is ua-parser. It is a collection of regexes to match user agents, with flavors in most languages.
If you want to have this in the Apache logic itself, you can extract from their list of regex
(Edge)(\d+)(?:\.(\d+))?
How I do write a rule for any URL that hits the site EXCEPT the 'enlightenment' page?
RewriteRule is what you want to look at

Scrapy on Ubuntu web server getting 417 error

I have been developing a crawling script for a number of news websites and using Scrapy to handle the logic.
When I run my script on an Ubuntu web server (Digital Ocean, if that helps), a lot of the websites that return 200 on my local machine turn out to be 417 instead.
I was wondering how I should fix this, if it is a problem at all? I'm actually not quite sure if it is affecting the final output, but it seems like it has been.
Some of my own research has turned up:
http://www.checkupdown.com/status/E417.html . I've tried adding an Expect header to my requests, which hasn't worked
I've heard that it might be a problem with HTTP 1.1 vs 1.0? EDIT: Nope. Scrapy's HTTPDownloaderHandler automatically chooses 1.1 if it is available
417 is the error a web server gives you when your client says it expects content-types a,b,c, but the content that the server could deliver doesn't match any of these types.
This looks like a scrapy bug or, more likely, misconfiguration.
It seems either your public ip address was already banned or was banned while you scraped by the web server of the page you want to scrape. For the first situation you can reboot your instance to get a new public ip (at least this works on Amazon). For the second scenario, here are some tips from the official documentation to avoid this situation:
rotate your user agent from a pool of well-known ones from browsers
(google around to get a list of them)
disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour
use download delays (2 or higher). See DOWNLOAD_DELAY setting.
if possible, use Google cache to fetch pages, instead of hitting the
sites directly
use a pool of rotating IPs. For example, the free Tor
project or paid services like ProxyMesh
use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera
Additionally, you can reduce concurrent requests settings in your spider, that worked once for me.

Recording AJAX requests and Pop ups using Jmeter or Badboy

I am trying to load test a website where lot of images are called via Ajax and the very first thing i.e. logging into the application opens up a pop up when we click on log in button. I tried JMeter proxy settings to record it but failed. Also tried with Badboy, and it didn't work out too.
The limitations I have with me is I need to load test it with JMeter only. Are there any plugins available which can be integrated to JMeter so that AJAX and pop ups in an application can be handled in a much better way.
In general, JMeter does not have a problem with Ajax or Popups - so there is no plugin to address such. Both recording and playback happens at the HTTP layer so things like popups are somewhat irrelevant and Ajax is more a matter of timing; but in both cases it is only the HTTP call that matters. So, if you are having a problem, it could be something else that is holding you up. Try playing with the Proxy settings or using a different browser, beyond that you would need to expand 'didn't work out' and 'it failed' into a more detailed problem statement!
One solution - regardless of your exact problem - is to build the testplan manually, this is often actually the easiest way to work with JMeter. You can use a tool like fiddler or charles to examine the traffic and create the requests directly based on what you see. You can also use browser dev tools to do this. You might instinctively think this is going to be difficult but it's not and the added bonus is that the process gives you a solid understanding of how whatever it is you are testing actually works - which is always nice to have.
JMeter is a tool for testing server-side activity; as long as you record every request to server-side it doesn't care whatever happens on the client. Can you give details abt why JMeter "failed" to record Ajax requests?
Are you sure that you have included recording controller to your Jmeter thread group?
I think you are missing with you configuration for "recording controller" or "HTTP Request Sampler"
Some more information about your "didn't work" situation will be much more helpful.
I'll suggest that you should also have a look at the Jmeter Log file(jmeter.log) created in "Bin" folder to understand the root cause of of the issue.
Thanks,

Why are the files from the Google libraries api loaded via HTTPS

Well the main question says it all, why are the files loaded via https. I am just adding some new libraries to the website, and noticed that the links are all https://.
Now from what I understand you use https when there is some sensitive information, and I do not think that is the case with these libraries I guess. I think nobody is interested in getting the content of these files.
Is there any explanation for this ?
People asked for it so they could use the libraries on things like e-commerce sites, which eventually require an SSL connection. They provide links to the https version by default to make it easier for everyone overall (automatically avoids mixed-content warnings), and for most people the slight performance cost won't matter. But if you know you won't have any need for it, just strip it down to a regular http connection:
https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js
http://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js
They did actually publish the http URLs at one point, but I'd imagine that the resulting mixed-content warnings etc that came about as a result of people adding SSL connections and not thinking it through just created a bunch of support questions, so it was simpler to default to showing https and let people hack it if they really wanted.