Scrapy on Ubuntu web server getting 417 error - scrapy

I have been developing a crawling script for a number of news websites and using Scrapy to handle the logic.
When I run my script on an Ubuntu web server (Digital Ocean, if that helps), a lot of the websites that return 200 on my local machine turn out to be 417 instead.
I was wondering how I should fix this, if it is a problem at all? I'm actually not quite sure if it is affecting the final output, but it seems like it has been.
Some of my own research has turned up:
http://www.checkupdown.com/status/E417.html . I've tried adding an Expect header to my requests, which hasn't worked
I've heard that it might be a problem with HTTP 1.1 vs 1.0? EDIT: Nope. Scrapy's HTTPDownloaderHandler automatically chooses 1.1 if it is available

417 is the error a web server gives you when your client says it expects content-types a,b,c, but the content that the server could deliver doesn't match any of these types.
This looks like a scrapy bug or, more likely, misconfiguration.

It seems either your public ip address was already banned or was banned while you scraped by the web server of the page you want to scrape. For the first situation you can reboot your instance to get a new public ip (at least this works on Amazon). For the second scenario, here are some tips from the official documentation to avoid this situation:
rotate your user agent from a pool of well-known ones from browsers
(google around to get a list of them)
disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour
use download delays (2 or higher). See DOWNLOAD_DELAY setting.
if possible, use Google cache to fetch pages, instead of hitting the
sites directly
use a pool of rotating IPs. For example, the free Tor
project or paid services like ProxyMesh
use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera
Additionally, you can reduce concurrent requests settings in your spider, that worked once for me.

Related

Is PageSpeed Insights bypassing Google CDN cache?

We're using Google Cloud Platform to host a WordPress site:
Google Load Balancer with CDN -> Instance Group with single VM -> Nginx + WordPress
From step 1 (only VM with WordPress, no cache) to the last step (whole setup with Load Balancer and CDN) I could progressively see the improvement when testing locally from my browser and from GTmetrix. But PageSpeed Insights always showed little improvement.
Now we're proud of an impressive 98/97 score in GTmetrix (woah!), but PSI still shows we're pretty average, specially on mobile (range from 45-55).
Problem: we're concerned about page ranking in Google so we'd like to make PSI happy as well. Also... our client won't understand that we did make an improvement while PSI still shows that score.
I was digging and found a few weird things about PSI:
When we adjusted cache-control in nginx, it was correctly detected by local browser and GTmetrix, but section Serve static assets with an efficient cache policy in PSI showed the old values for a few days.
The homepage has a background video hosted in 3 formats (mp4, webm, ogv). Clients are supposed to request only one of them (my browser and GTmetrix do), but PSI actually requests the 3 of them. I can see them in Avoid enormous network payloads section.
When a client requests our homepage, only the GET / request reaches our backend server (which is the expected behaviour) and the rest of the static assets are served from the CDN. But when testing from PSI, all requests reach our backend server. I can see them in nginx access log.
So... those 3 points are making us get a worse score in PSI (point 1 suddenly fixed itself yesterday after days since we changed cache-control), but for what I understand none of them should be happening. Is there something else I am missing?
Thanks in advance to those who can shed some light on this.
but PSI still shows we're pretty average, specially on mobile (range from 45-55).
PSI defaults to show you a mobile score on a simulated throttled connection. If you look at the desktop tab this is comparable to GT Metrix (which uses the same engine 'Lighthouse' under the hood without throttling so will give similar results on Desktop).
Sorry to tell you but the site is only average on mobile speed, test it by going to Performance tab in developer tools and enabling 'Network:Fast 3G' and 'CPU: 4x Slowdown' in the throttling options.
Plus the site seems really JavaScript computation heavy for some reason, PSI simulates a slower CPU so this is another factor. One script is taking nearly 1 second to evaluate.
Serve static assets with an efficient cache policy in PSI showed the old values for a few days.
This is far more likely to be a config issue than a PSI issue. PSI always runs from an empty cache. Perhaps the roll out across all CDNs is slow for some reason and PSI was requesting from a different CDN to you?
Videos - but PSI actually requests the 3 of them. I can see them in Avoid enormous network payloads section.
Do not confuse what you see here with what Google has used to actually run your test. This is calculated separately from all assets that it can download not based on the run data that is calculated by loading the page in a headless browser.
Also these assets are the same for desktop and mobile so it could be for some reason it is using one asset for the mobile test and one for the desktop test.
Either way it does indeed look like a bug but it will not affect your score as that is calculated in other ways.
all requests reach our backend server
Then this points to a similar problem as with point 1 - are you sure your CDN has fully deployed? Either that or you have some rule set up for a certain user agent / robots rule set up that bypasses your CDN. Most likely a robots rule needs updating.
What can you do?
double check your config, deployment etc. Ensure it has propagated to all CDN sites and that all of the DNS routing is working as expected.
Check that you don't have rules set for robots, I notice the site is 'noindex' so perhaps you do have something set up while you are testing things that is interfering.
Run an 'Audit' from Developer Tools in Google Chrome -> this uses exactly the same engine that PSI uses. This may give you better results as it uses your actual browser rather than a headless browser. Although for me this stops the videos loading at all so something strange is happening with that.

Site functionality diminished over VPN/company network

We currently experience a diminished with one of our customers at our main production site. All subpages and resources seem to be affected as well.
The customer reports a completely broken experience for themselves with the site not working correctly at all, mostly due to assets not loading correctly.
We already started investigating and have found that - so far - nothing seems to be wrong with the site itself.
Quick rundown:
The production site has a Cloudflare layer and almost all of it's assets are delivered either via CDNjs or Amazon's Cloudfront (behind Cloudflare) - all assets are reachable via HTTP as well
The site uses SSL and enforces it (the dynamic cert from Cloudflare)
We could secure a HAR from one of the requests for the request to one of our sites, the request times are extremely long. If you like to try, here is an online HAR viewer, be sure to uncheck validation of the file.
The customer uses Internet Explorer 8 and Chrome (39). While the site is not optimized for IE8. It should run fine in Chrome, in fact, in runs in most browsers above IE9 just fine for all of us.
Notes
We already ruled out:
Virtual delivery problems (there could be physical limitations we are not aware of)
General faultiness of our setup (We tried three different open VPNs to verify this)
Being on the customers blacklist by accident (although we cannot be entirely sure of this)
SSL Server name indication (SNI) problems
(Potentially) a general problem with the customers network, the customer does not report any problems with "the rest of the internet".
The customer will not give access to their VPN/disclose security details so we cannot really test for the situation ourselves. We suspect that the customer uses an internal proxy that might cause the problems described, but we are not sure.
Questions
My questions here are:
Is there any known problem caused by internal networking in conjunction with our setup that can cause this behaviour?.
Are there potential problems on our end that we could have overlooked or things that we do different from other sites?
It seems the connection is being done (or routed) through a low bandwidth high latency link (or a very congested one). Most of the dns lookups and connects seems to be taking ~10s.
In the HAR you can see that it affects fonts.googleapis.com and cdnjs.cloudflare.com. https://www.google-analytics.com/analytics.js has no data captured. To me the affirmation that the customer does not report any problems with "the rest of the internet" seems kind of dubious, seeing that in this HAR it hasn't been able to load the analytics js and access to usual cdns are very slow.
My guesses (pick one or more):
they are testing in a machine different than the one they have no problems with "the rest of the internet"
this machine is very, very slow
it has some kind of content filtering, antivirus, whatever filtering the web (perhaps with a ssl certificate installed in order to forge & inspect https traffic)
the access is done through a congested route, or a low bandwidth high latency link
Two hotspots:
It happens sometime for CDN points to be inconsistent, I spent a lot of time to understand this issue. How? In a live session with the client when I opened each resource loaded one by one I understand there are differences between CDN access points (Mine eastern Europe - His central Europe ). CDN hosting was one of the biggest US player in the world, anyhow we fixed this by invalidating(deleting) all files from CDN as so new/correct ones were loaded.
You need to have CDN that supports serving files over HTTPS, then use that CDN for the SSL requests.

Capture web driver network traffic across all browsers

I want to capture all the network calls from Web Driver in Java. I am not doing any UI testing, just testing JS execution and, requests and responses of some network calls.
I tried using Browser Mob as is suggested in most forums, but I need it to work across all browsers. It worked flawlessly with Firefox, but I was facing some issues with the others. Safari driver doesn't event support a Proxy capability.
I don't want to use Fiddler as it involves some manual steps around invoking and storing the calls. Whereas, Browser Mob being an in-code proxy can be integrated in a more smoother fashion.
I also tried using the RC-like package included in Selenium standalone server package. But, I have some HTTPS calls and some nested iframes in cross domains. I am particularly interested in some cross domain POST call and it doesn't work out that well. Also, people keep saying it's not recommended to use that package.
So, I had a solution where we can use a standalone proxy server running on a machine. Using host entries, we'll point Web Driver to hit the proxy instead of the actual server. The proxy will record all the incoming calls and route them to the actual server host. Later, I can make a request to the proxy which will return me all the calls it intercepted. I am not sure whether it's still called a proxy or a router.
I came across TCPmon, but it's no longer being supported. Does anyone know some similar tools that could run on Unix systems or any alternate solutions?
We modified the Fiddler rules script to include a new exec action. If you use their native script editor, it also provide auto complete features and we were comfortably able to get around it. The syntax is similar to that of JavaScript.
The Fiddler package comes with a ExecActions.exe which can be used to pass console arguments to a running Fiddler instance using the command prompt.
The code we wrote processed all the sessions captured by Fiddler and wrote it to a file in a custom JSON format and later used GSON to deserialize it.
Please let me know, if you want further details.

Does HTTP user-agent affect page accessed in practice ? Any examples?

I am using wget to download url that could be used on either linux/osx/windows. My question is if server behavior could be affected by user-agent string (-U) option ? According to this MS link web server can use this information to provide content that is tailored for your specific browser. According to Apache doc(access control section) you can use these directives to deny access to a particular browser (User-Agent). So I am wondering if I need to download links with different user-agent for different OS or one download would suffice.
Is this actually done ? I tried bunch of servers but did not really see different behavior across user agents.
There are sites that prevent scraping by returning an error response when they detect you're hitting their servers with an automation tool instead of a browser, and the user agent is one of the aspects of detecting that difference.
Other than that not much useful can be said about this, as we don't know what sites you want to target, what HTTP server they run and what code runs on top of that.

Getting Orbited to work with my Twisted app

I can't seem to get Orbited working with my Twisted app. I have a page, served by Twisted (say localhost:8000/page) which includes Orbited.js from the orbited server (localhost:8001/static/Orbited.js). I then have a TCP chat server example running on port 7777. I try to use Orbited.TCPSocket to connect to the chat server:
conn=new Orbited.TCPSocket();
conn.open("localhost", 7777);
conn.send("test\r\n"); //error: bad readyState
It works fine when Orbited is serving the page, but not when twisted serves it from a different port. My orbited.cfg looks like this:
[listen]
http://:8001
[access]
* -> localhost:7777
And before (which worked) I had this in it as well:
[static]
test=index.html
Where index.html was another page grabbing localhost:8001/static/Orbited.js, and was accessed from localhost:8001/test.
How do I need to change my config file to work with requests from my twisted site on another port?
Update
I tried changing Orbited.settings.port to 8001 before trying to open the connection, but I got an error: "unsafe javascript attempt to access frame with url http://localhost:8000/page from frame with url http://localhost:8001/static/xsdrBridge.html#1. Domains, protocols and ports must match."
Hmm, also, I just looked at the orbited wiki, and apparently, setting Orbited.settings.port is exactly what I'm supposed to do. but I'm getting horrible errors
You can call send() only after the connection is in opened state.
Put a handler for .onopen() and do a .send() from there.
I have used Orbited in the past. It works in general but there are several quirks to get it set up and running smoothly. The project itself seems to be in a state of flux (it seems to be moving to node.js). Both of these points lead me to suggest that - if you can avoid it - not to use Orbited.
Are there alternatives that are cleaner? I would say, yes. You can pretty much emulate Orbited with Websockets on stock Twisted. This will clearly work for newer browsers. What about older ones? Well, there are open-source projects that wrap websockets and fall back to flash as a transport for older browsers. The setup works quite well, and actually feels cleaner than using a solution like orbited.
If you check out http://github.com/rlotun/txWebSocket you'll find the current state of Twisted's websocket implementation, as well as an example of how to fall back to flash on older browsers. Hopefully this will be useful enough for you to serve as a drop in replacement to Orbited.