ServerErrorException when crawling site

ServerErrorException when crawling site - import.io

I am doing a full crawl of a site with the API and I am getting a lot of:
{
"errorType": "ServerErrorException",
"error": "ServerErrorException: Server error. (HTTP 500)"
}
Also I am getting timeout responses and actual http fails: StatusCode: 504, ReasonPhrase: 'GATEWAY_TIMEOUT'
I am browsing the site I am crawling at same time and it seems fast and responsive still, no slow downs.
I removed multithreading from my code and run synchronously which stops the 500 errors but it still gives lots of timeouts (and takes ages).
I am running less than 100 concurrent connections with my multithreading. Is that too much? I'd like to push it to 1000+. Do I need to add some delay between requests?

The error you are seeing, indicates that the website you are trying to crawl is sending you 500's.
The 504 gateway time-outs you are getting will be because you are possibly hitting import.io with too many requests.
I would slow down your crawling, perhaps setting a delay between requests.
If you are crawling lots of pages, you will have to respect that the website you are crawling might not be able to deal with large amounts of incoming requests in a small amount of time.
I would advise to start slow and increase until you start to see errors/timeouts. Once you start to see some errors, slow down a bit.
Crawling an entire website can be a slow process.

Related

Correctly timing out XHR Requests in React-Native on Android

We're facing an issue with handling unexpected behaviours when performing xmlHttpRequests on Android devices using React-Native. We've experienced behaviour where the app becomes unavailable to complete API calls, even though the device is connected to the internet perfectly well (browser can access non-cached sites just fine). The only way to resolve this issue for our users has been to completely restart the app.
To understand the issue and its severity, we wrapped all our API calls to a timer function in production and sent reports to Sentry whenever a request took longer than 30 seconds to finish. Now we've been receiving these kind of reports in the hundreds per day with the duration sometimes being in the hours or even days.
First, instead of using whatwg-fetch, we moved to using axios so that we can manually set the timeout of each request, but this ended up not helping at all.
Second, we dove deeper into understanding how React-Native actually implements timing out XHR requests on Android, and found that it uses OkHttp3 under the hood. OkHttp has a default value for connect, read and write timeouts and react-native allows developers to change the value of connect timeout here. However, OkHttp also has a method for setting a call timeout (everything from connect to reading the response body), but this has a default value of 0 (no timeout) and React-Native doesn't allow users to change that. More on okhttp timeouts here
My question is whether this can be the cause of our worries and whether it should be reported to React-Native as a bug. Lets assume the following case:
app makes an API call to a resource
okhttp is able to connect to the resource within specified timeout limit (default 10s)
okhttp is able to write the request body to the server within timeout limit (10s)
server processes request but for some reason it fails to start sending a response to the client. I guess there could be many reasons for this like server being disconnected, server crashing or server simply losing the connection to the client without the client noticing it. As there is no timeout here, okhttp will happily wait for days on end for the server to start responding to the request.
Am I missing something here or misunderstanding how timeouts work in okhttp? Is there another, perhaps better solution than implementing the ability for developers to set callTimeout on API calls performed on android? And also, isn't it a bit stupid that developers cant set their own write and read timeouts? Cant this lead to unexpected behaviour when you want to send large amounts of data in either direction on a slow connection? 10s is quite long, but perhaps not long enough for all use-cases.
Thanks for your help in advance!

Xdebug boosts site speed

I have WAMP stack for development and a lot of sites are going slow, but I have a really big issue with PrestaShop because loading time is 1 min on average.
Although the content is loaded, the main request is responding very slowly and Chrome's waterfall shows that the delay is caused by Content Downloading, but all assets are already downloaded (local storage) or cached.
I noticed that when I enable the xdebug listener (on VSCode) the site is responding as it should, i.e. within miliseconds.
Any idea what might be happening ?

Apache - resources randomly hang (resulting in slow page loads)

HTTP requests of resources randomly - about between 1-5% of the time (per resource, not per page loads) - take extremely long to be delivered to the browser (~20 seconds), not uncommonly hanging indefinitely even. (Server details listed in list at the bottom).
This results in about every 5th request to any page appear to hang due to a JavaScript resource hanging within the <head> tag.
The resources are css, js and small image files, served directly by apache (no scripting language), although page loads (involving PHP or Rails) also rarely hang, with equal chances as any other resource (1-5% of the time), so this seems to be an Apache Request related issue.
Additional information:
I've checked the idle workers on server-status and as expected, I still have 98% of my idle workers. Although this may be relevant as the hangings apply to static resources not served by FastCGI (the resources are static).
I am not the only one with this problem. Someone else is also having the same problem, and from a different IP address.
This happens in both Google Chrome and Firefox as HTTP clients.
I have tried constantly force refreshing the same JS file in a new tab. It eventually led to the same kind of hanging.
The Timing tab for Google Chrome reports 34ms waiting and 19.27s receiving for one of these hanging requests. Would that mean Apache already had the file contents to be delivered ready, only had trouble delivering it in a sensible amount of time?
error.log doesn't show any errors. There are some expected 404 and 500 errors in error.log, but those aren't related to the hanging; those are actual errors for nonexisting pages and PHP fatal errors.
I get some suspicious 206 Partial Content responses mostly for static content, although the hanging happens more often then those partial contents. I mostly get 200 OK responses everywhere and I can confirm indefinitely hanging resources that were reported as 200 OK in the apache access.log.
I do have mod_passenger installed for Redmine. I don't know if that helps, but suspiciously this server has it installed unlike all the other servers I worked with. Although mod_passenger shouldn't affect static content, especially not within a non-ruby project folder, should it?
The server is using Apache 2.4 Event MPM on Ubuntu 13.10, hosted on Digital Ocean.
What may be causing these hangings and how could I fix this?

I had the same problem, so after reading this thread I tried setting KeepAlive Off in my apache config which seems to have helped- all resources have expected waiting times now.
Not a great "fix", but at least I am one step closer to figuring out the cause and pages aren't taking 15s to fully load in the mean time.

mod_python does not process multiple requests from the same browser simultaneously for the same file

I have a page which can take long time to process. But in the mean time if the same page is accessed (from the same system), the second instance gets blocked till the first page finishes. Actually instead of the blocking behaviour, I would be happy it the second instance fails rather than getting hung. Is there a way to around to make the same file accessbile at the same time.
I have found the same problem being present in PHP also. But those replies were related to PHP. Apache same orgin request blocking Why does apache not process multiple requests from the same browser simultaneously discusses about the same problem with php.

Heroku, H12 and passthrough upload timeouts

Overview:
I have a photobooth that takes pictures and sends them to my web application.
Then my web application store users data and sends the picture to user facebook profile/fanpage.
My web app runs Ruby on Rails # Heroku Cedar stack.
Flow:
My webapp receives the photo from the photobooth via a POST, like an web form.
The booth waits for the server response. If the upload has failed, it will send the picture again.
The response from webapp only will be fired after facebook upload has been completed.
Problems:
Webapp only sends data to photobooth after all processing has been completed. Many times this will happen after 30 secs. This causes to Heroku fire an H12 - Timeout.
Solutions?
Keep the request alive while file is being uploaded (return some response data in order to prevent heroku from firing a H12 - https://devcenter.heroku.com/articles/http-routing#timeouts). - Is it possible? how to achieve this in Ruby?
Change to Unicorn + Nginx and activate Upload Module (this way dyno only receives the request after the upload has been completed - Unicorn + Rails + Large Uploads). Is it really possible?
Use the rack-timeout gem. This would make a lot of my passthrough uploads to fail, so the pictures would never be posted on Facebook, right?
Change the architecture. Make the upload direct to S3, spin a worker to check new pictures uploaded to S3 bucket, download them and send them to Facebook. - This one might be the best one, but it takes a lot of time and effort. I might go for it in the long term, but I'm looking for a fast solution right now.
Other...

More info on this issue.
From Rapgenius:
http://rapgenius.com/Lemon-money-trees-rap-genius-response-to-heroku-lyrics
Ten days ago, spurred by a minor problem serving our compiled
javascript, we started running a lot of ab benchmarks. We noticed that
the numbers we were getting were consistently worse than the numbers
reported to us by Heroku and their analytics partner New Relic. For a
static copyright page, for instance, Heroku reported an average
response time of 40ms; our tools said 6330ms. What could account for
such a big difference?
“Requests are waiting in a queue at the dyno level,” a Heroku engineer
told us, “then being served quickly (thus the Rails logs appear fast),
but the overall time is slower because of the wait in the queue.”
Waiting in a queue at the dyno level? What?
From Heroku:
https://blog.heroku.com/archives/2013/2/16/routing_performance_update
Over the past couple of years Heroku customers have occasionally
reported unexplained latency on Heroku. There are many causes of
latency—some of them have nothing to do with Heroku—but until this
week, we failed to see a common thread among these reports. We now
know that our routing and load balancing mechanism on the Bamboo and
Cedar stacks created latency issues for our Rails customers, which
manifested themselves in several ways, including:
Unexplainable, high latencies for some requests
Mismatch between reported queuing and service time metrics and the observed reality
Discrepancies between documented and observed behaviors
For applications running on the Bamboo stack, the root cause of these
issues is the nature of routing on the Bamboo stack coupled with
gradual, horizontal expansion of the routing cluster. On the Cedar
stack, the root cause is the fact that Cedar is optimized for
concurrent request routing, while some frameworks, like Rails, are not
concurrent in their default configurations.
We want Heroku to be the best place to build, deploy and scale web and
mobile applications. In this case, we’ve fallen short of that promise.
We failed to:
Properly document how routing works on the Bamboo stack
Understand the service degradation being experienced by our customers and take corrective action
Identify and correct confusing metrics reported from the routing layer and displayed by third party tools
Clearly communicate the product strategy for our routing service
Provide customers with an upgrade path from non-concurrent apps on Bamboo to concurrent Rails apps on Cedar
Deliver on the Heroku promise of letting you focus on developing apps while we worry about the infrastructure
We are immediately taking the following actions:
Improving our documentation so that it accurately reflects how our service works across both Bamboo and Cedar stacks
Removing incorrect and confusing metrics reported by Heroku or partner services like New Relic
Adding metrics that let customers determine queuing impact on application response times
Providing additional tools that developers can use to augment our latency and queuing metrics
Working to better support concurrent-request Rails apps on Cedar
The remainder of this blog post explains the technical details and history of our routing infrastructure, the intent behind the decisions
we made along the way, the mistakes we made and what we think is the
path forward.

1) You can use Unicorn as your app server and set the timeout before the unicorn master kills a worker to a number of seconds that is greater than your requests need. Here is some example setup where you can see a timeout of 30 seconds.
Nginx does not work on heroku, so that is no option.
2) Changing the architecture would work well too, though I would choose an option than where the upload traffic does not block my own server, such as TransloadIt. They will help you get the pictures to S3 for examples and do custom transformations, cropping etc. without you having to add additional dynos because your processes are being blocked by file uploads.
Addition: 3) Another change of architecture would be to only handle the receiving part in one action, and giving the uploading to facebook task to a background worker (using for example Sidekiq).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas