Heroku, H12 and passthrough upload timeouts - ruby-on-rails-3

Overview:
I have a photobooth that takes pictures and sends them to my web application.
Then my web application store users data and sends the picture to user facebook profile/fanpage.
My web app runs Ruby on Rails # Heroku Cedar stack.
Flow:
My webapp receives the photo from the photobooth via a POST, like an web form.
The booth waits for the server response. If the upload has failed, it will send the picture again.
The response from webapp only will be fired after facebook upload has been completed.
Problems:
Webapp only sends data to photobooth after all processing has been completed. Many times this will happen after 30 secs. This causes to Heroku fire an H12 - Timeout.
Solutions?
Keep the request alive while file is being uploaded (return some response data in order to prevent heroku from firing a H12 - https://devcenter.heroku.com/articles/http-routing#timeouts). - Is it possible? how to achieve this in Ruby?
Change to Unicorn + Nginx and activate Upload Module (this way dyno only receives the request after the upload has been completed - Unicorn + Rails + Large Uploads). Is it really possible?
Use the rack-timeout gem. This would make a lot of my passthrough uploads to fail, so the pictures would never be posted on Facebook, right?
Change the architecture. Make the upload direct to S3, spin a worker to check new pictures uploaded to S3 bucket, download them and send them to Facebook. - This one might be the best one, but it takes a lot of time and effort. I might go for it in the long term, but I'm looking for a fast solution right now.
Other...

More info on this issue.
From Rapgenius:
http://rapgenius.com/Lemon-money-trees-rap-genius-response-to-heroku-lyrics
Ten days ago, spurred by a minor problem serving our compiled
javascript, we started running a lot of ab benchmarks. We noticed that
the numbers we were getting were consistently worse than the numbers
reported to us by Heroku and their analytics partner New Relic. For a
static copyright page, for instance, Heroku reported an average
response time of 40ms; our tools said 6330ms. What could account for
such a big difference?
“Requests are waiting in a queue at the dyno level,” a Heroku engineer
told us, “then being served quickly (thus the Rails logs appear fast),
but the overall time is slower because of the wait in the queue.”
Waiting in a queue at the dyno level? What?
From Heroku:
https://blog.heroku.com/archives/2013/2/16/routing_performance_update
Over the past couple of years Heroku customers have occasionally
reported unexplained latency on Heroku. There are many causes of
latency—some of them have nothing to do with Heroku—but until this
week, we failed to see a common thread among these reports. We now
know that our routing and load balancing mechanism on the Bamboo and
Cedar stacks created latency issues for our Rails customers, which
manifested themselves in several ways, including:
Unexplainable, high latencies for some requests
Mismatch between reported queuing and service time metrics and the observed reality
Discrepancies between documented and observed behaviors
For applications running on the Bamboo stack, the root cause of these
issues is the nature of routing on the Bamboo stack coupled with
gradual, horizontal expansion of the routing cluster. On the Cedar
stack, the root cause is the fact that Cedar is optimized for
concurrent request routing, while some frameworks, like Rails, are not
concurrent in their default configurations.
We want Heroku to be the best place to build, deploy and scale web and
mobile applications. In this case, we’ve fallen short of that promise.
We failed to:
Properly document how routing works on the Bamboo stack
Understand the service degradation being experienced by our customers and take corrective action
Identify and correct confusing metrics reported from the routing layer and displayed by third party tools
Clearly communicate the product strategy for our routing service
Provide customers with an upgrade path from non-concurrent apps on Bamboo to concurrent Rails apps on Cedar
Deliver on the Heroku promise of letting you focus on developing apps while we worry about the infrastructure
We are immediately taking the following actions:
Improving our documentation so that it accurately reflects how our service works across both Bamboo and Cedar stacks
Removing incorrect and confusing metrics reported by Heroku or partner services like New Relic
Adding metrics that let customers determine queuing impact on application response times
Providing additional tools that developers can use to augment our latency and queuing metrics
Working to better support concurrent-request Rails apps on Cedar
The remainder of this blog post explains the technical details and history of our routing infrastructure, the intent behind the decisions
we made along the way, the mistakes we made and what we think is the
path forward.

1) You can use Unicorn as your app server and set the timeout before the unicorn master kills a worker to a number of seconds that is greater than your requests need. Here is some example setup where you can see a timeout of 30 seconds.
Nginx does not work on heroku, so that is no option.
2) Changing the architecture would work well too, though I would choose an option than where the upload traffic does not block my own server, such as TransloadIt. They will help you get the pictures to S3 for examples and do custom transformations, cropping etc. without you having to add additional dynos because your processes are being blocked by file uploads.
Addition: 3) Another change of architecture would be to only handle the receiving part in one action, and giving the uploading to facebook task to a background worker (using for example Sidekiq).

Related

Is PageSpeed Insights bypassing Google CDN cache?

We're using Google Cloud Platform to host a WordPress site:
Google Load Balancer with CDN -> Instance Group with single VM -> Nginx + WordPress
From step 1 (only VM with WordPress, no cache) to the last step (whole setup with Load Balancer and CDN) I could progressively see the improvement when testing locally from my browser and from GTmetrix. But PageSpeed Insights always showed little improvement.
Now we're proud of an impressive 98/97 score in GTmetrix (woah!), but PSI still shows we're pretty average, specially on mobile (range from 45-55).
Problem: we're concerned about page ranking in Google so we'd like to make PSI happy as well. Also... our client won't understand that we did make an improvement while PSI still shows that score.
I was digging and found a few weird things about PSI:
When we adjusted cache-control in nginx, it was correctly detected by local browser and GTmetrix, but section Serve static assets with an efficient cache policy in PSI showed the old values for a few days.
The homepage has a background video hosted in 3 formats (mp4, webm, ogv). Clients are supposed to request only one of them (my browser and GTmetrix do), but PSI actually requests the 3 of them. I can see them in Avoid enormous network payloads section.
When a client requests our homepage, only the GET / request reaches our backend server (which is the expected behaviour) and the rest of the static assets are served from the CDN. But when testing from PSI, all requests reach our backend server. I can see them in nginx access log.
So... those 3 points are making us get a worse score in PSI (point 1 suddenly fixed itself yesterday after days since we changed cache-control), but for what I understand none of them should be happening. Is there something else I am missing?
Thanks in advance to those who can shed some light on this.
but PSI still shows we're pretty average, specially on mobile (range from 45-55).
PSI defaults to show you a mobile score on a simulated throttled connection. If you look at the desktop tab this is comparable to GT Metrix (which uses the same engine 'Lighthouse' under the hood without throttling so will give similar results on Desktop).
Sorry to tell you but the site is only average on mobile speed, test it by going to Performance tab in developer tools and enabling 'Network:Fast 3G' and 'CPU: 4x Slowdown' in the throttling options.
Plus the site seems really JavaScript computation heavy for some reason, PSI simulates a slower CPU so this is another factor. One script is taking nearly 1 second to evaluate.
Serve static assets with an efficient cache policy in PSI showed the old values for a few days.
This is far more likely to be a config issue than a PSI issue. PSI always runs from an empty cache. Perhaps the roll out across all CDNs is slow for some reason and PSI was requesting from a different CDN to you?
Videos - but PSI actually requests the 3 of them. I can see them in Avoid enormous network payloads section.
Do not confuse what you see here with what Google has used to actually run your test. This is calculated separately from all assets that it can download not based on the run data that is calculated by loading the page in a headless browser.
Also these assets are the same for desktop and mobile so it could be for some reason it is using one asset for the mobile test and one for the desktop test.
Either way it does indeed look like a bug but it will not affect your score as that is calculated in other ways.
all requests reach our backend server
Then this points to a similar problem as with point 1 - are you sure your CDN has fully deployed? Either that or you have some rule set up for a certain user agent / robots rule set up that bypasses your CDN. Most likely a robots rule needs updating.
What can you do?
double check your config, deployment etc. Ensure it has propagated to all CDN sites and that all of the DNS routing is working as expected.
Check that you don't have rules set for robots, I notice the site is 'noindex' so perhaps you do have something set up while you are testing things that is interfering.
Run an 'Audit' from Developer Tools in Google Chrome -> this uses exactly the same engine that PSI uses. This may give you better results as it uses your actual browser rather than a headless browser. Although for me this stops the videos loading at all so something strange is happening with that.

Site functionality diminished over VPN/company network

We currently experience a diminished with one of our customers at our main production site. All subpages and resources seem to be affected as well.
The customer reports a completely broken experience for themselves with the site not working correctly at all, mostly due to assets not loading correctly.
We already started investigating and have found that - so far - nothing seems to be wrong with the site itself.
Quick rundown:
The production site has a Cloudflare layer and almost all of it's assets are delivered either via CDNjs or Amazon's Cloudfront (behind Cloudflare) - all assets are reachable via HTTP as well
The site uses SSL and enforces it (the dynamic cert from Cloudflare)
We could secure a HAR from one of the requests for the request to one of our sites, the request times are extremely long. If you like to try, here is an online HAR viewer, be sure to uncheck validation of the file.
The customer uses Internet Explorer 8 and Chrome (39). While the site is not optimized for IE8. It should run fine in Chrome, in fact, in runs in most browsers above IE9 just fine for all of us.
Notes
We already ruled out:
Virtual delivery problems (there could be physical limitations we are not aware of)
General faultiness of our setup (We tried three different open VPNs to verify this)
Being on the customers blacklist by accident (although we cannot be entirely sure of this)
SSL Server name indication (SNI) problems
(Potentially) a general problem with the customers network, the customer does not report any problems with "the rest of the internet".
The customer will not give access to their VPN/disclose security details so we cannot really test for the situation ourselves. We suspect that the customer uses an internal proxy that might cause the problems described, but we are not sure.
Questions
My questions here are:
Is there any known problem caused by internal networking in conjunction with our setup that can cause this behaviour?.
Are there potential problems on our end that we could have overlooked or things that we do different from other sites?
It seems the connection is being done (or routed) through a low bandwidth high latency link (or a very congested one). Most of the dns lookups and connects seems to be taking ~10s.
In the HAR you can see that it affects fonts.googleapis.com and cdnjs.cloudflare.com. https://www.google-analytics.com/analytics.js has no data captured. To me the affirmation that the customer does not report any problems with "the rest of the internet" seems kind of dubious, seeing that in this HAR it hasn't been able to load the analytics js and access to usual cdns are very slow.
My guesses (pick one or more):
they are testing in a machine different than the one they have no problems with "the rest of the internet"
this machine is very, very slow
it has some kind of content filtering, antivirus, whatever filtering the web (perhaps with a ssl certificate installed in order to forge & inspect https traffic)
the access is done through a congested route, or a low bandwidth high latency link
Two hotspots:
It happens sometime for CDN points to be inconsistent, I spent a lot of time to understand this issue. How? In a live session with the client when I opened each resource loaded one by one I understand there are differences between CDN access points (Mine eastern Europe - His central Europe ). CDN hosting was one of the biggest US player in the world, anyhow we fixed this by invalidating(deleting) all files from CDN as so new/correct ones were loaded.
You need to have CDN that supports serving files over HTTPS, then use that CDN for the SSL requests.

Heroku Deploy Images Reset

I have an rails app deployed to Heroku. It has image uploads. After deployed to heroku again, i am unable to see old images that are uploaded.
Is heroku reset images folder when app is re-deployed? Please tell me the reason behind it.
Background
Heroku uses an 'ephemeral file system', which from an application architecture point of view should be considered as read-only - it is discarded as soon as the dyno is stopped or restarted (which, along with other occasions, occurs after each push), and is also not shared between multiple dynos.
This is fine for executing code from, as most application data is stored in a database that is independent of the dynos. However, for file uploads this presents a problem, and so any uploads should not be stored directly in the dyno filesystem.
Solution
The simplest solution is to use something like Amazon S3 as your file upload storage solution, and, if using a gem like Paperclip, this is natively supported within the gem. There is a great overview article in the Heroku Dev Center about using S3 and Heroku (https://devcenter.heroku.com/articles/s3), which leads into an article contributed by Thoughtbot (the developers of Paperclip) on implemenation specifics within a Rails app (https://devcenter.heroku.com/articles/paperclip-s3)

Is there a way to spawn Unicorn processes and have them load fully before they are accessed to process requests?

I am running Unicorn on Heroku. I notice when we scale web dynos up. These new web dynos are being accessed right after it is spawned. In the logs we are getting:
Request timeout" error with 30 seconds limit (i.e. service=30000ms)
As soon as a dyno starts, traffic will be sent to it. If Unicorn is still spinning up child processes as these requests arrive, then they will likely time out, especially if your app takes a while to boot.
Is there a way to spawn Unicorn child processes without them being accessed by requests, until the app is fully loaded?
This is actually something else and something I've encountered before.
Essentially, when you scale you're seeing your dyno get created and your slug being deployed onto it. At this point your slug is initialised. The routing mesh will start sending requests through the moment this has happened, as it sees the dyno as up and ready to rock.
However, (and this is where I found the problem), it takes time for your application to spin up to respond to the request (Unicorn is up, Rails is still initialising) and you get the 30 second timeout.
The only way I could fix this was to get my app starting up in less than 30 seconds repeatedly which I finally achieved by updating to a current version of Rails. I've also found some increases by also updating to try running on Ruby 1.9.3

Load testing comet based application

We have developed a comet based application for chat (using streaming approach). The application has been developed in ASP .Net 3.5 sp1.
The browser has two connections with the server. One for posting and another for receiving chat messages. While load testing with Jmeter or VSTS the posting is getting recorded and load tested but not the receiving portion. Can some one please suggest any load testing tool which can address this issue.
I've come across the same problem, the top runner for me at the moment is browsermob.com. It has a complete API that allows you to create test scenarios that can "watch and wait" on pages recording every http request made as though they are visiting through a real browser. It gets kind of expensive if you need to test with more than 25 concurrent users (browser users), but seems very reasonably priced from what I have seen so far.
It'd be really interesting to see what tools others who are somewhat technically inept are using.
http://docs.codehaus.org/display/JETTY/Stress+Testing+Cometd