What's the "average" requests per second for a production web application? - optimization

I have no frame of reference in terms of what's considered "fast"; I'd always wondered this but have never found a straight answer...

OpenStreetMap seems to have 10-20 per second
Wikipedia seems to be 30000 to 70000 per second spread over 300 servers (100 to 200 requests per second per machine, most of which is caches)
Geograph is getting 7000 images per week (1 upload per 95 seconds)

Not sure anyone is still interested, but this information was posted about Twitter (and here too):
The Stats
Over 350,000 users. The actual numbers are as always, very super super top secret.
600 requests per second.
Average 200-300 connections per second. Spiking to 800 connections per second.
MySQL handled 2,400 requests per second.
180 Rails instances. Uses Mongrel as the "web" server.
1 MySQL Server (one big 8 core box) and 1 slave. Slave is read only for statistics and reporting.
30+ processes for handling odd jobs.
8 Sun X4100s.
Process a request in 200 milliseconds in Rails.
Average time spent in the database is 50-100 milliseconds.
Over 16 GB of memcached.

When I go to the control panel of my webhost, open up phpMyAdmin, and click on "Show MySQL runtime information", I get:
This MySQL server has been running for 53 days, 15 hours, 28 minutes and 53 seconds. It started up on Oct 24, 2008 at 04:03 AM.
Query statistics: Since its startup, 3,444,378,344 queries have been sent to the server.
Total 3,444 M
per hour 2.68 M
per minute 44.59 k
per second 743.13
That's an average of 743 mySQL queries every single second for the past 53 days!
I don't know about you, but to me that's fast! Very fast!!

personally, I like both analysis done every time....requests/second and average time/request and love seeing the max request time as well on top of that. it is easy to flip if you have 61 requests/second, you can then just flip it to 1000ms / 61 requests.
To answer your question, we have been doing a huge load test ourselves and find it ranges on various amazon hardware we use(best value was the 32 bit medium cpu when it came down to $$ / event / second) and our requests / seconds ranged from 29 requests / second / node up to 150 requests/second/node.
Giving better hardware of course gives better results but not the best ROI. Anyways, this post was great as I was looking for some parallels to see if my numbers where in the ballpark and shared mine as well in case someone else is looking. Mine is purely loaded as high as I can go.
NOTE: thanks to requests/second analysis(not ms/request) we found a major linux issue that we are trying to resolve where linux(we tested a server in C and java) freezes all the calls into socket libraries when under too much load which seems very odd. The full post can be found here actually....
http://ubuntuforums.org/showthread.php?p=11202389
We are still trying to resolve that as it gives us a huge performance boost in that our test goes from 2 minutes 42 seconds to 1 minute 35 seconds when this is fixed so we see a 33% performancce improvement....not to mention, the worse the DoS attack is the longer these pauses are so that all cpus drop to zero and stop processing...in my opinion server processing should continue in the face of a DoS but for some reason, it freezes up every once in a while during the Dos sometimes up to 30 seconds!!!
ADDITION: We found out it was actually a jdk race condition bug....hard to isolate on big clusters but when we ran 1 server 1 data node but 10 of those, we could reproduce it every time then and just looked at the server/datanode it occurred on. Switching the jdk to an earlier release fixed the issue. We were on jdk1.6.0_26 I believe.

That is a very open apples-to-oranges type of question.
You are asking
1. the average request load for a production application
2. what is considered fast
These don't neccessarily relate.
Your average # of requests per second is determined by
a. the number of simultaneous users
b. the average number of page requests they make per second
c. the number of additional requests (i.e. ajax calls, etc)
As to what is considered fast.. do you mean how few requests a site can take? Or if a piece of hardware is considered fast if it can process xyz # of requests per second?

Note that hit-rate graphs will be sinusoidal patterns with 'peak hours' maybe 2x or 3x the rate that you get while users are sleeping. (Can be useful when you're scheduling the daily batch-processing stuff to happen on servers)
You can see the effect even on 'international' (multilingual, localised) sites like wikipedia

less than 2 seconds per user usually - ie users that see slower responses than this think the system is slow.
Now you tell me how many users you have connected.

You can search "slashdot effect analysis" for graphs of what you would see if some aspect of the site suddenly became popular in the news, e.g. this graph on wiki.
Web-applications that survive tend to be the ones which can generate static pages instead of putting every request through a processing language.
There was an excellent video (I think it might have been on ted.com? I think it might have been by flickr web team? Does someone know the link?) with ideas on how to scale websites beyond the single server, e.g. how to allocate connections amongst the mix of read-only and read-write servers to get best effect for various types of users.

I have a customer that uses our software on a commercial web app servers. The software runs on 40 servers. The software is a 10 year old Java API.
4000 TPS.

Related

Server load is minimal but website responds poorly

I have VPS on hetzner. Server is located in Germany.
It has 256GB RAM, 6CPUs (12 threads).
I have a file which since yesterday, is requested about 30 times in one second. File has 2 Select, 2 Update, 2 Insert queries, so I assumed (not sure how this works) from this file server has about 180 requests per second. So right after this requests started, all the websites on the server just started loading poorly. I made this file run just one select query and than die. This didn't help. In WHM load is aboiut 0.02.
I've checked for error logs and there is no max_user_connection or any error there.
I have enabled slow query log and checked log file. there is nothing (I've tested it with select sleep(10) and this query was logged).
This is visit statistics, please bring your attention to may 30th:
Bandwidth stats for last 24 hours:
There are many errors like this in ssl_log (diff IPs of course):
188.121.206.150 - - [30/May/2018:19:50:03 +0200] "-" 408 - "-" "-"
I've been searching web a lot and couldn't find any solution. Could anyone at least tell what should I monitor or where. I have full access to anything there is possible inside the server. Any help is appreciated.
UPDATE 1
I have subdomain: banners.analyticson.com (access allowed for now) and there I have all the images and html5 files that are requested.
Take one image for example : https://banners.analyticson.com/img/suy8G1S6RU.jpg
It needs too much time to load. As I noticed, this sub domain has some issue.
Script, that I mentioned earlier (with 6 queries) just tries to get one of those banners to the user, so result of that script is to return one banner from banners.analyticson.com.
UPDATE 2
I've checked my script, it is fine. It takes less than 1 second to complete.
I've also checked Top command and there is a result. I'm not sure if $MEM value is fine.
You're going to have to narrow the problem down...
There are multiple potential issues.
First thing to eliminate would be the performance of your new script on a development laptop - I assume you're using PHP, so use the profiling tools to work out what is going on. If it's a database query, you'll see which one by looking at the profiler.
If your PHP script and database queries are fine, the next thing to look at: it sounds like you've hit some bottleneck resource on your infrastructure. In these cases, scripts that run fine as a single request start queueing for the bottleneck resource, and every new request adds to the queue until the whole server starts to crawl. This can be a bit of a puzzle - start with top and keep digging.
Next, I'd look at configuration of Apache to make sure everything is squeaky clean - Apache used to have a default to do a reverse DNS lookup for every request, which slows the server down rather impressively on production. You may also want to look at your SSL configuration - the error you report is linked to a load balancer issue.
If it's not as simple as memory, CPU etc., you're into more esoteric issues. You may need to ramp up a load testing rig so you can experiment without affecting the live site - typically, I do this on a machine as similar to live as possible, using Apache JMeter to generate load, and find the "inflection point". Typically, you see response times increase linearly with the number of concurrent requests, until you hit the bottleneck resource, at which point the response time increases rapidly. As a simple example, if you have 10 database connections available, response time should increase linearly up to 10 concurrent connections, and then become much larger from 11 up.
Knowing where the inflection point is and being able to recreate it allows you to use PHP profiling tools under load. This is a lot of work.
UPDATE
You're using php-cgi; this is easily the most inefficient way of running PHP scripts. Your server is barely breaking a sweat - CPU and memory basically idle. Here's a comparison for how to run PHP; consider changing to mod_php.

How to increase Google Sheets v4 API quota limitations

The new Google Sheets API v4 currently has an unlimited read/write quota per day (which is fantastic), but restricted to 500 reads/writes per account per 100 seconds, and 100 read/writes per key per 100 seconds (or, I have found, multiple keys coming from the same IP). This is probably plenty for most use cases, but I have an edge case that requires bringing a frequently-updated Google Sheet with 70 tabs down to a node.js server that distributes these to user's clients every ~30-60 seconds or so (users are data annotators who are student research assistants). This wasn't so bad early in the project when there were only 20-30 tabs, but now that the data is large the server is blowing through the 100 quota and returning errors every 10-15 minutes.
The problem is such that:
Frequent data updates: Only data on 1-5 of the 70 tabs is likely to be updated on any given minute, but which tabs have new data is random (so I am pulling down the whole sheet of 70 = 70 reads).
Update interval: The need for updates happens randomly at about 30 second to 5-minute intervals (so some within the quota, some about 3-5x the quota).
Throttling: I have tried throttling the update to be within the 100 calls/100 seconds (my previous solution), but this introduces large usability issues, significantly decreasing usability/productivity/work quality.
Quota increase: The sheets API does not currently appear to include a way to pay to increase the quota. It does allow filling out a form to request an increase in the quota, but I'm not sure what the mean response time is on this (my request is only a few days old).
Multiple service accounts: I have tried using multiple service accounts to get the full 500 requests/100 seconds quota (rather than the per-user quota), since this is a server, but Google Sheets looks to rate-limit to 100 requests/100 seconds from a given IP
Alternatives: I have considered that this project may have just grown beyond the size that Sheets is easily able to handle, but there do not appear to be any good, usable, self-hosted, collaborative spreadsheets with easy-to-interface-to APIs out there.
Are there settings/methods suggested to achieve the full 500 calls/100 seconds for a server?
You can request quota update in Google Cloud Platform and it will be increased to 2500 per account an 500 per user. (about your #4)
You can use spreadsheets.get to read the entire spreadsheet in a single call, rather than 1 call per request. Alternately, you can use spreadsheets.values.batchGet to read multiple different ranges in a single call, if all you need are the values.
The Drive API offers "push notifications", so you can get notified when changes occur and react to those, instead of polling for them. The latency of the notifications is a little on the slow side, but it gets the job done.

What caused a 45 minute gap in apache logs

We have a gap in our apache logs for approximately 45 minutes, after which follows an unusually high burst of log activity.
Normally, we get a few hundred requests per hour in this time, early in the morning. But our traffic was normal, then the logs went quiet for 45 minutes (wherein people reported an inability to log in). After that, 4000 requests were written to the logs within a few minutes.
Is this consistent with an assumption that one or more runaway processes blocked the execution of other processes? Because apache logs after the process is completed, nothing got logged until the logjam was broken.
Is that a fair conclusion?
Yes, your assumption is indeed reasonable - we had a situation like this a couple years back at the place where I work. If all apache threads are blocked by long-running operations, this kind of behaviour may appear, until at least one thread is freed up again. It doesn't need to be a "runaway" process per se, either - a heavy load, like one incurred by a DOS attack (or maybe just intensive site traffic) may also produce this "picture".
I admit to only having very basic familiarity with Apache administration, but have you checked whether your setup is has sufficient resources to handle the >usual< traffic on the affected site?

Finding an applications scalibility point using JMeter

I am trying to find an applications scalibility point using JMeter. I define the scalability point as "The minimum number of concurrent users from which any increase no longer increases the Throughput per second".
I am using the following technique. Schedule my load test to run for an hour, starting a new thread sending SOAP/XML-RPC Requests every 30 seconds. I do this by setting my number of threads to 120 and my ramp up period to 3600 seconds.
Then looking at my TOTAL rows Throughput in my Summary Report Listener. A new row (thread) is added every 30 seconds, the total throughput number rises until it plateaus at about 123 requests per second after 80 of the threads are active in my case. It then slowly drops the throughput number to 120 per second as the last 20 threads are added. I then conclude that my applications scalability point is 123 requests per second with 80 active users.
My question, is this a valid way to find an application scalibility point or is there different technique that I should be trying?
From a technical perspective what you're doing does answer your question regarding one specific user scenario, though I think you might be missing the big picture.
First of all keep in mind that the actual HTTP request you're sending and ramp up times can often impact what you call a scalability point. Are your requests hitting a cache? Are they not random enough? Are they too random? Do they represent real world requests? is 30 seconds going to give you the same results as 20 seconds or 10 seconds?
From my personal experience it's MUCH easier and more intuitive to look at graphs when trying to analyze app performance. It's not just a question of raw numbers but also looking and trends and rates of change.
For example here is an example testing the ghost.org blogging platofom using JMeter with an interactive JMeter results graph.
http://blazemeter.com/blog/ghost-performance-benchmark

Affordability of Amazon Simple Storage Service (S3)

I have a website that attracts about 30,000 visitors per month. It has a lot of photos and PDF files which eat up a good deal of bandwidth. It's hosted by site5.com, which offers unlimited bandwidth & storage for ~$5 per month. According to site5's statistics, my site has about 20 GB of downloads per day, but I've seen it as high as 116 GB. Uploads range from 5-15 GB daily. (Though, I don't really upload things everyday, so I don't know where they get those numbers from.)
In anticipation of growing my site even more, perhaps by hosting videos, high-res photos, etc., I was looking into other storage options, even though site5 has been pretty good. Specifically, amazon.com's Simple Storage Service (S3) looks pretty good and is supposed to be a "highly scalable, reliable, fast, inexpensive data storage infrastructure."
Using Amazon's Simple Monthly Calculator, I multiplied out my worst-case scenario numbers:
Storage: 2 GB
Data Transfer-in: 15 GB/day * 31 days = 465 GB/month
Data Transfer-out: 116 GB/day * 31 days = 3596 GB/month
With those numbers alone, the calculator estimates my monthly bill to be a whopping $658.27!!! That's insane! Is anyone here using S3? Are your bills outrageous?
Wow, are you sure about those stats? I suppose that's possible, but you're lucky that your host hasn't given you the boot. Leasing a dedicated server will typically get you somewhere in the neighborhood of 1.5TB/month for at least 20 times what you are paying now. If you're doing 3.5TB for $5 per month and your host isn't complaining, don't even think about moving.
(note: most unlimited plans are indeed limited by the company's terms of service, which usually allows them to give anyone the boot for using "too many" resources.)
I would try to find some way to verify your stats before you continue.
$5/3500GB is $0.0014 per gig. That's insane.
3.6TB/month is kind of a lot. Just as a sanity-check, my internet connection seems to deliver somewhere around 100kB/sec reception if I'm lucky (I assume the send/receive rat are about the same). At that bandwidth limit it would take my computer 417 days sending continuously to deliver that amount of data.
10c per gigabyte seems pretty reasonable to me. NearlyFreeSpeech.net charges $1/gigabyte delivered but that decreases to 20c/gigabyte at high volumes. Mosso charges 22c/GB delivered.
If you are paying $5 for unlimited transfer and storage I would stick with your current provider as they are offering something that no-one else is going to be able to offer you for that price.
S3 is also a content distribution network, it has certain uptime guarantees, data storage guarantees, your host probably does not. When Amazon says they can deliver your 116 GB a day they really mean it, whereas your host is probably overselling their capacity and hoping people don't really use their unlimited transfer.
You are getting a steal in terms of what you use. Good luck finding that elsewhere.