Does Internet latency limit my clients to 120k synchronous transactions per hour? - udp

Customer machines send UDP requests to our server. The server processes each request and sends a response. The logic of the transactions requires the client to wait for a response before sending a new request.
Even if all processing by client and server machines is instantaneous, it appears our customers still need about 30ms on average just to send/receive a round trip transaction over the Internet. (That's traveling about 5,580 miles at light speed.)
Does that mean a given customer on average can't do more than about 120,000 synchronous transactions per hour?
1 transaction = .030 seconds minimum
120k transactions = 1 hour

Impact of Latency
Since you must serialize your requests, latency will limit your transaction rate.
However, the speed of light calculation is the theoretical best-case transit time. In real life, there are routers along the way that add latency.
Be sure and measure the actual ping time at various points in the day, over several days, to get real latency numbers.
Since the client and server code will process in less-than-zero-time, and the processing time might well be at least as much as the latency time (depending on what you are doing), it may not be realistic to assume the processing time approaches zero.
Overcoming Latency
These days, there are a number of fairly inexpensive ways to put your servers (or at least a layer of your architecture) closer to your customers. For example, you could consider using a service such as AWS to place processing resources in geographical proximity to your customers. You can then either give e.g. West Coast customers a different URL to use than East Coast customers, or you can use geographic load balancing so that everyone can use the same URL (your load balancing service routes traffic to the best server, worldwide). I have successfully used UltraDNS for that purpose in the past.

Related

Will cloudflare workers being used as an API be able to handle 10,000 Requests per second?

I have to create an API which handles occasionally high number of requests like bursts of 10,000 requests for few seconds. I will just add the request payload to KV and return success.
I have two questions
Will I have any significant cold starts(more than 500ms)
Will cloudflare workers be able to handle that occasional burst of requests.
No, you should not see cold starts over 500ms. Even if your app runs an expensive computation at startup, Workers actually applies a 200ms hard limit on such computation, which is detected at deploy time. So if your Worker successfully deploys, it should never take more than 500ms to start (unless Cloudflare is suffering from some sort of internal problem). Most apps take more like 10ms to "cold start", and this can usually be parallelized with the TLS handshake such that no cold start is observed at all.
As long as you are on a paid plan, Workers will have no problem scaling to 10,000 requests per second, even if it is sudden. Workers has customers that do orders of magnitude more traffic than that.
(I am the tech lead for Cloudflare Workers.)

API Traffic Shaping/Throttling Strategies For Tenant Isolation

I'll start my question by providing some context about what we're doing and the problems we're facing.
We are currently building a SaaS (hosted on Amazon AWS) that consists of several microservices that sit behind an API gateway (we're using Kong).
The gateway handles authentication (through consumers with API keys) and exposes the APIs of these microservices that I mentioned, all of which are stateless (there are no sessions, cookies or similar).
Each service is deployed using ECS services (one or more docker containers per service running on one or more EC2 machines) and load balanced using the Amazon Application Load Balancer (ALB).
All tenants (clients) share the same environment, that is, the very same machines and resources. Given our business model, we expect to have few but "big" tenants (at first).
Most of the requests to these services translate in heavy resource usage (CPU mainly) for the duration of the request. The time needed to serve one request is in the range of 2-10 seconds (and not ms like traditional "web-like" applications). This means we serve relatively few requests per minute where each one of them take a while to process (background or batch processing is not an option).
Right now, we don't have a strategy to limit or throttle the amount of requests that a tenant can make on a given period of time. Taken into account the last two considerations from above, it's easy to see this is a problem, since it's almost trivial for a tenant to make more requests than we can handle, causing a degradation on the quality of service (even for other tenants because of the shared resources approach).
We're thinking of strategies to limit/throttle or in general prepare the system to "isolate" tenants, so one tenant can not degrade the performance for others by making more requests than we can handle:
Rate limiting: Define a maximum requests/m that a tenant can make. If more requests arrive, drop them. Kong even has a plugin for it. Sadly, we use a "pay-per-request" pricing model and business do not allow us to use this strategy because we want to serve as many requests as possible in order to get paid for them. If excess requests take more time for a tenant that's fine.
Tenant isolation: Create an isolated environment for each tenant. This one has been discarded too, as it makes maintenance harder and leads to lower resource usage and higher costs.
Auto-scaling: Bring up more machines to absorb bursts. In our experience, Amazon ECS is not very fast at doing this and by the time these new machines are ready it's possibly too late.
Request "throttling": Using algorithms like Leaky Bucket or Token Bucket at the API gateway level to ensure that requests hit the services at a rate we know we can handle.
Right now, we're inclined to take option 4. We want to implement the request throttling (traffic shaping) in such a way that all requests made within a previously agreed rate with the tenant (enforced by contract) would be passed along to the services without delay. Since we know in advance how many requests per minute each tenant is gonna be making (estimated at least) we can size our infrastructure accordingly (plus a safety margin).
If a burst arrives, the excess requests would be queued (up to a limit) and then released at a fixed rate (using the leaky bucket or similar algorithm). This would ensure that a tenant can not impact the performance of other tenants, since requests will hit the services at a predefined rate. Ideally, the allowed request rate would be "dynamic" in such a way that a tenant can use some of the "requests per minute" of other tenants that are not using them (within safety limits). I believe this is called the "Dynamic Rate Leaky Bucket" algorithm. The goal is to maximize resource usage.
My questions are:
Is the proposed strategy a viable one? Do you know of any other viable strategies for this use case?
Is there an open-source, commercial or SaaS service that can provide this traffic shaping capabilities? As far as I know Kong or Tyk do not support anything like this, so... Is there any other API gateway that does?
In case Kong does not support this, How hard it is to implement something like what I've described as a plugin? We have to take into account that it would need some shared state (using Redis for example) as we're using multiple Kong instances (for load balancing and high availability).
Thank you very much,
Mikel.
Managing request queue on Gateway side is indeed tricky thing, and probably the main reason why it is not implemented in this Gateways, is that it is really hard to do right. You need to handle all the distributed system cases, and in addition, it hard makes it "safe", because "slow" clients quickly consume machine resources.
Such pattern usually offloaded to client libraries, so when client hits rate limit status code, it uses smth like exponential backoff technique to retry requests. It is way easier to scale and implement.
Can't say for Kong, but Tyk, in this case, provides two basic numbers you can control, quota - maximum number of requests client can make in given period of time, and rate limits - safety protection. You can set rate limit 1) per "policy", e.g for group of consumers (for example if you have multiple tiers of your service, with different allowed usage/rate limits), 2) per individual key 3) Globally for API (works together with key rate limits). So for example, you can set some moderate client rate limits, and cap total limit with global API setting.
If you want fully dynamic scheme, and re-calculate limits based on cluster load, it should be possible. You will need to write and run this scheduler somewhere, from time to time it will perform re-calculation, based on current total usage (which Tyk calculate for you, and you get it from Redis) and will talk with Tyk API, by iterating through all keys (or policies) and dynamically updating their rate limits.
Hope it make sense :)

Should round-robin technique lbmethod=byrequests in apache2 be enough for most scenarios?

In my system there are different kind of request having different range in terms of memory cost and time cost.
That is, if there are types of requests R1, R2.....R100, the amount of RAM required to process the request and the response time of these request types varies a lot, even by a margin of 10 to 100 times.
Should round-robin be the right method for such scenario or does round-robin will eventually cover up most scenarios in this situation?
If round-robin is not the right choice, then are there more customizing options available on apache?
Normally I would say once you're dealing with sufficiently large # of requests, plus factoring in stickyness, it's just not worth worrying about because it will tend to even out.
But if some requests are 1 or two orders of magnitude more expensive for the backends, you might consider "bybusyness" or "bytraffic" if those expensive requests happen to take longer to process or generate large responses. Under lower loads, this will give you better chances for not having 1 backend get unlucky and handle too many expensive requests in parallel (stickyness aside).
Should round-robin be the right method for such scenario or does
round-robin will eventually cover up most scenarios in this situation?
We did a 36 hour run (duration stress test) and 4 hour run (peak stress test) with full volume data for 50 concurrent users, 100 concurrent users and finally 350 concurrent users. There wasn't any difference in the CPU and RAM utilization among the different VMs which we were trying to do load distribute.
We did multiple such runs and the difference between CPU and RAM utilizations were not significant enough.
So, I think it will be fair to conclude that round-robin does cover a lot of scenarios including this one and is the right method to use for load distribution in this scenario.
Round robin algorithm sends requests among nodes in the order that requests are received. Here is a simple example. Let’s say you have 3 nodes: node-A, node-B, and node-C.
• First request is sent to node-A.
• Second request is sent to node-B.
• Third request is sent to node-C.
The load balancer continues sending requests to servers based on this order. It makes to sound that traffic would get equally distributed among the nodes. But that isn’t true.
Read more here to know in detail: What is the problem with Round robin algorithm?

Apigee SpikeArrest Sync Across MessageProcessors (MPs)

Our organisation is currently migrating to Apigee.
I currently have a problem very similar to this one, but due to the fact that I am a Stack Overflow newcomer and have low reputation I couldn't comment on it: Apigee - SpikeArrest behavior
So, in our organisation we have 6 MessageProcessors (MP) and I assume they are working in a strictly round-robin manner.
Please see this config (It is applied to the TARGET ENDPOINT of the ApiProxy):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<SpikeArrest async="false" continueOnError="false" enabled="true" name="spikearrest-1">
<DisplayName>SpikeArrest-1</DisplayName>
<FaultRules/>
<Properties/>
<Identifier ref="request.header.some-header-name"/>
<MessageWeight ref="request.header.weight"/>
<Rate>3pm</Rate>
</SpikeArrest>
I have a rate of 3pm, which means 1 hit each 20sec, calculated according to ApigeeDoc1.
The problem is that instead of 1 successful hit every 20sec I get 6 successful ones in the range of 20sec and then the SpikeArrest error, meaning it hit once each MP in a round robin manner.
This means I get 6 hit per 20 sec to my api backend instead of the desired 1 hit per 20sec.
Is there any way to sync the spikearrests across the MPs?
ConcurrentRatelimit doesn't seem to help.
SpikeArrest has no ability to be distributed across message processors. It is generally used for stopping large bursts of traffic, not controlling traffic at the levels you are suggesting (3 calls per minute). You generally put it in the Proxy Request Preflow and abort if the traffic is too high.
The closest you can get to 3 per minute using SpikeArrest with your round robin message processors is 1 per minute, which would result in 6 calls per minute. You can only specify SpikeArrests as "n per second" or "n per minute", which does get converted to "1 per 1/n second" or "1 per 1/n minute" as you mentioned above.
Do you really only support one call every 20 seconds on your backend? If you are trying to support one call every 20 seconds per user or app, then I suggest you try to accomplish this using the Quota policy. Quotas can share a counter across all message processors. You could also use quotas with all traffic (instead of per user or per app) by specifying a quota identifier that is a constant. You could allow 3 per minute, but they could all come in at the same time during that minute.
If you are just trying to protect against overtaxing your backend, the ConcurrentRateLimit policy is often used.
The last solution is to implement some custom code.
Update to address further questions:
Restating:
6 message processors handled round robin
want 4 apps to each be allowed 5 calls per second
want the rest of the apps to share 10 calls per second
To get the kind of granularity you are looking for, you'll need to use quotas. Unfortunately you can't set a quota to have a "per second" value on a distributed quota (distributed quota shares the count among message processors rather than having each message processor have its own counter). The best you can do is per minute, which in your case would be 300 calls per minute. Otherwise you can use a non-distributed quota (dividing the quota between the 6 message processors), but the issue you'll have there is that calls that land on some MPs will be rejected while others will be accepted, which can be confusing to your developers.
For distributed quotas you'd set the 300 calls per minute in an API Product (see the docs), and assign that product to your four apps. Then, in your code, if that product is not assigned for the current API call's app, you'd use a quota that is hardcoded to 10 per second (600 per minute) and use a constant identifier rather than the client_id, so that all other traffic uses that quota.
Quotas don't keep you from submitting all your requests nearly simultaneously, and I'm assuming your backend can't handle 1200+ requests all at the same time. You'll need to smooth the traffic using a SpikeArrest policy. You'll want to allow the maximum traffic through the SpikeArrest that your backend can handle. This will help protect against traffic spikes, but you'll probably get some traffic rejected that would normally be allowed by the Quota. The SpikeArrest policy should be checked before the Quota, so that rejected traffic is not counted against the app's quota.
As you can probably see, configuring for situations like yours is more of an art than a science. My suggestion would be to do significant performance/load testing, and tune it until you find the correct values. If you can figure out how to use non-distributed quotas to get acceptable performance and predictability, that will let you work with per second numbers instead of per minute numbers, which will probably make massive spikes less likely.
Good luck!
Unlike Quota limits, the Spike Arrest cannot be synchronized across MP.
But, as you're setting them on a per minute level, you could use Quota Policy instead -- then set it to Distributed and Synchronized and it will coordinate across MP.
Keep in mind there will always be some latency on the synchronization across machines so it will never be a completely precise number.

Difference between server hit rate and througput in jMeter reports

I'm using jMeter to make load test on a web application. I use also the plugin "jMeter Plugins" to have more Graphs.
My question is
I can't understand the difference between the server hit rate (Server hit per second graph) and the througput (Transactions per Second). The two graphs are very close but they differ a bit in some locations.
I wonder also if "transaction" here means request .. right ??
Thx a lot :)
Both hits per second and throughput are talking about workload, the hits are the request send from the injector over time, meanwhile the throughput is the load that the system is able to handle, both graphs should look the same as long as the application haven't reach its breaking point, after the breaking point the hits will continue increasing triggering a response times increase.
A test in which you note the difference is the peak test (you increase load until you crash the application), when the application exceeds its throughput the 2 plots will diverge.
As you can see the blue curve differ from from the green one after 650RPS, then response times skyrocket and request start failing.
If we let the test continue running, the injector will run out of threads and the hits curve will be the same as the throughput again. Configuring the injectors pool thread.
The area in between the two curves are active request, request that the injector sent and are waiting to be processed.
The hits plot is measured in RPS, it is counting requests not transactions.
The same plot can be generate using the jmeter's composite graph.
server hit rate gives graph of how many hits can server handle per each second for single unit.
Throughput Rate is the amount of transactions produced over time during a test. It’s also expressed as the amount of capacity that a website or application can handle.
http://www.joecolantonio.com/2011/07/05/performance-testing-what-is-throughput/