Algorithm to split traffic across variable number of servers - load-balancing

I have a .NET Core service AAA that retrieves a bit of data from another Core service BBB. BBB has an in-memory cache (ConcurrentDictionary) and is deployed to 10 boxes. The total size of the data to be cached is around 100GB.
AAA will have a list of servers that run BBB and I was thinking of doing something along the lines of ServerId = DataItemId % 10, so that each of the boxes gets to serve and cache 10% of the total dataset. What I can't figure is what to do when one of the BBB boxes goes down (e.g. due to Windows Update).
Is there some algorith to split the traffic, that will allow servers to go down and up, but still redirect most of requests to the server that has revelant data cashed?

Azure Load Balancer does not interact with the application payload. It makes decisions based on a hashing function which includes the 5-tuple of the TCP/UDP transport IP packet. There is a difference between Basic and Standard LB here in that Standard LB uses an improved hashing function. There's no strict guarantee for share of requests, but the number of flows arriving over time should be relatively even. A health probe can be used to detect whether a backend instance is health or sick. This controls whether new flows arrive on a backend instance. https://aka.ms/lbprobes has details.

Related

Custom load balance logic in HAProxy

I am working on a video-conferencing application. We have a pool of servers where rooms are created, a room can have n number of users. I was exploring HAProxy and several other load balancers, but couldn't find any solution for what I was looking for.
My requirements are as follows
A room should be created on the server with the lowest load at the time of creation.
All users of that room should join on the same server.
I have tried url_param balance logic with consistent hashing, but it is distributing load randomly. Is it even possible with modern L7 load balancers or do I need to write some custom logic (in some load balancer) or a separate application for this scenario?
Is there any way of balancing load based on connections or CPU usage while maintaining the session stickiness?
balance documentation says you can choose algorithm like leastconn and that this only applies when no persistence information is available, or when a connection is redispatched to another server.
So the second part of the answer are stick tables. Read docs about stick match and other stick keywords
So with stick table it looks like this:
backend foo
mode http
balance leastconn
stick store-request src
stick-table type ip size 200k expire 30m
server s1 192.168.1.1:8080
server s2 192.168.1.2:8080
There are more examples in the docs.
What you need to figure out (or tell us) is how can we know the room client wants based on the request and make such stick table and rules. If it's in URL or http header then it is perfectly doable in haproxy.
If leastconn is not good enough, then there is an option of dynamically adjusting servers' weights with haproxy's unix socket CLI and use roundrobin algorithm. Also agent options can be configured for servers to dynamically set servers' weights.

why do we need consistent hashing when round robin can distribute the traffic evenly

When the load balancer can use round robin algorithm to distribute the incoming request evenly to the nodes why do we need to use the consistent hashing to distribute the load? What are the best scenario to use consistent hashing and RR to distribute the load?
From this blog,
With traditional “modulo hashing”, you simply consider the request
hash as a very large number. If you take that number modulo the number
of available servers, you get the index of the server to use. It’s
simple, and it works well as long as the list of servers is stable.
But when servers are added or removed, a problem arises: the majority
of requests will hash to a different server than they did before. If
you have nine servers and you add a tenth, only one-tenth of requests
will (by luck) hash to the same server as they did before. Consistent hashing can achieve well-distributed uniformity.
Then
there’s consistent hashing. Consistent hashing uses a more elaborate
scheme, where each server is assigned multiple hash values based on
its name or ID, and each request is assigned to the server with the
“nearest” hash value. The benefit of this added complexity is that
when a server is added or removed, most requests will map to the same
server that they did before. So if you have nine servers and add a
tenth, about 1/10 of requests will have hashes that fall near the
newly-added server’s hashes, and the other 9/10 will have the same
nearest server that they did before. Much better! So consistent
hashing lets us add and remove servers without completely disturbing
the set of cached items that each server holds.
Similarly, The round-robin algorithm is used to the scenario that a list of servers is stable and LB traffic is at random. The consistent hashing is used to the scenario that the backend servers need to scale out or scale in and most requests will map to the same server that they did before. Consistent hashing can achieve well-distributed uniformity.
Let's say we want to maintain user sessions on servers. So, we would want all requests from a user to go to the same server. Using round-robin won't be of help here as it blindly forwards requests in circularly fashion among the available servers.
To achieve 1:1 mapping between a user and a server, we need to use hashing based load balancers. Consistent hashing works on this idea and it also elegantly handles cases when we want to add or remove servers.
References: Check out the below Gaurav Sen's videos for further explanation.
https://www.youtube.com/watch?v=K0Ta65OqQkY
https://www.youtube.com/watch?v=zaRkONvyGr8
For completeness, I want to point out one other important feature of Consistent Hashing that hasn't yet been mentioned: DOS mitigation.
If a load-balancer is getting spammed with requests, (either from too many customers, an attack, or a haywire local service) a round-robin approach will apply the request spam evenly across all upstream services. Even spread out, this load might be too much for each service to handle. So what happens? Your loadbalancer, in trying to be helpful, has brought down your entire system.
If you use a modulus or consistent hashing approach, then only a small subset of services will be DOS'd by the barrage.
Being able to "limit the blast radius" in this manner is a critical feature of production systems
Consistent hashing is fits well for stateful systems(where context of the previous request is required in the current requests), so in stateful systems if previous and current request lands in different servers than for current request context is lost and system won't be able to fulfil the request, so in consistent hashing with the use of hashing we can route of requests to same server for that particular user, while in round robin we cannot achieve this, round robin is good for stateless systems.

API Traffic Shaping/Throttling Strategies For Tenant Isolation

I'll start my question by providing some context about what we're doing and the problems we're facing.
We are currently building a SaaS (hosted on Amazon AWS) that consists of several microservices that sit behind an API gateway (we're using Kong).
The gateway handles authentication (through consumers with API keys) and exposes the APIs of these microservices that I mentioned, all of which are stateless (there are no sessions, cookies or similar).
Each service is deployed using ECS services (one or more docker containers per service running on one or more EC2 machines) and load balanced using the Amazon Application Load Balancer (ALB).
All tenants (clients) share the same environment, that is, the very same machines and resources. Given our business model, we expect to have few but "big" tenants (at first).
Most of the requests to these services translate in heavy resource usage (CPU mainly) for the duration of the request. The time needed to serve one request is in the range of 2-10 seconds (and not ms like traditional "web-like" applications). This means we serve relatively few requests per minute where each one of them take a while to process (background or batch processing is not an option).
Right now, we don't have a strategy to limit or throttle the amount of requests that a tenant can make on a given period of time. Taken into account the last two considerations from above, it's easy to see this is a problem, since it's almost trivial for a tenant to make more requests than we can handle, causing a degradation on the quality of service (even for other tenants because of the shared resources approach).
We're thinking of strategies to limit/throttle or in general prepare the system to "isolate" tenants, so one tenant can not degrade the performance for others by making more requests than we can handle:
Rate limiting: Define a maximum requests/m that a tenant can make. If more requests arrive, drop them. Kong even has a plugin for it. Sadly, we use a "pay-per-request" pricing model and business do not allow us to use this strategy because we want to serve as many requests as possible in order to get paid for them. If excess requests take more time for a tenant that's fine.
Tenant isolation: Create an isolated environment for each tenant. This one has been discarded too, as it makes maintenance harder and leads to lower resource usage and higher costs.
Auto-scaling: Bring up more machines to absorb bursts. In our experience, Amazon ECS is not very fast at doing this and by the time these new machines are ready it's possibly too late.
Request "throttling": Using algorithms like Leaky Bucket or Token Bucket at the API gateway level to ensure that requests hit the services at a rate we know we can handle.
Right now, we're inclined to take option 4. We want to implement the request throttling (traffic shaping) in such a way that all requests made within a previously agreed rate with the tenant (enforced by contract) would be passed along to the services without delay. Since we know in advance how many requests per minute each tenant is gonna be making (estimated at least) we can size our infrastructure accordingly (plus a safety margin).
If a burst arrives, the excess requests would be queued (up to a limit) and then released at a fixed rate (using the leaky bucket or similar algorithm). This would ensure that a tenant can not impact the performance of other tenants, since requests will hit the services at a predefined rate. Ideally, the allowed request rate would be "dynamic" in such a way that a tenant can use some of the "requests per minute" of other tenants that are not using them (within safety limits). I believe this is called the "Dynamic Rate Leaky Bucket" algorithm. The goal is to maximize resource usage.
My questions are:
Is the proposed strategy a viable one? Do you know of any other viable strategies for this use case?
Is there an open-source, commercial or SaaS service that can provide this traffic shaping capabilities? As far as I know Kong or Tyk do not support anything like this, so... Is there any other API gateway that does?
In case Kong does not support this, How hard it is to implement something like what I've described as a plugin? We have to take into account that it would need some shared state (using Redis for example) as we're using multiple Kong instances (for load balancing and high availability).
Thank you very much,
Mikel.
Managing request queue on Gateway side is indeed tricky thing, and probably the main reason why it is not implemented in this Gateways, is that it is really hard to do right. You need to handle all the distributed system cases, and in addition, it hard makes it "safe", because "slow" clients quickly consume machine resources.
Such pattern usually offloaded to client libraries, so when client hits rate limit status code, it uses smth like exponential backoff technique to retry requests. It is way easier to scale and implement.
Can't say for Kong, but Tyk, in this case, provides two basic numbers you can control, quota - maximum number of requests client can make in given period of time, and rate limits - safety protection. You can set rate limit 1) per "policy", e.g for group of consumers (for example if you have multiple tiers of your service, with different allowed usage/rate limits), 2) per individual key 3) Globally for API (works together with key rate limits). So for example, you can set some moderate client rate limits, and cap total limit with global API setting.
If you want fully dynamic scheme, and re-calculate limits based on cluster load, it should be possible. You will need to write and run this scheduler somewhere, from time to time it will perform re-calculation, based on current total usage (which Tyk calculate for you, and you get it from Redis) and will talk with Tyk API, by iterating through all keys (or policies) and dynamically updating their rate limits.
Hope it make sense :)

Asp.net Web Api 2 OWIN self hosted, high CPU, what is the average compute load I should expect?

I built a set of 3 APIs using Asp.net Web Api 2, self-hosted using OWIN in an Azure Cloud service Worker role.
The Worker Role is exposed to the internet with a custom domain.
Each API has a single controller, doing some normal dictionary operations, table calls and Azure Redis calls. 1 request on two just do a single Redis call and return in around 10ms.
The average call when going through all the API code is 150ms.
The answer is a JSON object of around 10k in size.
Everything works fine, but I have a problem.
I'm having around 25 peaks connections per second and no more than 2 Million requests per day, and I can barely get the CPU below 40% with 3 Azure D2_V2 (2 cores , 8GB RAM) instances running.
I'm in trouble because I'm spending almost 1.5k$ a month for an Api serving just 15-25 calls per second.
If I remove or scale down an instance, the CPU go up to 55-60%, Redis and Azure table calls slows down a lot and an API request takes 3- 5 seconds to get back.
I tried everything at the best of my abilities, I thought could be some bots or DDos attack, so I installed the nuget package WebApi Throttle, set a maximum of 1 requests per IP per second.
Nothing changed.
I reviewed all the code configuration to cut unoptimized parts, but 1 call in 2 just call redis and get back and the others are very clean and simple C# returning in 150ms with 2 azure table calls + 1 azure queue set.
The API Controllers are async, everything is async.
I enabled Profiling, the CPU is high in the main azure process, and the Redis Get method, nothing else relevant here, no bottlenecks.
I enabled Diagnostics, no errors.
I installed Application Insights, and here I see something strange that cannot tell if it is normal or not.
I see this IP: 13.88.23.0 doing thousands requests to the APIs with querystring values generally used in normal requests. A lot of them fail.
This IP is Azure itself, why is calling the Api?
Some of these requests are stuck for minutes, I can see that from the Application insights panel, it's always the same IP.
Then I see the remaining logs, dependencies etc,nothing relevant.
Apart from that , what could I do to understand the problem?
I can't think is normal to consume so many CPU resources for an API with just 2 Million calls a day, or not?
Is there an additional profiling technique I could use?
Based on your experience, how many API calls should I expect to serve with 3 dual core 8GB RAM servers in normal conditions? (assuming there is something wrong in my configuration)
Thanks
UPDATE
I separated the API in two cloud services, 2 in one and 1 in another.
I still see in Application Insights calls from another IP belonging to Microsoft.
I suppose this is normal, probably Application insights cannot detect the real IP of the client since is a Worker Role and show the internal one.
But the problem of having to use so much power for so few calls remain.
Any thoughts on that?

ZMQ device queue does not load balance properly

I know that ZMQ offers all of the flexibility to do your own load-balancing. However I would expect the out-of-the-box broker, about 4 lines of code using the line
zmq_device (ZMQ_QUEUE, frontend, backend);
to load balance quite well as the documentation says it does load balance.
ZMQ_QUEUE creates a shared queue that collects requests from a set of clients, and distributes these fairly among a set of services. Requests are fair-queued from frontend connections and load-balanced between backend connections. Replies automatically return to the client that made the original request.
I have an army of back-end services and yet find that often my front-end clients have to wait several seconds for something that takes < 1/10 of a second in a 1:1 setting (there are same # of client and service machines). I suspect that ZMQ is not load-balancing properly out of the box - it's sending too many requests to the same service even though it doesn't have bandwidth, etc.
I think this is partly because the services are multithreaded in a way that lets them take up to 10 concurrent requests yet it slows down greatly at near the 10th request even though it can still accept them. Random distribution would be ideal. Is there an out-of-the-box way to do this or can it be done in a few lines of code, or do I have to write my own broker from scratch?
Fwiw issue was the workers were taking on work when they didn't have room for it, issue was not in ZMQ layer per se.