How much do you end up paying for each metric created by Container Insights for EKS for short lived pods?

How much do you end up paying for each metric created by Container Insights for EKS for short lived pods? - amazon-cloudwatch

I just enabled Container Insights in my EKS cluster following the Amazon's Quick Start Setup for Conainer Insights on Amazon EKS.
I see that the number of metrics is quickly increasing, CloudWatch > Metrics > Custom Namespaces > ContainerInsights is already 1279 metrics with the majority of them under ClusterName, Namespace, Podname (979 metrics)
Under ContainerInsights > ClusterName, Namespace, PodName is creating 4-5 metrics for each new pod and the number of pods steadily increase since we have Airflow creating new short-lived pods all the time.
Since CloudWatch Pricing indicates that we pay 0.30$ per custom metric/month, does that means that I will get billed ~1000x0.30 = 300$? The pricing page also indicates that All custom metrics charges are prorated by the hour and metered only when you send metrics to CloudWatch. Since almost all the pods are short-lived (less than an hour), does that mean that I will get billed only ~1000x(0.30/(30days * 24hours)) =
~1000*(0.30/720) = 0.41$ ?

According to Example 8 - Container Insights for Amazon EKS and Kubernetes (k8s) on the (CloudWatch Pricing page](https://aws.amazon.com/cloudwatch/pricing/) up to 9 metrics will be sent per pod. In there is even clear that the metrics cost is prorated on hourly basis.
I guess the best way to think about is asking yourself how many average running container you'll have during the 30 day period and use that to calculate the cost
(n_average_running_pods x 9 x 0.30$USD)

Related

how to Identify the cloud watch metrics for specific KCL in kinesis streams

we have multiple kinesis consumer applications(KCL 2.0) are consuming the data from the same kinesis stream. All the consumer is sending the metrics to cloud watch and in the cloud watch those are showing up.
If i wanted to specifically understood and scale to multiple instances of one consumer application. how can we achive that... ?
cloud watch metrics Get records iterator age, Incoming data - sum (Count)

KCL metrics are provided under separate namespace in cloudwatch. The namespace that is used to upload metrics is the applicationName that you provide in KCL configuration. So if you have multiple KCL application with differnt applicationName then, you will find those metrics in cloudwatch metric console under "custom namespaces"
complete list of KCL metrics can be found here

API Traffic Shaping/Throttling Strategies For Tenant Isolation

I'll start my question by providing some context about what we're doing and the problems we're facing.
We are currently building a SaaS (hosted on Amazon AWS) that consists of several microservices that sit behind an API gateway (we're using Kong).
The gateway handles authentication (through consumers with API keys) and exposes the APIs of these microservices that I mentioned, all of which are stateless (there are no sessions, cookies or similar).
Each service is deployed using ECS services (one or more docker containers per service running on one or more EC2 machines) and load balanced using the Amazon Application Load Balancer (ALB).
All tenants (clients) share the same environment, that is, the very same machines and resources. Given our business model, we expect to have few but "big" tenants (at first).
Most of the requests to these services translate in heavy resource usage (CPU mainly) for the duration of the request. The time needed to serve one request is in the range of 2-10 seconds (and not ms like traditional "web-like" applications). This means we serve relatively few requests per minute where each one of them take a while to process (background or batch processing is not an option).
Right now, we don't have a strategy to limit or throttle the amount of requests that a tenant can make on a given period of time. Taken into account the last two considerations from above, it's easy to see this is a problem, since it's almost trivial for a tenant to make more requests than we can handle, causing a degradation on the quality of service (even for other tenants because of the shared resources approach).
We're thinking of strategies to limit/throttle or in general prepare the system to "isolate" tenants, so one tenant can not degrade the performance for others by making more requests than we can handle:
Rate limiting: Define a maximum requests/m that a tenant can make. If more requests arrive, drop them. Kong even has a plugin for it. Sadly, we use a "pay-per-request" pricing model and business do not allow us to use this strategy because we want to serve as many requests as possible in order to get paid for them. If excess requests take more time for a tenant that's fine.
Tenant isolation: Create an isolated environment for each tenant. This one has been discarded too, as it makes maintenance harder and leads to lower resource usage and higher costs.
Auto-scaling: Bring up more machines to absorb bursts. In our experience, Amazon ECS is not very fast at doing this and by the time these new machines are ready it's possibly too late.
Request "throttling": Using algorithms like Leaky Bucket or Token Bucket at the API gateway level to ensure that requests hit the services at a rate we know we can handle.
Right now, we're inclined to take option 4. We want to implement the request throttling (traffic shaping) in such a way that all requests made within a previously agreed rate with the tenant (enforced by contract) would be passed along to the services without delay. Since we know in advance how many requests per minute each tenant is gonna be making (estimated at least) we can size our infrastructure accordingly (plus a safety margin).
If a burst arrives, the excess requests would be queued (up to a limit) and then released at a fixed rate (using the leaky bucket or similar algorithm). This would ensure that a tenant can not impact the performance of other tenants, since requests will hit the services at a predefined rate. Ideally, the allowed request rate would be "dynamic" in such a way that a tenant can use some of the "requests per minute" of other tenants that are not using them (within safety limits). I believe this is called the "Dynamic Rate Leaky Bucket" algorithm. The goal is to maximize resource usage.
My questions are:
Is the proposed strategy a viable one? Do you know of any other viable strategies for this use case?
Is there an open-source, commercial or SaaS service that can provide this traffic shaping capabilities? As far as I know Kong or Tyk do not support anything like this, so... Is there any other API gateway that does?
In case Kong does not support this, How hard it is to implement something like what I've described as a plugin? We have to take into account that it would need some shared state (using Redis for example) as we're using multiple Kong instances (for load balancing and high availability).
Thank you very much,
Mikel.

Managing request queue on Gateway side is indeed tricky thing, and probably the main reason why it is not implemented in this Gateways, is that it is really hard to do right. You need to handle all the distributed system cases, and in addition, it hard makes it "safe", because "slow" clients quickly consume machine resources.
Such pattern usually offloaded to client libraries, so when client hits rate limit status code, it uses smth like exponential backoff technique to retry requests. It is way easier to scale and implement.
Can't say for Kong, but Tyk, in this case, provides two basic numbers you can control, quota - maximum number of requests client can make in given period of time, and rate limits - safety protection. You can set rate limit 1) per "policy", e.g for group of consumers (for example if you have multiple tiers of your service, with different allowed usage/rate limits), 2) per individual key 3) Globally for API (works together with key rate limits). So for example, you can set some moderate client rate limits, and cap total limit with global API setting.
If you want fully dynamic scheme, and re-calculate limits based on cluster load, it should be possible. You will need to write and run this scheduler somewhere, from time to time it will perform re-calculation, based on current total usage (which Tyk calculate for you, and you get it from Redis) and will talk with Tyk API, by iterating through all keys (or policies) and dynamically updating their rate limits.
Hope it make sense :)

Apigee SpikeArrest Sync Across MessageProcessors (MPs)

Our organisation is currently migrating to Apigee.
I currently have a problem very similar to this one, but due to the fact that I am a Stack Overflow newcomer and have low reputation I couldn't comment on it: Apigee - SpikeArrest behavior
So, in our organisation we have 6 MessageProcessors (MP) and I assume they are working in a strictly round-robin manner.
Please see this config (It is applied to the TARGET ENDPOINT of the ApiProxy):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<SpikeArrest async="false" continueOnError="false" enabled="true" name="spikearrest-1">
<DisplayName>SpikeArrest-1</DisplayName>
<FaultRules/>
<Properties/>
<Identifier ref="request.header.some-header-name"/>
<MessageWeight ref="request.header.weight"/>
<Rate>3pm</Rate>
</SpikeArrest>
I have a rate of 3pm, which means 1 hit each 20sec, calculated according to ApigeeDoc1.
The problem is that instead of 1 successful hit every 20sec I get 6 successful ones in the range of 20sec and then the SpikeArrest error, meaning it hit once each MP in a round robin manner.
This means I get 6 hit per 20 sec to my api backend instead of the desired 1 hit per 20sec.
Is there any way to sync the spikearrests across the MPs?
ConcurrentRatelimit doesn't seem to help.

SpikeArrest has no ability to be distributed across message processors. It is generally used for stopping large bursts of traffic, not controlling traffic at the levels you are suggesting (3 calls per minute). You generally put it in the Proxy Request Preflow and abort if the traffic is too high.
The closest you can get to 3 per minute using SpikeArrest with your round robin message processors is 1 per minute, which would result in 6 calls per minute. You can only specify SpikeArrests as "n per second" or "n per minute", which does get converted to "1 per 1/n second" or "1 per 1/n minute" as you mentioned above.
Do you really only support one call every 20 seconds on your backend? If you are trying to support one call every 20 seconds per user or app, then I suggest you try to accomplish this using the Quota policy. Quotas can share a counter across all message processors. You could also use quotas with all traffic (instead of per user or per app) by specifying a quota identifier that is a constant. You could allow 3 per minute, but they could all come in at the same time during that minute.
If you are just trying to protect against overtaxing your backend, the ConcurrentRateLimit policy is often used.
The last solution is to implement some custom code.
Update to address further questions:
Restating:
6 message processors handled round robin
want 4 apps to each be allowed 5 calls per second
want the rest of the apps to share 10 calls per second
To get the kind of granularity you are looking for, you'll need to use quotas. Unfortunately you can't set a quota to have a "per second" value on a distributed quota (distributed quota shares the count among message processors rather than having each message processor have its own counter). The best you can do is per minute, which in your case would be 300 calls per minute. Otherwise you can use a non-distributed quota (dividing the quota between the 6 message processors), but the issue you'll have there is that calls that land on some MPs will be rejected while others will be accepted, which can be confusing to your developers.
For distributed quotas you'd set the 300 calls per minute in an API Product (see the docs), and assign that product to your four apps. Then, in your code, if that product is not assigned for the current API call's app, you'd use a quota that is hardcoded to 10 per second (600 per minute) and use a constant identifier rather than the client_id, so that all other traffic uses that quota.
Quotas don't keep you from submitting all your requests nearly simultaneously, and I'm assuming your backend can't handle 1200+ requests all at the same time. You'll need to smooth the traffic using a SpikeArrest policy. You'll want to allow the maximum traffic through the SpikeArrest that your backend can handle. This will help protect against traffic spikes, but you'll probably get some traffic rejected that would normally be allowed by the Quota. The SpikeArrest policy should be checked before the Quota, so that rejected traffic is not counted against the app's quota.
As you can probably see, configuring for situations like yours is more of an art than a science. My suggestion would be to do significant performance/load testing, and tune it until you find the correct values. If you can figure out how to use non-distributed quotas to get acceptable performance and predictability, that will let you work with per second numbers instead of per minute numbers, which will probably make massive spikes less likely.
Good luck!

Unlike Quota limits, the Spike Arrest cannot be synchronized across MP.
But, as you're setting them on a per minute level, you could use Quota Policy instead -- then set it to Distributed and Synchronized and it will coordinate across MP.
Keep in mind there will always be some latency on the synchronization across machines so it will never be a completely precise number.

How to analyse JMeter result?

I am new in JMeter tool. Can anyone help me for the best way to analyse JMeter reports?

Simply list of related links you can possibly find useful:
Native graphs:
JMeter Report Dashboard
Real-time plotting with 3rd party real-time series database like influxdb
Free Open source solutions for automated graphs:
JMeter Plugins - look onto custom graphs in this package; some of them provide better results reporting out-of-box than jmeter's original ones;
JMeter Result Analysis Plugin
JWeter tool for logs analyzing & visualization
Recipes with custom development:
JMeter Wiki: Suggestions and Recipes for Log Analysis
Better JMeter Graphs
Plotting your load test with JMeter
3rd party solutions:
Blazemeter Sense
Tricentis flood.io
RedLine13
JAnalyser: browser based results analysis tool
UPD.
Please find, use and feel free to extend this Awesome JMeter collection continued as github repo.

There are 3 test that are must when doing performance testing, there should always be a baseline, a peak test and a stress test. These test relate to each other because of the little's law. The long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, λ, multiplied by the time a customer spends in the system, W; or expressed algebraically: L = λW..
Jmeter already provides means to check this values, the standar plugin provides plots for reponse times, hits as well as throughput. There is no way to directly tell how many users were active on the system, it is not the same concurrent users than active users. The plugins are enough to produce the reports, but they do not allow to control much of the presentation, i will use some plots produced using python(they add labels, and have 2 y axis).
Baseline Test:
This case is an special case of the law, in this case the active users is constant and it is one, then:
L = λW
1 = λW
1/W = λ
If the application run the same piece of code, the response time will stabilize over time, then the arrival rate will be constant over time too.
There is a service that does nothing else than wait some time to go by:
2 Seconds service: The arrival rate was 1/2TPS.
3 Seconds service: The arrival rate was 1/3TPS.
Peak Test:
This is nother special case, in this case load incrase until it surpass the system thoughput, because the load is greater than throughput the response times do increase. During the test the threads number should increase fast enough to recover from long response times.
This time instead of running the peak, i will stress the system with more load than it is able to handle during the whole test. To control the service throughput:
The active transactions are those that had leave the injector but haven't get a reponse, those are transactions that are queue in some place whitin the system.
λ(t ) = c, T(t) = k; both the load as well as the thoughput are constant over time.
L = Σλ - ΣT = ct - kt; The active transactions is the difference in between the cumulative load and the cumulative thoughput.
L = (c - k)t
λW= (c - k)t
cW(t) = (c - k)t
W(t) = t(c - k)/c
Because response times do grow as active users do, we will need the injector to create new threads as fast as new conections are requiered, most of the pool threads are going to be busy.
2TPS arrival rate, 1 TPS throughput:
The response times function is 1/2t
The injector stress the system during 300 seconds.
The test last 600 seconds.
4TPS arrival rate, 1 TPS throughput:
The response times function is 3/4t
The injector stress the system during 300 seconds.
The test last 1200 seconds.
6TPS arrival rate, 5 TPS throughput:
The response times function is 1/6t
The injector stress the system during 300 seconds.
The test last 360 seconds.

In simple word if you want to analyze your JMeter report...
Start with server CPU and RAM utilization. When you run a performance test on your server, see how much CPU and RAM is utilized by the current test.
Issue the following command on hosted site server; it will create a log file of CPU usage.
while true; do
( echo "%CPU %MEM ARGS $(date)" &&
ps -e -o pcpu,pmem,args --sort=pcpu | cut -d" " -f1-5 |
tail ) >> ps.log
sleep 1
done
See overall response time, it should not exceed your expected response time criteria.
See below image. My expectation is response time should not go above 525 microseconds, but some requests are crossing it. Find these kind of requests which are taking time.
Overall Response Times:
See Transaction per second, how many transaction are made per second and is there any drop in the test time frame?
Inspect the summary report, Average time, and max time to see which requests are taking the most time.
Currently many listeners are available in JMeter as add-ons or built in, but these are the major things to look at in order to be able to guess properly what's going on. And you can use other reports like that.
Follow my blog for more details https://softwaretesterfriend.blogspot.in/

Starting with version 3.0, JMeter includes a dynamic HTML report that can be generated either at the end of a load test or from a result file.
See generating-dashboard

In order to analyze your JMeter results, you can use
Listeners in JMeter
Blazemeter Sense
Reports Dashboard

In addition to all the other answers: there is a nice site of BlazeMeter where you can upload your test result file (.jtl) and it will generate all kinds of (interactive) reports for it. It even analyzes it for you and points out when the first error occurs, what the saturation point is, etc. https://sense.blazemeter.com/gui/
If you have a graphite/grafana infrastructure I can recommend to add the Backend Listener to the project. It will send real-time metrics to the graphite server and you can monitor the test in graphite (or grafana).

If you are new JMeter understanding JMeter listeners and other components will help you . check the tutorial
- https://www.youtube.com/watch?v=FfDVIklNjgw

How does one autoscale web dynos on Heroku?

With Heroku, how does one AUTO scale up in terms of web dynos when it is needed? Say we get a surge of 100 concurrent users every 2-3 minutes. If our app is stuck on 5-6 web dynos. We are screwed.
Second, I wouldn't be able to monitor traffic 24 hours to determine whether a scale up or down is required.
So far, I've seen http://hirefireapp.com/ and http://www.heroscale.com/
Any suggestions about these two?

The reason heroku don't do this natively is that it's an incredibly complex problem to solve.
For instance, imagine your scenario above, you suddenly start seeing a queue forming and want to ramp the dynos. You crank on ten more. However, it's not a dyno problem, your database is running slow, so now you've got more dynos all sat waiting for the database which now has even more demand placed on it.
Whilst there are auto-scaling products out there, I've not tried any of them, and fully believe that at the moment only a human can make the correct call on scaling. Your mileage may vary.
I have found in the past that setting the resources to an expected usage level (which may be above the current usage) tends to work best, excluding massive traffic influx (such as being on Hacker News etc)

I built HireFire and would like to share some up-to-date information:
HireFire autoscales both your web- and worker dynos using our dyno managers. We currently support the following metric sources:
HireFire (Job Queue) | Worker Dynos
Heroku Logplex (Response Time) | Web Dynos
Heroku Logplex (Connect Time) | Web Dynos
Heroku Logplex (Queue Time) | Web Dynos
Heroku Logplex (Requests Per Minute) | Web Dynos
Heroku Logplex (CPU Load) | Web/Worker Dynos
NewRelic (Apdex) | Web Dynos
NewRelic (Response Time) | Web Dynos
NewRelic (Requests Per Minute) | Web Dynos
HireFire (Job Queue)
Autoscales your worker-based dyno based on the queue size of your jobs. Integration for Ruby and Python applications can be done easily using a 1st or 3rd party library. Any other language and/or framework can be integrated easily without a library as well.
You're able to configure any number of dyno managers for a given application at no extra cost, meaning that you're not limited to a single "worker" entry in your Procfile. This (optionally) allows you to schedule work more efficiently by having for example one Procfile entry per queue and having HireFire scale each individual queue independently.
Heroku Logplex
The Logplex (Logdrain) strategy allows HireFire to consume your logs in order to parse Heroku-emitted metric data which we then use to autoscale. Metrics include Response Time, Connect Time and Load. In addition to that we support Queue Time which can easily be added by installing our library. Or, you can write the minimal amount of code yourself to push the necessary data to the logdrain.
This abstract approach (excluding Queue Time) requires no code changes and works with any language/framework. Just setup a Logdrain via de Heroku CLI and you're set.
For metric aggregation, you can choose between average and (any) percentile.
New Relic
We integrate with New Relic. If you're already using it you can hook it up to HireFire and use their metrics (apdex, response time and rpm) to autoscale your web dynos.
If you have any questions, don't hesitate to get in touch!

A lot of my friends in the Rails community use Rails Autoscale for autoscaling on Heroku. Here's how it works:
Rails Autoscale provides a tiny Rack middleware that captures this timing and periodically reports it back to the Rails Autoscale service. This is similar to how New Relic works at a fraction of the size.
The autoscaling that Heroku provides natively is only available on their Performance tier (which starts at $250 / month per dyno). If you're using the Hobby or Standard plans, you'll need to find a 3rd party solution.
One thing I like about Rails Autoscale is it scales your app up and down automatically, based on request queueing.
It's a nice feature, especially for peace of mind. If you're sleeping, and get a traffic spike, you can't manually adjust the number of dynos. Having a tool that can scale up automatically is nice insurance.

Heroku just launched a new addon that does auto scaling. Web dynos only right now though.
Check out this thread https://stackoverflow.com/a/14075781/484689

I wrote a Heroku auto scaling engine called Heroku Vector. It allows you to scale web and sidekiq dynos based on the volume of traffic you receive (instead of waiting for latencies in response time):
https://github.com/wpeterson/heroku-vector
You can run it as a stand-alone dyno process.

As of January 2017, Heroku formally supports autoscaling.
Autoscaling is easy to set up and use, and it recommends a p95
threshold based on your app’s past 24 hours of response times.
Response-based autoscaling ensures that your web dyno formation is
always sized for optimal efficiency, while capping your costs based on
limits you set. Autoscaling is currently included at no additional
cost for apps using Performance and Private web dynos.
Here are the docs:
https://devcenter.heroku.com/articles/scaling#autoscaling
Here is the announcement: https://blog.heroku.com/heroku-autoscaling

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas