Metrics of micrometer (in spring boot) will remain in heap after scraping by Prometheus? - jvm

We have Spring application which has micrometer metrics collection enabled, and the metrics are collected by Prometheus server after certain delay. As the metrics are Time Series data, My question is the metrics which are collected at the spring application will they remain in the heap even after they are copied to Prometheus, if such time series data continues to grow then heap will go out-of-memory is not it ? Anyone has bit knowledge on micrometer metrics memory retention, please let us know.

Monitoring approach has a separation of responsibilities. We can name at least two of them:
1. Hold a knowledge about current status of application.
This is a responsibility of Spring application. It should know all status and metrics at the current moment of time. There is no time series data in the application itself. You cannot get something like "What was the application metrics one hour ago?", because Spring has no idea what was the state of application even seconds ago.
2. Collect history of application status.
This is where Prometheus comes. It scrapes the "current" metrics from application, and maps it to the current timestamp. In such way it builds a time series, so you're able to ask Prometheus "What was the application metrics one hour ago?".
In other words, in the heap of Spring application only current state is stored, without any history. And because of that no cleaning up needed.

Related

AWS cloud watch "Latency" metric and jmeter "Average" metric in summary report for api performance testing

While load/performance testing of API on ELB in AWS using JMeter, I see
AWS cloud watch Latency metric = 10 ms (seems good) and in JMeter's Summary Report Average metric = 3000 ms (seems bad).
The API returns 1MB of JSON data, but I don't understand why there is so much difference in numbers and is this api performance acceptable?
If the SLA said to have 100 ms API response time.
You are looking into different metrics:
Latency: JMeter measures the latency from just before sending the request to just after the first response has been received.
Elapsed time: JMeter measures the elapsed time from just before sending the request to just after the last response has been received.
So Latency is included into response time, it is so-called Time To First Byte and Elapsed Time is the Time to Last Byte. My expectation is that you should be sticking to what JMeter reports so you won't be confused with the metrics coming from different sources, JMeter is at least open source therefore you have the confidence regarding how the metrics are calculated.
If response time of 3 seconds is too high you can start looking into the reasons for this which could be:
Your API server is simply overloaded, check out CPU, RAM, Network, Disk usage using i.e. aforementioned Amazon CloudWatch or JMeter PerfMon Plugin
Your application configuration might not be ready for high loads. The majority of web/application/database servers defaults are suitable for application development and debugging only (same applies to JMeter) so most probably you will need to tune infrastructure.
Your application uses non-optimal algorithms. Use profiler tools to inspect where it spends time, what are the "heaviest" methods, how long database calls last, etc.
Also if your application is behind the ELB JMeter can cache IP address of one of the entry nodes and all your requests will be hitting only one host. To avoid this situation add DNS Cache Manager to your Test Plan.
References:
JMeter Glossary
JMeter Best Practices
The DNS Cache Manager: The Right Way To Test Load Balanced Apps

Data not showing up intermittently on the OpenTSDB UI

We are running some high volume tests by pushing metrics to OpenTSDB (2.3.0) with BigTable, and a curious problem surfaces from time to time. For some metrics, an hour of data stops showing up on the web UI when we run a query. The span of "missing" data is very clearcut and borders on the hour (UTC). After a while, while rerunning the same query, the data shows up. There does not seem to be any pattern that we can deduce here, other than the hour span. Any pointers on what to look for and debug this?
How long do you have to wait before the data shows up? Is it always the most recent hour that is missing?
Have you tried using OpenTSDB CLI when this is happening and issuing a scan to see if the data is available that way?
http://opentsdb.net/docs/build/html/user_guide/cli/scan.html
You could also check via an HBase shell scan to see if you can get the raw data that way (here's information on how it's stored in HBase):
http://opentsdb.net/docs/build/html/user_guide/backends/hbase.html
If you can verify the data is there then it seems likely to be a web UI problem. If not, the next likely culprit is something getting backed up in the write pipeline.
I am not aware of any particular issue in the Google Cloud Bigtable backend layer that would cause this behavior, but I believe some folks have encountered issues with OpenTSDB compactions during periods of high load that result in degraded performance.
It's worth checking in the Google Cloud Console to see if there's any outliers in the latency, CPU or throughput graphs that correlate with the times during which you experience the issue.

Bigquery streaming inserts taking time

During load testing of our module we found that bigquery insert calls are taking time (3-4 s). I am not sure if this is ok. We are using java biguqery client libarary and on an average we push 500 records per api call. We are expecting a million records per second traffic to our module so bigquery inserts are bottleneck to handle this traffic. Currently it is taking hours to push data.
Let me know if we need more info regarding code or scenario or anything.
Thanks
Pankaj
Since streaming has a limited payload size, see Quota policy it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
The approach you've chosen if takes hours that means it does not scale, and won't scale. You need to rethink the approach with async processes. In order to finish sooner, you need to run in parallel multiple workers, the streaming performance will be the same. Just having 10 workers in parallel it means time will be 10 times less.
Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on a messaging system like Beanstalkd.
Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. Well, that's exactly what Beanstalkd provides.
Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type.
You need an API/producer which can put jobs on a tube, let's say a json representation of the row. This was a killer feature for our use case. So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time.
On the other part, you have now a bunch of jobs on some tubes. You need an agent. An agent/consumer can reserve a job.
It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. In the case of failure, the consumer can bury the job. This job will not be pushed back to the tube, but will be available for further inspection.
A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client.
Beanstalkd clients can be found in most common languages, a web interface can be useful for debugging.

VisualVM collect performance data over a period of time

for Java I am using VisualVM to monitor CPU, Memory, Thread info. Is there a way from VisualVM to collect this information for a range of time so that i am able to present it in a graph.
In VisualVM under Monitor tab i am able to see CPU,Classes,Heap and thread graph. I would like to be able to collect this data over a period of time when i run my load test. Later on present it on graph for later analysis.
If VisualVM is not the tool please suggest alternate option.
Thanks
You can use Tracer plugin for monitoring. Select probes which suits your needs and you should be able to export monitored data, which can be used to construct the graph of your choice.

Is there a way to leverage Hadoop tools to mange parallel REST API calls to external sources?

I am writing software that creates a large graph database. The software needs to access dozens of different REST APIs with millions of total requests. The data will then be processed by the Hadoop cluster. Each of these APIs have rate limits that vary by requests/second, per window, per day and per user (typically via OAuth).
Does anyone have any suggestions on how I might use either a Map function or other Hadoop-ecosystem tool to manage these queries? The goal would to be to leverage the parallel processing in Hadoop.
Because of the varied rate limits, it often makes sense to switch to a different API query while waiting for the first limit to reset. An example would be one API call that creates nodes in the graph and another that enriches the data for that node. I could have the system go out and enrich the data for the new nodes while waiting for the first API limit to reset.
I have tried using SQS queuing on EC2 to manage the various API limits and states (creating a queue for each API call), but have found it to be ridiculously slow.
Any ideas?
It looks like the best option for my scenario will be using Storm, or specifically the Trident abstraction. It gives me the greatest flexibility for both workload management but process management as well