Publishing table count stats with cloudwatch put-metric-data - amazon-cloudwatch

I've been tasked with monitoring a data integration task, and I'm trying to figure out the best way to do this using cloudwatch metrics.
The data integration task populates records in 3 database tables. What I'd like to do is publish custom metrics each day, with the number of rows that have been inserted for each table. If the row count for one or more tables is 0, then it means something has gone wrong with the integration scripts, so we need to send alerts.
My question is, how to most logically structure the calls to put-metric-data.
I'm thinking of the data being structured something like this...
Namespace: Integrations/IntegrationProject1
Metric Name: RowCount
Metric Dimensions: "Table1", "Table2", "Table3"
Metric Values: 10, 100, 50
Does this make sense, or should it logically be structured in some other way? There is no inherent relationship between the tables, other than that they're all associated with a particular project. What I mean is, I don't want to be infering some kind of meaningful progression from 10 -> 100 -> 50.
Is this something that can be done with a single call to the cloudwatch put-metric-data, or would it need to be 3 seperate calls?
Seperate calls I think would look something like this...
aws cloudwatch put-metric-data --metric-name RowCount --namespace "Integrations/IntegrationProject1" --unit Count --value 10 --dimensions Table=Table1
aws cloudwatch put-metric-data --metric-name RowCount --namespace "Integrations/IntegrationProject1" --unit Count --value 100 --dimensions Table=Table2
aws cloudwatch put-metric-data --metric-name RowCount --namespace "Integrations/IntegrationProject1" --unit Count --value 50 --dimensions Table=Table3
This seems like it should work, but is there some more efficient way I can do this, and combine it into a single call?
Also is there a way I can qualify that the data has a resolution of only 24 hours?

Your structure looks fine to me. Consider having a dimension for your stage: beta|gamma|prod.
This seems like it should work, but is there some more efficient way I can do this, and combine it into a single call?
Not using the AWS CLI, but if you used any SDK e.g. Python Boto3, you can publish up to 20 metrics in a single PutMetricData call.
Also is there a way I can qualify that the data has a resolution of only 24 hours?
No. CloudWatch will aggregate the data it receives on your behalf. If you want to see a daily datapoint, you can change the period to 1 day when graphing the metric on the CloudWatch Console.

Related

After importing a metric into Victoria Metrics, the metric is repeated for 5 minutes. What controls this behavior?

I am writing some software that will be pushing data to Victoria Metrics, as below:
curl -d 'foo{bar="baz"} 30' -X POST 'http://[Victoria]/insert/0/prometheus/api/v1/import/prometheus'
I noticed that if I push a single metric like this, it shows up as not a single data point but rather shows up repeatedly as if it was being scraped every 15 seconds, either until I push a new value for that metric or 5 minutes passes.
What setting/mechanism is causing this 5-minute repeat period?
Pushing data with a timestamp does not change this. Metric gets repeated for 5 minutes after that time or until a change regardless.
I don't necessarily need to alter this behavior, just trying to understand why it's happening.
How do you query the database?
I guess this behaviour is due to the ranged query concept and ephemeral datapoints, check this out:
https://docs.victoriametrics.com/keyConcepts.html#range-query
The interval between datapoints depends on the step parameter, which is 5 minutes when omitted.
If you want to receive only the real datapoints, go via export functions.
https://docs.victoriametrics.com/#how-to-export-time-series
TSDB VM has ephemeral dots which fill gaps in the closest sample on the left to the requested timestamp.
So if you make the instant request:
curl "http://<victoria-metrics-addr>/api/v1/query?query=foo_bar&time=2022-05-10T10:03:00.000Z"
The time range at which VictoriaMetrics will try to locate a missing data sample is equal to 5m by default and can be overridden via step parameter.
step - optional max lookback window for searching for raw samples when executing the query. If step is skipped, then it is set to 5m (5 minutes) by default.
GET | POST /api/v1/query?query=...&time=...&step=...
You can read more about key concepts in this part of the documentation
key-concepts
There you can find also information about query range and different concepts about TSDB

Using 1 Dataflow Job (Apache Beam Pipeline) to aggregate data on different window periods, and write them to different column families in BigTable

I am trying to optimize my Apache Beam pipeline on Google Cloud Platform Dataflow.
Background information: I am trying to read streaming data from PubSub Messages, and aggregate them based on 3 time windows: 1 min, 5 min and 60 min. Such aggregations consists of summing, averaging, finding the maximum or minimum, etc. For example, for all data collected from 1200 to 1201, I want to aggregate them and write the output into BigTable's 1-min column family. And for all data collected from 1200 to 1205, I want to similarly aggregate them and write the output into BigTable's 5-min column. Same goes for 60min.
The current approach I took is to have 3 separate dataflow jobs (i.e. 3 separate Beam Pipelines), each one having a different window duration (1min, 5min and 60min). See https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/windowing/Window.html. And the outputs of all 3 dataflow jobs are written to the same BigTable, but on different column families. Other than that, the function and aggregations of the data are the same for the 3 jobs.
However, this seems to be very computationally inefficient, and cost inefficient, as the 3 jobs are essentially doing the same function, with the only exception being the window time duration and output column family.
Some challenges and limitations we faced was that from the Apache Beam documentation, it seems like we are unable to create multiple windows of different periods in a singular dataflow job. Also, when we write the final data into big table, we would have to define the table, column family, column, and rowkey. And unfortunately, the column family is a fixed property (i.e. it cannot be redefined or changed given the window period).
Hence, I am wondering if there is a way to only use 1 dataflow job (i.e. 1 Apache Beam pipeline) that fulfils the objective of this project? Which is to aggregate data on different window periods, and write them to different column families of the same BigTable.
I was considering using Split stream: first window by 1-min, then split into 3 streams (1 write to bigtable for 1-min interval, another for 5-min aggregation, and another for 60-min aggregation). However, the problem is that we are working with streaming data and not batch data.
Thank you

How to aggregate the time time between pairs of logs in CloudWatch

Suppose you have logs with some transaction ID and timestamp
12:00: transactionID1 handled by funcX
12:01: transactionID2 handled by funcX
12:03: transactionID2 handled by funcY
12:04: transactionID1 handled by funcY
I want to get the time between 2 logs of the same event and aggregate (e.g. sum, avg) the time difference.
For example, for transactionID1, the time diff would be (12:04 - 12:01) 3min and for transactionID2, the time diff would be (12:03 - 12:02) 1min. Then I'd like to take the average of all these time differences, so (3+1)/2 or 2min.
Is there a way to that?
This doesn't seem possible with CloudWatch alone. I don't know where your logs come from, e.g. EC2, Lambda function. What you could do is to use the AWS SDK to create custom metrics.
Approach 1
If the logs are written by the same process, you can keep a map of transactionID and startTimein memory and create a custom metric with transactionID as dimension and calculate the metric value with the startTime. In case the logs are from different processes e.g. Lambda function invocations, then you can use DynamoDB to store the startTime.
Approach 2
If the transactions are independent you could also create custom metrics per transaction and use CloudWatch DIFF_TIME which will create a calculated metric with values for each transaction.
With CloudWatch AVG it should then be possible to calculate the average duration.
Personally, I have used the first approach to calculate a duration across Lambda functions and other services.

ElastiCache cloudwatch metrics for Redis: currItems for a single database

I have set up a metric for the aws interface on the ElastiCache redis cluster. I'm looking at a value of currItems superior to a certain number for a given period (say 1000 for 1 minute)
The issue I have is that I have two databases in Redis, name 0 and 1. I would like to only get the currItems for database 0, not database 1, since database 1 is holding values for a longer period of time and make the whole metric look much bigger than it should be (since I want the current items of database 0)
Is there a way to create a metric that would only get the currItems of the database 0?
You will have to create an application for this or use existing redis tools.
https://stackoverflow.com/questions/8614737/what-are-some-recommended-tools-to-monitor-a-redis-database
If you are using new relic

How to create inhouse funnel analytics?

I want to create in-house funnel analysis infrastructure.
All the user activity feed information would be written to a database / DW of choice and then, when I dynamically define a funnel I want to be able to select the count of sessions for each stage in the funnel.
I can't find an example of creating such a thing anywhere. Some people say I should use Hadoop and MapReduce for this but I couldn't find any examples online.
Your MapReduce is pretty simple:
Mapper reads row of a session in log file, its output is (stag-id, 1)
Set number of Reducers to be equal to the number of stages.
Reducer sums values for each stage. Like in wordcount example (which is a "Hello World" for Hadoop - https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v1.0).
You will have to set up a Hadoop cluster (or use Elastic Map Reduce on Amazon).
To define funnel dynamically you can use DistributedCache feature of Hadoop. To see results you will have to wait for MapReduce to finish (minimum dozens of seconds; or minutes in case of Amazon's Elastic MapReduce; the time depends on the amount of data and the size of your cluster).
Another solution that may give you results faster - use a database: select count(distinct session_id) group by stage from mylogs;
If you have too much data to quickly execute that query (it does a full table scan; HDD transfer rate is about 50-150MB/sec - the math is simple) - then you can use a distributed analytic database that runs over HDFS (distributed file system of Hadoop).
In this case your options are (I list here open-source projects only):
Apache Hive (based on MapReduce of Hadoop, but if you convert your data to Hive's ORC format - you will get results much faster).
Cloudera's Impala - not based on MapReduce, can return your results in seconds. For fastest results convert your data to Parquet format.
Shark/Spark - in-memory distributed database.