How do I record multidimensional metrics in cloudwatch? - amazon-cloudwatch

I want to be able to graph two or more metrics that come from the same event e.g. response times from an API versus the size of the response. Is there a way to group two or more metrics as being a multidimensional data point?
I don't see that multiple MetricDatum logged in one PutMetricDataRequest are grouped in any way.

You need to use MetricDatum (withDimensions, setDimensions) functions.
See: https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/javav2/example_code/cloudwatch/src/main/java/com/example/cloudwatch/PutMetricData.java#L62-L82
aws-cli example for multidimensions:
aws cloudwatch get-metric-statistics --metric-name Buffers --namespace MyNameSpace --dimensions Name=InstanceId,Value=1-23456789 Name=InstanceType,Value=m1.small --start-time 2016-10-15T04:00:00Z --end-time 2016-10-19T07:00:00Z --statistics Average --period 60

Related

How to set up job dependencies in google bigquery?

I have a few jobs, say one is loading a text file from a google cloud storage bucket to bigquery table, and another one is a scheduled query to copy data from one table to another table with some transformation, I want the second job to depend on the success of the first one, how do we achieve this in bigquery if it is possible to do so at all?
Many thanks.
Best regards,
Right now a developer needs to put together the chain of operations.
It can be done either using Cloud Functions (supports, Node.js, Go, Python) or via Cloud Run container (supports gcloud API, any programming language).
Basically you need to
issue a job
get the job id
poll for the job id
job is finished trigger other steps
If using Cloud Functions
place the file into a dedicated GCS bucket
setup a GCF that monitors that bucket and when a new file is uploaded it will execute a function that imports into GCS - wait until the operations ends
at the end of the GCF you can trigger other functions for next step
another use case with Cloud Functions:
A: a trigger starts the GCF
B: function executes the query (copy data to another table)
C: gets a job id - fires another function with a bit of delay
I: a function gets a jobid
J: polls for job is ready?
K: if not ready, fires himself again with a bit of delay
L: if ready triggers next step - could be a dedicated function or parameterized function
It is possible to address your scenario with either cloud functions(CF) or with a scheduler (airflow). The first approach is event-driven getting your data crunch immediately. With the scheduler, expect data availability delay.
As it has been stated once you submit BigQuery job you get back job ID, that needs to be check till it completes. Then based on the status you can handle on success or failure post actions respectively.
If you were to develop CF, note that there are certain limitations like execution time (max 9min), which you would have to address in case BigQuery job takes more than 9 min to complete. Another challenge with CF is idempotency, making sure that if the same datafile event comes more than once, the processing should not result in data duplicates.
Alternatively, you can consider using some event-driven serverless open source projects like BqTail - Google Cloud Storage BigQuery Loader with post-load transformation.
Here is an example of the bqtail rule.
rule.yaml
When:
Prefix: "/mypath/mysubpath"
Suffix: ".json"
Async: true
Batch:
Window:
DurationInSec: 85
Dest:
Table: bqtail.transactions
Transient:
Dataset: temp
Alias: t
Transform:
charge: (CASE WHEN type_id = 1 THEN t.payment + f.value WHEN type_id = 2 THEN t.payment * (1 + f.value) END)
SideInputs:
- Table: bqtail.fees
Alias: f
'On': t.fee_id = f.id
OnSuccess:
- Action: query
Request:
SQL: SELECT
DATE(timestamp) AS date,
sku_id,
supply_entity_id,
MAX($EventID) AS batch_id,
SUM( payment) payment,
SUM((CASE WHEN type_id = 1 THEN t.payment + f.value WHEN type_id = 2 THEN t.payment * (1 + f.value) END)) charge,
SUM(COALESCE(qty, 1.0)) AS qty
FROM $TempTable t
LEFT JOIN bqtail.fees f ON f.id = t.fee_id
GROUP BY 1, 2, 3
Dest: bqtail.supply_performance
Append: true
OnFailure:
- Action: notify
Request:
Channels:
- "#e2e"
Title: Failed to aggregate data to supply_performance
Message: "$Error"
OnSuccess:
- Action: query
Request:
SQL: SELECT CURRENT_TIMESTAMP() AS timestamp, $EventID AS job_id
Dest: bqtail.supply_performance_batches
Append: true
- Action: delete
You want to use an orchestration tool, especially if you want to set up this tasks as recurring jobs.
We use Google Cloud Composer, which is a managed service based on Airflow, to do workflow orchestration and works great. It comes with automatically retry, monitoring, alerting, and much more.
You might want to give it a try.
Basically you can use Cloud Logging to know almost all kinds of operations in GCP.
BigQuery is no exception. When the query job completed, you can find the corresponding log in the log viewer.
The next question is how to anchor the exact query you want, one way to achieve this is to use labeled query (means attach labels to your query) [1].
For example, you can use below bq command to issue query with foo:bar label
bq query \
--nouse_legacy_sql \
--label foo:bar \
'SELECT COUNT(*) FROM `bigquery-public-data`.samples.shakespeare'
Then, when you go to Logs Viewer and issue below log filter, you will find the exactly log generated by above query.
resource.type="bigquery_resource"
protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.labels.foo="bar"
The next question is how to emit an event based on this log for the next workload. Then, the Cloud Pub/Sub comes into play.
2 ways to publish an event based on log pattern are:
Log Routers: set Pub/Sub topic as the destination [1]
Log-based Metrics: create alert policy whose notification channel is Pub/Sub [2]
So, the next workload can subscribe to the Pub/Sub topic, and be triggered when the previous query has completed.
Hope this helps ~
[1] https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfiguration
[2] https://cloud.google.com/logging/docs/routing/overview
[3] https://cloud.google.com/logging/docs/logs-based-metrics

AWS Glue ETL"Failed to delete key: target_folder/_temporary" caused by S3 exception "Please reduce your request rate"

Glue job configured to max 10 nodes capacity, 1 job in parallel and no retries on failure is giving an error "Failed to delete key: target_folder/_temporary", and according to stacktrace the issue is that S3 service starts blocking the Glue requests due to the amount of requests: "AmazonS3Exception: Please reduce your request rate."
Note: The issue is not with IAM as the IAM role that glue job is using has permissions to delete objects in S3.
I found a suggestion for this issue on GitHub with a proposition of reducing the worker count: https://github.com/aws-samples/aws-glue-samples/issues/20
"I've had success reducing the number of workers."
However, I don't think that 10 is too many workers and would even like to actually increase the worker count to 20 to speed up the ETL.
Did anyone have any success who faced this issue? How would I go about solving it?
Shortened stacktrace:
py4j.protocol.Py4JJavaError: An error occurred while calling o151.pyWriteDynamicFrame.
: java.io.IOException: Failed to delete key: target_folder/_temporary
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:665)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.delete(EmrFileSystem.java:332)
...
Caused by: java.io.IOException: 1 exceptions thrown from 12 batch deletes
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:384)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.doSingleThreadedBatchDelete(S3NativeFileSystem.java:1372)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:663)
...
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: ...
Part of Glue ETL python script (just in case):
datasource0 = glueContext.create_dynamic_frame.from_catalog(database="database", table_name="table_name", transformation_ctx="datasource0")
... relationalizing, renaming and etc. Transforming from DynamicDataframe to PySpark dataframe and back.
partition_ready = Map.apply(frame=processed_dataframe, f=map_date_partition, transformation_ctx="map_date_partition")
datasink = glueContext.write_dynamic_frame.from_options(frame=partition_ready, connection_type="s3", connection_options={"path": "s3://bucket/target_folder", "partitionKeys": ["year", "month", "day", "hour"]}, format="parquet", transformation_ctx="datasink")
job.commit()
Solved(Kind of), thank you to user ayazabbas
Accepted the answer that helped me into the correct direction of a solution. One of the things I was searching for is how to reduce many small files into big chunks and repartition does exactly that. Instead of repartition(x) I used coalesce(x) where x is 4*worker count of a glue job so that Glue service could allocate each data chunk to each available vCPU resource. It might make sense to have x at least 2*4*worker_count to account for slower and faster transformation parts if they do exist.
Another thing I did was reduce the number of columns by which I was partitioning the data before writing it to S3 from 5 to 4.
Current drawback is that I haven't figured out how to find the worker count within the glue script that the glue service allocates for the job, thus the number is hardcoded according to the job configuration (Glue service allocates sometimes more nodes than what is configured).
I had this same issue. I worked around it by running repartition(x) on the dynamic frame before writing to S3. This forces x files per partition and the max parallelism during the write process will be x, reducing S3 the request rate.
I set x to 1 as I wanted 1 parquet file per partition so I'm not sure what the safe upper limit of parallelism you can have is before the request rate gets too high.
I couldn't figure out a nicer way to solve this issue, it's annoying because you have so much idle capacity during the write process.
Hope that helps.

How to interpret the RabbitMQ Message stats?

I to want get and historize queue metrics for the "Enqueued, Dequeued an Size" (Terminology formerly met on ActiveMQ).
The moving charts provided in the management plugin are not enough for the monitoring that I need to do.
So with RabbitMQ, I'm getting data from https://rabbitmq-server:15672/api/queues/myvhost
This returns json.. for a queue, I can obtain real life production data like :
"messages":0, // for "Size"
"message_stats":{
"deliver_get":171528, // for "Dequeued"
"ack":162348,
"redeliver":9513,
"deliver_no_ack":0,
"deliver":171528,
"get":0,
"publish":51293 // for "Enqueued"
(...)
I'm in particular surprised by the publish counter:
Its value can even decrease between 2 measures done with a couple of minutes of delay ! (see sample chart around 17:00)
As you can see on my data, the deliver_get is significantly larger than the publish.
https://my-rabbitmq:15672/doc/stats.html doesn't give a lot of details that could explain what I actually notice.
Also, under the message_stats object that I obtain, I'm missing the some counters like confirm and return which could be related to the enqueuing.
Are there relationships between these metrics ? (like deliver_get + messages = redeliver + publish.. but that one doesn't work with my figures)
Is there another more detailed documentation about these metrics ?

Substract a value x from a Prometheus metric for a Grafana "single stat" with "Delta" activated?

Maybe the problem should be solved in another way.
I am using a Grafana SingleStat pane with "Delta" activated and the dashboard shows me "today so far".
Prometheus metric: sum(request_duration_count{...})
Problem:
I have a metric counting requests. Between 03:00 and 06:00 an automated test triggers my service and the metric is incremented by value x. (I set a Grafana annotation at the starting point.)
I want to get a single stat request count without the test requests.
Advanced Problem:
Nagios checks every y minutes and also increments the counter.
How can I remove these "test-counts"?
Any ideas?

Aerospike AQL count(*) SQL analogue script

Ok, so the problem is that I need to do aggregation queries on aerospike's aql console. Specifically, I would like to take an average of a bins of records in a set and to count all the records in a set. I am not sure how to even begin...
aql> SHOW SETS will give you the numbers of objects in yours sets, with the column n_objects
Then, you use the n_objects values to calculate your average
SQL-like aggregation functions are implemented in Aerospike using stream UDFs, which are written in Lua. A stream UDF is a map-reduce operation that is applied on a stream of records returned from a scan or secondary index query.
The stream UDF module (let's assume it's contained in the file aggr_funcs.lua) would implement COUNT(*) by returning 1 for each record it sees, and reducing to an aggregated integer value.
local function one(record)
return 1
end
local function sum(v1, v2)
return v1 + v2
end
function count_star(stream)
return stream : map(one) : reduce(sum)
end
You would register the UDF module with the server, then invoke it. Here's an example of how you'd do that in Python using aerospike.Query.apply:
import aerospike
from aerospike import predicates as p
config = {'hosts': [('127.0.0.1', 3000)],
'lua': {'system_path':'/usr/local/aerospike/lua/',
'user_path':'/usr/local/aerospike/usr-lua/'}}
client = aerospike.client(config).connect()
query = client.query('test', 'demo')
#query.where(p.between('my_val', 1, 9)) optionally use a WHERE predicate
query.apply('aggr_funcs', 'count_star')
num_records = query.results()
client.close()
However, you should get metrics such as the number of objects using an info command. Aerospike has an info subsystem that is used by the command line tools such as asinfo, the AMC dashboard, and the info methods of the language clients.
To get the number of objects in the cluster:
asinfo -h 33.33.33.91 -v 'objects'
23773
You can also get the number of objects in a specific namespace. I have a two node cluster, and I'll query each one:
asinfo -h 33.33.33.91 -v 'namespace/test'
type=device;objects=23773;sub-objects=0;master-objects=12274;master-sub-objects=0;prole-objects=11499;prole-sub-objects=0;expired-objects=0;evicted-objects=0;set-deleted-objects=0;nsup-cycle-duration=0;nsup-cycle-sleep-pct=0;used-bytes-memory=2139672;data-used-bytes-memory=618200;index-used-bytes-memory=1521472;sindex-used-bytes-memory=0;free-pct-memory=99;max-void-time=202176396;non-expirable-objects=0;current-time=201744558;stop-writes=false;hwm-breached=false;available-bin-names=32765;used-bytes-disk=6085888;free-pct-disk=99;available_pct=99;memory-size=2147483648;high-water-disk-pct=50;high-water-memory-pct=60;evict-tenths-pct=5;evict-hist-buckets=10000;stop-writes-pct=90;cold-start-evict-ttl=4294967295;repl-factor=2;default-ttl=432000;max-ttl=0;conflict-resolution-policy=generation;single-bin=false;ldt-enabled=false;ldt-page-size=8192;enable-xdr=false;sets-enable-xdr=true;ns-forward-xdr-writes=false;allow-nonxdr-writes=true;allow-xdr-writes=true;disallow-null-setname=false;total-bytes-memory=2147483648;read-consistency-level-override=off;write-commit-level-override=off;migrate-order=5;migrate-sleep=1;migrate-tx-partitions-initial=4096;migrate-tx-partitions-remaining=0;migrate-rx-partitions-initial=4096;migrate-rx-partitions-remaining=0;migrate-tx-partitions-imbalance=0;total-bytes-disk=8589934592;defrag-lwm-pct=50;defrag-queue-min=0;defrag-sleep=1000;defrag-startup-minimum=10;flush-max-ms=1000;fsync-max-sec=0;max-write-cache=67108864;min-avail-pct=5;post-write-queue=0;data-in-memory=true;file=/opt/aerospike/data/test.dat;filesize=8589934592;writethreads=1;writecache=67108864;obj-size-hist-max=100
asinfo -h 33.33.33.92 -v 'namespace/test'
type=device;objects=23773;sub-objects=0;master-objects=11499;master-sub-objects=0;prole-objects=12274;prole-sub-objects=0;expired-objects=0;evicted-objects=0;set-deleted-objects=0;nsup-cycle-duration=0;nsup-cycle-sleep-pct=0;used-bytes-memory=2139672;data-used-bytes-memory=618200;index-used-bytes-memory=1521472;sindex-used-bytes-memory=0;free-pct-memory=99;max-void-time=202176396;non-expirable-objects=0;current-time=201744578;stop-writes=false;hwm-breached=false;available-bin-names=32765;used-bytes-disk=6085888;free-pct-disk=99;available_pct=99;memory-size=2147483648;high-water-disk-pct=50;high-water-memory-pct=60;evict-tenths-pct=5;evict-hist-buckets=10000;stop-writes-pct=90;cold-start-evict-ttl=4294967295;repl-factor=2;default-ttl=432000;max-ttl=0;conflict-resolution-policy=generation;single-bin=false;ldt-enabled=false;ldt-page-size=8192;enable-xdr=false;sets-enable-xdr=true;ns-forward-xdr-writes=false;allow-nonxdr-writes=true;allow-xdr-writes=true;disallow-null-setname=false;total-bytes-memory=2147483648;read-consistency-level-override=off;write-commit-level-override=off;migrate-order=5;migrate-sleep=1;migrate-tx-partitions-initial=4096;migrate-tx-partitions-remaining=0;migrate-rx-partitions-initial=4096;migrate-rx-partitions-remaining=0;migrate-tx-partitions-imbalance=0;total-bytes-disk=8589934592;defrag-lwm-pct=50;defrag-queue-min=0;defrag-sleep=1000;defrag-startup-minimum=10;flush-max-ms=1000;fsync-max-sec=0;max-write-cache=67108864;min-avail-pct=5;post-write-queue=0;data-in-memory=true;file=/opt/aerospike/data/test.dat;filesize=8589934592;writethreads=1;writecache=67108864;obj-size-hist-max=100
Notice that the value of master-objects on each of the nodes adds up together to the cluster-wide objects value.
To get the number of objects in a set:
asinfo -h 33.33.33.91 -v 'sets/test/demo'
n_objects=23771:n-bytes-memory=618046:stop-writes-count=0:set-enable-xdr=use-default:disable-eviction=false:set-delete=false;