How do I tell I find all AWS metrics using high-resolution? - amazon-cloudwatch

i run into this error in AWS cloudwatch
which does not make sense as I think/thought we had 0 high resolution metrics(high resolution only records for 3 hours). We typically just do 1 minute interval reporting. How do I find all metrics with high resolution? In this way I am hoping I can edit them to not high resolution.
I searched around a ton on the documentation and I looked into micrometer code which seems to default to highResolution = false and a step of 2 minutes. (We are using micrometer). I am trying to figure out next steps on figuring out why AWS thinks this data is high resolution data.
I was also thinking 'ok, perhaps it would roll up to 1 minute data then 5 minute data' so in my query I tried 1 minute and 5 minute but I still get the error of only 3 hours of data.

Error is thrown because you're using the query syntax (SELECT ...) and that only supports the latest 3 hours of data. The feature is called Metrics Insights, you can see the limits here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-metrics-insights-limits.html
Error is not related to high resolution metrics. Even if they were high resolution, when you're setting the period to 5 min, you would only retrieve datapoints aggregated to 5 min granularity.

Related

After importing a metric into Victoria Metrics, the metric is repeated for 5 minutes. What controls this behavior?

I am writing some software that will be pushing data to Victoria Metrics, as below:
curl -d 'foo{bar="baz"} 30' -X POST 'http://[Victoria]/insert/0/prometheus/api/v1/import/prometheus'
I noticed that if I push a single metric like this, it shows up as not a single data point but rather shows up repeatedly as if it was being scraped every 15 seconds, either until I push a new value for that metric or 5 minutes passes.
What setting/mechanism is causing this 5-minute repeat period?
Pushing data with a timestamp does not change this. Metric gets repeated for 5 minutes after that time or until a change regardless.
I don't necessarily need to alter this behavior, just trying to understand why it's happening.
How do you query the database?
I guess this behaviour is due to the ranged query concept and ephemeral datapoints, check this out:
https://docs.victoriametrics.com/keyConcepts.html#range-query
The interval between datapoints depends on the step parameter, which is 5 minutes when omitted.
If you want to receive only the real datapoints, go via export functions.
https://docs.victoriametrics.com/#how-to-export-time-series
TSDB VM has ephemeral dots which fill gaps in the closest sample on the left to the requested timestamp.
So if you make the instant request:
curl "http://<victoria-metrics-addr>/api/v1/query?query=foo_bar&time=2022-05-10T10:03:00.000Z"
The time range at which VictoriaMetrics will try to locate a missing data sample is equal to 5m by default and can be overridden via step parameter.
step - optional max lookback window for searching for raw samples when executing the query. If step is skipped, then it is set to 5m (5 minutes) by default.
GET | POST /api/v1/query?query=...&time=...&step=...
You can read more about key concepts in this part of the documentation
key-concepts
There you can find also information about query range and different concepts about TSDB

Google DataPrep is extremely slow

In Google Dataflow, i have a job that basically looks like this:
Dataset: 100 rows, 1 column.
Recipe: 0 steps
Output: New Table.
But it takes between 6-8 minutes to run. What could be the issue?
Usually times are in minutes, not in seconds for Dataprep/dataflow setup.
These solutions are for large data sets and the duration stays constant even if you have 10 times the size.
DataPrep creates for you a DataFlow workflow, and provisions a few VMs for you, that takes time, usually that phase could be in the minute mark. And only a bit later is scaling that up to 50 or 1000 boxes.

Cloud DataFlow performance - are our times to be expected?

Looking for some advice on how best to architect/design and build our pipeline.
After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.
Our data/workflow:
Google DFP writes our adserver logs (CSV compressed) directly to GCS (hourly).
A day's worth of these logs has in the region of 30-70 million records, and about 1.5-2 billion for the month.
Perform transformation on 2 of the fields, and write the row to BigQuery.
The transformation involves performing 3 REGEX operations (due to increase to 50 operations) on 2 of the fields, which produces new fields/columns.
What we've got running so far:
Built a pipeline that reads the files from GCS for a day (31.3m), and uses a ParDo to perform the transformation (we thought we'd start with just a day, but our requirements are to process months & years too).
DoFn input is a String, and its output is a BigQuery TableRow.
The pipeline is executed in the cloud with instance type "n1-standard-1" (1vCPU), as we think 1 vCPU per worker is adequate given that the transformation is not overly complex, nor CPU intensive i.e. just a mapping of Strings to Strings.
We've run the job using a few different worker configurations to see how it performs:
5 workers (5 vCPUs) took ~17 mins
5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins
Would those times be in line with what you would expect for our use case and pipeline?
You can also write the output to files and then load it into BigQuery using command line/console. You'd probably save some dollars of instance's uptime. This is what I've been doing after running into issues with Dataflow/BigQuery interface. Also from my experience there is some overhead bringing instances up and tearing them down (could be 3-5 minutes). Do you include this time in your measurements as well?
BigQuery has a write limit of 100,000 rows per second per table OR 6M/per minute. At 31M rows of input that would take ~ 5 minutes of just flat out writes. When you add back the discrete processing time per element & then the synchronization time (read from GCS->dispatch->...) of the graph this looks about right.
We are working on a table sharding model so you can write across a set of tables and then use table wildcards within BigQuery to aggregate across the tables (common model for typical BigQuery streaming use case). I know the BigQuery folks are also looking at increased table streaming limits, but nothing official to share.
Net-net increasing instances is not going to get you much more throughput right now.
Another approach - in the mean time while we work on improving the BigQuery sync - would be to shard your reads using pattern matching via TextIO and then run X separate pipelines targeting X number of tables. Might be a fun experiment. :-)
Make sense?

How to reduce time allotted for a batch of HITs?

today I created a small batch of 20 categorization HITs with the name Grammatical or Ungrammatical using the web UI. Can you tell me the easiest way to manage this batch so that I can reduce its time allotted to 15 minutes from 1 hour and remove also remove the categorization of masters. This is a very simple task that's set to auto-approve within 1 hour, and I am fine with that. I just need to make it more lucrative for people to attempt this at the penny rate.
You need to register a new HITType with the relevant properties (reduced time and no masters qualification) and then perform a ChangeHITTypeOfHIT operation on all of the HITs in the batch.
API documentation here: http://docs.aws.amazon.com/AWSMechTurk/latest/AWSMturkAPI/ApiReference_ChangeHITTypeOfHITOperation.html

Each request takes 25-30 sec for Google Analytics API?

I'm using GAPI library (in PHP) for querying Google Analytics API.
I request 2 dimensions (pagePath, date), 2 metric (pageviews, visits), past 365 days time range, and 2 filters for pagePath. Average time to get data for one query is 25-30 sec.
When I use only 1 metric (pageviews), average response time is 3 sec.
Why would there be such a difference when using 1 or 2 metrics?
I'm guessing that the path/date/pageviews is stored pre-calculated, while the path/date/visits needs to be calculated off the data-store (be thankful you're not applying complicated segments - then it gets really slow).
There's hints about how this might work in the google BigTable paper.