Count unique values in aws cloudwatch metric - amazon-cloudwatch

I have a set of cloudwatch logs in json format that contain a username field. How can I write a cloudwatch metric query that counts the number of unique users per month?

Now you can count unique field values using the count_distinct instruction inside CloudWatch Insights queries.
Example:
fields userId, #timestamp
| stats count_distinct(userId)
More info on CloudWatch Insights: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html

You can now do this! Using CloudWatch Insights.
API: https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_StartQuery.html
I am working on a similar problem and my query for this API looks something like:
fields #timestamp, #message
| filter #message like /User ID/
| parse #message "User ID: *" as #userId
| stats count(*) by #userId
To get the User Ids. Right now this returns with a list of them then counts for each one. Getting a total count of unique can either be done after getting the response or probably by playing with the query more.
You can easily play with queries using the CloudWatch Insights page in the AWS Console.

I think you can achieve that by following query:
Log statement being parsed: "Trying to login user: abc ....."
fields #timestamp, #message
| filter #message like /Trying to login user/
| parse #message "Trying to login user: * and " as user
| sort #timestamp desc
| stats count(*) as loginCount by user | sort loginCount desc
This will print the table in such a way,
# user loginCount
1 user1 10
2 user2 15
......

I don't think you can.
Amazon CloudWatch Logs can scan log files for a specific string (eg "Out of memory"). When it encounters this string, it will increment a metric. You can then create an alarm for "When the number of 'Out of memory' errors exceeds 10 over a 15-minute period".
However, you are seeking to count unique users, which does not translate well into this method.
You could instead use Amazon Athena, which can run SQL queries against data stored in Amazon S3. For examples, see:
Analyzing Data in S3 using Amazon Athena
Using Athena to Query S3 Server Access Logs
Amazon Athena – Interactive SQL Queries for Data in Amazon S3

Related

Can we use 'bg' to get get only the changes n data of two CSV in Google Cloud storage bucket for Optimization?

cat compare_data_gse_query.sql | bq query --use_legacy_sql=False
(1) Cost optimization in GSP. lots of the SQL query do need the changes.
(2) How to compare CSV data (today | yesterday) changes in GSC - Google Storage
(3) SQL * is to to be avoided. is there any utility in 'bq' CLI could handle ?

Splunk search - how to reset stats by group instead of all stats for the search

I want to create a report on job execution times by job ID. The same job ID could be executed multiple times. The logs capture each time the job starts, but does not explicitly log when it finishes. For the purposes of this report, we will determine the stop time based on the last log captured for the job. To achieve this, I need loop through the results capturing the start time and the latest log time to determine the stop time.
The problem I'm running into is I need to reset any stats captured for a job ID when the search detects another instance of the job has started. I tried the reset_before/reset_after/reset_on_change to achieve the desired results, but those trigger a reset of stats for all job IDs not the one that was re-executed. Here's a visual of the raw data and an example of the report I'm trying to generate.
Input Data
sample
Desired Result
sample report
Here is the start of the search...I removed the reset stat attempts to avoid causing any confusion. This search pulls back the data, but I have not been successful in getting the stats to reset by job ID when a new job starts.
index=jobs message="*Started*" OR message="*processing*"
| rex field=message "#(?<JobID>[^\(]+)"
| stats earliest(_time) as start, latest(_time) as stop by JobID
| eval starttime=strftime(start,"%Y-%m-%d %H:%M")
| eval stoptime=strftime(stop,"%Y-%m-%d %H:%M")
| eval runtime=round((stop-start)/60/60,2)
| table JobbID, starttime, stoptime, runtime
Any help is appreciated!
Have a look at the transaction command: https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Transaction
Also check out this documentation page: https://docs.splunk.com/Documentation/SplunkCloud/latest/Search/Identifyandgroupeventsintotransactions
I recreated your input data: https://imgur.com/a/bGnouiq
Now I run this transaction command:
yourbasesearch
| sort -_time
| transaction Job_ID startswith=(Message=Started)
This will group your events into transactions (separately for each job ID) whenever there is a new Message=Started event.
This is the result:
https://imgur.com/a/5Vi9gOo
You can get the stop time like this:
| eval stop_time=_time+duration

How do I find out what is inserting data in my Azure Data Warehouse

I am using an Azure 'Synapse SQL Pool' (aka Data Warehouse) containing a table named 'DimClient'. I see in my database that new records are being added every day at a specific time. I've reviewed all the ADF pipelines and triggers but none of them are set to run at that time. I don't see any stored procedures that insert or update records in this table either. I can only conclude there is another process running that is adding those records.
I turned on 'Send to Log Analytics' to forward to a workspace and included the SqlRequests and ExecRequests categories. I waited a day and reviewed the logs using the following query:
AzureDiagnostics
| where Category == "SqlRequests" or Category == "ExecRequests"
| where Command_s contains "DimClient" ;
I get 'No Results Found' but when I query the table in SSMS, it contains new records that were added within the last 24 hours. How do I determine what is inserting these records?
you should get result. it takes time some to sync data in log analytics. also check diagnostic settings on Synpase pool

Load data from csv in google cloud storage as bigquery 'in' query

I want to compose such query using bigquery, my file stored in Google cloud platform storage:
select * from my_table where id in ('gs://bucket_name/file_name.csv')
I get no results. Is it possible? or am I missing something?
You are able using the CLI or API to do adhoc queries to GCS files without creating tables, a full example is covered here Accessing external (federated) data sources with BigQuery’s data access layer
code snippet is here:
BigQuery query --external_table_definition=healthwatch::date:DATETIME,bpm:INTEGER,sleep:STRING,type:STRING#CSV=gs://healthwatch2/healthwatchdetail*.csv 'SELECT date,bpm,type FROM healthwatch WHERE type = "elevated" and bpm > 150;'
Waiting on BigQueryjob_r5770d3fba8d81732_00000162ad25a6b8_1 ... (0s)
Current status: DONE
+---------------------+-----+----------+
| date | bpm | type |
+---------------------+-----+----------+
| 2018-02-07T11:14:44 | 186 | elevated |
| 2018-02-07T11:14:49 | 184 | elevated |
+---------------------+-----+----------+
on other hand you can create a permament EXTERNAL table with autodetect schema to facilitate WebUI and persistence read more about that here Querying Cloud Storage Data

Generic way to stay within Google BigQuery SQL Query quota

This is the SQL query I am running against a public dataset:
SELECT
package,
COUNT(*) count
FROM (
SELECT
REGEXP_EXTRACT(line, '(.*)') package,
id
FROM (
SELECT
SPLIT(content, '\n') line,
id
FROM
[bigquery-public-data:github_repos.contents]
WHERE
sample_path LIKE '%.bashrc' OR sample_path LIKE '%.bash_profile')
GROUP BY
package,
id )
GROUP BY
1
ORDER BY
count DESC
LIMIT
400;
and this is the error message:
Error: Quota exceeded: Your project exceeded quota for free query
bytes scanned. For more information, see
https://cloud.google.com/bigquery/troubleshooting-errors
bigquery-public-data:github_repos.contents is too large for my quota.
bigquery-public-data:github_repos.sample_contents is too small for what I'm analyzing.
Is there any way to specify how much quota a query can utilize? For example, if I have a 1TB quota, is there a way to run this query against github_repos.contents (which would consume 2.15TB), but stop processing after consuming 1TB?
You can use Custom Cost Controls. This can be set at project level or user. The user can be a service account. Having different service accounts running each queries you can "specify how much quota a query can utilize".