AWS Glue metrics to populate Job name, job Status, Start time, End time and Elapsed time

AWS Glue metrics to populate Job name, job Status, Start time, End time and Elapsed time - amazon-cloudwatch

I tried various metrics options using glue.driver.* but there is no clear way to get Job name, job Status, Start time, End time and Elapsed time in Cloudwatch metrics. This info is already available under Job Runs history but no way to get this on Metrics.
I found few solutions where this can be achieved using Lambda function but there should be an easy way.
Please share ideas. thanks.

We had the same issue. In order to track glue job runs we ended up writing a small shell script which transformed the JSON Output of -> https://docs.aws.amazon.com/cli/latest/reference/glue/list-jobs.html to CSV. The final Output resembled the following :

Related

Pyhon APScheduler stop jobs before starting a new one

I need to start a job every 30 minutes, but before a new job is being started I want the old but same job being terminated. This is to make sure the job always fetches the newest data file which is constantly being updated.
Right now I'm using the BlockingScheduler paired with my own condition to stop the job (stop job if processed 1k data etc.), I was wondering if APScheduler supports this "only 1 job at the same time and stop old one before new one" behavior natively
I've read the docs but I think the closest is still the default behavior which equals max_instances=1, this just prevents new jobs firing before the old job finishes, which is not what I'm looking for.
Any help is appreciated. Thanks!

After further research I came to a conclusion that this is not supported natively in APScheduler, but by inspired by
Get number of active instances for BackgroundScheduler jobs
, I modified the answer into a working way of detecting the number of current running instances of the same job, so when you have a infinite loop/long task executing, and you want the new instance to replace the old instance, you can add something like
if(scheduler._executors['default']._instances['set_an_id_you_like'] > 1):
# if multiple instances break loop/return
return
and this is what should look like when you start:
scheduler = BlockingScheduler(timezone='Asia/Taipei')
scheduler.add_job(main,'cron', minute='*/30', max_instances=3, next_run_time=datetime.now(),\
id='set_an_id_you_like')
scheduler.start()
but like the answer in the link, please refrain from doing this if someday there's a native way to do this, currently I'm using APScheduler 3.10
This method at least doesn't rely on calculating time.now() or datetime.datetime.now() in every iteration to check if the time has passed compared when the loop started. In my case since my job runs every 30 minutes, I didn't want to calculate deltatime so this is what I went for, hope this hacky method helped someone that googled for a few days to come here.

Export Bigquery Logs

I want to analyze the activity on BigQuery during the past month.
I went to the cloud console and the (very inconvenient) log viewer. I set up exports to Big-query, and now I can run queries on the logs and analyze the activity. There is even very convenient guide here: https://cloud.google.com/bigquery/audit-logs.
However, all this helps to look at data collected from now on. I need to analyze past month.
Is there a way to export existing logs (rather than new) to Bigquery (or to flat file and later load them to BQ)?
Thanks

While you cannot "backstream" the BigQuery's logs of the past, there is something you can still do, depending on what kind of information you're looking for. If you need information about query jobs (jobs stats, config etc), you can call Jobs: list method of BigQuery API to list all jobs in your project. The data is preserved there for 6 months and if you're project owner, you can list the jobs of all users, regardless who actually ran it.
If you don't want to code anything, you can even use API Explorer to call the method and save the output as json file and then load it back into BigQuery's table.
Sample code to list jobs with BigQuery API. It requires some modification but it should be fairly easy to get it done.

You can use Jobs: list API to collect job info and upload it to GBQ
Since it is in GBQ - you can analyze it any way you want using power of BigQuery
You can either flatten result or use original - i recommend using original as it is less headache as no any transformation before loading to GBQ (you just literally upload whatever you got from API). Of course all this in simple app/script that you still have to write
Note: make sure you use full value for projection parameter

I was facing the same problem when I found a article which describes how to inspect Big Query using INFORMATION_SCHEMA without any script nor Jobs: list as mentioned by other OPs.
I was able to run and got this working.
# Monitor Query costs in BigQuery; standard-sql; 2020-06-21
# #see http://www.pascallandau.com/bigquery-snippets/monitor-query-costs/
DECLARE timezone STRING DEFAULT "Europe/Berlin";
DECLARE gb_divisor INT64 DEFAULT 1024*1024*1024;
DECLARE tb_divisor INT64 DEFAULT gb_divisor*1024;
DECLARE cost_per_tb_in_dollar INT64 DEFAULT 5;
DECLARE cost_factor FLOAT64 DEFAULT cost_per_tb_in_dollar / tb_divisor;
SELECT
DATE(creation_time, timezone) creation_date,
FORMAT_TIMESTAMP("%F %H:%I:%S", creation_time, timezone) as query_time,
job_id,
ROUND(total_bytes_processed / gb_divisor,2) as bytes_processed_in_gb,
IF(cache_hit != true, ROUND(total_bytes_processed * cost_factor,4), 0) as cost_in_dollar,
project_id,
user_email,
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_USER
WHERE
DATE(creation_time) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) and CURRENT_DATE()
ORDER BY
bytes_processed_in_gb DESC
Credits: https://www.pascallandau.com/bigquery-snippets/monitor-query-costs/

How to create inhouse funnel analytics?

I want to create in-house funnel analysis infrastructure.
All the user activity feed information would be written to a database / DW of choice and then, when I dynamically define a funnel I want to be able to select the count of sessions for each stage in the funnel.
I can't find an example of creating such a thing anywhere. Some people say I should use Hadoop and MapReduce for this but I couldn't find any examples online.

Your MapReduce is pretty simple:
Mapper reads row of a session in log file, its output is (stag-id, 1)
Set number of Reducers to be equal to the number of stages.
Reducer sums values for each stage. Like in wordcount example (which is a "Hello World" for Hadoop - https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v1.0).
You will have to set up a Hadoop cluster (or use Elastic Map Reduce on Amazon).
To define funnel dynamically you can use DistributedCache feature of Hadoop. To see results you will have to wait for MapReduce to finish (minimum dozens of seconds; or minutes in case of Amazon's Elastic MapReduce; the time depends on the amount of data and the size of your cluster).
Another solution that may give you results faster - use a database: select count(distinct session_id) group by stage from mylogs;
If you have too much data to quickly execute that query (it does a full table scan; HDD transfer rate is about 50-150MB/sec - the math is simple) - then you can use a distributed analytic database that runs over HDFS (distributed file system of Hadoop).
In this case your options are (I list here open-source projects only):
Apache Hive (based on MapReduce of Hadoop, but if you convert your data to Hive's ORC format - you will get results much faster).
Cloudera's Impala - not based on MapReduce, can return your results in seconds. For fastest results convert your data to Parquet format.
Shark/Spark - in-memory distributed database.

How to reduce time allotted for a batch of HITs?

today I created a small batch of 20 categorization HITs with the name Grammatical or Ungrammatical using the web UI. Can you tell me the easiest way to manage this batch so that I can reduce its time allotted to 15 minutes from 1 hour and remove also remove the categorization of masters. This is a very simple task that's set to auto-approve within 1 hour, and I am fine with that. I just need to make it more lucrative for people to attempt this at the penny rate.

You need to register a new HITType with the relevant properties (reduced time and no masters qualification) and then perform a ChangeHITTypeOfHIT operation on all of the HITs in the batch.
API documentation here: http://docs.aws.amazon.com/AWSMechTurk/latest/AWSMturkAPI/ApiReference_ChangeHITTypeOfHITOperation.html

Running rational performance tester on a schedule

Is is possible to run rational performance tester once every hour and generate a report which contains all response times for every hour for all pages? Like this
hour 1: hello.html min avg max time
hour 2: hello.html min avg max time
hour 3: hello.html min avg max time
if you use a ordinary schedule and let it iterate once every hour all response times get lumped together in the report likes this:
hello.html` min avg max count=24
.
Would it be possible to start rpt from a script and run a specific project/schedule and then let cron run that script every hour?

To run the Rational Performance tester tests automatically, one can you use commandline feature that is built within tool. So if you can create window scheduler to use bat file(or unix crontab to use shell script) and the following command inserted into that file, that would solve the first bit of calling rpt test automatically.
cmdline –workspace “workspace “–project “testproj” –schedule “schedule_or_test”
For more details on the above command, refer the below link
Executing Rational Performance tester from command line
Second bit, To produce response time report automatically, there seems to be no easy way(which is shame), but one can write java custom code to log page responses time into text file.

For sure, you can schedule that task using Rational Quality Manager, the new IBM's Centralized QA Management Tool. However, in the same tool you can start your test plan with a Java code that allows you to manage that.
Hope this helps.

Why would you want to do that? Sounds like you are looking for a method of monitoring a running website! If so then there are much simpler ways such as adding %D in apache logformat to write out the time taken to service the page and process your web logs every hour instead :-)
If you really want to do this then don't use RPT - use JMeter or something more commandliney, would be easy then. In fact if its just loading a page then Curl on a cron would do it.

Well it is not a single page it is a websphere portal running on an mainframe so it is not just to open up an apache config.
Haven't looked into JMeter but we have a couple of steps that must be done in the test ( log in make some stuff and logout) that we want to measure and we allready had a testflow in RPT that we use so it would be nice to reuse it even if it is not what rpt are ment for.
//Olme

You can use several stages for the scheduler(select the scheduler, add in "user load" tab).
stage 1, duration 1 hour
stage 2, duration 1 hour
stage 3, duration 1 hour
You would get the test result with several stages. Select all the stages and right click, there's "compare" option, after compare the stages' result, it looks:
stage 1: hello.html min avg max time
stage 2: hello.html min avg max time
srage 3: hello.html min avg max time

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

AWS Glue metrics to populate Job name, job Status, Start time, End time and Elapsed time - amazon-cloudwatch

We had the same issue. In order to track glue job runs we ended up writing a small shell script which transformed the JSON Output of -> https://docs.aws.amazon.com/cli/latest/reference/glue/list-jobs.html to CSV. The final Output resembled the following :

Related

Pyhon APScheduler stop jobs before starting a new one

Export Bigquery Logs

How to create inhouse funnel analytics?

How to reduce time allotted for a batch of HITs?

Running rational performance tester on a schedule

Categories

Resources