How to automate to delete multiple GDG generations of different jobs on daily basis? - batch-processing

I am working in a environment where Batch jobs which triggers in daily basis are A to Z– total 24 jobs(Just an example)
All these jobs create QUESTION.XXXXXXXX.HELPME.G0002V000 (Where XXXXXXX=Jobnames(A_Z)) once the batch process is completed.
Case1: QUESTION.XXXXXXXX.HELPME.G0002V000 will be deleted automatically if the batch run is completed successfully.
Case2: on the other hand, If the job is failed, we need to delete QUESTION.XXXXXXXX.HELPME.G0002V000 manually.
Since, we are having huge number of batch jobs and plenty of them fail on daily basis. It’s really painful to delete all them every day.
I want to create a system which can help me in deleting all the generation files in one PS file or .txt files(failed jobs GDG generations) in one shot like submitting one JCL or executing one REXX component.
Note: I cant code this in my JCL because of few other constrains.

A couple of options
Use a BackUp-Delete option in DFDSS or what ever it is called now. Has the advantage of creating a backup of whatever is deleted. You can use generics in DFDSS
If you reference the GDG Base QUESTION.XXXXXXXX.HELPME in JCL it will pickup every generation. You can use DISP=(OLD,DELETE) (note if there is no generation you get a JCL error.
Use Rexx to check/delete
To do it in JCL
// PGM=IEFBR14
//C DD DSN=QUESTION.XXXXXXXX.HELPME(+1)....
//*
// PGM=IEFBR14
//D DD DSN=QUESTION.XXXXXXXX.HELPME,DISP=(OLD,DELETE)

Related

How to force Nextflow process to recalculate and ignore cache in resumed workflow

I have a series of processes in nextflow pipeline, employing multiple heavy computing steps and database (SQL) insertion/fetch. I need to insert certain (intermediate) process results to the DB and fetch them later for further processing (within the same pipeline). In the most simplified form it will be something like:
process1 (fetch data from DB)
process2 (analyze process1.out)
process3 (inserts process2.out to DB)
The problem is, that when any values are changed in the DB, output from process1 is still cached (when using -resume flag), so changes in DB are not reflected here at all.
Is there any way to force reprocessing process1 while using -resume and ignore cache?
So far, I was manually deleting respective work folder, or adding dummy line to process1, but that is extremely ineffective solution.
Thanks for any help here.
Result caching is enable by default, but this feature can be disabled using the cache directive by setting the value to false. For example:
process process1 {
cache false
...
}
Not sure if we have the full picture here, but updating a database with some set of process results just to fetch them again later on seems wasteful. Or maybe I've just misunderstood. I would instead try to separate the heavy computational work (hours) from the database transactions (minutes) if at all possible. Note that if you need to make per process database transactions, you might be able to achieve this using the beforeScript and afterScript directives (which can be enabled/disabled using a nextflow.config profile, for example). For example, a beforeScript could be used to create a database object that is updated (using an afterScript) once the process has completed. Since both of these scripts are run from inside the workDir, you could use the basename of the current/working directory (i.e. the task UUID) as a key.

Run Snowflake Task when a data share view is refreshed

Snowflake's documentation illustrates to have a TASK run on a scheduled basis when there are inserts/updates/deletions or other DML operations run on a table by creating a STREAM on that specific table.
Is there any way to have a TASK run if a view from a external Snowflake data share is refreshed, i.e. dropped and recreated?
As part of this proposed pipeline, we receive a one-time refresh of a view within a specific time period in a day and the goal would be to start a downstream pipeline that runs at most once during that time period, when the view is refreshed.
For example for the following TASK schedule
'USING CRON 0,10,20,30,40,50 8-12 * * MON,WED,FRI America/New York', the downstream pipeline should only run once every Monday, Wednesday, and Friday between 8-12.
Yes, I can point you to the documentation if you would like to see if this works for the tables you might already have set up:
Is there any way to have a TASK run if a view from a external
Snowflake data share is refreshed, i.e. dropped and recreated?
If you create a stored procedure to monitor the existence of the table, I have not tried that before though, I will see if I can ask an expert.
Separately, is there any way to guarantee that the task runs at most
once on a specific day or other time period?
Yes, you can use CRON to schedule optional parameters with specific days of the week or time: an example:
CREATE TASK delete_old_data
WAREHOUSE = deletion_wh
SCHEDULE = 'USING CRON 0 0 * * * UTC';
Reference: https://docs.snowflake.net/manuals/user-guide/tasks.html more specifically: https://docs.snowflake.net/manuals/sql-reference/sql/create-task.html#optional-parameters
A TASK can only be triggered by a calendar schedule, either directly or indirectly via a predecessor TASK being run by a schedule.
Since the tasks are only run on a schedule, they will not run more often than the schedule says.
A TASK can't be triggered by a data share change, so you have to monitor it on a calendar schedule.
This limitation is bound to be lifted sometime, but is valid as of Dec, 2019.

Run a SQL Server job until it succeeds

I have a SQL Server job that has run for almost 2 years.
It's connecting to a bad Oracle database that keeps disconnecting, it always fails due to that. And when I run it again after 10 or 15 minutes, it works successfully. I'm getting bored of checking it every day...
Is there a way that make the job run to connect to that Oracle source until it succeeds, or another job that looks over this job status and if it failed, then it runs it again until it succeeds?
A solution we are using is something like this:
Wrap your Oracle query in an SSIS package, and after the query, have the package update a SQL table that keeps either a history of executions, or just a single row that tracks the last time the job ran successfully. In short, if the Oracle query was successful, then put something in a table saying the query ran successfully today. If it was not successful, then don't put anything in the table for today.
Then at the beginning of the package, BEFORE the Oracle query, check to see if the query has been run successfully today. If it has already run successfully, then do nothing and exit the package. If it has not run successfully today, then go ahead and try to run it, following the post-query steps described above. If you have any other conditions about when the package should run (like "only after 10 am" or anything like that) you would include that logic here.
Finally, schedule the job to call the package, and schedule to run every 15 minutes, or however often you like. It will try every 15 minutes until it runs successfully, and after that it will stop doing anything until the next day.
As a bonus, you can use this same package and job to initiate all tasks that you want handled the same way. You just need to keep meta data about all these tasks in your history/metadata table.
an alternative is to create the job step and leave it unscheduled, and create an ssis job that acts as the master to all your jobs and it runs every minute checking all job steps from a config table that have yet to succeed today and any it finds execute using sp_start_job.
if they do run successfully log the stats to a log table and this prevents them ever being launched again until the next day. This prevents all yours jobs needing to be scheduled every 15 minutes etc, they launch asap, and you can add extra logic to handle dependencies, number parallel running, importance level etc, start time, latest start time, max number to retty etc

How to execute X times a Job Executor step

Introduction
To keep it simple, let's imagine a simple transformation.
This transformation gets an input of 4 rows, from a Data Grid step.
The stream passes through a Job Executor, referencing to a simple job, with a Write Log component.
Expectations
I would like the simple job executes 4 times, that means 4 log messages.
Results
It turns out that the Job Executor step launches the simple job only once, instead of 4 times : I only have one log message.
Hints
The documentation of the Job Executor component specifies the following :
By default the specified job will be executed once for each input row.
This is parametrized in the "Row grouping" tab, with the following field :
The number of rows to send to the job: after every X rows the job will be executed and these X rows will be passed to the job.
Answer
The step actually works well : an input of X rows will execute a "Job Executor" step X times. The fact is I wasn't able to see it with the logs.
To verify it, I have added a simple transformation inside the "Job Executor" step, which writes into a text file. After I have checked this file, it appeared that the "Job Executor" was perfectly executed X times.
Research
Trying to understand why I didn't have X log messages after the X times execution of "Job Executor", I have added a "Wait for" component inside the initial simple job. Finally, adding two seconds allowed me to see X log messages appearing during the execution.
Hope this helps because it's pretty tricky. Please feel free to provide further details.
A little late to the party, as a side note:
Pentaho is a set of programs (Spoon, Kettle, Chef, Pan, Kitchen), The engine is Kettle, and everything inside transformations is started in parallel. This makes log retrieving a challenging task for Spoon (the UI), you don't actually need a Wait for entry, try outputting the logs into a file (specifying a log file on the Job executor entry properties) and you'll see everything in place.
Sometimes we need to give Spoon a little bit of time to get everything in place, personally that's why I recommend not relying on Spoon Execution Results logging tab, it is better to output the logs to a DB or files.

What happens when bigquery upload job fails after loaded a portion of the JSON file?

As the title mentioned, what happens when I start a bigquery upload job and, let's say, after loading 50% of the rows in the JSON file the job failed. Does bigquery rollback everything of the load job or am I left with 50% of the data loaded?
I am appending data daily into a single table and keeping duplicate-free is very important. We are using the HTTP Rest API
BigQuery appends data atomically. You will never get half of the data in the table if the load fails. If the job completes successfully, all of the data will show up at once.
There are two additional tricks you can use to prevent duplicates:
Specify a job id for the load job. Imagine you pull your network cable mid way through starting the job... how do you know whether it succeeded? Specifying a job id lets you look up the job later if the job creation request fails.
Perform your loads to a temporary table, and specify WRITE_TRUNCATE as the writeDisposition. This means that you can run import jobs idempotently to the temporary table, and if you don't know whether a job succeeded, just run another one, and it will overwrite the data. Once you have a load job that completes successfully, run a table copy job with a writeDisposition to WRITE_APPEND to append the new data to your main table.