Pentaho Job - Execute job based on condition - automation

I have a pentaho job that is scheduled to run every week and gets data from one table and populates in another.
Now the job executes every week irrespective of whether the source table was updated or not.
I want to put a condition before the job runs to see if the source was updated last week or not and run the job only if the source was updated else dont run the job.

There are many ways you can do this. Assuming you have a table in your database that stores the last date your job was run, you could do something like this.
Create a Job and configure a parameter in it (I called mine RunJob). Create a transformation which gets your max run date, or row count then looks up the run date or row count from the previous run and compares them. It then sets the value of your job's variable based on the results of the comparison. Mine looks like this.
Note that the last step in the transform is a Set Variables step from the Job branch.
Then in your job use a Simple Evaluation step to test the variable. Mine looks like this:
Note here that my transform sets the value of the variable only if the job needs to be run, otherwise it will be NULL.
Note also to be sure to update your last run date or row count after doing the table load. That's what the SQL step does at the end of the job.
You could probably handle this with fewer steps if you used some JavaScript in there, but I prefer not to script if I can avoid it.

Related

How do I create a backup for a table which will be used for a full-refresh?

I have an incremental model A where each day is calculated using the previous day's value. Running a full-refresh means that this table needs to be calculated since the beginning of time which is very inefficient and takes too long.
I have tried to create a backup table which will take a copy of the table's value each month, and have model A refer to the backup table during a full-refresh so that the values only after the backup need to be recalculated and I can arrive at today's value much quicker. However this gives me an error:
Encountered an error:
Found a cycle: model.model_A --> model.backup --> model.model_A
This is because the backup refers to the model to get the value each month, while model A also refers to the backup to build off in the case of a full-refresh.
Is there a way around this problem, avoiding rebuilding the entire model from the beginning of time every time I do a full-refresh?
Yes, you can't have 'circular loops' or cycles in your build process.
If there is an application that calculates the values for each day, you could perhaps store the new values back in the same source table(s), just adding a 'updated_at' or something similar. If I understand your use case correctly, you could then use this value whenever you need to query only the last day's information.

CDC LSNs: queries return different minimum value

I have CDC enable on a table and I'm trying to get the minimum LSN for that table to use in an ETL job. However when I run this query
select sys.fn_cdc_get_min_lsn('dbo_Table')
I get a different result to this query
select min(__$start_lsn) from cdc.dbo_Table_CT
Shouldn't these queries return the same values? and if they don't why not? and how to get them back in sync?
The first query:
select sys.fn_cdc_get_min_lsn('dbo_Table')
Invokes a system function that returns the lowest POSSIBLE lsn for the capture instance. This value is set when the cleanup function runs. It is recorded in, and queried from, cdc.change_tables.
The second query:
select min(__$start_lsn) from cdc.dbo_Table_CT
Looks at the actual capture instance and returns the lowest ACTUAL lsn for the instance. This value is set when the first actual change to the instance is logged after the cleanup function runs. It is recorded in, and queried from, cdc.dbo_Table_CT.
They're unlikely to tie out, statistically speaking. For the purposes of an ETL job, using the call to the system table will likely be quicker, and is a more accurate reflection of when the current set of change records started being accumulated.

SQLAgent job with different schedules

I am looking to see if its possible to have one job that runs different schedules, with the catch being one of the schedules needs to pass in a parameter.
I have an executable that will run some functionality when there is no parameter, but if there is a parameter present it will run some additional logic.
Setting up my job I created a schedule (every 15 minutes), Operating system (CmdExec)
runApplication.exe
For the other schedule I would like it to be once per day however the executable would need to be: runApplication.exe "1"
I dont think I can create a different step with a separate schedule, or can I?
Anyone have any ideas on how to achieve this without having two separate jobs?
There's no need for 2 jobs. What you can do is update your script so the result of your job (your parameter) is stored in a table. Then update your secondary logic to reference that table. If there's a value of parameter, then run your secondary logic. All in one script. If there's no value in that parameter, then have your secondary logic to return a 0 or not run at all.
Just make sure you either truncate the entire reference parameter table every run or you store a date in there so you know which one to reference.
Good luck.

Simple Evaluation to run a step once a week only?

I want to run a Job that, if it is the first time running the Job this week, it will go trough an extra step. I found that I can do this evalution with Simple Evaluation but I can't find how to make this 'once a week'check.
Anyone have any idea how can I make this?
do this
create new transformation, use get System info step and in type, select "first day of this week", check your date with this column and if equal, run your main using job executor else run the other one.

Get the oldest row from table

I coding a application that dealing with files. So, I have a table that contains information about all the files that registered in the application.
My "files" table looks like this: ID, Path and LastScanTime.
The algorithm that I use in my application is simple:
Take the oldest row (LastScanTime is the oldest)
Extract the file path
Do some magics on this file (takes exactly 5 minutes)
Update the LastScanTime to the current time (now)
Go to step "1"
Until now, the task is pretty simple. For this, I going to use this SQL statement for getting the oldest item:
SELECT TOP 1 * FROM files ORDER BY [LastScanTime] ASC
and at the end of the item processing (preventing the item to be selected immediately again):
UPDATE Files SET [LastScanTime]=GETDATE() WHERE Id=#ItemID
Now, I going to add some complexity to the algorithm:
Take the 3 oldest row (LastScanTime is the oldest)
For each row, do:
A. Extract the file path
B. Do some magics on this file (takes exactly 5 minutes)
C. Update the LastScanTime to the current time (now)
D. Go to step "1"
The problem that now I facing with is that the whole process is going to be processed in parallel (no more serial processing). So, changing my SQL statement to the next statement is not enough!
SELECT TOP 3 * FROM files ORDER BY [LastScanTime] ASC
Why this SQL statement isn't enough?
Let's say that I run my code and started to execute the first 3 items. Now, after a minute I want to execute another 3 items. This SQL statement will retrieve exactly the same "oldest" items that we already started to process.
Possible solution
Implementing a SELECT & UPDATE (combined) that getting the 3 oldest item and immediately update their last scan time. Since there no SELECT & UPDATE in same statement, what will happens if between the executing of the first SELECT, will come in another SELECT? The both statements will get the same results. This is a problem... Another problem is that we mark the item as "scanned recently", before the scan is really finished. What happend if the scanned will terminated by an error?
I'm looking for tips and tricks to solve this problem. The solutions can add columns as needed.
I'll appreciate you help.
Well I usually have habit of having two different field name in the database. one is AddedDate and another is ModifiedDate.
So the algorithm in your terms will be:-
Take the oldest row (AddedDate is the oldest)
Extract the file path
Do some process on this file
Update the ModifiedDate to the current time (now)
It seems that you are going to invent event queue with your SQL. Possibly standard approaches like RabbitMQ or ActiveMQ may solve your problem.