Pentaho Job - Evaluate a condition to determine where to send job flow

Pentaho Job - Evaluate a condition to determine where to send job flow - pentaho

I have a Pentaho job and in the job itself I want to evaluate a condition and send the job in one of two directions based on the result of that condition. In particular, I want to see whether the time's hour is "3" (i.e. between 3 and 4 a.m.) and if it is, send it in one direction, and if not, in the other direction.
What is the easiest way to do this? (I am using Pentaho 4.2 Spoon)
Thanks!

A transformation is one option - as "Working Hard" said.
If you want to evaluate a condition in the job itself, you can use the "simple evaluation"-step in the job.
To do so, check the field-value of the time (important: it should have the date-format HH:mm) with the condition "If value is between".
It could look like this:
(if you get problems like Pentaho is asking for a field or something like that, just add a comment)

If you don't have any specific constrain then you can create 1 transformation and execute that transformation to job..
in transformation after you will get your desired output you can use "modified java-script" step and their you can check for condition for the same that if it is satisfying your condition then send to A step if not then B step.

Related

PDI: Output only if no errors

i want to transform a csv-file to an XML-file. In the Transformation i have also a small Validation of data, for example length of a string must be < 50. So i have a Textfile Input Step > Modified JavaScript Step with two hops to Abort Step (for the error handling hop) and XML Output Step. My goal is to only create the XML-file if no error occurs. At the moment it create a XML with 2 "rows" and Abort because in row 3 in csv is a very long string. I think it is a very simple scenario but i have no approach how to solve it. Please can someone give me a tip.
Thanks a lot.
Marko

EDITED:
It seems your flow is indeed halting strings longer than 50 characters if it is aborting midway, but since Pentaho works in parallelism, if the first row is valid and reaches the output step, the output will start recording, what you want is to block this step until all rows have been processed by the prior step.
Simply add a "Blocking Step"(do not mistake the Block this step until steps finish, you want Blocking Step)before your output step. Remenber to check 'Pass all rows?' option ins this step, this will effectively "Hold" all the rows in the transformation right before the output.

Order step metrics in Pentaho Data Integration

I´m working on a rather long Transformation in Kettle and I put some Steps in the middle of the Flow.
So now my Step metrics are all scrambled up and very hard to read.
Is there any way i could sort this to be in order (with the direction of the flow) again?

If you click on # in a "Step metrics" tab it will sort the steps by their order. The visualisation in a "Metrics" tab will be also sorted.

Steps are stored in the order of insertion. The step metrics grid allows the steps to be shown in a different order by clicking on the column header, but since a transformation graph can be meshed, it's generally not possible to sort the steps in the order of the data flow. Only a single path in your graph could be sorted by analyzing the hops, anyway.

What you can do is change the name of each step and add a number in front of it. Then sort by name.
Boring, I know, but it is what we have...

It's unfortunate that assigning a step number isn't an option. And maybe it differs by version, but in 8.3 the step metrics # column assignment seems to be somewhat based on the step's order in the flow (which of course breaks down when the flow branches), not by when the step was added. It does ring a bell that it was based on when the step was added in past versions though.
It's also unfortunate that the sort by step name is case sensitive - so steps that start with "a" come after steps that start with "Z". Perhaps there's a way to work that behavior into a naming strategy that actually leverages that for some benefit, but I haven't found one.
So I'm inclined to agree with #recacon - using a number prefix for the step names and then sorting execution metrics by step name seems like the best option. I haven't done much of this yet since without having a team standard it's unlikely to be maintained.
For the few times I have done it, I've used a three digit numeric prefix where values are lowest at the start of the flow and increase farther down the path. To reduce the need for re-sequencing when steps are added later, I start out incrementing by ten from one step to the next, then use a number between when splitting hops later on.
I also increment the 100's digit for branches in the flow or if there's a significant section of logic for a particular purpose.

looping in a Kettle transformation

I want to repetitively execute an SQL query looking like this:
SELECT '${date.i}' AS d,
COUNT(DISTINCT xid) AS n
FROM table
WHERE date
BETWEEN DATE_SUB('${date.i}', INTERVAL 6 DAY)
AND '${date.i}'
;
It is basically a grouping by time spans, just that those are intersecting, which prevents usage of GROUP BY.
That is why I want to execute the query repetitively for every day in a certain time span. But I am not sure how I should implement the loop. What solution would you suggest?
The Kettle variable date.i is initialized from a global variable. The transformation is just one of several in the same transformation bundle. The "stop trafo" would be implemented maybe implicitely by just not reentering the loop.
Here's the flow chart:

Flow of the transformation:
In step "INPUT" I create a result set with three identical fields keeping the dates from ${date.from} until ${date.until} (Kettle variables). (for details on this technique check out my article on it - Generating virtual tables for JOIN operations in MySQL).
In step "SELECT" I set the data source to be used ("INPUT") and that I want "SELECT" to be executed for every row in the served result set. Because Kettle maps parameters 1 on 1 by a faceless question-mark I have to serve three times the same paramter - for each usage.
The "text file output" finally outputs the result in a generical fashion. Just a filename has to be set.
Content of the resulting text output for 2013-01-01 until 2013-01-05:
d;n
2013/01/01 00:00:00.000;3038
2013/01/02 00:00:00.000;2405
2013/01/03 00:00:00.000;2055
2013/01/04 00:00:00.000;2796
2013/01/05 00:00:00.000;2687
I am not sure if this is the slickest solution but it does the trick.

In Kettle you want to avoid loops and they can cause real trouble in transforms. Instead you should do this by adding a step that will put a row in the stream for each date you want (with the value stored in a field) and then using that field value in the query.
ETA: The stream is the thing that moves rows (records) between steps. It may help to think of it as consisting of a table at each hop that temporarily holds rows between steps.
You want to avoid loops because a Kettle transform is only sequential at the row level: rows may process in parallel and out of order and the only guarantee is that the row will pass through the steps in order. Because of this a loop in a transform does not function as you would intuitively expect.
FYI, it also sounds like you might need to go through some of the Kettle tutorials if you are still unclear about what the stream is.

how to call function every time the current time is equal to time in my row?

I have column "date" in my table.I need to call my function for this table every time when the current time is equal to time in my "date" column. I don't know if it's possible to do this in ms sql server?

It seems like you are trying to implement some kind of scheduling.
You could try implementing one using one of SQL Server services called SQL Server Agent. It may not be fit for all kinds of response to time events, though, but it should be able to manage certain tasks.
You would need to set up a SQL Server Agent job for it.
A job would need to consist of at least one job step and have at least one schedule to be runnable. Perhaps, it would be easiest for you at this point to use the Transact-SQL type of job step.
A Transact-SQL job step is just a Transact-SQL script, a multi-statement query. In your case it would probably first check if there are rows matching the current time. Then, either for every matching row separately or for the entire set of them, it would perform whatever kind of operation Transact-SQL allows you to perform.

Control data flow in Pentaho transformation with variables

I want to control data flow in pentaho transformation files with system variables. I found a component called 'simple evaluation' which is exactly what I want, however it can only be used in job files.
I have gone through component-tree of transformation from spoon but cannot find any one like 'simple evaluation'.
Can anyone give me some idea, how to make it?
Thanks

IIRC you can't use variables in the filter rows step. That would probably be a worthy
change request to raise in jira.pentaho.com
So, simply use a "Get Variables" step to get the variable into the stream
and then use the filter rows step. ( or switch/case depending on complexity )

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pentaho Job - Evaluate a condition to determine where to send job flow - pentaho

Related

PDI: Output only if no errors

Order step metrics in Pentaho Data Integration

looping in a Kettle transformation

how to call function every time the current time is equal to time in my row?

Control data flow in Pentaho transformation with variables

Categories

Resources