PDI: Output only if no errors - pentaho

i want to transform a csv-file to an XML-file. In the Transformation i have also a small Validation of data, for example length of a string must be < 50. So i have a Textfile Input Step > Modified JavaScript Step with two hops to Abort Step (for the error handling hop) and XML Output Step. My goal is to only create the XML-file if no error occurs. At the moment it create a XML with 2 "rows" and Abort because in row 3 in csv is a very long string. I think it is a very simple scenario but i have no approach how to solve it. Please can someone give me a tip.
Thanks a lot.
Marko

EDITED:
It seems your flow is indeed halting strings longer than 50 characters if it is aborting midway, but since Pentaho works in parallelism, if the first row is valid and reaches the output step, the output will start recording, what you want is to block this step until all rows have been processed by the prior step.
Simply add a "Blocking Step"(do not mistake the Block this step until steps finish, you want Blocking Step)before your output step. Remenber to check 'Pass all rows?' option ins this step, this will effectively "Hold" all the rows in the transformation right before the output.

Related

Apache Spark: count vs head(1).isEmpty

For a given spark df, I want to know if a certain column has null value or not. The code I had was -
if (df.filter(col(colName).isNull).count() > 0) {//throw exception}
This was taking long and was being called 2 times for 1 df since I was checking for 2 columns. Each time it was called, I saw a job for count, so 2 jobs for 1 df.
I then changed the code to look like this -
if (!df.filter(col(colName).isNull).head(1).isEmpty) {//throw exception}
With this change, I now see 4 head jobs compared to the 2 count jobs before, increasing the overall time.
Can you experts please help me understand why the number of jobs doubled? The head function should be called only 2 times.
Thanks for your help!
N
Update: added screenshot showing the jobs for both cases. The left side shows the one with count and right side is the head. That's the only line that is different between the 2 runs.
dataframe.head(1) does 2 things -
1. Executes the action behind the dataframe on executor(s).
2. Collects 1st row of the result from executor(s) to the driver.
dataframe.count() does 2 things -
1. Executes the action behind the dataframe on executor(s). If there are no transformation on the file and parquet format is used then it is basically scanning the statistics of the file(s).
2. Collects count from executor(s) to the driver.
Based on the source of dataframe being a file which stores statistics and absence of any transformation, count() can run faster than head.
I am not 100% sure why there are 2 jobs vs 4. Can you please paste the screenshot.
Is hard to say just looking for this line of code, but there is one reason for head can taking more time. head is a deterministic request if you have sort or order_by in any part that will request a shuffle to always return the first row. With the case of count you don't need the result ordered, so there is no need to shuffle, basic a simple mapreduce step. That is probably why your head can taking more time.

Pentaho "Return value id can't be found in the input row"

I have a pentaho transformation, which is used to read a text file, to check some conditions( from which you can have errors, such as the number should be a positive number). From this errors I'm creating an excel file and I need for my job the number of the lines in this error file plus to log which lines were with problem.
The problem is that sometimes I have an error " the return values id can't be found in the input row".
This error is not every time. The job is running every night and sometimes it can work without any problems like one month and in one sunny day I just have this error.
I don't think that this is from the file, because if I execute the job again with the same file it is working. I can't understand what is the reason to fail, because it is saying the value "id", but I don't have such a value/column. Why it is searching a value, which doesn't exists.
Another strange thing is that normally the step, which fails should be executed at all( as far as I know), because no errors were found, so we don't have rows at all to this step.
Maybe the problem is connected with the "Prioritize Stream" step? Here I'm getting all errors( which use exactly the same columns). I tried before the grouping steps to put a sorting, but it didn't help. Now I'm thinking to try with "Blocking step".
The problem is that I don't know why this happen and how to fix it. Any suggestions?
see here
Check if all your aggregates ins the Group by step have a name.
However, sometimes the error comes from a previous step: the group (count...) request data from the Prioritize Stream, and if that step has an error, the error gets reported mistakenly as coming from the group rather than from the Prioritze.
Also, you mention a step which should not be executed because there is no data: I do not see any Filter which would prevent rows with missing id to flow from the Prioritize to the count.
This is a bug. It happens randomly in one of my transformations that often ends up with empty stream (no rows). It mostly works, but once in a while it gives this error. It seems to only fail when the stream is empty though.

Order step metrics in Pentaho Data Integration

I´m working on a rather long Transformation in Kettle and I put some Steps in the middle of the Flow.
So now my Step metrics are all scrambled up and very hard to read.
Is there any way i could sort this to be in order (with the direction of the flow) again?
If you click on # in a "Step metrics" tab it will sort the steps by their order. The visualisation in a "Metrics" tab will be also sorted.
Steps are stored in the order of insertion. The step metrics grid allows the steps to be shown in a different order by clicking on the column header, but since a transformation graph can be meshed, it's generally not possible to sort the steps in the order of the data flow. Only a single path in your graph could be sorted by analyzing the hops, anyway.
What you can do is change the name of each step and add a number in front of it. Then sort by name.
Boring, I know, but it is what we have...
It's unfortunate that assigning a step number isn't an option. And maybe it differs by version, but in 8.3 the step metrics # column assignment seems to be somewhat based on the step's order in the flow (which of course breaks down when the flow branches), not by when the step was added. It does ring a bell that it was based on when the step was added in past versions though.
It's also unfortunate that the sort by step name is case sensitive - so steps that start with "a" come after steps that start with "Z". Perhaps there's a way to work that behavior into a naming strategy that actually leverages that for some benefit, but I haven't found one.
So I'm inclined to agree with #recacon - using a number prefix for the step names and then sorting execution metrics by step name seems like the best option. I haven't done much of this yet since without having a team standard it's unlikely to be maintained.
For the few times I have done it, I've used a three digit numeric prefix where values are lowest at the start of the flow and increase farther down the path. To reduce the need for re-sequencing when steps are added later, I start out incrementing by ten from one step to the next, then use a number between when splitting hops later on.
I also increment the 100's digit for branches in the flow or if there's a significant section of logic for a particular purpose.

looping in a Kettle transformation

I want to repetitively execute an SQL query looking like this:
SELECT '${date.i}' AS d,
COUNT(DISTINCT xid) AS n
FROM table
WHERE date
BETWEEN DATE_SUB('${date.i}', INTERVAL 6 DAY)
AND '${date.i}'
;
It is basically a grouping by time spans, just that those are intersecting, which prevents usage of GROUP BY.
That is why I want to execute the query repetitively for every day in a certain time span. But I am not sure how I should implement the loop. What solution would you suggest?
The Kettle variable date.i is initialized from a global variable. The transformation is just one of several in the same transformation bundle. The "stop trafo" would be implemented maybe implicitely by just not reentering the loop.
Here's the flow chart:
Flow of the transformation:
In step "INPUT" I create a result set with three identical fields keeping the dates from ${date.from} until ${date.until} (Kettle variables). (for details on this technique check out my article on it - Generating virtual tables for JOIN operations in MySQL).
In step "SELECT" I set the data source to be used ("INPUT") and that I want "SELECT" to be executed for every row in the served result set. Because Kettle maps parameters 1 on 1 by a faceless question-mark I have to serve three times the same paramter - for each usage.
The "text file output" finally outputs the result in a generical fashion. Just a filename has to be set.
Content of the resulting text output for 2013-01-01 until 2013-01-05:
d;n
2013/01/01 00:00:00.000;3038
2013/01/02 00:00:00.000;2405
2013/01/03 00:00:00.000;2055
2013/01/04 00:00:00.000;2796
2013/01/05 00:00:00.000;2687
I am not sure if this is the slickest solution but it does the trick.
In Kettle you want to avoid loops and they can cause real trouble in transforms. Instead you should do this by adding a step that will put a row in the stream for each date you want (with the value stored in a field) and then using that field value in the query.
ETA: The stream is the thing that moves rows (records) between steps. It may help to think of it as consisting of a table at each hop that temporarily holds rows between steps.
You want to avoid loops because a Kettle transform is only sequential at the row level: rows may process in parallel and out of order and the only guarantee is that the row will pass through the steps in order. Because of this a loop in a transform does not function as you would intuitively expect.
FYI, it also sounds like you might need to go through some of the Kettle tutorials if you are still unclear about what the stream is.

SSIS: importing files some with column names, some without

OrPresumably due to inconsistent configuration of logging devices, I need to load a collection of csv files via SSIS that will sometimes have a first row with column names and will sometimes not. The file format is otherwise identical.
There seems a chance that the logging configuration can be standardized, so I don't want to waste programming time with a script task that opens each file and determines whether it has a header row and then processes it differently depending.
Rather, I would like to specify something like Destination.MaxNumberOfErrors, that would allow up to one error row per file (so if the only problem in the file was the header, it would not fail). The Flat File Source error is fatal though, so I don't see a way of getting it to keep going.
The meaning of the failure code is defined by the component, but the
error is fatal and the pipeline stopped executing. There may be error
messages posted before this with more information about the failure.
My best choice seems to be to simply ignore the first data row for now and wait to see if a more uniform configuration can be achieved. Of course, the dataset is invalid while this strategy is in place. I should add that the data is very big, so the ETL routines need to be as efficient as possible. In my opinion this contraindicates any file parsing or conditional splitting if there is any alternative.
The question is if there is a way to configure the File Source to continue from this fatal error?
Yes there is!
In the "Error Output" page in the editor, change the Error response for each row to "Redirect row". Then you can trap the problem rows (the headers, in your case) by taking them as a single column through the error output of your source.
If you can assume the values for header names would never appear in your data, then define your flat file connection manager as having no headers. The first step inside your data flow would check the values of column 1-N vs the header row values. Only let the data flow through if the values don't match.
Is there something more complex to the problem than that?