Pentaho Kettle: changing meta at run-time - pentaho

I wonder if Kettle (AKA Pentaho PDI) supports metadata changing at run-time.
I've implemented a couple of custom plugins:
The first plugin sends data to the second plugin. The metadata of the rows sent in output can change when some conditions occur. In practice, this means that processRow() starts with a certain metadata and then, after a while, it changes it. Of course, the row sent in output through putRow() is always synchronized with the related metadata.
The second plugin receives data from the first plugin, calling getInputRowMeta() for understanding the metadata of the received row. However, such metadata seems to not be synchronized with the received row.
Given the results of this simple example, I wonder if the Kettle engine supports this kind of run-time behavior --- i.e. if getInputRowMeta() returns the correct metadata for the specific row that has been received.
Is anybody able of providing evidence that metadata changing is actually not possible ? Otherwise, is there any safe way for getting the metadata of the specific row received in processRow() ?

From page 616 of the book Pentaho Kettle Solutions:
The calculation of the output row
metadata is something that needs to happen once and only once because the layout of
all the output rows needs to be the same.

Related

Is there any command line client to do data entry in DHIS2?

I want to know is there any command line client to do data entry in DHIS2?
I found one, named as dish (https://github.com/baosystems/dish2/), but it is only used for simplifying common tasks and is suitable for handling batch metadata operations, system maintenance operations.
I want to enter data into data elements directly, is it possible? If not there is any alternative method to it?
as far as I know, there are no command line clients to do data entry for DHIS2. There are however options to import data into DHIS2 using xml, json or csv formats. So one option is to create the data in one of these formats first, then use the API to import it.
When you say you want to enter data into data elements directly, I assume you are referring to actual data and not metadata.
There is no way to interact with the DHIS2 api to add data directly to a data element. The reason for this is that data elements are either connected to a data set or, if you are using the tracker models, a program stage. A single data element can be connected to multiple data sets or program stages, so adding data directly to a data element wouldn't make sense.
You can however do data entry for a data element, but you need to go through either a data set or program stage that uses the data element.
What is your use-case for needing a command line client for this? Maybe I know of another solution that would help you.

How to display a status depending on the data flow position

Consider for example this modified Simple TCP sample program:
How can I display the current state of the program like
Wait for Connection
Connected
Connection terminated
on the frontpanel, depending on where the "data flow" currently is.
The easiest way to do this is to place a string indicator on your front panel and write messages to a local variable of this indicator at each point where you want to see a status update.
You need to keep in mind how LabVIEW dataflow works: code will execute as soon as the data it depends on becomes available. Sometimes you can use existing structures to enforce this - for example, if you put a string constant inside your loop and wire it to a local variable terminal outside the loop, the write will only happen after the loop exits. Sometimes you may need to enforce that dataflow artificially, for example by placing your operation inside a sequence frame and connecting a wire to the border of the sequence: then what's inside the sequence will only happen after data arrives on that wire. (This is about the only thing you should use a sequence for!)
This method is not guaranteed to be deterministic, but it's usually good enough for giving a simple status indication to the user.
A better version of the above would be to send the status messages on a queue or notifier which you read, and update the status indicator, in a separate loop. The queue and notifier write functions have error terminals which can help you to enforce sequence. A notifier is like the local variable in that you will only see the most recent update; a queue keeps all the data you write to it in the right order so would be more suitable if you want to log all the updates to a scrolling list or log file. With this solution you could add more features: for example the read loop could add a timestamp in front of each message so you could see how recent it was.
A really good solution to this general problem is to use a design pattern based on a state machine. Now your program flow is clearly organised into different states and it's very easy to add in functionality like sending a different message from each state. There are good examples and project templates for these design patterns included with recent versions of LabVIEW.
You should be able to find more information on any of the terms in bold in the LabVIEW help or on the NI website.

How to add more data to be stored in jenkins rest api

To make the question simple, I know that I can get some build information with https://jenkins_server/...///api/json|xml|python. And I get a lot of information for that build record.
However, I want to add more information to that build record. For example, the docker image created, or the tickets or files changed from last build to create release note, ... etc. How do I do that?
For now, I use a script to create a json file as an artifact and call that json file to get these information, but it seems a duplicate if I can add more data to the jenkins build object directly.
The Jenkins remote access API is designed to provide access to generic Jenkins-internal information, like build numbers, timestamps, fingerprints etc.
If you want to add your own data there, then you must extend Jenkins accordingly, e.g., by designing a plugin that advertises your (custom) information items as standard Jenkins-"internal" data. If you want to do that, you may want to have a look at they way fingerprint information is handled (I found that quite instructive).
However, I'd recommend that you stick with your current approach, and keep generic Jenkins-internal information separated from Job-specific data. It is less effort and clearly separates your own data from Jenkins' data.

Audit and error handling in SSIS

We are starting a project to handle big, big flat files. These files are kind of 'normalized' and we want to process them first to an intermediate file.
I would like to see a custom table for audit rows and a custom table for errors that are thrown during processing. Also errors must be stored in the Event Log.
What are the best practices according to audit & error handling in general for SSIS (VS2008)?
(edit)
We have made (I think) very elegant solution by designing 1 master package. This package runs a child package (the one orginally intended). The master package subscribes to the 3 events like OnInformation, OnWarning and OnError. These events are routed to a generic audit & logging service that makes calls to the Enterprise Library Logging & Exception handling blocks.
What I would recommend you is to adopt the following philosophy for stable etl processes coming from files:
Never cast anything in the connector, just import the fields as nvarchars of the maximum lenght they will achieve.
Procedurally add a rowcount for error tracking in casting errors.
Cast and control each column to your specification.
If a row cannot be read at some stage, you will not know the index, but you will know that the file is malformed (extremely rare in my experience, for half transferred files), and it should be rejected anyway.
A quick screenshot of a part of a file loading process shows how the rejection (after assigning row_id) can work (link to dataflow image). To this you can add further countless checks (duplicates...) and even have a repository for the loaded files to check upon the rejects and whatever else you might want to control (Link to control flow image).
In some of my processes, I even use a flat file connector and just import each row as a bulk text and then split it in columns with an intermediate script component, allowing for different versions of the columns in the files.
Anyway, sorry not to be more detailed (due to my status I can't add more links or any images), but I hope that you understand the concept.
Regards,
Francisco.

Is it possible to force an error in an Integration Services data flow to demonstrate its rollback?

I have been tasked with demoing how Integration Services handles an error during a data flow to show that no data makes it into the destination. This is an existing package and I want to limit the code changes to the package as much as possible (since this is most likely a one time deal).
The scenario that is trying to be understood is a "systemic" failure - the source file disappears midstream, or the file server loses power, etc.
I know I can make this happen by having the Error Output of the source set to Failure and introducing bad data but I would like to do something lighter than that.
I suppose I could add a Script Transform task and look for a certain value and throw an error but I was hoping someone has come up with something easier / more elegant.
Thanks,
Matt
mess up the file that you are trying to import by pasting some bad data or saving it in another format like UTF-8 or something like that
We always have a task at the end that closes the dataflow in our meta data tables. To test errors, I simply remove the ? that is the variable for the stored proc it runs. Easy to do and easy to fix back the way it was and it doesn't mess up anything datawise as our error trapping then closes the the data flow with an error. You could do something similar by adding a task to call a stored proc with an input variable but assign no parameters to it so it will fail. Then once the test is done, simply disable that task.
Data will make it to the destination if it is not running as a transaction. If you want to prevent populating partial data you have to use transactions. Then there is an option to set the end result of a control flow item as "failed" irrespective of the actual result but this is not available in data flow items. You will have to either produce an actual error in the data or code in a situation that will create an error. There is no other way...
Could we try with transaction level property of the package?
On failure of the data flow it will revert all the data from the target.
On successful dataflow only it will commit the data to target otherwise it will roll back the data from target.