I am using get data from xml step to loop (by xpath) on all fields in a file xml but the loop continue to all next steps in the transformation, I want to know how can i stop loop after executing all rows of my xml file?
PS: I am using Kettle 6.1
Related
Good with XML but beginner with Pentaho I am stumbling connecting the pieces.
Overall goal is to run an SQL query and format the output using XSLT into a nice email body. I use the XML Output block to get the SQL query result in XML, that is, it saves in XML format to a file, but it copies the data as CSV to the next block. I found this out by hooking up a temporary Text File Output block.
Why would a block named XML Output produce CSV output?
How can I get the XML to the XSL Translation block without needing to save to disk file in between?
Are there simpler ways of getting a nice email message body from an SQL SELECT query?
Here is the Pentaho Transformation where Report Totals does a simple SQL SELECT:
All steps in PDI output rows of data, with the same structure (field names, data types, etc.).
Each step is responsible to picking up those rows of data and doing something with them.
Your XML output step will pick up the rows of data coming from the table input step and write them as XML to an external file. But it looks like what you want is to create an XML field to use later in the XSL transformation step and then email. The step you're looking for is probably Add XML, not XML output.
Add XML will take the incoming rows and produce a set of XML rows as strings. Those can then be further manipulated by grouping them together into a single row, then inserted into a root XML element, then transformed and appended to the email body.
You can right click each of those steps and click on Preview to see what data is going out of each step, it should help you make sense of PDI's internals.
The Text file output outputs whatever data comes in as a CSV. As no XML fields are coming in, no XML is written out (the XML Output doesn't add a XML field to the data stream, only writes to an external file).
I have a kettle transformation which has a csv file input step. I would like the transformation to just skip all the subsequent steps in the transformation if the csv file has no data (empty). Is there a way to achieve this?
Try the step "Detect Empty Stream" and check for NULL condition for any one of the columns from the CSV.
Appending a link from PDI Wiki:
http://wiki.pentaho.com/display/EAI/Detect+empty+stream
Otherwise, you could use the step "Get File Names". This returns many fields, one of which is the field "size". After this you can put a "Filter rows" step that sends the flow to a "Dummy" step if the size is equale to 0.
I have files abc.xlsx, 1234.xlsx, and xyz.xlsx in some folder. My requirement is to develop a transformation where the Microsoft Excel Input in PDI (Pentaho Data Integration) should only pick the file based on the output of a sql query. If the output query is abc.xlsx. Microsoft Excel Input should pick of abc.xlsx for further processing. How do I achieve this? Would really appreciate your help. Thanks.
Transformations in Kettle run asynchronously, so you're probably looking into needing a job for this.
Files to create
Create a transformation that performs the SQL query you're looking for and populates a variable based on the result
Create a transformation that pulls data from the Excel file, using the variable populated as the filename
Create a job that executes the first transformation, then steps into the second transformation
Jobs run sequentially, so it will execute the first transformation, perform the query, get the result, and set a variable. Variables need to be set and retrieved in different transformations because of their asynchronous nature. This is the reason for the second transformation; the job won't step into the second transformation until the first one is done running (therefore, not until the variable is populated).
This is all assuming you only want to run the transformation once, expecting a single result from the query. If you want to loop it, pulling data from a set, then setup is a little bit different.
The Excel input step has a "accept filenames from previous step" option. You can have a table input build the full path of the file you want to read (or you somehow build it later knowing the base dir and the short filename), pass the filename to the excel input, tick that box and specify the step and the field you want to use for the filename.
I'm trying to read multiple XML files from a folder, to compile all the data they have (all of them have the same XML structure), and than save that data in a CSV file.
I already have a 'read-files' Transformation with the steps: Get File Names and Copy Rows to Result, to get all the XML files. (it's working - I print a file with all the files names)
Then, I enter in a 'for-each-file' Job which has a Transformation with the Get Rows from Result Step, and then another Job to process those files.
I think I'm loosing information from the 'read-files' Transformation to the Transformation in the 'for-each-file' Job which Get all the rows. (I print another file with all the files names, but it is empty)
Can you tell me if I'm thinking in the right way? I have to set some variables, or some option that is disabled? Thanks.
Here is an example of "How to process a Kettle transformation once per filename"
http://www.timbert.net/doku.php?id=techie:kettle:jobs:processtransonceperfile
What is the best practice when importing and validating an XML file to a single table (flattened) in SQL Server ?
I've a XML file which contains about 15 complex types which are all related to a single parent element.
The SSIS design could look like this:
But it's getting very complicated with all those (15) joins.
Is it maybe a better idea to just write T-SQL code to :
1) Import the XML into a column which is of the type XML and is linked to a XSD-schema.
2) Use this code:
TRUNCATE TABLE XML_Import
INSERT INTO XML_Import(ImportDateTime, XmlData)
SELECT GETDATE(), XmlData
FROM
(
SELECT *
FROM OPENROWSET (BULK 'c:\XML-Data.xml', SINGLE_BLOB) AS XMLDATA
) AS FileImport (XMLDATA)
delete from dbo.UserFlat
INSERT INTO dbo.UserFlat
SELECT
user.value('(UserIdentifier)', 'varchar(8)') as UserIdentifier,
user.value('(Emailaddress)', 'varchar(70)') as Emailaddress,
businessaddress.value('(Fax)', 'varchar(70)') as Fax,
employment.value('(EmploymentData)', 'varchar(8)') as EmploymentData,
-- More values here ...
FROM
XML_Import CROSS APPLY
XmlData.nodes('//user') AS User(user) CROSS APPLY
user.nodes('BusinessAddress') AS BusinessAddress(businessaddress) CROSS APPLY
user.nodes('Employment') AS Employment(employment)
-- More 'joins' here ...
to fill the 'UserFlat' table ?Some disadvantages are that you have to manually type the SQLcode, but the advantage here is that I have more direct control how the elements are processed and converted. But I don't know if there are any performance differences between processing XML in SSIS and processing the XML with T-SQL XML statements.
Note that some other requirements are:
Error handling : in case of an error, an email must be send to a person.
Able to process multiple input files with a specific file name pattern : XML_{date}_{time}.xml
Move the processed XML files to a different folder.
Please advice.
Based on the requirements that you have mentioned, I would say that you can use best of both the worlds (T-SQL & SSIS).
I feel that T-SQL gives more flexibility in loading the XML data that you have described in the question.
There are lot of different ways you can achieve this. Here is one possible option:
Create a Stored Procedure that would take the path of the XML file as input parameter.
Perform your XML data load operation using the T-SQL way which you feel is easier.
Use SSIS package to perform error handling, file processing, archiving and send email.
Use logging feature available in SSIS. It just requires simple configuration. Here is a samples that show how to configure logging in SSIS How to track status of rows successfully processed or failed in SSIS data flow task?
A sample mock up of your flow would be as shown below in the screenshot. Loop the files using Foreach Loop container. Pass the file path as parameter to Execute SQL Task, which in turn would call the T-SQL that you had mentioned. After processing the file, using the File System Task to move the file to an archive folder.
Sample used in SSIS reading multiple xml files from folder
shows how to loop through files using Foreach loop container. It loops through xml files but uses Data Flow Task because the xml files are in simpler format.
Sample used in How to send the records from a table in an e-mail body using SSIS package? shows how to send e-mail using Send Mail Task.
Sample used in How do I move files to an archive folder after the files have been processed? shows how to move files to an Archive folder.
Sample used in Branching after a file system task in SSIS without failing the package shows how to continue package execution even after a particular task fails. This will help you to proceed with package execution even if Foreach Loop fails so you can send email. Blue arrow in the screenshot indicates on completion of previous task.
Sample used in How do I pick the most recently created folder using Foreach loop container in SSIS package? shows how to perform pattern matching.
Hope that gives you an idea.