Pentaho: why does XML Output block puts out CSV? - sql

Good with XML but beginner with Pentaho I am stumbling connecting the pieces.
Overall goal is to run an SQL query and format the output using XSLT into a nice email body. I use the XML Output block to get the SQL query result in XML, that is, it saves in XML format to a file, but it copies the data as CSV to the next block. I found this out by hooking up a temporary Text File Output block.
Why would a block named XML Output produce CSV output?
How can I get the XML to the XSL Translation block without needing to save to disk file in between?
Are there simpler ways of getting a nice email message body from an SQL SELECT query?
Here is the Pentaho Transformation where Report Totals does a simple SQL SELECT:

All steps in PDI output rows of data, with the same structure (field names, data types, etc.).
Each step is responsible to picking up those rows of data and doing something with them.
Your XML output step will pick up the rows of data coming from the table input step and write them as XML to an external file. But it looks like what you want is to create an XML field to use later in the XSL transformation step and then email. The step you're looking for is probably Add XML, not XML output.
Add XML will take the incoming rows and produce a set of XML rows as strings. Those can then be further manipulated by grouping them together into a single row, then inserted into a root XML element, then transformed and appended to the email body.
You can right click each of those steps and click on Preview to see what data is going out of each step, it should help you make sense of PDI's internals.
The Text file output outputs whatever data comes in as a CSV. As no XML fields are coming in, no XML is written out (the XML Output doesn't add a XML field to the data stream, only writes to an external file).

Related

Azure Data Factory 2 : How to split a file into multiple output files

I'm using Azure Data Factory and am looking for the complement to the "Lookup" activity. Basically I want to be able to write a single line to a file.
Here's the setup:
Read from a CSV file in blob store using a Lookup activity
Connect the output of that to a For Each
within the For Each, take each record (a line from the file read by the Lookup activity) and write it to a distinct file, named dynamically.
Any clues on how to accomplish that?
Use Data flow, use the derived column activity to create a filename column. Use the filename column in sink. Details on how to implement dynamic filenames in ADF is describe here: https://kromerbigdata.com/2019/04/05/dynamic-file-names-in-adf-with-mapping-data-flows/
Data Flow would probably be better for this, but as a quick hack, you can do the following to read the text file line by line in a pipeline:
Define your source dataset to output a line as a single column. Normally I would use "NoDelimiter" for this, but that isn't supported by Lookup. As a workaround, define it with an incorrect Column Delimiter (like | or \t for a CSV file). You should also go to the Schema tab, and CLEAR the schema. This will generate a column in the output named "Prop_0".
In the foreach activity, set the Items to the Lookup's "output.value" and check "Sequential".
Inside the foreach, you can use item().Prop_0 to grab the text of the line:
To the best of my understanding, creating a blob isn't directly supported by pipelines [hence my suggestion above to look into Data Flow]. It is, however, very simple to do in Logic Apps. If I was tackling this problem, I would create a logic app with an HTTP Request Received trigger, then call it from ADF with a Web activity and send the text line and dynamic file name in the payload.

Using SSIS denormalize the data from xml file and load into sql server

I am new to SSIS, I am trying to load data from XML file into SQL server table. I have created the project and can transform and load data into table, but with one issue, below is sample of XML data
<EventLocationInfo>
<Facility>NY 31</Facility>
<Direction>eastbound</Direction>
<City>Cicero</City>
<County>Onondaga</County>
<State>NY</State>
<LocationDetails>
<LocationItem>
<Intersections>
<PrimaryLoc>I-81</PrimaryLoc>
<Article>area of</Article>
</Intersections>
</LocationItem>
<LocationItem>
<PointCoordinates Datum="NAD83">
<Lat>43.1755981445313</Lat>
<Lon>-76.1159973144531</Lon>
</PointCoordinates>
</LocationItem>
<LocationItem>
<AssociatedCities>
<PrimaryCity>Cicero</PrimaryCity>
<Article>area of</Article>
</AssociatedCities>
</LocationItem>
</LocationDetails>
</EventLocationInfo>
The result I am getting is like this
Is it possible to generate only one row instead of getting three different rows, if possible what Data Flow Transformation tool i can use to get this result.
Please help, thanks in advance.
Brijesh

Pentaho Data Integration: How to select output of sql query as a filename for Microsoft Excel Input.

I have files abc.xlsx, 1234.xlsx, and xyz.xlsx in some folder. My requirement is to develop a transformation where the Microsoft Excel Input in PDI (Pentaho Data Integration) should only pick the file based on the output of a sql query. If the output query is abc.xlsx. Microsoft Excel Input should pick of abc.xlsx for further processing. How do I achieve this? Would really appreciate your help. Thanks.
Transformations in Kettle run asynchronously, so you're probably looking into needing a job for this.
Files to create
Create a transformation that performs the SQL query you're looking for and populates a variable based on the result
Create a transformation that pulls data from the Excel file, using the variable populated as the filename
Create a job that executes the first transformation, then steps into the second transformation
Jobs run sequentially, so it will execute the first transformation, perform the query, get the result, and set a variable. Variables need to be set and retrieved in different transformations because of their asynchronous nature. This is the reason for the second transformation; the job won't step into the second transformation until the first one is done running (therefore, not until the variable is populated).
This is all assuming you only want to run the transformation once, expecting a single result from the query. If you want to loop it, pulling data from a set, then setup is a little bit different.
The Excel input step has a "accept filenames from previous step" option. You can have a table input build the full path of the file you want to read (or you somehow build it later knowing the base dir and the short filename), pass the filename to the excel input, tick that box and specify the step and the field you want to use for the filename.

Using SSIS to extract a XML representation of table data to a file

I'm trying to use SSIS to extract XML representation of a query result set to a text file. My query is currently successfully extracting the exact XML output I need when I run it in SSMS. I've tried every trick I can find to use this result set in a SSIS package to create a file.
Using a dataflow to port a OLE Source to a Flat file doesn't work because the output of a XML query is treated as TEXT and SSIS can't push TEXT, NTEXT or IMAGE to a file destination.
I've tried to then Execute SQL Task to fill a user variable and then use a Script Task (written using C#) to write the contents of this user variable to a file output, but the user variable is always empty. I don't know, but I suspect this is, again, because the XML is treated as TEXT or IMAGE and the user variable doesn't handle this.
The query is in this form:
SELECT *
FROM dataTable
WHERE dataTable.FIELD = 'Value'
FOR XML AUTO, ROOT('RootVal')
The resulting dataset is well formed XML, but I can't figure out how to get it from result set to file.
It's a relatively easy task for me to write a console app to do this in C# 4.0, but restrictions require me to at least prove it CAN'T be done with SSIS before I write the console app and a scheduler.
Sorry to spoil, but there's an SSIS option for you: Export Column Transformation.
I defined an OLE DB query with
SELECT
*
FROM
(
SELECT * FROM dbo.spt_values FOR XML AUTO, ROOT('RootVal')
) D (xml_node)
CROSS APPLY
(
SELECT 'C:\ssisdata\so_xmlExtract.xml'
) F (fileName)
This results in 1 row and 2 columns in the dataflow. I then attached the Export Column Transformation and wired it up with xml_node as Extract Column and fileName as the File Path Column
Mostly truncated results follow
<RootVal>
<dbo.spt_values name="rpc" number="1" type="A " status="0"/>
<dbo.spt_values name="dist" number="8" type="A " status="0"/>
<dbo.spt_values name="deferred" number="8192" type="V " low="0" high="1" status="0"/>
</RootVal>
A more detailed answer, with pictures, is available on this Q&A Export Varbinary(max) column with ssis
BillInKC's answer is the best I've ever seen, but SQL can be simplified (no need for cross apply):
SELECT X.*, 'output.xml' AS filename
FROM (SELECT * FROM #t FOR XML PATH('item'), ROOT('itemList')) AS X (xml_node)
It will output the same structure:
xml_node filename
-------------------------------------------------- ----------
<itemList><item><num>1000</num></item></itemlist> output.xml
(1 row(s) affected)

Importing and validating XML file using SSIS or just plain T-SQL?

What is the best practice when importing and validating an XML file to a single table (flattened) in SQL Server ?
I've a XML file which contains about 15 complex types which are all related to a single parent element.
The SSIS design could look like this:
But it's getting very complicated with all those (15) joins.
Is it maybe a better idea to just write T-SQL code to :
1) Import the XML into a column which is of the type XML and is linked to a XSD-schema.
2) Use this code:
TRUNCATE TABLE XML_Import
INSERT INTO XML_Import(ImportDateTime, XmlData)
SELECT GETDATE(), XmlData
FROM
(
SELECT *
FROM OPENROWSET (BULK 'c:\XML-Data.xml', SINGLE_BLOB) AS XMLDATA
) AS FileImport (XMLDATA)
delete from dbo.UserFlat
INSERT INTO dbo.UserFlat
SELECT
user.value('(UserIdentifier)', 'varchar(8)') as UserIdentifier,
user.value('(Emailaddress)', 'varchar(70)') as Emailaddress,
businessaddress.value('(Fax)', 'varchar(70)') as Fax,
employment.value('(EmploymentData)', 'varchar(8)') as EmploymentData,
-- More values here ...
FROM
XML_Import CROSS APPLY
XmlData.nodes('//user') AS User(user) CROSS APPLY
user.nodes('BusinessAddress') AS BusinessAddress(businessaddress) CROSS APPLY
user.nodes('Employment') AS Employment(employment)
-- More 'joins' here ...
to fill the 'UserFlat' table ?Some disadvantages are that you have to manually type the SQLcode, but the advantage here is that I have more direct control how the elements are processed and converted. But I don't know if there are any performance differences between processing XML in SSIS and processing the XML with T-SQL XML statements.
Note that some other requirements are:
Error handling : in case of an error, an email must be send to a person.
Able to process multiple input files with a specific file name pattern : XML_{date}_{time}.xml
Move the processed XML files to a different folder.
Please advice.
Based on the requirements that you have mentioned, I would say that you can use best of both the worlds (T-SQL & SSIS).
I feel that T-SQL gives more flexibility in loading the XML data that you have described in the question.
There are lot of different ways you can achieve this. Here is one possible option:
Create a Stored Procedure that would take the path of the XML file as input parameter.
Perform your XML data load operation using the T-SQL way which you feel is easier.
Use SSIS package to perform error handling, file processing, archiving and send email.
Use logging feature available in SSIS. It just requires simple configuration. Here is a samples that show how to configure logging in SSIS How to track status of rows successfully processed or failed in SSIS data flow task?
A sample mock up of your flow would be as shown below in the screenshot. Loop the files using Foreach Loop container. Pass the file path as parameter to Execute SQL Task, which in turn would call the T-SQL that you had mentioned. After processing the file, using the File System Task to move the file to an archive folder.
Sample used in SSIS reading multiple xml files from folder
shows how to loop through files using Foreach loop container. It loops through xml files but uses Data Flow Task because the xml files are in simpler format.
Sample used in How to send the records from a table in an e-mail body using SSIS package? shows how to send e-mail using Send Mail Task.
Sample used in How do I move files to an archive folder after the files have been processed? shows how to move files to an Archive folder.
Sample used in Branching after a file system task in SSIS without failing the package shows how to continue package execution even after a particular task fails. This will help you to proceed with package execution even if Foreach Loop fails so you can send email. Blue arrow in the screenshot indicates on completion of previous task.
Sample used in How do I pick the most recently created folder using Foreach loop container in SSIS package? shows how to perform pattern matching.
Hope that gives you an idea.