Pentaho Data Integration: How to select output of sql query as a filename for Microsoft Excel Input. - pentaho

I have files abc.xlsx, 1234.xlsx, and xyz.xlsx in some folder. My requirement is to develop a transformation where the Microsoft Excel Input in PDI (Pentaho Data Integration) should only pick the file based on the output of a sql query. If the output query is abc.xlsx. Microsoft Excel Input should pick of abc.xlsx for further processing. How do I achieve this? Would really appreciate your help. Thanks.

Transformations in Kettle run asynchronously, so you're probably looking into needing a job for this.
Files to create
Create a transformation that performs the SQL query you're looking for and populates a variable based on the result
Create a transformation that pulls data from the Excel file, using the variable populated as the filename
Create a job that executes the first transformation, then steps into the second transformation
Jobs run sequentially, so it will execute the first transformation, perform the query, get the result, and set a variable. Variables need to be set and retrieved in different transformations because of their asynchronous nature. This is the reason for the second transformation; the job won't step into the second transformation until the first one is done running (therefore, not until the variable is populated).
This is all assuming you only want to run the transformation once, expecting a single result from the query. If you want to loop it, pulling data from a set, then setup is a little bit different.

The Excel input step has a "accept filenames from previous step" option. You can have a table input build the full path of the file you want to read (or you somehow build it later knowing the base dir and the short filename), pass the filename to the excel input, tick that box and specify the step and the field you want to use for the filename.

Related

SSIS: Excel data source - if column not exists use other column

I am using select statement in excel source to select just specific columns data from excel for import.
But I am wondering, is it possible to select data such way when I select for example column with name: Column_1, but if this column is not exists in excel then it will try to select column with name Column_2? Currently if Column_1 is missing, then data flow task fails.
Use a Script task and write .net code to read the excel file and then perform the check for the Column_1 availability in the file. If the column does not present then use Column_2 as input. Script Task in SSIS can act as a source.
SSIS is metadata based and will not support dynamic metadata, however you can use Script Component as #nitin-raj suggested to handle all known source columns. There is a good post below on how it can be done.
Dynamic File Connections
If you have many such files that can have varying columns then it is better to create a custom component.However, you cannot have dynamic metadata even with custom component, the set of columns should be known upfront to SSIS.
If the list of columns keep changing and you cannot know in advance what are expected columns then you are better off handling the entire thing in C#/VB.Net using Script Task of control flow
As a best practice, because SSIS meta data is static, any data quality and formatting issues in source files should be corrected before ssis data flow task runs.
I have seen this situation before and there is a very simple fix. In the beginning of your ssis package, using a file task to create copy of the source excel file and then run a c# script or execute a powershell to rename the columns so that if column 1 does not exist, it is either added at the appropriate spot in excel file or in case the column name is wrong is it corrected.
As a result of this, you will not need to refresh your ssis meta data every time it fails. This is a standard data standardization practice.
The easiest way is to add two data flow tasks, one data flow for each Excel source select statement and use precedence constraints to execute the second data flow when the first one fails.
The disadvantage of this approach is that if the first data flow task fails for another reason, it will also try to execute the second one. You will need some advanced error handling to check if the error is thrown due to missing columns or not.
But if have a similar situation, I will use a Script Task to check if the column exists and build the SQL command dynamically. Note that this SQL command must always return the same metadata (you must use aliases).
Helpful links
Overview of SSIS Precedence Constraints
Working with Precedence Constraints in SQL Server Integration Services
Precedence Constraints

Running SSIS Solution/Package deletes components out of the Data Flow Task

I'm working on a package to import data from a raw text file to a table in SQL Server. My package contains:
1) An Execute Process Task that runs a batch file to compile .txt files
2) An Execute SQL Task that Truncates the table I want to import
3) A Data Flow Task that takes the data from the raw text file and puts it in the table in SQL Server
I was able to run each step individually and they worked as expected - however, when I run the solution from inside SSIS itself, it gives me the "success" message but nothing actually happens. Even worse, the components of the data flow task are now missing.
Has anyone experienced this who found a work around?
Sorry for the lack of specifics! I actually figured it out. Let me clarify my second paragraph:
The batch portion and the Execute SQL Task work perfectly when I disable the Data Flow Task! However, upon enabling the Data Flow Task, the package would "run" but not complete the Data Flow Task and would delete the Data Flow Task's components completely. Within the data flow task I had:
1) Flat File Source
2) Conditional split that ignored rows in the first column if the value was "".
3) OLE DB destination table
What I found is that changing the Conditional Split from specifically ignoring rows with "" to making the criteria based on value length, rather than looking for that symbol, worked out and didn't completely delete out components in the data flow task.
TL;DR: For whatever reason, the solution I built didn't like the conditional split criteria being based on the "" character. When I removed that, the solution worked perfectly.

Pentaho writing to log but not text file

I have a transformation that is successfully writing the first row to the log file.
However the same transformation is not writing the first row to a text file.
The text file remains blank.
Does anyone know why this may be?
edited - only focusing on the applications to run and set pm variable transformations, as the other transformations are replications of set pm variable but for different fields
It looks like your Set Variables step is distributing its rows over the two follow-up steps in a round-robin way, which is the default setting in PDI.
Right-click the Set Variables step and under Data Movement, select Copy. That will send all rows to BOTH steps. You should see a documents icon on the hops then.

Set Table Input data to Polling Folder - Pentaho Data Integration

I have a requirement where we can get list of file names from SQL and need to pass these file names as variable to Step which can poll folder for these file names as text file. Please advise how to set SQL output of file names as array variable and pass to polling folder step ?
Don't use variables. Variables are only suitable if your input has 1 single row.
Instead, use two transformations inside a parent job. The first transformation gets a list of filenames and passes those to a step Copy Rows to Result;
The second transformation can do one of two things:
Process all files at once: just use a Get Rows from Result step as your entry point to the transformation;
Process one file at a time: create a parameter for the filename on the transformation; open the parent job, and on the properties of the transformation go to Advanced and tick the box "Execute for every input row" and on Parameters put the child trans parameter name and the stream column name coming from the 1st transformation.

SSIS Script Component - only to change variables

I have a series of task that are very similar:
SELECT a,b FROM c
Lookup in another table and change value in column b.
Save new value back to c and if not match, send the result on to an error table.
That part is pretty straight forward and illustrated here:
Source ==> Lookup =match=> SQL Update command
=No match=> SQL Save Error command
(Hope you understand what I mean - but it works!)
I now have to repeat this a number of times, where my source-sql changes. So what I want to do is to insert a Script Component in front of the Source and set my User::Sql variable like:
Variables.Sql = "SELECT d, e FROM f"
All of the above is contained in a Data Flow. When I have created one I can then copy that one and only change the Sql variable in the script and then it should all work.
My problem is: When I insert the Script Command it asks me if it is a Source, Destination or Transscript script. And by only setting the variable it does not produce any rows for output and cannot connect to my Source.
Anyone know how to make that work?
(I have simplified the above. I actually want to update multiple variables and use those in my Source, Lookup and Error update as well - therefore it is not more simple just to change the SQL script in the initial Source! But being able to do the above, I will be able to achieve what I want :-))
You should set your variable containing the SQL query in the control flow, before you execute the dataflow.
Then you need to use that variable as an expression in your Dataflow. You can parametrize the query used in the lookup or any other parameters of your dataflow.
If your dataflows really have always the same structure, you could even generate a list of queries and call your dataflow task in a loop, preventing the duplication of the same tasks.