Transition from MS SQL to Pentaho Kettle - pentaho

I have a few ms SQL scripts which I would like migrate to kettle. Ideally what I would like to do is for each step of the script to be a single step in kettle. But I am finding it difficult to wrap my head around the ms SQL statements and the related kettle step. Could someone please elaborate on which kettle step which can be used to do the following:
select * from [table] - This one is obviously [Input->Table input]
ALTER TABLE [table] ADD [fieldname] [nvarchar](255)
UPDATE b SET b.b_field = a.a_field
FROM [table_a] a
INNER JOIN [table_b] b
ON right(b.b_identity,19)=a.a_identity
where b.b_field is null
Step 3 is repeated with many other different tables with different fields compared.
Thank you.

You can't simply translate it step by step. Replace the functionality, but you can't simply map SQL steps to PDI steps. It's a completely different paradigm.

As quick and dirty way to migrate SQL scripts to Kettle, you have the SQL Execute script step, in which you can copy/paste your script as is.
Still on the quick and dirty way, note that you can put more than one statement in the Table Input, provided they are separated by coma. You can even create temporary table with SELECT INTO, index them, and read from them.
But obviously this is not really clean. For (2), you can produce a flow containing the table-name and field-name, then use a Javascript step to write down a column containing the text "ALTER TABLE [table-name] ADD [field-name] NVARCHAR(255)", then a Dynamic SQL row to execute that statement for each input line.
For (3) the principle is to create the input flow with a Table Input with a "SELECT a.a_field FROM [table_a] a INNER JOIN [table_b] b ON RIGHT(b.b_identity,19)=a.a_identity". And then to update table_b with an Update step. I cannot really help there since I do not see the b-key for the update.
When this is done and tested for one table and one field, you can put these values in parameters, and use a Job to loop over the parameters.
You have an example of this use case in the sample directory which was shipped with your distribution. It sits in the same folder as your spoon.bat, and the job of interest to you is samples/transformations/dynamic-table/Dynamic table creation and population.kjb.

Related

Use SQL Field in SSIS Variable

Is it possible to reference a SQL field in your SSIS variable?
For instance, I would like use the field from the "table" below
Select '999999' AS Physician_Profile_ID
as a dynamic variable (named "CMSPhysProID" in our example) here
I plan on concatenating multiple IDs into a In statement.
Possible by using execute sql taskIn left side pan of Execute SQL task, general tab 1.Select result set as single row2. Connection type ole db 3. Set connection and form SQL statement, As you mentioned Select '999999' AS Physician_Profile_ID 4.Go to result set in your left side pan 5. Add your variable where you want to store '999999' 6. Click ok
If you are looking to store the value within the variable to be used later, you can simply use an Execute SQL Task with a single row result set. More details in the following article:
SSIS Basics: Using the Execute SQL Task to Generate Result Sets
If you are looking to add a computed column while importing data, you must use a Derived Column Transformation within the data flow task to add a column based on another one, you can refer to the following article for more details about this component:
SSIS Derived Columns with Multiple Expressions vs Multiple Transformations
What are you trying to accomplish by concatenating the IDs into an "IN" statement? If the idea is to use the values of the IDs to limit the results, as a dynamic WHERE clause, you may have better luck just using a lookup against either a table you maintain with the desired IDs or even a static list generated in the package with a script task. (If you can use the lookup table method it will be much easier to maintain as you only have to update a table, not your source code.)
Alternatively, you may even be able to accomplish the goal with a join. Create a temp table from the profile IDs you want to keep and join to it, or, again, use it as a lookup component. Dynamically creating a where clause using IN will come in a lot slower and will be cumbersome to maintain.

How to run a select sql statement within a field in the Pentaho?

I have a table with a 'query' field containing a select sql and another 'parameters' field containing the sql parameters. I have merged these two fields into a new field containing a correct select sql statement. Now I need to execute this new field containing select sql, get the return from select (the output fields) and generate an excel file.
Use Table-Input if you are interested in a query result set. Table-Input supports SQL parameters, so no need to build the statement yourself using e.g. Replace-In-String, and tripping over escapes on your way. Also, there's variable substitution, just in case you can't live with a single template.
Update 21:14 GMT
I'm not very fond of the way you try to prepare the SELECT statement, but here we go, assuming it's a single statement we have:
Create a job with a Start entry and 2 Transformation entries (T1, T2). Let T1 produce the field containing your SELECT statement and use a Set-Variables step to make the statement available to T2 as variable SELECT. In T2 use a Table-Input step referencing ${SELECT} in the SQL statement text area. Don't forget to enable option "Replace variables in script".
From now on it's a matter of taste. I would prefer to create a CSV file using Text-File-Output. Using the right field separator Excel will open the file after double-clicking it. The advantage of Text-File-Output is that you don't have to specify the fields you don't know at design-time anyway. An empty field list will just handle all fields coming in. Comparable to the total projection in a Table-Input which will create the necessary fields from the retrieved columns downstream.
If you must produce an Excel workbook, you'll have to learn about metadata injection. That would be a separate project for a beginner, though. There are samples in your Kettle installation folder. And there is a very active community if you find yourself in trouble.

Execute SQL step in pentaho

I have created transformation which includes table input,sql step and excel o/p step.
Table input-->Run a query and get the field "query" which includes sql query select * from dual
Execute sql step-->Dynamically passing that query field using '?' and enabling variable substitution
Excel o/p-Expecting o/p is the sql query should be triggered and get the result in excel o/p
But i can't get the fiels from execute sql step.. How i can do this???
Thanks
Kavitha S
Use Database join instead of Execute SQL step. The Database Join step allows you to run a query against a database using data obtained from previous steps.
Database join Input: You can pass any of data you want from previous step using ? notation in SQL query defined inside the step.
Database join Output: Executes parametrized SQL query and adds new parameters as an output.
The step is what you need for your 2nd step. See more info about the Database join step in the documentation.
In PDI, "Execute SQL Step" is not meant for generating rows. It will not add any extra row to the data stream. You got Table Input step to generate multiple rows.
What you can try as an alternative is to break the transformation into two parts.
Part 1: Table Input Step > (query rows are generated) >> Use "Set variables" or "copy rows to result" to some other steps to set the query into some variable e.g: query.
Part 2: Take another Table Input Step (into a next .ktr file) and use the variable substitution of ${query} >> Finally output the result set to the excel output.
For dynamically sql queries, you can read this blog.
In case you have some lookups to do with the query generated, you can use Dynamic SQL row to generate the rows.
Hope it helps :)

capture executed sql from input table in pentaho pdi

I am using pentaho for data migration testing. I have set a "table input" step where many parts of the query inside "table inputs" are variables. I have been looking for a way to capture that query after it gets executed during runtime.
I was wondering if there is any specific system log variables for sql or is it to do with metadata. need help! Thanks
Maybe the following approach will help:
We assume a transformation reading a CSV file to get the dynamic portion of the SELECT statement (e.g. the columns) and setting the variable columns with it.
The second transformation uses this variable to generate the SELECT statement and store it into the variable sql_statement.
In the main transformation we use ${sql_statement} as the SELECT statement of the table input and write the data to an output file (that's the business process so to say). From the same input we copy the output to another path. There we add the current time as a field (use element "Get system data") and we add the generated SQL statement, join them as a cartesian product and group the result by the sql_statement. That way we can compute the first time and the last time that the statement was used. These results are written to a text file.
The last thing we need is a job calling the three transformations sequentially.
This is a sample output:
sql_statement;min_time;max_time
SELECT my_column FROM test_table;2014/05/08 00:41:21.143;2014/05/08 00:41:21.144
Thank you Marcus! I did some thing similar.
It works. awesome.
I gathered parts of queries from table field where they were kept and formed a full query in javascript. After that full query will be sent as parameter to a transformation that will run and log the query.

Pentaho kettle : how to execute "insert into ... select from" with the sql script step?

I am discovering Pentaho DI and i am stuck with this problem :
I want to insert data from a csv file to a custom DB, which does not support the "insert table" step. So i would like to use the sql script step, with one request :
INSERT INTO myTable
SELECT * FROM myInput
And my transformation would like this :
I don't know how to get all my data from the csv to be injected in the "myInput" field.
Could someone help me ?
Thanks a lot :)
When you first edit the SQL Script step, click 'Get fields' button. This is going to load the parameters(fields from your csv) into box on the bottom left corner. Delete the parameters(fields) you don't want to insert.
In your sql script write your query something like this where the question marks are your parameters in order.
insert into my_table (field1,field2,field3...) values ('?','?','?'...);
Mark the checkboxes execute for each row and execute as a single statement. That's really about it. Let me know if you have any more questions and if you provide sample data I'll make you a sample ktr file to look at.
I think you get the wrong way. You should get a cvs file input step and a table output step.
As rwilliams said, In cvs file input step Get fields; the more important, in table output there is a Database Fields tab. Enter field mapping is right choise.The guess function is amazing.
And more, system can generate target table create sql statement when target table not exists in target connection Db server.
Use the following code
with cte as
(
SELECT * FROM myInput
)
select *
into myTable
from cte;