Pentaho Kettle Spoon Date manipulation - pentaho

I am using Pentaho Spoon to do some transformation. I am using 'Table Input' and joining multiple tables to get final output table.
I need to achieve:
SELECT COUNT(distinct ID)
FROM TBLA join TBLB ON TBLA.ID=TBLB.ID
WHERE
TBLA.ID=334
AND TBLA.date = '2013-1-9'
AND TBLB.date BETWEEN '2012-11-15' AND '2013-1-9';
I am manually inserting '2012-11-15' but I am using Get System Data to insert '2012-1-9'. I am using 1 Get System Data.
My query is:
SELECT COUNT(distinct ID)
FROM TBLA join TBLB ON TBLA.ID=TBLB.ID
WHERE
TBLA.ID=334
AND TBLA.date='?'
AND TBLB.date BETWEEN '2012-11-15' AND '?';
I get error message in Table Input saying No value specified for parameter 2
Any suggestion will be appreciated.
Thank you.

Simple one this; You need to "duplicate" the system date. So add another line in "get system data" called "date2" or something, make it the same as the first line, and then it will fill in the 2nd parameter or ?
OR simply change the query to say between '2012-11-15' and TBLA.date
then you dont need the 2nd parameter

Personally I prefer the pattern of a Get System Info/Add Constants step to create one row with multiple columns that feeds into a Database Join step. Then you replace parameters in your query with columns instead of rows, and you can specify a column more than once.

Related

Azure Data Factory Exist Transformation

Is there a way that after comparing two tables and then use the Case function?
I am trying to have a new column base on Exists transformation. In sql I do it like this:
(isnull (select 'YES' from sales where salesperson = t1.salesperson group by salesperson), 'NO')) AS registeredSales
T1 is personal.
Or should I include the table into the stream of the joins and then use the case() function to compare the two columns?
If there's another way to work around to compare these two streams, I would be pleased to hear.
Thanks.
Flat files in a datalake can also be compared. We can use the derived column in dataflow to gernerate a new column.
I create a dataflow demo cotains two sources: CustomerSource(customer.csv stored in datalake2) and SalesSource(sales.csv stored in datalake2 and it contains only one column) as follows
Then I join the two sources with the column CustomerId
Then I use Select activity to give an alias to the CustomerId from SalesSource
In the DerivedColumn, I select the Add column and enter the expression iifNull(SalesCustomerID, 'NO', 'YES') to generate a new column named 'registeredSales' as follows:
The last column of the result shows:

Sql loop through the values on a table

first off, noob alert! :))
I need to construct a query that runs on many tables. The tables vary on name just on the last digits as per client code. The thing is, the values that change aren't sequential so looping as in i=1,2,3,... does not work. A possible solution would be to have those values on a given field on an other table.
Here is the code for the first two clients 015 and 061. The leading zero(s) must are essential.
SELECT LnMov2017015.CConta, RsMov2017015.DR, RsMov2017015.NInt, "015" AS CodCli
FROM LnMov2017015 INNER JOIN RsMov2017015 ON LnMov2017015.NReg = RsMov2017015.NReg
WHERE (((LnMov2017015.CConta)="6" And (LnMov2017015.CConta)="7") AND ((RsMov2017015.DR)=9999))
UNION SELECT LnMov2017061.CConta, RsMov2017061.DR, RsMov2017061.NInt, "061" AS CodCli
FROM LnMov2017061 INNER JOIN RsMov2017061 ON LnMov2017061.NReg = RsMov2017061.NReg
WHERE (((LnMov2017061.CConta)="6" And (LnMov2017061.CConta)="7") AND ((RsMov2017061.DR)=9999))
...
So for the first SELECT the table Name is LnMov2017015, the ending 015 being the value, the client code, that changes from table to table e.g. in the second SELECT the table name is LnMov2017061 (061) being what distinguishes the table.
For each client code there are two tables e.g. LnMov2017015 and RsMov2017015 (LnMov2017061 and RsMov2017061 for the second set client shown).
Is there a way that I can build the SQL, based upon the example SQL above?
Does anyone have an idea for a solution? :)
Apparently it is possible to build a query object to read data in another db without establishing table link. Just tested and to my surprise it works. Example:
SELECT * FROM [SoilsAgg] IN "C:\Users\Owner\June\DOT\Lab\Editing\ConstructionData.accdb";
I was already using this structure in VBA to execute DELETE and UPDATE action statements.
Solution found :)
Thank you all for your input.
Instead of linking 100 tables (password protected), I'll access them with SLQ
FROM Table2 IN '' ';database=C:\db\db2.mdb;PWD=mypwd'
And merge them all with a query, before any other thing!

Creating a calculated field table based on data in separate tables

It is straight forward to create a calculated field in a table that uses data IN the table... due to the fact that the expression builder is straight forward to use. However, it appears to me that the expression builder for the calculated field only works with data IN the table;
i.e: expression builder in table MYTABLE works with fields FIELD1.MYTABLE, FIELD2.MYTABLE etc.
Inventory Problem
My problem is that I have two 'count' fields that result from my queries that apply to INPUTQUERY and OUTPUTQUERY (gives me a count of all input data added and a count of all output data added) and now I want to subtract the two to get a stock.
I can't link the table that was created from my query because it wont be able to continually update do the relationship itself, and thus i'm stuck either using the expression builder/SQL.
First question:
Is it possible to have the expression builder reference data from other tables?
i.e expressionbuilder for:
MAINTABLE CALCULATEDFIELD.MAINTABLE = INPUTSUM.INPUTTABLE - OUTPUTSUM.OUTPUTTABLE
(which gives a difference of the two)?
Second question:
if the above isn't possible, can I do this through an SQL code ?
i.e
SELECT(data from INPUTSUM)
FROM(INPUTTABLE)
-
SELECT(data from OUTPUTSUM)
FROM(OUTPUTTABLE)
Try this:
SELECT SUM(T.INPUTSUM) - SUM(T.OUTPUTSUM) AS RESULTSUM
FROM
(
SELECT INPUTSUM, 0 AS OUTPUTSUM
FROM INPUTTABLE
UNION
SELECT 0 AS INPUTSUM, OUTPUTSUM
FROM OUTPUTTABLE
) AS T

Automatically generated SQL code throws type mismatch

My problem is that I would like to make an append query in Ms-Access 2010. I tried to realize it in query designer, but it throws an error:
Type mismatch in expression
See the generated code below:
INSERT INTO Yield ( ProcessName, Sor, Lot,
ProcessCode, Outgoing, DefectReason, DefectQty, ModifyQty )
SELECT Process.[ProcessName], Sor.[Sor], Qty.[Lot], Qty.[ProcessCode],
Qty.[Outgoing Date], Qty.[Defect Reason], Qty.[Defect Qty], Qty.[Modify_Qty]
FROM (Sor INNER JOIN ProcessCode ON Sor.[SorID] = ProcessCode.[SorID])
INNER JOIN (Process INNER JOIN Qty ON Process.[ProcessID] = Qty.[ProcessID])
ON ProcessCode.[ProcessID] = Process.[ProcessID];
The tables and the attributes are all existing. The ID numbers are indexes, the Quantities are numerical, the 'ProcessName', 'Sor', 'Lot', 'ProcessCode', 'DefectReason' attributes are strings.
What could be the problem?
Thanks in advance.
Looks ok. The best advice is divide it in smaller pieces.
http://importblogkit.com/2015/05/how-do-you-eat-an-elephant/ .
Try this:
Remove the insert part. Just try the select to make sure the join are working properly. If this fail the problem is on the join fields
Then, Put the insert again, but instead of putting table fields from the SELECT use default values. '' for strings and 0 for numeric and put the right alias for column name. That way you make sure your data is bringing the right data type. If this fail then one of the field isnt really a string or a number. Like gustav suggest probably a DATE
If that work then try to put one table field each time until you find the one causing the problem. Maybe one field doesnt support null or is receiving a bigger value than supported.
The problem was that the Yield table did not have the listed attributes. I thought that if some of the listed output attributes are not included in the output table, Access automatically creates the missing new attributes. I was wrong. The output table has to contain the attributes (rows), new attributes cannot be inserted into it this way.

Loop in Kettle/Spoon/Pentaho

I have a query like this:
SELECT count(distinct ID) FROM TBLC WHERE date BETWEEN ? AND ?;
I am using Pentaho Spoon. I am using 'Execute SQL Script'/ statement. The options I see are Execute for each row, execute as a single statement and variable substitution.
If I need to change my query or need other steps to implement, please response.
EDIT:
I am using a Pentaho Spoon to transfer data from Infobright database (table1, table2) to Infobright database (table3).
Query is similar to:
SELECT table1.column1, table2.column2
FROM table1 JOIN table2 ON table1.id=table2.id
WHERE table2.date BETWEEN '2012-12-01' AND '2012-12-30'
I want a way so that I do not have to manually specify the date range each time I run the transformation. I want to automate the date range.
Thanks in advance.
Based on what you've described, I believe you can accomplish what you want by using a generate rows step to inject rows into the stream containing the dates you want, then generate the needed queries for each date row in the stream to get all the rows you want from the source tables.
You can use execute as a single statement and variable substitution as they are best suited for your use case.
Add parameters StartDate and EndDate to your transformation and use them in your query as show below. Enable "Variable Subsitution" in the Execute SQL Script step.
SELECT table1.column1, table2.column2
FROM table1 JOIN table2 ON table1.id=table2.id
WHERE table2.date BETWEEN **$StartDate** AND **$EndDate**
Suppy values of StartDate and EndDate while executing the transformation.
i guess the dates are in a table or a file in the database
what you can do is :
create a job that get those parameters to the steam and set variables .
on the next job you can use them as variable to your query using {date_from} {date to}
that way each time you run the jobs it takes what inside the database
you of course need to update the date_from and date_to