pentaho spoon : how to insert value to column conditionnally? - pentaho

So in my table, I have a column quantity and comment.
If the value in quantity is more than 0, then I need to insert a string "available" to column comment , if it equals to 0 then "to order" and finally if it's less than zero, then "warning". What could be the best way?
edited:
Guess my question above doesn't show the whole work necessary.
At first, I have a text file where I get fields including quantity.
Then I do some modifications of data (on formula step, I do some calculations on quantity).
In the end I use Table output step to insert them into BD. One of the fields to insert is quantity.
My main question is :
Is it better to insert values to column comment after Table output step (when quantity is already added in BD) using SQL script step?

You have basically 3 options:
A filter rows step to split the stream based on the value of quantity, then each of the output streams has an Add constants step to add the new field you want, then combine them again by connecting both Add constants steps to a dummy;
A user defined java expression
A javascript step.
Option 2 is probably the cleanest; option 3 is basically the same as option 2, but with javascript instead of java code; option 1 has the advantage of not requiring any code (though, as the alternative is a one liner, not really an issue). Plus, in option 1 order of rows isn’t necessarily maintained.

** answer no longer applies with new question details **
If you are updating a database table, by far the best and most efficient solution is to do it in a single SQL statement.
In a Pentaho Job, add a SQL step (under scripting).
In that step enter the SQL command. It will be similar to:
UPDATE MyTable
SET comment =
CASE
WHEN quantity > 0 THEN 'available'
WHEN quantity < 0 THEN 'warning'
ELSE 'to order'
END
// next line optional, use it if you only need to update some of the records.
WHERE (insert conditions here if you need any)
As an extra comment, it's less than ideal to have two columns that should always be in sync, but depend on an external job to keep them in sync. There are techniques like database triggers or calculating the case/when while retrieving the rows in a select statement that eliminate the chance of having out of sync fields.

Related

Pentaho step - Use SQL functions to add a column in data before dumping it int DB

I am fairly new to Pentaho, and while working on it, I have stumbled across a problem. Below is how my flow is:
Read input from a file. Let's say this has 5 columns.
Make some modifications to existing columns. (Filter, modify and all).
Add a new column, which will be equal to an SQL function of the current row data. Example, it can be sum(id, id+1)
Dump to the database.
Step 1,2, 4 are already in place and are working fine. It's Step 3 where I am stuck. I've tried to execute SQL, but that is only for Modifying DDL and doesn't return data. Table input needs data to be in a table already, which isn't the case with me.
I have a workaround, that I can enter all rows in DB, and then fire an update query, but I was hoping if there is a better way to do this.
You can add formula step and in the formula column, you can specify what you want to achieve. For example, your other column+1 and save it in a new field or also replace the existing field value

Add new column to existing table Pentaho

I have a table input and I need to add the calculation to it i.e. add a new column. I have tried:
to do the calculation and then, feed back. Obviously, it stuck the new data to the old data.
to do the calculation and then feed back but truncate the table. As the process got stuck at some point, I assume what happens is that I was truncating the table while the data was still getting extracted from it.
to use stream lookup and then, feed back. Of course, it also stuck the data on the top of the existing data.
to use stream lookup where I pull the data from the table input, do the calculation, at the same time, pull the data from the same table and do a lookup based on the unique combination of date and id. And use the 'Update' step.
As it is has been running for a while, I am positive it is not the option but I exhausted my options.
It's seems that you need to update the table where your data came from with this new field. Use the Update step with fields A and B as keys.
actully once you connect the hope, result of 1st step is automatically carried forward to the next step. so let's say you have table input step and then you add calculator where you are creating 3rd column. after writing logic right click on calculator step and click on preview you will get the result with all 3 columns
I'd say your issue is not ONLY in Pentaho implementation, there are somethings you can do before reaching Data Staging in Pentaho.
'Workin Hard' is correct when he says you shouldn't use the same table, but instead leave the input untouched, and just upload / insert the new values into a new table, doesn't have to be a new table EVERYTIME, but instead of truncating the original, you truncate the staging table (output table).
How many 'new columns' will you need ? Will every iteration of this run create a new column in the output ? Or you will always have a 'C' Column which is always A+B or some other calculation ? I'm sorry but this isn't clear. If the case is the later, you don't need Pentaho for transformations, Updating 'C' Column with a math or function considering A+B, this can be done directly in most relational DBMS with a simple UPDATE clause. Yes, it can be done in Pentaho, but you're putting a lot of overhead and processing time.

BigQuery - remove unused column from schema

I accidentally added a wrong column to my BigQuery table schema.
Instead of reloading the complete table (million of rows), I would like to know if the following is possible:
remove bad rows (rows with values contains the wrong column) by running a "select *" query on the table with some kind of filter, and saving result to same table.
removing the (now) unused column.
Is this functionality (or similar) supported?
Possibly the "save result to table" functionality can have a "compact schema" option.
The smallest time-saving way to remove a column from Big Query according to the documentation.
ALTER TABLE [table_name] DROP COLUMN IF EXISTS [column_name]
If your table does not consist of record/repeated type fields - your simple option is:
Select valid columns while filtering out bad records into new temp table
SELECT < list of original columns >
FROM YourTable
WHERE < filter to remove bad entries here >
Write above to temp table - YourTable_Temp
Make a backup copy of "broken" table - YourTable_Backup
Delete YourTable
Copy YourTable_Temp to YourTable
Check if all looks as expected and if so - get rid of temp and backup tables
Please note: the cost of above #1 is exactly the same as action in first bullet in your question. The rest of actions (copy) are free
In case if you have repeated/record fields - you still can execute above plan, but in #1 you will need to use some BigQuery User-Defined Functions to have proper schema in output
You can see below for examples - of course this will require some extra dev - but if you are in critical situation - this should work for you
Create a table with Record type column
create a table with a column type RECORD
I hope, at some point Google BigQuery Team will add better support for cases like yours when you need to manipulate and output repeated/record data, but for now this is a best workaround I found - at least for myself
Below is the code to do it. Lets say c is the column that you wants to delete.
CREATE OR REPLACE TABLE transactions.test_table AS
SELECT * EXCEPT (c) FROM transactions.test_table;
Or second method and my favorite is by following below steps.
Write Select query with the columns you want to exclude.
Go to Query Settings
Query Settings
In Destination setting Set destination table for query results, enter project name, Dataset name and table name exactly same as you entered in Step 1.
In Destination table write preference select Overwrite table.
Destination table settings
Save the Query Setting and run the query.
Save results to table is your way to go. Try on the big table with the selected columns you are interested, and you can apply a limit to make it small.

how to remove duplicate row from output table using Pentaho DI?

I am creating a transformation that take input from CSV file and output to a table. That is running correctly but the problem is if I run that transformation more then one time. Then the output table contain the duplicate rows again and again.
Now I want to remove all duplicate row from the output table.
And if I run the transformation repeatedly it should not affect the output table until it don't have a new row.
How I can solve this?
Two solutions come to my mind:
Use Insert / Update step instead of Table input step to store data into output table. It will try to search row in output table that matches incoming record stream row according to key fields (all fields / columns in you case) you define. It works like this:
If the row can't be found, it inserts the row. If it can be found and the fields to update are the same, nothing is done. If they are not all the same, the row in the table is updated.
Use following parameters:
The keys to look up the values: tableField1 = streamField1; tableField2 = streamField2; tableField3 = streamField3; and so on..
Update fields: tableField1, streamField1, N; tableField2, streamField2, N; tableField3, streamField3, N; and so on..
After storing duplicite values to the output table, you can remove duplicites using this concept:
Use Execute SQL step where you define SQL which removes duplicite entries and keeps only unique rows. You can inspire here to create such a SQL: How can I remove duplicate rows?
Another way is to use the Merge rows (diff) step, followed by a Synchronize after merge step.
As long as the number of rows in your CSV that are different from your target table are below 20 - 25% of the total, this is usually the most performance friendly option.
Merge rows (diff) takes two input streams that must be sorted on its key fields (by a compatible collation), and generates the union of the two inputs with each row marked as "new", "changed", "deleted", or "identical". This means you'll have to put Sort rows steps on the CSV input and possibly the input from the target table if you can't use an ORDER BY clause. Mark the CSV input as the "Compare" row origin and the target table as the "Reference".
The Synchronize after merge step then applies the changes marked in the rows to the target table. Note that Synchronize after merge is the only step in PDI (I believe) that requires input be entered in the Advanced tab. There you set the flag field and the values that identify the row operation. After applying the changes the target table will contain the exact same data as the input CSV.
Note also that you can use a Switch/Case or Filter Rows step to do things like remove deletes or updates if you want. I often flow off the "identical" rows and write the rest to a text file so I can examine only the changes.
I looked for visual answers, but the answers were text, so adding this visual-answer for any kettle-newbie like me
Case
user-updateslog.csv (has dup values) ---> users_table , store only latest user detail.
Solution
Step 1: Connect csv to insert/update as in the below Transformation.
Step 2: In Insert/Update, add condition to compare keys to find the candidate row, and choose "Y" fields to update.

Quickest way to fill SQL Table with Dummy Data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 12 months ago.
The community reviewed whether to reopen this question 12 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
What is the quickest way to fill a SQL table with dummy data?
I have a wide table with about 40 fields of different kinds (int, bit, varchar, etc.) and need to do some performance testing. I'm using SQL Server 2008.
You Only need Go 1000 after your INSERT, to fill it 1000 times, just like this:
INSERT INTO dbo.Cusomers(Id, FirstName, LastName) VALUES(1, 'Mohamed', 'Mousavi')
GO 1000
It will make a table with 1000 same rows in it.
Another solution is that you can populate the beginning rows of your table with some data, then you fill the next rows of table by repeating the beginning rows over and over, it means you fill your table by itself:
INSERT INTO dbo.Customers
SELECT * FROM dbo.Customers
GO 10
In the case one or more column are identity (meaning they accept unique values, if it's auto incremental), you just don't place it in your query, for instance if Id in dbo.Customer is identity, the query goes like this:
INSERT INTO dbo.Customers
SELECT FirstName, Last Name FROM dbo.Customers
GO 10
Instead Of:
INSERT INTO dbo.Customers
SELECT Id, FirstName, Last Name FROM dbo.Customers
GO 10
Else you'll encounter this Error:
An explicit value for the identity column in table 'dbo.Customers' can only be specified when a column list is used and IDENTITY_INSERT is ON.
Note:
This is sort of an arithmetic progression, so it's going to last a little, don't use a big number in front of GO.
If you want to have a table which is filled a little bit more elaborated then you can achieve that the same way this time by executing a simple query and following these steps:
Choose one of your tables which has a remarkable number of rows, say dbo.Customers
Right click on it and select Script Table as > Create To > New Query Editor Window
Name your new table to something else like dbo.CustomersTest, Now you can execute the query to have a new table with similar structure with the dbo.Customers.
Note:Keep in mind that if it has a Identity filed, change it's Identity Specification to No Since you are supposed to fill the new table by the data of the original one repeatedly.
Run the following query, it's going to be run 1000 times, you can change it to more or less but be aware that it might last minuets based on your computer hardware:
INSERT INTO [dbo].[CustomersTest] SELECT * FROM [dbo].[Customers] GO 1000
After a while you have a table with dummy rows in it!
As #SQLMenace mentioned, RedGate Data Generator is a so good tool to fulfill it, it costs $369, you have a 14 days trial chance Although.
The good point is that RedGate identifies foreign keys so you can apply JOIN in your queries.
You have a bunch of options which allow you to decide how every column is supposed to be populated, every column is anticipated semantically so that related data are suggested, for instance if you have a column named 'Department' it isn't filled by weird characters, it's filled by expressions like "Technical", "Web", "Customer", etc. Even you can use regular expression to restrict selected characters.
I populated my tables with over 10,000,000 records which was an awesome simulation.
Late answer but can be useful to other readers of this thread.
Beside other solutions, I can recommend importing data from a .csv file using SSMS or custom SQL import scripts, programs. There is a step-by-step tutorial on how to do this, so you might want to check it out: http://solutioncenter.apexsql.com/how-to-generate-randomized-test-data-from-a-csv-file/
Be aware that importing a .csv file using SSMS or custom SQL import scripts, is easier than creating SQL inserts manually, but there are some limitations, as explained in the tutorial:
If there is a need for thousands of rows to be populated, and the .csv file contains few hundred rows of data it is just not enough. The workaround is reimporting the same .csv file over and over until needed. The drawback to this method is that it will insert large blocks of rows with the same data, without randomizing them.
The tutorial also explains how to use a 3rd party SQL data generator called ApexSQL Generate. The tool has an integrated function to generate large amounts of randomized data from the imported .csv formatted file. Application features a fully functional free trial so you can download and try it to see if it works for you.
http://filldb.info/dummy/ works best. It offers complete settings, choice of how many rows to generate, "real" dummy data, all for free.
I've never seen anything more effective or better at this conditions.
You can generate a whole database or just a table with an easy to use GUI. It is also very elaborate in its settings and options, allowing you to generate dummy data with basically no effort. The GUI has no limits in size and is very extensive in data type options.
To use it, navigate to the link and insert a SQL command that defines the tables or use their dummy tables. Then click next and fill out your rows data types and settings for dummy data population.
Then click next and generate the data. Wait. Once done, download the database and import it to your own database server.