pentaho merge rows diff with using text file inputs - pentaho

I have a text file that I need to load into a database... I used merge rows(diff)...
I compared the text file input with table input step.. i used sorted merge for sorting columns for both text file input and table input steps.. and i used merge rows(diff) step followed by Synchronize after merge... My problem is if i run my job first time its inserting the text file data to database.. and the second time also its inserting same rows again into the database... Can any one please help me what mistake i did..

use " Insert / Update " step in your transformation.. so it will avoid your duplication problem.
Insert/update Description

Related

TYPE command. Inserting csv file

I have a CSV file im looking to load into a TSQL table using the "type" command.
Code: type yourfilename
When looking in the command prompt its breaking the file lines into two different rows and inserting them separately into my destination table
EX.
"Manheim Chicago","Manheim","IL","199004520601","On
Block","2D4FV47V86H126473","2006","DODGE","MAGNUM 4X2 V6"
I want solution to look like this
Solution Pic
[1]: https://i.stack.imgur.com/Bkgf6.png
Where this would be one record in the table.
Question. Does anyone know how to format a type command so it displays a full record without line breaks?

Pentaho spoon:insert/update does DROP COLUMN?

I am doing insert/update step (text file to DB) on spoon and I have a question.
Suppose that in my text file I have 10 columns and in my DB I have 18, because 8 columns will be completed from another text file later.
On insert/update step, I chose a key to look up the value (which is client_id, for example) and on "Update fields", I did mappings for those 10 columns. When I checked SQL query, I saw those 8 columns will be dropped.
But I want to keep them. Any solution for it?
The Insert/Update step will NOT drop columns when run normally.
The SQL button inspects the table and suggests changes based on the fields you specified in the step. It's only a convenience for quick ETL development, for example when sending rows from text files to staging table using a Table Output step. It only drops columns if you execute the script it generates. Don't do that and your columns will be perfectly safe!

Pentaho | Issue with CSV file to Table output

I am working in Pentaho spoon. I have a requirement to load CSV file data into one table.
I have used , as delimter in CSV file. I can see correct data in preview of CSV file input step. But when I tried to insert data into Table Output step, I am getting data truncation error.
This is because I have below kind of values in one of my column.
"2,ABC Squere".
As you see, I have "," in my column value so it is truncating and throwing error.How to solve this problem?
I want to upload data in Table with this kind of values..
Here is one way of doing it
test.csv
--------
colA,colB,colC
ABC,"2,ABC Squere",test
See below the settings. The key is to use "" as encloser and , as delimiter.
you can change the delimiter say to PIPE and also keeping data as quoted text like "1,Name" this will treat the same as 1 column

Newly inserted or updated row count in pentaho data integration

I am new to Pentaho Data Integration; I need to integrate one database to another location as ETL Job. I want to count the number of insert/updat during the ETL job, and insert that count to another table . Can anyone help me on this?
I don't think that there's a built-in functionality for returning the number of affected rows of an Insert/Update step in PDI to date.
Nevertheless, most database vendors are able to provide you with the ability to get the number of affected rows from a given operation.
In PostgreSQL, for instance, it would look like this:
/* Count affected rows from INSERT */
WITH inserted_rows AS (
INSERT INTO ...
VALUES
...
RETURNING 1
)
SELECT count(*) FROM inserted_rows;
/* Count affected rows from UPDATE */
WITH updated_rows AS (
UPDATE ...
SET ...
WHERE ...
RETURNING 1
)
SELECT count(*) FROM updated_rows;
However, you're aiming to do that from within a PDI job, so I suggest that you try to get to a point where you control the SQL script.
Suggestion: Save the source data in a file on the target DB server, then use it, perhaps with a bulk loading functionality, to insert/update, then save the number of affected rows into a PDI variable. Note that you may need to use the SQL script step in the Job's scope.
EDIT: the implementation is a matter of chosen design, so the suggested solution is one of many. On a very high level, you could do something like the following.
Transformation I - extract data from source
Get the data from the source, be it a database or anything else
Prepare it for output in a way that it fits the target DB's structure
Save a CSV file using the text file output step on the file system
Parent Job
If the PDI server is the same as the target DB server:
Use the Execute SQL Script step to:
Read data from the file and perform the INSERT/UPDATE
Write the number of affected rows into a table (ideally, this table could also contain the time-stamp of the operation so you could keep track of things)
If the PDI server is NOT the same as the target DB server:
Upload the source data file to the server, e.g. with the FTP/SFTP file upload steps
Use the Execute SQL Script step to:
Read data from the file and perform the INSERT/UPDATE
Write the number of affected rows into a table
EDIT 2: another suggested solution
As suggested by #user3123116, you can use the Compare Fields step (if not part of your environment, check the marketplace for it).
The only shortcoming I see is that you have to query the target database before inserting/updating, which is, of course, less performant.
Eventually it could look like so (note that this is just the comparison and counting part):
Also note that you can split the input of the source data stream (COPY, not DISTRIBUTE), and do your insert/update, but this stream must wait for the stream of the field comparison to end the query on the target database, otherwise you might end up with the wrong statistics.
The "Compare Fields" step will take 2 streams as input for comparison, and its output is 4 distinct streams for "Identical", Changed", "Added", and "Removed" records. You can count those 4, and then process the "Changed", "Added", and "Removed" records with an Insert/Update.
You can do it from the Logging option inside the Transformation settings. Please follow the below steps :
Click on Edit menu --> Settings
Switch to Logging Tab
Select Step from the left menu
Provide the Log Connection & Log table name(Say StepLog)
Select the required fields for logging(LINES_OUTPUT - for inserted count & LINES_UPDATED - for updated count)
Click on SQL button and create the table by clicking on the Execute button
Now all the steps will be logged into the Log table(StepLog), you can use it for further actions.
Enjoy

Reading the value from text file and updating it to a field in sql table

I have atext file with data like
Patient name: Patient1 Medical rec #: A1Admit date: 04/26/2009 Discharge date: 04/26/2009
DRG: 982 and so on.
In the format as given above i am having several records in text file.each field is separated by colon
I have to read this file and find out values and update corresponding fields in my sql table.Say drg value 982 has to be updated in drg column of sql table)
Please help in doing it through sql query or ssis package.
If I get this task I'll use SSIS.
Create 2 DataSources: flat file (for text file) and SQL Server connection
Use Lookup task to lookup value from text file for each record in the db table
Use execute SQL Task to update records by lookuped value
You MIGHT try doing this by means of BULK INSERT.
Create a temp-table to get hold the new values
BULK INSERT the file into said table (**)
[optionally do some data-enrichment/cleaning here]
merge the information from the temp-table into the actual table
The only problem with this MIGHT be that
the server cannot access the file directly (eg. when the file is on a
network share)
the file is of a format that can't be handled by BULK INSERT
Given the example data above you might need to load the data into one big column and then do the splitting into different columns by means of creative-sql (PatIndex, substring, the works...). You might try giving colon as a field-separator, but you'll still end up with data that needs (quite a bit) of cleaning.