I have a get file names step with a Regular expression that gets 4 csv files.
After that I have a text file input step which sets the fields of the csv, and read these files.
Once this step is completed a Table output step is executed.
The problem is that the text file input seems to read all 4 files in a single statement, so the table output statement inserts the rows of the 4 files. So my output table has 20 rows (5 per each file)
The expected beahivour is read one file, insert the 5 rows of the file in the output table and execute sql script which moves this table to a final table and truncate temp table. Now repeat the process for the second, third and last file.
The temporary table is deleted in every step of load a file, but final table not, it is incremental.
How can I do that in pentaho?
Change your current job to a subjob that executes once for each incoming record.
In the new main job you need:
a transformation that runs Get Filenames linking to Copy Rows to Result
a Job entry with your current job. Configure it to execute for each row.
In the subjob you have to replace Get Filenames with Get Rows from Result and reconfigure the field that contains the filename.
Related
I have a csv-file with multiple rows containing select statements that I would like to execute and then save all the results to one file.
Each row in the csv is looking something like this:
SELECT 'somehardcodedname' AS name,age FROM account WHERE id='somehardcodedid';
I attempted to run the following command:
\o output.csv
\i input.csv
This creates a file but the file is sort of useless, since I am getting the table names and the number of rows before/after each result and the table structure in the csv becomes ugly(kind of like a table in a table).
Is there any way that I can modify the command to put all results from the individual select queries into one result that gets saved to the file? The input.csv contains several thousand rows, so I cannot make that into one query.
I am doing insert/update step (text file to DB) on spoon and I have a question.
Suppose that in my text file I have 10 columns and in my DB I have 18, because 8 columns will be completed from another text file later.
On insert/update step, I chose a key to look up the value (which is client_id, for example) and on "Update fields", I did mappings for those 10 columns. When I checked SQL query, I saw those 8 columns will be dropped.
But I want to keep them. Any solution for it?
The Insert/Update step will NOT drop columns when run normally.
The SQL button inspects the table and suggests changes based on the fields you specified in the step. It's only a convenience for quick ETL development, for example when sending rows from text files to staging table using a Table Output step. It only drops columns if you execute the script it generates. Don't do that and your columns will be perfectly safe!
I have a requirement where we can get list of file names from SQL and need to pass these file names as variable to Step which can poll folder for these file names as text file. Please advise how to set SQL output of file names as array variable and pass to polling folder step ?
Don't use variables. Variables are only suitable if your input has 1 single row.
Instead, use two transformations inside a parent job. The first transformation gets a list of filenames and passes those to a step Copy Rows to Result;
The second transformation can do one of two things:
Process all files at once: just use a Get Rows from Result step as your entry point to the transformation;
Process one file at a time: create a parameter for the filename on the transformation; open the parent job, and on the properties of the transformation go to Advanced and tick the box "Execute for every input row" and on Parameters put the child trans parameter name and the stream column name coming from the 1st transformation.
Have a requirement to create a transformation where I have to run a select statement. After selecting the values it should update the status, so it doesn't process the same record again.
Select file_id, location, name, status
from files
OUTPUT:
1, c/user/, abc, PROCESS
Updated output should be:
1, c/user/, abc, INPROCESS
Is it possible for me to do a database select and cache the records so it doesn't reprocess the same record again in a single transformation in PDI? So I don't need to update the status in the database. Something similar to dynamic lookup in Informatica. If not what's the best possible way to update the database after doing the select.
Thanks, that helps. You wouldn't do this in a single transformation, because of the multi-threaded execution model of PDI transformations; you can't count on a variable being set until the transform ends.
The way to do it is to put two transformations in a Job, and create a variable in the job. The first transform runs your select and flows the result into a Set Variables step. Configure it to set the variable you created in your Job. Next you run the second transform which contains your Excel Input step. Specify your Job level variable as the file name.
If the select gives more than one result, you can store the file names in the Jobs file results area. You do this with an Set files in result step. Then you can configure the job to run the second transform once for each result file.
I have a text file that I need to load into a database... I used merge rows(diff)...
I compared the text file input with table input step.. i used sorted merge for sorting columns for both text file input and table input steps.. and i used merge rows(diff) step followed by Synchronize after merge... My problem is if i run my job first time its inserting the text file data to database.. and the second time also its inserting same rows again into the database... Can any one please help me what mistake i did..
use " Insert / Update " step in your transformation.. so it will avoid your duplication problem.
Insert/update Description