continuation to this question
Pentaho Spoon - Output to multiple files based on field content
how to do it for "xml output" since "xml output" doesn't have
"accept filename from the field" input in its form
Since the step doesn't accept the filename from previous fields, you can use variables to set the value.
It's a longer path, but at least it allows you to resolve it:
You need to create a transformation to set the filenames in the result (using the "Copy rows to result" step)
You need to create another transformation with "Execute for every input row" option marked. This transformation has to read from the result ("Get rows from result" step) and set the values with "Set Variables" step.
Finally, a third transformation should create the xml file, using the previously created variable as filename (variables are called using this structure: ${variable_name}).
You need to connect those 3 transformations with a job.
Related
Anyone who could show me an example for a "copy rows to result"/"get rows from result"-Job?
I am talking about a successfully tested job.
Goal:
"Get rows from result" is sending some keywords.
"Get file names" delivers all the files a certain directory is containing.
Filter rows" looks for certain files, i.e. it is using the keywords to find the files (transf. will run in a loop, i.e. "Execute every input row" is checked).
Have got:
One Job
Within this job I've got two transformations
Transf. 1 reads a text file and copies the result. It does work properly since the "preview data" shows the values I want to pass.
"Execute every input row" is checked in the settings of the 2nd transf. as well as the "copy result to parameters".
Transf. 2 begins with a "Get rows from result", the preview does not show any values. "Input fields" can't find any field.
"Stream column name" is the same as the column name of the column copied to the result
Solution
2nd transformation should contain only the following steps:
"Get file names": Use the parameter to find the certain file(s=can be multiple, hence "execute every input row" is still checked). In my case, I've used it within the "Wildcard"-column
"Set files in result"
So I've got about 10 JSON files that I have to stuff into an Elasticsearch setup. I have 3 steps currently, "Get file names", "JSON Input", and "Elasticsearch bulk insert". When I look at the Step metrics, I see that Get File Names is correctly reading the 10 files. But when it comes to the JSON input, only the first file is processed. What could be going on.
Here is an image of my setup, and I've attached the ktr file.
Link to the ktr file as it stands currently
Any help is greatly appreciated.
In the Content tab of the step you have the "Limit" atribute set to 1, you can Edit this by unchecking the "Source is from a previous step" option in the File tab, then you set "Limit" to 0.
New to PDI here. Need to output data from a view in a postgresql database to a file daily. The output file will be like xxxx_20160427.txt, so need to append the dynamic date in the file name. How to do it?
EDIT-----------------
I was not clear here by asking how to add dynamic date, I was trying to add not just date but optional other parts to the file name. E.g adding a serial no (01) at the end: xxxx_2016042701.txt etc. So my real question is how to make a dynamic file name? In other ETL tool e.g. SSIS it will be a simple expression. Not sure how it is done in PDI?
In your Text file output step, simply check "Include date in filename?" under the files tab.
You can create a dynamic filename variable with a Modified Java Script value STEP.
and then in the Text File Output STEP click on "Accept file name from field", and select your variable declared from previous step (filename_var on this example).
How can we compute checksum for "entire" file data in Pentaho?
I know we can calculate checksum using "Add a checksum" function (But it returns checksum value "per row" for my CSV file input) and "Calculator" function (But it returns null or zero value as a checksum "per row" for my CSV file input)...
Instead I want checksum for the entire file data and NOT per individual row. How can we achieve this?
Thank you
You can use Java functions via the "Modified JavaScript Value" step like this ("filename" is a column with path to the file)
var md5_hash = '';
file = new Packages.java.io.File(filename);
fileInputStream = new Packages.java.io.FileInputStream(file);
md5_hash = Packages.org.apache.commons.codec.digest.DigestUtils.md5Hex(fileInputStream);
fileInputStream.close();
Alternatively, load the entire file in a single row using 'load file data in memory', apply a checksum to that, then do a cartesian or a stream lookup based on filename with your regular data flow.
Finally, I was able to compute checksum of the entire file.
I used "User Defined Class" step and java.security.MessageDigest class in Java to compute and return checksum of a file read using FileInputStream.
Thanks
anyone know how to set variable for file name in 'Text File Input'?
I want the file name depends on when I execute the transformation, example:
D:\input_file_<variable>.txt
today = D:\input_file_20131128.txt
tomorrow = D:\input_file_20131129.txt
FYI, I'm using Kettle Spoon - 4.2.0
In set form, you can use variable as ${Variable_Name} in file name.
You should notice the system information:
Please remember that the variables you define with this step can't be used in this transformation. This is simply because all steps in a transformation run in parallel without a certain order of execution.
As alternative correct usage, you can set variables you want to use in the first transformation of a job