Kettle - Read multiple files from a folder - pentaho

I'm trying to read multiple XML files from a folder, to compile all the data they have (all of them have the same XML structure), and than save that data in a CSV file.
I already have a 'read-files' Transformation with the steps: Get File Names and Copy Rows to Result, to get all the XML files. (it's working - I print a file with all the files names)
Then, I enter in a 'for-each-file' Job which has a Transformation with the Get Rows from Result Step, and then another Job to process those files.
I think I'm loosing information from the 'read-files' Transformation to the Transformation in the 'for-each-file' Job which Get all the rows. (I print another file with all the files names, but it is empty)
Can you tell me if I'm thinking in the right way? I have to set some variables, or some option that is disabled? Thanks.

Here is an example of "How to process a Kettle transformation once per filename"
http://www.timbert.net/doku.php?id=techie:kettle:jobs:processtransonceperfile

Related

Convert bulk .xlsx files to .csv (UTF-8) in Pentaho

I am new to Pentaho. I am trying to build a transformation that can convert a bunch of .xlsx files to .csv (utf-8).
I tried Get file Names and Text File Output, but it saves a single file as csv and the content of that file is the file properties.
I also tried Microsoft Excel Input and Microsoft Excel Output and that did not work either.
Any help will be appreciated. TIA!
I have prepare a SOLUTION for you. I have made my solution full dynamic. For that reason solution is combination of 6 (transformation & job). You only need to define following 2 things:-
Source folder location
Destination folder location
Others will work dynamically.
Also, I have learn a lot with this solution.
Would you like to generate a separate CSV for each Excel file?
It is better to do it like this:
Using the Get File Names component, read the list of Excel files from the folder.
Then call Execute Transformation, and pass the name of the file.
Then a separate Transformation will be performed for each file, and a separate CSV will be generated in it for each Excel file.

Capture Multiple File Names into a column on SSIS

Hope you can help me.
I am looking to retrieve multiple file-names into a column from files I wish to upload onto SQL in a custom column on SSIS.
I am aware of using Advanced Editor on the Flat File Source in Component Properties > FileNameColumnName. However, I am not sure how to make sure it picks up all the file names or what to enter in this field?
I can upload the files and all the data it holds but not the filename into a column.

Google Cloud Dataprep - Scan for multiple input csv and create corresponding bigquery tables

I have several csv files on GCS which share the same schema but with different timestamps for example:
data_20180103.csv
data_20180104.csv
data_20180105.csv
I want to run them through dataprep and create Bigquery tables with corresponding names. This job should be run everyday with a scheduler.
Right now what I think could work is as follows:
The csv files should have a timestamp column which is the same for every row in the same file
Create 3 folders on GCS: raw, queue and wrangled
Put the raw csv files into raw folder. A Cloud function is then run to move 1 file from raw folder into queue folder if it's empty, do nothing otherwise.
Dataprep scans the queue folder as per scheduler. If a csv file is found (eg. data_20180103.csv) the corresponding job is run, output file is put into wrangled folder (eg. data.csv).
Another Cloud function is run whenever a new file is added to wrangled folder. This one will create a new BigQuery table with name according to the timestamp column in csv file (eg. 20180103). It also delete all files in queue and wrangled folder and proceed to move 1 file from raw folder to queue folder if there's any.
Repeat until all tables are created.
This seems overly complicated to me and I'm not sure how to handle cases where the Cloud functions fail to do their job.
Any other suggestion for my use-case is appreciated.

Creating metadata dynamically from a flat .csv file in CC

I am having some difficulties on how to dynamically create a metadata, which need to be extracted from the header line of a flat .csv file in CC.
Usually, I manually define the metadata by select New Metadata --> Extract from flat file in CC. However the metadata of the file may changes with additional columns. Thus, I do not know the metadata of the file and I can not define it in this static approach.
It would be helpful if you could suggest a solution to create metadata dynamically and using this newly created metadata for connecting to other components. Perhaps an example graph file for demonstration would be great.
Thanks,
Andy
I have discovered this kind of solution.
You just have to fill in flat .csv filename into csv readers and writers.
MetaDataMaster.grf - runs the graphs below.
MetaDataCreator.grf - creates metadata according to csv header and
write it into meta_example.fmt file
MetaDataUser.grf - Reads csv according to created meta_example.fmt file - you can add there a reformat and use just some predefined fields.
You can run the 2nd and 3rd graph separately to test it.

How to do loop in pentaho for getting file names?

I have 100 000 files.
I want to get the name of those file names and have to put in database,
I have to do like this
get 10 files name's;
update/insert names into database; and
move those 10 files to another directory;
and loop these three steps till no files are found.
Is this possible?
I'm attaching a working example (I tested it with ~400 text files on kettle 4.3.).
transformation.ktr
job.kjb
Both transformation and job contain detailed notes on what to set and where.
Transformation.ktr It reads first 10 filenames from given source folder, creates destination filepath for file moving. It outputs filenames to insert/update (I used dummy step as a placeholder) and uses "Copy rows to resultset" to output needed source and destination paths for file moving.
job.kjb All the looping is done in this job. It executes "transformation.ktr" (which does insert/update for 10 files), and then moves those 10 files to destination folder. After that, it checks whether there's any more files in source folder. If there is, process is repeated, if not, it declares success.