How to do loop in pentaho for getting file names? - pentaho

I have 100 000 files.
I want to get the name of those file names and have to put in database,
I have to do like this
get 10 files name's;
update/insert names into database; and
move those 10 files to another directory;
and loop these three steps till no files are found.
Is this possible?

I'm attaching a working example (I tested it with ~400 text files on kettle 4.3.).
transformation.ktr
job.kjb
Both transformation and job contain detailed notes on what to set and where.
Transformation.ktr It reads first 10 filenames from given source folder, creates destination filepath for file moving. It outputs filenames to insert/update (I used dummy step as a placeholder) and uses "Copy rows to resultset" to output needed source and destination paths for file moving.
job.kjb All the looping is done in this job. It executes "transformation.ktr" (which does insert/update for 10 files), and then moves those 10 files to destination folder. After that, it checks whether there's any more files in source folder. If there is, process is repeated, if not, it declares success.

Related

Does hive table skip headers from all the files?

Hive has an option "skip.header.line.count"="1" to make the external table to skip header from the file.
So, what would be the behavior if the folder has multiple files. Had this doubt and just verified it.
Header would be skipped on every file in the folder.
If the folder (pointed by the table) had multiple files. Hive skips the first N rows ("skip.header.line.count"="N") from each and every file in the folder.

inserting multiple text files

I have 4 different text files each file with different name and different column in it place in one folder. I want these 4 four files to be inserted or updated into 4 different existing tables. So How to read read these 4 files dynamically and insert them into their respective table dynamically in SSIS.
Well, you need to use Data Flow Task to move data from a Flat File Source to a Table Destination (OLEDB Destination perhaps). Are the columns in your file delimited in any way? For example, with any of these: (;),(|) or something like that? if it is, you can create a FlatFileConnectionManager and set that to split the columns. If not, you might need to use the FixedWidth option to separate your columns. To use the OLEDB Destination, you will need to create a OLEDB connectionManager to point to the table in your database. I could help you more if I had more information about the files you want to read the data from.
EDIT
Well you said at the start you were working with 4 files and 4 tables, so you can create 4 Flat Destination sourcers with 4 OLEDB destinations aswell (1 of each for each flat file). If I understood you correctly, these 4 files can or cannot exist yet. So if you know the names that the files will get, change the Package Property DelayValidation to true, and then create a connection with a sample text file. You do this so the File path gets saved. The tables, in my opinion DO need to exist. Now, when you said:
i want to load all the text files into each different existing table whenever there is files inside the folder.
The only way I know you can do something similar, is to schedule the execution of your package at a certain time with SQL Server Agent Job. Please let me know if this was what you were looking for.

VBA macro create temporary file

I am using a macro which creates temporary docx files that are then assembled into one.
I then delete the temporary files.
These files still show up in the Recent Files list, even though they no longer exist.
How can I prevent these temp files from being recognized by Word as a recent file?
Or is there a way to save the contents of the would-be temporary file in an array and then use this array to complete the final file? Meaning, the temp file does not ever actually exist.
The fifth parameter of Document.SaveAs is AddToRecentFiles. Set that to False.
https://msdn.microsoft.com/en-us/library/office/aa220734
You can create the temporary files, combine them into one, and then close them without saving them.

Kettle - Read multiple files from a folder

I'm trying to read multiple XML files from a folder, to compile all the data they have (all of them have the same XML structure), and than save that data in a CSV file.
I already have a 'read-files' Transformation with the steps: Get File Names and Copy Rows to Result, to get all the XML files. (it's working - I print a file with all the files names)
Then, I enter in a 'for-each-file' Job which has a Transformation with the Get Rows from Result Step, and then another Job to process those files.
I think I'm loosing information from the 'read-files' Transformation to the Transformation in the 'for-each-file' Job which Get all the rows. (I print another file with all the files names, but it is empty)
Can you tell me if I'm thinking in the right way? I have to set some variables, or some option that is disabled? Thanks.
Here is an example of "How to process a Kettle transformation once per filename"
http://www.timbert.net/doku.php?id=techie:kettle:jobs:processtransonceperfile

Best Way to ETL Multiple Different Excel Sheets Into SQL Server 2008 Using SSIS

I've seen plenty of examples of how to enumerate through a collection of Excel workbooks or sheets using the Foreach Loop Container, with the assumption that the data structure of all of the source files are identical and the data is going to a single destination table.
What would be the best way to handle the following scenario:
- A single Excel workbook with 10 - 20 sheets OR 10 - 20 Excel workbooks with 1 Sheet.
- Each workbook/ sheet has a different schema
- There is a 1:1 matching destination tables for each source sheet.
- Standard cleanup: workbooks would be created and placed in a "loading" folder, SSIS package runs on a job that reads the files in the loading folder and moves them to an archive folder upon successful completion
I know that I can create a seperate SSIS package for each workbook, but that seems really painful to maintain. Any help is greatly appreciated!
We faced the same issue long back. I will just summarize you what we have done.
We have written an SSIS-package pro-grammatically using C#. A MetaTable is maintained which holds the information (table name, columns, positions of these columns in the flat file.) of the flat files. We extract the flat file name and then query it to the meta-table about the table this flat file belongs to and the columns it is having and the column positions in the flat file.
We execute the package in the SQLSERVER by passing each flat file as a command line argument to the PackageExe. So it reads and processes each flat file.
Example Suppose say we have a flat file FF, we first extract name of the flat file and then get the table name by querying to the DB, lets say it is TT which contains columns COL-1, COL-2 with positions 1 to 10 and 11 to 20 respectively. By reading this information from the MetaTable now I have created a derived column transformation (package.)
Our Application has a set of flat files in a folder and by using "For Loop Container SSIS" we get one flat file at a time and do the above process.