Pentaho Kettle - Get the file names dynamically - pentaho

I hope this message finds everyone well!
I'm stucked on a situation on Pentaho PDI Tool and I'm looking for an answer (or at least a light in the end of the cave) to solve it!
I have to import, every month, a bunch of xls's files of differents clients. Every file has a different name (witch is given aleatory) and this files are on a folder named with the name of the client. However, I use the same process for all clients and situations.
Is there a way to pass the name of the directory as a variable and change this variable on every process? How can I read this files on differents paths?

The answer you're looking for requires a flow with variables as you stated. In a JOB you will start with a KTR with the client's name and their respective folder. In the same JOB you are going to pass these results and use them as variables, to another JOB if needed, or to a KTR, and you are going to use the options "Copy previous results to parameters" and "Execute for every input row" (Advanced Tab), and in the parameters tab you will name the variables and stream column name (where the data is coming from in the previous KTR, ie.: Clients name and directory).
If you have trouble with creating this flow i can spare some more time and share a sample if you need.
EDIT:
Sample Here

You have an example of this in the sample directory which is shipped with your PDI distribution.
Your case is covered by the samples/jobs/run_all.

Related

Dynamic path creation

While developing transformations on local I set my transformation path to the target folders that are presented in Local PC and Once testing is completed on local I am moving our transformation to server repository to schedule it from server environments but every time I require to change the path set to the server folders. I believe it can be done by creating dynamic path or creating any variable but I am unable to resolve it. Is this option available in Pentaho? if Yes, Can you please help me for setting the dynamic path?
In This answer there is a link to a described solution, and in the answer i have a sample KTR that should help.
You can also use the pentaho properties file in different environments, meaning, you can utilize the same variable in both environments, say ${path}, but in each environment this has a different value.
kettle.properties can be found in your user folder .. C:\Users\YourUser.kettle
The standard way to handle environments in Kettleis with variables.
In the home directory there is a (hidden) folder named .kettle which contains every thing that should be local : your preferences, your shared connections, your cache, and, most of all, THE kettle.property file.
You can define variables in it, like a ${myPath}. To do this, use the menu Edit/Edit the Kettle.properties and add a variable named myPath and give it for value your prefered path, with an optional description.
Then, when you see a blue diamond with a $ on the right of a field in a step window (which means almost any field you'll need), you can press Crtl+Enter in the field and choose any variable defined in your kettle.properties. Alternatively, you may type or copy/paste ${your-variable-name} in the field.
Then, when launching spoon, it will not use the hard-coded path, but the content of the variable in the kettle.properties.
And nothing prevent you from having a different kettle.properties on your dev PC and on the prod server.
While we are there, three usefull tricks.
There is a predefined ${Internal.Job.Filename.Directory} variable contaning the path of the current transformation which by used for relative path. For example, ${Internal.Job.Filename.Directory}/../myDir/myFile.ext.
If you right-click anywhere on the screen, and go to the Properties/Parameters, you may also define your variable here.
You may also redefine these variables in the Run Option window that anoys you each time you rune a transformation (yes, there was a reason).
Finally, you can send these variables from job to jobs and transfos.

Automatically add database entry after ftp upload

Sorry if this seems stupid but I wonder if it's possible to add a database entry after an ftp upload.
To be more clear, thanks to winSCP I have several folders sending everything I put in there automatically to my server.
However, I would like to create a mysql entry for each uploaded files and once again, automatically. Is it possible to do that? How?
To gives the full details of what I need to do, you can read the following.
I have several folders with pictures and each folders are uploaded automatically.
Each of those folders belong to one user and the goal is to give them an account and allow them to see and download those files through a web interface. Since one account = one folder, that's kinda easy.
And I think a simple .htaccess can simply secure things so one user can only see and download the file in his own repository, no?
However if I want them to be able to see what's new (=something they didn't download or simply mark as read) I think I need a table to manage those files.
Something like id | file (string) | read (bool).
If you think this way to proceed is bad, they I'm open to change how to do things, but to be clear uploading the file need to work this way. Not using any kind of formulary.
Thanks for reading that, sorry for my english.
Your problem contains three steps:
Folders/Files been automatically uploaded to your server directory, as you say, this been efficiently handled by winSCP.
You need to update your database with all the files and folders present in your server directory.
You need to update whether or not it is been read/downloaded by the user.
Since your first step is in place, we don't need anything there. For second step, you should write a script and schedule that script to run at a fixed time interval using CRON (if using LINUX or UNIX, or WINDOWS). The script would be responsible to create a list of file(s) present in the directory, and simply insert the file(s) information that are not present in your database.
EDIT:
This edit is to describe how your script file should work. As I explained, the cron jobs would simply help you run your script file in fixed set of interval (which can be every minute, or every hour, or every day, and so on). Lets say your database table has following columns:
fileid (varchar[20])
filepath (varchar[20])
status (boolean)
Your script file should do following things:
Create a list of existing filepaths in your server directory
Run a select query, create a list of existing filepaths from database table.
Compare list1 with list2, and find the ones that doesn't exist in list2 (This would give you a list of filepath that needs to be inserted into table)
Just insert the list of file paths you got above, and set there status to be false (which means the file is not read/downloaded yet)
NOTE: Please keep in mind that I am not advising right now that how your database table should look like. It can be what you have proposed or can even differ depending on your will or requirements.
For the third step, simply keep the status of your file to be unread when creating entries in your table from the second step, and then when user click on the file link in your application whether to view or download it, send a POST request to your server updating the file status to be marked as read.
Let me know if this helps!

How to illustrate incoming data from one OR two sources in BPMN model?

I've studied BPMN in coursework; this is my first time applying it in real-work scenarios that don't follow any of my textbook examples.
I am trying to illustrate a process where a client can either upload a CSV file, manually enter records, or both. At the end of the day, all records are loaded to a production database via a script. At the moment, I've got it like this:
But, unless one reads the notes attached to each object, this tells me that uploaded AND manual data will be present.
In BPMN how would I designate that Path "A", Path "B" OR both, could be valid? How do I label the gateway? The scripting step I anticipate putting between the data input and the production database, but I'm not quite sure, again, how to specify that the script runs ONCE based on the presence of data from EITHER feed, not both.
What would this typically look like, and thanks in advance.
In BPMN to express that Path A, Path B or both could be valid ways forward, you can use an "inclusive or" gateway. I would typically label the split with a question and the outgoing pathes with the "answers", iow conditions under which the pathes are activated. If I understand your example correctly, a possible solution could look like the following.
Whether you want to use the task types I used, depends a bit on your more specific context. My task types in that example would mean that for the "upload" the process is "waiting for an incoming message", while in the case of manual entry it is "waiting for a user to complete the task" (by entering the required data).
The example also assumes that you know before you reach the inclusive or gateway which channels you will want to use this time.

Check for multiple files

Okay, I'll try to explain as good as I can... Quite a particular case.
Tools: SSIS 2008
We have a control flow that now needs to be triggered by an event: the presence of one or multiple files. (1,2 or 3)
The variables used:
BO_FileLocation_1
BO_FileLocation_2
BO_FileLocation_3
BO_FileName_1
BO_FileName_2
BO_FileName_3
There can be one, two or three files: defined in above variables. When they are filled in,
they should be processed. When they are empty, this means there's just one file file, the process should ignore them and jump to the next (file watcher?) task.
For example:
BO_FileLocation_1= "C:\"
BO_FileLocation_2 NULL
BO_FileLocation_3 NULL
BO_FileName_1= "test.csv"
BO_FileName_2 NULL
BO_FileName_3 NULL
The report only needs one file.
I'd need a generic concept that checks the presence of these files, it could be more generic than my SSIS knowledge can handle right now. For example handy, when there's a 4th file in the future. I was also thinking to work with a single script to handle all the logic.
Thanks in advance
A possibly irrelevant image:
If all you want is to trigger the Copy Source File to handle if one or more of the files is present, just use the OR Constraint in your flow. The following image shows you how:
First connect all to the destination:
Then click one of the green arrows. This will make its properties window pop up. Select the Logical ORinstead of the Logical AND:
If everything went well, you should now see the connections as dashed lines:
There are several possible solutions:
Create a sequence container and include all the file imports in the sequence container. Add int variables for RowCountFile1, RowCountFile2, and RowCountFile3 and set the value to 0 (this is the default value when you create an int variable). Add a RowCount transformation to each of the data flows. Create a precedence constraint from the sequence container to the "Do something" task. Set the precedence constraint to success and expression. Set the expression value to #RowCountFile1 > 0 || #RowCountFile2 > 0 || #RowCountFile3 > 0. The advantage of this approach is that you can take an action as soon as the files are detected, you import all available files, and you only take an action after all the files have been imported. You could then schedule running this SSIS package as a SQL Server Agent job step and run it as frequently as you want.
A variant on solution 1 is to use for each file enumerator containers inside the sequence container. This would be useful if you don't know the exact name of the file and you expect to import more than one under some circumstances. For instance, if you get a file every few minutes with a timestamp in its file name and your process doesn't run for some reason, then you may have to process multiple files to get caught up and then take an action once it has been done.
You could use the file watcher task as you outlined in your question. The only problem I have with the file watcher task is that the package has to be in a constantly running state. This makes it hard to troubleshoot problems and performance. It also can introduce other problems since I remember having some problems with the file watcher task years ago when it first came out. It may well be a totally stable task now, but I prefer other methods over the task after having been burned previously. If you really want the package to run continously instead of having it be called by a job, then you could always use a script task to check for file, sleep thread if not found, check again, etc. I'm sure that's what the file watcher task does, but I would trust my own C# over the task. Power to anyone who has had better experiences than me with File Watcher...
Use PowerShell. If you just want to take an action if a file appears and you aren't importing the data, then a PowerShell script could do this just as well as a SSIS package. The drawback is that you have to learn some basic PowerShell, it may be hard to maintain in the future since PowerShell is probably not your bread and butter core language, and you may have to rewrite the code again to a SSIS package if you want to import the data. You would probably call the PowerShell script from a SQL Server Agent job step, so scheduling can be handled pretty easily.
There are more options than what I listed, so let me know if you still want more suggestions.

Pentaho Spoon - Wait for File - Wildcards

I know i've asked a couple of pentaho related questions lately but am rushing to evaluate it in a short timeframe :)
My latest obstacle I am trying to overcome is that I am building a job that will process an input file when it arrives, but i only know the format for the filename, not the exact filename itself and the "wait for file" step does not allow wildcards. This seems like a glaring ommision for such a step so am wondering if i've just missed something but on forums etc it seems i'm not the only one facing such a challenge.
Ideally i need the "wait for file" step to search on a wildcard/regex and when it finds a match pass the resulting files name to the next step in the job for processing.
Any suggestions?
Thanks
Tom
Again I try to answer your question.
Actually, you don't need a job to wait for a file. Based on my answer on country split : Pentaho Spoon - Output to multiple files based on field content, you just need to pass through the source name and then archive it using process file (see the pic below).
From here, I think you can adapting my logic using the ktr I provided before (http://pentaho.phi-integration.com/kettle/kettle-files/split_countries.ktr).
Then you can control the repetition of the job (wait and process files) using job scheduler (see the pic).
Well, hope this helps Tom !
Regards,
Dino
I had a similar requirement, and solved this by creating a directory specifically for receiving the files (from a remote host).
The the "Get File Names" step reads the files in the directory and passes the name to the next step. The "Get File Names" allows wildcards, btw.
(Off course, I have to clean up in input queue once I have finished processing the file.)
EDIT: I omitted to mention that you loose the "wake" functionality with the Get File Names, and you'll have to loop and schedule regular parses of the directory.