I know i've asked a couple of pentaho related questions lately but am rushing to evaluate it in a short timeframe :)
My latest obstacle I am trying to overcome is that I am building a job that will process an input file when it arrives, but i only know the format for the filename, not the exact filename itself and the "wait for file" step does not allow wildcards. This seems like a glaring ommision for such a step so am wondering if i've just missed something but on forums etc it seems i'm not the only one facing such a challenge.
Ideally i need the "wait for file" step to search on a wildcard/regex and when it finds a match pass the resulting files name to the next step in the job for processing.
Any suggestions?
Thanks
Tom
Again I try to answer your question.
Actually, you don't need a job to wait for a file. Based on my answer on country split : Pentaho Spoon - Output to multiple files based on field content, you just need to pass through the source name and then archive it using process file (see the pic below).
From here, I think you can adapting my logic using the ktr I provided before (http://pentaho.phi-integration.com/kettle/kettle-files/split_countries.ktr).
Then you can control the repetition of the job (wait and process files) using job scheduler (see the pic).
Well, hope this helps Tom !
Regards,
Dino
I had a similar requirement, and solved this by creating a directory specifically for receiving the files (from a remote host).
The the "Get File Names" step reads the files in the directory and passes the name to the next step. The "Get File Names" allows wildcards, btw.
(Off course, I have to clean up in input queue once I have finished processing the file.)
EDIT: I omitted to mention that you loose the "wake" functionality with the Get File Names, and you'll have to loop and schedule regular parses of the directory.
Related
I hope this message finds everyone well!
I'm stucked on a situation on Pentaho PDI Tool and I'm looking for an answer (or at least a light in the end of the cave) to solve it!
I have to import, every month, a bunch of xls's files of differents clients. Every file has a different name (witch is given aleatory) and this files are on a folder named with the name of the client. However, I use the same process for all clients and situations.
Is there a way to pass the name of the directory as a variable and change this variable on every process? How can I read this files on differents paths?
The answer you're looking for requires a flow with variables as you stated. In a JOB you will start with a KTR with the client's name and their respective folder. In the same JOB you are going to pass these results and use them as variables, to another JOB if needed, or to a KTR, and you are going to use the options "Copy previous results to parameters" and "Execute for every input row" (Advanced Tab), and in the parameters tab you will name the variables and stream column name (where the data is coming from in the previous KTR, ie.: Clients name and directory).
If you have trouble with creating this flow i can spare some more time and share a sample if you need.
EDIT:
Sample Here
You have an example of this in the sample directory which is shipped with your PDI distribution.
Your case is covered by the samples/jobs/run_all.
I want to get errors generated by system in Pentaho Kettle and expose it as results in transformation or job, for example i want to get errors of the HL7 input from log and expose it as results in the next step.
I want to get errors generated by system
You mean like Apache or MySQL errors? If that's the case, you may just point a Pentaho transformation to those files. They usually have a default place like /var/logs/apache2 and that would be pretty easy to read.
The part that's not that easy is if you want to parse those errors into something easier to analyse. For that I would use "load file in memory" and some "regex evaluation" steps to get the data you want out of the raw text.
But, there are better solutions for reading your logs and analyzing errors.
See LogStash for more info or similar products.
You could you save those results in a temporary csv file that the next step(s) can consume.
If you go with this solution I would recommend:
Adding a unique jobID or identifier in the file name to ensure that your next step is reading the right file.
Adding a step at the end that removes old temp files
I have 5000 files in a folder and on daily basis new file keep loaded in same file. I need to get the latest file only on daily basis among all the files.
Will it be possible to achieve the scenario in Mule out of box.
Tried keeping file component inside Poll component( To make use of waterMark) but not working.
Is there any way we can achieve this. If not please suggest the best way ( Any possible links).
Mule Studio: 5.3, RunTime 3.7.2.
Thanks in advance
Short answer: Not really any extremely quick out of the box solution. But there are other ways. Im not saying this is the right or only way of solving it, but I've earlier implemented a similar scenario in this way:
A Normal File inbound with a database table as file-log. Each time a new file is processed, a component checks if its name appears in the table. By choice or filter I only continue if it isn't in there already - and after processing I add the filename to the table.
This is a quite "heavy" solution though. A simpler access would be to use an idempotent filter with a object store. For example a Redis server: https://github.com/mulesoft/redis-connector/blob/master/src/test/resources/redis-objectstore-tests-config.xml
It is actually very simple if your incoming file contains timestamp........you can configure the file inbound connector by setting file:filename-regex-filter pattern="myfilename_#[function:timestamp].csv". I hope this helps
May be you can use a quartz scheduler( mention the time in cron expression), followed by a groovy script in which you can start the file connector . Keep the file connector in another flow.
We have a need to create a daily process that will manipulate a file that is now being manually generating before FTPing it to a vendor. The issues with the current file are as follows:
1) It is currently comma delimited and it needs to be pipe delimited.
2) The vendor only want specific columns to be sent. They have a limit of 26 columns.
We need to develop an automated process that can be scheduled to run once a day and pick up a file with a specific extension, do the file manipulation and FTP the file.
Ideally, we would like to have some error handling in the process. We would want an email to get sent out if there was no file to process or if there was an error during the manipulation or FTP process.
My first thought was to use SQL Server Import/Export. I've done this before but that was only for packages that could be run manually. This process needs to be fully automated (after the existing file is manually generated.) I don't see a way to pick up any file with a specific extension. It looks like I have to select a specific file.
Is there a way to use Import/Export or some similar tool?
Or, do I need to write a program to do this sort of task? It seems to me like it would be more work to write a program. So, I am trying to avoid that.
Thank you for your help!
You should write a program. Seriously.
Okay, I'll try to explain as good as I can... Quite a particular case.
Tools: SSIS 2008
We have a control flow that now needs to be triggered by an event: the presence of one or multiple files. (1,2 or 3)
The variables used:
BO_FileLocation_1
BO_FileLocation_2
BO_FileLocation_3
BO_FileName_1
BO_FileName_2
BO_FileName_3
There can be one, two or three files: defined in above variables. When they are filled in,
they should be processed. When they are empty, this means there's just one file file, the process should ignore them and jump to the next (file watcher?) task.
For example:
BO_FileLocation_1= "C:\"
BO_FileLocation_2 NULL
BO_FileLocation_3 NULL
BO_FileName_1= "test.csv"
BO_FileName_2 NULL
BO_FileName_3 NULL
The report only needs one file.
I'd need a generic concept that checks the presence of these files, it could be more generic than my SSIS knowledge can handle right now. For example handy, when there's a 4th file in the future. I was also thinking to work with a single script to handle all the logic.
Thanks in advance
A possibly irrelevant image:
If all you want is to trigger the Copy Source File to handle if one or more of the files is present, just use the OR Constraint in your flow. The following image shows you how:
First connect all to the destination:
Then click one of the green arrows. This will make its properties window pop up. Select the Logical ORinstead of the Logical AND:
If everything went well, you should now see the connections as dashed lines:
There are several possible solutions:
Create a sequence container and include all the file imports in the sequence container. Add int variables for RowCountFile1, RowCountFile2, and RowCountFile3 and set the value to 0 (this is the default value when you create an int variable). Add a RowCount transformation to each of the data flows. Create a precedence constraint from the sequence container to the "Do something" task. Set the precedence constraint to success and expression. Set the expression value to #RowCountFile1 > 0 || #RowCountFile2 > 0 || #RowCountFile3 > 0. The advantage of this approach is that you can take an action as soon as the files are detected, you import all available files, and you only take an action after all the files have been imported. You could then schedule running this SSIS package as a SQL Server Agent job step and run it as frequently as you want.
A variant on solution 1 is to use for each file enumerator containers inside the sequence container. This would be useful if you don't know the exact name of the file and you expect to import more than one under some circumstances. For instance, if you get a file every few minutes with a timestamp in its file name and your process doesn't run for some reason, then you may have to process multiple files to get caught up and then take an action once it has been done.
You could use the file watcher task as you outlined in your question. The only problem I have with the file watcher task is that the package has to be in a constantly running state. This makes it hard to troubleshoot problems and performance. It also can introduce other problems since I remember having some problems with the file watcher task years ago when it first came out. It may well be a totally stable task now, but I prefer other methods over the task after having been burned previously. If you really want the package to run continously instead of having it be called by a job, then you could always use a script task to check for file, sleep thread if not found, check again, etc. I'm sure that's what the file watcher task does, but I would trust my own C# over the task. Power to anyone who has had better experiences than me with File Watcher...
Use PowerShell. If you just want to take an action if a file appears and you aren't importing the data, then a PowerShell script could do this just as well as a SSIS package. The drawback is that you have to learn some basic PowerShell, it may be hard to maintain in the future since PowerShell is probably not your bread and butter core language, and you may have to rewrite the code again to a SSIS package if you want to import the data. You would probably call the PowerShell script from a SQL Server Agent job step, so scheduling can be handled pretty easily.
There are more options than what I listed, so let me know if you still want more suggestions.