How a single Autosys job monitors multiple folders for file at a time? - jobs

I have task in which there are 3 folders at one location each of one having different files coming in.
Then I want to create an Autosys file watcher job which will monitor 3 folders and if it finds any file in any of the folders then it will move that file to respective folder in another location?
Does anyone have an idea how a single job will monitors multiple folders?

I believe you will need to create a job per file you want to monitor. The File Watch job type will only accept a single file in its watch_file attribute.
If you're using Autosys r11, the File Trigger job type will accept wildcards in the watch_file attribute, which may be able to help in your situation.

Related

Can snowpipe pick up files from within sub-folders?

I'm looking to try to pick up all parquet files from an s3 bucket that have been placed into partitioned sub-folders by date.
In the past I've used snowpipe with a sort of 1-1 relationship, one sub-folder to one table; but I would be interested to know if it is possible to crawl over partitioned data to a single table.
Many thanks!
Short answer: Yes!
With COPY INTO you can load a particular file, a whole folder or all sub-folders within a certain folder. All you need to do is to adjust your path accordingly. Just mention the path in the FROM clause and all subfolders will be copied.
copy into mytable
from #my_stage/your_main_folder/;
Docs: https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
Edit: There are variations possible. Also the stage itself can point to a certain main folder already and you do not need to extend the COPY INTO path.
Yes you can use sub folders as part of Snowflake Stage which you want to crawl in the Pipe Definition.
https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-s3.html#step-3-create-a-pipe-with-auto-ingest-enabled
Make sure that the S3 Stage has the Sub folder path which you want to crawl.

Pentaho Kettle - Get the file names dynamically

I hope this message finds everyone well!
I'm stucked on a situation on Pentaho PDI Tool and I'm looking for an answer (or at least a light in the end of the cave) to solve it!
I have to import, every month, a bunch of xls's files of differents clients. Every file has a different name (witch is given aleatory) and this files are on a folder named with the name of the client. However, I use the same process for all clients and situations.
Is there a way to pass the name of the directory as a variable and change this variable on every process? How can I read this files on differents paths?
The answer you're looking for requires a flow with variables as you stated. In a JOB you will start with a KTR with the client's name and their respective folder. In the same JOB you are going to pass these results and use them as variables, to another JOB if needed, or to a KTR, and you are going to use the options "Copy previous results to parameters" and "Execute for every input row" (Advanced Tab), and in the parameters tab you will name the variables and stream column name (where the data is coming from in the previous KTR, ie.: Clients name and directory).
If you have trouble with creating this flow i can spare some more time and share a sample if you need.
EDIT:
Sample Here
You have an example of this in the sample directory which is shipped with your PDI distribution.
Your case is covered by the samples/jobs/run_all.

Pentaho - Check if a csv file is already loaded before loading

I am loading CSV files from a folder using Pentaho, and once files are loaded, I am making an entry into a table with the filenames that are loaded.
I need to put a check before loading a file if it is already loaded, for that I want to pick the filename and check with the names in the table that holds files which are already loaded. Since I am new to Pentaho, I am struggling to design this approach.
Please, suggest how should I go through to do this or if there is any totally different approach.
Your approach is valid. Make some book keeping of the processed filename in a database (you may also use a CSV file for that).
The difficulty with this approach is that the filename may not be in a field. So you have to write a master job to Add file name to results and give hand to a transformation that load the CSV (Press crtl-space in the box and find your variable in the drop down), check the database, with a Stream lookup, and Filter rows that are not matched. After the load, you 'Update' the bookkeeping table.
An other approach we used successfully in the past was to load the file form a directory and move the processed file into an other directory. This way it was easy to drop new files into a directory, and to retrieve processed file in case of problems.
This could be a start:
The Job
The transformation

Automatically add database entry after ftp upload

Sorry if this seems stupid but I wonder if it's possible to add a database entry after an ftp upload.
To be more clear, thanks to winSCP I have several folders sending everything I put in there automatically to my server.
However, I would like to create a mysql entry for each uploaded files and once again, automatically. Is it possible to do that? How?
To gives the full details of what I need to do, you can read the following.
I have several folders with pictures and each folders are uploaded automatically.
Each of those folders belong to one user and the goal is to give them an account and allow them to see and download those files through a web interface. Since one account = one folder, that's kinda easy.
And I think a simple .htaccess can simply secure things so one user can only see and download the file in his own repository, no?
However if I want them to be able to see what's new (=something they didn't download or simply mark as read) I think I need a table to manage those files.
Something like id | file (string) | read (bool).
If you think this way to proceed is bad, they I'm open to change how to do things, but to be clear uploading the file need to work this way. Not using any kind of formulary.
Thanks for reading that, sorry for my english.
Your problem contains three steps:
Folders/Files been automatically uploaded to your server directory, as you say, this been efficiently handled by winSCP.
You need to update your database with all the files and folders present in your server directory.
You need to update whether or not it is been read/downloaded by the user.
Since your first step is in place, we don't need anything there. For second step, you should write a script and schedule that script to run at a fixed time interval using CRON (if using LINUX or UNIX, or WINDOWS). The script would be responsible to create a list of file(s) present in the directory, and simply insert the file(s) information that are not present in your database.
EDIT:
This edit is to describe how your script file should work. As I explained, the cron jobs would simply help you run your script file in fixed set of interval (which can be every minute, or every hour, or every day, and so on). Lets say your database table has following columns:
fileid (varchar[20])
filepath (varchar[20])
status (boolean)
Your script file should do following things:
Create a list of existing filepaths in your server directory
Run a select query, create a list of existing filepaths from database table.
Compare list1 with list2, and find the ones that doesn't exist in list2 (This would give you a list of filepath that needs to be inserted into table)
Just insert the list of file paths you got above, and set there status to be false (which means the file is not read/downloaded yet)
NOTE: Please keep in mind that I am not advising right now that how your database table should look like. It can be what you have proposed or can even differ depending on your will or requirements.
For the third step, simply keep the status of your file to be unread when creating entries in your table from the second step, and then when user click on the file link in your application whether to view or download it, send a POST request to your server updating the file status to be marked as read.
Let me know if this helps!

Executing Abaqus Model in Taverna

I'm pretty new to both Taverna and Abaqus but I am trying to run an Abaqus model using a "Tool" in Taverna remotely on a HPC. This works fine if I already have my model file and inputs on the HPC but I need a way of uploading the files dynamically in Taverna (trying to generically wrap Abaqus models).
I've tried adding a input port that takes a file list but I don't know how I can copy it to the "location" that I've set for the tool. Could a beanshell service be the answer or can I iterate through the file list and copy them up before executing the abaqus model?
Thanks
When you say that you created an input port that takes a file list, I guess you mean an input to the tool service.
Assuming the input port is called my_file_list, when the tool service is run, it will take a list of data values on port my_file_list. As an example, say it has "hello", "hi" and "hola" is the three values in the list.
On the location where the tool service is run, it executes in a temporary directory - a different directory for each execution of the service. It is normally something like /tmp/usecase-2029778474741087696
Three files will be created in the temporary directory; those files contain the (in this example) three values the tool service received on port my_file_list. The files could be called
/tmp/usecase-2029778474741087696/tempfile.0.tmp containing hello
/tmp/usecase-2029778474741087696/tempfile.1.tmp containing hi
/tmp/usecase-2029778474741087696/tempfile.2.tmp containing hola
There will also be a file called my_input_list. That file will contain
/tmp/usecase-2029778474741087696/tempfile.0.tmp
/tmp/usecase-2029778474741087696/tempfile.1.tmp
/tmp/usecase-2029778474741087696/tempfile.2.tmp
The script of your tool service would normally read the contents of my_input_list line by line and do something with the contents of the listed file(s).
I have also seen some scripts that 'cheat' and iterate directly over tempfile*.tmp but that would be "a bad thing". The problem with that trick, is that if you want to add a second list of files to the tool service then the file my_input_list could contain
/tmp/usecase7932018053449784034/tempfile.4.tmp
/tmp/usecase7932018053449784034/tempfile.5.tmp
/tmp/usecase7932018053449784034/tempfile.6.tmp
as other temporary files were used for the other file list port.
I hope that helps
The tool service allows you to upload files - but if you are using the HPC through a job submission node, then you would have to modify your command line tool to then use the job file staging command to further push the files as part of the job. The files would be available in the current (temporary) directory of the specified tool script.
I would try to do it through the Tool service and not involve the beanshell - then you can keep your workflow simpler.
A good thing to remember is that you can write multiple shell commands in the box.
Similarly you would probably want to retrieve back the results so that you can process them further in the workflow (unless they are massive - in which case you should just output their remote filenames and send them in again to the next HPC job)
The exact commands to use for staging files and retrieving them depends on the HPC job submission system. Which one are you using?
Thanks for the input guys.
It was my misunderstanding of how Taverna uses the File list. All the files in the list are copied to the temp "sandbox" and are therefore available for use.
Another nice easy way is to zip the directory and pass the zipped files into an input port for the service. Then just unzip the files inside the command.
Thanks again