Pentaho - Check if a csv file is already loaded before loading - pentaho

I am loading CSV files from a folder using Pentaho, and once files are loaded, I am making an entry into a table with the filenames that are loaded.
I need to put a check before loading a file if it is already loaded, for that I want to pick the filename and check with the names in the table that holds files which are already loaded. Since I am new to Pentaho, I am struggling to design this approach.
Please, suggest how should I go through to do this or if there is any totally different approach.

Your approach is valid. Make some book keeping of the processed filename in a database (you may also use a CSV file for that).
The difficulty with this approach is that the filename may not be in a field. So you have to write a master job to Add file name to results and give hand to a transformation that load the CSV (Press crtl-space in the box and find your variable in the drop down), check the database, with a Stream lookup, and Filter rows that are not matched. After the load, you 'Update' the bookkeeping table.
An other approach we used successfully in the past was to load the file form a directory and move the processed file into an other directory. This way it was easy to drop new files into a directory, and to retrieve processed file in case of problems.
This could be a start:
The Job
The transformation

Related

Automatically add database entry after ftp upload

Sorry if this seems stupid but I wonder if it's possible to add a database entry after an ftp upload.
To be more clear, thanks to winSCP I have several folders sending everything I put in there automatically to my server.
However, I would like to create a mysql entry for each uploaded files and once again, automatically. Is it possible to do that? How?
To gives the full details of what I need to do, you can read the following.
I have several folders with pictures and each folders are uploaded automatically.
Each of those folders belong to one user and the goal is to give them an account and allow them to see and download those files through a web interface. Since one account = one folder, that's kinda easy.
And I think a simple .htaccess can simply secure things so one user can only see and download the file in his own repository, no?
However if I want them to be able to see what's new (=something they didn't download or simply mark as read) I think I need a table to manage those files.
Something like id | file (string) | read (bool).
If you think this way to proceed is bad, they I'm open to change how to do things, but to be clear uploading the file need to work this way. Not using any kind of formulary.
Thanks for reading that, sorry for my english.
Your problem contains three steps:
Folders/Files been automatically uploaded to your server directory, as you say, this been efficiently handled by winSCP.
You need to update your database with all the files and folders present in your server directory.
You need to update whether or not it is been read/downloaded by the user.
Since your first step is in place, we don't need anything there. For second step, you should write a script and schedule that script to run at a fixed time interval using CRON (if using LINUX or UNIX, or WINDOWS). The script would be responsible to create a list of file(s) present in the directory, and simply insert the file(s) information that are not present in your database.
EDIT:
This edit is to describe how your script file should work. As I explained, the cron jobs would simply help you run your script file in fixed set of interval (which can be every minute, or every hour, or every day, and so on). Lets say your database table has following columns:
fileid (varchar[20])
filepath (varchar[20])
status (boolean)
Your script file should do following things:
Create a list of existing filepaths in your server directory
Run a select query, create a list of existing filepaths from database table.
Compare list1 with list2, and find the ones that doesn't exist in list2 (This would give you a list of filepath that needs to be inserted into table)
Just insert the list of file paths you got above, and set there status to be false (which means the file is not read/downloaded yet)
NOTE: Please keep in mind that I am not advising right now that how your database table should look like. It can be what you have proposed or can even differ depending on your will or requirements.
For the third step, simply keep the status of your file to be unread when creating entries in your table from the second step, and then when user click on the file link in your application whether to view or download it, send a POST request to your server updating the file status to be marked as read.
Let me know if this helps!

SSIS - Why won't my Data Flow Task fail?

I've got a simple SSIS package that runs a 'foreach' loop, checking a folder for .csv files. It imports the contents of the CSV into a staging table where the columns map. On success of this, it moves the file to an archive folder appending the date. Where it fails, it is supposed to put the file into a failure folder.
However, i've tested with a random csv, that doesn't have column headings that match the mappings, and the data flow task DOESN'T fail & the file goes to the archive folder (of course the table isn't updated either). Any ideas as to why this is happening?
Here is the package:
Here is the data flow:
OK, I can do this.
Start with seven text files of input data, one of which contains error data.
The control flow executes like this.
The good files get moved to the ProcessedData folder.
The bad file gets moved to the ToReviewData folder.
The only setting you need to make is MaximumErrorCount on the Foreach Loop Container. Set this to a suitably high value.
I haven't changed any of the properties on the Load Cats task. In particular, you can see that FailPackageOnFailure is False; this is only required for checkpoints.
The precedence constraints are as you'd expect. Nothing clever here.
See training kit 70-463 > Chapter 4: Designing and Implementing Control Flow.

NSFileSystemFileNumber is changed after file is edited/updated in objective c

I am working on File Management System exactly like Dropbox in Cocoa.
My problem is when i edit any text file at that time NSFileSystemFileNumber is changed.
I want an unique NSFileSystemFileNumber even if that edited file is moved from the particular folder.
In short, I just want to know how to fetch that moved file's old or original path from the database.
Any alternate way to solve out this problem?
Thanks in Adv..!!
It depends on how the editor save functionality is implemented. Each editor will have different functionality and it sounds like the one you are using does the following:
Delete existing file.
Create new file.
Write file data.
Hence you get a new inode each time. Others might:
Truncate existing file.
Write file data.
which would result in the same inode each time.
There is nothing you can about this so you will need to track file changes using the name or something, not the inode.

VB.NET Create downloadable resource

I've become stuck at this hurdle. I'm trying to create a database that clients fill in, however the client can set different database paths to view different information in the program. I want to create template databases so should they wish to create a new database it will work with the SQL queries the program uses.
I'm trying to save the templates in to the program so that when a button is clicked, the template file is "downloaded" (copied) to the clients desktop.
Is this even possible?
Thanks
You can open the Resources page of the project properties and add any existing file, including a SQL Server MDF data file. At run time, you can get the data of the file from the appropriate property of My.Resources. The type of the data depends on the type of the file. I'd expect that an MDF file would come back as a Byte array, which you can then write to a file or whatever.
That said, you don't want to make your EXE too big by embedding several sizeable data files in it. You might be better off just using loose files in a subfolder or, if you're determined to use resources, create a satellite assembly, i.e. a DLL that contains just resources.

How do I run data flow task successfully if certain files in the data flow doesn't exist

I have a data flow task that imports excel files. I cant use a for each loop to go through the excel files as the metadata for each excel file is completely different.
So in the data flow task I have 10 separate source files and use a union component to combine them then import it to SQL.
Problem i am facing now is sometimes certain excel files that i am importing might not exist so when my package runs it will fail as the file doesn't exist. So is there any way for me to create a check that allows the package run to skip the source file that doesn't exist and run the rest of the source files?
I am using SSIS 2005.
Suggestion: if the file doesn't exist, then create it first.
Have an empty version of each source file somewhere, and in your control flow (before the data flow), check to see if the files exist, and if they don't, copy the blank files to the location of the real files.
This article explains how to perform a check if file exists mechanism in SSIS:
http://www.bidn.com/blogs/DevinKnight/ssis/76/does-file-exist-check-in-ssis