How do I run data flow task successfully if certain files in the data flow doesn't exist - sql

I have a data flow task that imports excel files. I cant use a for each loop to go through the excel files as the metadata for each excel file is completely different.
So in the data flow task I have 10 separate source files and use a union component to combine them then import it to SQL.
Problem i am facing now is sometimes certain excel files that i am importing might not exist so when my package runs it will fail as the file doesn't exist. So is there any way for me to create a check that allows the package run to skip the source file that doesn't exist and run the rest of the source files?
I am using SSIS 2005.

Suggestion: if the file doesn't exist, then create it first.
Have an empty version of each source file somewhere, and in your control flow (before the data flow), check to see if the files exist, and if they don't, copy the blank files to the location of the real files.

This article explains how to perform a check if file exists mechanism in SSIS:
http://www.bidn.com/blogs/DevinKnight/ssis/76/does-file-exist-check-in-ssis

Related

how can I safely import files to sql server in ssis while new files are actively being written to the source directory?

I need to import many xml files into sql server every day. I was thinking of running a for each loop container every few minutes to import the files to the db table and then move them to another directory, but sometimes over a dozen new files are written to the source folder every minute. Is it going to be an issue if the Package tries to loop through the folder at the exact moment new files are being written to the folder? If so, how can I work around this?
You could loop over the files in a script task and attempt to move them to a separate "ReadyToProcess" folder in a try/catch. Catch the IOException if the file is in use by another process, and continue on to the next file. The skipped file will be picked up on the next run. Then loop over the files in "ReadyToProcess" to read them into the database.
It seems like you know what files are finished writing and what files are still being modified which makes things a little easier. It is important to remember: if your SSIS task tries to open a file this currently being modified or used by another process the SSIS package will fail.
You can work around this by using a script task to generate a list of files in your source folder at a point in time and use a for or foreach loop to only fetch the files that are in the generated list. This would be in contrast to fetching everything that's in your source folders, as your post implies.
Other solutions would be to batch your incoming files and offset the package execution time so there isn't a risk of the file being exported to SQL as it's imported into your source folder.
For instance, loading your source documents in batches every 30 minutes: 1:00, 1:30, 2...
and execute your SSIS task every 30 minutes, but offset from the batch by 15 minutes: 1:15, 1:45, 2:15...
Lastly, if possible, run your SSIS package at a period where there will be no new files written to your source folder. While not always possible, if you knew there wouldn't be any new documents coming in at 2AM that'd be the best time to scheudle your SSIS package.

Pentaho - Check if a csv file is already loaded before loading

I am loading CSV files from a folder using Pentaho, and once files are loaded, I am making an entry into a table with the filenames that are loaded.
I need to put a check before loading a file if it is already loaded, for that I want to pick the filename and check with the names in the table that holds files which are already loaded. Since I am new to Pentaho, I am struggling to design this approach.
Please, suggest how should I go through to do this or if there is any totally different approach.
Your approach is valid. Make some book keeping of the processed filename in a database (you may also use a CSV file for that).
The difficulty with this approach is that the filename may not be in a field. So you have to write a master job to Add file name to results and give hand to a transformation that load the CSV (Press crtl-space in the box and find your variable in the drop down), check the database, with a Stream lookup, and Filter rows that are not matched. After the load, you 'Update' the bookkeeping table.
An other approach we used successfully in the past was to load the file form a directory and move the processed file into an other directory. This way it was easy to drop new files into a directory, and to retrieve processed file in case of problems.
This could be a start:
The Job
The transformation

Is it possible to automate updating Tableau extract for Tableau Reader?

Situation now:
I have a data warehouse job profile that publishes .txt file in Data folder every day in the morning. I open Tableau workbook which automatically updates data visualisations because of union I made. I save this workbook as extract and collages without Tableau Desktop can view it via Tableau Reader.
What I need:
This reporting format is heavily dependent on me and I need to automate this.
Is this even possible without Tableau Server?
Since Tableau Viewer can only use packaged workbooks with extracted data, you may not directly achieve this.
However, you may automate the packaging process using Tableau's command line parameters and the process will not be dependent on anyone anymore.
You may check the .PDF file on below link. Using that help document, you may create a .BAT file and get that .BAT file periodically started using Task Scheduler on your computer. The users then may open the packaged file from the network location you have saved. Or else (If all user computers have Tableau Desktop installed) you may put the file opening line at the end of the .BAT file, so the user can run the .BAT when they want to see the report.
https://community.tableau.com/docs/DOC-5209
Bernardo was correct in saying the Extract API can be used to programatically create extracts, and thus "refresh" an extract by simply recreating it (the point about Tableau Server is only relevant if you want to publish the extract that you create with the Extract API).
Where you might have trouble is that there is no currently supported way to programatically replace an extract within a .twbx file. That said, it should be possible to do this by simply renaming the .twbx to .zip (it is after all just an archive) and then using something like Python's zip module to manipulate the archive to replace the extract with your new extract.
NB: The Extract API can only be used to create .hyper files. If you want to work with .tde files, then you'll need to use the Tableau SDK instead

SSIS - Why won't my Data Flow Task fail?

I've got a simple SSIS package that runs a 'foreach' loop, checking a folder for .csv files. It imports the contents of the CSV into a staging table where the columns map. On success of this, it moves the file to an archive folder appending the date. Where it fails, it is supposed to put the file into a failure folder.
However, i've tested with a random csv, that doesn't have column headings that match the mappings, and the data flow task DOESN'T fail & the file goes to the archive folder (of course the table isn't updated either). Any ideas as to why this is happening?
Here is the package:
Here is the data flow:
OK, I can do this.
Start with seven text files of input data, one of which contains error data.
The control flow executes like this.
The good files get moved to the ProcessedData folder.
The bad file gets moved to the ToReviewData folder.
The only setting you need to make is MaximumErrorCount on the Foreach Loop Container. Set this to a suitably high value.
I haven't changed any of the properties on the Load Cats task. In particular, you can see that FailPackageOnFailure is False; this is only required for checkpoints.
The precedence constraints are as you'd expect. Nothing clever here.
See training kit 70-463 > Chapter 4: Designing and Implementing Control Flow.

How to use the latest file in a folder for source

I have an SSIS package which pulls in a CSV file for processing which pulls one file for the source \\server\dash\LABORDERS.CSV and is working fine.
We wanted to keep older files for historic purposes so everyday there will be new files instead of just overwriting the old one and it looks like this:
I know I am suppose to add a script task but I am not sure where to add it and how to invoke it so that the source file is always looking in the folder for the latest file and using that file to transfer data to it's sql destination.
How can I achieve it?
What have you tried? You can could create a script task at the start of your control flow that uses the .NET framework filesystem objects to search a directory and get the file with the most recent timestamp. You could then assign that file name to a SSIS Variable, then use that variable in your file connection manager.