Check for multiple files - sql

Okay, I'll try to explain as good as I can... Quite a particular case.
Tools: SSIS 2008
We have a control flow that now needs to be triggered by an event: the presence of one or multiple files. (1,2 or 3)
The variables used:
BO_FileLocation_1
BO_FileLocation_2
BO_FileLocation_3
BO_FileName_1
BO_FileName_2
BO_FileName_3
There can be one, two or three files: defined in above variables. When they are filled in,
they should be processed. When they are empty, this means there's just one file file, the process should ignore them and jump to the next (file watcher?) task.
For example:
BO_FileLocation_1= "C:\"
BO_FileLocation_2 NULL
BO_FileLocation_3 NULL
BO_FileName_1= "test.csv"
BO_FileName_2 NULL
BO_FileName_3 NULL
The report only needs one file.
I'd need a generic concept that checks the presence of these files, it could be more generic than my SSIS knowledge can handle right now. For example handy, when there's a 4th file in the future. I was also thinking to work with a single script to handle all the logic.
Thanks in advance
A possibly irrelevant image:

If all you want is to trigger the Copy Source File to handle if one or more of the files is present, just use the OR Constraint in your flow. The following image shows you how:
First connect all to the destination:
Then click one of the green arrows. This will make its properties window pop up. Select the Logical ORinstead of the Logical AND:
If everything went well, you should now see the connections as dashed lines:

There are several possible solutions:
Create a sequence container and include all the file imports in the sequence container. Add int variables for RowCountFile1, RowCountFile2, and RowCountFile3 and set the value to 0 (this is the default value when you create an int variable). Add a RowCount transformation to each of the data flows. Create a precedence constraint from the sequence container to the "Do something" task. Set the precedence constraint to success and expression. Set the expression value to #RowCountFile1 > 0 || #RowCountFile2 > 0 || #RowCountFile3 > 0. The advantage of this approach is that you can take an action as soon as the files are detected, you import all available files, and you only take an action after all the files have been imported. You could then schedule running this SSIS package as a SQL Server Agent job step and run it as frequently as you want.
A variant on solution 1 is to use for each file enumerator containers inside the sequence container. This would be useful if you don't know the exact name of the file and you expect to import more than one under some circumstances. For instance, if you get a file every few minutes with a timestamp in its file name and your process doesn't run for some reason, then you may have to process multiple files to get caught up and then take an action once it has been done.
You could use the file watcher task as you outlined in your question. The only problem I have with the file watcher task is that the package has to be in a constantly running state. This makes it hard to troubleshoot problems and performance. It also can introduce other problems since I remember having some problems with the file watcher task years ago when it first came out. It may well be a totally stable task now, but I prefer other methods over the task after having been burned previously. If you really want the package to run continously instead of having it be called by a job, then you could always use a script task to check for file, sleep thread if not found, check again, etc. I'm sure that's what the file watcher task does, but I would trust my own C# over the task. Power to anyone who has had better experiences than me with File Watcher...
Use PowerShell. If you just want to take an action if a file appears and you aren't importing the data, then a PowerShell script could do this just as well as a SSIS package. The drawback is that you have to learn some basic PowerShell, it may be hard to maintain in the future since PowerShell is probably not your bread and butter core language, and you may have to rewrite the code again to a SSIS package if you want to import the data. You would probably call the PowerShell script from a SQL Server Agent job step, so scheduling can be handled pretty easily.
There are more options than what I listed, so let me know if you still want more suggestions.

Related

Multiple users executing the same workflow

Are there guidelines regarding how to share a Snakemake workflow among multiple users on the same data under Linux, or is the whole thing considered bad practice?
Let me explain in case it's not clear:
Suppose user A executes a workflow in directory dir/. Assume the workflow terminates successfully, and he/she then properly sets file/directory permissions recursively on all output and intermediate files and the .snakemake/ subdirectory for other users to read/write, of course.
User B subsequently navigates to dir/, adds input files to the workflow, then executes it. Can anything go wrong?
TL;DR: I'm asking about non-concurrent execution of the same workflow by distinct users on the same system, and on the same data on disk. Is Snakemake designed for such use cases?
It's possible to run snakemake --nolock which will prevent locking of the directory, so multiple runs can be made from inside the same directory. However, without lock, there's now an opening for errors due to concurrent runs trying to modify the same files. It's probably OK, if you are certain that this will be avoided, e.g. if you are in constant communication with another user about which files will be modified.
An alternative option is to create a third directory/path, and put all the data there. This way you can work from separate directories/path and avoid costly recomputes.
I would say that from the point of view of snakemake, and workflow management in general, it's ok for user B to add or update input files and re-run the pipeline. After all, one of the advantages of a workflow management system is to update results according to new input. The problem is that user A could find her results updated without being aware of it.
From the top of my head and without more detail this is what I would suggest. Make snakemake read the list of input files from a table (pandas comes in handy for this) or from some configuration file. Keep this sample sheet under version control (with git/github) together with the Snakefile and other source code.
When users update the working directory with new files, they will also need to update the sample sheet in order for snakemake to "see" the new input and other users will know about it via version control. I prefer this setup over dumping files in a directory and letting snakemake process whatever is in there.

Audit and error handling in SSIS

We are starting a project to handle big, big flat files. These files are kind of 'normalized' and we want to process them first to an intermediate file.
I would like to see a custom table for audit rows and a custom table for errors that are thrown during processing. Also errors must be stored in the Event Log.
What are the best practices according to audit & error handling in general for SSIS (VS2008)?
(edit)
We have made (I think) very elegant solution by designing 1 master package. This package runs a child package (the one orginally intended). The master package subscribes to the 3 events like OnInformation, OnWarning and OnError. These events are routed to a generic audit & logging service that makes calls to the Enterprise Library Logging & Exception handling blocks.
What I would recommend you is to adopt the following philosophy for stable etl processes coming from files:
Never cast anything in the connector, just import the fields as nvarchars of the maximum lenght they will achieve.
Procedurally add a rowcount for error tracking in casting errors.
Cast and control each column to your specification.
If a row cannot be read at some stage, you will not know the index, but you will know that the file is malformed (extremely rare in my experience, for half transferred files), and it should be rejected anyway.
A quick screenshot of a part of a file loading process shows how the rejection (after assigning row_id) can work (link to dataflow image). To this you can add further countless checks (duplicates...) and even have a repository for the loaded files to check upon the rejects and whatever else you might want to control (Link to control flow image).
In some of my processes, I even use a flat file connector and just import each row as a bulk text and then split it in columns with an intermediate script component, allowing for different versions of the columns in the files.
Anyway, sorry not to be more detailed (due to my status I can't add more links or any images), but I hope that you understand the concept.
Regards,
Francisco.

How to implement a system wide text replacement in windows programmatically?

I have a small VB .Net application that, among other things, attempts to substitute system wide typed text by the user(hotstrings concept). To achieve that, I have deployed 'ahk2exe' and 'AutoHotkeySC.bin' with my application and did the following:
When a user assignes a new 'hotstring':
Kill 'hotstring' exe script file if running
Append new hotstring to the script file (if non exist then create a new one)
Convert edited/new script file to exe (using ahk2exe)
Run the newly converted script exe
(somewhere there I also check if the hotstring has been already assigned)
However, I am not totally satisfied with this method for the following two main reasons:
The extra resources deployed with the application.
Lag: The time it takes for the system to kill the process and then restart it takes a minimum of 5 seconds on my fast computer and more on other computers. That amount of time is much more than the time it takes the user to assign the hotstring, minimize/close the window and then test his/her new hotstring. When the user does so initially with no success they will think the process failed. So this method is not very good for user experience.
So, I am looking for a different method or implementation. May be using keyboard hooks? Or maybe adding a .dll library that achieves the same. Are there any resources you know about that might help (free or commercial)? What is the best way to achieve my desired goal?
Many thanks for your help.
Implementing what Autohotkey does would be a pretty non trivial task.
But I'm pretty sure that AHK supports an "autoreload" option for scripts
googling "autohotkey auto reload" turned up several pages discussing that very concept. IF that worked, all you'd have to do is update the script file and that's it, AHK should automatically reload the script.

SSIS Intermittent variable error: The system cannot find the file specified

Our SSIS pacakges a structured as one Control package and many child packages (about 30) that are invoked from the control package. The child packages are invoked with Execute Package Task. There is one Execute Package Task per child package. Each Execute Package Task uses File Connection Manager to specify path to the child package dtsx file. There is one File Connection Manager per child package. Each File Connection Manager has an expression defined for ConnectionString property. This expression looks like this:
#[Template::FolderPackages]+"MyPackage.dtsx"
The file name is different for each package. The variable (FolderPackages) is specified in the SSIS package configuration file.
The error that is generated during run time is
Error 0x80070002 while loading package file "MyPackage.dtsx"
The system cannot find the file specified." The package that fails is different from run to run and sometimes no packages fail at all. This is when run on exactly the same environment/data etc.
I ran FileMon during this error and found out that when the error happens SSIS tries to read the dtsx file from a wrong place, namely from system32. I checked that this is identical to what would happen if #[Template::FolderPackages] variable were empty, but because the very same variable is used for every child package and works for some but doesn't work sometimes for others, I have no expalnation to this fact.
Anything obvious, or time to raise a support call with Microsoft?
Are you using Expressions on the SSIS variables directly? Variables with Expressions are calculated each time the variable is referenced by the consuming object which needs to use it. That is where the race condition bug exists, because sometimes the expression doesn't get evaluated if another thread is already evaluating a different variable, and the default value for the variable is provided to the consumer object.
If that matches your design, these two bugs on the connect site discuss the problem, and the workarounds:
https://connect.microsoft.com/SQLServer/feedback/details/332372/ssis-variable-expressions-dont-always-evaluate
A second one at
connect.microsoft.com/SQLServer/feedback/details/406534/ssis-2008-variable-expressions-dont-always-evaluate
A summary of workarounds is
{
- Note the parallel tasks that could run in you SSIS control flow and utilize these expression variables. If you have two tasks side-by-side if each relies upon the same variable, and that variable has an Expression to set its value, then you could hit this.
Manually sequentialize such tasks, so that they don't run in parallel. Ie. Add a green arrow on the control flow, so that the tasks occur in order Task1, Task2, Task3, rather than side-by-side on parallel paths and rather than inside the same container with no paths.
You could avoid variable expressions: Assigning local variables in the required order using a home-made script task that does the same kind of work, so that variables are not evaluated using expressions (ie. the thing which can hit this race condition). In other words, manually assign the variable values at a point in time in your control flow just before they are used. The point of using expressions on variables is to dynamically set a value based on another value whenever it is used, so this acheives a similar design goal but in a manual way.
Reduce threads to minimize potential: Setting the Dataflow task EngineThreads to 1 and MaxConcurrentExecutables to 1. This will help sequentalize execution of your package to one task at a time, but that has the side effect which may cause slower performance.
Create and set values on distinct copies of variables at different scope levels in the design, so that they evaluate in different parallel execution scopes and avoid the expression evaluation on parallel threads. Master::Var1, Child1::Var1, Child2::Var1
}
A bit of a stab in the dark but...
I've had a similar issue with variables where readonly=false and multiple components were reading the variable at the same time and causing locking issues.
I consistently recreated the problem by running a pair of dataflows that did nothing but reference the variable inside a for loop container and changed the variable to be read only and this resolved the problem.
If you temporarily hardcode the package name does this resolve the issue?
Turns out after sending trace info to Microsoft that we are encountering heap corruption. I'll update this question if we get to the bottom of it.
The current suggestion is to disable heap lookaside for dtexec.exe.
The official answer to this issue is that it is a bug in SQL 2005 and 2008. Many tasks accessing the same variable cause a race condition, and some tasks get the default value for the expression instead of the evaluated value.
The workaround is to ensure that the default value (the value defined in the property sheet for whatever property you are having trouble with) should be the value that will work in your production environment.
This way, when the race condition happens in prod, SSIS will fall back to the package value, which will still work.
In dev? Well you're just going to have to deal with that manually until we get a bug fix from Microsoft.
There is a KB article relating to this issue: http://support.microsoft.com/kb/2448991 which states when and where this was fixed.

How can I speed up batch processing job in Coldfusion?

Every once in awhile I am fed a large data file that my client uploads and that needs to be processed through CMFL. The problem is that if I put the processing on a CF page, then it runs into a timeout issue after 120 seconds. I was able to move the processing code to a CFC where it seems to not have the timeout issue. However, sometime during the processing, it causes ColdFusion to crash and has to restarted. There are a number of database queries (5 or more, mixture of updates and selects) required for each line (8,000+) of the file I go through as well as other logic provided by me in the form of CFML.
My question is what would be the best way to go through this file. One caveat, I am not able to move the file to the database server and process it entirely with the DB. However, would it be more efficient to pass each line to a stored procedure that took care of everything? It would still be a lot of calls to the database, but nothing compared to what I have now. Also, what would be the best way to provide feedback to the user about how much of the file has been processed?
Edit:
I'm running CF 6.1
I just did a similar thing and use CF often for data parsing.
1) Maintain a file upload table (Parent table). For every file you upload you should be able to keep a list of each file and what status it is in (uploaded, processed, unprocessed)
2) Temp table to store all the rows of the data file. (child table) Import the entire data file into a temporary table. Attempting to do it all in memory will inevitably lead to some errors. Each row in this table will link to a file upload table entry above.
3) Maintain a processing status - For each row of the datafile you bring in, set a "process/unprocessed" tag. This way if it breaks, you can start from where you left off. As you run through each line, set it to be "processed".
4) Transaction - use cftransaction if possible to commit all of it at once, or at least one line at a time (with your 5 queries). That way if something goes boom, you don't have one row of data that is half computed/processed/updated/tested.
5) Once you're done processing, set the file name entry in the table in step 1 to be "processed"
By using the approach above, if something fails, you can set it to start where it left off, or at least have a clearer path of where to start investigating, or worst case clean up in your data. You will have a clear way of displaying to the user the status of the current upload processing, where it's at, and where it left off if there was an error.
If you have any questions, let me know.
Other thoughts:
You can increase timeouts, give the VM more memory, put it in 64 bit but all of those will only increase the capacity of your system so much. It's a good idea to do these per call and do it in conjunction with the above.
Java has some neat file processing libraries that are available as CFCS. if you run into a lot of issues with speed, you can use one of those to read it into a variable and then into the database
If you are playing with XML, do not use coldfusion's xml parsing. It works well for smaller files and has fits when things get bigger. There are several cfc's written out there (check riaforge, etc) that wrap some excellent java libraries for parsing xml data. You can then create a cfquery manually if need be with this data.
It's hard to tell without more info, but from what you have said I shoot out three ideas.
The first thing, is with so many database operations, it's possible that you are generating too much debugging. Make sure that under Debug Output settings in the administrator that the following settings are turned off.
Enable Robust Exception Information
Enable AJAX Debug Log Window
Request Debugging Output
The second thing I would do is look at those DB queries and make sure they are optimized. Make sure selects are happening with indicies, etc.
The third thing I would suspect is that the file hanging out in memory is probably suboptimal.
I would try looping through the file using file looping:
<cfloop file="#VARIABLES.filePath#" index="VARIABLES.line">
<!--- Code to go here --->
</cfloop>
Have you tried an event gateway? I believe those threads are not subject to the same timeout settings as page request threads.
SQL Server Integration Services (SSIS) is the recommended tool for complex ETL (Extract, Transform, and Load) work, which is what this sounds like. (It can be configured to access files on other servers.) The question might be, can you work up an interface between Cold Fusion and SSIS?
If you can upgrade to cf8 and take advantage of cfloop file="" which would give you greater speed and the file would not be put in memory (which is probably the cause of the crashing).
Depending on the situation you are encountering you could also use cfthread to speed up processing.
Currently, an event gateway is the only way to get around the timeout limits of an HTTP request cycle. CF does not have a way to process CF pages offline, that is, there is no command-line invocation (one of my biggest gripes about CF - very little offling processing).
Your best bet is to use an Event Gateway or rewrite your parsing logic in straight Java.
I had to do the same thing, Ben Nadel has written a bunch of great articles uses java file io, to allow you to more speedily read files, write files etc...
Really helped improve the performance of our csv importing application.