Audit and error handling in SSIS - error-handling

We are starting a project to handle big, big flat files. These files are kind of 'normalized' and we want to process them first to an intermediate file.
I would like to see a custom table for audit rows and a custom table for errors that are thrown during processing. Also errors must be stored in the Event Log.
What are the best practices according to audit & error handling in general for SSIS (VS2008)?
(edit)
We have made (I think) very elegant solution by designing 1 master package. This package runs a child package (the one orginally intended). The master package subscribes to the 3 events like OnInformation, OnWarning and OnError. These events are routed to a generic audit & logging service that makes calls to the Enterprise Library Logging & Exception handling blocks.

What I would recommend you is to adopt the following philosophy for stable etl processes coming from files:
Never cast anything in the connector, just import the fields as nvarchars of the maximum lenght they will achieve.
Procedurally add a rowcount for error tracking in casting errors.
Cast and control each column to your specification.
If a row cannot be read at some stage, you will not know the index, but you will know that the file is malformed (extremely rare in my experience, for half transferred files), and it should be rejected anyway.
A quick screenshot of a part of a file loading process shows how the rejection (after assigning row_id) can work (link to dataflow image). To this you can add further countless checks (duplicates...) and even have a repository for the loaded files to check upon the rejects and whatever else you might want to control (Link to control flow image).
In some of my processes, I even use a flat file connector and just import each row as a bulk text and then split it in columns with an intermediate script component, allowing for different versions of the columns in the files.
Anyway, sorry not to be more detailed (due to my status I can't add more links or any images), but I hope that you understand the concept.
Regards,
Francisco.

Related

How to transfer data from SQL Server to mongodb (using mongoose schema for validation)

Goal
We have a MEAN stack application that implements a strict mongoose schema. The MEAN stack app needs to be seeded with data that originates from a SQL Server database. The app should function as expected as long as the seeded data complies with the mongoose schema.
Problem
Currently, the data transfer job is being done through the mongo CLI which does not perform validation. Issues that have come up have been Date objects being saved as strings, missing keys that are required on our schema, entire documents missing, etc. The dev team has lost hours of development time debugging the app and discovering these data issues.
Solution we are looking for
How can we validate data so it:
Throws errors
Fails and halts the transfer
Or gives some other indication that the data is not clean
Disclaimer
I was not part of the data transfer process so I don't have more detail on the specifics of that process.
This is a general problem of what you might call "batch import", "extract-transform-load (ETL)", or "data store migration", disconnected from any particular tech. I'd approach it by:
Export the data into some portable format (e.g. CSV or JSON)
Push the data into the new system through the same validation logic that will handle new data on an ongoing basis.
It's often necessary to modify that logic a bit. For example, maybe your API will autogenerate time stamps for normal operation, but for data import, you want to explicitly set them from the old data source. A more complicated situation would be when there are constraints that cross your models/entities that need to be suspended until all the data is present.
Typically, you write your import script or system to generate a summary of how many records were processed, which ones failed, and why. Then you fix the issues, run it on those remaining records. Repeat until you're happy.
P.S. It's a good idea to version control your import script.
Export to csv and write a small script using node. that will solve your problem. You can use fast-csv npm

How to display a status depending on the data flow position

Consider for example this modified Simple TCP sample program:
How can I display the current state of the program like
Wait for Connection
Connected
Connection terminated
on the frontpanel, depending on where the "data flow" currently is.
The easiest way to do this is to place a string indicator on your front panel and write messages to a local variable of this indicator at each point where you want to see a status update.
You need to keep in mind how LabVIEW dataflow works: code will execute as soon as the data it depends on becomes available. Sometimes you can use existing structures to enforce this - for example, if you put a string constant inside your loop and wire it to a local variable terminal outside the loop, the write will only happen after the loop exits. Sometimes you may need to enforce that dataflow artificially, for example by placing your operation inside a sequence frame and connecting a wire to the border of the sequence: then what's inside the sequence will only happen after data arrives on that wire. (This is about the only thing you should use a sequence for!)
This method is not guaranteed to be deterministic, but it's usually good enough for giving a simple status indication to the user.
A better version of the above would be to send the status messages on a queue or notifier which you read, and update the status indicator, in a separate loop. The queue and notifier write functions have error terminals which can help you to enforce sequence. A notifier is like the local variable in that you will only see the most recent update; a queue keeps all the data you write to it in the right order so would be more suitable if you want to log all the updates to a scrolling list or log file. With this solution you could add more features: for example the read loop could add a timestamp in front of each message so you could see how recent it was.
A really good solution to this general problem is to use a design pattern based on a state machine. Now your program flow is clearly organised into different states and it's very easy to add in functionality like sending a different message from each state. There are good examples and project templates for these design patterns included with recent versions of LabVIEW.
You should be able to find more information on any of the terms in bold in the LabVIEW help or on the NI website.

Check for multiple files

Okay, I'll try to explain as good as I can... Quite a particular case.
Tools: SSIS 2008
We have a control flow that now needs to be triggered by an event: the presence of one or multiple files. (1,2 or 3)
The variables used:
BO_FileLocation_1
BO_FileLocation_2
BO_FileLocation_3
BO_FileName_1
BO_FileName_2
BO_FileName_3
There can be one, two or three files: defined in above variables. When they are filled in,
they should be processed. When they are empty, this means there's just one file file, the process should ignore them and jump to the next (file watcher?) task.
For example:
BO_FileLocation_1= "C:\"
BO_FileLocation_2 NULL
BO_FileLocation_3 NULL
BO_FileName_1= "test.csv"
BO_FileName_2 NULL
BO_FileName_3 NULL
The report only needs one file.
I'd need a generic concept that checks the presence of these files, it could be more generic than my SSIS knowledge can handle right now. For example handy, when there's a 4th file in the future. I was also thinking to work with a single script to handle all the logic.
Thanks in advance
A possibly irrelevant image:
If all you want is to trigger the Copy Source File to handle if one or more of the files is present, just use the OR Constraint in your flow. The following image shows you how:
First connect all to the destination:
Then click one of the green arrows. This will make its properties window pop up. Select the Logical ORinstead of the Logical AND:
If everything went well, you should now see the connections as dashed lines:
There are several possible solutions:
Create a sequence container and include all the file imports in the sequence container. Add int variables for RowCountFile1, RowCountFile2, and RowCountFile3 and set the value to 0 (this is the default value when you create an int variable). Add a RowCount transformation to each of the data flows. Create a precedence constraint from the sequence container to the "Do something" task. Set the precedence constraint to success and expression. Set the expression value to #RowCountFile1 > 0 || #RowCountFile2 > 0 || #RowCountFile3 > 0. The advantage of this approach is that you can take an action as soon as the files are detected, you import all available files, and you only take an action after all the files have been imported. You could then schedule running this SSIS package as a SQL Server Agent job step and run it as frequently as you want.
A variant on solution 1 is to use for each file enumerator containers inside the sequence container. This would be useful if you don't know the exact name of the file and you expect to import more than one under some circumstances. For instance, if you get a file every few minutes with a timestamp in its file name and your process doesn't run for some reason, then you may have to process multiple files to get caught up and then take an action once it has been done.
You could use the file watcher task as you outlined in your question. The only problem I have with the file watcher task is that the package has to be in a constantly running state. This makes it hard to troubleshoot problems and performance. It also can introduce other problems since I remember having some problems with the file watcher task years ago when it first came out. It may well be a totally stable task now, but I prefer other methods over the task after having been burned previously. If you really want the package to run continously instead of having it be called by a job, then you could always use a script task to check for file, sleep thread if not found, check again, etc. I'm sure that's what the file watcher task does, but I would trust my own C# over the task. Power to anyone who has had better experiences than me with File Watcher...
Use PowerShell. If you just want to take an action if a file appears and you aren't importing the data, then a PowerShell script could do this just as well as a SSIS package. The drawback is that you have to learn some basic PowerShell, it may be hard to maintain in the future since PowerShell is probably not your bread and butter core language, and you may have to rewrite the code again to a SSIS package if you want to import the data. You would probably call the PowerShell script from a SQL Server Agent job step, so scheduling can be handled pretty easily.
There are more options than what I listed, so let me know if you still want more suggestions.

How can I speed up batch processing job in Coldfusion?

Every once in awhile I am fed a large data file that my client uploads and that needs to be processed through CMFL. The problem is that if I put the processing on a CF page, then it runs into a timeout issue after 120 seconds. I was able to move the processing code to a CFC where it seems to not have the timeout issue. However, sometime during the processing, it causes ColdFusion to crash and has to restarted. There are a number of database queries (5 or more, mixture of updates and selects) required for each line (8,000+) of the file I go through as well as other logic provided by me in the form of CFML.
My question is what would be the best way to go through this file. One caveat, I am not able to move the file to the database server and process it entirely with the DB. However, would it be more efficient to pass each line to a stored procedure that took care of everything? It would still be a lot of calls to the database, but nothing compared to what I have now. Also, what would be the best way to provide feedback to the user about how much of the file has been processed?
Edit:
I'm running CF 6.1
I just did a similar thing and use CF often for data parsing.
1) Maintain a file upload table (Parent table). For every file you upload you should be able to keep a list of each file and what status it is in (uploaded, processed, unprocessed)
2) Temp table to store all the rows of the data file. (child table) Import the entire data file into a temporary table. Attempting to do it all in memory will inevitably lead to some errors. Each row in this table will link to a file upload table entry above.
3) Maintain a processing status - For each row of the datafile you bring in, set a "process/unprocessed" tag. This way if it breaks, you can start from where you left off. As you run through each line, set it to be "processed".
4) Transaction - use cftransaction if possible to commit all of it at once, or at least one line at a time (with your 5 queries). That way if something goes boom, you don't have one row of data that is half computed/processed/updated/tested.
5) Once you're done processing, set the file name entry in the table in step 1 to be "processed"
By using the approach above, if something fails, you can set it to start where it left off, or at least have a clearer path of where to start investigating, or worst case clean up in your data. You will have a clear way of displaying to the user the status of the current upload processing, where it's at, and where it left off if there was an error.
If you have any questions, let me know.
Other thoughts:
You can increase timeouts, give the VM more memory, put it in 64 bit but all of those will only increase the capacity of your system so much. It's a good idea to do these per call and do it in conjunction with the above.
Java has some neat file processing libraries that are available as CFCS. if you run into a lot of issues with speed, you can use one of those to read it into a variable and then into the database
If you are playing with XML, do not use coldfusion's xml parsing. It works well for smaller files and has fits when things get bigger. There are several cfc's written out there (check riaforge, etc) that wrap some excellent java libraries for parsing xml data. You can then create a cfquery manually if need be with this data.
It's hard to tell without more info, but from what you have said I shoot out three ideas.
The first thing, is with so many database operations, it's possible that you are generating too much debugging. Make sure that under Debug Output settings in the administrator that the following settings are turned off.
Enable Robust Exception Information
Enable AJAX Debug Log Window
Request Debugging Output
The second thing I would do is look at those DB queries and make sure they are optimized. Make sure selects are happening with indicies, etc.
The third thing I would suspect is that the file hanging out in memory is probably suboptimal.
I would try looping through the file using file looping:
<cfloop file="#VARIABLES.filePath#" index="VARIABLES.line">
<!--- Code to go here --->
</cfloop>
Have you tried an event gateway? I believe those threads are not subject to the same timeout settings as page request threads.
SQL Server Integration Services (SSIS) is the recommended tool for complex ETL (Extract, Transform, and Load) work, which is what this sounds like. (It can be configured to access files on other servers.) The question might be, can you work up an interface between Cold Fusion and SSIS?
If you can upgrade to cf8 and take advantage of cfloop file="" which would give you greater speed and the file would not be put in memory (which is probably the cause of the crashing).
Depending on the situation you are encountering you could also use cfthread to speed up processing.
Currently, an event gateway is the only way to get around the timeout limits of an HTTP request cycle. CF does not have a way to process CF pages offline, that is, there is no command-line invocation (one of my biggest gripes about CF - very little offling processing).
Your best bet is to use an Event Gateway or rewrite your parsing logic in straight Java.
I had to do the same thing, Ben Nadel has written a bunch of great articles uses java file io, to allow you to more speedily read files, write files etc...
Really helped improve the performance of our csv importing application.

What are the pitfalls of using a separate SSIS package for centralized error management?

We have a few loosely coupled SSIS packages that are in charge of batch integration. When they have an error (validation issue with data, or an actual OnError error) then they all do the same thing, they email a message to a distribution list. The content of the message varies, and sometimes other people need to be cc'd on the message. But it is basically the same process for everything.
I am thinking of creating a single ErrorHandler package that has a few parameters (error message, cc address, subject line etc) and just getting the parent packages to run an Execute Package Step when they need to send an error message.
The way I see it, we then have one single SSIS package that allows us to manage what we do with the incoming errors. If we decide we want to write stuff into a log file, or call a web service, it only has to be changed in the one place.
Limited testing so far looks fine. Am I missing something obvious here? Why doesn't everybody do this? Is there a transactional or cascading issue that could be a problem?
Well, one pitfall we have encountered now we've started down this road is that you have to keep your SSIS packages in the same project otherwise they can't invoke each other.
This negates some of the good reasons we chose to use this pattern in the first place.
For instance, SSIS Parameters can be defined at the package level or the project level, but not in between, so now we have parameters that need to be shared between three packages, but which are visible to/shared across ALL packages because they must be kept at the project level.
You can't organise SSIS packages within a project by, for instance, grouping them into subfolders. The only way to visually organise them is to introduce some kind of naming schema.
There are ways to overcome this, by invoking packages via stored procs or script steps, but these come with their own shortfalls. (For instance, you can specify the Environment you want to use when invoking a package from a stored proc, but how does the calling package know what Environment it is being run in in order to invoke the child package with the same Environment? By Environment I refer to the logical grouping of parameters into environments such as Test, Staging, Prod that can be performed on the SQL Server.)
Microsoft is aware of this issue, but does not seem to want to do anything about it. https://connect.microsoft.com/SQLServer/feedback/details/779789/ssis-2012-execute-package-task-external-reference-does-not-work-with-ssisdb