Design a process to archive data (SQL Server 2005) - sql-server-2005

We're designing a process to archive a set of records based on different criteria like date, status, etc...
Simplified set of tables: Claim, ClaimDetails, StatusHistory (tracks changes in Claim status), Comments & Files
Environment: SQL Server 2005, ASP.Net MVC (.NET Framework v3.5 SP1)
Claim is the main entity and it has child row(s) in the sub tables mentioned. Some have details and others are used to track changes. Eventually based on some criteria a Claim becomes "ready to archive" as explained above. In simple words archived Claims will be identified from the database and treated differently in the web-app.
Here's a simplest version: (top view)
Create a script which marks a Claim "archived" in db.
Archived row and its child row(s) can either be kept in the same table (set a flag) or moved to a different table which will be a replica of the original one.
In the web app we'll let the user filter Claims based on the archive status.
There might be a need to "unarchive" a Claim in the future.
What I need to know?
We need to keep this as simple and easy as possible and also flexible to adapt future changes - pls suggest an approach.
Is a timely scheduled SQL script the best option for this scenario?
Should I consider using separate tables or just add an "Archived" flag to each table?
For performance consideration of the selected approach - what difference would it make if we plan to have about 10,000 Claims and what if its 1 million. In short pls mention a light & heavy load approach.
We store files uploaded physically on our server - I believe this can stay as it is.
Note: A claim can have any number of child record(s) in all the mentioned table so it gets n-fold.
Is there a standard framework or pattern that I should refer to learn about archiving process. Any sample ref article or tool would help me learn more.

You can find some good discussion on this topic here and this is another one.
Having archived data in separate table gives more flexibility like if you want to track the user who marked a claim as archived or the date when a claim is archived or to see all the changes made to a claim after it is created.

Related

PET technology Fluent Nhibernate

For a web application (with some real private data) we want to use privacy enhancing technology to prevent big risks when someone gets permission to our database.
The application is build with different layers, and we use (as said in the topic title) Fluent NHibernate to connect to our database and we've created our own wrapper class to create query's.
Security is a big issue for the kind of application we're building. I'll try to explain the setting by a simple example:
Our customers got some clients in their application (each installation of the application uses its own database), for which some sensitive data is added, there is a client table, and a person table, that are linked.
The base table, which links to the other tables (there will be hundreds of them soon), probably containing sensitive data, is the client table
At this moment, the client has a cleint_id, and a table_id in the database, our customer only knows the client_id, the system links the data by the table_id, which is unknown to the user.
What we want to ensure:
A possible hacker who would have gained access to our database, should not be able to see the link between the customer and the other tables by just opening the database. So actually there should be some kind of "hidden link" between the customer and other tables. The personal data and all sensitive other tables should not be obviously linked together.
Because of the data sensitivity we're looking for a more robust solution then "statically hash the table_id and use this in other tables", because when one of the persons is linked to the corresponding client, not all other clients data is compromised too.
Ultimately, the customer table cannot be linked to the other tables at all, just by working inside the database, the application-code is needed to link the tables.
To accomplish this we've been looking into different methods, but because of the multiple linked tables to this client, and further development (thus probably even more tables) we're looking for a centralised solution. That's why we concluded this should be handled in the database connector. Searching on the internet and here on Stack Overflow, did not point us in the right direction, perhaps we couldn't find this because of wrong search terms (PET, Privacy enhancing technology, combined with NHibernate did not give us any directions.
How can we accomplish our goals in this specific situation, or where to search to help us fix this.
We have a similar requirement for our application and what we ended up with using database schema's.
We have one database and each customer has a separate schema, where all the data for that customer is stored. It is possible to link from the schema to the rest of the database, but not to different schema's.
Security can be set for each schema separately so you can make the life of a hacker harder.
That being said I can also imagine a solution where you let NHibernate encrypt every peace of data it will send to the database and decrypt everything it gets back. The data will be store savely, but it will be very difficult to query over data.
So there is probably not a single answer to this question, and you have to decide what is better: Not being able to query, or just making it more difficult for a hacker to get to the data.

SQL Server Auditing Alternatives with Application User Tracking

I'm looking for an auditing solution that does exactly what Change Data Capture (CDC) does, except I need it to also track the application user that made the change. I'm currently using SQL Server 2012 Enterprise and may be upgrading to 2014 later this year.
We already have an auditing solution in place that leverages Delete, Insert, and Update triggers, but some new requirements might force us to update every audit trigger and corresponding audit table. Given various problems we've run in to with that solution over the years, this seems like as good a time as any to reevaluate and potentially replace the solution.
To give you an idea of what I'm currently working with (and may be able to leverage), we use a stored procedure (ConnectionInitialize) to store a user id with a SPID in a table (ApplicationUser) and then we delete the row using another stored procedure (ConnectionReset) once we're done making our deletes, inserts, and updates.
Were we to use CDC, I looked into adding a trigger to something like the cdc.lsn_time_mapping table, but I couldn't find a way to map the LSN back to the SPID (and therefore the user id) that was being used. This also presented some other issues in that CDC is always a little bit behind.
I looked into SQL Server Audit a little bit, but that presented some challenges of its own. We're using Transparent Data Encryption (TDE) to appease some of our security requirements, but SQL Server Audit looks like it'd need a separate encryption strategy; that and I'm more interested in the columns than in the actual SQL statements. Even so, these aren't deal-breakers for me, so I'm still looking into it.
Given what I'm trying to accomplish, does anyone have any feedback or recommendations?
By itself, CDC doesn't meet the requirements. The reason being is that CDC only grabs changes to your data, not any underlying context under which those changes were made. You can, however, get what you're looking for if you're willing to tag your data with some audit columns. The basic idea is that you append a column to your table (or to a different table if you aren't able to modify the actual table for whatever reason) and populate it with the user who last modified the record (pretty simple to do via an insert/update trigger). Once that is actual data, you can consume it however you need to (CDC being one possible mechanism).
Late answer but hopefully useful.
There is a third party tool, ApexSQL Audit, capable of meeting your requirements. My previous company is using it for years and they have been satisfied with it.
There is a helpful comparison article you can read to find more details about audited data, auditing mechanisms, integrity protection etc, for both CDC & Audit tool at one place.

What is the best way to log all user request operations: (inserts, updates, deletes in Sql Server 2008?

I have a database with 50 tables and I want to log users requests, such as inserts, updates or deletes on all the tables in the database. I can also create a trigger for this for each request type.
What is the best way to do this from a performance perspective or is there a better way to track this?
You can also create audit tables which are populated by triggers (and which allow much more flexibility than change data capture). The critical component is to capture sets of data not try to work row-by-row. It does add some overhead yes, but if you write the triggers correctly, it isn't that much. Be sure to capture who (including which application if you have multiple applications hitting the database) and when as well as the old and new values. Set up one audit table per table you want audited (too much locking if you use only one audit table). And at the time you set up your system, write the code to get data back from a bad transaction or set of transactions. That makes it easier to recover when you do have something go wrong and you need to revert. We use two tables per table audited, one contains the info about the process that did the changes (name of the application, date, user, etc. and an auditid), the other contains the details about what was changed (old and new values, ID of the record being affected and column affected). Our structure enables us to use the same structure for each table being audited, and allows the tables to change without having to change the audit table and allows us to easily script the audit tables for a new tables. It is also easy for us to see what records were changed at the same time or in the same process or to find out which of the many applications which touch our database was responsible for the bad data as well as telling us who in particular was responsible for the bad data. This helps us track down application bugs and find out why the data was changed the way it was in some cases. It also makes it easier for us to track down all the data that was affected by a broken process rather than just the one we knew about.
If you have Enterprise Edition, look into Change Data Capture. If you don't have Enterprise and aren't interested in capturing the historical values of the columns that change, look into Change Tracking.
See Comparing Change Data Capture and Change Tracking to understand the differences between the two.
Assuming all requests to insert, update and/or delete data goes through some middle-tier data access layer, I would suggest you do your logging there. This is where we do all of ours. It is much simpler than trying to extract the actual insert / delete / update statements out of SQL Server.
If you want to do auditing of data, you can look into Change Data Capture (CDC). But this requires the Enterprise Edition.

Is there an easy way to track all changes in a SQL 2005 Database

I've been tasked with hooking in our product with another third party product. One of the things I need to do is mimic some of the third-party's product functionality when adding new "projects" - which can touch several database tables. Is there any way to add some kind of global hook to a database that would record all changes made to data?
I'd like to add the hook, create a project using the third-party application, then check out what all tables were affected.
I know it's more than just new rows as well, I've come across a number of count fields that look to be incremented for new projects and I worry that there might be other records that are modified on a new project insert, and not just new rows being added.
Thanks for any help
~Prescott
I can think of the following ways you can track changes
Run SQL Server Profiler which will capture all queries that run on the server. You can filter these by database, schema or a set of tables, etc.
Use a 3rd party Transaction Log reader. This is very much a less intrusive process. You have to ensure that you are set to FULL recovery on the database.
Make sure the log will not be reused:
the database is in full recovery mode (true full, with an initial backup)
the log backup maintenance tasks are suspended for the duration of the test
Then:
write down the current database LSN
run your 3rd party project create
check the newly added log information with select * from ::fn_log(oldcurrentLSN, NULL);
All write operations will apear in the log. From the physical operation (allocation unit ID) you can get to the logical operation (object id).
Now that being said, you should probably have a decent understanding of the 3rd party schema and data model if you plan to interact with it straight at the database level. If you are planning to update the 3rd party tool and you don't even know what tables to update, you'll more than likely end up corrupting its data.

Ideas for Combining Thousand Databases into One Database

We have a SQL server that has a database for each client, and we have hundreds of clients. So imagine the following: database001, database002, database003, ..., database999. We want to combine all of these databases into one database.
Our thoughts are to add a siteId column, 001, 002, 003, ..., 999.
We are exploring options to make this transition as smoothly as possible. And we would LOVE to hear any ideas you have. It's proving to be a VERY challenging problem.
I've heard of a technique that would create a view that would match and then filter.
Any ideas guys?
Create a client database id for each of the client databases. You will use this id to keep the data logically separated. This is the "site id" concept, but you can use a derived key (identity field) instead of manually creating these numbers. Create a table that has database name and id, with any other metadata you need.
The next step would be to create an SSIS package that gets the ID for the database in question and adds it to the tables that have to have their data separated out logically. You then can run that same package over each database with the lookup for ID for the database in question.
After you have a unique id for the data that is unique, and have imported the data, you will have to alter your apps to fit the new schema (actually before, or you are pretty much screwed).
If you want to do this in steps, you can create views or functions in the different "databases" so the old client can still hit the client's data, even though it has been moved. This step may not be necessary if you deploy with some downtime.
The method I propose is fairly flexible and can be applied to one client at a time, depending on your client application deployment methodology.
Why do you want to do that?
You can read about Multi-Tenant Data Architecture and also listen to SO #19 (around 40-50 min) about this design.
The "site-id" solution is what's done.
Another possibility that may not work out as well (but is still appealing) is multiple schemas within a single database. You can pull common tables into a "common" schema, and leave the customer-specific stuff in customer-specific schema. In some database products, however, the each schema is -- effectively -- a separate database. In other products (Oracle, DB2, for example) you can easily write queries that work in multiple schemas.
Also note that -- as an optimization -- you may not need to add siteId column to EVERY table.
Sometimes you have a "contains" relationship. It's a master-detail FK, often defined with a cascade delete so that detail cannot exist without the parent. In this case, the children don't need siteId because they don't have an independent existence.
Your first step will be to determine if these databases even have the same structure. Even if you think they do, you need to compare them to make sure they do. Chances are there will be some that are customized or missed an upgrade cycle or two.
Now depending on the number of clients and the number of records per client, your tables may get huge. Are you sure this will not create a performance problem? At any rate you may need to take a fresh look at indexing. You may need a much more powerful set of servers and may also need to partion by client anyway for performance.
Next, yes each table will need a site id of some sort. Further, depending on your design, you may have primary keys that are now no longer unique. You may need to redefine all primary keys to include the siteid. Always index this field when you add it.
Now all your queries, stored procs, views, udfs will need to be rewritten to ensure that the siteid is part of them. PAy particular attention to any dynamic SQL. Otherwise you could be showing client A's information to client B. Clients don't tend to like that. We brought a client from a separate database into the main application one time (when they decided they didn't still want to pay for a separate server). The developer missed just one place where client_id had to be added. Unfortunately, that sent emails to every client concerning this client's proprietary information and to make matters worse, it was a nightly process that ran in the middle of the night, so it wasn't known about until the next day. (the developer was very lucky not to get fired.) The point is be very very careful when you do this and test, test, test, and test some more. Make sure to test all automated behind the scenes stuff as well as the UI stuff.
what I was explaining in Florence towards the end of last year is if you had to keep the database names and the logical layer of the database the same for the application. In that case you'd do the following:
Collapse all the data into consolidated tables into one master, consolidated database (hereafter referred to as the consolidated DB).
Those tables would have to have an identifier like SiteID.
Create the new databases with the existing names.
Create views with the old table names which use row-level security to query the tables in the consolidated DB, but using the SiteID to filter.
Set up the databases for cross-database ownership chaining so that the service accounts can't "accidentally" query the base tables in the consolidated DB. Access must happen through the views or through stored procedures and other constructs that will enforce row-level security. Now, if it's the same service account for all sites, you can avoid the cross DB ownership chaining and assign the rights on the objects in the consolidated DB.
Rewrite the stored procedures to either handle the change (since they are now referring to views and they don't know to hit the base tables and include SiteID) or use InsteadOf Triggers on the views to intercept update requests and put the appropriate site specific information into the base tables.
If the data is large you could look at using a partioned view. This would simplify your access code as all you'd have to maintain is the view; however, if the data is not large, just add a column to identify the customer.
Depending on what the data is and your security requirements the threat of cross contamination may be a show stopper.
Assuming you have considered this and deem it "safe enough". You may need/want to create VIEWS or impose some other access control to prevent customers from seeing each-other's data.
IIRC a product called "Trusted Oracle" had the ability to partition data based on such a key (about the time Oracle 7 or 8 was out). The idea was that any given query would automagically have "and sourceKey = #userSecurityKey" (or some such) appended. The feature may have been rolled into later versions of the popular commercial product.
To expand on Gregory's answer, you can also make a parent ssis that calls the package doing the actual moving within a foreach loop container.
The parent package queries a config table and puts this in an object variable. The foreach loop then uses this recordset to pass variables to the package, such as your database name and any other details the package might need.
You table could list all of your client databases and have a flag to mark when you are ready to move them. This way you are not sitting around running the ssis package on 32,767 databases. I'm hooked on the foreach loop in ssis.