Refactoring database and preserve existing data best practice? - sql

I have been working on a very data intensive application that has around 250 tables. Recently there have been some design changes required. Some of the design changes require adding new tables and linking those up with existing tables (foregin key) in a 1-N manner for parent - child relationships (in ORM).
Take this example. Current design allows for one Rental Vehicle per Contract. New design requires multiple Vehicles in the same Contract with Multiple rates.
So the data in one table needs to be put in 2 additional tables now.
I have completed the changes to the schema but I can't deploy those changes to the test environment until I find a way to convert the existing data and put it in the new design format.
My current process.
Add 3 new Tables nContract, nContractedAsset, nContractRate
Copy information from Contract table into 3 new tables. Preserve primary key field on nContract table same as Contract table.
Copy foregin key references / Indexes / Rights to nContract from Contract table
Drop Contract table
Rename nContract to Contract and so on.
The only issue I have is I am not comfortable doing part 2 in SQL. I want to use the power of the ORM and .Net to do more intelligent and complex tasks for more complex scenarios than this example
Is there a way I can write the data migration using ADO.Net or ORM for step 2?
What are best practices or the processes for this? Am I doing something wrong?

I ended up using FluentMigrator https://github.com/schambers/fluentmigrator
It allowed me to do Entity Framework like migrations (See: Ruby On Rails Active Records migrations)
Most of the DDL can be written in .NET in a fluent format. It supports UP and DOWN migrations wrapped up in transactions and even supports full SQL scripts for data migration.
Best thing about it is all your migration scripts can be put in source control and even tested.

Related

What is the best structure to separate all tables per client

For instance, i have these entities
Client : table
TransactionA : table
TransactionB : table
..
TransactionZ : table
TransactionA to TransactionZ table is referenced to Client
in database structure, i've been thinking of creating new table TransactionA for every new Client registered and has a schema with the Client.Code so it looks like clientA.tbl_TransactionA.
with this structure, i think my database would generate thousands of table depending on how many clients will register which i think that it is hard in maintenance if there's a modification in core.
I would like to ask for your opinion on the best approach on this matter, advantage and disadvantage.
PS:
I am using Entity Framework (code first), MSSQL
Thanks in advance.
Creating a table per client would not be a good idea on many levels. To pick one of the more obvious ones, using Entity Framework you would have to alter and recompile your code each time you wanted to add a client. You'd probably have to use reflection or to figure out which client DbSet to reference when seeking a transaction.
It isn't clear what has driven you to this design consideration, but it would seem obvious that the more reasonable model would be to have a Transactions table that had a foreign key / navigation property to the Client table. I assume there's some good but unstated reason why this would not suffice, though.

How to sync database schema with Entity Framework in model-first without migrations

I have made changes to my model-first EDMX file, and now I want to apply the changes to my database. I've just added a new table and some new fields, nothing destructive. I want to apply the "diff" to my database, but without all the hassle of database migrations. What I actually need is a non-destructive SQL file containing only the differences.
Currently, I am doing this manually by creating new database SQL from model, deleting all the code non-relevant to the table I am creating, and running the SQL. However, my table is currently empty so I can do this destructively. Moreover, if there are any changes referencing other entities too (e.g. adding a new foreign key to one of the existing tables), the SQL is, obviously, destructive. So I need to add them manually by writing my own SQL.
Is there any tool or a shorter workaround that would automate this whole process? I am looking for something that will compare the current database and the newly created EDMX, and apply only the diff into the database, as a one time process. The whole database migration system of Entity Framework is an extreme overhead and unnecessary work, the whole process, which will run only once, can be boiled down to a single SQL file. Is there such a tool/method? What is the best practice for this (other than EF migrations)?

How to avoid manually writing/managing SQL

My team and I are rapidly developing an Webapp backed by an Oracle DB. We use maven's plugin flyway to manage our db creation and population from INSERT SQL scripts. Typically we add 3-4 tables per sprint and / or modify the existing tables structure.
We model the schema in an external tool that generates the schema including the constraints and run this in first followed by the SQL INSERTs to ensure the integrity of all the data.
We spend too much time managing the changes to the SQL to cover the new tables - by this I mean adding the extra column data to the existing SQL INSERT statements not to mention the manual creation of the new SQL INSERT data particularly when they reference a foreign key.
Surely there is another way, maybe maintaining raw data in Excel and passing this through a parser to the DB. Has anyone any ideas?
10 tables so far and up to 1000 SQL statements, DB is not live so we tear it down on every build.
Thanks
Edit: The inserted data is static reference data the platform depends on to function - menus etc.
The architecture is Tomcat, JSF, Spring, JPA, Oracle
Please store your raw data in tables in the database - hey! why on earth do you want to use Excel for this? You have Oracle Database - the best tool for the job!
Load your unpolished data using SQL*Loader or external tables into regular tables in the database.
From there you have SQL - the most powerful rdbms tool to manipulate your data.
NEVER do slow by slow inserts. (1000 sql statements). Please do CTAS.
Add/enable the constraints AFTER you have loaded all the data.
create table t as select * from raw_data;
or
insert into t (x,y,z) select x,y,z from raw_data;
Using this method, you can bypass the SQL engine and do direct inserts (direct path load). This can even be done in parallel to make your data go into the database superfast!
Do all of your data manipulation in SQL or PLSQL. (Not in the application)
Please invest time learning the Oracle Database. It is full of features for you to use!
Don't just use it like a datadump (a place where you store your data). Create packages - interfaces to your application - your API to the database.
Don't just throw around thousands of statements compiled into your application. It will get messy.
Build your business logic inside the database PLSQL - use your application for presentation.
Best of luck!
Alternatively, you also have the option to implement a Java migration. It could read whatever input data you have (Excel, csv, ...) and do the proper inserts.

Design a process to archive data (SQL Server 2005)

We're designing a process to archive a set of records based on different criteria like date, status, etc...
Simplified set of tables: Claim, ClaimDetails, StatusHistory (tracks changes in Claim status), Comments & Files
Environment: SQL Server 2005, ASP.Net MVC (.NET Framework v3.5 SP1)
Claim is the main entity and it has child row(s) in the sub tables mentioned. Some have details and others are used to track changes. Eventually based on some criteria a Claim becomes "ready to archive" as explained above. In simple words archived Claims will be identified from the database and treated differently in the web-app.
Here's a simplest version: (top view)
Create a script which marks a Claim "archived" in db.
Archived row and its child row(s) can either be kept in the same table (set a flag) or moved to a different table which will be a replica of the original one.
In the web app we'll let the user filter Claims based on the archive status.
There might be a need to "unarchive" a Claim in the future.
What I need to know?
We need to keep this as simple and easy as possible and also flexible to adapt future changes - pls suggest an approach.
Is a timely scheduled SQL script the best option for this scenario?
Should I consider using separate tables or just add an "Archived" flag to each table?
For performance consideration of the selected approach - what difference would it make if we plan to have about 10,000 Claims and what if its 1 million. In short pls mention a light & heavy load approach.
We store files uploaded physically on our server - I believe this can stay as it is.
Note: A claim can have any number of child record(s) in all the mentioned table so it gets n-fold.
Is there a standard framework or pattern that I should refer to learn about archiving process. Any sample ref article or tool would help me learn more.
You can find some good discussion on this topic here and this is another one.
Having archived data in separate table gives more flexibility like if you want to track the user who marked a claim as archived or the date when a claim is archived or to see all the changes made to a claim after it is created.

Ideas for Combining Thousand Databases into One Database

We have a SQL server that has a database for each client, and we have hundreds of clients. So imagine the following: database001, database002, database003, ..., database999. We want to combine all of these databases into one database.
Our thoughts are to add a siteId column, 001, 002, 003, ..., 999.
We are exploring options to make this transition as smoothly as possible. And we would LOVE to hear any ideas you have. It's proving to be a VERY challenging problem.
I've heard of a technique that would create a view that would match and then filter.
Any ideas guys?
Create a client database id for each of the client databases. You will use this id to keep the data logically separated. This is the "site id" concept, but you can use a derived key (identity field) instead of manually creating these numbers. Create a table that has database name and id, with any other metadata you need.
The next step would be to create an SSIS package that gets the ID for the database in question and adds it to the tables that have to have their data separated out logically. You then can run that same package over each database with the lookup for ID for the database in question.
After you have a unique id for the data that is unique, and have imported the data, you will have to alter your apps to fit the new schema (actually before, or you are pretty much screwed).
If you want to do this in steps, you can create views or functions in the different "databases" so the old client can still hit the client's data, even though it has been moved. This step may not be necessary if you deploy with some downtime.
The method I propose is fairly flexible and can be applied to one client at a time, depending on your client application deployment methodology.
Why do you want to do that?
You can read about Multi-Tenant Data Architecture and also listen to SO #19 (around 40-50 min) about this design.
The "site-id" solution is what's done.
Another possibility that may not work out as well (but is still appealing) is multiple schemas within a single database. You can pull common tables into a "common" schema, and leave the customer-specific stuff in customer-specific schema. In some database products, however, the each schema is -- effectively -- a separate database. In other products (Oracle, DB2, for example) you can easily write queries that work in multiple schemas.
Also note that -- as an optimization -- you may not need to add siteId column to EVERY table.
Sometimes you have a "contains" relationship. It's a master-detail FK, often defined with a cascade delete so that detail cannot exist without the parent. In this case, the children don't need siteId because they don't have an independent existence.
Your first step will be to determine if these databases even have the same structure. Even if you think they do, you need to compare them to make sure they do. Chances are there will be some that are customized or missed an upgrade cycle or two.
Now depending on the number of clients and the number of records per client, your tables may get huge. Are you sure this will not create a performance problem? At any rate you may need to take a fresh look at indexing. You may need a much more powerful set of servers and may also need to partion by client anyway for performance.
Next, yes each table will need a site id of some sort. Further, depending on your design, you may have primary keys that are now no longer unique. You may need to redefine all primary keys to include the siteid. Always index this field when you add it.
Now all your queries, stored procs, views, udfs will need to be rewritten to ensure that the siteid is part of them. PAy particular attention to any dynamic SQL. Otherwise you could be showing client A's information to client B. Clients don't tend to like that. We brought a client from a separate database into the main application one time (when they decided they didn't still want to pay for a separate server). The developer missed just one place where client_id had to be added. Unfortunately, that sent emails to every client concerning this client's proprietary information and to make matters worse, it was a nightly process that ran in the middle of the night, so it wasn't known about until the next day. (the developer was very lucky not to get fired.) The point is be very very careful when you do this and test, test, test, and test some more. Make sure to test all automated behind the scenes stuff as well as the UI stuff.
what I was explaining in Florence towards the end of last year is if you had to keep the database names and the logical layer of the database the same for the application. In that case you'd do the following:
Collapse all the data into consolidated tables into one master, consolidated database (hereafter referred to as the consolidated DB).
Those tables would have to have an identifier like SiteID.
Create the new databases with the existing names.
Create views with the old table names which use row-level security to query the tables in the consolidated DB, but using the SiteID to filter.
Set up the databases for cross-database ownership chaining so that the service accounts can't "accidentally" query the base tables in the consolidated DB. Access must happen through the views or through stored procedures and other constructs that will enforce row-level security. Now, if it's the same service account for all sites, you can avoid the cross DB ownership chaining and assign the rights on the objects in the consolidated DB.
Rewrite the stored procedures to either handle the change (since they are now referring to views and they don't know to hit the base tables and include SiteID) or use InsteadOf Triggers on the views to intercept update requests and put the appropriate site specific information into the base tables.
If the data is large you could look at using a partioned view. This would simplify your access code as all you'd have to maintain is the view; however, if the data is not large, just add a column to identify the customer.
Depending on what the data is and your security requirements the threat of cross contamination may be a show stopper.
Assuming you have considered this and deem it "safe enough". You may need/want to create VIEWS or impose some other access control to prevent customers from seeing each-other's data.
IIRC a product called "Trusted Oracle" had the ability to partition data based on such a key (about the time Oracle 7 or 8 was out). The idea was that any given query would automagically have "and sourceKey = #userSecurityKey" (or some such) appended. The feature may have been rolled into later versions of the popular commercial product.
To expand on Gregory's answer, you can also make a parent ssis that calls the package doing the actual moving within a foreach loop container.
The parent package queries a config table and puts this in an object variable. The foreach loop then uses this recordset to pass variables to the package, such as your database name and any other details the package might need.
You table could list all of your client databases and have a flag to mark when you are ready to move them. This way you are not sitting around running the ssis package on 32,767 databases. I'm hooked on the foreach loop in ssis.