Copying relational data from database to database - sql

Edit: Let me completely rephrase this, because I'm not sure there's an XML way like I was originally describing.
Yet another edit: This needs to be a repeatable process, and it has to be able to be set up in a way that it can be called in C# code.
In database A, I have a set of tables, related by PKs and FKs. A parent table, with child and grandchild tables, let's say.
I want to copy a set of rows from database A to database B, which has identically named tables and fields. For each table, I want to insert into the same table in database B. But I can't be constrained to use the same primary keys. The copy routine must create new PKs for each row in database B, and must propagate those to the child rows. I'm keeping the same relations between the data, in other words, but not the same exact PKs and FKs.
How would you solve this? I'm open to suggestions. SSIS isn't completely ruled out, but it doesn't look to me like it'll do this exact thing. I'm also open to a solution in LINQ, or using typed DataSets, or using some XML thing, or just about anything that'll work in SQL Server 2005 and/or C# (.NET 3.5). The best solution wouldn't require SSIS, and wouldn't require writing a lot of code. But I'll concede that this "best" solution may not exist.
(I didn't make this task up myself, nor the constraints; this is how it was given to me.)

I think the SQL Server utility tablediff.exe might be what you are looking for.
See also this thread.

First, let me say that SSIS is your best bet. But, to answer the question you asked...
I don't believe you will be able to get away with creating new id's all around, although you could but you would need to take the original IDs to use for lookups.
The best you can get is one insert statement for table. Here is an example of the code to do SELECTs to get you the data from your XML Sample:
declare #xml xml
set #xml='<People Key="1" FirstName="Bob" LastName="Smith">
<PeopleAddresses PeopleKey="1" AddressesKey="1">
<Addresses Key="1" Street="123 Main" City="St Louis" State="MO" ZIP="12345" />
</PeopleAddresses>
</People>
<People Key="2" FirstName="Harry" LastName="Jones">
<PeopleAddresses PeopleKey="2" AddressesKey="2">
<Addresses Key="2" Street="555 E 5th St" City="Chicago" State="IL" ZIP="23456" />
</PeopleAddresses>
</People>
<People Key="3" FirstName="Sally" LastName="Smith">
<PeopleAddresses PeopleKey="3" AddressesKey="1">
<Addresses Key="1" Street="123 Main" City="St Louis" State="MO" ZIP="12345" />
</PeopleAddresses>
</People>
<People Key="4" FirstName="Sara" LastName="Jones">
<PeopleAddresses PeopleKey="4" AddressesKey="2">
<Addresses Key="2" Street="555 E 5th St" City="Chicago" State="IL" ZIP="23456" />
</PeopleAddresses>
</People>
'
select t.b.value('./#Key', 'int') PeopleKey,
t.b.value('./#FirstName', 'nvarchar(50)') FirstName,
t.b.value('./#LastName', 'nvarchar(50)') LastName
from #xml.nodes('//People') t(b)
select t.b.value('../../#Key', 'int') PeopleKey,
t.b.value('./#Street', 'nvarchar(50)') Street,
t.b.value('./#City', 'nvarchar(50)') City,
t.b.value('./#State', 'char(2)') [State],
t.b.value('./#Zip', 'char(5)') Zip
from
#xml.nodes('//Addresses') t(b)
What this does is take Nodes from the XML and parse out the data. To get the relational id from people we use ../../ to go up the chain.

Dump the XML approach and use the import wizard / SSIS.

By far the easiest way is Red Gate's SQL Data Compare. You can set it up to do just what you described in a minute or two.

I love Red Gate's SQL Compare and Data Compare too but it won't meet his requirements for the changing primary keys as far as I can tell.
If cross database queries/linked servers are an option you could do this with a stored procedure that copies the records from parent/child in DB A into temporary tables on DB B and then add a column for the new primary key in the temp child table that you would update after inserting the headers.
My question is if the records don't have the same primary key how do you tell if it's a new record? Is there some other candidate key? If these are new tables why can't they have the same primary key?

I have created the same thing with a set of stored procedures.
Database B will have its own primary keys, but store Database A's primary keys, for debuging purposes. It means I can have more than one Database A!
Data is copied via a linked server. Not too fast; SSIS is faster. But SSIS is not for beginners, and it is not easy to code something that works with changing source tables.
And it is easy to call a stored procedure from C#.

I'd script it in a Stored Procedure, using Inserts to do the hard work. Your code will take the PKs from Table A (presumably via ##Scope_Identity) - I assume that the PK for Table A is an Identity field?
You could use temporary tables, cursors or you might prefer to use the CLR - it might lend itself to this kind of operation.
I'd be surprised to find a tool that could do this off the shelf with either a) pre-determined keys, or b) identity fields (clearly Tables B & C don't have them).

Are you clearing the destination tables each time and then starting again? That will make a big difference to the solution you need to implement. If you are doing a complete re-import each time then you could do something like the following:
Create a temporary table or table variable to record the old and new primary keys for the parent table.
Insert the parent table data into the destination and use the OUTPUT clause to capture the new ID's and insert them with the old IDs into the temp table.
NOTE: Using the output clause is efficient and allows you to do the insert in bulk without cycling through each record to be inserted.
Insert the child table data. Join to the temp table to retrieve the new foreign key required.
The above process could be done using T-SQL Script, C# code or SSIS. My preference would be for SSIS.

If you are adding each time then you may need to keep a permanent table to track the relationship between source database primary keys and destination database primary keys (at least for the parent table). If you needed to keep this kind of data out of the destination database, you could get SSIS to store/retrieve it from some kind of logging database or even a flat file.
You could probably avoid the above scenario if there is a combination of fields in the parent table that can be used to uniquely identify that record and therefore "find" the primary key for that record in the destination database.

I think most likely what I'm going to use is typed datasets. It won't be a generalized solution; we'll have to regenerate them if any of the tables change. But based on what I've been told, that's not a problem; the tables aren't expected to change much.
Datasets will make it reasonably easy to loop through the data hierarchically and refresh PKs from the database after insert.

When dealing with similar tasks I simply created a set of stored procedures to do the job.
As the task that you specified is pretty custom, you are not likely to find "ready to use" solution.
Just to give you some hints:
If the databases are on different servers use linked servers so you can access both source and destination tables simply through TSQL
In the stored procedure:
Identify the parent items that need to be copied - you said that the primary keys are different so you need to use unique constraints instead (you should be able to define them if the tables are normalised)
Identify the child items that need to be copied based on the identified parents, to check if some of them are already in the destination db use the unique constraints approach again
Identify the grandchild items (same logic as with parent-child)
Copy data over starting with the lowest level (grandchildren, children, parents)
There is no need for cursors etc, simply store the immediate results in the temporary table (or table variable if working within one stored procedure)
That approach worked for me pretty well.
You can of course add parameter to the main stored procedure so you can either copy all new records or only ones that you specify.
Let me know if that is of any help.

Related

Reading uncommitted data inside transaction (MS SQL Server 2008)

I have written a program in C# to read data from to tables, transform it and write it to four other tables.
Three of the destination tables has an not null integer pointing to the primary key on the last destination table. I call this V.
When my program have read a chunk of data to memory and transformed it, it will use the SqlBulkCopy to write to the table V. Upon completion it will use a select-statement to retrieve the primary keys.
These primary keys will be assigned properly to the three other destination tables in memory. Finally the last three tables will be written to the database using SqlBulkCopy in one transaction.
However, if the program writes successfully to the table V, but fails to write the other three tables I have a bunch of dirty data.
Can I somehow map the data going into table V with their primary keys, inside a transaction surrounding the all the code?
I would be sincerely happy for any suggestion to solve my problem.
Place a foreign key on the three tables' columns that are referring back to the primary key on table V. That will make certain there is no dirty data. You may have to rework your logic and the order the tables are written in, but adding the foreign key should clean up the process and make it work smoother after refactoring.

How to create a Primary Key on quasi-unique data keys?

I have a nightly SSIS process that exports a TON of data from an AS400 database system. Due to bugs in the AS400 DB software, ocassional duplicate keys are inserted into data tables. Every time a new duplicate is added to an AS400 table, it kills my nightly export process. This issue has moved from being a nuisance to a problem.
What I need is to have an option to insert only unique data. If there are duplicates, select the first encountered row of the duplicate rows. Is there SQL Syntax available that could help me do this? I know of the DISTINCT ROW clause but that doesn't work in my case because for most of the offending records, the entirety of the data is non-unique except for the fields which comprise the PK.
In my case, it is more important for my primary keys to remain unique in my SQL Server DB cache, rather than having a full snapshot of data. Is there something I can do to force this constraint on the export in SSIS/SQL Server with out crashing the process?
EDIT
Let me further clarify my request. What I need is to assure that the data in my exported SQL Server tables maintains the same keys that are maintained the AS400 data tables. In other words, creating a unique Row Count identifier wouldn't work, nor would inserting all of the data without a primary key.
If a bug in the AS400 software allows for mistaken, duplicate PKs, I want to either ignore those rows or, preferably, just select one of the rows with the duplicate key but not both of them.
This SELECT statement should probably happen from the SELECT statement in my SSIS project which connects to the mainframe through an ODBC connection.
I suspect that there may not be a "simple" solution to my problem. I'm hoping, however, that I'm wrong.
Since you are using SSIS, you must be using OLE DB Source to fetch the data from AS400 and you will be using OLE DB Destination to insert data into SQL Server.
Let's assume that you don't have any transformations
Add a Sort transformation after the OLE DB Source. In the Sort Transformation, there is a check box option at the bottom to remove duplicate rows based on a give set of column values. Check all the fields but don't select the Primary Key that comes from AS400. This will eliminate the duplicate rows but will insert the data that you still need.
I hope that is what you are looking for.
In SQL Server 2005 and above:
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY almost_unique_field ORDER BY id) rn
FROM import_table
) q
WHERE rn = 1
There are several options.
If you use IGNORE_DUP_KEY (http://www.sqlservernation.com/home/creating-indexes-with-ignore_dup_key.html) option on your primary key, SQL will issue a warning and only the duplicate records will fail.
You can also group/roll-up your data but this can get very expensive. What I mean by that is:
SELECT Id, MAX(value1), MAX(value2), MAX(value3) etc
Another option is to add an identity column (and cluster on this for an efficient join later) to your staging table and then create a mapping in a temp table. The mapping table would be:
CREATE TABLE #mapping
(
RowID INT PRIMARY KEY CLUSTERED,
PKIN INT
)
INSERT INTO #mapping
SELECT PKID, MIN(rowid) FROM staging_table
GROUP BY PKID
INSERT INTO presentation_table
SELECT S.*
FROM Staging_table S
INNER JOIN #mapping M
ON S.RowID = M.RowID
If I understand you correctly, you have duplicated PKs that have different data in the other fields.
First, put the data from the other database into a staging table. I find it easier to research issues with imports (especially large ones) if I do this. Actually I use two staging tables (and for this case I strongly recommend it), one with the raw data and one with only the data I intend to import into my system.
Now you can use and Execute SQL task to grab the one of the records for each key (see #Quassnoi for an idea of how to do that you may need to adjust his query for your situation). Personally I put an identity into my staging table, so I can identify which is the first or last occurance of duplicated data. Then put the record you chose for each key into your second staging table. If you are using an exception table, copy the records you are not moving to it and don't forget a reason code for the exception ("Duplicated key" for instance).
Now that you have only one record per key in a staging table, your next task is to decide what to do about the other data that is not unique. If there are two different business addresses for the same customer, which do you chose? This is a matter of business rules definition not strictly speaking SSIS or SQL code. You must define the business rules for how you chose the data when the data needs to be merged between two records (what you are doing is the equivalent of a de-dupping process). If you are lucky there is a date field or other way to determine which is the newest or oldest data and that is the data they want you to use. In that case once you have selected just one record, you are done the intial transform.
More than likely though you may need different rules for each other field to choose the correct one. In this case you write SSIS transforms in a data flow or Exec SQl tasks to pick the correct data and update the staging table.
Once you have the exact record you want to import, then do the data flow to move to the correct production tables.

Are relationship tables really needed?

Relationship tables mostly contain two columns: IDTABLE1, and IDTABLE2.
Only thing that seems to change between relationship tables is the names of those two columns, and table name.
Would it be better if we create one table Relationships and in this table we place 3 columns:
TABLE_NAME, IDTABLE1, IDTABLE2, and then use this table for all relationships?
Is this a good/acceptable solution in web/desktop application development? What would be downside of this?
Note:
Thank you all for feedback. I appreciate it.
But, I think you are taking it a bit too far... Every solution works until one point.
As data storage simple text file is good till certain point, than excel is better, than MS Access, than SQL Server, than...
To be honest, I haven't seen any argument that states why this solution is bad for small projects (with DB size of few GB).
It would be a monster of a table; it would also be cumbersome. Performance-wise, such a table would not be a great idea. Also, foreign keys are impossible to add to such a table. I really can't see a lot of advantages to such a solution.
Bad idea.
How would you enforce the foreign keys if IDTABLE1 could contain ids from any table at all?
To achieve acceptable performance on joins without a load of unnecessary IO to bring in completely unrelated rows you would need a composite index with leading column TABLE_NAME that basically ends up partitioning the table into sections anyway.
Obviously even with this pseudo partitioning going on you would still be wasting a lot of space in the table/indexes just repeating the table name for each row.
Isn't it a big IF that you're only going to store the 2 ID fields? If I have a StudentCourse (or better yet Enrollment) table that has StudentID & CourseID, but wouldn't EnrollmentDate go in this table as well since not all students enroll on the first day of class. Seems like a bad idea to add this column to an already bloated table where most records will be null.
The benefit of a single table could be a requirement that the application has the ability to allow user/admin to create these relationships with data (Similar to have a single lookup or reference list table) and avoid having to create a new table to address these User Created References. Needing dynamic querying may benefit as well. An application that requires such dynamic data structure requirements might be better suited for a schemaless or nosql database.

SQL, How to change column in SQL table without breaking other dependencies?

I'm sure this might be quite common query but couldn't find good answer as for now.
Here is my question:
I've got a table named Contacts with varchar column Title. Now in the middle of development I want to replace field Title with TitleID which is foreign key to ContactTitles table. At the moment table Contacts has over 60 dependencies (other tables, views functions).
How can I do that the safest and easiest way?
We use: MSSQL 2005, data has already been migrated, just want to change schema.
Edit:
Thanks to All for quick replay.
Like it was mentioned Contacts table has over 60 dependents, but when following query was run, only 5 of them use Title column. Migration script was run, so no data changes required.
/*gets all objects which use specified column */
SELECT Name
FROM syscomments sc
JOIN sysobjects so ON sc.id = so.id
WHERE TEXT LIKE '%Title%' AND TEXT LIKE '%TitleID%'
Then I went through those 5 views and updated them manually.
Use refactoring methods. Start off by creating a new field called TitleID, then copy all the titles into the ContactTitles table. Then, one by one, update each of the dependencies to use the TitleID field. Just make sure you've still got a working system after each step.
If the data is going to be changing, you'll have to be careful and make sure that any changes to the Title column also change the ContactTitles table. You'll only have to keep them in sync while you're doing the refactoring.
Edit: There's even a book about it! Refactoring Databases.
As others pointed out it depends on your RDBMS.
There are two approaches:
make a change to the table and fix all dependencies
make a view that you can use instead of direct access to the table (this can guard you against future changes in the underlying core table(s), but you might loose some update functionality, depending on your DBMS)
For Microsoft SQL Server Redgate have a (not free) product that can help with this refactoring http://www.red-gate.com/products/sql_refactor/index.htm
In the past I have managed to do this quite easily (if primitively) by simply getting a list of things to review
SELECT * FROM sys.objects
WHERE OBJECT_DEFINITION(OBJECT_ID) LIKE '%Contacts%'
(and possibly taking dependencies information into account and filtering by object type)
Scripting all the ones of interest in Management Studio then simply going down the list and reviewing them all and changing the CREATE to ALTER. It should be quite a simple and repetitive change even for 60 possible dependencies. Additionally if you are referring to a non existent column you should get an error message when you run the script to ALTER.
If you use * in your queries or adhoc SQL in your applications obviously things may be a bit more difficult.
Use SP_Depend 'Table Name' to check the Dependencies of the table
and then Use the SP_Rename to Rename the Column Name which is very useful.
sp_rename automatically renames the associated index whenever a PRIMARY KEY or UNIQUE constraint is renamed. If a renamed index is tied to a PRIMARY KEY constraint, the PRIMARY KEY constraint is also automatically renamed by sp_rename.
and then start Updating the Procedure and Functions one by one there is no other good option for change like this if you found then tell me too.

Is there any way to fake an ID column in NHibernate?

Say I'm mapping a simple object to a table that contains duplicate records and I want to allow duplicates in my code. I don't need to update/insert/delete on this table, only display the records.
Is there a way that I can put a fake (generated) ID column in my mapping file to trick NHibernate into thinking the rows are unique? Creating a composite key won't work because there could be duplicates across all of the columns.
If this isn't possible, what is the best way to get around this issue?
Thanks!
Edit: Query seemed to be the way to go
The NHibernate mapping makes the assumption that you're going to want to save changes, hence the requirement for an ID of some kind.
If you're allowed to modify the table, you could add an identity column (SQL Server naming - your database may differ) to autogenerate unique Ids - existing code should be unaffected.
If you're allowed to add to the database, but not to the table, you could try defining a view that includes a RowNumber synthetic (calculated) column, and using that as the data source to load from. Depending on your database vendor (and the products handling of views and indexes) this may face some performance issues.
The other alternative, which I've not tried, would be to map your class to a SQL query instead of a table. IIRC, NHibernate supports having named SQL queries in the mapping file, and you can use those as the "data source" instead of a table or view.
If you're data is read only one simple way we found was to wrapper the query in a view and build the entity off the view, and add a newguid() column, result is something like
SELECT NEWGUID() as ID, * FROM TABLE
ID then becomes your uniquer primary key. As stated above this is only useful for read-only views. As the ID has no relevance after the query.