SQL: Inserting into a (dynamic) lookup table - sql

Most articles about lookup tables deal with its creation, initial population and use (for looking up: id-->value).
My question is about dynamic updating (inserting new values) of the lookup table, as new data is stored in data tables.
For example, we have a table of persons, and one attribute (column) of it is city of residency. Many persons would have the same value, so it makes sense to use a lookup table for it. As the list of cities that would appear is not known beforehand, the lookup table is initially empty.
To clarify, the value(s) of city is/are:
not know beforehand (we don't know what customer might contact us tomorrow)
there is no "list of all possible cities" (real life cities come and go, get renamed etc)
many persons will share the same value
initially, there will be a few different values (up to 10), later more (but not very much, a few hundred)
Also, the expected number of person objects will be thousands if not millions.
So the basic algorithm is (pseudocode):
procedure insertPerson(name,age,city)
{
cityId := lookup(city);
if cityId == null
cityId := insertIntoLookupTableAndReturnId(city);
INSERT INTO person_table VALUES (name,age,cityId);
}
What is a good lookup table organization for this problem? What exact code to use?
The goal is high performance of person insertion (whether the city is already in the lookup table or not).
General answers are welcome and Oracle 11g would be great.
Note: This is about an OLTP scenario. New persons are inserted in real time. There is no known list of persons that can be used for initialization of the lookup table.

Your basic approach appears to be OK except for one small change I would do: The function lookup(city) will search for the city and return the ID and, if the city is not found, will insert a new record and return its ID. This way, you are further encapsulating the management of the lookup table (cities). As such, your code would become:
procedure insertPerson(name,age,city)
{
INSERT INTO person_table VALUES (name,age,lookup(city));
}
One additional thing you may consider is to create a VIEW that would be used to query for persons' information, including the name of the city.

After some testing, the best performance (least block accesses) I could find was with an index organized table as the lookup table and the below SQL for inserting data.
create table citylookup (key number primary key, city varchar2(100)) organization index;
create unique index cltx1 on citylookup(city);
create sequence lookupkeys;
create sequence datakeys;
create table data (x number primary key, k number references citylookup(key) not null);
-- "Rome" is the city we try to insert
insert all
when oldkey is null then -- if the city is not in the lookup yet
into citylookup values (lookupkeys.nextval, 'Rome') -- then insert it
-- finally, insert the data row with the correct lookup key
when 1=1 then into data values (datakeys.nextval,nvl(oldkey, lookupkeys.nextval))
select (select key from citylookup where city='Rome') as oldkey from dual;
Result: 6+2 blocks for city-exists case, 10+2 for city-doesn't-exists yet (as reported by SQL*Plus with set autotrace on: first value is db block gets, the second consistent gets).
Alternatively, as suggested by Dudu Markovitz, the lookup table could cached in the application and in the hit case just perform an simple INSERT into the DATA table, which then costs only 6+1 block accesses (for the above test case). Here the problem is keeping the cached lookup table in sync with the database and possible other instances of the server application.
PS: The above INSERT ALL command "wastes" a sequence value from the lookupkeys sequence on each run, even if no new city is inserted into the lookup table. It is an additional exercise to solve that.

Related

Dynamically generate criteria in SQL

I have a Users table that contains dozens of columns like date of birth, year of vehicle owned, make and model of the vehicle, color and many other personal fields unrelated to the vehicle
There's also a 2nd table called Coupons that needs to be designed in a way to support a qualification like "user qualifies if younger than 30 yrs old", "user qualifies if vehicle is greater than 10 yrs old", "user qualifies if vehicle color is green".
When a user logs in, I need to present all coupons the user qualifies for. The problem that I'm having is that the coupon qualifications could be numerous, could have qualifiers like equal, greater than or less than and may have different combinations.
My only solution at this point is to store the actual sql string within one of the coupons table columns like
select * from Users where UserId = SOME_PLACEHOLDER and VehicleYear < 10
Then I could execute the sql for each coupon row and return true or false. Seems very inefficient as I would potentially have to execute 1000s of sql statements for each coupon code.
Any insight, help is appreciated. I do have server-side code where I could potentially be able to do looping.
Thank you.
Very difficult problem. Seems like users will be added at high volume speed, with coupons at a fairly regular frequency.
Adding SQL to a table to be used dynamically is workable - at least you'll get a fresh execution plan - BUT your plan cache may balloon up.
I have a feeling that running a single coupon for all users is probably likely to be your highest performing query because it's one single set of criteria which will be fairly selective on users first and total number of coupons is small, whereas running all coupons for a single user is separate criteria for each coupon for that user. Running all coupons for all users may still perform well, even though it's effectively a cross join first - I guess it is just going to depend.
Anyway, the case for all coupons for all users (or sliced either way, really) will be something like this:
SELECT user.id, coupon.id
FROM user
INNER JOIN coupon
ON (
CASE WHEN <coupon.criteria> THEN <coupon.id> -- code generated from the coupon rules table
CASE WHEN <coupon.criteria> THEN <coupon.id> -- etc.
ELSE NULL
) = coupon.id
To generate the coupon rules, you can relatively easily do the string concatenation in a single swipe (and you can combine an individual rule lines design for a coupon with AND with a further inner template):
DECLARE #outer_template AS varchar(max) = 'SELECT user.id, coupon.id
FROM user
INNER JOIN coupon
ON (
{template}
ELSE NULL
) = coupon.id
';
DECLARE #template AS varchar(max) = 'CASE WHEN {coupon.rule} THEN {coupon.id}{crlf}';
DECLARE #coupon AS TABLE (id INT, [rule] varchar(max));
INSERT INTO #coupon VALUES
(1, 'user.Age BETWEEN 20 AND 29')
,(2, 'user.Color = ''Yellow''');
DECLARE #sql AS varchar(MAX) = REPLACE(
#outer_template
,'{template}',
REPLACE((
SELECT REPLACE(REPLACE(
#template
,'{coupon.rule}', coupon.[rule])
, '{coupon.id}', coupon.id)
FROM #coupon AS coupon
FOR XML PATH('')
), '{crlf}', CHAR(13) + CHAR(10)));
PRINT #sql;
// EXEC (#sql);
There's ways to pretty that up - play with it here: https://data.stackexchange.com/stackoverflow/q/115098/
I would consider adding computed columns (possibly persisted and indexed) to assist. For instance, age - non-persisted computed column will likely perform better than a scalar function.
I would consider batching this with a table which says whether a coupon is valid for a user and when it was last validated.
Seems like ages can change and a user can become valid or invalid for a coupon as their birthday passes.
When a user logs in you could spawn a background job to update their coupons. On subsequent logons, there won't be any need to update (since it's not likely to change until the next day or a triggering event).
Just a few ideas.
I would also add that you should have a way to test a coupon before it is approved to ensure there are no syntax errors (since the SQL is ad hoc or arbitrary) - this can be done relatively easily - perhaps a test user table (test_user as user in the generated code template instead) is required to contain pass and fail rows and the coupon rule points to those. Not only does the EXEC have to work - the rows it returns should be the expected and only the expected rows for that coupon.
This is not an easy problem. Here are some quick ideas that may help depending on your domain requirements:
Restrict the type of criteria you will be filtering on so that you can use dynamic or non-dynamic sql to execute them efficiently. For example if you are going to only have integers between a range of min and max values as a criteria then the problem becomes simpler. (You only need to know the field name, and the min max values to describe a criterian, not the full where statement.)
Create a number of views which expose the attributes in a helpful way. Then perform queries against those views -- or have those views pre-select in some way. For example, an age group view that has a field which can contain the values < 21, 21-30, 30-45, >45. Then your select just needs to return the rows from this view that match these strings.
Create a table which stores the results of running your criteria matching query (This can be run off line by a back ground process). Then for a given user check for membership by looking where in the table this user's ID exists.
Thinking about this some more I realize all my suggestions are based on one idea.
A query for an individual user will work faster overall if you first perform an SQL query against all users and cache that result in some way. If every user is reproducing queries against the whole dataset you will lose efficiency. You need some way to cache results and reuse them.
Hope this helps -- comment if these ideas are not clear.
My first thought on an approach (similar to Hogan's) would be to test for coupon applicability at the time the coupon is created. Store those results in a table (User_Coupons for example). If any user data is changed, your system would then retest any changed users for which coupons are applicable to them. At coupon creation (or change) time it would only check versus that coupon. At use creation (or change) time it would only check versus that user.
The coupon criteria should be from a known set of possible criteria and any time that you want to add a new type of criteria, it would possibly involve a code change. For example, let's say that you have a table set up similar to this:
CREATE TABLE Coupon_Criteria (
coupon_id INT NOT NULL,
age_minimum SMALLINT NULL,
age_maximum SMALLINT NULL,
vehicle_color VARCHAR(20) NULL,
...
CONSTRAINT PK_Coupon_Criteria PRIMARY KEY CLUSTERED (coupon_id)
)
If you wanted to add the ability to base a coupon on vehicle age then you would have to add a column to the table and likewise you would have to adjust your search code. You would use NULL values to indicate that the criteria is unused for that coupon.
An example query for the above table:
SELECT
CC.coupon_id
FROM
Users U
INNER JOIN Coupon_Criteria CC ON
(CC.age_maximum IS NULL OR dbo.f_GetAge(U.birthday) <= age_maximum) AND
(CC.age_minimum IS NULL OR dbo.f_GetAge(U.birthday) >= age_minimum) AND
(CC.vehicle_color IS NULL OR U.vehicle_color = CC.vehicle_color) AND
...
This can get unwieldy if the number of possible criteria gets to be very large.
Another possibility would be to save the coupon criteria in XML and have a business object for your application use that to determine eligibility. It could use the XML to generate a proper query against the User table (and any other necessary tables).
Here's another possibility. Each criteria could be given a query template which you could append to your queries. This would just involve updates to the data instead of DDL and could have good performance. It would involve dynamic SQL.
CREATE TABLE Coupons (
coupon_id INT NOT NULL,
description VARCHAR(2000) NOT NULL,
...
CONSTRAINT PK_Coupons PRIMARY KEY CLUSTERED (coupon_id)
)
CREATE TABLE Coupon_Criteria (
coupon_id INT NOT NULL,
criteria_num SMALLINT NOT NULL,
description VARCHAR(50) NOT NULL,
code_template VARCHAR(500) NOT NULL,
CONSTRAINT PK_Coupon_Criteria PRIMARY KEY CLUSTERED (coupon_id, criteria_num),
CONSTRAINT FK_Coupon_Criteria_Coupon FOREIGN KEY (coupon_id) REFERENCES Coupons (coupon_id)
)
INSERT INTO Coupons (coupon_id, description)
VALUES (1, 'Young people save $200 on yellow vehicles!')
INSERT INTO Coupon_Criteria (coupon_id, criteria_num, description, code_template)
VALUES (1, 1, 'Young people', 'dbo.Get_Age(U.birthday) <= 20')
INSERT INTO Coupon_Criteria (coupon_id, criteria_num, description, code_template)
VALUES (1, 2, 'Yellow Vehicles', U.vehicle_color = ''Yellow''')
You could then build a query by simply concatenating all of the criteria for any given coupon. The big downside to this one is that it's only one-directional. Given a coupon you can easily find who is qualified for it, but given a user you cannot find all coupons for which they are eligible except by going through all of the coupons. My guess is that the second is what you'd probably be most interested in unfortunately. Maybe this will give you some other ideas though.
For example, you could potentially have it work the other way by having a set number of criteria in a table and for the coupon/criteria linking table indicate whether or not that criteria is active. When querying you could then include that in your query. In other words, the query would look something like:
WHERE
(CC.is_active = 0 OR <code from the code column>) AND
The querying gets very complex though since you either need to join once for every possible criteria or you need to query to compare the number of active requirements for a coupon versus the number that are fulfilled. That is possible in SQL, but it's similar to working with an EAV model - which is basically what this turns into: a variation on an EAV model (yuck)

What is the preferred way of saving dynamic lists in database?

In our application user can create different lists (like sharepoint) for example a user can create a list of cars (name, model, brand) and a list of students (name, dob, address, nationality), e.t.c.
Our application should be able to query on different columns of the list so we can't just serialize each row and save it in one row.
Should I create a new table at runtime for each newly created list? If this was the best solution then probably Microsoft SharePoint would have done it as well I suppose?
Should I use the following schema
Lists (Id, Name)
ListColumns (Id, ListId, Name)
ListRows (Id, ListId)
ListData(RowId, ColumnId, Value)
Though a single row will create as many rows in list data table as there are columns in the list, this just doesn't feel right.
Have you dealt with this situation? How did you handle it in database?
what you did is called EAV (Entity-Attribute-Value Model).
For a list with 3 columns and 1000 entries:
1 record in Lists
3 records in ListColumns
and 3000 Entries in ListData
This is fine. I'm not a fan of creating tables on-the-fly because it could mess up your database and you would have to "generate" your SQL queries dynamically. I would get a strange feeling when users could CREATE/DROP/ALTER Tables in my database!
Another nice feature of the EAV model is that you could merge two lists easily without droping and altering a table.
Edit:
I think you need another table called ListRows that tells you which ListData records belong together in a row!
Well I've experienced something like this before - I don't want to share the actual table schema so lets do some thought exercises using some of the suggested table structures:
Lets have a lists table containing a list of all my lists
Lets also have a columns table containing the metadata (column names)
Now we need a values table which contains the column values
We also need a rows table which contains a list of all the rows, otherwise it gets very difficult to work out how many rows there actually are
To keep things simple lets just make everything a string (VARCAHR) and have a go at coming up with some queries:
Counting all the rows in a table
SELECT COUNT(*) FROM [rows]
JOIN [lists]
ON [rows].list_id = [Lists].id
WHERE [Lists].name = 'Cars'
Hmm, not too bad, compared to:
SELECT * FROM [Cars]
Inserting a row into a table
BEGIN TRANSACTION
DECLARE #row_id INT
DECLARE #list_id INT
SELECT #list_id = id FROM [lists] WHERE name = 'Cars'
INSERT INTO [rows] (list_id) VALUES (#list_id)
SELECT #row_id = ##IDENTITY
DECLARE #column_id INT
-- === Need one of these for each column ===
SELECT #column_id = id FROM [columns]
WHERE name = 'Make'
AND list_id = #list_id
INSERT INTO [values] (column_id, row_id, value)
VALUES (#column_id, #row_id, 'Rover')
-- === Need one of these for each column ===
SELECT #column_id = id FROM [columns]
WHERE name = 'Model'
AND list_id = #list_id
INSERT INTO [values] (column_id, row_id, value)
VALUES (#column_id, #row_id, 'Metro')
COMMIT TRANSACTION
Um, starting to get a little bit hairy compared to:
INSERT INTO [Cars] ([Make], [Model}) VALUES ('Rover', 'Metro')
Simple queries
I'm now getting bored of constructing tediously complex SQL statements so maybe you can have a go at coming up with equivalent queries for the followng statements:
SELECT [Model] FROM [Cars] WHRE [Make] = 'Rover'
SELECT [Cars].[Make], [Cars].[Model], [Owners].[Name] FROM [Cars]
JOIN [Owners] ON [Owners].id = [Cars].owner_id
WHERE [Owners].Age > 50
SELECT [Cars].[Make], [Cars].[Model], [Owners].[Name] FROM [Cars]
JOIN [Owners] ON [Owners].id = [Cars].owner_id
JOIN [Addresses] ON [Addresses].id = [Owners].address_id
WHERE [Addresses].City = 'London'
I hope you are beginning to get the idea...
In short - I've experienced this before and I can assure you that creating a database inside a database in this way is definitely a Bad Thing.
If you need to do anything but the most basic querying on these lists (and literally I mean "Can I have all the items in this list please?"), you should try and find an alternative.
As long as each user pretty much has their own database I'll definitely recommend the CREATE TABLE approach. Even if they don't I'd still recommend that you at least consider it.
Perhaps a potential solution would be the creating of lists can involve CREATE TABLE statements for those entities/lists?
It sounds like the db structure or schema can change at runtime, or at the user's command, so perhaps something like this might help?
User wants to create a new list of an entity never seen before. Call it Computer.
User defines the attributes (screensize, CpuSpeed, AmountRAM, NumberOfCores)
System allows user to create in the UI
system generally lets them all be strings, unless can tell when all supplied values are indeed dates or numbers.
build the CREATE scripts, execute them against the DB.
insert the data that the user defined into that new table.
Properly coded, we're working with the requirements given: let users create new entities. There was no mention of scale here. Of course, this requires all input to be sanitized, queries parameterized, actions logged, etc.
The negative comment below doesn't actually give any good reasons, but creates a bit of FUD. I'd be interested in addressing any concerns with this potential solution. We haven't heard about scale, security, performance, or usage (internal LAN vs. internet).
You should absolutely not dynamically create tables when your users create lists. That isn't how databases are meant to work.
Your schema is correct, and the pluralization is, in my opinion, also correct, though I would remove the camel case and call them lists, list_columns, list_rows and list_data.
I would further improve upon your schema by skipping rows and columns tables, they serve no purpose. Simply have a row/column number attached to each cell, and keep things sparse: Don't bother holding empty cells in the database. You retain the ability to query/sort based on row/column, your queries will be (potentially very much) faster because the number of list_cells will be reduced, and you won't have to do any crazy joining to link your data back to its table.
Here is the complete schema:
create table lists (
id int primary key,
name varchar(25) not null
);
create table list_cells (
id int primary key,
list_id int not null references lists(id)
on delete cascade on update cascade,
row int not null,
col int not null,
data varchar(25) not null
);
It sounds like you might have Sharepoint already deployed in your environment.
Consider integrating your application with Sharepoint, and have it be your datastore. No need to recreate all the things you like about Sharepoint, when you could leverage it.
It'd take a bit of configuring, but you could call SP web services to CRUD your list data for you.
inserting list data into Sharepoint via web services
reading SP lists via web services
Sharepoint 2010 can also expose lists via OData, which would be simple to consume from any application.

Auto increment with a Unit Of Work

Context
I'm building a persistence layer to abstract different types of databases that I'll be needing. On the relational part I have mySQL, Oracle and PostgreSQL.
Let's take the following simplified MySQL tables:
CREATE TABLE Contact (
ID varchar(15),
NAME varchar(30)
);
CREATE TABLE Address (
ID varchar(15),
CONTACT_ID varchar(15),
NAME varchar(50)
);
I use code to generate system specific alpha numeric unique ID's fitting 15 chars in this case. Thus, if I insert a Contact record with it's Addresses I have my generated Contact.ID and Address.CONTACT_IDs before committing.
I've created a Unit of Work (amongst others) as per Martin Fowler's patterns to add transaction support. I'm using a key based Identity Map in the UoW to track the changed records in memory. It works like a charm for the scenario above, all pretty standard stuff so far.
The question scenario comes in when I have a database that is not under my control and the ID fields are auto-increment (or in Oracle sequences). In this case I do not have the db generated Contact.ID beforehand, so when I create my Address I do not have a value for Address.CONTACT_ID. The transaction has not been started on the DB session since all is kept in the Identity Map in memory.
Question: What is a good approach to address this? (Avoiding unnecessary db round trips)
Some ideas:
Retrieve the last ID: I can do a call to the database to retrieve the last Id like:
SELECT Auto_increment FROM information_schema.tables WHERE table_name='Contact';
But this is MySQL specific and probably something similar can be done for the other databases. If do this then would need to do the 1st insert, get the ID and then update the children (Address.CONTACT_IDs) – all in the current transaction context.
Avoid explicitly referencing the CONTACT_ID entirely. Assuming that Contact.NAME has a UNIQUE constraint and that the CONTACT_ID column REFERENCES Contact(ID):
INSERT INTO Contact (NAME) VALUES ('Joe Bloggs'); -- Contact.ID auto-generated
INSERT INTO Address (CONTACT_ID, NAME)
VALUES ((SELECT ID FROM Contact WHERE NAME = 'Joe Bloggs'),
'123 Apple Lane');
Now Address.CONTACT_ID is correct without your code knowing the key's value or even its type.

Merging databases how to handle duplicate PK's

We have three databases that are physically separated by region, one in LA, SF and NY. All the databases share the same schema but contain data specific to their region. We're looking to merge these databases into one and mirror it. We need to preserve the data for each region but merge them into one db. This presents quite a few issues for us, for example we will certainly have duplicate Primary Keys, and Foreign Keys will be potentially invalid.
I'm hoping to find someone who has had experience with a task like this who could provide some tips, strategies and words of experience on how we can accomplish the merge.
For example, one idea was to create composite keys and then change our code and sprocs to find the data via the composite key (region/original pk). But this requires us to change all of our code and sprocs.
Another idea was to just import the data and let it generate new PK's and then update all the FK references to the new PK. This way we potentially don't have to change any code.
Any experience is welcome!
I have no first-hand experience with this, but it seems to me like you ought to be able to uniquely map PK -> New PK for each server. For instance, generate new PKs such that data from LA server has PK % 3 == 2, SF has PK % 3 == 1, and NY has PK % 3 == 0. And since, as I understood your question anyway, each server only stores FK relationships to its own data, you can update the FKs in identical fashion.
NewLA = OldLA*3-1
NewSF = OldLA*3-2
NewNY = OldLA*3
You can then merge those and have no duplicate PKs. This is essentially, as you already said, just generating new PKs, but structuring it this way allows you to trivially update your FKs (assuming, as I did, that the data on each server is isolated). Good luck.
BEST: add a column for RegionCode, and include it on your PKs, but you don't want to do all the leg work.
HACK: if your IDs are INTs, a quick fix would be to add a fixed value based on region to each key on import. INTs can be as large as: 2,147,483,647
local server data:
LA IDs: 1,2,3,4,5,6
SF IDs: 1,2,3,4,5
NY IDs: 1,2,3,4,5,6,7,9
add 100000000 to LA's IDs
add 200000000 to SF's IDs
add 300000000 to NY's IDs
combined server data:
LA IDs: 100000001,100000002,100000003,100000004,100000005,100000006
SF IDs: 200000001,200000002,200000003,200000004,200000005
NY IDs: 300000001,300000002,300000003,300000004,300000005,300000006,300000007,300000009
I have done this and I say change your keys (pick a method) rather than changing your code. Invariably you will either miss a stored procedure or introduce a bug. With data changes, it is pretty easy to write tests to look for orphaned records or to verify that things were matched up correctly. With code changes, especially code that is working correctly, it is too easy to miss something.
One thing you could do is set up the tables with regional data to use GUID's. That way, the primary keys in each region are unique, and you can mix and match data (import data from one region to another). For the tables which have shared data (like type tables), you can keep the primary keys the way they are (since they should be the same everywhere).
Here is some information about GUID's:
http://www.sqlteam.com/article/uniqueidentifier-vs-identity
Maybe SQL Server Management Studio lets you convert columns to use GUID's easily. I hope so!
Best of luck.
what i have done in a situation like this is this:
create a new db with the same schema
but only tables. no pk fk, checks
etc.
transfer data from DB1 to this
source db
for each table in target database
find the top number for the PK
for each table in the source
database update their pk, fk etc
starting with the (top number + 1)
from the target db
for each table in target database
set identity insert to on
import data from source db to target
db
for each table in target database
set identity insert to off
clear source db
repeat for DB2
As Jon mentioned, I would use GUIDs to solve the merge task. And I see two different solutions that required GUIDs:
1) Permanently change your database schema to use GUIDs instead of INTEGER (IDENTITY) as primary key.
This is a good solution in general, but if you have a lot of non SQL code that is somehow bound to the way your identifiers work, it could require quite some code changes. Probably since you merge databases, you may anyways need to update your application so that it is working with one region data only based on the user logged in etc.
2) Temporarily add GUIDs for migration purposes only, and after the data is migrated, drop them:
This one is kind-of more tricky, but once you write this migration script, you can (re-)run it multiple times to merge databases again in case you screw it the first time. Here is an example:
Table: PERSON (ID INT PRIMARY KEY, Name VARCHAR(100) NOT NULL)
Table: ADDRESS (ID INT PRIMARY KEY, City VARCHAR(100) NOT NULL, PERSON_ID INT)
Your alter scripts are (note that for all PK we automatically generate the GUID):
ALTER TABLE PERSON ADD UID UNIQUEIDENTIFIER NOT NULL DEFAULT (NEWID())
ALTER TABLE ADDRESS ADD UID UNIQUEIDENTIFIER NOT NULL DEFAULT (NEWID())
ALTER TABLE ADDRESS ADD PERSON_UID UNIQUEIDENTIFIER NULL
Then you update the FKs to be consistent with INTEGER ones:
--// set ADDRESS.PERSON_UID
UPDATE ADDRESS
SET ADDRESS.PERSON_UID = PERSON.UID
FROM ADDRESS
INNER JOIN PERSON
ON ADDRESS.PERSON_ID = PERSON.ID
You do this for all PKs (automatically generate GUID) and FKs (update as shown above).
Now you create your target database. In this target database you also add the UID columns for all the PKs and FKs. Also disable all FK constraints.
Now you insert from each of your source databases to the target one (note: we do not insert PKs and integer FKs):
INSERT INTO TARGET_DB.dbo.PERSON (UID, NAME)
SELECT UID, NAME FROM SOURCE_DB1.dbo.PERSON
INSERT INTO TARGET_DB.dbo.ADDRESS (UID, CITY, PERSON_UID)
SELECT UID, CITY, PERSON_UID FROM SOURCE_DB1.dbo.ADDRESS
Once you inserted data from all the databases, you run the code opposite to the original to make integer FKs consistent with GUIDs on the target database:
--// set ADDRESS.PERSON_ID
UPDATE ADDRESS
SET ADDRESS.PERSON_ID = PERSON.ID
FROM ADDRESS
INNER JOIN PERSON
ON ADDRESS.PERSON_UID = PERSON.UID
Now you may drop all the UID columns:
ALTER TABLE PERSON DROP COLUMN UID
ALTER TABLE ADDRESS DROP COLUMN UID
ALTER TABLE ADDRESS DROP COLUMN PERSON_UID
So at the end you should get a rather long migration script, that should do the job for you. The point is - IT IS DOABLE
NOTE: all written here is not tested.

Generate unique ID to share with multiple tables SQL 2008

I have a couple of tables in a SQL 2008 server that I need to generate unique ID's for. I have looked at the "identity" column but the ID's really need to be unique and shared between all the tables.
So if I have say (5) five tables of the flavour "asset infrastructure" and I want to run with a unique ID between them as a combined group, I need some sort of generator that looks at all (5) five tables and issues the next ID which is not duplicated in any of those (5) five tales.
I know this could be done with some sort of stored procedure but I'm not sure how to go about it. Any ideas?
The simplest solution is to set your identity seeds and increment on each table so they never overlap.
Table 1: Seed 1, Increment 5
Table 2: Seed 2, Increment 5
Table 3: Seed 3, Increment 5
Table 4: Seed 4, Increment 5
Table 5: Seed 5, Increment 5
The identity column mod 5 will tell you which table the record is in. You will use up your identity space five times faster so make sure the datatype is big enough.
Why not use a GUID?
You could let them each have an identity that seeds from numbers far enough apart never to collide.
GUIDs would work but they're butt-ugly, and non-sequential if that's significant.
Another common technique is to have a single-column table with an identity that dispenses the next value each time you insert a record. If you need them pulling from a common sequence, it's not unlikely to be useful to have a second column indicating which table it was dispensed to.
You realize there are logical design issues with this, right?
Reading into the design a bit, it sounds like what you really need is a single table called "Asset" with an identity column, and then either:
a) 5 additional tables for the subtypes of assets, each with a foreign key to the primary key on Asset; or
b) 5 views on Asset that each select a subset of the rows and then appear (to users) like the 5 original tables you have now.
If the columns on the tables are all the same, (b) is the better choice; if they're all different, (a) is the better choice. This is a classic DB spin on the supertype / subtype relationship.
Alternately, you could do what you're talking about and recreate the IDENTITY functionality yourself with a stored proc that wraps INSERT access on all 5 tables. Note that you'll have to put a TRANSACTION around it if you want guarantees of uniqueness, and if this is a popular table, that might make it a performance bottleneck. If that's not a concern, a proc like that might take the form:
CREATE PROCEDURE InsertAsset_Table1 (
BEGIN TRANSACTION
-- SELECT MIN INTEGER NOT ALREADY USED IN ANY OF THE FIVE TABLES
-- INSERT INTO Table1 WITH THAT ID
COMMIT TRANSACTION -- or roll back on error, etc.
)
Again, SQL is highly optimized for helping you out if you choose the patterns I mention above, and NOT optimized for this kind of thing (there's overhead with creating the transaction AND you'll be issuing shared locks on all 5 tables while this process is going on). Compare that with using the PK / FK method above, where SQL Server knows exactly how to do it without locks, or the view method, where you're only inserting into 1 table.
I found this when searching on google. I am facing a simillar problem for the first time. I had the idea to have a dedicated ID table specifically to generate the IDs but I was unsure if it was something that was considered OK design. So I just wanted to say THANKS for confirmation.. it looks like it is an adequate sollution although not ideal.
I have a very simple solution. It should be good for cases when the number of tables is small:
create table T1(ID int primary key identity(1,2), rownum varchar(64))
create table T2(ID int primary key identity(2,2), rownum varchar(64))
insert into T1(rownum) values('row 1')
insert into T1(rownum) values('row 2')
insert into T1(rownum) values('row 3')
insert into T2(rownum) values('row 1')
insert into T2(rownum) values('row 2')
insert into T2(rownum) values('row 3')
select * from T1
select * from T2
drop table T1
drop table T2
This is a common problem for example when using a table of people (called PERSON singular please) and each person is categorized, for example Doctors, Patients, Employees, Nurse etc.
It makes a lot of sense to create a table for each of these people that contains thier specific category information like an employees start date and salary and a Nurses qualifications and number.
A Patient for example, may have many nurses and doctors that work on him so a many to many table that links Patient to other people in the PERSON table facilitates this nicely. In this table there should be some description of the realtionship between these people which leads us back to the categories for people.
Since a Doctor and a Patient could create the same Primary Key ID in their own tables, it becomes very useful to have a Globally unique ID or Object ID.
A good way to do this as suggested, is to have a table designated to Auto Increment the primary key. Perform an Insert on that Table first to obtain the OID, then use it for the new PERSON.
I like to go a step further. When things get ugly (some new developer gets got his hands on the database, or even worse, a really old developer, then its very useful to add more meaning to the OID.
Usually this is done programatically, not with the database engine, but if you use a BIG INT for all the Primary Key ID's then you have lots of room to prefix a number with visually identifiable sequence. For example all Doctors ID's could begin with 100, all patients with 110, all Nurses with 120.
To that I would append say a Julian date or a Unix date+time, and finally append the Auto Increment ID.
This would result in numbers like:
110,2455892,00000001
120,2455892,00000002
100,2455892,00000003
since the Julian date 100yrs from now is only 2492087, you can see that 7 digits will adequately store this value.
A BIGINT is 64-bit (8 byte) signed integer with a range of -9.22x10^18 to 9.22x10^18 ( -2^63 to 2^63 -1). Notice the exponant is 18. That's 18 digits you have to work with.
Using this design, you are limited to 100 million OID's, 999 categories of people and dates up to... well past the shelf life of your databse, but I suspect thats good enough for most solutions.
The operations required to created an OID like this are all Multiplication and Division which avoids all the gear grinding of text manipulation.
The disadvantage is that INSERTs require more than a simple TSQL statement, but the advantage is that when you are tracking down errant data or even being clever in your queries, your OID is visually telling you alot more than a random number or worse, an eyesore like GUID.