Moving from ints to GUIDs as primary keys - sql

I use several referenced tables with integer primary keys. Now I want to change ints to GUIDs leaving all references intact. What is the easiest way to do it?
Thank you!
Addition
I do understand the process in general, so I need more detailed advices, for example, how to fill new GUID column. Using default value newid() is correct, but what for already existing rows?

Create a new column for the guid
value in the master table. Use the
uniqueidentifier data type, make it
not null with a newid() default so
all existing rows will be populated.
Create new uniqueidentifier columns
in the child tables.
Run update statements to build the guild relationships using the exisitng int relationships to reference the entities.
Drop the original int columns.
In addition, leave some space in your data/index pages (specify fillfactor < 100) as guids are not sequential like int identity columns are. This means inserts can be anywhere in the data range and will cause page splits if your pages are 100% full.

Firstly: Dear God why?!?!?
Secondly, you're going to have to add the GUID column to all your tables first, then populate them based on the int value. Once done you can set the GUIDs to primary/foreign keys then drop the int columns.
To update the value you'd do something like
Set the new GUIDs in the primary key table
Run this:
.
UPDATE foreignTable f
SET f.guidCol = p.guidCol
FROM primaryTable p
WHERE p.intCol = f.intCol

This is relevent in a system that implements the distributed computing model. If the system is required to know the primary key at the time when you persist information in the system, the use of a auto-incrementing primary key maintained by ONE handler will slow down the system. Instead, you need a mechanism like a GUID generator to create primary key (keep in mind that the true feature of a primary key is its uniqueness). So, I can scale up with multiple services, each creating its primary key, independently of each other.
I had dubious privilege of doing this before and basically what I had to do was to export the whole damned database into XML. Next, I had a Java application that uses the java.util.Random's nextLong() function to replace the primary key with their new guid keys. After that I imported the whole thing back in to the database.
Of course, the first time I tried to import the XML files back, I forgot to turn off the auto-number feature of the primary key field, so do learn from my mistakes. I'm sure that there're better ways of doing it, but this was a fast and dirty way of doing it ... and it worked. In case you wondering, the project was to make the application scale.

Yeah, I'm with Glenn... I was actually hesitating on posting the same thing before he posted it....
Why would you not want an auto increment int primary key separate from your GUID? it's a lot more flexible, and you can just have the GUID column indexed so you have good performance on your queries...
As for the flexibility, I like to keep my id's as autoincrement ints because then the other seemingly unique and primary-key worthy item can change.
A great case of the flexibility is if you use usernames as a primary key. Even if they are unique, it is nice to be able to change them. What if users use an email address as their username? Being able to change the username and have it not affect all your queries is a big plus, and I suspect the same could be true with your GUIDs....

I think, you must do it manualy. Or you can write some utility for it. The scenario should be:
Duplicate the "int" PK/FK columns with new "guid" columns.
Generates new values for "guid" PK columns.
Update values in "guid" FK columns with specified values ( you find the records via "int" PK ).
Remove references ( relations ) with "int" PK/FK columns.
Create similar references ( relations ) with "guid" PK/FK columns.
Remove "int" PK/FK columns.

It's a very good choice. I switched from longs to UUID for one of my applications and I don't regret it. If you use MS SQL Server it is included in standard (I use postgresql and it's only included in standard from 8.3 on).
Like mentioned by Glenn Slaven, you can recreate UUIDs from the keys you have in your current records. Be aware that they will not be unique though but that way it's easy to keep the relationships intact. New records you create after the move will be unique.

DON'T DO IT! We started out using GUIDs, and now we've almost finished moving to INTs as PKs; we're retaining the GUID for logging purposes (and for some tables of, er, "negotiable relational integrity" ;) ), but the speed increase of using ints has been phenomenal.
This only really became apparent when the table rowcounts crossed into millions, mind you.
Our biggest folly by far was using a NEWID() as the PK of our (sequential) log table - there was much head-smacking when we realised our error.

Related

Is it a bad practice to use a single column table with a nominal value as PK?

When dealing with No-SQL databases, this would not be a problem as a relation would not make sense (or at least would be unnecessary), but when the infrastructure does not support No-SQL databases, scenarios such as creating tags for articles would raise some concerns (from the perspective of finding the best practice).
Assuming that I have three tables as Articles, Tags, ArticleTags. In this case, the Tag names must be unique as duplicate tags in the Tags table would not make any sense. Taking this into the account, I can do the following:
CREATE TABLE [Tags](
[TagId] UNIQUEIDENTIFIER ROWGRIDCOL PRIMARY KEY DEFAULT NEWSEQUENTIALID(),
[Name] NVARCHAR(50) NOT NULL UNIQUE
)
And this approach can be considered as a standard practice. However, since the Tags.Name is unique, I can also use the Name column as the primary key and remove the TagId column. The question is that, if I do this and use the Tags.Name as the primary key, even the Tags table would be considered as redundant and I can simply add a new column such as Tag in the ArticleTags table without any relation and this will be ok if we want to allow users to generate new tags when necessary (losing the FK constraint).
However, would this violate the normalization rules? and would this be a better practice in comparison with the standard approach (having both id and name)?
It is a rather bad idea to have a GUID as a primary (there might be exceptions). If you have a table where you are going to be doing frequent inserts, then a GUID as a primary key is definitely a bad idea.
Why? Well, by default, the primary key is clustered in SQL Server. You can override this, but let's stick with the default.
Because the GUID can have a smaller value, this results in inserts going between existing rows. That tends to cause fragmentation and (much) slower inserts.
Note that this even occurs with NEWSEQUENTIALID(). As the documentation explains:
After restarting Windows, the GUID can start again from a lower range,
but is still globally unique.
If you are doing all the inserts at once, then it doesn't matter, much.
However, this seems so much simpler:
CREATE TABLE [Tags](
[TagId] int identity(1, 1) primary key,
[Name] NVARCHAR(50) NOT NULL UNIQUE
);
Here are some reasons:
Identity columns take up less space (ints are smaller than GUIDs).
Identity columns are much handier for referring tables (foreign keys take up less space).
Integer ids are easier for people to recognize when looking at the data or typing in an id (say for debugging).
Inserts always go at the end of the table.
I would just avoid the habit of ever using GUIDs (or UUIDs in other databases) as primary keys. The one case where I've had to relax this is when I generate data using SparkSQL or BigQuery. However, I consider it a bug in those tools that they cannot readily do row_number() on a large data set.
As for using Name, I would discourage that. You might want to rename a tag at some point in the future or decide that 50 characters is not large enough. Although you can have cascading foreign key references, I think a unique integer id is a safer approach. In addition, the id gives some inadvertent information -- such as the last tag inserted into the table.

Do I need a primary key if something will NOT be changed?

If I had a site where a user can flag another user post and it cannot be undone or changed, do I need to have a primary key? All my selects would be on the post_id and with a where clause to see if the user already flagged it.
It seems to me from some of your other posts that the reason you are trying to avoid adding a primary key to your table is to save space.
Stop thinking like that.
It's a bad idea to make non-standard optimizations like this without having tested them first to see if they actually work. Have you run some tests that shows that you save a significant amount of space in your database by omitting the primary key on this table? Or are you just guessing?
Using a primary key doesn't necessarily mean that you will use more space. Depending on the database, if you omit the primary key it might add a hidden field for you anyway (for example if you don't have a PK in MySQL/InnoDB it adds a hidden clustered index on a synthetic column containing 6 byte row ID values (source)). If you do use a primary key, rather than adding a new column you can just choose some existing columns that you know should be unique anyway. It won't take up any more space, it will just mean that the data will be stored in a different order to make it easier to search.
When you add an index, that index is going to take up extra space, as an index is basically just a copy of a few columns of the table, plus a link back to the row in the original table. Remember that hidden column the database uses when you don't have a PK? Well now it has to use that to find your rows, so you'll get a copy of it in your index too. If you use a primary key then you probably don't need one of your indexes that you would have added, so you're actually saving space here.
Besides all this, some useful database tools just won't work well if you don't have a primary key on your table. You will annoy everyone that has to maintain your database after you are gone.
So tell me, why do you think it's a good idea to NOT have one?
A primary key has nothing to do with whether data can be changed - it's a single point of reference for an entire row, which can make looking up and/or changing data faster.
All my selects would be on the post_id and with a where clause to see if the user already flagged it.
You need to provide more information about business rules. For example, should the system support more than one user flagging the same post?
If the answer is "no", then I would model a POST_STATUS_CODE table and have a foreign key to the table in your POSTS table.
If the answer is "yes", then I would still have a POST_STATUS_CODE table but also a table linking the POSTS and POST_STATUS_CODE tables - say POSTS_STATUS_XREF.
I have a post_flag table with post_id, user_id (who flagged it) and flag_type (ATM as a byte). I don't see how PK will make it faster in this case but I imagine it will take up 4 or 8 bytes per row. I was thinking about indexing post_id. If I do should I still create a PK?
At a minimum, I would make the primary key to be a combination of:
post_id
user_id
The reason being that a primary key ensures that there can't be duplicates.
A primary key can be made up of more than one column - this is called a compound key. It means that the pair of values is unique. IE: You can't have more than one combination of 1, 1 values, but you could have 1,2, 1,3, etc (and vice versa). Attempts to add duplicates will result in duplicate primary key errors.
Primary keys help speed up lookups and joins, so it's always nice to have if you can.
You don't need a primary key, not even if users are going to modify rows. A primary key optimizes the performance every time you query that table though. If you think your table will grow larger than about a thousand rows or so, then setting a primary key will give a noticeable performance boost.
The only advantage in not creating a primary key really is that it means you don't have to create one, which is fair enough I suppose :-P
You could just not bother creating one for now. You can always add one later. Not a big deal. Don't let anyone bully you into thinking you absolutely must create a primary key right now! You'll see it being horribly slow soon enough :-P and then you can just add the primary key at that point. If you don't have too many duplicates by then :-P
Best have one, if just because you may have to delete the occasional record manually (e.g. duplicates) and one should have a unique identifier for that.
The simple answer is yes. every table should have a primary key (made of at least one column). what benefit do you get for not having one?
In such a situation, you might be able to get away without one, but I'd be inclined to throw a primary key on there anyway, simply because it's relatively simple to do and will save rework if the requirements change.
The software requirements may change rapidly. The customer may introduce new requirements. So having a primary key may be useful because you can eliminate totally unnecessary data migrations in such a situations.
Read this: "Is it OK not to use a Primary Key When I don’t Need one?"
Yes, you do need a primary key.
You may as well use text files for storage if you don't think you do because it means you don't understand them...

SQL primary key - complex primary or string with concatenation?

I have a table with 16 columns. It will be most frequently used table in web aplication and it will contain about few hundred tousand rows. Database is created on sql server 2008.
My question is choice for primary key. What is quicker? I can use complex primary key with two bigint-s or i can use one varchar value but i will need to concatenate it after?
There are many more factors you must consider:
data access prevalent pattern, how are you going to access the table?
how many non-clustered indexes?
frequency of updates
pattern of updates (sequential inserts, random)
pattern of deletes
All these factors, and specially the first two, should drive your choice of the clustered key. Note that the primary key and clustered key are different concepts, often confused. Read up my answer on Should I design a table with a primary key of varchar or int? for a lengthier discussion on the criteria that drive a clustered key choice.
Without any information on your access patterns I can answer very briefly and concise, and actually correct: the narrower key is always quicker (for reasons of IO). However, this response bares absolutely no value. The only thing that will make your application faster is to choose a key that is going to be used by the query execution plans.
A primary key which does not rely on any underlying values (called a surrogate key) is a good choice. That way if the row changes, the ID doesn't have to, and any tables referring to it (Foriegn Keys) will not need to change. I would choose an autonumber (i.e. IDENTITY) column for the primary key column.
In terms of performance, a shorter, integer based primary key is best.
You can still create your clustered index on multiple columns.
Why not just a single INT auto-generated primary key? INT is 32-bit, so it can handle over 4 billion records.
CREATE TABLE Records (
recordId INT NOT NULL PRIMARY KEY,
...
);
A surrogate key might be a fine idea if there are foreign key relationships on this table. Using a surrogate will save tables that refer to it from having to duplicate all those columns in their tables.
Another important consideration is indexes on columns that you'll be using in WHERE clauses. Your performance will suffer if you don't. Make sure that you add appropriate indexes, over and above the primary key, to avoid table scans.
What do you mean quicker? if you need to search quicker, you can create index for any column or create full text search. the primary key just make sure you do not have duplicated records.
The decision relies upon its use. If you are using the table to save data mostly and not retrieve it, then a simple key. If you are mostly querying the data and it is mostly static data where the key values will not change, your index strategy needs to optimize the data to the most frequent query that will be used. Personally, I like the idea of using GUIDs for the primary key and an int for the clustered index. That allows for easy data imports. But, it really depends upon your needs.
Lot’s of variables you haven’t mentioned; whether the data in the two columns is “natural” and there is a benefit in identifying records by a logical ID, if disclosure of the key via a UI poses a risk, how important performance is (a few hundred thousand rows is pretty minimal).
If you’re not too fussy, go the auto number path for speed and simplicity. Also take a look at all the posts on the site about SQL primary key types. Heaps of info here.
Is it a ER Model or Dimensional Model. In ER Model, they should be separate and should not be surrogated. The entire record could have a single surrogate for easy references in URLs etc. This could be a hash of all parts of the composite key or an Identity.
In Dimensional Model, also they must be separate and they all should be surrogated.

Should i have a primary ID? i am indexing another field

Using sqlite i need a table to hold a blob to store a md5 hash and a 4byte int. I plan to index the int but this value will not be unique.
Do i need a primary key for this table? and is there an issue with indexing a non unique value? (I assume there is not issue or reason for any).
Personally, I like to have a unique primary id on all tables. It makes finding unique records for updating/deleting easier.
How are you going to reference on a SELECT * FROM Table WHERE or an UPDATE ... WHERE? Are you sure you want each one?
You already have one.
SQLite automatically creates an integer ROWID column for every row of every table. This can function as a primary key if you don't declare your own.
In general it's a good idea to declare your own primary key column. In the particular instance you mentioned, ROWID will probably be fine for you.
My advice is to go with primary key if you want to have referential integrity. However there is no issue with indexing a non unique value. The only thing is that your performance will downgrade a little.
What are the consequences of letting two identical rows somehow get into this table?
One consequence is, of course, wasted space. But I'm talking about something more fundamental, here. There are times when duplicate rows in data give you wrong results. For example, if you grouped by the int column (field), and listed the count of rows in each group, a duplicate row (record) might throw you off, depending on what you are really looking for.
Relational databases work better if they are based on relations. Relations are always in first normal form. The primary reason for declaring a primary key is to prevent the table from getting out of first normal form, and thus not representing a relation.

Should I use an ENUM for primary and foreign keys?

An associate has created a schema that uses an ENUM() column for the primary key on a lookup table. The table turns a product code "FB" into it's name "Foo Bar".
This primary key is then used as a foreign key elsewhere. And at the moment, the FK is also an ENUM().
I think this is not a good idea. This means that to join these two tables, we end up with four lookups. The two tables, plus the two ENUM(). Am I correct?
I'd prefer to have the FKs be CHAR(2) to reduce the lookups. I'd also prefer that the PKs were also CHAR(2) to reduce it completely.
The benefit of the ENUM()s is to get constraints on the values. I wish there was something like: CHAR(2) ALLOW('FB', 'AB', 'CD') that we could use for both the PK and FK columns.
What is: Best PracticeYour preference
This concept is used elsewhere too. What if the ENUM()'s values are longer? ENUM('Ding, dong, dell', 'Baa baa black sheep'). Now the ENUM() is useful from a space point-of-view. Should I only care about this if there are several million rows using the values? In which case, the ENUM() saves storage space.
ENUM should be used to define a possible range of values for a given field. This also implies that you may have multiple rows which have the same value for this perticular field.
I would not recommend using an ENUM for a primary key type of foreign key type.
Using an ENUM for a primary key means that adding a new key would involve modifying the table since the ENUM has to be modified before you can insert a new key.
I am guessing that your associate is trying to limit who can insert a new row and that number of rows is limited. I think that this should be achieved through proper permission settings either at the database level or at the application and not through using an ENUM for the primary key.
IMHO, using an ENUM for the primary key type violates the KISS principle.
but when you only trapped with differently 10 or less rows that wont be a problem
e.g's
CREATE TABLE `grade`(
`grade` ENUM('A','B','C','D','E','F') PRIMARY KEY,
`description` VARCHAR(50) NOT NULL
)
This table it is more than diffecult to get a DML
We've had more discussion about it and here's what we've come up with:
Use CHAR(2) everywhere. For both the PK and FK. Then use mysql's foreign key constraints to disallow creating an FK to a row that doesn't exist in the lookup table.
That way, given the lookup table is L, and two referring tables X and Y, we can join X to Y without any looking up of ENUM()s or table L and can know with certainty that there's a row in L if (when) we need it.
I'm still interested in comments and other thoughts.
Having a lookup table and a enum means you are changing values in two places all the time. Funny... We spent to many years using enums causing issues where we need to recompile to add values. In recent years, we have moved away from enums in many situations an using the values in our lookup tables. The biggest value I like about lookup tables is that you add or change values without needing to compile. Even with millions of rows I would stick to the lookup tables and just be intelligent in your database design