Naming for contraints when creating a table - sql

I see following SQL in a system:
create table Account
(
[AccountId] int not null,
[EquityStatus] int not null constraint DF_Account_EquityStatus default(1),
[DerivativeStatus] int not null constraint DF_Account_DerivativeStatus default(1),
[HasSecurityAgreement] tinyint not null constraint DF_Account_HasSecurityAgreement default(0),
...
)
Why it names every contraints for almost every column? Is there any benefit to do this?

You can let your system assign a system-generated name for your constraints but then it becomes very difficult, from a maintenance perspective, if you want to alter or drop a constraint later - you generally have to perform such operations by name and if you let the system auto-generate the name, you won't know the correct name to use.
In addition, if scripts are your primary means of deployment and you have multiple environments (e.g. development, staging, test, etc) then the system generated names will differ in each environment which makes it far more difficult if you want to compare two environments to establish current differences.

Related

Having an Identity column in a table with primary key with varchar type

I was told to create an autID identity column in the table with GUID varchar(40) as the primary key and use the autID column as a reference key to help in the join process. But is that a good approach?
This causes a lot of problems like this
CREATE TABLE OauthClientInfo
(
autAppID INT IDENTITY(1,1)
strClientID VARCHAR(40), -- GUID
strClientSecret VARCHAR(40)
)
CREATE TABLE OAuth_AuthToken
(
autID INT IDENTITY(1,1)
strAuthToken VARCHAR(40),
autAppID_fk INT
FOREIGN KEY REFERENCES OauthClientInfo(autAppID)
)
I was told that having autAppID_fk helps in the joins vs having strClientID_fk of varchar(40), but my point to defend is we unnecessarily adding a new id as a reference that some times forces to make joins.
Example, to know what is the strClientID that the strAuthToken belongs, if we have strClientID_fk as the reference key then the OAuth_AuthToken table data make sense a lot for me. Please comment your views on this.
I was told to create an autID identity column in the table with GUID varchar
(40) as the primary key and use the autID column as a reference key to help in the join process. But is that a good approach?
You were told this by someone that confuses clustering and primary keys. They are not one and the same, despite the confusing implementation of the database engine that "helps" the lazy developer.
You might get arguments about adding an identity column to every table and designating it as the primary key. I'll disagree with all of this. One does not BLINDLY do anything of this type in a schema. You do the proper analysis, identify (and enforce) any natural keys, and then you decide an whether a synthetic key is both useful and needed. And then you determine which columns to use for the clustered index (because you only have one). And then you verify the appropriateness of your decisions based on how efficient and effective your schema is under load by testing. There are no absolute rules about how to implement your schema.
Many of your indexing (and again note - indexing and primary key are completely separate things) choices will be affected by how your tables are updated over time. Do you have hotspots that need to be minimized? Does your table experience lots of random inserts, updates, and deletes over time? Or maybe just lots of inserts but relatively few updates or deletes? These are just some of the factors that guide your decision.
You need to use UNIQUEIDENTIFIER data type for GUID columns not VARCHAR
As far as I have read, Auto increment int is the most suitable column for clustered index.
And strClientID is the worst candidate for PK or cluster index.
Most importantly you haven't mention the purpose of StrClientID. What kind of data does it hold, how does it get populated?

Creating a table with a field (with a foreign key) which can reference many tables and maintain referencial integrity?

what is the best way of creating a table which can hold a key to a lot of other tables?
As far as I know I have two options:
1) I create a table with a lot of foreign key fields
2) I create two fields, one which indicates the referenced table and another field which holds the primary key of that table.
The latter has a lot of issues due to the fact there's no way to maintain referential integrity (because there's no foreign key to each table).
Besides a link to this table I want to add a description so I can show all notifications in a grid. By clicking a line in the grid I want to open the corresponding program and fix the issue in that program.
It's a bit hard to explain, perhaps this example explains better:
I need to create a system which handles task/notes/notifications for every program in our business application. We have invoices, sales-orders, deliveries, production-orders, etc
Our software detects that something is wrong which any of these. For instance, if the profits on a sales-order are not high enough the order can't be validated automatically. In this case I want to create a notification for the sales-manager so that he can check out what's wrong with the sales-order.
FYI: Iam using Sybase SQL Anywhere 12.
Does it make any sense?
This can be solved in reverse way. Lets say that you have table Alerts where you are going to put all kind of alerts about bad things happened elsewhere. You may reference this table from ALL other tables in your system and create non-mandatory relationship from them. In short it may look like (i'm using MSSQL syntax):
create table Alerts(
ID int not null identity,
SomeInfoAboutTheProblem varchar(255),
constraint PK_Alerts primary key (ID)
)
create table Invoices(
ID....
AlertID int NULL,
....
constraint FK_Invoices2Alerts foreign key (AlertID) references Alerts(ID)
)
In case you cannot modify your tables with business information you may create "extention" table for Alerts that may store some specific problem information and actual reference to the problematic record. For example:
create table Alerts(
ID int not null identity,
SomeInfoAboutTheProblem varchar(255),
constraint PK_Alerts primary key (ID)
)
create table Alerts_for_Invoices(
AlertID int NOT NULL,
InvoiceID int NOT NULL,
SomeAdditionalInvoiceProblemInfo ....,
constraint FK_Alerts_for_Invoices2Alerts foreign key (AlertID) references(ID),
constraint FK_Alerts_for_Invoices2Invoices foreign key (InvoiceID) references Invoices(ID)
)
To show list of problems you may just select general information from Alerts table while opening the dialog you may select all appropriate information regading the problem.

Identity field and primary key in SQL Server when values are unique

When a set of values that will be stored in a table have a name or a code that should be unique across the system, should it be created with a primary key of ID auto increment (int)?
Take the situation of State Abbreviations. Other than consistency, what would be the purpose of an ID on the table that was the primary key other than the state name or abbreviation?
If for example the foreign key from an shipping address referenced the state abbreviation that is not mutable then ... is there a purpose for having an auto increment int ID?
You highlighted one positive aspect of a separate table: consistency. It is much easier to have this:
CREATE TABLE dbo.States
(
StateID TINYINT PRIMARY KEY,
Name VARCHAR(32),
Abbreviation CHAR(2)
);
CREATE TABLE dbo.CustomerAddresses
(
AddressID INT PRIMARY KEY,
...,
StateID TINYINT NOT NULL FOREIGN KEY REFERENCES dbo.States(StateID)
);
Than to have a trigger or check constraint like:
CHECK StateAbbreviation IN ('AL', 'AK', /* 50+ more states/territories... */)
Now, with something static and small like a 2-character state abbreviation, this design might make more sense, eliminating some unnecessary mapping between the abbreviations and some surrogate ID:
CREATE TABLE dbo.States
(
Abbreviation CHAR(2) PRIMARY KEY,
Name VARCHAR(32)
);
CREATE TABLE dbo.CustomerAddresses
(
AddressID INT PRIMARY KEY,
...,
StateAbbreviation CHAR(2) FOREIGN KEY REFERENCES dbo.States(Abbreviation)
);
This constrains the data to the known set of states, allows you to store the actual data in the table (which can eliminate a lot of joins in queries), actually saves you some space, and avoids having any messy hard-coded check constraints (or constraints using UDFs, or triggers validating the data).
That all said, there is no magic blanket answer that satisfies all designs. As your string gets larger, it can make more sense to use an integer instead of just storing the string. A counter-example would be storing all of the User Agent strings from your web logs - it makes a lot of sense to store the same string once and assign an integer to it, than to store the same 255-character string over and over and over again.
Other things that can make this design troublesome:
What if you expand beyond the US later?
Forget about state abbreviations for a moment (which are pretty static); what if your lookups are things that do change frequently?
State Abbreviation is a rare example of a good non-increment primary key for the following reasons:
They are small (2-character)
They don't change
The set of values is relatively static - new records are unlikely
Just because the natural key is unique doesn't make it a good candidate for the primary key.
Even real-world values that are unique (like SSN) may nod be good candidates if they are entered in by humans. For example, suppose someone enters in a bunch of related data for a person, then get a letter that the SSN is wrong - now you can't just update the primary key - you need to update all of the foreign keys as well!
As a general rule (which may not apply in every single case), it's better to use integers as primary keys for performance reasons. So if your unique key is a string, create an autoincrement primary key.
Also, states don't have to be necessarily unique. It's true in one country but when you look at all countries in the world, same abbreviations may happen.
EDIT
I can't find a very good evidence of string vs. integer performance but take a look e.g. in here: Strings as Primary Keys in SQL Database
Having said that, there's never a lot of states so performance gain will be small in this case.

Database Design and the use of non-numeric Primary Keys

I'm currently in the process of designing the database tables for a customer & website management application. My question is in regards to the use of primary keys as functional parts of a table (and not assigning "ID" numbers to every table just because).
For example, here are four related tables from the database so far, one of which uses the traditional primary key number, the others which use unique names as the primary key:
--
-- website
--
CREATE TABLE IF NOT EXISTS `website` (
`name` varchar(126) NOT NULL,
`client_id` int(11) NOT NULL,
`date_created` timestamp NOT NULL default CURRENT_TIMESTAMP,
`notes` text NOT NULL,
`website_status` varchar(26) NOT NULL,
PRIMARY KEY (`name`),
KEY `client_id` (`client_id`),
KEY `website_status` (`website_status`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
--
-- website_status
--
CREATE TABLE IF NOT EXISTS `website_status` (
`name` varchar(26) NOT NULL,
PRIMARY KEY (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `website_status` (`name`) VALUES
('demo'),
('disabled'),
('live'),
('purchased'),
('transfered');
--
-- client
--
CREATE TABLE IF NOT EXISTS `client` (
`id` int(11) NOT NULL auto_increment,
`date_created` timestamp NOT NULL default CURRENT_TIMESTAMP,
`client_status` varchar(26) NOT NULL,
`firstname` varchar(26) NOT NULL,
`lastname` varchar(46) NOT NULL,
`address` varchar(78) NOT NULL,
`city` varchar(56) NOT NULL,
`state` varchar(2) NOT NULL,
`zip` int(11) NOT NULL,
`country` varchar(3) NOT NULL,
`phone` text NOT NULL,
`email` varchar(78) NOT NULL,
`notes` text NOT NULL,
PRIMARY KEY (`id`),
KEY `client_status` (`client_status`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=4 ;
--
-- client_status
---
CREATE TABLE IF NOT EXISTS `client_status` (
`name` varchar(26) NOT NULL,
PRIMARY KEY (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `client_status` (`name`) VALUES
('affiliate'),
('customer'),
('demo'),
('disabled'),
('reseller');
As you can see, 3 of the 4 tables use their 'name' as the primary key. I know that these will always be unique. In 2 of the cases (the *_status tables) I am basically using a dynamic replacement for ENUM, since status options could change in the future, and for the 'website' table, I know that the 'name' of the website will always be unique.
I'm wondering if this is sound logic, getting rid of table ID's when I know the name is always going to be a unique identifier, or a recipe for disaster? I'm not a seasoned DBA so any feedback, critique, etc. would be extremely helpful.
Thanks for taking the time to read this!
There are 2 reasons I would always add an ID number to a lookup / ENUM table:
If you are referencing a single column table with the name then you may be better served by using a constraint
What happens if you wanted to rename one of the client_status entries? e.g. if you wanted to change the name from 'affiliate' to 'affiliate user' you would need to update the client table which should not be necessary. The ID number serves as the reference and the name is the description.
In the website table, if you are confident that the name will be unique then it is fine to use as a primary key. Personally I would still assign a numeric ID as it reduces the space used in foreign key tables and I find it easier to manage.
EDIT:
As stated above, you will run into problems if the website name is renamed. By making this the primary key you will be making it very difficult if not impossible for this to be changed at a later date.
When making natural PRIMARY KEY's, make sure their uniqueness is under your control.
If you're absolutely sure you will never ever have uniqueness violation, then it's OK to use these values as PRIMARY KEY's.
Since website_status and client_status seem to be generated and used by you and only by you, it's acceptable to use them as a PRIMARY KEY, though having a long key may impact performance.
website name seems be under control of the outer world, that's why I'd make it a plain field. What if they want to rename their website?
The counterexamples would be SSN and ZIP codes: it's not you who generates them and there is no guarantee that they won't be ever duplicated.
Kimberly Tripp has an Excellent series of blog articles (GUIDs as PRIMARY KEYs and/or the clustering key and The Clustered Index Debate Continues) on the issue of creating clustered indexes, and choosing the primary key (related issues, but not always exactly the same). Her recommendation is that a clustered index/primary key should be:
Unique (otherwise useless as a key)
Narrow (the key is used in all non-clustered indexes, and in foreign-key relationships)
Static (you don't want to have to change all related records)
Always Increasing (so new records always get added to the end of the table, and don't have to be inserted in the middle)
Using "Name" as your key, while it seems to satisfy #1, doesn't satisfy ANY of the other three.
Even for your "lookup" table, what if your boss decides to change all affiliates to partners instead? You'll have to modify all rows in the database that use this value.
From a performance perspective, I'm probably most concerned that a key be narrow. If your website name is actually a long URL, then that could really bloat the size of any non-clustered indexes, and all tables that use it as a foreign key.
Besides all the other excellent points that have already been made, I would add one more word of caution against using large fields as clustering keys in SQL Server (if you're not using SQL Server, then this probably doesn't apply to you).
I add this because in SQL Server, the primary key on a table by default also is the clustering key (you can change that, if you want to and know about it, but most of the cases, it's not done).
The clustering key that determines the physical ordering of the SQL Server table is also being added to every single non-clustered index on that table. If you have only a few hundred to a few thousand rows and one or two indices, that's not a big deal. But if you have really large tables with millions of rows, and potentially lots of indices to speed up the queries, this will indeed cause a lot of disk space and server memory to be wasted unnecessarily.
E.g. if your table has 10 million rows, 10 non-clustered indices, and your clustering key is 26 bytes instead of 4 (for an INT), then you're wasting 10 mio. by 10 by 22 bytes for a total of 2.2 billion bytes (or 2.2 GBytes approx.) - that's not peanuts anymore!
Again - this only applies to SQL Server, and only if you have really large tables with lots of non-clustered indices on them.
Marc
"If you're absolutely sure you will never ever have uniqueness violation, then it's OK to use these values as PRIMARY KEY's."
If you're absolutely sure you will never ever have uniqueness violation, then don't bother to define the key.
Personally, I think you will run into trouble using this idea. As you end up with more parent child relationships, you end up with a huge amount of work when the names change (As they always will sooner or later). There can be a big performance hit when having to update a child table that has thousands of rows when the name of the website changes. And you have to plan for how do make sure that those changes happen. Otherwise, the website name changes (oops we let the name expire and someone else bought it.) either break because of the foreign key constraint or you need to put in an automated way (cascade update) to propagate the change through the system. If you use cascading updates, then you can suddenly bring your system to a dead halt while a large chage is processed. This is not considered to be a good thing. It really is more effective and efficient to use ids for relationships and then put unique indexes on the name field to ensure they stay unique. Database design needs to consider maintenance of the data integrity and how that will affect performance.
Another thing to consider is that websitenames tend to be longer than a few characters. This means the performance difference between using an id field for joins and the name for joins could be quite significant. You have to think of these things at the design phase as it is too late to change to an ID when you have a production system with millions of records that is timing out and the fix is to completely restructure the databse and rewrite all of the SQL code. Not something you can fix in fifteen minutes to get the site working again.
This just seems like a really bad idea. What if you need to change the value of the enum? The idea is to make it a relational database and not a set of flat files. At this point, why have the client_status table? Moreover, if you are using the data in an application, by using a type like a GUID or INT, you can validate the type and avoid bad data (in so far as validating the type). Thus, it is another of many lines to deter hacking.
I would argue that a database that is resistant to corruption, even if it runs a little slower, is better than one that isn’t.
In general, surrogate keys (such as arbitrary numeric identifiers) undermine the integrity of the database. Primary keys are the main way of identifying rows in the database; if the primary key values are not meaningful, the constraint is not meaningful. Any foreign keys that refer to surrogate primary keys are therefore also suspect. Whenever you have to retrieve, update or delete individual rows (and be guaranteed of affecting only one), the primary key (or another candidate key) is what you must use; having to work out what a surrogate key value is when there is a meaningful alternative key is a redundant and potentially dangerous step for users and applications.
Even if it means using a composite key to ensure uniqueness, I would advocate using a meaningful, natural set of attributes as the primary key, whenever possible. If you need to record the attributes anyway, why add another one? That said, surrogate keys are fine when there is no natural, stable, concise, guaranteed-to-be-unique key (e.g. for people).
You could also consider using index key compression, if your DBMS supports it. This can be very effective, especially for indexes on composite keys (think trie data structures), and especially if the least selective attributes can appear first in the index.
I think I am in agreement with cheduardo. It has been 25 years since I took a course in database design but I recall being told that database engines can more efficiently manage and load indexes that use character keys. The comments about the database having to update thousands of records when a key is changed and on all of the added space being taken up by the longer keys and then having to be transferred across systems, assumes that the key is actually stored in the records and that it does not have to be transferred across systems anyway. If you create an index on a column(s) of a table, I do not think the value is stored in the records of the table (unless you set some option to do so).
If you have a natural key for a table, even if it is changed occassionally, creating another key creates a redundancy that could result in data integrity issues and actually creates even more information that needs to be stored and transferred across systems. I work for a team that decided to store the local application settings in the database. They have an identity column for each setting, a section name, a key name, and a key value. They have a stored procedure (another holy war) to save a setting that ensures it does not appear twice. I have yet to find a case where I would use a setting's ID. I have, however, ended up with multiple records with the same section and key name that caused my application to fail. And yes, I know that could have been avoided by defining a constraint on the columns.
Here few points should be considered before deciding keys in table
Numeric key is more suitable when you
use references ( foreign keys), since
you not using foreign keys, it ok in
your case to use non numeric key.
Non-numeric key uses more space than
numeric keys, can decrease
performance.
Numeric keys make db look simpler to
understand ( you can easily know no
of rows just by looking at last row)
You NEVER know when the company you work for suddenly explodes in growth and you have to hire 5 developers overnight. Your best bet is to use numeric (integer) primary keys because they will be much easier for the entire team to work with AND will help your performance if and when the database grows. If you have to break records out and partition them, you might want to use the primary key. If you are adding records with a datetime stamp (as every table should), and there is an error somewhere in the code that updates that field incorrectly, the only way to confirm if the record was entered in the proper sequence it to check the primary keys. There are probably 10 more TSQL or debugging reasons to use INT primary keys, not the least of which is writing a simple query to select the last 5 records entered into the table.

What is the purpose of constraint naming

What is the purpose of naming your constraints (unique, primary key, foreign key)?
Say I have a table which is using natural keys as a primary key:
CREATE TABLE Order
(
LoginName VARCHAR(50) NOT NULL,
ProductName VARCHAR(50) NOT NULL,
NumberOrdered INT NOT NULL,
OrderDateTime DATETIME NOT NULL,
PRIMARY KEY(LoginName, OrderDateTime)
);
What benefits (if any) does naming my PK bring?
Eg.
Replace:
PRIMARY KEY(LoginName, OrderDateTime)
With:
CONSTRAINT Order_PK PRIMARY KEY(LoginName, OrderDateTime)
Sorry if my data model is not the best, I'm new to this!
Here's some pretty basic reasons.
(1) If a query (insert, update, delete) violates a constraint, SQL will generate an error message that will contain the constraint name. If the constraint name is clear and descriptive, the error message will be easier to understand; if the constraint name is a random guid-based name, it's a lot less clear. Particulary for end-users, who will (ok, might) phone you up and ask what "FK__B__B_COL1__75435199" means.
(2) If a constraint needs to be modified in the future (yes, it happens), it's very hard to do if you don't know what it's named. (ALTER TABLE MyTable drop CONSTRAINT um...) And if you create more than one instance of the database "from scratch" and use system-generated default names, no two names will ever match.
(3) If the person who gets to support your code (aka a DBA) has to waste a lot of pointless time dealing with case (1) or case (2) at 3am on Sunday, they're quite probably in a position to identify where the code came from and be able to react accordingly.
To identify the constraint in the future (e.g. you want to drop it in the future), it should have a unique name. If you don't specify a name for it, the database engine will probably assign a weird name (e.g. containing random stuff to ensure uniqueness) for you.
It keeps the DBAs happy, so they let your schema definition into the production database.
When your code randomly violates some foreign key constraint, it sure as hell saves time on debugging to figure out which one it was. Naming them greatly simplifies debugging your inserts and your updates.
It helps someone to know quickly what constraints are doing without having to look at the actual constraint, as the name gives you all the info you need.
So, I know if it is a primary key, unique key or default key, as well as the table and possibly columns involved.
By correctly naming all constraints, You can quickly associate a particular constraint with our data model. This gives us two real advantages:
We can quickly identify and fix any errors.
We can reliably modify or drop constraints.
By naming the constraints you can differentiate violations of them. This is not only useful for admins and developers, but your program can also use the constraint names. This is much more robust than trying to parse the error message. By using constraint names your program can react differently depending on which constraint was violated.
Constraint names are also very useful to display appropriate error messages in the user’s language mentioning which field caused a constraint violation instead of just forwarding a cryptic error message from the database server to the user.
See my answer on how to do this with PostgreSQL and Java.
While the OP's example used a permanent table, just remember that named constraints on temp tables behave like named constraints on permanent tables (i.e. you can't have multiple sessions with the exact same code handling the temp table, without it generating an error because the constraints are named the same). Because named constraints must be unique, if you absolutely must name a constraint on a temp table try to do so with some sort of randomized GUID (like SELECT NEWID() ) on the end of it to ensure that it will uniquely-named across sessions.
Another good reason to name constraints is if you are using version control on your database schema. In this case, if you have to drop and re-create a constraint using the default database naming (in my case SQL Server) then you will see differences between your committed version and the working copy because it will have a newly generated name. Giving an explicit name to the constraint will avoid this being flagged as a change.