Unique Constraint column can only contain one NULL value - sql

A Unique Constraint can be created upon a column that can contain NULLs. However, at most, only a single row may ever contain a NULL in that column.
I do not understand why this is the case since, by definition, a NULL is not equal to another NULL (since NULL is really an unknown value and one unknown value does not equal another unknown value).
My questions:
1. Why is this so?
2. Is this specific to MsSQL?
I have a hunch that it is because a Unique Constraint can act as a reference field for a Foreign Key and that the FK would otherwise not know which record in the reference table to which it was refering if more than one record with NULL existed. But, it is just a hunch.
(Yes, I understand that UCs can be across multiple columns, but that doesn't change the question; rather it just complicates it a bit.)

Yes, it's "specific" to Microsoft SQL Server (in that some other database systems have the opposite approach, the one you expected - and the one defined in the ANSI standard, but I believe there are other database systems that are the same as SQL Server).
If you're working on a version of SQL Server that supports filtered indexes, you can apply one of those:
CREATE UNIQUE INDEX IX_T ON [Table] ([Column]) WHERE [Column] IS NOT NULL
(But note that this index cannot be the target of an FK constraint)
The "Why" of it really just comes down to, that's how it was implemented long ago (possibly pre-standards) and it's one of those awkward situations where to change it now could potentially break a lot of existing systems.
Re: Foreign Keys - you would be correct, if it wasn't for the fact that a NULL value in a foreign key column causes the foreign key not to be checked - there's no way (in SQL Server) to use NULL as an actual key.

Yes it's a SQL Server feature (and a feature of a few other DBMSs) that is contrary to the ISO SQL standard. It perhaps doesn't make much sense given the logic applied to nulls in other places in SQL - but then the ISO SQL Standard isn't very consistent about its treatment of nulls either. The behaviour of nullable uniqueness constraints in Standard SQL is not very helpful. Such constraints aren't necessarily "unique" at all because they permit duplicate rows. E.g., the constraint UNIQUE(foo,bar) permits the following rows to exist simultaneously in a table:
foo bar
------ ------
999 NULL
999 NULL
(!)
Avoid nullable uniqueness constraints. It's usually straightforward to move the columns to a new table as non-nullable columns and put the uniqueness constraint there. The information that would have been represented by populating those columns with nulls can (presumably) be represented by simply not populating those columns in the new table at all.

Related

Confusing t-sql exam answer about sequence or uniqueidentifier

I found a t-sql question and its answer. It is too confusing. I could use a little help.
The question is:
You develop a database application. You create four tables. Each table stores different categories of products. You create a Primary Key field on each table.
You need to ensure that the following requirements are met:
The fields must use the minimum amount of space.
The fields must be an incrementing series of values.
The values must be unique among the four tables.
What should you do?
A. Create a ROWVERSION column.
B. Create a SEQUENCE object that uses the INTEGER data type.
C. Use the INTEGER data type along with IDENTITY
D. Use the UNIQUEIDENTIFIER data type along with NEWSEQUENTIALID()
E. Create a TIMESTAMP column.
The said answer is D. But, I think the more suitable answer is B. Because sequence will use less space than GUID and it satisfies all the requirements.
D is a wrong answer, because NEWSEQUENTIALID doesn't guarantee "an incrementing series of values" (second requirement).
NEWSEQUENTIALID()
Creates a GUID that is greater than any GUID
previously generated by this function on a specified computer since
Windows was started. After restarting Windows, the GUID can start
again from a lower range, but is still globally unique.
I'd say that B (sequence) is the correct answer. At least, you can use a sequence to fulfil all three requirements, if you don't restart/recycle it manually. I think it is the easiest way to meet all three requirements.
Between the choices provided D B is the correct answer, since it meets all requirements:
ROWVERSION is a bad choice for a primary key, as stated in MSDN:
Every time that a row with a rowversion column is modified or inserted, the incremented database rowversion value is inserted in the rowversion column. This property makes a rowversion column a poor candidate for keys, especially primary keys. Any update made to the row changes the rowversion value and, therefore, changes the key value. If the column is in a primary key, the old key value is no longer valid, and foreign keys referencing the old value are no longer valid.
TIMESTAMP is deprecated, as stated in that same page:
The timestamp syntax is deprecated. This feature will be removed in a future version of Microsoft SQL Server. Avoid using this feature in new development work, and plan to modify applications that currently use this feature.
An IDENTITY column does not guarantee uniqueness, unless all it's values are only ever generated automatically (you can use SET IDENTITY_INSERT to insert values manually), nor does it guarantee uniqueness between tables for any value.
A GUID is practically guaranteed to be unique per system, so if a guid is the primary key for all 4 tables it ensures uniqueness for all tables. the one requirement it doesn't fulfill is storage size - It's storage size is quadruple that of int (16 bytes instead of 4).
A SEQUENCE, when is not declared as recycle, guarantee uniqueness, and has the lowest storage size.
The sequence of numeric values is generated in an ascending or descending order at a defined interval and can be configured to restart (cycle) when exhausted.
However,
I would actually probably choose a different option all together - create a base table with a single identity column and link it with a 1:1 relationship with all other categories. then use an instead of insert trigger for all categories tables that will first insert a record to the base table and then use scope_identity() to get the value and insert it as the primary key for the category table.
This will enforce uniqueness as well as make it possible to use a single foreign key reference between the categories and products.
The issue has been discussed extensively in the past, in general:
http://blog.codinghorror.com/primary-keys-ids-versus-guids/
The constraint #3 is why a SEQUENCE could run into issues as there is a higher risk of collision/lowered number of possible rows in each table.

Using VARCHAR as PRIMARY KEY for an 'ORPHAN' table

I'm to create an orphan table (no relationships with any other table whatsoever) that contains 3 columns.
Col1 - String field - VARCHAR(32) - Contains unique data not more than 32 characters
Col2 - String field - TEXT - Contains larger non-unique data of characters
Col3 - Numeric (Bool) - INT(1) - 0/1 for Flagging
I'm thinking of using Col1 as my PRIMARY KEY. I have done some research and see people argue that using a meaningless INT column as a PRIMARY KEY to avoid Foreign Key/Storage issues is the way to go.
However, IMO, since this is an orphan table, it should not matter. Besides, I would require to place an INDEX on Col1 anyway.
As a side note, I'm not expecting more than ~1000 rows in this table.
Thoughts please.
I'd still just use an INT PK and put an index on COL1. I suppose you could use COL1 as the index if you can ensure that nothing will ever be joined to that table, but if nothing else the index will give you an idea of the order in which items are added/deleted from the table. I also like to add an IsActive boolean so that you never delete anything and a DateCreated datetime to almost every table.
If col1 is your real primary key, there is no reason not to use it. Especially if the table is that tiny.
You would need to maintain a unique index on that column anyway, so by adding an artificial primary key you just add more overhead fon insert and delete operations (as two indexes must be maintained).
Unless you are referencing that PK from really, really many other rows (and other tables) you should just go with what is the natural primary for your business rules.
I see where you are coming from but it just makes sense to index the first column anyway. It may be because I am used to excel but the usefulness of the initial column for a primary key also has an order to it along with readability while debugging or capturing data. If you use a more random generated number you still would be searching through a few hundred rows looking for a hard to distinguish key. In the end I highly recommend the extra column of ints. It is well worth it.
Whenever i do any database tables i keep my INT column.
I believe its faster to compare numbers then strings.
So it all depends how ofter you will query the database for info and compare strings in there.
I'm still unclear what the question is. Judging from the answers, I've deduced it down to two plausible questions:
Is it okay to use a VARCHAR instead of an INTEGER as a primary key?
Yes it is okay to use a VARCHAR instead.
In many cases it is preferred, especially if your table is expected to grow beyond 2,147,483,647 records (yes this happens). Performance-wise, even if INTs had a minimal speed advantage, on a ~1000 record table, you would not see it. Designated PKs are indexed by default. The one problem is that you'll lose any auto-generating sequence that the database can do for you.
Is it okay to use your unique COL1 field as a primary key, instead of some other unique ID field?
Yes it is okay.
The whole notion of having a primary key is to establish a unique field. What you're losing, though, may be some intrinsic comprehension. When other users want to join on that table, it's far easier to understand that id is a unique field, whereas col1 (some varchar) may or may not be unique.
In your given scenario, it should be okay. If the scope does grow up then you can always introduce an auto_increment PK column. Just make sure that your field is both indexed and unique.

Primary key on a column

I have a table with 2 columns and 2 records. Column1 will never change but the Column2 might have chances that it will change but table will have only 2 records.
Column1
Missing
Invalid
Column2
\\sqlserver\destination\missing
\\sqlserver\destination\invalid
I am little confused here about the primary key that i wanna put on this table as there is no Id column. so which column i should have primary key? or do i have to add one more column with identity and put primary key on that?
Thanks
Yes. The PK column can never contain duplicates. It doesn't have to be an integer however, but it needs to be a unique non-null column.
The criteria for choosing candidate and primary keys are:
uniqueness, irreducibility, stability, simplicity and familiarity
From what you have written, Column1 is definitely a candidate key. It has all 5 of the above criteria.
Column2 might be a candidate key if the two values in the table must always be unique. However, it is not stable so Column1 is a better key to choose for foreign key references to the table (primary key).
You could create a 3rd numeric column. Since you constrain the table to 2 rows, it makes little difference whether the new column has a system-maintained sequence (identity attribute).
Column1 has familiarity and the new column would not. At a logical level of discourse, both Column1 and this new column are equally simple. Physically, a 7 character string is at least as large as a 64-bit number so a 32-bit number occupies less space.
However, if you choose to add a new column due to physical size, I would consider a char(1) column with 'M' for missing or 'I' for invalid, which would still have all 5 criteria while occupying less physical space in referencing tables.
IMHO, you should "always" put an id key that then becomes the primary key.
"always" is in quotes because it's possible to argue for cases when it's not required, but generally this is the way to go, and certainly it's safe to say it is the default approach, and any deviation from it should be investigated rigorously for its benefits.
There is an argument for "natural" keys; that is to say you put the primay key on the field that is guaranteed to be unique and never change. But, in my experience, almost everything does end up changing, so it's safer to go with an inbuilt default auto-incrementing ID.
You don't need a primary key on a table that has only two records. Primary key is meant for increasing query speed; with 2 records you will hardly see any difference.
Edit:
In response to the comments, I'd like to point out that no mainstream DB vendor enforces the use of primary keys. There is a reason for them being optional: unless the primary key is required by functionality, it doesn't belong there; YAGNI.
On a table that small the creation of an index would actually slow things down. Indexes are stored as a binary tree so a lookup on this small of a table, unless it's clustered index will cause more reads than the table scan. A clustered index would actually "be" the table, but again on such a small table the cost/benefit is "moot".
I like always having unique rows, but in this case just leaving the table unindexed (know as a "heap") might actually be the most efficient. I'd wouldn't throw an index on the table unless you need to enforce constraints. Indexing it for query performance isn't going to do anything for you with this small of a table.
If you are required to put a PK on the table for some other reason, then I would say put it on the first column as it is the shortest, less likely to chance and it looks like this table is basically just used as a look-up anyways...

What GUID can I use as a placeholder

I have a database table that has a non null column of type uniqueidentifier. This was put in place for use in the near future. But for now, I need to use some placeholder. Can I simply use:
00000000-0000-0000-0000-000000000000
for all the rows until a real guid is used when new rows are inserted in the future? Does SQL Server enforce uniqueness on this column?
SQL Server will enforce uniqueness, IF and only if you put a unique constraint or unique index on that field. Otherwise, SQL Server will only enforce that the value must be NOT NULL.
As marc_s, says, you can do that because uniqueness is not enforced for uniqueidentifiers even in the same column of a table without an explicit declared unique index/constraint (after all, two rows can legitimately have the same foreign key).
IF this is just a temporary bootstrap, and in the future only real GUIDs (NOT NULL) are going to be allowed, I think this is an OK workaround to avoid generating GUIDs which just need to be replaced later and so not needing to keep a separate partially-initialized flag column or table of the temporary rows so you can fill in the appropriate GUIDs later.
However, from a design point of view, I'm more concerned about the semantics of what this special reserved GUID is, and why allowing a special reserved value is OK, but NULLs are not. Like I said, if it's just temporary and in the steady state, you don't want to ever allow NULLs OR this special reserved 0, that's fine, but if you are going to continue to allow this special reserved GUID in steady state operations, I think that raises design questions.
Is it meant to be a foreign key? If so, NULLs can be used (but a reserved key value like 0 which is not in the referenced table cannot). If it's a loose association, storing the GUID in this table might not be a great design.

Should I use an ENUM for primary and foreign keys?

An associate has created a schema that uses an ENUM() column for the primary key on a lookup table. The table turns a product code "FB" into it's name "Foo Bar".
This primary key is then used as a foreign key elsewhere. And at the moment, the FK is also an ENUM().
I think this is not a good idea. This means that to join these two tables, we end up with four lookups. The two tables, plus the two ENUM(). Am I correct?
I'd prefer to have the FKs be CHAR(2) to reduce the lookups. I'd also prefer that the PKs were also CHAR(2) to reduce it completely.
The benefit of the ENUM()s is to get constraints on the values. I wish there was something like: CHAR(2) ALLOW('FB', 'AB', 'CD') that we could use for both the PK and FK columns.
What is: Best PracticeYour preference
This concept is used elsewhere too. What if the ENUM()'s values are longer? ENUM('Ding, dong, dell', 'Baa baa black sheep'). Now the ENUM() is useful from a space point-of-view. Should I only care about this if there are several million rows using the values? In which case, the ENUM() saves storage space.
ENUM should be used to define a possible range of values for a given field. This also implies that you may have multiple rows which have the same value for this perticular field.
I would not recommend using an ENUM for a primary key type of foreign key type.
Using an ENUM for a primary key means that adding a new key would involve modifying the table since the ENUM has to be modified before you can insert a new key.
I am guessing that your associate is trying to limit who can insert a new row and that number of rows is limited. I think that this should be achieved through proper permission settings either at the database level or at the application and not through using an ENUM for the primary key.
IMHO, using an ENUM for the primary key type violates the KISS principle.
but when you only trapped with differently 10 or less rows that wont be a problem
e.g's
CREATE TABLE `grade`(
`grade` ENUM('A','B','C','D','E','F') PRIMARY KEY,
`description` VARCHAR(50) NOT NULL
)
This table it is more than diffecult to get a DML
We've had more discussion about it and here's what we've come up with:
Use CHAR(2) everywhere. For both the PK and FK. Then use mysql's foreign key constraints to disallow creating an FK to a row that doesn't exist in the lookup table.
That way, given the lookup table is L, and two referring tables X and Y, we can join X to Y without any looking up of ENUM()s or table L and can know with certainty that there's a row in L if (when) we need it.
I'm still interested in comments and other thoughts.
Having a lookup table and a enum means you are changing values in two places all the time. Funny... We spent to many years using enums causing issues where we need to recompile to add values. In recent years, we have moved away from enums in many situations an using the values in our lookup tables. The biggest value I like about lookup tables is that you add or change values without needing to compile. Even with millions of rows I would stick to the lookup tables and just be intelligent in your database design