I'm designing a database schema, and I'm wondering what criteria I should use for deciding whether each column should be nullable or not.
Should I mark as NOT NULL only those columns that absolutely must be filled out for a row to make any sense at all to my application?
Or should I mark all columns that I intend to never be null?
What are the performance implications of small vs large numbers of NOT NULL columns?
I assume lots of NOT NULL columns would slow down inserts a bit, but it might actually speed up selects, since the query execution plan generator has more information about the columns..
Can someone with more knowledge than me give me the low-down?
Honestly, I've always thought NOT NULL should be the default. NULL is the odd special case, and you should make a case for it whenever you use it. Plus it's much easier to change a column from NOT NULL to nullable than it is to go the other way.
There are no significant performance consequences. Don't even think about considering this as an issue. To do so is a huge early optimization antipattern.
"Should I only mark as NOT NULL only those columns that absolutely must be filled out for a row to make any sense at all to my application?"
Yes. It's as simple as that. You're a lot better off with a NULLable column without any NULL values in it, than with the need for NULLs and having to fake it. And anyway, any ambiguous cases are better filtered out in your Business Rules.
EDIT:
There's another argument for nullable fields that I think is ultimately the most compelling, which is the Use Case argument. We've all been subject to data entry forms that require values for some fields; and we've all abandoned forms where we had no sensible values for required fields. Ultimately, the application, the form, and the database design are only defensible if they reflect the user requirements; and it's clear that there are many, many database columns for which users can present no value - sometimes at given points in the business process, sometimes ever.
Err on the side of NOT NULL. You will, at some point, have to decide what NULL "means" in your application - more than likely, it will be different things for different columns. Some of the common cases are "not specified", "unknown", "inapplicable", "hasn't happened yet", etc. You will know when you need one of those values, and then you can appropriately allow a NULLable column and code the logic around it.
Allowing random things to be NULL is, sooner or later, always a nightmare IME. Use NULL carefully and sparingly - and know what it means in your logic.
Edit: There seems to be an idea that I'm arguing for NO null columns, ever. That's ridiculous. NULL is useful, but only where it's expected.
Le Dorfier's DateOfDeath example is a good example. A NULL DateOfDeath would indicate "not happened yet". Now, I can write a view LivingPersons WHERE DateOfDeath IS NULL.
But, what does a NULL OrderDate mean? That the order wasn't placed yet? Even though there's a record in the Order table? How about a NULL address? Those are the thoughts that should go through your head before you let NULL be a value.
Back to DateOfDeath - a query of persons WHERE DateOfDeath > '1/1/1999' would not return the NULL records - even though we logically know they must die after 1999. Is that what you want? If not, then you better include OR DateOfDeath IS NULL in that query. If you allow all columns to be NULL, you have to think about that every single time you write a query. IME, that's too much of a mental tax for the 10% or so of columns that actually have legit meaning when they're NULL.
I have found marking a column as NOT NULL is usually a good idea unless you have a useful meaning for NULL in the column. Otherwise you may unexpectedly find NULL in there later when you realise you don't want it, and changing is harder.
I try to avoid using NULL's in the database as much as possible. This means that character fields are always not null. Same for numeric fields, especially anything representing money or similar (shares, units, etc).
I have 2 exceptions:
Dates where the date might not be known (eg. DivorcedOn)
Optional foriegn key relationships (MarriedToPersonId). Though on occasion I have used "blank" rows in the foreign key table and made the relatonship mandatory (eg. JobDescriptionCode)
I have also on occasion used explicit bit fields for "unknown"/"not set" (eg. JobDescriptionCode and IsEmployeed).
I have a few core reasons why:
NULLs will always cause problems in numeric fields. Always. Always. Always. Doesn't matter how careful you are at somepoint select X + Y as Total is going to happen and it will return NULL.
NULLs can easily cause problems in string fields, typically address fields (eg. select AddrLine1 + AddrLine2 from Addresses).
Guarding against NULLs in the business logic tier is a tedious waste of effort... just don't let them in the DB and you can save 100's of lines of code.
My preferred defaults:
Strings -> "", aka an empty string
Numbers -> 0
Dates -> Today or NULL (see exception #1)
Bit -> false
You may find Chris Date's Database In Depth a useful resource for these kinds of questions. You can get a taste for his ideas in this interview, where he says among other things:
So yes, I do think SQL is pretty bad.
But you explicitly ask what its major
flaws are. Well, here are a few:
Duplicate rows
Nulls
Left-to-right column ordering
Unnamed columns and duplicate column names
Failure to support "=" properly
Pointers
High redundancy
In my own experience, nearly all "planned nulls" can be represented better with a child table that has a foreign key to a base table. Participating in the child table is optional, and that's where the null/not null distinction is actually made.
This maps well to the interpretation of a relation as a first-order logic proposition. It also is just common sense. When one does not know Bob's address, does one write in one's Rolodex:
Bob. ____
Or does one merely refrain from filling out an address card for Bob until one has an actual address for him?
Edit: Date's argument appears on pages 53-55 of Database In Depth, under the section heading "Why Nulls are Prohibited."
I lean toward NOT NULL unless I see a reason otherwise -- like someone else said, like it or not, NULL is the weird special case.
One of my favorites in regards to NULL is:
SELECT F1 FROM T WHERE F2 <> 'OK'
...which (in DB2 at least) won't include any rows where f2 is null -- because in relational jargon, (NULL <> 'OK') IS NULL. But your intent was to return all not-OK rows. You need an extra OR predicate, or write F2 DISTINCT FROM 'OK' instead (which is special case coding in the first place).
IMO, NULL is just one of those programmer's tools, like pointer arithmetic or operator overloading, that requires as much art as science.
Joe Celko writes about this in SQL For Smarties -- the trap of using NULL in an application is that its meaning is, well, undefined. It could mean unknown, uninitialized, incomplete, not applicable -- or as in the dumb example above, does it mean OK or not-OK?
Thanks for all the great answers, guys. You gave me a lot to think about, and helped me form my own opinion/strategy, which boils down to this:
Allow nulls if-and-only-if a null in
that column would have a specific
meaning to your application.
A couple of common meanings for null:
Anything that comes directly from the user
Here null means "user did not enter"
For these columns, it's better to allow nulls, or you'll just get asdasd#asd.com type input anyway.
Foreign keys for "0 or 1" relationships
null means "no related row"
So allow nulls for these columns
This one is controversial, but this is my opinion.
In general, if you cannot think of a useful meaning for null in a column, it should be NOT NULL. You can always change it to nullable later.
Example of the sort of thing I ended up with:
create table SalesOrderLine (
Id int identity primary key,
-- a line must have exactly one header:
IdHeader int not null foreign key references SalesOrderHeader,
LineNumber int not null, -- a line must have a line number
IdItem int not null, -- cannot have null item
Quantity decimal not null, -- maybe could sell 0, but not null
UnitPrice decimal not null, -- price can be 0, but not null
-- a null delivery address means not for delivery:
IdDeliveryAddress int foreign key references Address,
Comment varchar(100), -- null means user skipped it
Cancelled bit not null default (0) -- true boolean, not three-state!
Delivered datetime, -- null means not yet delivered
Logged datetime not null default (GetDate()) -- must be filled out
)
I would tend to agree with dorfier.
Be serious in your application about being flexible when receiving database NULL values and treating them as empty values, and you give yourself a lot of flexibility to let NULL's get inserted for values you don't specify.
There's probably a lot of cases where you need some very serious data integrity (and/or the intense speed optimization of disallowing NULL fields) but I think that these concerns are tempered against the extra effort it takes to make sure every field has a default value and/or gets set to a sensible value.
Stick with NOT NULL on everything until someone squeaks with pain about it. Then remove it on one column at a time, as reluctantly as possible. Avoid nulls in your DB as much as you can, for as long as you can.
Personally I think you should mark the columns as Null or not null based on what kind of data they contain, if there is a genuine requirement for the data to always be there, and whether the data is always known at the time of input. Marking a column as not null when the users don't have the data will force then to make up the data which makes all your data useless (this how you end up with junk data such as an email field containing "thisissilly#Ihatethisaplication.com"). Failing to require something that must be there for the process to work(say the key field to show what customer made the order) is equally stupid. Null vice not null is a data integrity issue at the heart, do what makes the most sense towards keeping your data useable.
If you can think long term, having NULLs in a column affects how you can design your queries. Whether you use CASE statements, COALESCE, or have to explicitly test for NULL values can make the decision for you.
From a performance standpoint, it's faster to not have to worry about NULLS. From a design standpoint, using NULL is an easy way to know that an item has never been filled in. Useful examples include "UpdatedDateTime" columns. NULL means an item has never been updated.
Personally I allow NULLs in most situations.
What are the performance implications of small vs large numbers of NOT NULL columns?
This may be stating the obvious, but, when a column is nullable, each record will require 1 extra bit of storage. So a BIT column will consume 100% more storage when it is nullable, while a UNIQUEIDENTIFIER will consume only 0.8% more storage when it is nullable.
In the pathological case, if your database has a single table consisting of a single BIT column, the decision to make that column nullable would reduce your database's performance in half. However, under the vast majority of real world scenarios, nullability will not have a measurable performance impact.
Using 'Not Null' or 'Null' should be primarily driven by your particular persistance requirements.
Having a value being Nullable means there are two or three states (three states with Bit fields)
For instance; if I had a bit field which was called 'IsApproved' and the value is set at a later stage than insertion. Then there are three states:
'IsApproved' Not answered
'IsApproved' Is Approved
'IsApproved' Is Not Approved
So if a field can be legitimently considered Not Answered and there is no default value that is suitable. These fields should be considered for being nullable
Any nullable column is a violation of third normal form.
But, that's not an answer.
Maybe this is: there are two types of columns in databases - ones that hold the structure of the data, and ones that hold the content of the data. Keys are structure, user-enterable fields are data. Other things - well - it's a judgment call.
Stuff that's structure, that is used in join clauses, is typically not null. Stuff that's data is typically nullable.
When you have a column that hold one of a list of choices or null (no choice made), it is usually a good idea to have a specific value for "no choice made" rather than a nullable column. These types of columns often participate in joins.
Related
Recently i was looking at a SQL table we have and noticed the following.
[FooColumn] CHAR (1) DEFAULT ('N') NOT NULL,
Above you can see that FooColumn Would always default to 'N' but still has a "NOT NULL" specified.
Would there be some Storage/Performance differences in setting a column to "NOT NULL" instead of "NULL" ?
How would SQL Server Treat a "NOT NULL: different from a "NULL" column ?
NOTE: This is ONLY for SQL and not the overhead of externally doing NULL checks
You should only use NOT NULL when you have a reason (ie required field for UI or for backend relationships). NOT NULL vs NULL performance is negligible and as per this article from 2016 (SQL SERVER), performance shouldn't be a consideration when deciding NOT NULL vs NULL.
Even though that field will default to 'N', a command could still set it to NULL if nulls were allowed. It comes down to is NULL a valid piece of data for that column.
EDIT
In a data-driven technical application, in my experience these are some guidelines we use:
for numeric fields, NULL is unknown to the user, and all numbers have meaning.
for string fields, NULL and "" are identical to the user, so it depends on you backend application.
I know that your question was excluding ISNULL checks but if you are doing a lot of them then it might be a code smell that those fields should be NOT NULL if possible since they can get expensive.
It's a complicated "debate".
NULL means unknown. It's different from 0 or empty string.
NOT NULL means you NEED to insert a value in there, always, even if it's a blank string or a 0. Many designers argue that's it's better design. Other see no issues with having NULL values. Different software houses will enforce different rules.
Having a "default" value simply means that when you create new records without specifying a value, it will use the default value instead. This is regardless of whether the field is NULL or NOT NULL.
Having NULL values MAY have an impact on performance (as the DBMS needs to deal with this special case), it will depend on which DBMS you are using, which version, which config etc... You need to do bench-marking with your own setup to see what's what.
Here's a good article: http://www.itprotoday.com/microsoft-sql-server/designing-performance-null-or-not-null
As the question is asked :
"NULL vs NOT NULL Performance differences"
, the answer must be based on the storage structure of the line and the difference in treatment of the line in the event of a Null.
The answer is: there is no difference.
Here are articles discussing line structure into SQL server:
https://www.red-gate.com/simple-talk/sql/database-administration/sql-server-storage-internals-101/
https://aboutsqlserver.com/2013/10/15/sql-server-storage-engine-data-pages-and-data-rows/
Here the column is defined as CHAR(1) so it is a fixed size column.
The difference between an empty string''' and Null is checked in the line structure information. There is no structural space saving storing null or an empty string; the structural information does not change depending on the definition of the constraint.
If you are looking for performance in the relation of data structure then you need to look elsewhere.
IMHO :
A column defined as CHAR(1) often contains coded information with few distinct values.
It is also common that this kind of column points to a "translation" table through FK.
So, if it is a "2-state indicator value" then the BIT type can be used knowing that all columns of this type are grouped together in the same byte.
If more different cases (more distinct values) are needed then the tinyint type will also occupy 1 byte of fixed size but will not require validation of the collation to process ties. (note : TinyInt offer more values than CHAR(1) )
Elsewhere, if you don't have a FK constraint yet, this must be balanced.
[FooColumn] CHAR (1) DEFAULT ('N') NOT NULL,
It is far better than NCHAR(1), VARCHAR(1) or NVARCHAR(1) !
(For MySQL check FooColumn CHARACTER SET)
But, depend your RDBMs and existant development, investigate if you can use BIT or TinyInt (no collation)
The extra cost of the needed test to check 'NOT NULL' compared to none for 'NULL' is very, very minimal.
I have a (dummy) table structure as follows:
ticket
id: int(11) PK
name: varchar(255)
status: ?????????
The question is, what data type should I use for status? Here are my options, as I see them:
varchar representing the status - BAD because there's no integrity
enum representing the status - BAD because to change the value, I'd have to alter the table, and then any code with dropdowns for the values, etc etc etc
int FK to a status table - GOOD because it's dynamic, BAD because it's harder to inspect by sight (which may be useful)
varchar FK to a status table - GOOD because it's dynamic, and visible on inspection. BAD because the keys are meaningful, which is generally frowned upon. Interestingly, in this case it's entirely possible for the status table to have just 1 column, making it a glorified enum
Have I got an accurate read of the situation? Is having a meaningful key really that bad? Because while it does give me goosebumps, I don't have any reason for it doing so...
Update:
For option 4, the proposed structure would be status: char(4) FK, to a status table. So,
OPEN => "Open"
CLOS => "Closed"
"PEND" => "Pending Authorization"
"PROG" => "In Progress
What's the disadvantage in this case ? The only benefit I can see of using int over char in this case is slight performance.
I would go with number 4, but I'd use a char(x) column. If you're worried about performance, a char(4) takes up as much space (and, or so one would think, disk i/o, bandwidth, and processing time) as an int, which also takes 4 bytes to store. If you're really worried about performance, make it a char(2) or even char(1).
Don't think of it as "meaningful data", think of it as an abbreviation of the natural key. Yes, the data has meaning, but as you've noticed that can be a good thing when working with the data--it means you don't always have to join (even if to a trivially small table) to extract meaning from the database. And of course the foreign key constraint ensures that the data is valid, since it must be in the lookup table. (This can be done with CHECK constraints as well, but Lookup tables are generally easier to manage and maintain over time.)
The downside is that you can get caught up with trying to find meaning. char(1) has a strong appeal, but if you get to ten or more values, it can get hard to come up with good meaningful values. Less of a problem with char(4), but still a possible issue. Another downside: if the data is likely to change, then yes, your meaningful data ("PEND" = "Pending Authorization") can lose its meaning ("PEND" = "Forward to home office for initial approval"). That's a poor example; if codes like that do change, you're probably much better off refactoring your system to reflect the change in business rules. I guess my point should be, if it's a user-entered lookup value, surrogate keys (integers) will be your friend, but if they're internally defined and maintained you should definitely consider more human-friendly values. That, or you'll need post-em notes on your monitor to remind you what the heck Status = 31 is supposed to mean. (I've got three on mine, and the stickum wears out every few months. Talk about cost to maintain...)
Go with number 3. Create a view that join's in the status value if you want something inspectable.
I would use an INT, and create a foreign key relationship to the status table. An INT should definitely be safe for an enumerated status column.
May I recommend you go with a statusID field instead, and have a separate table mapping the ID to a varchar?
EDIT: I guess that's exactly what you outlined in point 3. I think that is the best option.
I'm assuming that your database has a front end of some description, and that regular users are not exposed to the status code.
So, your convenience is only for programmers and DBAs - important people, but I wouldn't optimize my design for them.
Stronger - I would be very careful of using "meaningful" abbreviations - the most egregious data foul-up I've ever seen happened when a developer was cleansing some data, and interpreted the "meaningful" key incorrectly; turns out that "PROG" does not mean "programmed", but "in progress".
Go with option 3.
I've been working with a lot of databases recently that require a lot of statuses AND I've got a few notes that might be worth adding to the conversation.
INT: One thing I found is that if an application has a lot of tracking going on, the number of reference tables can quickly get unwieldy and, as you've mentioned, make inspecting the database at a glance impractical. (Which, for some of my clients, has mattered much more than the scant milliseconds it's saved in processing time.)
VARCHAR: Terrible idea for programming, but it's important to consider if a given status is actually going to be used by the code, or just human eyes. For the latter, you get unlimited range and don't have to maintain any relationships.
CHAR(4): Using a descriptive char column can actually be a very good approach. I'd typically only consider it if the value range were going to be low and obvious, but only because I consider this a nonstandard approach (risking confusion to new devs). Realistically, you could use a CHAR value as a foreign key just the same as an INT, gain legibility and maintain performance parity.
The one thing you couldn't do that I'd miss is mathematical operations (like "<" and ">").
INT Range: A hybrid strategy I've tried out is to use INT, but adding a degree of semantics to the numbers. So, for instance,
1-10 being for initial stages,
11-20 being in progress, and
21-30 being the final stages.
60-69 for errors, rejections
The problem here is that if you discover you need more numbers, you're SOL, since the next range is already taken. So, what I ended up doing was (sort of) mimicking HTTP responses:
100-199 being for initial stages,
200-299 being in progress, and
300-399 being the final stages.
500-599 for errors, rejections
I prefer this to simple INT, and while it can be less descriptive than CHAR, it can also be less ambiguous. Whereas "PROG" could mean a number of things, good, bad or benign, if I can see something is in the 500 range, I may not known what the problem is, I will be able to tell you there is a problem.
Creating a separate table with status is a good idea when you want to show the list of the status in the HTML form. You can show the verbose description from the lookup table and it will help the user to choose status if the requirements are like that.
From the development perspective, I would like to go integer as a primary key. You can optimize it by using small/tiny integer if you know it will not exceed the limit.
If you use abbreviation as a foreign key then you have to think every time to make it unique all the time as #Philip Kelley had mentioned it as a downside of it.
Lastly, you can declare the table type MYISAM if you like.
Update:
Reflecting #Philip Kelley opinion, if there are too many status, then it's better to use integer as foreign key. If there are only couple of status, then may be use abbr as a foreign key.
A lot of the applications I write make use of lookup tables, since that was just the way I was taught (normalization and such). The problem is that the queries I make are often more complicated because of this. They often look like this
get all posts that are still open
"SELECT * FROM posts WHERE status_id = (SELECT id FROM statuses WHERE name = 'open')"
Often times, the lookup tables themselves are very short. For instance, there may only be 3 or so different statuses. In this case, would it be okay to search for a certain type by using a constant or so in the application? Something like
get all posts that are still open
"SELECT * FROM posts WHERE status_id = ".Status::OPEN
Or, what if instead of using a foreign id, I set it as an enum and queried off of that?
Thanks.
The answer depends a little if you are limited to freeware such as PostGreSQL (not fully SQL compliant), or if you are thinking about SQL (ie. SQL compliant) and large databases.
In SQL compliant, Open Architecture databases, where there are many apps using one database, and many users using different report tools (not just the apps) to access the data, standards, normalisation, and open architecture requirements are important.
Despite the people who attempt to change the definition of "normalisation", etc. to suit their ever-changing purpose, Normalisation (the science) has not changed.
if you have data values such as {Open; Closed; etc} repeated in data tables, that is data duplication, a simple Normalisation error: if you those values change, you may have to update millions of rows, which is very limited design.
Such values should be Normalised into a Reference or Lookup table, with a short CHAR(2) PK:
O Open
C Closed
U [NotKnown]
The data values {Open;Closed;etc} are no longer duplicated in the millions of rows. It also saves space.
the second point is ease of change, if Closed were changed to Expired, again, one row needs to be changed, and that is reflected in the entire database; whereas in the un-normalised files, millions of rows need to be changed.
Adding new data values, eg. (H,HalfOpen) is then simply a matter of inserting one row.
in Open Architecture terms, the Lookup table is an ordinary table. It exists in the [SQL compliant] catalogue; as long as the FOREIGN KEY relation has been defined, the report tool can find that as well.
ENUM is a Non-SQL, do not use it. In SQL the "enum" is a Lookup table.
The next point relates to the meaningfulness of the key.
If the Key is meaningless to the user, fine, use an {INT;BIGINT;GUID;etc} or whatever is suitable; do not number them incrementally; allow "gaps".
But if the Key is meaningful to the user, do not use a meaningless number, use a meaningful Relational Key.
Now some people will get in to tangents regarding the permanence of PKs. That is a separate point. Yes, of course, always use a stable value for a PK (not "immutable", because no such thing exists, and a system-generated key does not provide row uniqueness).
{M,F} are unlikely to change
if you have used {0,1,2,4,6}, well don't change it, why would you want to. Those values were supposed to be meaningless, remember, only a meaningful Key need to be changed.
if you do use meaningful keys, use short alphabetic codes, that developers can readily understand (and infer the long description from). You will appreciate this only when you code SELECT and realise you do not have to JOIN every Lookup table. Power users too, appreciate it.
Since PKs are stable, particularly in Lookup tables, you can safely code:
WHERE status_code = 'O' -- Open
You do not have to JOIN the Lookup table and obtain the data value Open, as a developer, you are supposed to know what the Lookup PKs mean.
Last, if the database were large, and supported BI or DSS or OLAP functions in addition to OLTP (as properly Normalised databases can), then the Lookup table is actually a Dimension or Vector, in Dimension-Fact analyses. If it was not there, then it would have to be added in, to satisfy the requirements of that software, before such analyses can be mounted.
If you do that to your database from the outset, you will not have to upgrade it (and the code) later.
Your Example
SQL is a low-level language, thus it is cumbersome, especially when it comes to JOINs. That is what we have, so we need to just accept the encumbrance and deal with it. Your example code is fine. But simpler forms can do the same thing.
A report tool would generate:
SELECT p.*,
s.name
FROM posts p,
status s
WHERE p.status_id = s.status_id
AND p.status_id = 'O'
Another Exaple
For banking systems, where we use short codes which are meaningful (since they are meaningful, we do not change them with the seasons, we just add to them), given a Lookup table such as (carefully chosen, similar to ISO Country Codes):
Eq Equity
EqCS Equity/Common Share
OTC OverTheCounter
OF OTC/Future
Code such as this is common:
WHERE InstrumentTypeCode LIKE "Eq%"
And the users of the GUI would choose the value from a drop-down that displays
{Equity/Common Share;Over The Counter},
not {Eq;OTC;OF}, not {M;F;U}.
Without a lookup table, you can't do that, either in the apps, or in the report tool.
For look-up tables I use a sensible primary key -- usually just a CHAR(1) that makes sense in the domain with an additional Title (VARCHAR) field. This can maintain relationship enforcement while "keeping the SQL simple". The key to remember here is the look-up table does not "contain data". It contains identities. Some other identities might be time-zone names or assigned IOC country codes.
For instance gender:
ID Label
M Male
F Female
N Neutral
select * from people where gender = 'M'
Alternatively, an ORM could be used and manual SQL generation might never have to be done -- in this case the standard "int" surrogate key approach is fine because something else deals with it :-)
Happy coding.
Create a function for each lookup.
There is no easy way. You want performance and query simplicity. Ensure the following is maintained. You could create a SP_TestAppEnums to compare existing lookup values against the function and look for out of sync/zero returned.
CREATE FUNCTION [Enum_Post](#postname varchar(10))
RETURNS int
AS
BEGIN
DECLARE #postId int
SET #postId =
CASE #postname
WHEN 'Open' THEN 1
WHEN 'Closed' THEN 2
END
RETURN #postId
END
GO
/* Calling the function */
SELECT dbo.Enum_Post('Open')
SELECT dbo.Enum_Post('Closed')
Question is: do you need to include the lookup tables (domain tables 'round my neck of the woods) in your queries? Presumably, these sorts of tables are usually
pretty static in nature — the domain might get extended, but it probably won't get shortened.
their primary key values are pretty unlikely to change as well (e.g., the status_id for a status of 'open' is unlikely to suddenly get changed to something other than what it was created as).
If the above assumptions are correct, there's no real need to add all those extra tables to your joins just so your where clause can use a friend name instead of an id value. Just filter on status_id directly where you need to. I'd suspect the non-key attribute in the where clause ('name' in your example above) is more likely to get changes than the key attribute ('name' in your example above): you're more protected by referencing the desire key value(s) of the domain table in your join.
Domain tables serve
to limit the domain of the variable via a foreign key relationship,
to allow the domain to be expanded by adding data to the domain table,
to populate UI controls and the like with user-friendly information,
Naturally, you'd need to suck domain tables into your queries where you you actually required the non-key attributes from the domain table (e.g., descriptive name of the value).
YMMV: a lot depends on context and the nature of the problem space.
The answer is "whatever makes sense".
lookup tables involve joins or subqueries which are not always efficient. I make use of enums a lot to do this job. its efficient and fast
Where possible (and It is not always . . .), I use this rule of thumb: If I need to hard-code a value into my application (vs. let it remain a record in the database), and also store that vlue in my database, then something is amiss with my design. It's not ALWAYS true, but basically, whatever the value in question is, it either represents a piece of DATA, or a peice of PROGRAM LOGIC. It is a rare case that it is both.
NOT that you won't find yourself discovering which one it is halfway into the project. But as the others said above, there can be trade-offs either way. Just as we don't always acheive "perfect" normalization in a database design (for reason of performance, or simply because you CAN take thngs too far in pursuit of acedemic perfection . . .), we may make some concious choices about where we locate our "look-up" values.
Personally, though, I try to stand on my rule above. It is either DATA, or PROGRAM LOGIC, and rarely both. If it ends up as (or IN) a record in the databse, I try to keep it out of the Application code (except, of course, to retrieve it from the database . . .). If it is hardcoded in my application, I try to keep it out of my database.
In cases where I can't observe this rule, I DOCUMENT THE CODE with my reasoning, so three years later, some poor soul will be able to ficure out how it broke, if that happens.
The commenters have convinced me of the error of my ways. This answer and the discussion that went along with it, however, remain here for reference.
I think a constant is appropriate here, and a database table is not. As you design your application, you expect that table of statuses to never, ever change, since your application has hard-coded into it what those statuses mean, anyway. The point of a database is that the data within it will change. There are cases where the lines are fuzzy (e.g. "this data might change every few months or so…"), but this is not one of the fuzzy cases.
Statuses are a part of your application's logic; use constants to define them within the application. It's not only more strictly organized that way, but it will also allow your database interactions to be significantly speedier.
If in one of my columns in my Table I want the values as either Yes, No or Optional then what data type do I need to use?
BIT:
takes 1 byte, but up to 8 BIT fields can be merged into a single BYTE in SQL Server.
stores one of two values: 1 (meaning true) and 0 (meaning false) so the column needs to be nullable in order for NULL to pass as your third value
CHAR(1)
takes 1 byte
26 characters if case insensitive ASCII vs 52 if case sensitive
TINYINT
takes 1 byte
values zero to 255
Performance
All of the options take the same amount of space, making performance equivalent for JOINs/etc.
Comparison
BIT is not the wisest choice if there's any chance of the possible values changing. CHAR(1) is immediately readable IE: Y, N, O. TINYINT is a good choice for the primary key in a table you want to relate via foreign key, and store the descriptive text in another column.
Conclusion:
CHAR(1) would be my choice if not using a foreign key relationship, TINYINT otherwise.
With CHAR(1), having a natural primary key that is a single character is very unlikely. Assuming a natural key based on the leading character fails if you have 2+ words that start with the same character, and causes grief if the label needs to change because the key should also change and be perpetuated (unless you're lazy & like explaining why a code doesn't follow the same scheme as the others). CHAR(1) also provides roughly a fifth of the possibilities (assuming the upper end, 52 case sensitive values) that TINYINT does -- the artificial/surrogate key insulates from description changes.
Use BIT for a True / False or in your case use CHAR(1) Y/N or CHAR(3) Yes / No.
Really I would use a CHAR(1) here because the extra 2 chars don't add any real value.
I'm surprised to see so many votes for "Bit" here. It's a bad choice.
Semantically, NULL means "unknown", so it's not a good choice as a third (known) value. If you use it that way you can run into a lot of problems down the road. For example, aggregate functions, GROUP BY, and joins may not behave the way you're expecting. User interfaces may not handle treating NULL as a value well either (MS Access has trouble with null bit fields, for example). You'll also be unable to help preserve data integrity by defining the field NOT NULL.
Finally, you'll probably confuse any other database/application developer down the road who is used to the normal use of the value.
Go with the CHAR or TinyInt.
Both Sergey and JonVD offer good solutions, but I'll agree with Sergey. A nullable bit offers your three options. If it's not null, then you know the user took the option.
I would use char(1) or an INT with a check constraint.
Just to minimize potential mismatches between the database and whatever abstraction layer you are using to access it. JDBC has no TINYINT for example.
I agree with the options "OMG Ponies" presents but not for his conclusion in this case.
Use the bit column! Having a single bit assigned to hold the data gives one six other Y/N/O and a single Y/N storage locations for free. When one uses the Bit data type always define at least on bit as non null so SQL will reserve a place in the data page for the values.
The object part is misleading. My question is not specific to one type of sql.
ATM i am using sqlite but i will be switching to TSQL (It looks to be what my host is offering) and i am rewriting some tables and logic to clean things up.
One pattern i notice is i have a bigint that could possible be one of 2+ keys and sometimes if i need it a bit or byte as an id to what type it is. Two major things that come to mind is
1) If a bigint is signed and i happen to have more then 2^32 PK in a table would bigint still be able to access the keys? I'm thinking since the value will be negative and PKs are always positive? that i will get an error. mistake, i forgot bigint is 2^63, i have nothing to worry about.
2) If i have a bigint that represents the PK of 2 or more tables would this be bad practice? For whatever reason i think there is a better way of doing bigint the_id, byte the_id
1) TSQL Bigint is not limited to 2^32, it is -2^63 to +2^63 -1 - far more than you are likely to use, something like 9.2 Quintillion.
2) Id prefer to not use an ID to represent two different PK's in other tables, but without knowing the problem you are solving it is hard to say whether it is the right decision or the only one you really have.
As a rule of thumb, I always design my columns to hold one piece of data and only one type of data (by type, I don't mean data type although that is generally true as well.)
If nothing else, putting two different IDs in the same column will prevent the use of foreign keys to make sure that your data is accurate and valid.