About choosing Data type - sql

If in one of my columns in my Table I want the values as either Yes, No or Optional then what data type do I need to use?

BIT:
takes 1 byte, but up to 8 BIT fields can be merged into a single BYTE in SQL Server.
stores one of two values: 1 (meaning true) and 0 (meaning false) so the column needs to be nullable in order for NULL to pass as your third value
CHAR(1)
takes 1 byte
26 characters if case insensitive ASCII vs 52 if case sensitive
TINYINT
takes 1 byte
values zero to 255
Performance
All of the options take the same amount of space, making performance equivalent for JOINs/etc.
Comparison
BIT is not the wisest choice if there's any chance of the possible values changing. CHAR(1) is immediately readable IE: Y, N, O. TINYINT is a good choice for the primary key in a table you want to relate via foreign key, and store the descriptive text in another column.
Conclusion:
CHAR(1) would be my choice if not using a foreign key relationship, TINYINT otherwise.
With CHAR(1), having a natural primary key that is a single character is very unlikely. Assuming a natural key based on the leading character fails if you have 2+ words that start with the same character, and causes grief if the label needs to change because the key should also change and be perpetuated (unless you're lazy & like explaining why a code doesn't follow the same scheme as the others). CHAR(1) also provides roughly a fifth of the possibilities (assuming the upper end, 52 case sensitive values) that TINYINT does -- the artificial/surrogate key insulates from description changes.

Use BIT for a True / False or in your case use CHAR(1) Y/N or CHAR(3) Yes / No.
Really I would use a CHAR(1) here because the extra 2 chars don't add any real value.

I'm surprised to see so many votes for "Bit" here. It's a bad choice.
Semantically, NULL means "unknown", so it's not a good choice as a third (known) value. If you use it that way you can run into a lot of problems down the road. For example, aggregate functions, GROUP BY, and joins may not behave the way you're expecting. User interfaces may not handle treating NULL as a value well either (MS Access has trouble with null bit fields, for example). You'll also be unable to help preserve data integrity by defining the field NOT NULL.
Finally, you'll probably confuse any other database/application developer down the road who is used to the normal use of the value.
Go with the CHAR or TinyInt.

Both Sergey and JonVD offer good solutions, but I'll agree with Sergey. A nullable bit offers your three options. If it's not null, then you know the user took the option.

I would use char(1) or an INT with a check constraint.
Just to minimize potential mismatches between the database and whatever abstraction layer you are using to access it. JDBC has no TINYINT for example.

I agree with the options "OMG Ponies" presents but not for his conclusion in this case.
Use the bit column! Having a single bit assigned to hold the data gives one six other Y/N/O and a single Y/N storage locations for free. When one uses the Bit data type always define at least on bit as non null so SQL will reserve a place in the data page for the values.

Related

NULL vs NOT NULL Performance differences

Recently i was looking at a SQL table we have and noticed the following.
[FooColumn] CHAR (1) DEFAULT ('N') NOT NULL,
Above you can see that FooColumn Would always default to 'N' but still has a "NOT NULL" specified.
Would there be some Storage/Performance differences in setting a column to "NOT NULL" instead of "NULL" ?
How would SQL Server Treat a "NOT NULL: different from a "NULL" column ?
NOTE: This is ONLY for SQL and not the overhead of externally doing NULL checks
You should only use NOT NULL when you have a reason (ie required field for UI or for backend relationships). NOT NULL vs NULL performance is negligible and as per this article from 2016 (SQL SERVER), performance shouldn't be a consideration when deciding NOT NULL vs NULL.
Even though that field will default to 'N', a command could still set it to NULL if nulls were allowed. It comes down to is NULL a valid piece of data for that column.
EDIT
In a data-driven technical application, in my experience these are some guidelines we use:
for numeric fields, NULL is unknown to the user, and all numbers have meaning.
for string fields, NULL and "" are identical to the user, so it depends on you backend application.
I know that your question was excluding ISNULL checks but if you are doing a lot of them then it might be a code smell that those fields should be NOT NULL if possible since they can get expensive.
It's a complicated "debate".
NULL means unknown. It's different from 0 or empty string.
NOT NULL means you NEED to insert a value in there, always, even if it's a blank string or a 0. Many designers argue that's it's better design. Other see no issues with having NULL values. Different software houses will enforce different rules.
Having a "default" value simply means that when you create new records without specifying a value, it will use the default value instead. This is regardless of whether the field is NULL or NOT NULL.
Having NULL values MAY have an impact on performance (as the DBMS needs to deal with this special case), it will depend on which DBMS you are using, which version, which config etc... You need to do bench-marking with your own setup to see what's what.
Here's a good article: http://www.itprotoday.com/microsoft-sql-server/designing-performance-null-or-not-null
As the question is asked :
"NULL vs NOT NULL Performance differences"
, the answer must be based on the storage structure of the line and the difference in treatment of the line in the event of a Null.
The answer is: there is no difference.
Here are articles discussing line structure into SQL server:
https://www.red-gate.com/simple-talk/sql/database-administration/sql-server-storage-internals-101/
https://aboutsqlserver.com/2013/10/15/sql-server-storage-engine-data-pages-and-data-rows/
Here the column is defined as CHAR(1) so it is a fixed size column.
The difference between an empty string''' and Null is checked in the line structure information. There is no structural space saving storing null or an empty string; the structural information does not change depending on the definition of the constraint.
If you are looking for performance in the relation of data structure then you need to look elsewhere.
IMHO :
A column defined as CHAR(1) often contains coded information with few distinct values.
It is also common that this kind of column points to a "translation" table through FK.
So, if it is a "2-state indicator value" then the BIT type can be used knowing that all columns of this type are grouped together in the same byte.
If more different cases (more distinct values) are needed then the tinyint type will also occupy 1 byte of fixed size but will not require validation of the collation to process ties. (note : TinyInt offer more values than CHAR(1) )
Elsewhere, if you don't have a FK constraint yet, this must be balanced.
[FooColumn] CHAR (1) DEFAULT ('N') NOT NULL,
It is far better than NCHAR(1), VARCHAR(1) or NVARCHAR(1) !
(For MySQL check FooColumn CHARACTER SET)
But, depend your RDBMs and existant development, investigate if you can use BIT or TinyInt (no collation)
The extra cost of the needed test to check 'NOT NULL' compared to none for 'NULL' is very, very minimal.

Which variable type to use when using text as primary key in SQL Server database

I am aware about pros and cons of using a text as primary key in table
(there is discussion).
However I just wonder if I must use varchar(10) or char(10) or other
Values will look like 1-115115151 (length may differ)
For that string, I would recommend varchar(11).
If the string length can grow EVER, I'd recommend making it 12 or even 15.
Varchar uses space equal to the number of characters in use + 2. Data that can fit in either varchar(11) or varchar(15) will use the same amount space in both.
Strictly speaking you can use either, but agreed with the others here - you'll want to use varchar for variable length items.
When your database scales and you start foreign keying from other tables there's a good chance you'll be happy you saved the space from not having a fixed character type.
You'll also be very happy when/if, down the road, your key length winds up being longer than you'd anticipated.
Do not use text datatype as a primary key, only integers. It's important for easier programming, and SQL connections.

SQL Index - Difference Between char and int

I have a table on Sql Server 2005 database.
The primary key field of the table is a code number.
As a standard, the code must contain exactly 4 numeric digits. For example: 1234, 7834, ...
Do you suggest that field type to be char(4) or int or numeric(4) in terms of effective select operation.
Would indexing the table on any type of these differ from any other?
Integer / Identity columns are often used for primary keys in database tables for a number of reasons. Primary key columns must be unique, should not be updatable, and really should be meaningless. This makes an identity column a pretty good choice because the server will get the next value for you, they must be unique, and integers are relatively small and useable (compared to a GUID).
Some database architects will argue that other data types should be used for primary key values and the "meaningless" and "not updatable" criteria can be argued convincingly on both sides. Regardless, integer / identity fields are pretty convenient and many database designers find that they make suitable key values for referential integrity.
The best choice for primary key are integer data types since integer values are process faster than character data type values. A character data type (as a primary key) needs to be converted to ASCII equivalent values before processing.
Fetching the record on the basis of primary key will be faster in case of integers as primay keys as this will mean more index records will be present on a single page. So the total search time decreases. Also the joins will be faster. But this will be applicable incase your query uses clustered index seek and not scan and if only one table is used. In case of scan not having additional column will mean more rows on one data page.
Hopefully this will help you!
I advocate a SMALLINT column. Just because it is the most sensible datatype that will fit the required range (up to 65535, in excess of 4 digits). Use a check constraint to enforce the 4-digit limitation and a COMPUTED column to return the char(4) column.
If I remember correctly, ints take up less storage than chars, so you should go with int.
These two links say the same:
http://www.eggheadcafe.com/software/aspnet/31759030/varcharschars-vs-intbigint-as-keys.aspx
http://sql-server-performance.com/Community/forums/p/16020/94489.aspx
"It depends"
In this case, char(4) captures the data stored correctly with no storage overhead (4 bytes each). And 0001 is not the same as 1 of course.
You do have some overhead for processing collation etc if you have non-numeric digits, but it shouldn't matter for reasonably sized databases. And with a 4 digit code you do have an upper bound for number of rows especially if numeric (10k).
If your new codes are not strictly increasing, then you get the page split issue associated with GUID clustered keys
If they are strictly increasing, then use int and add a computed column to add leading zeros

How liberal should I be with NOT NULL columns?

I'm designing a database schema, and I'm wondering what criteria I should use for deciding whether each column should be nullable or not.
Should I mark as NOT NULL only those columns that absolutely must be filled out for a row to make any sense at all to my application?
Or should I mark all columns that I intend to never be null?
What are the performance implications of small vs large numbers of NOT NULL columns?
I assume lots of NOT NULL columns would slow down inserts a bit, but it might actually speed up selects, since the query execution plan generator has more information about the columns..
Can someone with more knowledge than me give me the low-down?
Honestly, I've always thought NOT NULL should be the default. NULL is the odd special case, and you should make a case for it whenever you use it. Plus it's much easier to change a column from NOT NULL to nullable than it is to go the other way.
There are no significant performance consequences. Don't even think about considering this as an issue. To do so is a huge early optimization antipattern.
"Should I only mark as NOT NULL only those columns that absolutely must be filled out for a row to make any sense at all to my application?"
Yes. It's as simple as that. You're a lot better off with a NULLable column without any NULL values in it, than with the need for NULLs and having to fake it. And anyway, any ambiguous cases are better filtered out in your Business Rules.
EDIT:
There's another argument for nullable fields that I think is ultimately the most compelling, which is the Use Case argument. We've all been subject to data entry forms that require values for some fields; and we've all abandoned forms where we had no sensible values for required fields. Ultimately, the application, the form, and the database design are only defensible if they reflect the user requirements; and it's clear that there are many, many database columns for which users can present no value - sometimes at given points in the business process, sometimes ever.
Err on the side of NOT NULL. You will, at some point, have to decide what NULL "means" in your application - more than likely, it will be different things for different columns. Some of the common cases are "not specified", "unknown", "inapplicable", "hasn't happened yet", etc. You will know when you need one of those values, and then you can appropriately allow a NULLable column and code the logic around it.
Allowing random things to be NULL is, sooner or later, always a nightmare IME. Use NULL carefully and sparingly - and know what it means in your logic.
Edit: There seems to be an idea that I'm arguing for NO null columns, ever. That's ridiculous. NULL is useful, but only where it's expected.
Le Dorfier's DateOfDeath example is a good example. A NULL DateOfDeath would indicate "not happened yet". Now, I can write a view LivingPersons WHERE DateOfDeath IS NULL.
But, what does a NULL OrderDate mean? That the order wasn't placed yet? Even though there's a record in the Order table? How about a NULL address? Those are the thoughts that should go through your head before you let NULL be a value.
Back to DateOfDeath - a query of persons WHERE DateOfDeath > '1/1/1999' would not return the NULL records - even though we logically know they must die after 1999. Is that what you want? If not, then you better include OR DateOfDeath IS NULL in that query. If you allow all columns to be NULL, you have to think about that every single time you write a query. IME, that's too much of a mental tax for the 10% or so of columns that actually have legit meaning when they're NULL.
I have found marking a column as NOT NULL is usually a good idea unless you have a useful meaning for NULL in the column. Otherwise you may unexpectedly find NULL in there later when you realise you don't want it, and changing is harder.
I try to avoid using NULL's in the database as much as possible. This means that character fields are always not null. Same for numeric fields, especially anything representing money or similar (shares, units, etc).
I have 2 exceptions:
Dates where the date might not be known (eg. DivorcedOn)
Optional foriegn key relationships (MarriedToPersonId). Though on occasion I have used "blank" rows in the foreign key table and made the relatonship mandatory (eg. JobDescriptionCode)
I have also on occasion used explicit bit fields for "unknown"/"not set" (eg. JobDescriptionCode and IsEmployeed).
I have a few core reasons why:
NULLs will always cause problems in numeric fields. Always. Always. Always. Doesn't matter how careful you are at somepoint select X + Y as Total is going to happen and it will return NULL.
NULLs can easily cause problems in string fields, typically address fields (eg. select AddrLine1 + AddrLine2 from Addresses).
Guarding against NULLs in the business logic tier is a tedious waste of effort... just don't let them in the DB and you can save 100's of lines of code.
My preferred defaults:
Strings -> "", aka an empty string
Numbers -> 0
Dates -> Today or NULL (see exception #1)
Bit -> false
You may find Chris Date's Database In Depth a useful resource for these kinds of questions. You can get a taste for his ideas in this interview, where he says among other things:
So yes, I do think SQL is pretty bad.
But you explicitly ask what its major
flaws are. Well, here are a few:
Duplicate rows
Nulls
Left-to-right column ordering
Unnamed columns and duplicate column names
Failure to support "=" properly
Pointers
High redundancy
In my own experience, nearly all "planned nulls" can be represented better with a child table that has a foreign key to a base table. Participating in the child table is optional, and that's where the null/not null distinction is actually made.
This maps well to the interpretation of a relation as a first-order logic proposition. It also is just common sense. When one does not know Bob's address, does one write in one's Rolodex:
Bob. ____
Or does one merely refrain from filling out an address card for Bob until one has an actual address for him?
Edit: Date's argument appears on pages 53-55 of Database In Depth, under the section heading "Why Nulls are Prohibited."
I lean toward NOT NULL unless I see a reason otherwise -- like someone else said, like it or not, NULL is the weird special case.
One of my favorites in regards to NULL is:
SELECT F1 FROM T WHERE F2 <> 'OK'
...which (in DB2 at least) won't include any rows where f2 is null -- because in relational jargon, (NULL <> 'OK') IS NULL. But your intent was to return all not-OK rows. You need an extra OR predicate, or write F2 DISTINCT FROM 'OK' instead (which is special case coding in the first place).
IMO, NULL is just one of those programmer's tools, like pointer arithmetic or operator overloading, that requires as much art as science.
Joe Celko writes about this in SQL For Smarties -- the trap of using NULL in an application is that its meaning is, well, undefined. It could mean unknown, uninitialized, incomplete, not applicable -- or as in the dumb example above, does it mean OK or not-OK?
Thanks for all the great answers, guys. You gave me a lot to think about, and helped me form my own opinion/strategy, which boils down to this:
Allow nulls if-and-only-if a null in
that column would have a specific
meaning to your application.
A couple of common meanings for null:
Anything that comes directly from the user
Here null means "user did not enter"
For these columns, it's better to allow nulls, or you'll just get asdasd#asd.com type input anyway.
Foreign keys for "0 or 1" relationships
null means "no related row"
So allow nulls for these columns
This one is controversial, but this is my opinion.
In general, if you cannot think of a useful meaning for null in a column, it should be NOT NULL. You can always change it to nullable later.
Example of the sort of thing I ended up with:
create table SalesOrderLine (
Id int identity primary key,
-- a line must have exactly one header:
IdHeader int not null foreign key references SalesOrderHeader,
LineNumber int not null, -- a line must have a line number
IdItem int not null, -- cannot have null item
Quantity decimal not null, -- maybe could sell 0, but not null
UnitPrice decimal not null, -- price can be 0, but not null
-- a null delivery address means not for delivery:
IdDeliveryAddress int foreign key references Address,
Comment varchar(100), -- null means user skipped it
Cancelled bit not null default (0) -- true boolean, not three-state!
Delivered datetime, -- null means not yet delivered
Logged datetime not null default (GetDate()) -- must be filled out
)
I would tend to agree with dorfier.
Be serious in your application about being flexible when receiving database NULL values and treating them as empty values, and you give yourself a lot of flexibility to let NULL's get inserted for values you don't specify.
There's probably a lot of cases where you need some very serious data integrity (and/or the intense speed optimization of disallowing NULL fields) but I think that these concerns are tempered against the extra effort it takes to make sure every field has a default value and/or gets set to a sensible value.
Stick with NOT NULL on everything until someone squeaks with pain about it. Then remove it on one column at a time, as reluctantly as possible. Avoid nulls in your DB as much as you can, for as long as you can.
Personally I think you should mark the columns as Null or not null based on what kind of data they contain, if there is a genuine requirement for the data to always be there, and whether the data is always known at the time of input. Marking a column as not null when the users don't have the data will force then to make up the data which makes all your data useless (this how you end up with junk data such as an email field containing "thisissilly#Ihatethisaplication.com"). Failing to require something that must be there for the process to work(say the key field to show what customer made the order) is equally stupid. Null vice not null is a data integrity issue at the heart, do what makes the most sense towards keeping your data useable.
If you can think long term, having NULLs in a column affects how you can design your queries. Whether you use CASE statements, COALESCE, or have to explicitly test for NULL values can make the decision for you.
From a performance standpoint, it's faster to not have to worry about NULLS. From a design standpoint, using NULL is an easy way to know that an item has never been filled in. Useful examples include "UpdatedDateTime" columns. NULL means an item has never been updated.
Personally I allow NULLs in most situations.
What are the performance implications of small vs large numbers of NOT NULL columns?
This may be stating the obvious, but, when a column is nullable, each record will require 1 extra bit of storage. So a BIT column will consume 100% more storage when it is nullable, while a UNIQUEIDENTIFIER will consume only 0.8% more storage when it is nullable.
In the pathological case, if your database has a single table consisting of a single BIT column, the decision to make that column nullable would reduce your database's performance in half. However, under the vast majority of real world scenarios, nullability will not have a measurable performance impact.
Using 'Not Null' or 'Null' should be primarily driven by your particular persistance requirements.
Having a value being Nullable means there are two or three states (three states with Bit fields)
For instance; if I had a bit field which was called 'IsApproved' and the value is set at a later stage than insertion. Then there are three states:
'IsApproved' Not answered
'IsApproved' Is Approved
'IsApproved' Is Not Approved
So if a field can be legitimently considered Not Answered and there is no default value that is suitable. These fields should be considered for being nullable
Any nullable column is a violation of third normal form.
But, that's not an answer.
Maybe this is: there are two types of columns in databases - ones that hold the structure of the data, and ones that hold the content of the data. Keys are structure, user-enterable fields are data. Other things - well - it's a judgment call.
Stuff that's structure, that is used in join clauses, is typically not null. Stuff that's data is typically nullable.
When you have a column that hold one of a list of choices or null (no choice made), it is usually a good idea to have a specific value for "no choice made" rather than a nullable column. These types of columns often participate in joins.

Flags in a database rows, best practices

I am asking this out of a curiosity. Basically my question is when you have a database which needs a row entry to have things which act like flags, what is the best practice? A good example of this would be the badges on stack overflow, or the operating system field in bugzilla. Any subset of the flags may be set for a given entry.
Usually, I do c and c++ work, so my gut reaction is to use an unsigned integer field as a set of bits which can be flipped... But i know that isn't a good solution for several reasons. The most obvious of which is scale-ability, there will be a hard upper limit on how many flags I can have.
I can also think of a couple of other solutions which scale better but would have performance issues because they would require multiple selects to get all the information.
So, what is the "right" way to do this?
Generally speaking, I avoid bitmask fields. They're difficult to read in the future and they require a much more in-depth knowledge of the data to understanding.
The relational solution has been proposed previously. Given the example you outlined, I would create something like this (in SQL Server):
CREATE TABLE Users (
UserId INT IDENTITY(1, 1) PRIMARY KEY,
FirstName VARCHAR(50),
LastName VARCHAR(50),
EmailAddress VARCHAR(255)
);
CREATE TABLE Badges (
BadgeId INT IDENTITY(1, 1) PRIMARY KEY,
[Name] VARCHAR(50),
[Description] VARCHAR(255)
);
CREATE TABLE UserBadges (
UserId INT REFERENCES Users(UserId),
BadgeId INT REFERENCES Badges(BadgeId)
);
If you really need an unbounded selection from a closed set of flags (e.g. stackoverflow badges), then the "relational way" would be to create a table of flags and a separate table which relates those flags to your target entities. Thus, users, flags and usersToFlags.
However, if space efficiency is a serious concern and query-ability is not, an unsigned mask would work almost as well.
A Very Relational Approach
For databases without the set type, you could open a new table to represent the set of entities for which each flag is set.
E.g. for a Table "Students" you could have tables "RegisteredStudents", "SickStudents", TroublesomeStudents etc. Each table will have only one column: the student_id. This would actually be very fast if all you want to know is which students are "Registered" or "Sick", and would work the same way in every DBMS.
For many cases, it depends on a lot of things - like your database backend. If you're using MySQL, for example, the SET datatype is exactly what you want.
Basically, it's just a bitmask, with values assigned to each bit. MySQL supports up to 64-bit values (meaning 64 different toggles). If you only need 8, then it only takes a byte per row, which is pretty awesome savings.
If you honestly have more than 64 values in a single field, your field might be getting more complicated. You may want to expand then to the BLOB datatype, which is just a raw set of bits that MySQL has no inherent understanding of. Using this, you can create an arbitrary number of bit fields that MySQL is happy to treat as binary, hex, or decimal values, however you need. If you need more than 64 options, create as many fields as is appropriate for your application. The downside is that is is difficult to make the field human readable. The BIT datatype is also limited to 64.
If the flags have very different meanings and are used directly in SQL queries or VIEWS, then using multiple columns of type BOOLEAN might be a good idea.
Put each flag into an extra column, because you'll read and modify them separately anyway. If you want to group the flags, just give their column names a common prefix, i.e. instead of:
CREATE TABLE ... (
warnings INTEGER,
errors INTEGER,
...
)
you should use:
CREATE TABLE ... (
warning_foo BOOLEAN,
warning_bar BOOLEAN,
warning_...
error_foo BOOLEAN,
error_bar BOOLEAN,
error_... BOOLEAN,
...
)
Although MySQL doesn't have a BOOLEAN type, you can use the quasi standard TINYINT(1) for that purpose, and set it only to 0 or 1.
I would recommend using a BOOLEAN datatype if your database supports this.
Otherwise, the best approach is to use NUMBER(1) or equivalent, and put a check constraint on the column that limits valid values to (0,1) and perhaps NULL if you need that. If there is no built-in type, using a number is less ambiguous that using a character column. (What's the value for true? "T" or "Y" or "t")
The nice thing about this is that you can use SUM() to count the number of TRUE rows.
SELECT COUNT(1), SUM(ActiveFlag)
FROM myusers;
If there are more than just a few flags, or likely to be so in the future, I'll use a separate table of flags and a many-to-many table between them.
If there are a handful of flags and I'm never going to use them in a WHERE, I'll use a SET() or bitfield or whatever. They're easy to read and more compact, but a pain to query and sometimes even more of a headache with an ORM.
If there are only a few flags -- and only ever going to be a few flags -- then I'll just make a couple BIT/BOOLEAN/etc columns.
Came across this when I was pondering best way to store bitmask flags (similar to OP's original use of integers) in a database.
The other answers are all valid solutions, but I think its worth mentioning that you may not have to resign yourself to horrible query problems if you choose to store bitmasks directly in the database.
If you are working on an application that uses bitmasks and you really want the convenience of storing them in the database as one integer or byte column, go ahead and do that. Down the road, you can write yourself a little utility that will generate another table of flags (in whatever pattern of rows/columns you choose) from the bitmasks in your primary working table. You can then do ordinary SQL queries on that computed/derived table.
This way your application gets the convenience of only reading/writing the bitmask field/column. But you can still use SQL to really dive into your data if that becomes necessary at a later time.