PK, IDs and simulating an 'object' in a table - sql

The object part is misleading. My question is not specific to one type of sql.
ATM i am using sqlite but i will be switching to TSQL (It looks to be what my host is offering) and i am rewriting some tables and logic to clean things up.
One pattern i notice is i have a bigint that could possible be one of 2+ keys and sometimes if i need it a bit or byte as an id to what type it is. Two major things that come to mind is
1) If a bigint is signed and i happen to have more then 2^32 PK in a table would bigint still be able to access the keys? I'm thinking since the value will be negative and PKs are always positive? that i will get an error. mistake, i forgot bigint is 2^63, i have nothing to worry about.
2) If i have a bigint that represents the PK of 2 or more tables would this be bad practice? For whatever reason i think there is a better way of doing bigint the_id, byte the_id

1) TSQL Bigint is not limited to 2^32, it is -2^63 to +2^63 -1 - far more than you are likely to use, something like 9.2 Quintillion.
2) Id prefer to not use an ID to represent two different PK's in other tables, but without knowing the problem you are solving it is hard to say whether it is the right decision or the only one you really have.

As a rule of thumb, I always design my columns to hold one piece of data and only one type of data (by type, I don't mean data type although that is generally true as well.)
If nothing else, putting two different IDs in the same column will prevent the use of foreign keys to make sure that your data is accurate and valid.

Related

Type to use for "Status" columns in a sql table

I have a (dummy) table structure as follows:
ticket
id: int(11) PK
name: varchar(255)
status: ?????????
The question is, what data type should I use for status? Here are my options, as I see them:
varchar representing the status - BAD because there's no integrity
enum representing the status - BAD because to change the value, I'd have to alter the table, and then any code with dropdowns for the values, etc etc etc
int FK to a status table - GOOD because it's dynamic, BAD because it's harder to inspect by sight (which may be useful)
varchar FK to a status table - GOOD because it's dynamic, and visible on inspection. BAD because the keys are meaningful, which is generally frowned upon. Interestingly, in this case it's entirely possible for the status table to have just 1 column, making it a glorified enum
Have I got an accurate read of the situation? Is having a meaningful key really that bad? Because while it does give me goosebumps, I don't have any reason for it doing so...
Update:
For option 4, the proposed structure would be status: char(4) FK, to a status table. So,
OPEN => "Open"
CLOS => "Closed"
"PEND" => "Pending Authorization"
"PROG" => "In Progress
What's the disadvantage in this case ? The only benefit I can see of using int over char in this case is slight performance.
I would go with number 4, but I'd use a char(x) column. If you're worried about performance, a char(4) takes up as much space (and, or so one would think, disk i/o, bandwidth, and processing time) as an int, which also takes 4 bytes to store. If you're really worried about performance, make it a char(2) or even char(1).
Don't think of it as "meaningful data", think of it as an abbreviation of the natural key. Yes, the data has meaning, but as you've noticed that can be a good thing when working with the data--it means you don't always have to join (even if to a trivially small table) to extract meaning from the database. And of course the foreign key constraint ensures that the data is valid, since it must be in the lookup table. (This can be done with CHECK constraints as well, but Lookup tables are generally easier to manage and maintain over time.)
The downside is that you can get caught up with trying to find meaning. char(1) has a strong appeal, but if you get to ten or more values, it can get hard to come up with good meaningful values. Less of a problem with char(4), but still a possible issue. Another downside: if the data is likely to change, then yes, your meaningful data ("PEND" = "Pending Authorization") can lose its meaning ("PEND" = "Forward to home office for initial approval"). That's a poor example; if codes like that do change, you're probably much better off refactoring your system to reflect the change in business rules. I guess my point should be, if it's a user-entered lookup value, surrogate keys (integers) will be your friend, but if they're internally defined and maintained you should definitely consider more human-friendly values. That, or you'll need post-em notes on your monitor to remind you what the heck Status = 31 is supposed to mean. (I've got three on mine, and the stickum wears out every few months. Talk about cost to maintain...)
Go with number 3. Create a view that join's in the status value if you want something inspectable.
I would use an INT, and create a foreign key relationship to the status table. An INT should definitely be safe for an enumerated status column.
May I recommend you go with a statusID field instead, and have a separate table mapping the ID to a varchar?
EDIT: I guess that's exactly what you outlined in point 3. I think that is the best option.
I'm assuming that your database has a front end of some description, and that regular users are not exposed to the status code.
So, your convenience is only for programmers and DBAs - important people, but I wouldn't optimize my design for them.
Stronger - I would be very careful of using "meaningful" abbreviations - the most egregious data foul-up I've ever seen happened when a developer was cleansing some data, and interpreted the "meaningful" key incorrectly; turns out that "PROG" does not mean "programmed", but "in progress".
Go with option 3.
I've been working with a lot of databases recently that require a lot of statuses AND I've got a few notes that might be worth adding to the conversation.
INT: One thing I found is that if an application has a lot of tracking going on, the number of reference tables can quickly get unwieldy and, as you've mentioned, make inspecting the database at a glance impractical. (Which, for some of my clients, has mattered much more than the scant milliseconds it's saved in processing time.)
VARCHAR: Terrible idea for programming, but it's important to consider if a given status is actually going to be used by the code, or just human eyes. For the latter, you get unlimited range and don't have to maintain any relationships.
CHAR(4): Using a descriptive char column can actually be a very good approach. I'd typically only consider it if the value range were going to be low and obvious, but only because I consider this a nonstandard approach (risking confusion to new devs). Realistically, you could use a CHAR value as a foreign key just the same as an INT, gain legibility and maintain performance parity.
The one thing you couldn't do that I'd miss is mathematical operations (like "<" and ">").
INT Range: A hybrid strategy I've tried out is to use INT, but adding a degree of semantics to the numbers. So, for instance,
1-10 being for initial stages,
11-20 being in progress, and
21-30 being the final stages.
60-69 for errors, rejections
The problem here is that if you discover you need more numbers, you're SOL, since the next range is already taken. So, what I ended up doing was (sort of) mimicking HTTP responses:
100-199 being for initial stages,
200-299 being in progress, and
300-399 being the final stages.
500-599 for errors, rejections
I prefer this to simple INT, and while it can be less descriptive than CHAR, it can also be less ambiguous. Whereas "PROG" could mean a number of things, good, bad or benign, if I can see something is in the 500 range, I may not known what the problem is, I will be able to tell you there is a problem.
Creating a separate table with status is a good idea when you want to show the list of the status in the HTML form. You can show the verbose description from the lookup table and it will help the user to choose status if the requirements are like that.
From the development perspective, I would like to go integer as a primary key. You can optimize it by using small/tiny integer if you know it will not exceed the limit.
If you use abbreviation as a foreign key then you have to think every time to make it unique all the time as #Philip Kelley had mentioned it as a downside of it.
Lastly, you can declare the table type MYISAM if you like.
Update:
Reflecting #Philip Kelley opinion, if there are too many status, then it's better to use integer as foreign key. If there are only couple of status, then may be use abbr as a foreign key.

uniqueidentifier vs identity

I noticed asp_membership uses uniqueidentifier, and to me, it seemed like a waste of space, because I could not think of a particular reason to not replace it with identity.
However, SQL Server is not letting me change it to int. It says "Conversion from uniqueidentifier to int is not supported on the connected database server". After googling around, it seems like I would have to break all the relationships etc and then manually delete the column and re-add it as int. Do you guys know of a better approach?
I don't think I would be dealing with multiple databases, so uniqueidentifier seems unneeded for me. Do you agree?
PLEASE NOTE: I am starting out a new web application. Do you still think fixing this would be THAT hard?
Also, note that any of my primary keys would not be part of my URL.
My advice would be to leave it alone. You're right about the fact that you would need to go to a lot of work (as you described) to get rid of it. Why? Why mess with something that just works? To save space in the database? With as cheap as storage is these days, and the average size of a hard drive or storage array, you're talking about a completely insignificant amount of space savings.
There is no Return on Investment for this idea.
You can't convert a UNIQUEIDENTIFIER column to INT - here's what you'd need to do in this case is
add your new ID column as INT IDENTITY - this will create and fill the column with values
drop the old GUID column you don't need anymore
Of course, since the "UserID" is the primary key, and thus will be referenced from a lot of places, you'll have to do a lot of housekeeping before being able to drop the UserId column.
While I applaud your idea and realization of UNIQUEIDENTIFIER being a really bad choice for a primary and clustering key in a SQL Server table, I think in this particular case, I'd probably leave it "as is" - trying to convert that will cause lots of ripple changes all throughout the ASP.NET tables - probably just not worth the effort.
Marc
Uniqueidentifier and int are to very different datatypes. You can't just change a guid to an int. You would need to remove all the relationships, assign numeric values to the parent table, add numeric columns to all the child tables, use joins to set all the values in the child tables. Remove all the guid columns, make the int column the primary key. Setup all the relationships.
And not to mention change every app that talks to the database and expects the value to be a guid so that it can work with the integer value.
You are looking at a lot of work for very little gain.
Yes you would get some performance improvement because the guids aren't sequential and the numbers would be. But that really isn't enough of a reason to actually make the change unless you are seeing some major performance issues from this.
At the least you'd be looking on weeks of work and testing. In reality months to ensure that you catch every little possible change.

Database-wide unique-yet-simple identifiers in SQL Server

First, I'm aware of this question, and the suggestion (using GUID) doesn't apply in my situation.
I want simple UIDs so that my users can easily communicate this information over the phone :
Hello, I've got a problem with order
1584
as opposed to
hello, I've got a problem with order
4daz33-d4gerz384867-8234878-14
I want those to be unique (database wide) because I have a few different kind of 'objects' ... there are order IDs, and delivery IDs, and billing-IDs and since there's no one-to-one relationship between those, I have no way to guess what kind of object an ID is referring to.
With database-wide unique IDs, I can immediately tell what object my customer is referring to. My user can just input an ID in a search tool, and I save him the extra-click to further refine what is looking for.
My current idea is to use identity columns with different seeds 1, 2, 3, etc, and an increment value of 100.
This raises a few question though :
What if I eventually get more than 100 object types? granted I could use 1000 or 10000, but something that doesn't scale well "smells"
Is there a possibility the seed is "lost" (during a replication, a database problem, etc?)
more generally, are there other issues I should be aware of?
is it possible to use an non integer (I currently use bigints) as an identity columns, so that I can prefix the ID with something representing the object type? (for example a varchar column)
would it be a good idea to user a "master table" containing only an identity column, and maybe the object type, so that I can just insert a row in it whenever a need a new idea. I feel like it might be a bit overkill, and I'm afraid it would complexify all my insertion requests. Plus the fact that I won't be able to determine an object type without looking at the database
are there other clever ways to address my problem?
Why not use identities on all the tables, but any time you present it to the user, simply tack on a single char for the type? e.g. O1234 is an order, D123213 is a delivery, etc.? That way you don't have to engineer some crazy scheme...
Handle it at the user interface--add a prefix letter (or letters) onto the ID number when reporting it to the users. So o472 would be an order, b531 would be a bill, and so on. People are quite comfortable mixing letters and digits when giving "numbers" over the phone, and are more accurate than with straight digits.
You could use an autoincrement column to generate the unique id. Then have a computed column which takes the value of this column and prepends it with a fixed identifier that reflects the entity type, for example OR1542 and DL1542, would represent order #1542 and delivery #1542, respectively. Your prefix could be extended as much as you want and the format could be arranged to help distiguish between items with the same autoincrement value, say OR011542 and DL021542, with the prefixes being OR01 and DL02.
I would implement by defining a generic root table. For lack of a better name call it Entity. The Entity table should have at a minimum a single Identity column on it. You could also include other fields that are common accross all your objects or even meta data that tells you this row is an order for example.
Each of your actual Order, Delivery...tables will have a FK reference back to the Entity table. This will give you a single unique ID column
Using the seeds in my opinion is a bad idea, and one that could lead to problems.
Edit
Some of the problems you mentioned already. I also see this being a pain to track and ensure you setup all new entities correctly. Imagine a developer updating the system two years from now.
After I wrote this answer I had thought a but more about why your doing this, and I came to the same conclusion that Matt did.
MS's intentional programing project had a GUID-to-word system that gave pronounceable names from random ID's
Why not a simple Base36 representation of a bigint? http://en.wikipedia.org/wiki/Base_36
We faced a similar problem on a project. We solved it by first creating a simple table that only has one row: a BIGINT set as auto-increment identity.
And we created an sproc that inserts a new row in that table, using default values and inside a transaction. It then stores the SCOPE_IDENTITY in a variable, rolls back the transaction and then returns the stored SCOPE_IDENTITY.
This gives us a unique ID inside the database without filling up a table.
If you want to know what kind of object the ID is referring to, I'd lose the transaction rollback and also store the type of object along side the ID. That way findout out what kind of object the Id is referring to is only one select (or inner join) away.
I use a high/low algorithm for this. I can't find a description for this online though. Must blog about it.
In my database, I have an ID table with an counter field. This is the high part. In my application, I have a counter that goes from 0 to 99. This is the low part. The generated key is 100 * high + low.
To get a key, I do the following
initially high = -1
initially low = 0
method GetNewKey()
begin
if high = -1 then
high = GetNewHighFromDatabase
newkey = 100 * high + low.
Inc low
If low = 100 then
low = 0
high = -1
return newKey
end
The real code is more complicated with locks etc but that is the general gist.
There are a number of ways of getting the high value from the database including auto inc keys, generators etc. The best way depends on the db you are using.
This algorithm gives simple keys while avoiding most the db hit of looking up a new key every time. In testing, I found it had similar performance to guids and vastly better performance than retrieving an auto inc key every time.
You could create a master UniqueObject table with your identity and a subtype field. Subtables (Orders, Users, etc.) would have a FK to UniqueObject. INSTEAD OF INSERT triggers should keep the pain to a minimum.
Maybe an itemType-year-week-orderNumberThisWeek variant?
o2009-22-93402
Such identifier can consist of several database column values and simply formatted into a form of an identifier by the software.
I had a similar situation with a project.
My solution: By default, users only see the first 7 characters of the GUID.
It's sufficiently random that collisions are extremely unlikely (1 in 268 million), and it's efficient for speaking and typing.
Internally, of course, I'm using the entire GUID.

How liberal should I be with NOT NULL columns?

I'm designing a database schema, and I'm wondering what criteria I should use for deciding whether each column should be nullable or not.
Should I mark as NOT NULL only those columns that absolutely must be filled out for a row to make any sense at all to my application?
Or should I mark all columns that I intend to never be null?
What are the performance implications of small vs large numbers of NOT NULL columns?
I assume lots of NOT NULL columns would slow down inserts a bit, but it might actually speed up selects, since the query execution plan generator has more information about the columns..
Can someone with more knowledge than me give me the low-down?
Honestly, I've always thought NOT NULL should be the default. NULL is the odd special case, and you should make a case for it whenever you use it. Plus it's much easier to change a column from NOT NULL to nullable than it is to go the other way.
There are no significant performance consequences. Don't even think about considering this as an issue. To do so is a huge early optimization antipattern.
"Should I only mark as NOT NULL only those columns that absolutely must be filled out for a row to make any sense at all to my application?"
Yes. It's as simple as that. You're a lot better off with a NULLable column without any NULL values in it, than with the need for NULLs and having to fake it. And anyway, any ambiguous cases are better filtered out in your Business Rules.
EDIT:
There's another argument for nullable fields that I think is ultimately the most compelling, which is the Use Case argument. We've all been subject to data entry forms that require values for some fields; and we've all abandoned forms where we had no sensible values for required fields. Ultimately, the application, the form, and the database design are only defensible if they reflect the user requirements; and it's clear that there are many, many database columns for which users can present no value - sometimes at given points in the business process, sometimes ever.
Err on the side of NOT NULL. You will, at some point, have to decide what NULL "means" in your application - more than likely, it will be different things for different columns. Some of the common cases are "not specified", "unknown", "inapplicable", "hasn't happened yet", etc. You will know when you need one of those values, and then you can appropriately allow a NULLable column and code the logic around it.
Allowing random things to be NULL is, sooner or later, always a nightmare IME. Use NULL carefully and sparingly - and know what it means in your logic.
Edit: There seems to be an idea that I'm arguing for NO null columns, ever. That's ridiculous. NULL is useful, but only where it's expected.
Le Dorfier's DateOfDeath example is a good example. A NULL DateOfDeath would indicate "not happened yet". Now, I can write a view LivingPersons WHERE DateOfDeath IS NULL.
But, what does a NULL OrderDate mean? That the order wasn't placed yet? Even though there's a record in the Order table? How about a NULL address? Those are the thoughts that should go through your head before you let NULL be a value.
Back to DateOfDeath - a query of persons WHERE DateOfDeath > '1/1/1999' would not return the NULL records - even though we logically know they must die after 1999. Is that what you want? If not, then you better include OR DateOfDeath IS NULL in that query. If you allow all columns to be NULL, you have to think about that every single time you write a query. IME, that's too much of a mental tax for the 10% or so of columns that actually have legit meaning when they're NULL.
I have found marking a column as NOT NULL is usually a good idea unless you have a useful meaning for NULL in the column. Otherwise you may unexpectedly find NULL in there later when you realise you don't want it, and changing is harder.
I try to avoid using NULL's in the database as much as possible. This means that character fields are always not null. Same for numeric fields, especially anything representing money or similar (shares, units, etc).
I have 2 exceptions:
Dates where the date might not be known (eg. DivorcedOn)
Optional foriegn key relationships (MarriedToPersonId). Though on occasion I have used "blank" rows in the foreign key table and made the relatonship mandatory (eg. JobDescriptionCode)
I have also on occasion used explicit bit fields for "unknown"/"not set" (eg. JobDescriptionCode and IsEmployeed).
I have a few core reasons why:
NULLs will always cause problems in numeric fields. Always. Always. Always. Doesn't matter how careful you are at somepoint select X + Y as Total is going to happen and it will return NULL.
NULLs can easily cause problems in string fields, typically address fields (eg. select AddrLine1 + AddrLine2 from Addresses).
Guarding against NULLs in the business logic tier is a tedious waste of effort... just don't let them in the DB and you can save 100's of lines of code.
My preferred defaults:
Strings -> "", aka an empty string
Numbers -> 0
Dates -> Today or NULL (see exception #1)
Bit -> false
You may find Chris Date's Database In Depth a useful resource for these kinds of questions. You can get a taste for his ideas in this interview, where he says among other things:
So yes, I do think SQL is pretty bad.
But you explicitly ask what its major
flaws are. Well, here are a few:
Duplicate rows
Nulls
Left-to-right column ordering
Unnamed columns and duplicate column names
Failure to support "=" properly
Pointers
High redundancy
In my own experience, nearly all "planned nulls" can be represented better with a child table that has a foreign key to a base table. Participating in the child table is optional, and that's where the null/not null distinction is actually made.
This maps well to the interpretation of a relation as a first-order logic proposition. It also is just common sense. When one does not know Bob's address, does one write in one's Rolodex:
Bob. ____
Or does one merely refrain from filling out an address card for Bob until one has an actual address for him?
Edit: Date's argument appears on pages 53-55 of Database In Depth, under the section heading "Why Nulls are Prohibited."
I lean toward NOT NULL unless I see a reason otherwise -- like someone else said, like it or not, NULL is the weird special case.
One of my favorites in regards to NULL is:
SELECT F1 FROM T WHERE F2 <> 'OK'
...which (in DB2 at least) won't include any rows where f2 is null -- because in relational jargon, (NULL <> 'OK') IS NULL. But your intent was to return all not-OK rows. You need an extra OR predicate, or write F2 DISTINCT FROM 'OK' instead (which is special case coding in the first place).
IMO, NULL is just one of those programmer's tools, like pointer arithmetic or operator overloading, that requires as much art as science.
Joe Celko writes about this in SQL For Smarties -- the trap of using NULL in an application is that its meaning is, well, undefined. It could mean unknown, uninitialized, incomplete, not applicable -- or as in the dumb example above, does it mean OK or not-OK?
Thanks for all the great answers, guys. You gave me a lot to think about, and helped me form my own opinion/strategy, which boils down to this:
Allow nulls if-and-only-if a null in
that column would have a specific
meaning to your application.
A couple of common meanings for null:
Anything that comes directly from the user
Here null means "user did not enter"
For these columns, it's better to allow nulls, or you'll just get asdasd#asd.com type input anyway.
Foreign keys for "0 or 1" relationships
null means "no related row"
So allow nulls for these columns
This one is controversial, but this is my opinion.
In general, if you cannot think of a useful meaning for null in a column, it should be NOT NULL. You can always change it to nullable later.
Example of the sort of thing I ended up with:
create table SalesOrderLine (
Id int identity primary key,
-- a line must have exactly one header:
IdHeader int not null foreign key references SalesOrderHeader,
LineNumber int not null, -- a line must have a line number
IdItem int not null, -- cannot have null item
Quantity decimal not null, -- maybe could sell 0, but not null
UnitPrice decimal not null, -- price can be 0, but not null
-- a null delivery address means not for delivery:
IdDeliveryAddress int foreign key references Address,
Comment varchar(100), -- null means user skipped it
Cancelled bit not null default (0) -- true boolean, not three-state!
Delivered datetime, -- null means not yet delivered
Logged datetime not null default (GetDate()) -- must be filled out
)
I would tend to agree with dorfier.
Be serious in your application about being flexible when receiving database NULL values and treating them as empty values, and you give yourself a lot of flexibility to let NULL's get inserted for values you don't specify.
There's probably a lot of cases where you need some very serious data integrity (and/or the intense speed optimization of disallowing NULL fields) but I think that these concerns are tempered against the extra effort it takes to make sure every field has a default value and/or gets set to a sensible value.
Stick with NOT NULL on everything until someone squeaks with pain about it. Then remove it on one column at a time, as reluctantly as possible. Avoid nulls in your DB as much as you can, for as long as you can.
Personally I think you should mark the columns as Null or not null based on what kind of data they contain, if there is a genuine requirement for the data to always be there, and whether the data is always known at the time of input. Marking a column as not null when the users don't have the data will force then to make up the data which makes all your data useless (this how you end up with junk data such as an email field containing "thisissilly#Ihatethisaplication.com"). Failing to require something that must be there for the process to work(say the key field to show what customer made the order) is equally stupid. Null vice not null is a data integrity issue at the heart, do what makes the most sense towards keeping your data useable.
If you can think long term, having NULLs in a column affects how you can design your queries. Whether you use CASE statements, COALESCE, or have to explicitly test for NULL values can make the decision for you.
From a performance standpoint, it's faster to not have to worry about NULLS. From a design standpoint, using NULL is an easy way to know that an item has never been filled in. Useful examples include "UpdatedDateTime" columns. NULL means an item has never been updated.
Personally I allow NULLs in most situations.
What are the performance implications of small vs large numbers of NOT NULL columns?
This may be stating the obvious, but, when a column is nullable, each record will require 1 extra bit of storage. So a BIT column will consume 100% more storage when it is nullable, while a UNIQUEIDENTIFIER will consume only 0.8% more storage when it is nullable.
In the pathological case, if your database has a single table consisting of a single BIT column, the decision to make that column nullable would reduce your database's performance in half. However, under the vast majority of real world scenarios, nullability will not have a measurable performance impact.
Using 'Not Null' or 'Null' should be primarily driven by your particular persistance requirements.
Having a value being Nullable means there are two or three states (three states with Bit fields)
For instance; if I had a bit field which was called 'IsApproved' and the value is set at a later stage than insertion. Then there are three states:
'IsApproved' Not answered
'IsApproved' Is Approved
'IsApproved' Is Not Approved
So if a field can be legitimently considered Not Answered and there is no default value that is suitable. These fields should be considered for being nullable
Any nullable column is a violation of third normal form.
But, that's not an answer.
Maybe this is: there are two types of columns in databases - ones that hold the structure of the data, and ones that hold the content of the data. Keys are structure, user-enterable fields are data. Other things - well - it's a judgment call.
Stuff that's structure, that is used in join clauses, is typically not null. Stuff that's data is typically nullable.
When you have a column that hold one of a list of choices or null (no choice made), it is usually a good idea to have a specific value for "no choice made" rather than a nullable column. These types of columns often participate in joins.

Flags in a database rows, best practices

I am asking this out of a curiosity. Basically my question is when you have a database which needs a row entry to have things which act like flags, what is the best practice? A good example of this would be the badges on stack overflow, or the operating system field in bugzilla. Any subset of the flags may be set for a given entry.
Usually, I do c and c++ work, so my gut reaction is to use an unsigned integer field as a set of bits which can be flipped... But i know that isn't a good solution for several reasons. The most obvious of which is scale-ability, there will be a hard upper limit on how many flags I can have.
I can also think of a couple of other solutions which scale better but would have performance issues because they would require multiple selects to get all the information.
So, what is the "right" way to do this?
Generally speaking, I avoid bitmask fields. They're difficult to read in the future and they require a much more in-depth knowledge of the data to understanding.
The relational solution has been proposed previously. Given the example you outlined, I would create something like this (in SQL Server):
CREATE TABLE Users (
UserId INT IDENTITY(1, 1) PRIMARY KEY,
FirstName VARCHAR(50),
LastName VARCHAR(50),
EmailAddress VARCHAR(255)
);
CREATE TABLE Badges (
BadgeId INT IDENTITY(1, 1) PRIMARY KEY,
[Name] VARCHAR(50),
[Description] VARCHAR(255)
);
CREATE TABLE UserBadges (
UserId INT REFERENCES Users(UserId),
BadgeId INT REFERENCES Badges(BadgeId)
);
If you really need an unbounded selection from a closed set of flags (e.g. stackoverflow badges), then the "relational way" would be to create a table of flags and a separate table which relates those flags to your target entities. Thus, users, flags and usersToFlags.
However, if space efficiency is a serious concern and query-ability is not, an unsigned mask would work almost as well.
A Very Relational Approach
For databases without the set type, you could open a new table to represent the set of entities for which each flag is set.
E.g. for a Table "Students" you could have tables "RegisteredStudents", "SickStudents", TroublesomeStudents etc. Each table will have only one column: the student_id. This would actually be very fast if all you want to know is which students are "Registered" or "Sick", and would work the same way in every DBMS.
For many cases, it depends on a lot of things - like your database backend. If you're using MySQL, for example, the SET datatype is exactly what you want.
Basically, it's just a bitmask, with values assigned to each bit. MySQL supports up to 64-bit values (meaning 64 different toggles). If you only need 8, then it only takes a byte per row, which is pretty awesome savings.
If you honestly have more than 64 values in a single field, your field might be getting more complicated. You may want to expand then to the BLOB datatype, which is just a raw set of bits that MySQL has no inherent understanding of. Using this, you can create an arbitrary number of bit fields that MySQL is happy to treat as binary, hex, or decimal values, however you need. If you need more than 64 options, create as many fields as is appropriate for your application. The downside is that is is difficult to make the field human readable. The BIT datatype is also limited to 64.
If the flags have very different meanings and are used directly in SQL queries or VIEWS, then using multiple columns of type BOOLEAN might be a good idea.
Put each flag into an extra column, because you'll read and modify them separately anyway. If you want to group the flags, just give their column names a common prefix, i.e. instead of:
CREATE TABLE ... (
warnings INTEGER,
errors INTEGER,
...
)
you should use:
CREATE TABLE ... (
warning_foo BOOLEAN,
warning_bar BOOLEAN,
warning_...
error_foo BOOLEAN,
error_bar BOOLEAN,
error_... BOOLEAN,
...
)
Although MySQL doesn't have a BOOLEAN type, you can use the quasi standard TINYINT(1) for that purpose, and set it only to 0 or 1.
I would recommend using a BOOLEAN datatype if your database supports this.
Otherwise, the best approach is to use NUMBER(1) or equivalent, and put a check constraint on the column that limits valid values to (0,1) and perhaps NULL if you need that. If there is no built-in type, using a number is less ambiguous that using a character column. (What's the value for true? "T" or "Y" or "t")
The nice thing about this is that you can use SUM() to count the number of TRUE rows.
SELECT COUNT(1), SUM(ActiveFlag)
FROM myusers;
If there are more than just a few flags, or likely to be so in the future, I'll use a separate table of flags and a many-to-many table between them.
If there are a handful of flags and I'm never going to use them in a WHERE, I'll use a SET() or bitfield or whatever. They're easy to read and more compact, but a pain to query and sometimes even more of a headache with an ORM.
If there are only a few flags -- and only ever going to be a few flags -- then I'll just make a couple BIT/BOOLEAN/etc columns.
Came across this when I was pondering best way to store bitmask flags (similar to OP's original use of integers) in a database.
The other answers are all valid solutions, but I think its worth mentioning that you may not have to resign yourself to horrible query problems if you choose to store bitmasks directly in the database.
If you are working on an application that uses bitmasks and you really want the convenience of storing them in the database as one integer or byte column, go ahead and do that. Down the road, you can write yourself a little utility that will generate another table of flags (in whatever pattern of rows/columns you choose) from the bitmasks in your primary working table. You can then do ordinary SQL queries on that computed/derived table.
This way your application gets the convenience of only reading/writing the bitmask field/column. But you can still use SQL to really dive into your data if that becomes necessary at a later time.