A table's boolean fields can be named using the positive vs the negative...
for example, calling a field:
"ACTIVE" , 1=on / 0=off
or
"INACTIVE" , 0=on / 1=off
Question:
Is there a proper way to make this type of table design decision, or is it arbitrary?
My specific example is a messages table with a bool field (private/public). This field will be set using a form checkbox when a user enters a new message. Is there a benefit in naming the field "public" vs "private"?
thanks.
I always prefer positive names, to avoid double negatives in code. "Is not inactive" is often cause for a double take when reading. "Is inactive" can always be written as "if (!Active)" whilst taking advantage of built-in language semantics.
My personal preference:
Use prefixes like "Is", "Has", etc. for Boolean fields to make their purpose clear.
Always name variables in the affirmative. For Active/Inactive, I would name it IsActive.
Don't make a bit field nullable unless you really have a specific purpose in doing so.
In your specific use case, the field should be named either IsPublic or IsPrivate--whichever name would result in a True answer when the user ticks the checkbox.
i would not disagree with some of the other answers but definitely avoid the incorrect answer which is not to put in double negatives always
Always use positive names.
If use negative names, you very quickly get into double negation. Not that double negation is rocket surgery, but it's a brain cycle and those are valuable :)
Always use positive.
It's simpler.
Take using the negation to the logical extreme: if InActive is better than Active, then why not InInActive, or InInInActive?
Because it would be less simple.
The proper way to handle these situations is to create a table to house the values associated with the column, and create a foreign key relationship between the two tables. IE:
WIDGETS table:
WIDGET_ID
WIDGET_STATUS (fk)
WIDGET_STATUS_CODES table:
WIDGET_STATUS_CODE (pk)
DESCRIPTION
If possible, WIDGET_STATUS_CODE would be a natural key (IE: ACT for "Active", INA for "Inactive"). This would make records more human readable, but isn't always possible so you'd use an artificial/surrogate key (like an auto-number/sequence/etc).
You want to do this because:
It's readable what status indicates (which was the original question)
Future proof in the need to define/use more statuses
Provides referencial integrity so someone couldn't set the value to 2, 3, 4, etc.
Space is cheap; there's nothing efficient about allowing bad data
Try to avoid boolean fields in databases alltogether.
One, the RM has a much better way to represent truth-valued information than via boolean fields : via the presence of a tuple in a table.
Two, boolean fields are very bad discriminators when querying. It's virtually complete madness to index them, so when querying, the presence of boolean fields gives no benefit at all.
Related
I have a table with guid identifier and one field that is a 5 characters string that can be specified by user, but it is optional, and it should be unique per user. I'm looking for a way to have this field always there, even if user doesn't specify it. The easiest approach is to have it like "00001", "00002"... etc. in case that user doesn't specify it, it is stored like this. I'm using SQL and entity framework core. What is the best way to achieve this?
EDIT: maybe trigger that will check after insert if that field is not specified and then just take current row number and convert it to string? does this make sense?
Cheers
Setting a default value to '00001' can be done define the field with:
NOT NULL DEFAULT right('0000' || to_char(SomeSequence.nextval),5) (pseudo-code to be adapted to the DBMS you are connected to).
Compared to the solution in your EDIT, this will at least guarantee that 2 inserts at the same time from 2 different users get assigned different values.
The real problem comes with the unique constraint on the column. This does not work nicely when mixing manual input with calculated values.
If as a user, I input (manually) 00005, then the insertion will fail when SomeSequence reaches 5.
I think this problem will exist regardless of how you implement the generation of values (sequence, trigger, external code, ...)
Even if you are fine with coding some additional (and probably complicated) logic to manage that, it will probably decrease concurrency.
I use a string type for my Id attribute on all my domain abjects. E.g.:
public class Person {
property string Id { get; set; }
// ... more properties
}
no tricks here. null represents a "no-value" value, when a new Person is created and before it is persisted, Id will remain null.
Now there is a discussion to enhance "no-value" space and say that null, empty string and white-space strings are all "no-value" values.
I.e. to check if entity is new instead of doing: if (person.Id == null) it will become if (string.IsNullOrWhiteSpace(person.Id))
In my humble opinion this is a smell or a design principle violation, but I can't figure out which one.
Question: which (if any) design principle does this decision violate (the decision to allow for more than just null to represent no-value value)?
(I think it should be something similar to Occam's razor principle or entropy or KISS, I just not sure)
It does violate the KISS principle. If there is no special need for handling empty strings beside nulls, then why do it? All operations must now check for two values, instead of one. When exploring the DB, a simple SELECT to find "NULL" records, becomes slightly less trivial for no good reason.
Another violated principle is the principle of least surprise - usually people expect only NULL values to represent NULL objects. The design with two special values is less obvious, and less "readable".
If something more should be hidden behind these "second-category-special-objects", then it should be made explicit. Otherwise, it should be trivial to handle empty string input, and store it as NULL to be coherent with the rest of the system.
EDIT:
I've also found another "principle" in Bob Martin's book Clean code - "one word per concept" which is somehow related to this case. Empty string and a NULL are two "words" used for one concept so they clearly violate this guideline.
I'm gonna go out on a limb and say defining both null and "" as the empty String for your application does not violate any design principles and is not a code smell. You need to just clearly define the semantics of the field for your purpose, and you have done so (i.e., in your application, but null and "" mean "no value").
You should have tests that ensure behavior is correct for both null and "".
This is not to say that you also can't make the decision to force all empty strings to null. That is an equally valid decision. You would need to have tests that verify that in all cases where you set the "No value", the actual value is null. You might want to go this way if your persistence layer expects null and only null to indicate no value.
So, in this case, there are no wrong decisions, just decisions.
Outside of the argument of whether or not NULLs should ever be used: I am responsible for an existing database that uses NULL to mean "missing or never entered" data. It is different from empty string, which means "a user set this value, and they selected 'empty'."
Another contractor on the project is firmly on the "NULLs do not exist for me; I never use NULL and nobody else should, either" side of the argument. However, what confuses me is that since the contractor's team DOES acknowledge the difference between "missing/never entered" and "intentionally empty or indicated by the user as unknown," they use a single character 'Z' throughout their code and stored procedures to represent "missing/never entered" with the same meaning as NULL throughout the rest of the database.
Although our shared customer has asked for this to be changed, and I have supported this request, the team cites this as "standard practice" among DBAs far more advanced than I; they are reluctant to change to use NULLs based on my ignorant request alone. So, can anyone help me overcome my ignorance? Is there any standard, or small group of individuals, or even a single loud voice among SQL experts which advocates the use of 'Z' in place of NULL?
Update
I have a response from the contractor to add. Here's what he said when the customer asked for the special values to be removed to allow NULL in columns with no data:
Basically, I designed the database to avoid NULLs whenever possible. Here is the rationale:
• A NULL in a string [VARCHAR] field is never necessary because an empty (zero-length) string furnishes exactly the same information.
• A NULL in an integer field (e.g., an ID value) can be handled by using a value that would never occur in the data (e.g, -1 for an integer IDENTITY field).
• A NULL in a date field can easily cause complications in date calculations. For example, in logic that computes date differences, such as the difference in days between a [RecoveryDate] and an [OnsetDate], the logic will blow up if one or both dates are NULL -- unless an explicit allowance is made for both dates being NULL. That's extra work and extra handling. If "default" or "placeholder" dates are used for [RecoveryDate] and [OnsetDate] (e.g., "1/1/1900") , mathematical calculations might show "unusual" values -- but date logic will not blow up.
NULL handling has traditionally been an area where developers make mistakes in stored procedures.
In my 15 years as a DBA, I've found it best to avoid NULLs wherever possible.
This seems to validate the mostly negative reaction to this question. Instead of applying an accepted 6NF approach to designing out NULLs, special values are used to "avoid NULLs wherever possible." I posted this question with an open mind, and I am glad I learned more about the "NULLs are useful / NULLs are evil" debate, but I am now quite comfortable labeling the 'special values' approach to be complete nonsense.
an empty (zero-length) string furnishes exactly the same information.
No, it doesn't; in the existing database we are modifying, NULL means "never entered" and empty string means "entered as empty".
NULL handling has traditionally been an area where developers make mistakes in stored procedures.
Yes, but those mistakes have been made thousands of times by thousands of developers, and the lessons and caveats for avoiding those mistakes are known and documented. As has been mentioned here: whether you accept or reject NULLs, representation of missing values is a solved problem. There is no need to invent a new solution just because developers continue make easy-to-overcome (and easy-to-identify) mistakes.
As a footnote: I have been a DBE and developer for more than 20 years (which is certainly enough time for me to know the difference beetween a database engineer and a database administrator). Throughout my career I have always been in the "NULLs are useful" camp, though I was aware that several very smart people disagreed. I was extremely skeptical about the "special values" approach, but not well-versed enough in the academics of "How To Avoid NULL the Right Way" to make a firm stand. I always love learning new things—and I still have lots to learn after 20 years. Thanks to all who contributed to make this a useful discussion.
Sack your contractor.
Okay, seriously, this isn't standard practice. This can be seen simply because all RDBMS that I have ever worked with implement NULL, logic for NULL, take account of NULL in foreign keys, have different behaviour for NULL in COUNT, etc, etc.
I would actually contend that using 'Z' or any other place holder is worse. You still require code to check for 'Z'. But you also need to document that 'Z' doesn't mean 'Z', it means something else. And you have to ensure that such documentation is read. And then what happens if 'Z' ever becomes a valid piece of data? (Such as a field for an initial?)
At a basic level, even without debating the validity of NULL vs 'Z', I would insist that the contractor conforms to standard practices that exist within your company, not his. Instituting his standard practice in an environment with an alternative standard practice will cause confusion, maintenance overheads, mis-understanding, and in the end increased costs and mistakes.
EDIT
There are cases where using an alternative to NULL is valid in my opinion. But only where doing so reduces code, rather than creating special cases which require accounting for.
I've used that for date bound data, for example. If data is valid between a start-date and an end-date, code can be simplified by not having NULL values. Instead a NULL start-date could be replaced with '01 Jan 1900' and a NULL end-date could be replaced with '31 Dec 2079'.
This still can change behaviour from what may be expected, and so should be used with care:
WHERE end-date IS NULL no longer give data that is still valid
You just created your own millennium bug
etc.
This is equivalent to reforming abstractions such that all properties can always have valid values. It is markedly different from implicitly encoding specific meaning into arbitrarily chosen values.
Still, sack the contractor.
This is easily one of the weirdest opinions I've ever heard. Using a magic value to represent "no data" rather than NULL means that every piece of code that you have will have to post-process the results to account/discard the "no-data"/"Z" values.
NULL is special because of the way that the database handles it in queries. For instance, take these two simple queries:
select * from mytable where name = 'bob';
select * from mytable where name != 'bob';
If name is ever NULL, it obviously won't show up in the first query's results. More importantly, neither will it show up in the second queries results. NULL doesn't match anything other than an explicit search for NULL, as in:
select * from mytable where name is NULL;
And what happens when the data could have Z as a valid value? Let's say you're storing someone's middle initial? Would Zachary Z Zonkas be lumped in with those people with no middle initial? Or would your contractor come up with yet another magic value to handle this?
Avoid magic values that require you to implement database features in code that the database is already fully capable of handling. This is a solved and well understood problem, and it may just be that your contractor never really grokked the notion of NULL and therefore avoids using it.
If the domain allows missing values, then using NULL to represent 'undefined' is perfectly OK (that's what it is there for). The only downside is that code that consumes the data has to be written to check for NULLs. This is the way I've always done it.
I have never heard of (or seen in practice) the use of 'Z' to represent missing data. As to "the contractor cites this as 'standard practice' among DBAs", can he provide some evidence of that assertion? As #Dems mentioned, you also need to document that 'Z' doesn't mean 'Z': what about a MiddleInitial column?
Like Aaron Alton and many others, I believe that NULL values are an integral part of database design, and should be used where appropriate.
Even if you somehow manage to explain to all your current and future developers and DBAs about "Z" instead of NULL, and even if they code everything perfectly, you will still confuse the optimizer because it will not know that you've cooked this up.
Using a special value to represent NULL (which is already a special value to represent NULL) will result in skews in the data. e.g. So many things happened on 1-Jan-1900 that it will throw out the optimizer's ability to understand that actual range of dates that really are relevant to your application.
This is like a manager deciding: "Wearing a tie is bad for productivity, so we're all going to wear masking tape around our necks. Problem solved."
I've never heard about the wide-spread use of 'Z' as a substitute for NULL.
(Incidentally, I'd not particularly like to work with a contractor who tells you in the face that they and other "advanced" DBAs are so much more knowledgeable and better than you.)
+=================================+
| FavoriteLetters |
+=================================+
| Person | FavoriteLetter |
+--------------+------------------+
| 'Anna' | 'A' |
| 'Bob' | 'B' |
| 'Claire' | 'C' |
| 'Zaphod' | 'Z' |
+---------------------------------+
How would your contractor interpret the data from the last row?
Probably he would choose a different "magic value" in this table to avoid collision with the real data 'Z'? Meaning you'd have to remember several magic values and also which one is used where... how is this better than having just one magic token NULL, and having to remember the three-valued logic rules (and pitfalls) that go with it? NULL at least is standardized, unlike your contractor's 'Z'.
I don't particularly like NULL either, but mindlessly substituting it with an actual value (or worse, with several actual values) everywhere is almost definitely worse than NULL.
Let me repeat my above comment here for better visibility: If you want to read something serious and well-grounded by people who are against NULL, I would recommend the short article "How to handle missing information without using NULLs" (links to a PDF from The Third Manifesto homepage).
Nothing in principle requires nulls for correct database design. In fact there are plenty of databases designed without using null and there are plenty of very good database designers and whole development teams who design databases without using nulls. In general it's a good thing to be cautious about adding nulls to a database because they inevitably lead to incorrect or ambiguous results later on.
I've not heard of using Z being called "standard practice" as a placeholder value instead of nulls but I expect your contractor is referring to the concept of sentinel values in general, which are sometimes used in database design. However, a much more common and flexible way to avoid nulls without using "dummy" data is simply to design them out. Decompose the table such that each type of fact is recorded in a table that doesn't have "extra", unspecified attributes.
In reply to contractors comments
Empty string <> NULL
Empty string requires 2 bytes storage + an offset read
NULL uses null bitmap = quicker
IDENTITY doesn't always start at 1 (why waste half your range?)
The whole concept is flawed as per most other answers here
While I have never seen 'Z' as a magic value to represent null, I have seen 'X' used to represent a field that has not been filled in. That said, I have only ever seen this in one place, and my interface to it was not a database, but rather an XML file… so I would not be prepared to use this an argument for being common practice.
Note that we do have to handle the 'X' specially, and, as Dems mentioned, we do have to document it, and people have been confused by it. In our defence, this is forced on us by an external supplier, not something that we cooked up ourselves!
The AdventureWorks Data Dictionary specifies that the [EmailPromotion] column in the [Contact] table is an int and that:
0 = Contact does not wish to receive e-mail promotions.
1 = Contact does wish to receive e-mail promotions.
and [Employee].[CurrentFlag] uses bit as follows:
0 = Inactive
1 = Active
My question has two parts:
Is there a good reason to use the int datatype in place bit (both uses will be documented)?
What naming conventions for boolean and boolean-like columns do you recommend? (e.g. IsActive, ActiveFlag, Active)
For the first part they are probably allowing for future status codes (although with that column name it is hard to imagine what that could be...).
You will find a lot of dissent on your second question, but I prefer IsActive as a name in this case. I find it reads well, and prevents double negatives in code that you would get if you used something like IsInactive.
Is there a good reason to use the int datatype in place bit (both uses will be documented)?
INT is consistently supported. While BIT is becoming more common, I'll wager the support is just a mask for an INT column with a CHECK constraint. There's a good asktom question about why Oracle doesn't have a specific BIT/BOOLEAN data type...
What naming conventions for boolean and boolean-like columns do you recommend? (e.g. IsActive, ActiveFlag, Active)
Same as programming, they should be prefixed with "is" and when read, should present a yes/no question to infer the column is a boolean indicator. So "isActive" would be my decision, but I would be probing to see if the situation didn't require a STATUS table & foreign key.
Whatever floats your boat. The only important thing is naming convention and consistency across the database.
Using int for a flag isn't justified, I would've used tinyint, I hardly ever use bit at all, because I'm always open for all kinds of possibilities out there.
I usually name those columns Active (if there are only two statuses) and StatusId (if there is more than one status).
We generally prefer to have all our varchar/nvarchar columns non-nullable with a empty string ('') as a default value. Someone on the team suggested that nullable is better because:
A query like this:
Select * From MyTable Where MyColumn IS NOT NULL
is faster than this:
Select * From MyTable Where MyColumn == ''
Anyone have any experience to validate whether this is true?
On some platforms (and even versions), this is going to depend on how NULLs are indexed.
My basic rule of thumb for NULLs is:
Don't allow NULLs until justified
Don't allow NULLs unless the data can really be unknown
A good example of this is modeling address lines. If you have an AddressLine1 and AddressLine2, what does it mean for the first to have data and the second to be NULL? It seems to me, you either know the address or not, and having partial NULLs in a set of data just asks for trouble when somebody concatenates them and gets NULL (ANSI behavior). You might solve this with allowing NULLs and adding a check constraint - either all the Address information is NULL or none is.
Similar thing with middle initial/name. Some people don't have one. Is this different from it being unknown and do you care?
ALso, date of death - what does NULL mean? Not dead? Unknown date of death? Many times a single column is not sufficient to encode knowledge in a domain.
So to me, whether to allow NULLs would depend very much on the semantics of the data first - performance is going to be second, because having data misinterpreted (potentially by many different people) is usually a far more expensive problem than performance.
It might seem like a little thing (in SQL Server the implementation is a bitmask stored with the row), but only allowing NULLs after justification seems to me to work best. It catches things early in development, forces you to address assumptions and understand your problem domain.
If you want to know that there is no value, use NULL.
As for speed, IS NULL should be faster, because it doesn't use string comparison.
If you need NULL, use NULL. Ditto empty string.
As for performance, "it depends"
If you have varchar, you are storing an actual value in the row for the length. If you have char, then you store the actual length. NULL won't be stored in-row depending on the engine (NULL bitmap for SQL Server for example).
This means IS NULL is quicker, query for query, but it could add COALESCE/NULLIF/ISNULL complexity.
So, your colleague is partially correct but may not appreciate it fully.
Blindly using empty string is use of a sentinel value rather then working through the NULL semantic issue
FWIW and personally:
I would tend to use NULL but don't always. I like to avoid dates like 31 Dec 9999 which is where NULL avoidance leads you.
From Cade Roux's answer... I also find that discussions about "Is date of death nullable" pointless. For an field, in practical terms, either there is a value or there isn't.
Sentinel values are worse then NULLs. Magic numbers. anyone?
Tell that guy on your team to get his prematurely optimizin' head out of his ass! (But in a nice way).
Developers like that can be poison to the team, full of low-level optimization myths, all of which may be true or have been true at one point in time for some specific vendor or query pattern, or possibly only true in theory but never true in practice. Acting upon these myths is a costly waste of time, and can destroy an otherwise good design.
He probably means well and wants to contribute his knowledge to the team. Unfortunately, he is wrong. Not wrong in the sense of whether a benchmark will prove his statement correct or incorrect. He's wrong in the sense that this is not how you design a database. The question of whether to make a field NULL-able is a question about domain of the data for the purposes of defining the type of the field. It should be answered in terms of what it means for the field to have no value.
In a nutshell, NULL = UNKNOWN!.. Which means (using date of death example) that the entity could be 1)alive, 2)dead but date of death is not known, or 3)unknown if entity is dead or alive. For numeric columns I always default them to 0 (ZERO) because somewhere along the line you may have to perform aggregate calculations and NULL + 123 = NULL. For alphanumerics I use NULL since its least expensive performance-wise and easier to say '...where a IS NULL' than saying '...where a = "" '. Using '...where a = " "[space]' is not a good idea because [space] is not a NULL! For dates, if you have to leave a date column NULL, you may want to add a status indicator column which, in the above example, A=Alive, D=Dead, Q=Dead, date of death not known, N=Alive or Dead is unknown.