Standard use of 'Z' instead of NULL to represent missing data? - sql

Outside of the argument of whether or not NULLs should ever be used: I am responsible for an existing database that uses NULL to mean "missing or never entered" data. It is different from empty string, which means "a user set this value, and they selected 'empty'."
Another contractor on the project is firmly on the "NULLs do not exist for me; I never use NULL and nobody else should, either" side of the argument. However, what confuses me is that since the contractor's team DOES acknowledge the difference between "missing/never entered" and "intentionally empty or indicated by the user as unknown," they use a single character 'Z' throughout their code and stored procedures to represent "missing/never entered" with the same meaning as NULL throughout the rest of the database.
Although our shared customer has asked for this to be changed, and I have supported this request, the team cites this as "standard practice" among DBAs far more advanced than I; they are reluctant to change to use NULLs based on my ignorant request alone. So, can anyone help me overcome my ignorance? Is there any standard, or small group of individuals, or even a single loud voice among SQL experts which advocates the use of 'Z' in place of NULL?
Update
I have a response from the contractor to add. Here's what he said when the customer asked for the special values to be removed to allow NULL in columns with no data:
Basically, I designed the database to avoid NULLs whenever possible. Here is the rationale:
• A NULL in a string [VARCHAR] field is never necessary because an empty (zero-length) string furnishes exactly the same information.
• A NULL in an integer field (e.g., an ID value) can be handled by using a value that would never occur in the data (e.g, -1 for an integer IDENTITY field).
• A NULL in a date field can easily cause complications in date calculations. For example, in logic that computes date differences, such as the difference in days between a [RecoveryDate] and an [OnsetDate], the logic will blow up if one or both dates are NULL -- unless an explicit allowance is made for both dates being NULL. That's extra work and extra handling. If "default" or "placeholder" dates are used for [RecoveryDate] and [OnsetDate] (e.g., "1/1/1900") , mathematical calculations might show "unusual" values -- but date logic will not blow up.
NULL handling has traditionally been an area where developers make mistakes in stored procedures.
In my 15 years as a DBA, I've found it best to avoid NULLs wherever possible.
This seems to validate the mostly negative reaction to this question. Instead of applying an accepted 6NF approach to designing out NULLs, special values are used to "avoid NULLs wherever possible." I posted this question with an open mind, and I am glad I learned more about the "NULLs are useful / NULLs are evil" debate, but I am now quite comfortable labeling the 'special values' approach to be complete nonsense.
an empty (zero-length) string furnishes exactly the same information.
No, it doesn't; in the existing database we are modifying, NULL means "never entered" and empty string means "entered as empty".
NULL handling has traditionally been an area where developers make mistakes in stored procedures.
Yes, but those mistakes have been made thousands of times by thousands of developers, and the lessons and caveats for avoiding those mistakes are known and documented. As has been mentioned here: whether you accept or reject NULLs, representation of missing values is a solved problem. There is no need to invent a new solution just because developers continue make easy-to-overcome (and easy-to-identify) mistakes.
As a footnote: I have been a DBE and developer for more than 20 years (which is certainly enough time for me to know the difference beetween a database engineer and a database administrator). Throughout my career I have always been in the "NULLs are useful" camp, though I was aware that several very smart people disagreed. I was extremely skeptical about the "special values" approach, but not well-versed enough in the academics of "How To Avoid NULL the Right Way" to make a firm stand. I always love learning new things—and I still have lots to learn after 20 years. Thanks to all who contributed to make this a useful discussion.

Sack your contractor.
Okay, seriously, this isn't standard practice. This can be seen simply because all RDBMS that I have ever worked with implement NULL, logic for NULL, take account of NULL in foreign keys, have different behaviour for NULL in COUNT, etc, etc.
I would actually contend that using 'Z' or any other place holder is worse. You still require code to check for 'Z'. But you also need to document that 'Z' doesn't mean 'Z', it means something else. And you have to ensure that such documentation is read. And then what happens if 'Z' ever becomes a valid piece of data? (Such as a field for an initial?)
At a basic level, even without debating the validity of NULL vs 'Z', I would insist that the contractor conforms to standard practices that exist within your company, not his. Instituting his standard practice in an environment with an alternative standard practice will cause confusion, maintenance overheads, mis-understanding, and in the end increased costs and mistakes.
EDIT
There are cases where using an alternative to NULL is valid in my opinion. But only where doing so reduces code, rather than creating special cases which require accounting for.
I've used that for date bound data, for example. If data is valid between a start-date and an end-date, code can be simplified by not having NULL values. Instead a NULL start-date could be replaced with '01 Jan 1900' and a NULL end-date could be replaced with '31 Dec 2079'.
This still can change behaviour from what may be expected, and so should be used with care:
WHERE end-date IS NULL no longer give data that is still valid
You just created your own millennium bug
etc.
This is equivalent to reforming abstractions such that all properties can always have valid values. It is markedly different from implicitly encoding specific meaning into arbitrarily chosen values.
Still, sack the contractor.

This is easily one of the weirdest opinions I've ever heard. Using a magic value to represent "no data" rather than NULL means that every piece of code that you have will have to post-process the results to account/discard the "no-data"/"Z" values.
NULL is special because of the way that the database handles it in queries. For instance, take these two simple queries:
select * from mytable where name = 'bob';
select * from mytable where name != 'bob';
If name is ever NULL, it obviously won't show up in the first query's results. More importantly, neither will it show up in the second queries results. NULL doesn't match anything other than an explicit search for NULL, as in:
select * from mytable where name is NULL;
And what happens when the data could have Z as a valid value? Let's say you're storing someone's middle initial? Would Zachary Z Zonkas be lumped in with those people with no middle initial? Or would your contractor come up with yet another magic value to handle this?
Avoid magic values that require you to implement database features in code that the database is already fully capable of handling. This is a solved and well understood problem, and it may just be that your contractor never really grokked the notion of NULL and therefore avoids using it.

If the domain allows missing values, then using NULL to represent 'undefined' is perfectly OK (that's what it is there for). The only downside is that code that consumes the data has to be written to check for NULLs. This is the way I've always done it.
I have never heard of (or seen in practice) the use of 'Z' to represent missing data. As to "the contractor cites this as 'standard practice' among DBAs", can he provide some evidence of that assertion? As #Dems mentioned, you also need to document that 'Z' doesn't mean 'Z': what about a MiddleInitial column?
Like Aaron Alton and many others, I believe that NULL values are an integral part of database design, and should be used where appropriate.

Even if you somehow manage to explain to all your current and future developers and DBAs about "Z" instead of NULL, and even if they code everything perfectly, you will still confuse the optimizer because it will not know that you've cooked this up.
Using a special value to represent NULL (which is already a special value to represent NULL) will result in skews in the data. e.g. So many things happened on 1-Jan-1900 that it will throw out the optimizer's ability to understand that actual range of dates that really are relevant to your application.
This is like a manager deciding: "Wearing a tie is bad for productivity, so we're all going to wear masking tape around our necks. Problem solved."

I've never heard about the wide-spread use of 'Z' as a substitute for NULL.
(Incidentally, I'd not particularly like to work with a contractor who tells you in the face that they and other "advanced" DBAs are so much more knowledgeable and better than you.)
+=================================+
| FavoriteLetters |
+=================================+
| Person | FavoriteLetter |
+--------------+------------------+
| 'Anna' | 'A' |
| 'Bob' | 'B' |
| 'Claire' | 'C' |
| 'Zaphod' | 'Z' |
+---------------------------------+
How would your contractor interpret the data from the last row?
Probably he would choose a different "magic value" in this table to avoid collision with the real data 'Z'? Meaning you'd have to remember several magic values and also which one is used where... how is this better than having just one magic token NULL, and having to remember the three-valued logic rules (and pitfalls) that go with it? NULL at least is standardized, unlike your contractor's 'Z'.
I don't particularly like NULL either, but mindlessly substituting it with an actual value (or worse, with several actual values) everywhere is almost definitely worse than NULL.
Let me repeat my above comment here for better visibility: If you want to read something serious and well-grounded by people who are against NULL, I would recommend the short article "How to handle missing information without using NULLs" (links to a PDF from The Third Manifesto homepage).

Nothing in principle requires nulls for correct database design. In fact there are plenty of databases designed without using null and there are plenty of very good database designers and whole development teams who design databases without using nulls. In general it's a good thing to be cautious about adding nulls to a database because they inevitably lead to incorrect or ambiguous results later on.
I've not heard of using Z being called "standard practice" as a placeholder value instead of nulls but I expect your contractor is referring to the concept of sentinel values in general, which are sometimes used in database design. However, a much more common and flexible way to avoid nulls without using "dummy" data is simply to design them out. Decompose the table such that each type of fact is recorded in a table that doesn't have "extra", unspecified attributes.

In reply to contractors comments
Empty string <> NULL
Empty string requires 2 bytes storage + an offset read
NULL uses null bitmap = quicker
IDENTITY doesn't always start at 1 (why waste half your range?)
The whole concept is flawed as per most other answers here

While I have never seen 'Z' as a magic value to represent null, I have seen 'X' used to represent a field that has not been filled in. That said, I have only ever seen this in one place, and my interface to it was not a database, but rather an XML file… so I would not be prepared to use this an argument for being common practice.
Note that we do have to handle the 'X' specially, and, as Dems mentioned, we do have to document it, and people have been confused by it. In our defence, this is forced on us by an external supplier, not something that we cooked up ourselves!

Related

Why is null an absorbing element on relations?

null is the lack of value, or, more theatrically, it is the unkown. From here, it is perfectly logical, that null + a, null * a, null / a, etc. is resulting as null. This means that null is an absorbing element on these operations. I wonder why does it have to be an absorbing element on relations as well. null > 5 could be considered to be false as well, with an explanation at least as plausible as we can give for the current behavior. Currently we can say that null > 5 is null, since the unkown might be greater than 5, or not, so the result is the unkown. But if it was false, then we could say that null > 5 is false, since the lack of value is not greater than 5.
Take a look at these queries:
select *
from books
where author like 'Alex%'
This will return all the books, which have their author starting with Alex. Let us see the other books:
select *
from books
where author not like 'Alex%'
This will return all the books where author does not start with Alex, right? Wrong! It will return all the books which have an author value which does not start with Alex. If we want to select the books whose author does not start with Alex, we have to explicitly include null values, like this:
select *
from books
where (author is null) or (author not like 'Alex%')
This seems to be an unnecessary complication to me which could be sorted out for future versions. But the question is: what is the explanation of this behavior? Why do we use null as the unkown instead of lack of value?
Why do we use null as the unknown instead of lack of value?
Part of the foundation of the Relational Model is predicate logic. While there are logics that have more than two values (true & false), the simplest and best defined, not to mention most familiar, is 2-valued: Boolean logic.
For reasons of industrial acceptance, into that fine mathematical model SQL introduced NULL. In Boolean logic we can prove the value of arbitrary expressions like NOT(A AND B), but there's no provision for missing values. Missing values are, quite simply, outside the domain of Boolean logic.
Having left academe behind, SQL makes arbitrary choices. What is the sum of N NULLs? NULL. What is count of N NULLs? 0. Is a value greater or lesser than NULL? To sort, has to be one or the other. Are two NULLs distinct, or identical, in GROUP BY? The SQL choices all "makes sense" at some level, even when implementations contradict each other. There's no right answer, because they're extra-logical.
So the answer to your question really is, because that's what the vendors chose. The unknown has no more meaning, logically, than lack of value. You could make an argument to treat NULL differently. It might win you a beer. If you want to see it manifested in a DBMS, though, you'll have to get it implemented.
This seems to be an unnecessary complication
You might be right, but you won't be surprised to learn that in 40 years many people have proposed your solution, namely X = NULL is false. The community settled on X = NULL is NULL, avoiding an implicit conversion. Considering how deeply nested and complicated SQL queries can be, that's probably a good thing.
CJ Date takes the position that NULL should be abolished, and all missing values should have a default value. I take exception to that for three reasons:
Missingness is meaningful. If I record a default value for a missing one, I need another column (is_missing) to record its missingness.
Default values can be captured in computations in error. Any use of a complementary is_missing column is ad hoc and outside the purview of the logic engine.
The "right default" varies by context. Sometimes, the "previous" known value is sufficient (because, say, yesterday's price might stand for today's, absent better information). Sometimes there's a known proxy, like average lifespan. Sometimes it's zero, as in a covariance matrix. And sometimes there's no good default: the "value" should be excluded because it's missing.
I have a pet solution, too, that's both simple and strict. I would like to see an SQL option, say, SET STRICT_BOOLEAN ON that would treat missing values as errors for logical and computational purposes. You can insert a NULL; you can select one. You cannot compare one or add one or concatenate one. To do those things, you must supply a default (appropriate to your context) with COALESCE or similar. Any "undefaulted" use of NULL simply raises an error, just like divide by zero does. And for the same reason: like zero as a divisor, NULL in logic is outside the domain.
I have not read the answer... But I believe that can help if you are using Oracle. Oracle implements the function LNNVL since Oracle 10 to deal with this.
https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions078.htm

SQL: What is the purpose of having "basically-null" default values?

So I am working with a table with a lot of default values:
CREATE TABLE BLAH.BLAH
(
...
A NUMBER(19) DEFAULT 0 NOT NULL,
B VARCHAR2(50) DEFAULT ' ' NOT NULL,
C TIMESTAMP(6) DEFAULT '01-JAN-1900' NOT NULL,
...
)
I am just wondering if there is any logical purpose for setting such null-like defaults to columns that would be MUCH better (in my opinion) as being set to actual NULL.
EDIT: I am mostly irked with the varchar2 default. The others are a bit more reasonable and easier to work with. It's just a pain when a lot of the code involves trimming; where I'm getting NULL when I'm expecting a single space.
It doesn't make much sense if you understand nulls and are used to dealing with them.
In its own crazy world it has some kind of logic, since if you never have nulls the values are easier to map to programming languages (all have an integer type, but not all have a nullable integer type) and comparing for 'equality' is easier (0 = 0, but it is not the case that null = null). But the tradeoff is that you then have to handle these magic default values with boilerplate checks in your program and in any SQL queries.
It could also be the result of some misapplied, cargo-culted 'coding standard' that forbids the use of nullable columns in the database, combined with a need to have some kind of 'unknown' value anyway.
It is true that the inclusion of nulls in SQL (and in the relational model in general) adds complexity and has been controversial. (I do not intend to give a full rundown with citations here or to argue for any particular point of view; I'm just saying that these arguments exist.) Some suggest that if you need nullable columns you should instead fix your data model to be fully normalized; Codd thought that a single null wasn't adequate and there should be separate 'missing' and 'not applicable'. But I'm sure that nobody who advocated getting rid of null suggested replacing it with a bogus 0 or 1900-01-01 value instead... So again it could be a case of misunderstanding and misapplying some design rule about not using nulls.
This kind of twisted thought process is quite common - I inherited a database where the rules forbade nullable columns (and there were APIs depending on that) but the string '?' was used instead. In fact, there were meant to be '?' and '/' for 'missing' and 'not applicable' - following Codd's suggestion - but you can guess how much real-world data or code observed that distinction and how useful it was in practice.
Is there a reason? There may be in some development groups. When you say:
select *
from blah
where timestamp <> date '2015-01-01'
You might want the query to return the default values. If the default is NULL, then the query will not return them. If the default is a date far in the past, then it will.
Similarly, you might not want to clutter code with lots of or X is null. This can be more than an aesthetic issue. The presence of or in where and on provides opportunities for code mistakes (due to missing parentheses) and can confuse the optimizer.
I am not saying that the use of such defaults is a best practice. NULL is such an important part of the SQL language that anyone writing production code should know how to deal with it. However, there are good reasons for the style you mention.

Is NULL an appropriate value for an applicable, known non-value in a record? (i.e. the value is known to be "no value")

Wikipedia says NULL is for representing "missing information and inapplicable information".
I have always underestood NULL to be appropriate to represent "Unknown" information, as in the field is applicable to the record, it is just an unknown value.
Given the problems that NULL introduces to queries (3-point logic and tons of exceptions that have to be accounted for) should NULL still be used in records for fields that are not applicable to a record? Or should it just be used for applicable fields whose values are unknown for a given record?
I too accept null to mean "unknown", but "inapplicable" fits for example when saving the CEO's employee record, we set employee_manager_id = null because the CEO has no manager.
One technique to avoid the hassle of nulls is to use a special value instead, for example saving -1 for an person.age, but then you have a bit more complexity checking this value. When using a special value for a foreign key (as in the manager id example), you actually have to create a dummy record with id = 0 for example, and this may introduce problems when processing "all employees" - you must skip the dummy record.
Personally, I think things stay cleaner just using null and suffering the hassle of more complex SQL - at least everyone knows what's going on.
In fact, Null does not connote 'unknown' so much as just 'no data'. It is used in SQL (and other environments) where data is simply absent for a given field.
With regard to your concern about 3-point logic and exceptions, your application is probably making use of more language than just sql. The code of your system that interfaces with SQL should be handling the question of what to do with Null fields.
If Null is simply unacceptable (i.e. you can't have your data structure without a non-null value), then you had better avoid the concepts of 'unknown' 'no data' altogether. Make the field required by setting the SQL column's Null value to false; that makes it so that Null cannot be entered as a valid value. E.g.
CREATE TABLE foo (bar INT NOT NULL);

Nullable vs. non-null varchar data types - which is faster for queries?

We generally prefer to have all our varchar/nvarchar columns non-nullable with a empty string ('') as a default value. Someone on the team suggested that nullable is better because:
A query like this:
Select * From MyTable Where MyColumn IS NOT NULL
is faster than this:
Select * From MyTable Where MyColumn == ''
Anyone have any experience to validate whether this is true?
On some platforms (and even versions), this is going to depend on how NULLs are indexed.
My basic rule of thumb for NULLs is:
Don't allow NULLs until justified
Don't allow NULLs unless the data can really be unknown
A good example of this is modeling address lines. If you have an AddressLine1 and AddressLine2, what does it mean for the first to have data and the second to be NULL? It seems to me, you either know the address or not, and having partial NULLs in a set of data just asks for trouble when somebody concatenates them and gets NULL (ANSI behavior). You might solve this with allowing NULLs and adding a check constraint - either all the Address information is NULL or none is.
Similar thing with middle initial/name. Some people don't have one. Is this different from it being unknown and do you care?
ALso, date of death - what does NULL mean? Not dead? Unknown date of death? Many times a single column is not sufficient to encode knowledge in a domain.
So to me, whether to allow NULLs would depend very much on the semantics of the data first - performance is going to be second, because having data misinterpreted (potentially by many different people) is usually a far more expensive problem than performance.
It might seem like a little thing (in SQL Server the implementation is a bitmask stored with the row), but only allowing NULLs after justification seems to me to work best. It catches things early in development, forces you to address assumptions and understand your problem domain.
If you want to know that there is no value, use NULL.
As for speed, IS NULL should be faster, because it doesn't use string comparison.
If you need NULL, use NULL. Ditto empty string.
As for performance, "it depends"
If you have varchar, you are storing an actual value in the row for the length. If you have char, then you store the actual length. NULL won't be stored in-row depending on the engine (NULL bitmap for SQL Server for example).
This means IS NULL is quicker, query for query, but it could add COALESCE/NULLIF/ISNULL complexity.
So, your colleague is partially correct but may not appreciate it fully.
Blindly using empty string is use of a sentinel value rather then working through the NULL semantic issue
FWIW and personally:
I would tend to use NULL but don't always. I like to avoid dates like 31 Dec 9999 which is where NULL avoidance leads you.
From Cade Roux's answer... I also find that discussions about "Is date of death nullable" pointless. For an field, in practical terms, either there is a value or there isn't.
Sentinel values are worse then NULLs. Magic numbers. anyone?
Tell that guy on your team to get his prematurely optimizin' head out of his ass! (But in a nice way).
Developers like that can be poison to the team, full of low-level optimization myths, all of which may be true or have been true at one point in time for some specific vendor or query pattern, or possibly only true in theory but never true in practice. Acting upon these myths is a costly waste of time, and can destroy an otherwise good design.
He probably means well and wants to contribute his knowledge to the team. Unfortunately, he is wrong. Not wrong in the sense of whether a benchmark will prove his statement correct or incorrect. He's wrong in the sense that this is not how you design a database. The question of whether to make a field NULL-able is a question about domain of the data for the purposes of defining the type of the field. It should be answered in terms of what it means for the field to have no value.
In a nutshell, NULL = UNKNOWN!.. Which means (using date of death example) that the entity could be 1)alive, 2)dead but date of death is not known, or 3)unknown if entity is dead or alive. For numeric columns I always default them to 0 (ZERO) because somewhere along the line you may have to perform aggregate calculations and NULL + 123 = NULL. For alphanumerics I use NULL since its least expensive performance-wise and easier to say '...where a IS NULL' than saying '...where a = "" '. Using '...where a = " "[space]' is not a good idea because [space] is not a NULL! For dates, if you have to leave a date column NULL, you may want to add a status indicator column which, in the above example, A=Alive, D=Dead, Q=Dead, date of death not known, N=Alive or Dead is unknown.

Why does NULL = NULL evaluate to false in SQL server

In SQL server if you have nullParam=NULL in a where clause, it always evaluates to false. This is counterintuitive and has caused me many errors. I do understand the IS NULL and IS NOT NULL keywords are the correct way to do it. But why does SQL server behave this way?
Think of the null as "unknown" in that case (or "does not exist"). In either of those cases, you can't say that they are equal, because you don't know the value of either of them. So, null=null evaluates to not true (false or null, depending on your system), because you don't know the values to say that they ARE equal. This behavior is defined in the ANSI SQL-92 standard.
EDIT:
This depends on your ansi_nulls setting. if you have ANSI_NULLS off, this WILL evaluate to true. Run the following code for an example...
set ansi_nulls off
if null = null
print 'true'
else
print 'false'
set ansi_nulls ON
if null = null
print 'true'
else
print 'false'
How old is Frank? I don't know (null).
How old is Shirley? I don't know (null).
Are Frank and Shirley the same age?
Correct answer should be "I don't know" (null), not "no", as Frank and Shirley might be the same age, we simply don't know.
Here I will hopefully clarify my position.
That NULL = NULL evaluate to FALSE is wrong. Hacker and Mister correctly answered NULL.
Here is why. Dewayne Christensen wrote to me, in a comment to Scott Ivey:
Since it's December, let's use a
seasonal example. I have two presents
under the tree. Now, you tell me if I
got two of the same thing or not.
They can be different or they can be equal, you don't know until one open both presents. Who knows? You invited two people that don't know each other and both have done to you the same gift - rare, but not impossible §.
So the question: are these two UNKNOWN presents the same (equal, =)? The correct answer is: UNKNOWN (i.e. NULL).
This example was intended to demonstrate that "..(false or null, depending on your system).." is a correct answer - it is not, only NULL is correct in 3VL (or is ok for you to accept a system which gives wrong answers?)
A correct answer to this question must emphasize this two points:
three-valued logic (3VL) is counterintuitive (see countless other questions on this subject on Stackoverflow and in other forum to make sure);
SQL-based DBMSes often do not respect even 3VL, they give wrong answers sometimes (as, the original poster assert, SQL Server do in this case).
So I reiterate: SQL does not any good forcing one to interpret the reflexive property of equality, which state that:
for any x, x = x §§ (in plain English: whatever the universe of discourse, a "thing" is always equal to itself).
.. in a 3VL (TRUE, FALSE, NULL). The expectation of people would conform to 2VL (TRUE, FALSE, which even in SQL is valid for all other values), i.e. x = x always evaluate to TRUE, for any possible value of x - no exceptions.
Note also that NULLs are valid " non-values " (as their apologists pretend them to be) which one can assign as attribute values(??) as part of relation variables. So they are acceptable values of every type (domain), not only of the type of logical expressions.
And this was my point: NULL, as value, is a "strange beast". Without euphemism, I prefer to say: nonsense.
I think that this formulation is much more clear and less debatable - sorry for my poor English proficiency.
This is only one of the problems of NULLs. Better to avoid them entirely, when possible.
§ we are concerned about values here, so the fact that the two presents are always two different physical objects are not a valid objection; if you are not convinced I'm sorry, it is not this the place to explain the difference between value and "object" semantics (Relational Algebra has value semantics from the start - see Codd's information principle; I think that some SQL DBMS implementors don't even care about a common semantics).
§§ to my knowledge, this is an axiom accepted (in a form or another, but always interpreted in a 2VL) since antiquity and that exactly because is so intuitive. 3VLs (is a family of logics in reality) is a much more recent development (but I'm not sure when was first developed).
Side note: if someone will introduce Bottom, Unit and Option Types as attempts to justify SQL NULLs, I will be convinced only after a quite detailed examination that will shows of how SQL implementations with NULLs have a sound type system and will clarify, finally, what NULLs (these "values-not-quite-values") really are.
In what follow I will quote some authors. Any error or omission is
probably mine and not of the original authors.
Joe Celko on SQL NULLs
I see Joe Celko often cited on this forum. Apparently he is a much respected author here. So, I said to myself: "what does he wrote about SQL NULLs? How does he explain NULLs numerous problems?". One of my friend has an ebook version of Joe Celko's SQL for smarties: advanced SQL programming, 3rd edition. Let's see.
First, the table of contents. The thing that strikes me most is the number of times that NULL is mentioned and in the most varied contexts:
3.4 Arithmetic and NULLs 109
3.5 Converting Values to and from NULL 110
3.5.1 NULLIF() Function 110
6 NULLs: Missing Data in SQL 185
6.4 Comparing NULLs 190
6.5 NULLs and Logic 190
6.5.1 NULLS in Subquery Predicates 191
6.5.2 Standard SQL Solutions 193
6.6 Math and NULLs 193
6.7 Functions and NULLs 193
6.8 NULLs and Host Languages 194
6.9 Design Advice for NULLs 195
6.9.1 Avoiding NULLs from the Host Programs 197
6.10 A Note on Multiple NULL Values 198
10.1 IS NULL Predicate 241
10.1.1 Sources of NULLs 242
...
and so on. It rings "nasty special case" to me.
I will go into some of these cases with excerpts from this book, trying to limit myself to the essential, for copyright reasons. I think these quotes fall within "fair use" doctrine and they can even stimulate to buy the book - so I hope that no one will complain (otherwise I will need to delete most of it, if not all). Furthermore, I shall refrain from reporting code snippets for the same reason. Sorry about that. Buy the book to read about datailed reasoning.
Page numbers between parenthesis in what follow.
NOT NULL Constraint (11)
The most important column constraint is the NOT NULL, which forbids
the use of NULLs in a column. Use this constraint routinely, and remove
it only when you have good reason. It will help you avoid the
complications of NULL values when you make queries against the data.
It is not a value; it is a marker that holds a place where a value might go.
Again this "value but not quite a value" nonsense. The rest seems quite sensible to me.
(12)
In short, NULLs cause a lot of irregular features in SQL, which we will discuss
later. Your best bet is just to memorize the situations and the rules for NULLs
when you cannot avoid them.
Apropos of SQL, NULLs and infinite:
(104) CHAPTER 3: NUMERIC DATA IN SQL
SQL has not accepted the IEEE model for mathematics for several reasons.
...
If the IEEE rules for math were allowed in
SQL, then we would need type conversion rules for infinite and a way to
represent an infinite exact numeric value after the conversion. People
have enough trouble with NULLs, so let’s not go there.
SQL implementations undecided on what NULL really means in particular contexts:
3.6.2 Exponential Functions (116)
The problem is that logarithms are undefined when (x <= 0). Some SQL
implementations return an error message, some return a NULL and DB2/
400; version 3 release 1 returned *NEGINF (short for “negative infinity”)
as its result.
Joe Celko quoting David McGoveran and C. J. Date:
6 NULLs: Missing Data in SQL (185)
In their book A Guide to Sybase and SQL Server, David McGoveran
and C. J. Date said: “It is this writer’s opinion than NULLs, at least as
currently defined and implemented in SQL, are far more trouble than
they are worth and should be avoided; they display very strange and
inconsistent behavior and can be a rich source of error and confusion.
(Please note that these comments and criticisms apply to any system
that supports SQL-style NULLs, not just to SQL Server specifically.)”
NULLs as a drug addiction:
(186/187)
In the rest of this book, I will be urging you not to use
them, which may seem contradictory, but it is not. Think of a NULL
as a drug; use it properly and it works for you, but abuse it and it can ruin
everything. Your best policy is to avoid NULLs when you can and use
them properly when you have to.
My unique objection here is to "use them properly", which interacts badly with
specific implementation behaviors.
6.5.1 NULLS in Subquery Predicates (191/192)
People forget that a subquery often hides a comparison with a NULL.
Consider these two tables:
...
The result will be empty. This is counterintuitive, but correct.
(separator)
6.5.2 Standard SQL Solutions (193)
SQL-92 solved some of the 3VL (three-valued logic) problems by adding
a new predicate of the form:
<search condition> IS [NOT] TRUE | FALSE | UNKNOWN
But UNKNOWN is a source of problems in itself, so that C. J. Date,
in his book cited below, reccomends in chapter 4.5. Avoiding Nulls in SQL:
Don't use the keyword UNKNOWN in any context whatsoever.
Read "ASIDE" on UNKNOWN, also linked below.
6.8 NULLs and Host Languages (194)
However, you should know how NULLs are handled when they have
to be passed to a host program. No standard host language for
which an embedding is defined supports NULLs, which is another
good reason to avoid using them in your database schema.
(separator)
6.9 Design Advice for NULLs (195)
It is a good idea to declare all your base tables with NOT NULL
constraints on all columns whenever possible. NULLs confuse people
who do not know SQL, and NULLs are expensive.
Objection: NULLs confuses even people that know SQL well,
see below.
(195)
NULLs should be avoided in FOREIGN KEYs. SQL allows this “benefit
of the doubt” relationship, but it can cause a loss of information in
queries that involve joins. For example, given a part number code in
Inventory that is referenced as a FOREIGN KEY by an Orders table, you
will have problems getting a listing of the parts that have a NULL. This is
a mandatory relationship; you cannot order a part that does not exist.
(separator)
6.9.1 Avoiding NULLs from the Host Programs (197)
You can avoid putting NULLs into the database from the Host Programs
with some programming discipline.
...
Determine impact of missing data on programming and reporting:
Numeric columns with NULLs are a problem, because queries
using aggregate functions can provide misleading results.
(separator)
(227)
The SUM() of an empty set is always NULL. One of the most common
programming errors made when using this trick is to write a query that
could return more than one row. If you did not think about it, you might
have written the last example as: ...
(separator)
10.1.1 Sources of NULLs (242)
It is important to remember where NULLs can occur. They are more than
just a possible value in a column. Aggregate functions on empty sets,
OUTER JOINs, arithmetic expressions with NULLs, and OLAP operators
all return NULLs. These constructs often show up as columns in
VIEWs.
(separator)
(301)
Another problem with NULLs is found when you attempt to convert
IN predicates to EXISTS predicates.
(separator)
16.3 The ALL Predicate and Extrema Functions (313)
It is counterintuitive at first that these two predicates are not the same in SQL:
...
But you have to remember the rules for the extrema functions—they
drop out all the NULLs before returning the greater or least values. The
ALL predicate does not drop NULLs, so you can get them in the results.
(separator)
(315)
However, the definition in the standard is worded in the
negative, so that NULLs get the benefit of the doubt.
...
As you can see, it is a good idea to avoid NULLs in UNIQUE
constraints.
Discussing GROUP BY:
NULLs are treated as if they were all equal to each other, and
form their own group. Each group is then reduced to a single
row in a new result table that replaces the old one.
This means that for GROUP BY clause NULL = NULL does not
evaluate to NULL, as in 3VL, but it evaluate to TRUE.
SQL standard is confusing:
The ORDER BY and NULLs (329)
Whether a sort key value that is NULL is considered greater or less than a
non-NULL value is implementation-defined, but...
... There are SQL products that do it either way.
In March 1999, Chris Farrar brought up a question from one of his
developers that caused him to examine a part of the SQL Standard that
I thought I understood. Chris found some differences between the
general understanding and the actual wording of the specification.
And so on. I think is enough by Celko.
C. J. Date on SQL NULLs
C. J. Date is more radical about NULLs: avoid NULLs in SQL, period.
In fact, chapter 4 of his SQL and Relational Theory: How to Write Accurate
SQL Code is titled "NO DUPLICATES, NO NULLS", with subchapters
"4.4 What's Wrong with Nulls?" and "4.5 Avoiding Nulls in SQL" (follow the link:
thanks to Google Books, you can read some pages on-line).
Fabian Pascal on SQL NULLs
From its Practical Issues in Database Management - A Reference
for the Thinking Practitioner (no excerpts on-line, sorry):
10.3 Pratical Implications
10.3.1 SQL NULLs
... SQL suffers from the problems inherent in 3VL as well as from many
quirks, complications, counterintuitiveness, and outright errors [10, 11];
among them are the following:
Aggregate functions (e.g., SUM(), AVG()) ignore NULLs (except for COUNT()).
A scalar expression on a table without rows evaluates incorrectly to NULL, instead of 0.
The expression "NULL = NULL" evaluates to NULL, but is actually invalid in SQL; yet ORDER BY treats NULLs as equal (whatever they precede or follow "regular" values is left to DBMS vendor).
The expression "x IS NOT NULL" is not equal to "NOT(x IS NULL)", as is the case in 2VL.
...
All commercially implemented SQL dialects follow this 3VL approach, and, thus,
not only do they exibits these problems, but they also have spefic implementation
problems, which vary across products.
The answers here all seem to come from a CS perspective so I want to add one from a developer perspective.
For a developer NULL is very useful. The answers here say NULL means unknown, and maybe in CS theory that's true, don't remember, it's been a while. In actual development though, at least in my experience, that happens about 1% of the time. The other 99% it is used for cases where the value is not UNKNOWN but it is KNOWN TO BE ABSENT.
For example:
Client.LastPurchase, for a new client. It is not unknown, it is known that he hasn't made a purchase yet.
When using an ORM with a Table per Class Hierarchy mapping, some values are just not mapped for certain classes.
When mapping a tree structure a root will usually have Parent = NULL
And many more...
I'm sure most developers at some point wrote WHERE value = NULL,
didn't get any results, and that's how they learned about IS NULL syntax. Just look how many votes this question and the linked ones have.
SQL Databases are a tool, and they should be designed the way which is easiest for their users to understand.
Just because you don't know what two things are, does not mean they're equal. If when you think of NULL you think of “NULL” (string) then you probably want a different test of equality like Postgresql's IS DISTINCT FROM AND IS NOT DISTINCT FROM
From the PostgreSQL docs on "Comparison Functions and Operators"
expression IS DISTINCT FROM expression
expression IS NOT DISTINCT FROM expression
For non-null inputs, IS DISTINCT FROM is the same as the <> operator. However, if both inputs are null it returns false, and if only one input is null it returns true. Similarly, IS NOT DISTINCT FROM is identical to = for non-null inputs, but it returns true when both inputs are null, and false when only one input is null. Thus, these constructs effectively act as though null were a normal data value, rather than "unknown".
Maybe it depends, but I thought NULL=NULL evaluates to NULL like most operations with NULL as an operand.
At technet there is a good explanation for how null values work.
Null means unknown.
Therefore the Boolean expression
value=null
does not evaluate to false, it evaluates to null, but if that is the final result of a where clause, then nothing is returned. That is a practical way to do it, since returning null would be difficult to conceive.
It is interesting and very important to understand the following:
If in a query we have
where (value=#param Or #param is null) And id=#anotherParam
and
value=1
#param is null
id=123
#anotherParam=123
then
"value=#param" evaluates to null
"#param is null" evaluates to true
"id=#anotherParam" evaluates to true
So the expression to be evaluated becomes
(null Or true) And true
We might be tempted to think that here "null Or true" will be evaluated to null and thus the whole expression becomes null and the row will not be returned.
This is not so. Why?
Because "null Or true" evaluates to true, which is very logical, since if one operand is true with the Or-operator, then no matter the value of the other operand, the operation will return true. Thus it does not matter that the other operand is unknown (null).
So we finally have true=true and thus the row will be returned.
Note: with the same crystal clear logic that "null Or true" evaluates to true, "null And true" evaluates to null.
Update:
Ok, just to make it complete I want to add the rest here too which turns out quite fun in relation to the above.
"null Or false" evaluates to null, "null And false" evaluates to false. :)
The logic is of course still as self-evident as before.
MSDN has a nice descriptive article on nulls and the three state logic that they engender.
In short, the SQL92 spec defines NULL as unknown, and NULL used in the following operators causes unexpected results for the uninitiated:
= operator NULL true false
NULL NULL NULL NULL
true NULL true false
false NULL false true
and op NULL true false
NULL NULL NULL false
true NULL true false
false false false false
or op NULL true false
NULL NULL true NULL
true true true true
false NULL true false
The concept of NULL is questionable, to say the least. Codd introduced the relational model and the concept of NULL in context (and went on to propose more than one kind of NULL!) However, relational theory has evolved since Codd's original writings: some of his proposals have since been dropped (e.g. primary key) and others never caught on (e.g. theta operators). In modern relational theory (truly relational theory, I should stress) NULL simply does not exist. See The Third Manifesto. http://www.thethirdmanifesto.com/
The SQL language suffers the problem of backwards compatibility. NULL found its way into SQL and we are stuck with it. Arguably, the implementation of NULL in SQL is flawed (SQL Server's implementation makes things even more complicated due to its ANSI_NULLS option).
I recommend avoiding the use of NULLable columns in base tables.
Although perhaps I shouldn't be tempted, I just wanted to assert a corrections of my own about how NULL works in SQL:
NULL = NULL evaluates to UNKNOWN.
UNKNOWN is a logical value.
NULL is a data value.
This is easy to prove e.g.
SELECT NULL = NULL
correctly generates an error in SQL Server. If the result was a data value then we would expect to see NULL, as some answers here (wrongly) suggest we would.
The logical value UNKNOWN is treated differently in SQL DML and SQL DDL respectively.
In SQL DML, UNKNOWN causes rows to be removed from the resultset.
For example:
CREATE TABLE MyTable
(
key_col INTEGER NOT NULL UNIQUE,
data_col INTEGER
CHECK (data_col = 55)
);
INSERT INTO MyTable (key_col, data_col)
VALUES (1, NULL);
The INSERT succeeds for this row, even though the CHECK condition resolves to NULL = NULL. This is due defined in the SQL-92 ("ANSI") Standard:
11.6 table constraint definition
3)
If the table constraint is a check
constraint definition, then let SC be
the search condition immediately
contained in the check constraint
definition and let T be the table name
included in the corresponding table
constraint descriptor; the table
constraint is not satisfied if and
only if
EXISTS ( SELECT * FROM T WHERE NOT
( SC ) )
is true.
Read that again carefully, following the logic.
In plain English, our new row above is given the 'benefit of the doubt' about being UNKNOWN and allowed to pass.
In SQL DML, the rule for the WHERE clause is much easier to follow:
The search condition is applied to
each row of T. The result of the where
clause is a table of those rows of T
for which the result of the search
condition is true.
In plain English, rows that evaluate to UNKNOWN are removed from the resultset.
Because NULL means 'unknown value' and two unknown values cannot be equal.
So, if to our logic NULL N°1 is equal to NULL N°2, then we have to tell that somehow:
SELECT 1
WHERE ISNULL(nullParam1, -1) = ISNULL(nullParam2, -1)
where known value -1 N°1 is equal to -1 N°2
NULL isn't equal to anything, not even itself. My personal solution to understanding the behavior of NULL is to avoid using it as much as possible :).
The question:
Does one unknown equal another unknown?
(NULL = NULL)
That question is something no one can answer so it defaults to true or false depending on your ansi_nulls setting.
However the question:
Is this unknown variable unknown?
This question is quite different and can be answered with true.
nullVariable = null is comparing the values
nullVariable is null is comparing the state of the variable
The confusion arises from the level of indirection (abstraction) that comes about from using NULL.
Going back to the "what's under the Christmas tree" analogy, "Unknown" describes the state of knowledge about what is in Box A.
So if you don't know what's in Box A, you say it's "Unknown", but that doesn't mean that "Unknown" is inside the box. Something other than unknown is in the box, possibly some kind of object, or possibly nothing is in the box.
Similarly, if you don't know what's in Box B, you can label your state of knowledge about the contents as being "Unknown".
So here's the kicker: Your state of knowledge about Box A is equal to your state of knowledge about Box B. (Your state of knowledge in both cases is "Unknown" or "I don't know what's in the Box".) But the contents of the boxes may or may not be equal.
Going back to SQL, ideally you should only be able to compare values when you know what they are. Unfortunately, the label that describes a lack of knowledge is stored in the cell itself, so we're tempted to use it as a value. But we should not use that as a value, because it would lead to "the content of Box A equals the content of Box B when we don't know what's in Box A and/or we don't know what's in Box B.
(Logically, the implication "if I don't know what's in Box A and if I don't know what's in Box B, then what's in Box A = What's in Box B" is false.)
Yay, Dead Horse.
There are two sensible ways to handle NULL = NULL comparisons in a WHERE clause, and they boil down to "What do you mean by NULL?" One way assumes NULL means "unknown," and the other assumes NULL means "data does not exist." SQL has chosen a third way which is wrong all around.
The "NULL means unknown" solution: Throw an error.
Unknown = unknown should evaluate to 3VL null. But the output of a WHERE clause is 2VL: You either return the row or you don't. It's like being asked to divide by zero and return a number: There is no correct response. So you throw an error instead, and force the programmer to explicitly handle this situation.
The "NULL means no data" solution: Return the row.
No data = no data should evaluate to true. If I'm comparing two people, and they have the same first name, and the same last name, and neither has a middle name, then it is correct to say "These people have the same name."
The SQL solution: Don't return the row.
This is always wrong. If NULL means "unknown," then you don't know if the row should be returned or not, and you should not try to guess. If NULL means "no data," then you should return the row. Either way, silently removing the row is incorrect and will cause problems. It's the worst of both worlds.
Setting aside theory and speaking in practical terms, I'm with AlexDev: I have almost never encountered a case where "return the row" was not the desired result. However, "almost never" is not "never," and SQL databases often serve as the backbones of big important systems, so I can see a fair case for being rigorous and throwing an error.
What I cannot see is a case for silently coercing 3VL null into 2VL false. Like most silent type coercions, it's a rabid weasel waiting to be set loose in your system, and when the weasel finally jumps out and bites someone, you'll have the merry devil of a time tracking it back to its nest.
null is unknown in sql so we cant expect two unknowns to be same.
However you can get that behavior by setting ANSI_NULLS to Off(its On by Default)
You will be able to use = operator for nulls
SET ANSI_NULLS off
if null=null
print 1
else
print 2
set ansi_nulls on
if null=null
print 1
else
print 2
You work for the government registering information about citizens. This includes the national ID for every person in the country. A child was left at the door of a church some 40 years ago, nobody knows who their parents are. This person's father ID is NULL. Two such people exist. Count people who share the same father ID with at least one other person (people who are siblings). Do you count those two too?
The answer is no, you don’t, because we don’t know if they are siblings or not.
Suppose you don’t have a NULL option, and instead use some pre-determined value to represent “the unknown”, perhaps an empty string or the number 0 or a * character, etc. Then you would have in your queries that * = *, 0 = 0, and “” = “”, etc. This is not what you want (as per the example above), and as you might often forget about these cases (the example above is a clear fringe case outside ordinary everyday thinking), then you need the language to remember for you that NULL = NULL is not true.
Necessity is the mother of invention.
Just an addition to other wonderful answers:
AND: The result of true and unknown is unknown, false and unknown is false,
while unknown and unknown is unknown.
OR: The result of true or unknown is true, false or unknown is unknown, while unknown or unknown is unknown.
NOT: The result of not unknown is unknown
If you are looking for an expression returning true for two NULLs you can use:
SELECT 1
WHERE EXISTS (
SELECT NULL
INTERSECT
SELECT NULL
)
It is helpful if you want to replicate data from one table to another.
The equality test, for example, in a case statement when clause, can be changed from
XYZ = NULL
to
XYZ IS NULL
If I want to treat blanks and empty string as equal to NULL I often also use an equality test like:
(NULLIF(ltrim( XYZ ),'') IS NULL)
To quote the Christmas analogy again:
In SQL, NULL basically means "closed box" (unknown). So, the result of comparing two closed boxes will also be unknown (null).
I understand, for a developer, this is counter-intuitive, because in programming languages, often NULL rather means "empty box" (known). And comparing two empty boxes will naturally yield true / equal.
This is why JavaScript for example distinguishes between null and undefined.
Null isn't equal to anything including itself
Best way to test if an object is null is to check whether the object equals itself since null is the only object not equal to itself
const obj = null
console.log(obj==obj) //false, then it's null
Check this article