What does Postgres' NULLS NOT DISTINCT really mean?

What does Postgres' NULLS NOT DISTINCT really mean? - sql

The documentation for Postgres seems to say that the phrase means the opposite of what it says. I've added emphasis to the quotation below to highlight the contradiction. Is this counter-intuitive language intended or did the person writing the documentation make a mistake?
By default, NULL values are not treated as distinct entries. Specifying NULLS NOT DISTINCT on unique indexes / constraints will cause NULL values to be treated distinctly.
Really? Seriously? Saying they are NOT distinct causes them to be treated AS DISTINCT!? What on earth?

Related

What's the most efficient way to do a case-insensitive like expression?

In Pervasive v13, is there a "more performant" way to perform a case-insensitive like expression than is shown below?
select * from table_name
where upper(field_name) like '%TEST%'
The UPPER function above has performance cost that I'd like to avoid.
I disagree with those who say that the performance-overhead of UPPER is minor; it is doubling the execution time compared to the exact same query without UPPER.
Background:
I was very satisfied with the execution time of this wildcard-like-expression until I realized the result set was missing records due to mismatched capitalization.
Then, I implemented the UPPER technique (above). This achieved including those missing records, but it doubled the execution time of my query.
This UPPER technique, for case-insensitive comparison, seems outlandishly intensive to me at even a conceptual level. Instead of changing a field's case, for every record in a large database table, I'm hoping that the SQL standard provides some type of syntactical flag that modifies the like-expression's behavior regarding case-sensitivity.
From there, behind the scenes, the database engine could generate a compiled regular expression (or some other optimized case-insensitive evaluator) that could well outperform this UPPER technique. This seems like a possibility that might exist.
However, I must admit, at some level there still must be a conversion to make the letter-comparisons. And perhaps, this UPPER technique is no worse than any other method that might achieve the same result set.
Regardless, I'm posting this question in hopes someone might reveal a more performant syntax I'm unaware of.

You do not need the UPPER, when you define your table using CASE.
The CASE keyword causes PSQL to ignore case when evaluating
restriction clauses involving a string column. CASE can be specified
as a column attribute in a CREATE TABLE or ALTER TABLE statement, or
in an ORDER BY clause of a SELECT statement.
(see: https://docs.actian.com/psql/psqlv13/index.html#page/sqlref%2Fsqlref.CASE_(string).htm )
CREATE TABLE table_name (field_name VARCHAR(100) CASE)

Why should I avoid NULL values in a SQL database?

I read a 45-tips-database-performance-tips-for-developers document from a famous commercial vendor for SQL tools today and there was one tip that confuse me:
If possible, avoid NULL values in your database. If not, use the
appropriate IS NULL and IS NOT NULL code.
I like having NULL values because to me it is a difference if a value was never set or it 0 or string empty. So databases have this for a porpuse.
So is this tip nonsense or should I take action to prevent having NULL values at all in my database tables? Does it effect performance a lot have a NULL value instead of a filled number or string value?

Besides the reasons mentioned in other answers, we can look at NULLs from a different angle.
Regarding duplicate rows, Codd said
If something is true, saying it twice doesn’t make it any more true.
Similarly, you can say
If something is not known, saying it is unknown doesn't make it known.
Databases are used to record facts. The facts (truths) serve as axioms from which we can deduce other facts.
From this perspective, unknown things should not be recorded - they are not useful facts.
Anyway, anything that is not recorded is unknown. So why bother recording them?
Let alone their existence makes the deduction complicated.

The NULL question is not simple... Every professional has a personal opinion about it.
Relational theory Two-Valued Logic (2VL: TRUE and FALSE) rejects NULL, and Chris Date is one of the most enemies of NULLs. But Ted Codd, instead, accepted Three-Valued Logic too (TRUE, FALSE and UNKNOWN).
Just a few things to note for Oracle:
Single column B*Tree Indexes don't contain NULL entries. So the Optimizer can't use an Index if you code "WHERE XXX IS NULL".
Oracle considers a NULL the same as an empty string, so:
WHERE SOME_FIELD = NULL
is the same as:
WHERE SOME_FIELD = ''
Moreover, with NULLs you must pay attention in your queries, because every compare with NULL returns NULL.
And, sometimes, NULLs are insidious. Think for a moment to a WHERE condition like the following:
WHERE SOME_FIELD NOT IN (SELECT C FROM SOME_TABLE)
If the subquery returns one or more NULLs, you get the empty recordset!
These are the very first few cases that I want to talk about. But we can speak about NULLs for a lot of time...

It's usually good practice to avoid or minimise the use of nulls. Nulls cause some queries to return results that are "incorrect" (i.e. the results won't correspond with the intended meaning of the database). Unfortunately SQL and SQL-style databases can make nulls difficult, though not necessarily impossible, to avoid. It's a very real problem and even experts often have trouble spotting flaws in query logic caused by nulls.
Since there is nothing like nulls in the real world, using them means making some compromises in the way your database represents reality. In fact there is no single consistent "meaning" of nulls and little general agreement on what they are for. In practice, nulls get used to represent all sorts of different situations. If you do use them it's a good idea to document exactly what a null means for any given attribute.
Here's an excellent lecture about the "null problem" by Chris Date:
http://www.youtube.com/watch?v=kU-MXf2TsPE

There are various downsides to NULLs that can make using them more difficult than actual values. for example:
In some cases they are not indexed.
They make join syntax more difficult.
They need special treatment for comparisons.
For string columns it might be appropriate to use "N/A", or "N/K" as a special value that helps distinguish between different classes of what could otherwise be NULL, but that's tricky to do for numerics or dates -- special values are generally tricky to use, and it may be better to add an extra column (eg. for date_of_birth you might have a column that specifies "reason_for_no_date_of_birth", which can help the application be more useful.
For many cases where data values are genuinely unknown or not relevant they can be entirely appropriate of course -- date_of_death is a good example, or date_of_account_termination.
Sometimes even these examples can be rendered irrelevant by normalising events out to a different table, so you have a table for "ACCOUNT_DATES" with DATE_TYPES of "Open", "Close", etc.

I think using NULL values in the database is feasible until your application has a proper logic to handle it, but according to this post there may be some problems as discussed here
http://databases.aspfaq.com/general/why-should-i-avoid-nulls-in-my-database.html

Should I use an empty string instead of NULL when checking for unique?

I'm still relatively new to database design, and I'm making a table with SQLite. I thought I was taught that it's best to use NULL in place of empty strings, so that's what I've been doing. I'm building an address table with the line:
CREATE TABLE addresses (
addressID INTEGER PRIMARY KEY,
officeName TEXT,
address TEXT NOT NULL CHECK(address<>''),
UNIQUE (officeName, address)
And adding addresses to the database (through PHP PDO) using the line
INSERT OR IGNORE INTO addresses (officeName,address) VALUES (?,?)
That line should check to see if the officeName/address is already in the database, and ignore it if it is, or add it if it isn't. "Address" is always a non-null string, but sometimes the officeName is blank. And if I make it NULL, it keeps getting added as if each NULL was distinct (it works fine if it's just an empty string). I did find this article saying that yes, NULLs are treated as distinct in a unique column. That now makes me wonder… should I always just use an empty string instead of NULL? Is there ever a case where it's "best practice" to use NULL instead? I thought it was always best practice, but now I'm thinking it might never be best practice.

NULL and the empty string are semantically different, just as NULL and 0 are semantically different.
NULL means "no value". In your case, that would be "no address".
Empty string is string string value of zero length. In your case, that would be an address that is the empty string.
Whether or not to use NULL or the empty string depends on the semantics of the situation, just like the decision of whether to use NULL or 0.
However, NULLs are a bit of a mess when it comes to comparison, IN, indexes, DISTINCT, and GROUP BY. Everyone seems to do things a little differently (FYI, this link doesn't cover SQL Server, which does it yet another way), so unfortunately, compromises are often made to accommodate particular desired behavior, depending on the DBMS.
In your case, you will have to use empty strings if you want to use the SQLite functionality you are interested in.
SQLite was originally coded in such a way that [NULLs are never distinct]. But the experiments run
on other SQL engines showed that none of them worked this way. So
SQLite was modified to work the same as Oracle, PostgreSQL, and DB2.
This involved making NULLs indistinct for the purposes of the SELECT
DISTINCT statement and for the UNION operator in a SELECT. NULLs are
still distinct in a UNIQUE column. This seems somewhat arbitrary, but
the desire to be compatible with other engines outweighed that
objection.
Know, however, that INSERT OR IGNORE is unique to SQLite; for no other DBMS would you be asking about using that statement.
Best practice is to base your decision on what you mean: no value, or the value with no characters. (Of course, you may always choose to forgo best practice for your own personal reasons.)

Should I allow null values in a db schema?

I know that logically, there are some cases where NULL values make sense in a DB schema, for example if some values plain haven't been specified. That said, working around DBNull in code tends to be a royal pain. For example, if I'm rendering a view, and I want to see a string, I would expect no value to be a blank string, not "Null", and I hate having to code around that scenario.
Additionally, it makes querying easier. Admittedly, you can do "foo is not null" very easily, but for junior SQL devs, it's counter intuitive to not be able to use "foo != null" (and yes, I know about options to turn off ANSI nulls, etc, but that's definitely NOT simpler, and I don't like working away from the standard).
What good reason is there for having/allowing nulls in a database schema?

The most significant reason for allowing NULLS is that there is no reasonable alternative. Logically, a NULL value represents "undefined". For lack of NULLS, you'll end up trying to specify a "dummy" value wherever the result is undefined, and then you'll have to account for said "dummy" value in ALL of your application logic.
I wrote a blog article on the reasons for including NULL values in your database. You can find it here. In short, I DO believe that NULL values are an integral part of database design, and should be used where appropriate.

C.J. Date in his book "SQL and Relational Theory" (2009: O'Reilly; ISBN 978-0-596-52306-0) takes a very strong stand against NULLs. He demonstrates that the presence of NULLs in SQL gives wrong answers to certain queries. (The argument does not apply to the relational model itself because the relational model does not allow NULLs.)
I'll try to summarize his example in words. He presents a table S with attributes SNO (Supplier Number) and City (City where supplier is located) and one row: (S1, London). Also a table P with attributes PNO (Part Number) and City (City where part is produced) and one row: (P1, NULL). Now he does the query "Get (SNO,PNO) pairs where either the supplier and part cities are different or the part city isn't Paris (or both)."
In the real world, P1 is produced in a city that either is or is not Paris, so the query should return (S1, P1) because the part city either is Paris or is not Paris. (The mere presence of P1 in table P means that the part has a city associated with it, even if unknown.) If it is Paris, then supplier and part cities are different. If it is not Paris, then the part city is not Paris. However, by the rules of three-valued logic, ('London' <> NULL) evaluates to UNKNOWN, (NULL <> 'Paris') evaluates to UNKNOWN, and UNKNOWN OR UNKNOWN reduces to UNKNOWN, which is not TRUE (and not FALSE either), and so the row isn't returned. The result of the query "SELECT S.SNO, P.PNO FROM S, P WHERE S.CITY <> P.CITY OR P.CITY <> 'Paris'" is an empty table, which is the wrong answer.
I'm not an expert and not currently equipped to take the pro or con here. I do consider C.J. Date to be one of the foremost authorities on relational theory.
P.S. It is also true that you can use SQL as something other than a relational database. It can do many things.

What good reason is there for having/allowing nulls in a database schema?
From the theory's point of view, having a NULL means that the value is not defined for a column.
Use it wherever you need to say "I don't know / I don't care" to answer the question "What is the value of this column?"
And here are some tips from performance's point of view:
In Oracle, NULL's are not indexed. You can save the index space and speed up the queries by using NULL's for the values you don't need to index.
In Oracle, trailing NULL's occupy no space.
Unlike zeroes, NULL's can be safely divided by.
NULL's do contribute into COUNT(*), but don't contribute into COUNT(column)

Nulls are good when your column can really have an unknown value which has no default.
We can't answer if your column applies to that rule.
for example if you have and end date you might be tempted to put in datetime.maxvalue in as the default isntead of null. it completely valid but you have to take into account reporting being done on that and stuff like that.

In theory, there is no difference between theory and practice. In practice, there is.
In theory, you can design a database that never needs a NULL in it, because it's fully normalized. Whenever a value is to be omitted, the entire row containing it can be omitted, so there's no need for any NULL.
However, the extent of table decomposition you have to go through in order to get this result is just simply not worth the gain from the aspect of theoretical esthetics. It's often best to let some columns contain NULLS.
Good candidates for nullable columns are ones where, in addition to the data being optional, you are never using the column in a comparison condition in a WHERE or HAVING clause. Believe it or not, foreign keys often work OK with NULLS in them, to indicate an instance of a relationship that is not present. INNER JOINS will drop the NULLS out along with the rows that contain them.
When a value is often used in boolean conditions, it's best to design so that NULLS won't happen. Otherwise you are apt to end up with the mysterious result that, in SQL, the value of "NOT UNKNOWN" is "UNKNOWN". This has caused bugs for a number of people before you.

Generally, if you allow NULL for a column in a database, that NULL value has some separate meaning with regards to the structure of the database itself. For example, in the StackOverflow database schema, NULL for the ParentId or Tags column in the Post table indicates whether the post is a question or an answer. Just make sure that in each case, the meaning is well documented.
Now your particular complaint is about handling these values in client code. There are two ways to mitigate the issue:
Most cases with a meaning like the one described above should never come back to the client in the first place. Use the NULL in your queries to gather the correct results, but don't return the NULL column itself.
For the remaining cases, you can generally use functions like COALESCE() or ISNULL() functions to return something that's easier to process.

A null is useful whenever you need to specify that there is no value at all.
You could use a magic number instead, but it's more intuitive to handle nulls than to handle magic values, and it's easier to remember which value to handle. (Hm... was it -1 or 99999 or 999999 that was the magic value...?)
Also, magic values doesn't have any real magic, there is no fail safe to keep you from using the value anyway. The computer doesn't know that you can't multiply 42 with -1 because -1 happens to be an unreasonable value in this situation, but it knows that you can't multiply 42 with null.
For a textual value an empty string can work as "no value", but there are some drawbacks even there. If you for example have three spaces in a field it's not always possible to visually distinguish from the empty string, but they are different values.

Nulls should and must be used anytime the information may not be available at the time the original data is entered (Example, ship date on an order).
Certainly there are situations where nulls may indicate the need to redesign (a table consisting of mostly null entries in most fields is probably not properly normalized, a filed that contains all null values is probably not needed.)
To not use nulls because your jr developers don't properly understand them indicates that you have a bigger problem than the nulls. Any developer who doesn't understand how to access data that includes nulls, needs to be given basic training in SQL. This is as silly as not using triggers to enforce data integrity rules because the devs forget to look at them when there is a problem or not using joins because the devs don't understand them or using select * because the devs are too lazy to add the field names.

In addition to the great reasons mentioned in other answers NULL can be very important for new releases of existing products.
Adding a new Nullable column to an already existing table has relatively low impact. Adding a new non-Nullable column is a much more involved process because of data migration. If you or your customers have lots of data the time and complexity of the migration can become a significant problem.

Reasons for having nulls
It's an accepted practice, and everyone who does database work knows how nulls function.
It clearly shows that there is an absence of a value.

For what it's worth, SQL-99 defines a predicate IS [NOT] DISTINCT FROM which returns true or false, even if the operands are NULL.
foo IS DISTINCT FROM 1234
Is equivalent to:
foo <> 1234 OR foo IS NULL
PostgreSQL, IBM DB2, and Firebird support IS DISTINCT FROM.
Oracle and Microsoft SQL Server don't (yet).
MySQL has their own operator <=>, which works like IS NOT DISTINCT FROM.

A database is corrupt to the extent that it contains null.

There is NEVER a case where NULL makes sense logically. NULL is not a part of the relational model, and relational theory does not have such a concept as NULL.
NULL is "useful", in the sense that crappy DBMS's leave you no other choice but to use it, at the PHYSICAL level, which those very same crappy DBMS's themselves gravely confuse with the logical level, and more or less force their users to do the same.

I agree with most of the answers on here, but to phase it a different way, "you can't have a value that means two things". It's just confusing. Does 0 actually mean 0? or does it mean we don't know yet? etc.

When there is an entity that has no value for its attribute, then we use a null value. A null value is not 0, but it is nothing value. One example is most Korean names have no middle name. If there is a name attribute with first name, middle and last name, a special value null should be given.

Why can you have a column named ORDER in DB2?

In DB2, you can name a column ORDER and write SQL like
SELECT ORDER FROM tblWHATEVER ORDER BY ORDER
without even needing to put any special characters around the column name. This is causing me pain that I won't get into, but my question is: why do databases allow the use of SQL keywords for object names? Surely it would make more sense to just not allow this?

I largely agree with the sentiment that keywords shouldn't be allowed as identifiers. Most modern computing languages have 20 or maybe 30 keywords, in which case imposing a moratorium on their use as identifiers is entirely reasonable. Unfortunately, SQL comes from the old COBOL school of languages ("computing languages should be as similar to English as possible"). Hence, SQL (like COBOL) has several hundred keywords.
I don't recall if the SQL standard says anything about whether reserved words must be permitted as identifiers, but given the extensive (excessive!) vocabulary it's unsurprising that several SQL implementations permit it.
Having said that, using keywords as identifiers isn't half as silly as the whole concept of quoted identifiers in SQL (and these aren't DB2 specific). Permitting case sensitive identifiers is one thing, but quoted identifiers permit all sorts of nonsense including spaces, diacriticals and in some implementations (yes, including DB2), control characters! Try the following for example:
CREATE TABLE "My
Tablé" ( A INTEGER NOT NULL );
Yes, that's a line break in the middle of an identifier along with an e-acute at the end... (which leads to interesting speculation on what encoding is used for database meta-data and hence whether a non-Unicode database would permit, say, a table definition containing Japanese column names).

Many SQL parsers (expecially DB2/z, which I use) are smarter than some of the regular parsers which sometimes separate lexical and semantic analysis totally (this separation is mostly a good thing).
The SQL parsers can figure out based on context whether a keyword is valid or should be treated as an identifier.
Hence you can get columns called ORDER or GROUP or DATE (that's a particularly common one).
It does annoy me with some of the syntax coloring editors when they brand an identifier with the keyword color. Their parsers aren't as 'smart' as the ones in DB2.

Because object names are ... names. All database systems let you use quoted names to stop you from running into trouble.
If you are running into issues, the fault lies not with the practice of permitting object names to be names, but with faulty implementations, or with faulty code libraries which don't automatically quote everything or cannot be made to quote names as-needed.

Interestingly you can use keywords as field names in SqlServer as well. The only differenc eis that you would need to use parenthesis with the name of the field
so you can do something like
create table [order](
id int,
[order] varchar(50) )
and then :)
select
[order]
from
[order]
order by [order]
That is of course a bit extreme example but at least with the use of parenthesis you can see that [order] is not a keyword.
The reason I would see people using names already reserved by keywords is when there is a direct mapping between column names, or names of the tables and the data presentation. You can call that being lazy or convenient.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas