How to handle null when comparing equality of value objects? - nhibernate

Note: I use C# as an example, but the problem is virtually the same in Java and probably many other languages.
Assume you implement a value object (as in value object pattern by M. Fowler) and it has some nullable field:
class MyValueObject
{
// Nullable field (with public access to keep the example short):
public string MyField;
}
Then, when overriding Equals(), how do you treat the case when both value objects have their MyField set to null? Are they equal or not?
In C#, treating them as equal seems obvious, because:
This is the behaviour of Equals() when you use a C# struct instead of a class and do not override Equals().
The following expressions are true:
null == null
object.ReferenceEquals(null, null)
object.Equals(null, null)
However, in SQL (at least in SQL Server's dialect), NULL = NULL is false, whereas NULL is NULL is true.
I am wondering what implementation is expected when using an O/R mapper (in my case, NHibernate). If you implement the "natural" C# equality semantics, may there be any ill effects when the O/R mapper maps them to the database?
Or maybe allowing nullable fields in value objects is wrong anyway?

Since ORMs know the relational model, they usually expose a way to query using SQL semantics.
NHibernate, for example, provides the is [not] null operator in HQL, and Restrictions.Is[Not]Null in Criteria.
Of course, there's an API where these paradigms collide: LINQ. Most ORMs try to do the right thing when comparing to null (i.e. replacing with is null), although there can be issues some times, especially where the behavior is not obvious.

Personally what I think is that if it can be null (in error free code), then they should be treated as equal.
However, if it shouldn't be null(ie: a Name for a Customer, or a Street Address for a Delivery) then it should never get to null in the first place.

I think you have two issues:
One being that you need to know if one instance of MyValueObject is equal to another instance.
And secondly, how that should translate to persistence.
I think you need to look at these separately as it seems that your angle is coupling them too close to each other which seems to me to violate some DDD principals - the Domain should not know/care about the persistence.
If you are unsure of the effect of the null value of MyField either (a) have it return a different Type other than string; (b) have it return a derivitave of string like EmptyString (or similar Special Case implementation); (c) or override the Equals method and specify exactly what it means for these instances to be equal.
If your ORM can not translate a particular expression (that involves MyValueObject) to SQL then perhaps its ok to do the harder work in the persistence layer (have the compare happen out of the SQL translation - yes, performance issues i know, but im sure not impossible to solve) in favour of keeping your Domain Model clean. It seems the solution should derive from "what's best for the domain model" to me.
#James Anderson makes a good point. Reserve null for error and failure states. I think Special Case seems more and more appropriate.

Related

Why is null an absorbing element on relations?

null is the lack of value, or, more theatrically, it is the unkown. From here, it is perfectly logical, that null + a, null * a, null / a, etc. is resulting as null. This means that null is an absorbing element on these operations. I wonder why does it have to be an absorbing element on relations as well. null > 5 could be considered to be false as well, with an explanation at least as plausible as we can give for the current behavior. Currently we can say that null > 5 is null, since the unkown might be greater than 5, or not, so the result is the unkown. But if it was false, then we could say that null > 5 is false, since the lack of value is not greater than 5.
Take a look at these queries:
select *
from books
where author like 'Alex%'
This will return all the books, which have their author starting with Alex. Let us see the other books:
select *
from books
where author not like 'Alex%'
This will return all the books where author does not start with Alex, right? Wrong! It will return all the books which have an author value which does not start with Alex. If we want to select the books whose author does not start with Alex, we have to explicitly include null values, like this:
select *
from books
where (author is null) or (author not like 'Alex%')
This seems to be an unnecessary complication to me which could be sorted out for future versions. But the question is: what is the explanation of this behavior? Why do we use null as the unkown instead of lack of value?
Why do we use null as the unknown instead of lack of value?
Part of the foundation of the Relational Model is predicate logic. While there are logics that have more than two values (true & false), the simplest and best defined, not to mention most familiar, is 2-valued: Boolean logic.
For reasons of industrial acceptance, into that fine mathematical model SQL introduced NULL. In Boolean logic we can prove the value of arbitrary expressions like NOT(A AND B), but there's no provision for missing values. Missing values are, quite simply, outside the domain of Boolean logic.
Having left academe behind, SQL makes arbitrary choices. What is the sum of N NULLs? NULL. What is count of N NULLs? 0. Is a value greater or lesser than NULL? To sort, has to be one or the other. Are two NULLs distinct, or identical, in GROUP BY? The SQL choices all "makes sense" at some level, even when implementations contradict each other. There's no right answer, because they're extra-logical.
So the answer to your question really is, because that's what the vendors chose. The unknown has no more meaning, logically, than lack of value. You could make an argument to treat NULL differently. It might win you a beer. If you want to see it manifested in a DBMS, though, you'll have to get it implemented.
This seems to be an unnecessary complication
You might be right, but you won't be surprised to learn that in 40 years many people have proposed your solution, namely X = NULL is false. The community settled on X = NULL is NULL, avoiding an implicit conversion. Considering how deeply nested and complicated SQL queries can be, that's probably a good thing.
CJ Date takes the position that NULL should be abolished, and all missing values should have a default value. I take exception to that for three reasons:
Missingness is meaningful. If I record a default value for a missing one, I need another column (is_missing) to record its missingness.
Default values can be captured in computations in error. Any use of a complementary is_missing column is ad hoc and outside the purview of the logic engine.
The "right default" varies by context. Sometimes, the "previous" known value is sufficient (because, say, yesterday's price might stand for today's, absent better information). Sometimes there's a known proxy, like average lifespan. Sometimes it's zero, as in a covariance matrix. And sometimes there's no good default: the "value" should be excluded because it's missing.
I have a pet solution, too, that's both simple and strict. I would like to see an SQL option, say, SET STRICT_BOOLEAN ON that would treat missing values as errors for logical and computational purposes. You can insert a NULL; you can select one. You cannot compare one or add one or concatenate one. To do those things, you must supply a default (appropriate to your context) with COALESCE or similar. Any "undefaulted" use of NULL simply raises an error, just like divide by zero does. And for the same reason: like zero as a divisor, NULL in logic is outside the domain.
I have not read the answer... But I believe that can help if you are using Oracle. Oracle implements the function LNNVL since Oracle 10 to deal with this.
https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions078.htm

SQL: What is the purpose of having "basically-null" default values?

So I am working with a table with a lot of default values:
CREATE TABLE BLAH.BLAH
(
...
A NUMBER(19) DEFAULT 0 NOT NULL,
B VARCHAR2(50) DEFAULT ' ' NOT NULL,
C TIMESTAMP(6) DEFAULT '01-JAN-1900' NOT NULL,
...
)
I am just wondering if there is any logical purpose for setting such null-like defaults to columns that would be MUCH better (in my opinion) as being set to actual NULL.
EDIT: I am mostly irked with the varchar2 default. The others are a bit more reasonable and easier to work with. It's just a pain when a lot of the code involves trimming; where I'm getting NULL when I'm expecting a single space.
It doesn't make much sense if you understand nulls and are used to dealing with them.
In its own crazy world it has some kind of logic, since if you never have nulls the values are easier to map to programming languages (all have an integer type, but not all have a nullable integer type) and comparing for 'equality' is easier (0 = 0, but it is not the case that null = null). But the tradeoff is that you then have to handle these magic default values with boilerplate checks in your program and in any SQL queries.
It could also be the result of some misapplied, cargo-culted 'coding standard' that forbids the use of nullable columns in the database, combined with a need to have some kind of 'unknown' value anyway.
It is true that the inclusion of nulls in SQL (and in the relational model in general) adds complexity and has been controversial. (I do not intend to give a full rundown with citations here or to argue for any particular point of view; I'm just saying that these arguments exist.) Some suggest that if you need nullable columns you should instead fix your data model to be fully normalized; Codd thought that a single null wasn't adequate and there should be separate 'missing' and 'not applicable'. But I'm sure that nobody who advocated getting rid of null suggested replacing it with a bogus 0 or 1900-01-01 value instead... So again it could be a case of misunderstanding and misapplying some design rule about not using nulls.
This kind of twisted thought process is quite common - I inherited a database where the rules forbade nullable columns (and there were APIs depending on that) but the string '?' was used instead. In fact, there were meant to be '?' and '/' for 'missing' and 'not applicable' - following Codd's suggestion - but you can guess how much real-world data or code observed that distinction and how useful it was in practice.
Is there a reason? There may be in some development groups. When you say:
select *
from blah
where timestamp <> date '2015-01-01'
You might want the query to return the default values. If the default is NULL, then the query will not return them. If the default is a date far in the past, then it will.
Similarly, you might not want to clutter code with lots of or X is null. This can be more than an aesthetic issue. The presence of or in where and on provides opportunities for code mistakes (due to missing parentheses) and can confuse the optimizer.
I am not saying that the use of such defaults is a best practice. NULL is such an important part of the SQL language that anyone writing production code should know how to deal with it. However, there are good reasons for the style you mention.

Allowing for more then one "no-value" value in value space

I use a string type for my Id attribute on all my domain abjects. E.g.:
public class Person {
property string Id { get; set; }
// ... more properties
}
no tricks here. null represents a "no-value" value, when a new Person is created and before it is persisted, Id will remain null.
Now there is a discussion to enhance "no-value" space and say that null, empty string and white-space strings are all "no-value" values.
I.e. to check if entity is new instead of doing: if (person.Id == null) it will become if (string.IsNullOrWhiteSpace(person.Id))
In my humble opinion this is a smell or a design principle violation, but I can't figure out which one.
Question: which (if any) design principle does this decision violate (the decision to allow for more than just null to represent no-value value)?
(I think it should be something similar to Occam's razor principle or entropy or KISS, I just not sure)
It does violate the KISS principle. If there is no special need for handling empty strings beside nulls, then why do it? All operations must now check for two values, instead of one. When exploring the DB, a simple SELECT to find "NULL" records, becomes slightly less trivial for no good reason.
Another violated principle is the principle of least surprise - usually people expect only NULL values to represent NULL objects. The design with two special values is less obvious, and less "readable".
If something more should be hidden behind these "second-category-special-objects", then it should be made explicit. Otherwise, it should be trivial to handle empty string input, and store it as NULL to be coherent with the rest of the system.
EDIT:
I've also found another "principle" in Bob Martin's book Clean code - "one word per concept" which is somehow related to this case. Empty string and a NULL are two "words" used for one concept so they clearly violate this guideline.
I'm gonna go out on a limb and say defining both null and "" as the empty String for your application does not violate any design principles and is not a code smell. You need to just clearly define the semantics of the field for your purpose, and you have done so (i.e., in your application, but null and "" mean "no value").
You should have tests that ensure behavior is correct for both null and "".
This is not to say that you also can't make the decision to force all empty strings to null. That is an equally valid decision. You would need to have tests that verify that in all cases where you set the "No value", the actual value is null. You might want to go this way if your persistence layer expects null and only null to indicate no value.
So, in this case, there are no wrong decisions, just decisions.

Subclassing NSPredicate to add operator

Cocoa defines predicate classes (NSPredicate, NSExpression, etc.) which "provide a general means of specifying queries in Cocoa" Predicate Programming. This set of classes describes what I need but with one little short-coming : I'd like additional operators.
NSComparisonPredicate already handles 14 operators (NSPredicateOperatorType) but I would like to add, say, temporal operators... or operators to represent things such as:
" variable has at least n entries" (binary operator)
" variable has value for, at most, n consecutive days" (ternary operator)
Obviously, I would need to implement these myself and the data model on which such queries are performed will have to support these operators. But, is there a way to implement it and benefit from the existing NSPredicate classes? Since operators were defined as an enum, I doubt I can extend on that front. Or am I completely missing the boat on this?!
Having spent a lot of time playing around with NSPredicate, I'm not sure this is the greatest idea.
Theoretically, you'd subclass NSPredicate, create your new initializer and properties, and then override the -evaluateWithObject:substitutionVariables: method to do your custom comparison.
Practically, it's probably a lot more difficult than that.
You might consider using FUNCTION() instead. I wrote a blog post about FUNCTION a while ago and how it plays with NSExpression and therefore with NSPredicate. Personally, I'd probably go with this, because then you could still use the +predicateWithFormat: syntax to create the NSPredicate. Creating a subclass to add an operator would necessarily prevent you from using the built-in parser.

Nullable vs. non-null varchar data types - which is faster for queries?

We generally prefer to have all our varchar/nvarchar columns non-nullable with a empty string ('') as a default value. Someone on the team suggested that nullable is better because:
A query like this:
Select * From MyTable Where MyColumn IS NOT NULL
is faster than this:
Select * From MyTable Where MyColumn == ''
Anyone have any experience to validate whether this is true?
On some platforms (and even versions), this is going to depend on how NULLs are indexed.
My basic rule of thumb for NULLs is:
Don't allow NULLs until justified
Don't allow NULLs unless the data can really be unknown
A good example of this is modeling address lines. If you have an AddressLine1 and AddressLine2, what does it mean for the first to have data and the second to be NULL? It seems to me, you either know the address or not, and having partial NULLs in a set of data just asks for trouble when somebody concatenates them and gets NULL (ANSI behavior). You might solve this with allowing NULLs and adding a check constraint - either all the Address information is NULL or none is.
Similar thing with middle initial/name. Some people don't have one. Is this different from it being unknown and do you care?
ALso, date of death - what does NULL mean? Not dead? Unknown date of death? Many times a single column is not sufficient to encode knowledge in a domain.
So to me, whether to allow NULLs would depend very much on the semantics of the data first - performance is going to be second, because having data misinterpreted (potentially by many different people) is usually a far more expensive problem than performance.
It might seem like a little thing (in SQL Server the implementation is a bitmask stored with the row), but only allowing NULLs after justification seems to me to work best. It catches things early in development, forces you to address assumptions and understand your problem domain.
If you want to know that there is no value, use NULL.
As for speed, IS NULL should be faster, because it doesn't use string comparison.
If you need NULL, use NULL. Ditto empty string.
As for performance, "it depends"
If you have varchar, you are storing an actual value in the row for the length. If you have char, then you store the actual length. NULL won't be stored in-row depending on the engine (NULL bitmap for SQL Server for example).
This means IS NULL is quicker, query for query, but it could add COALESCE/NULLIF/ISNULL complexity.
So, your colleague is partially correct but may not appreciate it fully.
Blindly using empty string is use of a sentinel value rather then working through the NULL semantic issue
FWIW and personally:
I would tend to use NULL but don't always. I like to avoid dates like 31 Dec 9999 which is where NULL avoidance leads you.
From Cade Roux's answer... I also find that discussions about "Is date of death nullable" pointless. For an field, in practical terms, either there is a value or there isn't.
Sentinel values are worse then NULLs. Magic numbers. anyone?
Tell that guy on your team to get his prematurely optimizin' head out of his ass! (But in a nice way).
Developers like that can be poison to the team, full of low-level optimization myths, all of which may be true or have been true at one point in time for some specific vendor or query pattern, or possibly only true in theory but never true in practice. Acting upon these myths is a costly waste of time, and can destroy an otherwise good design.
He probably means well and wants to contribute his knowledge to the team. Unfortunately, he is wrong. Not wrong in the sense of whether a benchmark will prove his statement correct or incorrect. He's wrong in the sense that this is not how you design a database. The question of whether to make a field NULL-able is a question about domain of the data for the purposes of defining the type of the field. It should be answered in terms of what it means for the field to have no value.
In a nutshell, NULL = UNKNOWN!.. Which means (using date of death example) that the entity could be 1)alive, 2)dead but date of death is not known, or 3)unknown if entity is dead or alive. For numeric columns I always default them to 0 (ZERO) because somewhere along the line you may have to perform aggregate calculations and NULL + 123 = NULL. For alphanumerics I use NULL since its least expensive performance-wise and easier to say '...where a IS NULL' than saying '...where a = "" '. Using '...where a = " "[space]' is not a good idea because [space] is not a NULL! For dates, if you have to leave a date column NULL, you may want to add a status indicator column which, in the above example, A=Alive, D=Dead, Q=Dead, date of death not known, N=Alive or Dead is unknown.