Negative integer indexes: are they evil? - sql

I have this database that I'm designing.
It needs to contain a couple dozen tables with records that we provide (a bunch of defaults) as well as records that the user can add. In order to keep the user from shooting himself in the foot, it's necessary to keep him from modifying the default records.
There are lots of ways to facilitate this, but I like the idea of giving protected records negative integer indexes, while reserving 0 as an invalid record id and giving user records positive integer indexes.
CREATE TABLE t1 (
ixt1 integer AUTOINCREMENT,
d1 double,
CONSTRAINT pk_ixt1 PRIMARY KEY (ixt1),
CONSTRAINT ch_zero CHECK (ixt1 <> 0)
);
-2 | 171.3 <- canned record
-1 | 100.0 <- canned record
1 | 666.6 <- user record
Reasons this seems good:
it doesn't use significantly more space
it's easy to understand
it doesn't require lots of additional tables to implement
"select * from table" gets all the pertinent records, with no additional indirection
the canned records can grow in the negative direction, and the user records can grow in the positive direction
However, I'm relatively new to database design. And after using this solution for a little while, I'm starting to worry that using negative indexes might be bad, because
Negative indexes might not be supported consistently among different DBMSs, making it difficult to write code that is database-agnostic
It might be just too easy to screw stuff up by inserting something at recid 0
It might make it hard to use tools (like db grids, perhaps) that expect integer indexes with nonnegative values.
And maybe there are some other really obvious reasons that would make this a Very Bad Idea.
So what's the definitive answer? Are negative integer indexes evil?

The most important flaw in this is the "Intelligent Key" problem.
Negative integers work fine as a key. In all databases.
No tool requires positive integer index values.
It's relatively easy to screw this up because the index has a "rule" which isn't obvious and no one will remember after you've won the lottery and left.
Further, when you invent a third status code ('pre-canned' vs. 'customer-specific canned' vs. 'the other canned invented by a product line' vs. 'the old canned before version 3') you're doomed.
The issue with "Intelligent keys" is that you're asking the key to do two unrelated jobs.
It's the unique identifier for a record. That's what an key is supposed to be.
You're also asking it to provide status, control and authorization to change properties. Oops. This is fraught with danger. You can't expand the meaning because it's a single bit buried in a key.
Just add a column with "owned by". If it's owned by "magical super user", then it's not shown to users. Use a VIEW to assure this, if you can't trust your application developers to enforce it.
If it is owned by "magical super user", then it's the default data, and whatever rules apply to that ownership.

I worked on a very large billing system. We had a very similar problem... needing to mark some records as being "special". The customers had tens of millions of rows of existing data in their databases for the affected table, and it was deemed unacceptable to migrate all of that data to a new structure (i.e. adding a column).
The decision was taken to do exactly what you suggest.
The trouble with that is, you require every bit of business logic to know about (and remember) the special meaning of negative indices and correctly treat it. That's quite error prone (speaking from experience).
Unless you have unusual circumstances that very strongly speak in favor of this non-traditional approach, I suggest you stick with a more traditional extra column. It's what most developers are used to and therefore less likely to cause errors. I wish we would have bitten the bullet and added the extra column.

It's your data, but I don't think this is a good idea. An 'index' value like this should be meaningless - don't use sign or number range or whatever to mean 'something' or 'something else'. I think that in the long run you'd be much better off having a 'record type' column that indicates clearly what kind of record you're looking at. In my experience this is a much better approach.
Good luck.

Related

SQL table design: one or multiple line per entity? [duplicate]

I was wondering if you have a website with a dozen different types of listings (Shops, Restaurants, Clubs, Hotels, Events) that require different fields, is there a benefit of creating a table with columns defined like so
Example Shop:
shop_id | name | X | Y | city | district | area | metro | station | address | phone | email | website | opening_hours
Or a more abstract approach similar to this:
object_id | name
---------------
1 | Messy Joe's
2 | Bate's Motel
type_id | name
---------------
1 | hotel
2 | restaurant
object_id | type_id
---------------
1 | 2
2 | 1
field_id | name | field_type
---------------
1 | address | text
2 | opening_hours | date
3 | speciality | text
type_id | field_id
---------------
1 | 1
1 | 2
2 | 1
2 | 3
object_id | field_id | value
1 | 1 | 1st street....
1 | 3 | English Cuisine
Of course it can be more abstract if value's are predefined (Example: specialties could have their own list)
If I take the abstract approach it can be very flexible, but queries will be more complex with a lot of joins.
But I don't know if this affects the performance, executing these 'more complex' queries.
I would be interested to know what are the up and downsides of both methods. I can just imagine for myself, but I don't have the experience to confirm this.
Certain issues need to be clarified and resolved before we can enter into a reasonable discussion.
Pre-requisite Resolution
Labels
In a profession that demands precision, it is important that we use precise labels, to avoid confusion, and so that we can communicate without having to use long-winded descriptions and qualifiers.
What you have posted as FixedTables, is Unnormalised. Fair enough, it may be an attempt at Third Normal form, but in fact it is a flat file, Unnormalised (not "denormalised). What you have posted as AbstractTables is, to be precise, Entity-Attribute-Value, which is almost, but not quite, Sixth Normal form, and is therefore more Normalised than 3NF. Assuming it is done correctly, of course.
The Unnormalised flat file is not "denormalised". It is chock full of duplication (nothing has been done to remove repeating groups and duplicate columns or to resolve dependencies) and Nulls, it is a performance hog in many ways, and prevents concurrency.
In order to be Denormalised, it has to first be Normalised, and then the Normalisation backed off a little for some good reason. Since it is not Normalised in the first place, it cannot be Denormalised. It is simply Unnormalised.
It cannot be said to be denormalised "for performance", because being a performance hog, it is the very antithesis of performance. Well, they need a justification for the lack of formalised design], and "for performance" is it. Even the smallest formal scrutiny exposed the misrepresentation (but very few people can provide, so it remains hidden, until they get an outsider to address, you guessed it, the massive performance problem).
Normalised structures perform far better than Unnormalised structures. More normalised structures (EAV/6NF) perform better than less normalised structures (3NF/5NF).
I am agreeing with the thrust of OMG Ponies, but not their labels and definitions
rather than saying 'don't "denormalise" unless you have to', I am saying, 'Normalise faithfully, period' and 'if there is a performance problem, you have not Normalised correctly'.
Wikipedia
The entries for Normal Forms and Normalisation offer definitions that are incorrect; they confuse the Normal Forms; they are lacking regarding the process of Normalisation; and they give equal weight to absurd or questionable NFs which have been debunked long ago. The result is, Wikipedia adds to an already confused and rarely understood subject. So don't waste your time.
However, in order to progress, without that reference posing a hindrance, let me say this.
The definition of 3NF is stable, and has not changed.
There is a lot of confusion of the NFs between 3NF and 5NF. The truth is that this is an area that progressed over the last 15 years; and many orgs, academics as well as vendors with their products with limitations, jumped to create a new "Normal Form" to validate their offerings. All serving commercial interests and academically unsound. 3NF in its original untampered state intended and guaranteed certain attributes.
The sum total is, 5NF is today, what 3NF was intended to be 15 years ago, and you can skip the commercial banter and the twelve or so "special" (commercial and pseudo-academic) NFs in-between, some of which are identified in Wikipedia, and even that in confusing terms.
Fifth Normal Form
Since you have been able to understand and implement the EAV in your post, you will have no problem understanding the following. Of course a true Relational Model is pre-requisite, strong keys, etc. Fifth Normal Form is, since we are skipping the Fourth:
Third Normal Form
which in simple definitive terms is, every non-key column in every table has a 1::1 relationship to the Primary Key of the table,
and to no other non-key columns
Zero data duplication (the result, if Normalisation is progressed diligently; not achieved by intelligence or experience alone, or by working toward it as a goal without the formal process)
no Update Anomalies (when you update a column somewhere, you do not have to update the same column located somewhere else; the column exists in one and only one place).
If you understand the above, 4NF, BCNF, and all the silly "NFs" can be dismissed, they are required for physicalised Record Filing Systems, as promoted by academics, quite foreign to the Relational Model (Codd).
Sixth Normal Form
The purpose is elimination of missing data (attribute columns), aka elimination of Nulls
This is the one true solution to the Null Problem (also called Handling Missing Values), and the result is a database without Nulls. (It can be done at 5NF with standards and Null substitutes but that is not optimal.) How you interpret and display the missing values is another story.
Technically, is not a true Normal Form, because it does not have 5NF as a pre-requisite, but it has a value
EAV vs Sixth Normal Form
All the databases I have written, except one, are pure 5NF. I have worked with (administered, fixed up, enhanced) a couple of EAV databases, and I have implemented many true 6NF databases. EAV is a loose implementation of 6NF, often done by people who do not have a good grasp on Normalisation and the NFs, but who can see the value in, and need the flexibility of, EAV. You are a perfect example.
The difference is this: because it is loose, and because implementers do not have a reference (6NF) to be faithful to, they only implement what they need, and they write it all in code; that ends up being an inconsistent model.
Whereas, a pure 6NF implementation does have a pure academic reference point, and thus it is usually tighter, and consistent. Typically this shows up in two visible elements:
6NF has a catalogue to contain metadata, and everything is defined in metadata, not code. EAV does not have one, everything is in code (implementers keep track of the objects and attributes). Obviously a catalogue eases the addition of columns, navigation, and allows utilities to be formed.
6NF when understood, provides the true solution to The Null Problem. EAV implementers, since they are absent the 6NF context, handle missing data in code, inconsistently, or worse, allow Nulls in the database. 6NF implementers disallow Nulls, and handle missing Data consistently and elegantly, without requiring code constructs (for Null handling; you still have to code for missing data of course).
Eg. For 6NF databases with a catalogue, I have a set of procs that will [re]generate the SQL required to perform all SELECTs, and I provide Views in 5NF for all users, so they do not need to know or understand the underlying 6NF structure. They are driven off the catalogue. Thus changes are easy and automated. EAV types do that manually, due to the absence of the catalogue.
Discussion
Now, we can start the discussion.
"Of course it can be more abstract if
value's are predefined (Example:
specialities could have their own
list)"
Sure. But do not get too "abstract". Maintain consistency and implement such lists in the same EAV (or 6NF) manner as other lists.
"If I take the abstract approach it
can be very flexible, but queries will
be more complex with a lot of joins.
But I don't know if this affects the
performance, executing these 'more
complex' queries."
Joins are pedestrian in relational databases. The problem is not the database, the problem is that SQL is cumbersome when handling joins, especially compound keys.
EAV and 6NF databases have more Joins, which just as pedestrian, no more, no less. If you have to code each SELECT manually, sure, the cumbersome gets really cumbersome.
The entire problem can be eliminated by (a) going with 6NF over EAV and (b) implementing a catalogue, from which you can (c) generate all the basic SQL. Eliminates an entire class of errors as well.
It is a common myth that Joins somehow have a cost. Totally false.
The join is implemented at compile time, there is nothing of substance to 'cost' CPU cycles.
The issue is the size of tables being joined, not the cost of the Join between those same tables.
Joining two tables with millions of rows each, on a correct PK⇢FK relation, each of which have the appropriate indices
(Unique on the parent [PK] side; Unique on the Child side [PK=parent FK + something]
is instantaneous
Where the Child index is not unique, but at least the leading columns are valid, it is slower; where there is no useful index, of course it is very slow.
None of it has to do with Join cost.
Where many rows are returned, the bottleneck will be the network and the disk layout; not the join processing.
Therefore you can get as "complex" as you like, there is no cost, SQL can handle it.
I would be interested to know what are
the up and downsides of both methods.
I can just imagine for myself, but I
don't have the experience to confirm
this.
5NF (or 3NF for those who have not made the progression) is the easiest and best, in terms of implementation; ease of use (developers as well as users); and maintenance.
The drawback is, every time you add a column, you have to change the database structure (table DDL). That is fine is some cases, but not in most cases, due to change control in place, quite onerous.
Second, you have to change existing code (code handling the new column does not count, because that is an imperative): where good standards are implemented, that is minimised; where they are absent, the scope is unpredictable.
EAV (which is what you have posted), allows columns to be added without DDL changes. That is the single reason people choose it. (code handling the new column does not count, because that is an imperative). If implemented well, it will not affect existing code; if not, it will.
But you need EAV-capable developers.
When EAV is implemented badly, it is abominable, a worse mess than 5NF done badly, but not any worse than Unnormalised which is what most databases out there are (misrepresented as "denormalised for performance").
Of course, it is even more important (than in 5NF/3NF) to hold a strong Transaction context, because the columns are far more distributed.
Likewise, it is essential to retain Declarative Referential Integrity: the messes I have seen were due in large part to the developers removing DRI because it became "too hard to maintain", the result was, as you can imagine, one mother of a data heap with duplicate 3NF/5NF rows and columns all over the place. And inconsistent Null handling.
There is no difference in performance, assuming that the server has been reasonably configured for the intended purpose. (Ok, there are specific optimisations that are possible only in 6NF, which are not possible in other NFs, but I think that is outside the scope of this thread.) And again, EAV done badly can cause unnecessary bottlenecks, no more so than Unnormalised.
Of course, if you go with EAV, I am recommending more formality; buy the full quid; go with 6NF; implement a catalogue; utilities to produce SQL; Views; handle Missing Data consistently; eliminate Nulls altogether. This reduces your vulnerability to the quality of your developers; they can forget about the EAV/6NF esoteric issues, use Views, and concentrate on the app logic.
In your question, you have presented at least two major issues at the same time. Those two issues are E-A-V and gen-spec.
First, let's talk about E-A-V. Your last table (object_id, field_id, value) is essentially an E-A-V. There is an upside to E-A-V and a downside to E-A-V. The upside is that the structure is so generic that it can accomodate almost any body of data describing almost any subject matter. That means that you can proceed to design and implementation with no data analysis and no understanding of the subject matter, and not worry about wrong assumptions. The down side is that at retrieval time, you have to do the data analysis that you skipped over before building the data base, in order to come up with queries that mean anything. This is much more serious than just retrieval efficiency. But you are also going to have terrible problems with retrieval efficiency. There are only two ways to learn about this pitfall: live through it or read about it from those who have. I recommend the reading.
Second, you have a gen-spec case. Your table (object_id, type_id) captures a gen-spec (generalization-specialization) pattern, along with the related tables. If I had to generalize between hotels and restaurants, I might call it something like "public accomodations" or "venues". But I'm not sure I understand your case, and you may be driving for something even more general than those two names suggest. After all, you've included "events" in your list, and an event is not a type of venue in my mind.
I've referred other people to readings on gen-spec and the relational model in previous responses.
When two tables are very similar, when should they be combined?
But I hesitate to send you off in the same direction, because it's not clear to me that you want to come up with a relational model of the data before building your database. A relational model of a body of data and an E-A-V model of the same data are almost totally at odds with each other. It seems to me you have to make that choice before you even explore how to express gen-spec in the relational model of data.
When you start to require a large number of different entities (or even before...), a nosql solution would be vastly simpler than either choice.
Just store each entity/record with the exact fields you require.
{
"id": 1,
"type":"Restaurant",
"name":"Messy Joe",
"address":"1 Main St.",
"tags":["asian","fusion","casual"]
}
The "abstract" approach is better known as "Normalization", looks like 3rd Normal Form (3NF).
The other one is called "Denormalized", and can be a valid performance option... when you've encountered speed issues using the Normalized approach, not before.
How do you have the listings represented in code? I'd guess Listing as a supertype, with Shop, Restuarant, etc. as subtypes?
Assuming so, this is a case of how to map subtypes to a relational database. There are generally three choices:
Option 1: single table per subtype,
with common attributes repeated in
each table (name, id, etc).
Option 2: single table for all objects (your single table approach)
Option 3: table for the supertype and one for each subtype
There's no universally correct solution. My preference is generally to start with option 3; it provides an intituitive structure to work with, is pretty well normalised and can easily be extended. It means a single join for retrieving each instance - but RDBMS are well optimised for doing joins so it doesn't really cause performance problems in practice.
Option 2 can be more performant for queries (no joins) but causes problems if other tables need to refer to all supertype instances (proliferation of foreign keys).
Option 1 appears at first sight to be the most performant, although 2 caveats: (1) It's not resilient to change. If you add a new subtype (and so different attributes) you'll need to change the table structure and migrate it. (2) It can be less efficient than it seems. Because the table population is sparse, some DBs don't store it particularly efficiently. As a consequence it can be less efficicent than option 1 - since the query engine can do joins faster than it can search bloated sparse table spaces.
Which to choose really comes down to knowing details of your problem. I'd suggest reading up a bit on the options: this article is a good place to start.
hth

single fixed table with multiple columns vs flexible abstract tables

I was wondering if you have a website with a dozen different types of listings (Shops, Restaurants, Clubs, Hotels, Events) that require different fields, is there a benefit of creating a table with columns defined like so
Example Shop:
shop_id | name | X | Y | city | district | area | metro | station | address | phone | email | website | opening_hours
Or a more abstract approach similar to this:
object_id | name
---------------
1 | Messy Joe's
2 | Bate's Motel
type_id | name
---------------
1 | hotel
2 | restaurant
object_id | type_id
---------------
1 | 2
2 | 1
field_id | name | field_type
---------------
1 | address | text
2 | opening_hours | date
3 | speciality | text
type_id | field_id
---------------
1 | 1
1 | 2
2 | 1
2 | 3
object_id | field_id | value
1 | 1 | 1st street....
1 | 3 | English Cuisine
Of course it can be more abstract if value's are predefined (Example: specialties could have their own list)
If I take the abstract approach it can be very flexible, but queries will be more complex with a lot of joins.
But I don't know if this affects the performance, executing these 'more complex' queries.
I would be interested to know what are the up and downsides of both methods. I can just imagine for myself, but I don't have the experience to confirm this.
Certain issues need to be clarified and resolved before we can enter into a reasonable discussion.
Pre-requisite Resolution
Labels
In a profession that demands precision, it is important that we use precise labels, to avoid confusion, and so that we can communicate without having to use long-winded descriptions and qualifiers.
What you have posted as FixedTables, is Unnormalised. Fair enough, it may be an attempt at Third Normal form, but in fact it is a flat file, Unnormalised (not "denormalised). What you have posted as AbstractTables is, to be precise, Entity-Attribute-Value, which is almost, but not quite, Sixth Normal form, and is therefore more Normalised than 3NF. Assuming it is done correctly, of course.
The Unnormalised flat file is not "denormalised". It is chock full of duplication (nothing has been done to remove repeating groups and duplicate columns or to resolve dependencies) and Nulls, it is a performance hog in many ways, and prevents concurrency.
In order to be Denormalised, it has to first be Normalised, and then the Normalisation backed off a little for some good reason. Since it is not Normalised in the first place, it cannot be Denormalised. It is simply Unnormalised.
It cannot be said to be denormalised "for performance", because being a performance hog, it is the very antithesis of performance. Well, they need a justification for the lack of formalised design], and "for performance" is it. Even the smallest formal scrutiny exposed the misrepresentation (but very few people can provide, so it remains hidden, until they get an outsider to address, you guessed it, the massive performance problem).
Normalised structures perform far better than Unnormalised structures. More normalised structures (EAV/6NF) perform better than less normalised structures (3NF/5NF).
I am agreeing with the thrust of OMG Ponies, but not their labels and definitions
rather than saying 'don't "denormalise" unless you have to', I am saying, 'Normalise faithfully, period' and 'if there is a performance problem, you have not Normalised correctly'.
Wikipedia
The entries for Normal Forms and Normalisation offer definitions that are incorrect; they confuse the Normal Forms; they are lacking regarding the process of Normalisation; and they give equal weight to absurd or questionable NFs which have been debunked long ago. The result is, Wikipedia adds to an already confused and rarely understood subject. So don't waste your time.
However, in order to progress, without that reference posing a hindrance, let me say this.
The definition of 3NF is stable, and has not changed.
There is a lot of confusion of the NFs between 3NF and 5NF. The truth is that this is an area that progressed over the last 15 years; and many orgs, academics as well as vendors with their products with limitations, jumped to create a new "Normal Form" to validate their offerings. All serving commercial interests and academically unsound. 3NF in its original untampered state intended and guaranteed certain attributes.
The sum total is, 5NF is today, what 3NF was intended to be 15 years ago, and you can skip the commercial banter and the twelve or so "special" (commercial and pseudo-academic) NFs in-between, some of which are identified in Wikipedia, and even that in confusing terms.
Fifth Normal Form
Since you have been able to understand and implement the EAV in your post, you will have no problem understanding the following. Of course a true Relational Model is pre-requisite, strong keys, etc. Fifth Normal Form is, since we are skipping the Fourth:
Third Normal Form
which in simple definitive terms is, every non-key column in every table has a 1::1 relationship to the Primary Key of the table,
and to no other non-key columns
Zero data duplication (the result, if Normalisation is progressed diligently; not achieved by intelligence or experience alone, or by working toward it as a goal without the formal process)
no Update Anomalies (when you update a column somewhere, you do not have to update the same column located somewhere else; the column exists in one and only one place).
If you understand the above, 4NF, BCNF, and all the silly "NFs" can be dismissed, they are required for physicalised Record Filing Systems, as promoted by academics, quite foreign to the Relational Model (Codd).
Sixth Normal Form
The purpose is elimination of missing data (attribute columns), aka elimination of Nulls
This is the one true solution to the Null Problem (also called Handling Missing Values), and the result is a database without Nulls. (It can be done at 5NF with standards and Null substitutes but that is not optimal.) How you interpret and display the missing values is another story.
Technically, is not a true Normal Form, because it does not have 5NF as a pre-requisite, but it has a value
EAV vs Sixth Normal Form
All the databases I have written, except one, are pure 5NF. I have worked with (administered, fixed up, enhanced) a couple of EAV databases, and I have implemented many true 6NF databases. EAV is a loose implementation of 6NF, often done by people who do not have a good grasp on Normalisation and the NFs, but who can see the value in, and need the flexibility of, EAV. You are a perfect example.
The difference is this: because it is loose, and because implementers do not have a reference (6NF) to be faithful to, they only implement what they need, and they write it all in code; that ends up being an inconsistent model.
Whereas, a pure 6NF implementation does have a pure academic reference point, and thus it is usually tighter, and consistent. Typically this shows up in two visible elements:
6NF has a catalogue to contain metadata, and everything is defined in metadata, not code. EAV does not have one, everything is in code (implementers keep track of the objects and attributes). Obviously a catalogue eases the addition of columns, navigation, and allows utilities to be formed.
6NF when understood, provides the true solution to The Null Problem. EAV implementers, since they are absent the 6NF context, handle missing data in code, inconsistently, or worse, allow Nulls in the database. 6NF implementers disallow Nulls, and handle missing Data consistently and elegantly, without requiring code constructs (for Null handling; you still have to code for missing data of course).
Eg. For 6NF databases with a catalogue, I have a set of procs that will [re]generate the SQL required to perform all SELECTs, and I provide Views in 5NF for all users, so they do not need to know or understand the underlying 6NF structure. They are driven off the catalogue. Thus changes are easy and automated. EAV types do that manually, due to the absence of the catalogue.
Discussion
Now, we can start the discussion.
"Of course it can be more abstract if
value's are predefined (Example:
specialities could have their own
list)"
Sure. But do not get too "abstract". Maintain consistency and implement such lists in the same EAV (or 6NF) manner as other lists.
"If I take the abstract approach it
can be very flexible, but queries will
be more complex with a lot of joins.
But I don't know if this affects the
performance, executing these 'more
complex' queries."
Joins are pedestrian in relational databases. The problem is not the database, the problem is that SQL is cumbersome when handling joins, especially compound keys.
EAV and 6NF databases have more Joins, which just as pedestrian, no more, no less. If you have to code each SELECT manually, sure, the cumbersome gets really cumbersome.
The entire problem can be eliminated by (a) going with 6NF over EAV and (b) implementing a catalogue, from which you can (c) generate all the basic SQL. Eliminates an entire class of errors as well.
It is a common myth that Joins somehow have a cost. Totally false.
The join is implemented at compile time, there is nothing of substance to 'cost' CPU cycles.
The issue is the size of tables being joined, not the cost of the Join between those same tables.
Joining two tables with millions of rows each, on a correct PK⇢FK relation, each of which have the appropriate indices
(Unique on the parent [PK] side; Unique on the Child side [PK=parent FK + something]
is instantaneous
Where the Child index is not unique, but at least the leading columns are valid, it is slower; where there is no useful index, of course it is very slow.
None of it has to do with Join cost.
Where many rows are returned, the bottleneck will be the network and the disk layout; not the join processing.
Therefore you can get as "complex" as you like, there is no cost, SQL can handle it.
I would be interested to know what are
the up and downsides of both methods.
I can just imagine for myself, but I
don't have the experience to confirm
this.
5NF (or 3NF for those who have not made the progression) is the easiest and best, in terms of implementation; ease of use (developers as well as users); and maintenance.
The drawback is, every time you add a column, you have to change the database structure (table DDL). That is fine is some cases, but not in most cases, due to change control in place, quite onerous.
Second, you have to change existing code (code handling the new column does not count, because that is an imperative): where good standards are implemented, that is minimised; where they are absent, the scope is unpredictable.
EAV (which is what you have posted), allows columns to be added without DDL changes. That is the single reason people choose it. (code handling the new column does not count, because that is an imperative). If implemented well, it will not affect existing code; if not, it will.
But you need EAV-capable developers.
When EAV is implemented badly, it is abominable, a worse mess than 5NF done badly, but not any worse than Unnormalised which is what most databases out there are (misrepresented as "denormalised for performance").
Of course, it is even more important (than in 5NF/3NF) to hold a strong Transaction context, because the columns are far more distributed.
Likewise, it is essential to retain Declarative Referential Integrity: the messes I have seen were due in large part to the developers removing DRI because it became "too hard to maintain", the result was, as you can imagine, one mother of a data heap with duplicate 3NF/5NF rows and columns all over the place. And inconsistent Null handling.
There is no difference in performance, assuming that the server has been reasonably configured for the intended purpose. (Ok, there are specific optimisations that are possible only in 6NF, which are not possible in other NFs, but I think that is outside the scope of this thread.) And again, EAV done badly can cause unnecessary bottlenecks, no more so than Unnormalised.
Of course, if you go with EAV, I am recommending more formality; buy the full quid; go with 6NF; implement a catalogue; utilities to produce SQL; Views; handle Missing Data consistently; eliminate Nulls altogether. This reduces your vulnerability to the quality of your developers; they can forget about the EAV/6NF esoteric issues, use Views, and concentrate on the app logic.
In your question, you have presented at least two major issues at the same time. Those two issues are E-A-V and gen-spec.
First, let's talk about E-A-V. Your last table (object_id, field_id, value) is essentially an E-A-V. There is an upside to E-A-V and a downside to E-A-V. The upside is that the structure is so generic that it can accomodate almost any body of data describing almost any subject matter. That means that you can proceed to design and implementation with no data analysis and no understanding of the subject matter, and not worry about wrong assumptions. The down side is that at retrieval time, you have to do the data analysis that you skipped over before building the data base, in order to come up with queries that mean anything. This is much more serious than just retrieval efficiency. But you are also going to have terrible problems with retrieval efficiency. There are only two ways to learn about this pitfall: live through it or read about it from those who have. I recommend the reading.
Second, you have a gen-spec case. Your table (object_id, type_id) captures a gen-spec (generalization-specialization) pattern, along with the related tables. If I had to generalize between hotels and restaurants, I might call it something like "public accomodations" or "venues". But I'm not sure I understand your case, and you may be driving for something even more general than those two names suggest. After all, you've included "events" in your list, and an event is not a type of venue in my mind.
I've referred other people to readings on gen-spec and the relational model in previous responses.
When two tables are very similar, when should they be combined?
But I hesitate to send you off in the same direction, because it's not clear to me that you want to come up with a relational model of the data before building your database. A relational model of a body of data and an E-A-V model of the same data are almost totally at odds with each other. It seems to me you have to make that choice before you even explore how to express gen-spec in the relational model of data.
When you start to require a large number of different entities (or even before...), a nosql solution would be vastly simpler than either choice.
Just store each entity/record with the exact fields you require.
{
"id": 1,
"type":"Restaurant",
"name":"Messy Joe",
"address":"1 Main St.",
"tags":["asian","fusion","casual"]
}
The "abstract" approach is better known as "Normalization", looks like 3rd Normal Form (3NF).
The other one is called "Denormalized", and can be a valid performance option... when you've encountered speed issues using the Normalized approach, not before.
How do you have the listings represented in code? I'd guess Listing as a supertype, with Shop, Restuarant, etc. as subtypes?
Assuming so, this is a case of how to map subtypes to a relational database. There are generally three choices:
Option 1: single table per subtype,
with common attributes repeated in
each table (name, id, etc).
Option 2: single table for all objects (your single table approach)
Option 3: table for the supertype and one for each subtype
There's no universally correct solution. My preference is generally to start with option 3; it provides an intituitive structure to work with, is pretty well normalised and can easily be extended. It means a single join for retrieving each instance - but RDBMS are well optimised for doing joins so it doesn't really cause performance problems in practice.
Option 2 can be more performant for queries (no joins) but causes problems if other tables need to refer to all supertype instances (proliferation of foreign keys).
Option 1 appears at first sight to be the most performant, although 2 caveats: (1) It's not resilient to change. If you add a new subtype (and so different attributes) you'll need to change the table structure and migrate it. (2) It can be less efficient than it seems. Because the table population is sparse, some DBs don't store it particularly efficiently. As a consequence it can be less efficicent than option 1 - since the query engine can do joins faster than it can search bloated sparse table spaces.
Which to choose really comes down to knowing details of your problem. I'd suggest reading up a bit on the options: this article is a good place to start.
hth

SQL Best Practices - Ok to rely on auto increment field to sort rows chronologically?

I'm working with a client who wants to add timestamps to a bunch of tables so that they may sort the records in those tables chronologically. All of the tables also have an auto incrementing integer field as their primary key (id).
The (simple) idea - save the overhead/storage and rely on the primary key to sort fields chronologically. Sure this works, but I'm uncertain whether or not this approach is acceptable in sound database design.
Pros: less storage required per record, simpler VO classes, etc. etc.
Con: it implies a characteristic of that field, an otherwise simple identifer, whose definition does not in any way define or guarantee that it should/will function as such.
Assume for the sake of my question that the DB table definitions are set in stone. Still - is this acceptable in terms of best practices?
Thanks
You asked for "best practices", rather than "not terrible practices" so: no, you should not rely on an autoincremented primary key to establish chronology. One day you're going to introduce a change to the db design and that will break. I've seen it happen.
A datetime column whose default value is GETDATE() has very little overhead (about as much as an integer) and (better still) tells you not just sequence but actual date and time, which often turns out to be priceless. Even maintaining an index on the column is relatively cheap.
These days, I always put a CreateDate column data objects connected to real world events (such as account creation).
Edited to add:
If exact chronology is crucial to your application, you can't rely on either auto-increment or timestamps (since there can always be identical timestamps, no matter how high the resolution). You'll probably have to make something application-specific instead.
Further to egrunin's answer, a change to the persistence or processing logic of these rows may cause rows to be inserted into the database in a non-sequential or nondeterministic manner. You may implement a parallelized file processor that throws a row into the DB as soon as the thread finishes transforming it, which may be before another thread has finished processing a row that occurred earlier in the file. Using an ORM for record persistence may result in a similar behavior; the ORM may just maintain a "bag" (unordered collection) of object graphs awaiting persistence, and grab them at random to persist them to the DB when it's told to "flush" its object buffer.
In either case, trusting the autoincrement column to tell you the order in which records came into the SYSTEM is bad juju. It may or may not be able to tell you the order in which records his the DATABASE; that depends on the DB implementation.
You can acheive the same goal in the short term by sorting on the ID column. This would be better that adding additional data to acheive the same result. I don't think that it would be confusing for anyone to look at the data table and know that it's chronological when they see that it's an identity column.
There are a few drawbacks or limitations that I see however.
The chronological sort can be messed up if someone re-seeds the column
Chronology for a date period cannot be ascertained without the additional data
This setup prevents you from sorting chronologically if the system ever accepts new, non-chronological data
Based on the realistic evaluation of these "limitations" you should be able to advise a proper approach.
Auto-incrementing ID will give you an idea of order as Brad points out, but do it right - if you want to know WHEN something was added, have a datetime column. Then you can not only chronologically sort but also apply filters.
Don't do it. You should never rely on the actual value of your ID column. Treat it like a black box, only useful for doing key lookups.
You say "less storage required per record," but how important is that? How big are the rows we're talking about? If you've got 200-byte rows, another 4 bytes probably isn't going to matter much.
Don't optimize without measuring. Get it working right first, and THEN optimize.
#MadBreaker
There's to separate things, if you need to know the order you create a column order with autoincrement, however if you want to know the date and time it was inserted you use datetime2.
Chronological order can be garanteed if you don't allow updates or deletes, but if you want time control over select you should use datetime2.
You didnt mention if you are running on a single db or clustered. If you are clustered, be wary of increment implementations, as you are not always guaranteed things will come out in the order you would naturally think. For example, Oracle sequences can cache groups of next values (depending on your setup) and give you a 1,3,2,4,5 sort of list...

Is an overuse of nullable columns in a database a "code smell"?

I'm just stepping into a project and it has a fairly large database backend. I've started digging through this database and 95% of the fields are nullable.
Is this normal practice in the database world? I'm just a lowly programmer, not a DBA but I would think you would want to keep nullable fields to a minimum, only where they make sense.
Is it a "code smell" if most columns are nullable?
Default values are typically the exception and NULLs are the norm, in my experience.
True, nulls are annoying.
It's also extremely useful because null is the best indicator of "NO VALUE". A concrete default value is very misleading, and you can lose information or introduce confusion down the road.
Anyone who has developed a data entry application knows how common it is for some of the fields to be unknown at the time of entry -- even for columns that are business-critical, to address #Chris McCall's answer.
However, a "code smell" is merely an indicator that something might be coded in a sloppy way. You use smells to identify things that need more investigation, not necessarily things that must be changed.
So yes, if you see nullable columns so consistently, you're right to be suspicious. It might indicate that someone was being lazy, or afraid to declare NOT NULL columns unequivocally. You can justify doing your own analysis.
I'm of the Extreme NO camp: I avoid NULLs all the time. Putting aside fundamental considerations about what they actually mean (because talk to different people, you'll get different answers such as "no value", "unknown value", "missing", "my ginger cat called Null"), the worst problem NULLs cause is that they often ruin your queries in mysterious ways.
I've lost count of the number of times I've had to debug someone's query (okay, maybe 9) and traced the problem to a join against a NULL. If your code needs ISNULL to repair joins then the chances are you've also lost index applicability and performance with it.
If you do have to store a "missing/unknown/null/cat" value (and it's something I prefer to avoid), it is better to be explicit about it.
Those skilled at NULLs may disagree. NULL use tends to split SQL crowds down the middle.
In my experience, heavy NULL use has been positively correlated with database abuse but I wouldn't carve this into stone tablets as some Law of Nature. My experience is just my experience.
EDIT: Additional thought. It is possible that those who are anti-null racists like myself are more excited by normalization than those who are pro-NULL. I don't think rabid normalizers would be too happy with ragged edges on their tables that can take NULLs. Lots of nulls may indicate that the the database developers are not into heavy normalisation. So rather than NULL suggesting code is "bad" it may alternatively suggest the philosophical position of the developers on normalisation. Maybe this is reaching. Just a thought.
Don't know if I consider it always a bad thing, but if the columns are being added because a single record (or maybe a few) need to have values while most don't, then it indicates a pretty flat table structure. If you're seeing column names like "addr1", "addr2", "addr3", then it stinks!
I would bet that most of the columns you have could be removed and represented in other tables. You could find the "non-null" ones through a foreign key relationship. This will increase the joins that you'll be doing, but it could be more preformant that doing a "where not col1 is null".
I think nullable columns should be avoided. Wherever the semantics of the domain make it possible to use a value that clearly indicates missing data, it should be used instead of NULL.
For instance, let's imagine a table that contains a Comment field. Most developers would place a NULL here to indicate that there's no data in the column. (And, hopefully, a check constraint that disallows zero-length strings so that we have a well-known "value" to indicate the lack of a value.) My approach is usually the opposite. The Comment column is NOT NULL and a zero-length string indicates the lack of a value. (I use a check constraint to ensure that the zero-length string is really a zero-length string, and not whitespace.)
So, why would I do this? Two reasons:
NULLs require special logic in SQL, and this technique avoids that.
Many client-side libraries have special values to indicate NULL. For instance, if you use Microsoft's ADO.NET, the constant DBNull.Value indicates a NULL, and you have to test for that. Using a zero-length string on a NOT NULL column obviates the need.
Despite all of this, there are many circumstances in which NULLs are fine. In fact, I have no objection to their use in the scenario above, although it wouldn't be my preferred way.
Whatever you do, be kind to those who will use your tables. Be consistent. Allow them to SELECT with confidence. Let me explain what I mean by this. I recently worked on a project whose database was not designed by me. Nearly every column was nullable and had no constraints. There was no consistency about what represented the absence of a value. It could be NULL, a zero-length string, or even a bunch of spaces, and often was. (How that soup of values got there, I don't know.)
Imagine the ugly code a developer has to write to find all of those records with a missing Comment field in this scenario:
SELECT * FROM Foo WHERE LEN(ISNULL(Comment, '')) = 0
Amazingly there are developers who regard this as perfectly acceptable, even normal, despite possible performance implications. Better would be:
SELECT * FROM Foo WHERE Comment IS NULL
Or
SELECT * FROM Foo WHERE Comment = ''
If your table is properly designed, the above two SQL statements can be relied upon to produce quality data.
In short, I would say yes, this is probably a code smell.
Whether a column is nullable or not is very important and should be determined carefully. The question should be assessed for every column. I am not a believer in a single "best practices" default for NULL. The "best practice" for me is to address the nullability thoroughly during the design and/or refactoring of the table.
To start with, none of your primary key columns are going to be nullable. Then, I strongly lean towards NOT NULL for anything which is a foreign key.
Some other things I consider:
Criteria where NULL should be strongly avoided:
money columns - is there really a possibility that this amount will be unknown?
Criteria where NULL can be justified most frequently:
datetime columns - there are no reserved dates, so NULL is effectively your best option
Other data types:
char/varchar columns - for codes/identifiers - NOT NULL almost exclusively
int columns - mostly NOT NULL unless it's something like "number of children" where you want to distinguish an unknown response.
No, whether or not a field should be nullable is a data concept and can't be a code smell. Whether or not NULLs are annoying to code has nothing to do with the usefulness of having nullable data fields.
They are a (very common) smell, I'm afraid. Look up C.J. Date writings on the topic.
As a best practice, if a column shouldn't be nullable, then it should be marked as such. However, I don't believe in going completely insane with things like this.
I think so. If you don't need the data, then it's not important to your business. If it is important to your business, it should be required.
This is all completely dependent on the scope and requirements of the project. I wouldn't use number of nullable fields alone as a metric for poorly written or designed code. Have a look at the business domain, if there are many non nullable fields represented there that are nullable in the database, then you have some issues.
In my experience, it is a problem when Null and Not Null don't match up to the required field /not required field.
It is in the realm of possibility that those really are all optional fields. If you find in the business tier or the UI tier that those fields are required, then I think this means the data model has drifted away from the business object model and is a sign of overly conservative DB change policies, or oversight.
If you run a sample data generator on your data, and then try to load the data that is valid according to SQL, you would find out right away if the rules match up.
That seems like a lot, it probably means you should at least investigate. Note that if this is mature product with a lot of data, convincing anyone to change the structure may be difficult. The earlier in the design phase you catch something like this the easier it is to fix up all the related code to adjust for the change.
Whether it is bad that they used the nulls would depend on whether the columns allowing nulls look as if they should be related tables (home phone, cell phone, business phone etc which should be in aspearate phone table) or if they look like things that might not be applicable to all records (possibly could bea related table with a one-to-one relationship)or might not be known at the time of data entry (probably ok). I would also check to see if they in fact alwAys do have a value (then you might be able to change to not null if the information is genuinely required by the busniess logic). If you have a few records with null
In my experience, a lot nullable field in a large database like you have is very normal. Considering it perhaps is used by a lot of applications written by different people. Making columns nullable is annoying but it is perhaps the best way to keep the application robust.
One of the many ways to map inheritance (e.g. c# objects) to a database is to create a table for the class at the top of the hierarchy, then add the columns for all the other classes. The columns have to be nullable for when an object of a different subclass is stored in the database. This is called Single-table inheritance mapping (or Map Hierarchy To A Single Table) and is a standard design pattern.
A side effect of Single-table inheritance mapping is that most columns are nullable.
Also in Oracle an empty string (0 length) is considered to be null, therefore in some companies all strings columns are made nullable even on SqlServer. (just because the first customer wants the software on SqlServer does not mean the 2nd customer does not have a Oracle DBA that will not let SqlServer onto there network)
To throw the opposite opinion out there. Every single field in a database should nullable. There is nothing more frustrating than working with a database that on every single insert throws an exception about required this or required that. Nothing should be required.
There is one exception to that, keys. Obviously all primary and foreign keys should be enforced to exist.
It should be the application's job to validate data and the database to simply store and retrieve what you give it. Having it process validation logic even as simple as null or not null makes a project way more complex to maintain for having different rules spread over everything.
As mentioned by others, front-facing data entry should allow omittance of many fields. This is complicated by how people interpret the trinary nature of NULL (e.g. empty versus missing).
As such, I am only answering about one facet of database design: foreign keys.
In general, foreign keys do not suffer from the arbitrary nature of business logic, therefore seeing these columns allowing NULL is definitely a code smell.
For example, if you had a [Person] table, in no situation would you ever have a [Person].[FatherID] value that was NULL intentionally.
For a large database, an attempt to save NULL to such a column is likely to occur at some point due to the inevitability of bugs, which would have been brought to light much sooner by having a NOT NULL constraint. So for version 1 or a table, you should never allow nullable columns without justification.
But things get much trickier in an evolving code base, especially one that is staying online and thus requires migration scripting to upgrade. In particular, you may find nullable columns added to tables later on, because properly adding them as non-nullable can be quite hard depending on your integration process.
Furthermore, visual table designers (such as in SQL Server Management Studio and Visual Studio) default to allowing NULL so it could simply be a matter of inadequate code review.
I don't want to attempt a proper answer for flag (i.e. boolean) columns, but I strongly suggest considering how they can be implemented without allowing NULL, since I have usually found ways to avoid nullability even under the constraints of business logic.

Char(4) versus int as StatusID/StatusCode column in a table

I need a status column that will have about a dozen possible values.
Is there any reason why I should choose int (StatusID) over char(4) (StatusCode)?
Since sql server doesn't support named constants, char is far more descriptive than int when used in stored procedure and views as constants.
To clarify, I would still use a lookup table either way. Since the I will need a more descriptive text for the UI. So this decision is only to help me as the developer when I'm maintaining the stored procedures and views.
Right now I'm leaning toward char(4). Especially since designing views in SQL Server Management Studio prevents me from adding comments (I know it's possible to add it in the script editor, but realistically I will use the View Designer far more often, especially if the view is trivial). StateCODE = 'NEW' is much more readable than StateID = 1000.
I guess the question is will there be cases where char(4) is problematic, and since the database is pretty small, I'm not too concerned about slight performance hit (like using TinyInt versus int), but more afraid of code maintenance problems.
Database purists will say a key should have no meaning in the business domain, and that you should create a status table where you look up the description and other meanings of the status.
But for operators and end users, having a descriptive status code can be a blessing. And it doesn't even have to be char(4), you can make it varchar(20). This allows them to query without joins, and inspect the database in an easier way.
In the end, I think the char(20) organization will run more smoothly, and go home earlier on Friday. But the int organization has a better abstraction of the database, and they can enjoy meta programming on friday evening (or boosting on forums.)
(All of this assuming that you're writing business support software. One of the more succesful business support systems, SAP, makes successful use of meaningful keys.)
There are many pro's and con's to each method. I'm sure other arguments will come up in favour of using a char(4). My reasons for choosing an int over a char include:
I always use lookup tables. They allow for an audit trail of the value to be retained and easily examined. For example, if one of your status codes is 'MING' and a business decision is made to change it from 'MING' to 'MONG' from a certain date, my lookup table handles this.
Smaller index - if you need to index this column, it will be thinner.
Extendability - OK, I made that word up, but if you need to go from 4 chars to 5 chars for example, a lookup table would be a blessing.
Descriptions: We use a lot of TLA's here which once you know what they are is great but if I gave a business user a report that said "GDA's 2007 1001", they wouldn't necessarily twig that GDA = Good Dead on Arrival. With a lookup table, I can add this description.
Best practice: Can't find the link to hand but it might be something I read in a K.Tripp article. Aim to make your clustered primary key incrementing integers to optimise the index.
Of course if you are absolutely positive that you will never need any more than a handful of 4 characters, there is no reason not to bang it in the table.
The best thing should be a lookup table with defined values and then relate it to original table, that uses that enumeration.
Collation ambigities are one reason to say no to char 4: Does ABcD = abCD = äBCd?
If you have 12 possible values, why not tinyint/byte and a Status table?
If you have to store the status for 10 million rows the 3 bytes different and the collation/string compares add up.
The place where I've run into this use case is columns that would map onto things that I would typically use an Enum for when programming. Do you store the integer value of the Enum or the name of the Enum in the database column? Honestly, I've done it both ways. Usually, I ask myself if the database will be used outside the application I'm building. If so, I will choose the human readable format to store in the database. If not, then I'll choose the integer value as it saves a little time when reconstituting (it's just a cast instead of a parse operation) the Enum in code.
You could also use a tinyint over an int
i always choose int's simply because they are easier to map to enums in code.
If you're dealing with huge amounts of data and high throughput then a smallint or tinyint can give better performance and a smaller footprint on the hard disk. If the data in your application is often viewed directly through applications like Access or Cognos then your business people will probably appreciate the descriptive values. I know that when I'm analyzing data as part of my Database Developer role I get tired of joining a lot of lookup tables because I can't remember if 1 = Foo and 2 = Bar or 1 = Bar and 2 = Foo.
Also, although performance will be enhanced if you have to lookup rows by these codes which can have smaller indexes, it can also be hurt (in a minor way) by having to do the joins if you are often looking up rows regardless of the code but where you have to include the text value. In most applications that's not an issue though and would probably only come into play in large data warehousing/reporting environments.