How to differentiate between unknown and known absence of information

How to differentiate between unknown and known absence of information - sql

Context
NULL is often conflated to mean both unknown data and the known absence of data
In aggregates (such as DATEs or worse, DATETIMEs), sparse data must be represented with magic values
Examples
The middle name of a person
A person who does not have a middle name compared to someone whose middle name is not known
Someone's birth date and time
Differentiating between the time not being known at all vs. the time not being known by your system
Knowing someone's month and day of birth but not the year
Other thoughts, context, and/or approaches
If this were a NoSQL context, one could have a "rule" that if a field is known to be absent, it's not stored at all and if it's unknown, stored as a null
This might make more sense with the rule flipped
Aggregates could be broken up and the rule could be applied to individual fields
I am admittedly ignorant in the NoSQL realm, but it seems like this would be easy to get wrong
For better or for worse, this doesn't apply to a SQL database; omission and NULL are the same
Any field that can be either unknown or absent could have an associated BOOLEAN field that states whether it is absent or not
This is the only approach that seems bulletproof to me
Could seemingly grow to all fields
Seems extremely tedious at the very least
Some "special value" (or values since there are various types) to represent the difference
For a text field, perhaps my-application/unknown and/or my-application/absent (or pick NULL for one)
Impossible to enforce without ambiguity (if one chose for example 42 for a Unix time, that is also 1970-01-01T00:00:42+00:00)
Question
What are the best practices for dealing with the difference between unknown data and known absence of data?

This is too long for a comment.
The "best practices" depends on the context and the overall data model. There are lots of different approaches:
Using a "magic" non-NULL value.
Using a separate boolean/tinyint column.
Using an EAV (entity-attribute-value) model.
Storing the values as JSON or XML (similar to your NOSQL approach).
(and no doubt other methods.)
For your example with the middle name, I find it hard to imagine a scenario where someone is known to have a middle name, but the name (or initial) is not known. So, NULL seems quite appropriate.
This applies to phone numbers, emails addresses, preferred honorifics and so on.
In practice, I don't find that this is a major concern. NULL seems to work well for most columns. Admittedly, there are some cases, but those are rather rare.

Related

What is atomicity in dbms

I read something like below in 1NF form of DBMS.
There was a sentence as follows:
"Every column should be atomic."
Can anyone please explain it to me thoroughly with an example?

Re "atomic"
In Codd's original 1969 and 1970 papers he defined relations as having a value for every attribute in a row. The value could be anything, including a relation. This used no notion of "atomic". He explained that "atomic" meant not relation-valued (ie not table-valued):
So far, we have discussed examples of relations which are defined on
simple domains--domains whose elements are atomic (nondecomposable)
values. Nonatomic values can be discussed within the relational
framework. Thus, some domains may have relations as elements.
He used "simple", "atomic" and "nondecomposable" as informal expository notions. He understood that a relation has rows of which each column has an associated name and value; attributes are by definition "single-valued"; the value is of any type. The only structural property that matters relationally is being a relation. It is also just a value, but you can query it relationally. Then he used "nonsimple" etc meaning relation-valued.
By the time of Codd's 1990 book The Relational Model for Database Management: Version 2:
From a database perspective, data can be classified into two types:
atomic and compound. Atomic data cannot be decomposed into smaller
pieces by the DBMS (excluding certain special functions). Compound
data, consisting of structured combinations of atomic data, can be
decomposed by the DBMS.
In the relational model there is only one type of compound data: the
relation. The values in the domains on which each relation is defined
are required to be atomic with respect to the DBMS. A relational
database is a collection of relations of assorted degrees. All of the
query and manipulative operators are upon relations, and all of them
generate relations as results. Why focus on just one type of compound
data? The main reason is that any additional types of compound data
add complexity without adding power.
"In the relational model there is only one type of compound data: the relation."
Sadly, "atomic = non-relation" is not what you're going to hear. (Unfortunately Codd was not the clearest writer and his expository remarks get confused with his bottom line.) Virtually all presentations of the relational model get no further than what was for Codd merely a stepping stone. They promote an unhelpful confused fuzzy notion canonicalized/canonized as "atomic" determining "normalized". Sometimes they wrongly use it to define realtion. Whereas Codd used everyday "nonatomic" to introduce defining relational "nonatomic" as relation-valued and defined "normalized" as free of relation-valued domains.
(Neither is "not a repeating group" helpful as "atomic", defining it as not something that is not even a relational notion. And sure enough in 1970 Codd says "terms attribute and repeating group in present database terminology are roughly analogous to simple domain and nonsimple domain, respectively".)
Eg: This misinterpretation was promoted for a long time from early on by Chris Date, honourable early relational explicator and proselytizer, primarily in his seminal still-current book An Introduction to Database Systems. Which now (2004 8th edition) thankfully presents the helpful relationally-oriented extended notion of distinguishing relation, row and "scalar" (non-relation non-row) domains:
This definition merely states that all [relation variables] are in 1NF
Eg: Maiers' classic The Theory of Relational Databases (1983):
The definition of atomic is hazy; a value that is atomic in one application could be non-atomic in another. For a general guideline, a value is non-atomic if the application deals with only a part of the value.
Eg: The current Wikipedia article on First NF (Normal Form) section Atomicity actually quotes from the introductory parts above. And then ignores the precise meaning. (Then it says something unintelligible about when the nonatomic turtles should stop.):
Codd states that the "values in the domains on which each
relation is defined are required to be atomic with respect to the
DBMS." Codd defines an atomic value as one that "cannot be decomposed
into smaller pieces by the DBMS (excluding certain special functions)"
meaning a field should not be divided into parts with more than one
kind of data in it such that what one part means to the DBMS depends
on another part of the same field.
Re "normalized" and "1NF"
When Codd used "normalize" in 1970, he meant eliminate relation-valued ("non-simple") domains from a relational database:
For this reason (and others to be cited below) the possibility of
eliminating nonsimple domains appears worth investigating. There is,
in fact, a very simple elimination procedure, which we shall call
normalization.
Later the notion of "higher NFs" (involving FDs (functional dependencies) & then JDs (join dependencies)) arose and "normalize" took on a different meaning. Since Codd's original normalization paper, normalization theory has always given results relevant to all relations, not just those in Codd's 1NF. So one can "normalize" in the original sense of going from just relations to a "normalized" "1NF" without relation-valued columns. And one can "normalize" in the normalization-theory sense of going from a just-relations "1NF" to higher NFs while ignoring whether domains are relations. And "normalization" is commonly also used for the "hazy" notion of eliminating values with "parts". And "normalization" is also wrongly used for designing a relational version of a non-relational database (whether just relations and/or some other sense of "1NF").
Relational spirit is to eschew multiple columns with the same meaning or domains with interesting parts in favour of another base table. But we must always come to an informal ergonomic decision about when to stop representing parts and just treat a column as "atomic" (non-relation-valued) vs "nonatomic" (relation-valued).
Normalization in database management system

Atomicity and 1NF... that is not about atomic transactions, but about definition and column content.
"Atomic" means "cannot be divided or split in smaller parts". Applied to 1NF this means that a column should not contain more than one value. It should not compose or combine values that have a meaning of their own.
This tipically regards 2 very common mistakes made by database designers:
1. multiple values in one column (list columns)
columns that contain a list of values, tipically space or comma separated, like this blog post table:
id title date_posted content tags
1 new idea 2014-05-23 ... tag1,tag2,tag3
2 why this? 2014-05-24 ... tag2,tag5
3 towel day 2014-05-26 ... tag42
or this contacts table:
id room phones
4 432 111-111-111 222-222-222
5 456 999-999-999
6 512 888-888-8888 333-3333-3333
This type of denormalization is rare, as most database designers see this cannot be a good thing. But you do find tables like this. They usually come from modifications to the database, whereas it may seem simpler to widen a column and use it to stuff multiple values instead of adding a normalized related table (which often breaks existing applications).
2. complex multi-part columns
In this case one column contains different bits of information and could maybe be designed as a set of separate columns.
Typical example are fullname and address columns:
id fullname address
1 Mark Tomers 56 Tomato Road
2 Fred Askalong 3277 Hadley Drive
3 May Anne Brice 225 Century Avenue - apartment 43/a
These types of denormalizations are very common, as it is quite difficult to draw the line and what is atomic and what is not. Depending on the application, a multi-part column could very well be the best solution in some cases. It is less structured, but simpler.
Structuring an address in many atomic columns may mean having more complex code to handle results for output. Another complexity comes from the structure not being adeguate to fit all types of addresses. Using one single VARCHAR column does not pose this problem, but may pose others... typically about searching and sorting.
An extreme case of multi-part columns are dates and times. Most RDBMS provide date and time data types and provide functions to handle date and time algebra and the extraction of the various bits (month, hour, etc...). Few people would consider convenient to have separate year, mont, day columns in a relational database. But I've seen it... and with good reasons: the use case was birthdates for a justice department database. They had to handle many immigrants with few or no documents. Sometimes you just knew a person was born in a certain year, but you would not know the day or month or birth. You can't handle that type of info with a single date column.

"Every column should be atomic."
Chris Date says, "Please note very carefully that it is not just simple things like the integer 3 that are legitimate values. On the contrary, values can be arbitrarily complex; for example, a value might be a geometric point, or a polygon, or an X ray, or an XML document, or a fingerprint, or an array, or a stack, or a list, or a relation (and so on)."[1]
He also says, "A relvar is in 1NF if and only if, in every legal value of that relvar, every tuple contains exactly one value for each attribute."[2]
He generally discourages the use of the word atomic, because it has confusing connotations. Single value is probably a better term to use.
For example, a date like '2014-01-01' is a single value. It's not indivisible; on the contrary, it quite clearly is divisible. But the dbms does one of two things with single values that have parts. The dbms either returns those values as a whole, or the dbms provides functions to manipulate the parts. (Clients don't have to write code to manipulate the parts.)[3]
In the case of dates, SQL can
return dates as a whole (SELECT CURRENT_DATE),
return one or more parts of a date (EXTRACT(YEAR FROM CURRENT_DATE)),
add and subtract intervals (CURRENT_DATE + INTERVAL '1' DAY),
subtract one date from another (CURRENT_DATE - DATE '2014-01-01'),
and so on. In this (narrow) respect, SQL is quite relational.
An Introduction to Database Systems, 8th ed, p 113. Emphasis in the original.
Ibid, p 358.
In the case of a "user-defined" type, the "user" is presumed to be a database programmer, not a client of the database.

it means column should not contain multiple values(like comma seperated values).
plz see below link.
http://www.studytonight.com/dbms/database-normalization.php

SQL table design: one or multiple line per entity? [duplicate]

I was wondering if you have a website with a dozen different types of listings (Shops, Restaurants, Clubs, Hotels, Events) that require different fields, is there a benefit of creating a table with columns defined like so
Example Shop:
shop_id | name | X | Y | city | district | area | metro | station | address | phone | email | website | opening_hours
Or a more abstract approach similar to this:
object_id | name
---------------
1 | Messy Joe's
2 | Bate's Motel
type_id | name
---------------
1 | hotel
2 | restaurant
object_id | type_id
---------------
1 | 2
2 | 1
field_id | name | field_type
---------------
1 | address | text
2 | opening_hours | date
3 | speciality | text
type_id | field_id
---------------
1 | 1
1 | 2
2 | 1
2 | 3
object_id | field_id | value
1 | 1 | 1st street....
1 | 3 | English Cuisine
Of course it can be more abstract if value's are predefined (Example: specialties could have their own list)
If I take the abstract approach it can be very flexible, but queries will be more complex with a lot of joins.
But I don't know if this affects the performance, executing these 'more complex' queries.
I would be interested to know what are the up and downsides of both methods. I can just imagine for myself, but I don't have the experience to confirm this.

Certain issues need to be clarified and resolved before we can enter into a reasonable discussion.
Pre-requisite Resolution
Labels
In a profession that demands precision, it is important that we use precise labels, to avoid confusion, and so that we can communicate without having to use long-winded descriptions and qualifiers.
What you have posted as FixedTables, is Unnormalised. Fair enough, it may be an attempt at Third Normal form, but in fact it is a flat file, Unnormalised (not "denormalised). What you have posted as AbstractTables is, to be precise, Entity-Attribute-Value, which is almost, but not quite, Sixth Normal form, and is therefore more Normalised than 3NF. Assuming it is done correctly, of course.
The Unnormalised flat file is not "denormalised". It is chock full of duplication (nothing has been done to remove repeating groups and duplicate columns or to resolve dependencies) and Nulls, it is a performance hog in many ways, and prevents concurrency.
In order to be Denormalised, it has to first be Normalised, and then the Normalisation backed off a little for some good reason. Since it is not Normalised in the first place, it cannot be Denormalised. It is simply Unnormalised.
It cannot be said to be denormalised "for performance", because being a performance hog, it is the very antithesis of performance. Well, they need a justification for the lack of formalised design], and "for performance" is it. Even the smallest formal scrutiny exposed the misrepresentation (but very few people can provide, so it remains hidden, until they get an outsider to address, you guessed it, the massive performance problem).
Normalised structures perform far better than Unnormalised structures. More normalised structures (EAV/6NF) perform better than less normalised structures (3NF/5NF).
I am agreeing with the thrust of OMG Ponies, but not their labels and definitions
rather than saying 'don't "denormalise" unless you have to', I am saying, 'Normalise faithfully, period' and 'if there is a performance problem, you have not Normalised correctly'.
Wikipedia
The entries for Normal Forms and Normalisation offer definitions that are incorrect; they confuse the Normal Forms; they are lacking regarding the process of Normalisation; and they give equal weight to absurd or questionable NFs which have been debunked long ago. The result is, Wikipedia adds to an already confused and rarely understood subject. So don't waste your time.
However, in order to progress, without that reference posing a hindrance, let me say this.
The definition of 3NF is stable, and has not changed.
There is a lot of confusion of the NFs between 3NF and 5NF. The truth is that this is an area that progressed over the last 15 years; and many orgs, academics as well as vendors with their products with limitations, jumped to create a new "Normal Form" to validate their offerings. All serving commercial interests and academically unsound. 3NF in its original untampered state intended and guaranteed certain attributes.
The sum total is, 5NF is today, what 3NF was intended to be 15 years ago, and you can skip the commercial banter and the twelve or so "special" (commercial and pseudo-academic) NFs in-between, some of which are identified in Wikipedia, and even that in confusing terms.
Fifth Normal Form
Since you have been able to understand and implement the EAV in your post, you will have no problem understanding the following. Of course a true Relational Model is pre-requisite, strong keys, etc. Fifth Normal Form is, since we are skipping the Fourth:
Third Normal Form
which in simple definitive terms is, every non-key column in every table has a 1::1 relationship to the Primary Key of the table,
and to no other non-key columns
Zero data duplication (the result, if Normalisation is progressed diligently; not achieved by intelligence or experience alone, or by working toward it as a goal without the formal process)
no Update Anomalies (when you update a column somewhere, you do not have to update the same column located somewhere else; the column exists in one and only one place).
If you understand the above, 4NF, BCNF, and all the silly "NFs" can be dismissed, they are required for physicalised Record Filing Systems, as promoted by academics, quite foreign to the Relational Model (Codd).
Sixth Normal Form
The purpose is elimination of missing data (attribute columns), aka elimination of Nulls
This is the one true solution to the Null Problem (also called Handling Missing Values), and the result is a database without Nulls. (It can be done at 5NF with standards and Null substitutes but that is not optimal.) How you interpret and display the missing values is another story.
Technically, is not a true Normal Form, because it does not have 5NF as a pre-requisite, but it has a value
EAV vs Sixth Normal Form
All the databases I have written, except one, are pure 5NF. I have worked with (administered, fixed up, enhanced) a couple of EAV databases, and I have implemented many true 6NF databases. EAV is a loose implementation of 6NF, often done by people who do not have a good grasp on Normalisation and the NFs, but who can see the value in, and need the flexibility of, EAV. You are a perfect example.
The difference is this: because it is loose, and because implementers do not have a reference (6NF) to be faithful to, they only implement what they need, and they write it all in code; that ends up being an inconsistent model.
Whereas, a pure 6NF implementation does have a pure academic reference point, and thus it is usually tighter, and consistent. Typically this shows up in two visible elements:
6NF has a catalogue to contain metadata, and everything is defined in metadata, not code. EAV does not have one, everything is in code (implementers keep track of the objects and attributes). Obviously a catalogue eases the addition of columns, navigation, and allows utilities to be formed.
6NF when understood, provides the true solution to The Null Problem. EAV implementers, since they are absent the 6NF context, handle missing data in code, inconsistently, or worse, allow Nulls in the database. 6NF implementers disallow Nulls, and handle missing Data consistently and elegantly, without requiring code constructs (for Null handling; you still have to code for missing data of course).
Eg. For 6NF databases with a catalogue, I have a set of procs that will [re]generate the SQL required to perform all SELECTs, and I provide Views in 5NF for all users, so they do not need to know or understand the underlying 6NF structure. They are driven off the catalogue. Thus changes are easy and automated. EAV types do that manually, due to the absence of the catalogue.
Discussion
Now, we can start the discussion.
"Of course it can be more abstract if
value's are predefined (Example:
specialities could have their own
list)"
Sure. But do not get too "abstract". Maintain consistency and implement such lists in the same EAV (or 6NF) manner as other lists.
"If I take the abstract approach it
can be very flexible, but queries will
be more complex with a lot of joins.
But I don't know if this affects the
performance, executing these 'more
complex' queries."
Joins are pedestrian in relational databases. The problem is not the database, the problem is that SQL is cumbersome when handling joins, especially compound keys.
EAV and 6NF databases have more Joins, which just as pedestrian, no more, no less. If you have to code each SELECT manually, sure, the cumbersome gets really cumbersome.
The entire problem can be eliminated by (a) going with 6NF over EAV and (b) implementing a catalogue, from which you can (c) generate all the basic SQL. Eliminates an entire class of errors as well.
It is a common myth that Joins somehow have a cost. Totally false.
The join is implemented at compile time, there is nothing of substance to 'cost' CPU cycles.
The issue is the size of tables being joined, not the cost of the Join between those same tables.
Joining two tables with millions of rows each, on a correct PK⇢FK relation, each of which have the appropriate indices
(Unique on the parent [PK] side; Unique on the Child side [PK=parent FK + something]
is instantaneous
Where the Child index is not unique, but at least the leading columns are valid, it is slower; where there is no useful index, of course it is very slow.
None of it has to do with Join cost.
Where many rows are returned, the bottleneck will be the network and the disk layout; not the join processing.
Therefore you can get as "complex" as you like, there is no cost, SQL can handle it.
I would be interested to know what are
the up and downsides of both methods.
I can just imagine for myself, but I
don't have the experience to confirm
this.
5NF (or 3NF for those who have not made the progression) is the easiest and best, in terms of implementation; ease of use (developers as well as users); and maintenance.
The drawback is, every time you add a column, you have to change the database structure (table DDL). That is fine is some cases, but not in most cases, due to change control in place, quite onerous.
Second, you have to change existing code (code handling the new column does not count, because that is an imperative): where good standards are implemented, that is minimised; where they are absent, the scope is unpredictable.
EAV (which is what you have posted), allows columns to be added without DDL changes. That is the single reason people choose it. (code handling the new column does not count, because that is an imperative). If implemented well, it will not affect existing code; if not, it will.
But you need EAV-capable developers.
When EAV is implemented badly, it is abominable, a worse mess than 5NF done badly, but not any worse than Unnormalised which is what most databases out there are (misrepresented as "denormalised for performance").
Of course, it is even more important (than in 5NF/3NF) to hold a strong Transaction context, because the columns are far more distributed.
Likewise, it is essential to retain Declarative Referential Integrity: the messes I have seen were due in large part to the developers removing DRI because it became "too hard to maintain", the result was, as you can imagine, one mother of a data heap with duplicate 3NF/5NF rows and columns all over the place. And inconsistent Null handling.
There is no difference in performance, assuming that the server has been reasonably configured for the intended purpose. (Ok, there are specific optimisations that are possible only in 6NF, which are not possible in other NFs, but I think that is outside the scope of this thread.) And again, EAV done badly can cause unnecessary bottlenecks, no more so than Unnormalised.
Of course, if you go with EAV, I am recommending more formality; buy the full quid; go with 6NF; implement a catalogue; utilities to produce SQL; Views; handle Missing Data consistently; eliminate Nulls altogether. This reduces your vulnerability to the quality of your developers; they can forget about the EAV/6NF esoteric issues, use Views, and concentrate on the app logic.

In your question, you have presented at least two major issues at the same time. Those two issues are E-A-V and gen-spec.
First, let's talk about E-A-V. Your last table (object_id, field_id, value) is essentially an E-A-V. There is an upside to E-A-V and a downside to E-A-V. The upside is that the structure is so generic that it can accomodate almost any body of data describing almost any subject matter. That means that you can proceed to design and implementation with no data analysis and no understanding of the subject matter, and not worry about wrong assumptions. The down side is that at retrieval time, you have to do the data analysis that you skipped over before building the data base, in order to come up with queries that mean anything. This is much more serious than just retrieval efficiency. But you are also going to have terrible problems with retrieval efficiency. There are only two ways to learn about this pitfall: live through it or read about it from those who have. I recommend the reading.
Second, you have a gen-spec case. Your table (object_id, type_id) captures a gen-spec (generalization-specialization) pattern, along with the related tables. If I had to generalize between hotels and restaurants, I might call it something like "public accomodations" or "venues". But I'm not sure I understand your case, and you may be driving for something even more general than those two names suggest. After all, you've included "events" in your list, and an event is not a type of venue in my mind.
I've referred other people to readings on gen-spec and the relational model in previous responses.
When two tables are very similar, when should they be combined?
But I hesitate to send you off in the same direction, because it's not clear to me that you want to come up with a relational model of the data before building your database. A relational model of a body of data and an E-A-V model of the same data are almost totally at odds with each other. It seems to me you have to make that choice before you even explore how to express gen-spec in the relational model of data.

When you start to require a large number of different entities (or even before...), a nosql solution would be vastly simpler than either choice.
Just store each entity/record with the exact fields you require.
{
"id": 1,
"type":"Restaurant",
"name":"Messy Joe",
"address":"1 Main St.",
"tags":["asian","fusion","casual"]
}

The "abstract" approach is better known as "Normalization", looks like 3rd Normal Form (3NF).
The other one is called "Denormalized", and can be a valid performance option... when you've encountered speed issues using the Normalized approach, not before.

How do you have the listings represented in code? I'd guess Listing as a supertype, with Shop, Restuarant, etc. as subtypes?
Assuming so, this is a case of how to map subtypes to a relational database. There are generally three choices:
Option 1: single table per subtype,
with common attributes repeated in
each table (name, id, etc).
Option 2: single table for all objects (your single table approach)
Option 3: table for the supertype and one for each subtype
There's no universally correct solution. My preference is generally to start with option 3; it provides an intituitive structure to work with, is pretty well normalised and can easily be extended. It means a single join for retrieving each instance - but RDBMS are well optimised for doing joins so it doesn't really cause performance problems in practice.
Option 2 can be more performant for queries (no joins) but causes problems if other tables need to refer to all supertype instances (proliferation of foreign keys).
Option 1 appears at first sight to be the most performant, although 2 caveats: (1) It's not resilient to change. If you add a new subtype (and so different attributes) you'll need to change the table structure and migrate it. (2) It can be less efficient than it seems. Because the table population is sparse, some DBs don't store it particularly efficiently. As a consequence it can be less efficicent than option 1 - since the query engine can do joins faster than it can search bloated sparse table spaces.
Which to choose really comes down to knowing details of your problem. I'd suggest reading up a bit on the options: this article is a good place to start.
hth

single fixed table with multiple columns vs flexible abstract tables

I was wondering if you have a website with a dozen different types of listings (Shops, Restaurants, Clubs, Hotels, Events) that require different fields, is there a benefit of creating a table with columns defined like so
Example Shop:
shop_id | name | X | Y | city | district | area | metro | station | address | phone | email | website | opening_hours
Or a more abstract approach similar to this:
object_id | name
---------------
1 | Messy Joe's
2 | Bate's Motel
type_id | name
---------------
1 | hotel
2 | restaurant
object_id | type_id
---------------
1 | 2
2 | 1
field_id | name | field_type
---------------
1 | address | text
2 | opening_hours | date
3 | speciality | text
type_id | field_id
---------------
1 | 1
1 | 2
2 | 1
2 | 3
object_id | field_id | value
1 | 1 | 1st street....
1 | 3 | English Cuisine
Of course it can be more abstract if value's are predefined (Example: specialties could have their own list)
If I take the abstract approach it can be very flexible, but queries will be more complex with a lot of joins.
But I don't know if this affects the performance, executing these 'more complex' queries.
I would be interested to know what are the up and downsides of both methods. I can just imagine for myself, but I don't have the experience to confirm this.

When you start to require a large number of different entities (or even before...), a nosql solution would be vastly simpler than either choice.
Just store each entity/record with the exact fields you require.
{
"id": 1,
"type":"Restaurant",
"name":"Messy Joe",
"address":"1 Main St.",
"tags":["asian","fusion","casual"]
}

The "abstract" approach is better known as "Normalization", looks like 3rd Normal Form (3NF).
The other one is called "Denormalized", and can be a valid performance option... when you've encountered speed issues using the Normalized approach, not before.

Is an overuse of nullable columns in a database a "code smell"?

I'm just stepping into a project and it has a fairly large database backend. I've started digging through this database and 95% of the fields are nullable.
Is this normal practice in the database world? I'm just a lowly programmer, not a DBA but I would think you would want to keep nullable fields to a minimum, only where they make sense.
Is it a "code smell" if most columns are nullable?

Default values are typically the exception and NULLs are the norm, in my experience.
True, nulls are annoying.
It's also extremely useful because null is the best indicator of "NO VALUE". A concrete default value is very misleading, and you can lose information or introduce confusion down the road.

Anyone who has developed a data entry application knows how common it is for some of the fields to be unknown at the time of entry -- even for columns that are business-critical, to address #Chris McCall's answer.
However, a "code smell" is merely an indicator that something might be coded in a sloppy way. You use smells to identify things that need more investigation, not necessarily things that must be changed.
So yes, if you see nullable columns so consistently, you're right to be suspicious. It might indicate that someone was being lazy, or afraid to declare NOT NULL columns unequivocally. You can justify doing your own analysis.

I'm of the Extreme NO camp: I avoid NULLs all the time. Putting aside fundamental considerations about what they actually mean (because talk to different people, you'll get different answers such as "no value", "unknown value", "missing", "my ginger cat called Null"), the worst problem NULLs cause is that they often ruin your queries in mysterious ways.
I've lost count of the number of times I've had to debug someone's query (okay, maybe 9) and traced the problem to a join against a NULL. If your code needs ISNULL to repair joins then the chances are you've also lost index applicability and performance with it.
If you do have to store a "missing/unknown/null/cat" value (and it's something I prefer to avoid), it is better to be explicit about it.
Those skilled at NULLs may disagree. NULL use tends to split SQL crowds down the middle.
In my experience, heavy NULL use has been positively correlated with database abuse but I wouldn't carve this into stone tablets as some Law of Nature. My experience is just my experience.
EDIT: Additional thought. It is possible that those who are anti-null racists like myself are more excited by normalization than those who are pro-NULL. I don't think rabid normalizers would be too happy with ragged edges on their tables that can take NULLs. Lots of nulls may indicate that the the database developers are not into heavy normalisation. So rather than NULL suggesting code is "bad" it may alternatively suggest the philosophical position of the developers on normalisation. Maybe this is reaching. Just a thought.

Don't know if I consider it always a bad thing, but if the columns are being added because a single record (or maybe a few) need to have values while most don't, then it indicates a pretty flat table structure. If you're seeing column names like "addr1", "addr2", "addr3", then it stinks!
I would bet that most of the columns you have could be removed and represented in other tables. You could find the "non-null" ones through a foreign key relationship. This will increase the joins that you'll be doing, but it could be more preformant that doing a "where not col1 is null".

I think nullable columns should be avoided. Wherever the semantics of the domain make it possible to use a value that clearly indicates missing data, it should be used instead of NULL.
For instance, let's imagine a table that contains a Comment field. Most developers would place a NULL here to indicate that there's no data in the column. (And, hopefully, a check constraint that disallows zero-length strings so that we have a well-known "value" to indicate the lack of a value.) My approach is usually the opposite. The Comment column is NOT NULL and a zero-length string indicates the lack of a value. (I use a check constraint to ensure that the zero-length string is really a zero-length string, and not whitespace.)
So, why would I do this? Two reasons:
NULLs require special logic in SQL, and this technique avoids that.
Many client-side libraries have special values to indicate NULL. For instance, if you use Microsoft's ADO.NET, the constant DBNull.Value indicates a NULL, and you have to test for that. Using a zero-length string on a NOT NULL column obviates the need.
Despite all of this, there are many circumstances in which NULLs are fine. In fact, I have no objection to their use in the scenario above, although it wouldn't be my preferred way.
Whatever you do, be kind to those who will use your tables. Be consistent. Allow them to SELECT with confidence. Let me explain what I mean by this. I recently worked on a project whose database was not designed by me. Nearly every column was nullable and had no constraints. There was no consistency about what represented the absence of a value. It could be NULL, a zero-length string, or even a bunch of spaces, and often was. (How that soup of values got there, I don't know.)
Imagine the ugly code a developer has to write to find all of those records with a missing Comment field in this scenario:
SELECT * FROM Foo WHERE LEN(ISNULL(Comment, '')) = 0
Amazingly there are developers who regard this as perfectly acceptable, even normal, despite possible performance implications. Better would be:
SELECT * FROM Foo WHERE Comment IS NULL
Or
SELECT * FROM Foo WHERE Comment = ''
If your table is properly designed, the above two SQL statements can be relied upon to produce quality data.

In short, I would say yes, this is probably a code smell.
Whether a column is nullable or not is very important and should be determined carefully. The question should be assessed for every column. I am not a believer in a single "best practices" default for NULL. The "best practice" for me is to address the nullability thoroughly during the design and/or refactoring of the table.
To start with, none of your primary key columns are going to be nullable. Then, I strongly lean towards NOT NULL for anything which is a foreign key.
Some other things I consider:
Criteria where NULL should be strongly avoided:
money columns - is there really a possibility that this amount will be unknown?
Criteria where NULL can be justified most frequently:
datetime columns - there are no reserved dates, so NULL is effectively your best option
Other data types:
char/varchar columns - for codes/identifiers - NOT NULL almost exclusively
int columns - mostly NOT NULL unless it's something like "number of children" where you want to distinguish an unknown response.

No, whether or not a field should be nullable is a data concept and can't be a code smell. Whether or not NULLs are annoying to code has nothing to do with the usefulness of having nullable data fields.

They are a (very common) smell, I'm afraid. Look up C.J. Date writings on the topic.

As a best practice, if a column shouldn't be nullable, then it should be marked as such. However, I don't believe in going completely insane with things like this.

I think so. If you don't need the data, then it's not important to your business. If it is important to your business, it should be required.

This is all completely dependent on the scope and requirements of the project. I wouldn't use number of nullable fields alone as a metric for poorly written or designed code. Have a look at the business domain, if there are many non nullable fields represented there that are nullable in the database, then you have some issues.

In my experience, it is a problem when Null and Not Null don't match up to the required field /not required field.
It is in the realm of possibility that those really are all optional fields. If you find in the business tier or the UI tier that those fields are required, then I think this means the data model has drifted away from the business object model and is a sign of overly conservative DB change policies, or oversight.
If you run a sample data generator on your data, and then try to load the data that is valid according to SQL, you would find out right away if the rules match up.

That seems like a lot, it probably means you should at least investigate. Note that if this is mature product with a lot of data, convincing anyone to change the structure may be difficult. The earlier in the design phase you catch something like this the easier it is to fix up all the related code to adjust for the change.
Whether it is bad that they used the nulls would depend on whether the columns allowing nulls look as if they should be related tables (home phone, cell phone, business phone etc which should be in aspearate phone table) or if they look like things that might not be applicable to all records (possibly could bea related table with a one-to-one relationship)or might not be known at the time of data entry (probably ok). I would also check to see if they in fact alwAys do have a value (then you might be able to change to not null if the information is genuinely required by the busniess logic). If you have a few records with null

In my experience, a lot nullable field in a large database like you have is very normal. Considering it perhaps is used by a lot of applications written by different people. Making columns nullable is annoying but it is perhaps the best way to keep the application robust.

One of the many ways to map inheritance (e.g. c# objects) to a database is to create a table for the class at the top of the hierarchy, then add the columns for all the other classes. The columns have to be nullable for when an object of a different subclass is stored in the database. This is called Single-table inheritance mapping (or Map Hierarchy To A Single Table) and is a standard design pattern.
A side effect of Single-table inheritance mapping is that most columns are nullable.
Also in Oracle an empty string (0 length) is considered to be null, therefore in some companies all strings columns are made nullable even on SqlServer. (just because the first customer wants the software on SqlServer does not mean the 2nd customer does not have a Oracle DBA that will not let SqlServer onto there network)

To throw the opposite opinion out there. Every single field in a database should nullable. There is nothing more frustrating than working with a database that on every single insert throws an exception about required this or required that. Nothing should be required.
There is one exception to that, keys. Obviously all primary and foreign keys should be enforced to exist.
It should be the application's job to validate data and the database to simply store and retrieve what you give it. Having it process validation logic even as simple as null or not null makes a project way more complex to maintain for having different rules spread over everything.

As mentioned by others, front-facing data entry should allow omittance of many fields. This is complicated by how people interpret the trinary nature of NULL (e.g. empty versus missing).
As such, I am only answering about one facet of database design: foreign keys.
In general, foreign keys do not suffer from the arbitrary nature of business logic, therefore seeing these columns allowing NULL is definitely a code smell.
For example, if you had a [Person] table, in no situation would you ever have a [Person].[FatherID] value that was NULL intentionally.
For a large database, an attempt to save NULL to such a column is likely to occur at some point due to the inevitability of bugs, which would have been brought to light much sooner by having a NOT NULL constraint. So for version 1 or a table, you should never allow nullable columns without justification.
But things get much trickier in an evolving code base, especially one that is staying online and thus requires migration scripting to upgrade. In particular, you may find nullable columns added to tables later on, because properly adding them as non-nullable can be quite hard depending on your integration process.
Furthermore, visual table designers (such as in SQL Server Management Studio and Visual Studio) default to allowing NULL so it could simply be a matter of inadequate code review.
I don't want to attempt a proper answer for flag (i.e. boolean) columns, but I strongly suggest considering how they can be implemented without allowing NULL, since I have usually found ways to avoid nullability even under the constraints of business logic.

Why do we care about data types?

Specifically, in relational database management systems, why do we need to know the data type of a column (more likely, the attribute of an object) at creation time?
To me, data types feel like an optimization, because one data point can be implemented in any number of ways. Wouldn't it be better to assign semantic roles and constraints to a data point and then have the engine internally examine and optimize which data type best serves the user?
I suspect this is where the heavy lifting is and why it's easier to just ask the user rather than to do the work.
What do you think? Where are we headed? Is this a realistic expectation? Or do I have a misguided assumption?

The type expresses a desired constraint on the values of the column.

The answer is storage space and fixed size rows.
Fixed-size rows are much, MUCH faster to search than variable length rows, because you can seek directly to the correct byte if you know which record number and field you want.
Edit: Having said that, if you use proper indexing in your database tables, the fixed-size rows thing isn't as important as it used to be.

SQLite does not care.
Other RDBMS's use principles that were designed in early 80's, when it was vital for performance.
Oracle, for instance, does not distinguish between a NULL and an empty string, and keeps its NUMBER's as sets of centesimal digits.
That hardly makes sense today, but these were very clever solutions when Oracle was being developed.
In one of the databases I developed, though, non-indexed values were used that were stored as VARCHAR2's, casted dynamically into appropriate datatypes depending on several conditions.
That was quite a special thing, though: it was used for bulk loading key-value pairs in one call to the database using collections.
Dynamic SQL statements were used for parsing data and putting them into appropriate tables based on key name.
All values were loaded to the temporary VARCHAR2 column as is and then converted into NUMBER's and DATETIME's to be put into their columns.

Explicit data types are huge for efficiency, and storage. If they are implicit they have to be 'figured' out and therefore incur speed costs. Indexes would be hard to implement as well.
I would suspect, although not positive, that having explicit types also on average incur less storage space. For numbers especially, there is no comparison between a binary int and a string of digit characters.

Hm... Your question is sort of confusing.
If I understand it correctly, you're asking why it is that we specify data types for table columns, and why it is that the "engine" automatically determines what is needed for the user.
Data types act as a constraint - they secure the data's integrity. An int column will never have letters in it, which is a good thing. The data type isn't automatically decided for you, you specify it when you create the database - almost always using SQL.

You're right: assigning a data type to a column is an implementation detail and has nothing to do with the set theory or calculus behind a database engine. As a theoretical model, a database ought to be "typeless" and able to store whatever we throw at it.
But we have to implement the database on a real computer with real constraints. It's not practical, from a performance standpoint, to have the computer dynamically try to figure out how to best store the data.
For example, let's say you have a table in which you store a few million integers. The computer could -- correctly -- figure out that it should store each datum as an integral value. But if you were to one day suddenly try to store a string in that table, should the database engine stop everything until it converts all the data to a more general string format?
Unfortunately, specifying a data type is a necessary evil.

If you know that some data item is supposed to be numeric integer, and you deliberately choose NOT to let the DBMS take care of enforcing this, then it becomes YOUR responsibility to ensure all sorts of things such as data integrity (ensuring that no value 'A' can be entered in the column, ensuring that no value 1.5 can be entered in the column), such as consistency of system behaviour (ensuring that the value '01' is considered equal to the value '1', which is not the behaviour you get from type String), ...
Types take care of all those sorts of things for you.

I'm not sure of the history of datatypes in databases, but to me it makes sense to know the datatype of a field.
When would you want to do a sum of some fields which are entirely varchar?
If I know that a field is an integer, it makes perfect sense to do a sum, avg, max, etc.

Not all databases work this way. SQLite was mentioned earlier, but a much older set of databases also does this, multivalued databases.
Consider UniVerse (now an IBM property). It does not do any data validation, nor does it require that you specify what type it is. Searches are still (relatively) fast, it takes up less space (due to the way it stores data dynamically).
You can describe what the data may look like using meta-data (dictionary items), but that is the limit of how you restrict the data.
See the wikipedia article on UniVerse

When you're pushing half a billion rows in 5 months after go live, every byte counts (in our system)
There is no such anti-pattern as "premature optimisation" in database design.
Disk space is cheap, of course, but you use the data in memory.

You should care about datatypes when it comes to filtering (WHERE clause) or sorting (ORDER BY). For example "200" is LOWER than "3" if those values are strings, and the opposite when they are integers.
I believe sooner or later you wil have to sort or filter your data ("200" > "3" ?) or use some aggregate functions in reports (like sum() or (avg()). Until then you are good with text datatype :)

A book I've been reading on database theory tells me that the SQL standard defines a concept of a domain. For instance, height and width could be two different domains. Although both might be stored as numeric(10,2), a height and a width column could not be compared without casting. This allows for a "type" constraint that is not related to implementation.
I like this idea in general, though, since I've never seen it implemented, I don't know what it would be like to use it. I can see that it would reduce the chance of errors in using values whose implementation happen to be the same, when their conceptual domain is quite different. It might also help keep people from comparing cm and inches, for instance.

Constraint is perhaps the most important thing mentioned here. Data types exist for ensuring the correctness of your data so you are sure you can manipulate it correctly. There are 2 ways we can store a date. In a type of date or as a string "4th of January 1893". But the string could also have been "4/1 1893", "1/4 1893" or similar. Datatypes constrain that and defines a canonical form for a date.
Furthermore, a datatype has the advantage that it can undergo checks. The string "0th of February 1975" is accepted as a string, but should not be as a date. How about "30th of February 1983"? Poor databases, like MySQL, does not make these checks by default (although you can configure MySQL to do it -- and you should!).
data types will ensure the consistency of your data. This is one of the most important concepts as keeping your data sane will spare your head from insanity.

RDBMs generally require definition of column types so it can perform lookups fast. If you want to get the 5th column of every row in a huge dataset, having the columns defined is a huge optimisation.
Instead of scanning each row for some form of delimiter to retrieve the 5th column (if column widths were not fixed width), the RDBMs can just take the item at sizeOf(column1 - 4(bytes)) + sizeOf(column5(bytes)). Imagine how much quicker this would be on a table of say 10,000,000 rows.
Alternatively, if you don't want to specify the types of each column, you have two options that I'm aware of. Specify each column as a varchar(255) and decide what you want to do with it within the calling program. Or you can use a different database system that uses key-value pairs such as Redis.

database is all about physical storage, data type define this!!!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas