Is an overuse of nullable columns in a database a "code smell"?

Is an overuse of nullable columns in a database a "code smell"? - sql

I'm just stepping into a project and it has a fairly large database backend. I've started digging through this database and 95% of the fields are nullable.
Is this normal practice in the database world? I'm just a lowly programmer, not a DBA but I would think you would want to keep nullable fields to a minimum, only where they make sense.
Is it a "code smell" if most columns are nullable?

Default values are typically the exception and NULLs are the norm, in my experience.
True, nulls are annoying.
It's also extremely useful because null is the best indicator of "NO VALUE". A concrete default value is very misleading, and you can lose information or introduce confusion down the road.

Anyone who has developed a data entry application knows how common it is for some of the fields to be unknown at the time of entry -- even for columns that are business-critical, to address #Chris McCall's answer.
However, a "code smell" is merely an indicator that something might be coded in a sloppy way. You use smells to identify things that need more investigation, not necessarily things that must be changed.
So yes, if you see nullable columns so consistently, you're right to be suspicious. It might indicate that someone was being lazy, or afraid to declare NOT NULL columns unequivocally. You can justify doing your own analysis.

I'm of the Extreme NO camp: I avoid NULLs all the time. Putting aside fundamental considerations about what they actually mean (because talk to different people, you'll get different answers such as "no value", "unknown value", "missing", "my ginger cat called Null"), the worst problem NULLs cause is that they often ruin your queries in mysterious ways.
I've lost count of the number of times I've had to debug someone's query (okay, maybe 9) and traced the problem to a join against a NULL. If your code needs ISNULL to repair joins then the chances are you've also lost index applicability and performance with it.
If you do have to store a "missing/unknown/null/cat" value (and it's something I prefer to avoid), it is better to be explicit about it.
Those skilled at NULLs may disagree. NULL use tends to split SQL crowds down the middle.
In my experience, heavy NULL use has been positively correlated with database abuse but I wouldn't carve this into stone tablets as some Law of Nature. My experience is just my experience.
EDIT: Additional thought. It is possible that those who are anti-null racists like myself are more excited by normalization than those who are pro-NULL. I don't think rabid normalizers would be too happy with ragged edges on their tables that can take NULLs. Lots of nulls may indicate that the the database developers are not into heavy normalisation. So rather than NULL suggesting code is "bad" it may alternatively suggest the philosophical position of the developers on normalisation. Maybe this is reaching. Just a thought.

Don't know if I consider it always a bad thing, but if the columns are being added because a single record (or maybe a few) need to have values while most don't, then it indicates a pretty flat table structure. If you're seeing column names like "addr1", "addr2", "addr3", then it stinks!
I would bet that most of the columns you have could be removed and represented in other tables. You could find the "non-null" ones through a foreign key relationship. This will increase the joins that you'll be doing, but it could be more preformant that doing a "where not col1 is null".

I think nullable columns should be avoided. Wherever the semantics of the domain make it possible to use a value that clearly indicates missing data, it should be used instead of NULL.
For instance, let's imagine a table that contains a Comment field. Most developers would place a NULL here to indicate that there's no data in the column. (And, hopefully, a check constraint that disallows zero-length strings so that we have a well-known "value" to indicate the lack of a value.) My approach is usually the opposite. The Comment column is NOT NULL and a zero-length string indicates the lack of a value. (I use a check constraint to ensure that the zero-length string is really a zero-length string, and not whitespace.)
So, why would I do this? Two reasons:
NULLs require special logic in SQL, and this technique avoids that.
Many client-side libraries have special values to indicate NULL. For instance, if you use Microsoft's ADO.NET, the constant DBNull.Value indicates a NULL, and you have to test for that. Using a zero-length string on a NOT NULL column obviates the need.
Despite all of this, there are many circumstances in which NULLs are fine. In fact, I have no objection to their use in the scenario above, although it wouldn't be my preferred way.
Whatever you do, be kind to those who will use your tables. Be consistent. Allow them to SELECT with confidence. Let me explain what I mean by this. I recently worked on a project whose database was not designed by me. Nearly every column was nullable and had no constraints. There was no consistency about what represented the absence of a value. It could be NULL, a zero-length string, or even a bunch of spaces, and often was. (How that soup of values got there, I don't know.)
Imagine the ugly code a developer has to write to find all of those records with a missing Comment field in this scenario:
SELECT * FROM Foo WHERE LEN(ISNULL(Comment, '')) = 0
Amazingly there are developers who regard this as perfectly acceptable, even normal, despite possible performance implications. Better would be:
SELECT * FROM Foo WHERE Comment IS NULL
Or
SELECT * FROM Foo WHERE Comment = ''
If your table is properly designed, the above two SQL statements can be relied upon to produce quality data.

In short, I would say yes, this is probably a code smell.
Whether a column is nullable or not is very important and should be determined carefully. The question should be assessed for every column. I am not a believer in a single "best practices" default for NULL. The "best practice" for me is to address the nullability thoroughly during the design and/or refactoring of the table.
To start with, none of your primary key columns are going to be nullable. Then, I strongly lean towards NOT NULL for anything which is a foreign key.
Some other things I consider:
Criteria where NULL should be strongly avoided:
money columns - is there really a possibility that this amount will be unknown?
Criteria where NULL can be justified most frequently:
datetime columns - there are no reserved dates, so NULL is effectively your best option
Other data types:
char/varchar columns - for codes/identifiers - NOT NULL almost exclusively
int columns - mostly NOT NULL unless it's something like "number of children" where you want to distinguish an unknown response.

No, whether or not a field should be nullable is a data concept and can't be a code smell. Whether or not NULLs are annoying to code has nothing to do with the usefulness of having nullable data fields.

They are a (very common) smell, I'm afraid. Look up C.J. Date writings on the topic.

As a best practice, if a column shouldn't be nullable, then it should be marked as such. However, I don't believe in going completely insane with things like this.

I think so. If you don't need the data, then it's not important to your business. If it is important to your business, it should be required.

This is all completely dependent on the scope and requirements of the project. I wouldn't use number of nullable fields alone as a metric for poorly written or designed code. Have a look at the business domain, if there are many non nullable fields represented there that are nullable in the database, then you have some issues.

In my experience, it is a problem when Null and Not Null don't match up to the required field /not required field.
It is in the realm of possibility that those really are all optional fields. If you find in the business tier or the UI tier that those fields are required, then I think this means the data model has drifted away from the business object model and is a sign of overly conservative DB change policies, or oversight.
If you run a sample data generator on your data, and then try to load the data that is valid according to SQL, you would find out right away if the rules match up.

That seems like a lot, it probably means you should at least investigate. Note that if this is mature product with a lot of data, convincing anyone to change the structure may be difficult. The earlier in the design phase you catch something like this the easier it is to fix up all the related code to adjust for the change.
Whether it is bad that they used the nulls would depend on whether the columns allowing nulls look as if they should be related tables (home phone, cell phone, business phone etc which should be in aspearate phone table) or if they look like things that might not be applicable to all records (possibly could bea related table with a one-to-one relationship)or might not be known at the time of data entry (probably ok). I would also check to see if they in fact alwAys do have a value (then you might be able to change to not null if the information is genuinely required by the busniess logic). If you have a few records with null

In my experience, a lot nullable field in a large database like you have is very normal. Considering it perhaps is used by a lot of applications written by different people. Making columns nullable is annoying but it is perhaps the best way to keep the application robust.

One of the many ways to map inheritance (e.g. c# objects) to a database is to create a table for the class at the top of the hierarchy, then add the columns for all the other classes. The columns have to be nullable for when an object of a different subclass is stored in the database. This is called Single-table inheritance mapping (or Map Hierarchy To A Single Table) and is a standard design pattern.
A side effect of Single-table inheritance mapping is that most columns are nullable.
Also in Oracle an empty string (0 length) is considered to be null, therefore in some companies all strings columns are made nullable even on SqlServer. (just because the first customer wants the software on SqlServer does not mean the 2nd customer does not have a Oracle DBA that will not let SqlServer onto there network)

To throw the opposite opinion out there. Every single field in a database should nullable. There is nothing more frustrating than working with a database that on every single insert throws an exception about required this or required that. Nothing should be required.
There is one exception to that, keys. Obviously all primary and foreign keys should be enforced to exist.
It should be the application's job to validate data and the database to simply store and retrieve what you give it. Having it process validation logic even as simple as null or not null makes a project way more complex to maintain for having different rules spread over everything.

As mentioned by others, front-facing data entry should allow omittance of many fields. This is complicated by how people interpret the trinary nature of NULL (e.g. empty versus missing).
As such, I am only answering about one facet of database design: foreign keys.
In general, foreign keys do not suffer from the arbitrary nature of business logic, therefore seeing these columns allowing NULL is definitely a code smell.
For example, if you had a [Person] table, in no situation would you ever have a [Person].[FatherID] value that was NULL intentionally.
For a large database, an attempt to save NULL to such a column is likely to occur at some point due to the inevitability of bugs, which would have been brought to light much sooner by having a NOT NULL constraint. So for version 1 or a table, you should never allow nullable columns without justification.
But things get much trickier in an evolving code base, especially one that is staying online and thus requires migration scripting to upgrade. In particular, you may find nullable columns added to tables later on, because properly adding them as non-nullable can be quite hard depending on your integration process.
Furthermore, visual table designers (such as in SQL Server Management Studio and Visual Studio) default to allowing NULL so it could simply be a matter of inadequate code review.
I don't want to attempt a proper answer for flag (i.e. boolean) columns, but I strongly suggest considering how they can be implemented without allowing NULL, since I have usually found ways to avoid nullability even under the constraints of business logic.

Related

Why should I avoid NULL values in a SQL database?

I read a 45-tips-database-performance-tips-for-developers document from a famous commercial vendor for SQL tools today and there was one tip that confuse me:
If possible, avoid NULL values in your database. If not, use the
appropriate IS NULL and IS NOT NULL code.
I like having NULL values because to me it is a difference if a value was never set or it 0 or string empty. So databases have this for a porpuse.
So is this tip nonsense or should I take action to prevent having NULL values at all in my database tables? Does it effect performance a lot have a NULL value instead of a filled number or string value?

Besides the reasons mentioned in other answers, we can look at NULLs from a different angle.
Regarding duplicate rows, Codd said
If something is true, saying it twice doesn’t make it any more true.
Similarly, you can say
If something is not known, saying it is unknown doesn't make it known.
Databases are used to record facts. The facts (truths) serve as axioms from which we can deduce other facts.
From this perspective, unknown things should not be recorded - they are not useful facts.
Anyway, anything that is not recorded is unknown. So why bother recording them?
Let alone their existence makes the deduction complicated.

The NULL question is not simple... Every professional has a personal opinion about it.
Relational theory Two-Valued Logic (2VL: TRUE and FALSE) rejects NULL, and Chris Date is one of the most enemies of NULLs. But Ted Codd, instead, accepted Three-Valued Logic too (TRUE, FALSE and UNKNOWN).
Just a few things to note for Oracle:
Single column B*Tree Indexes don't contain NULL entries. So the Optimizer can't use an Index if you code "WHERE XXX IS NULL".
Oracle considers a NULL the same as an empty string, so:
WHERE SOME_FIELD = NULL
is the same as:
WHERE SOME_FIELD = ''
Moreover, with NULLs you must pay attention in your queries, because every compare with NULL returns NULL.
And, sometimes, NULLs are insidious. Think for a moment to a WHERE condition like the following:
WHERE SOME_FIELD NOT IN (SELECT C FROM SOME_TABLE)
If the subquery returns one or more NULLs, you get the empty recordset!
These are the very first few cases that I want to talk about. But we can speak about NULLs for a lot of time...

It's usually good practice to avoid or minimise the use of nulls. Nulls cause some queries to return results that are "incorrect" (i.e. the results won't correspond with the intended meaning of the database). Unfortunately SQL and SQL-style databases can make nulls difficult, though not necessarily impossible, to avoid. It's a very real problem and even experts often have trouble spotting flaws in query logic caused by nulls.
Since there is nothing like nulls in the real world, using them means making some compromises in the way your database represents reality. In fact there is no single consistent "meaning" of nulls and little general agreement on what they are for. In practice, nulls get used to represent all sorts of different situations. If you do use them it's a good idea to document exactly what a null means for any given attribute.
Here's an excellent lecture about the "null problem" by Chris Date:
http://www.youtube.com/watch?v=kU-MXf2TsPE

There are various downsides to NULLs that can make using them more difficult than actual values. for example:
In some cases they are not indexed.
They make join syntax more difficult.
They need special treatment for comparisons.
For string columns it might be appropriate to use "N/A", or "N/K" as a special value that helps distinguish between different classes of what could otherwise be NULL, but that's tricky to do for numerics or dates -- special values are generally tricky to use, and it may be better to add an extra column (eg. for date_of_birth you might have a column that specifies "reason_for_no_date_of_birth", which can help the application be more useful.
For many cases where data values are genuinely unknown or not relevant they can be entirely appropriate of course -- date_of_death is a good example, or date_of_account_termination.
Sometimes even these examples can be rendered irrelevant by normalising events out to a different table, so you have a table for "ACCOUNT_DATES" with DATE_TYPES of "Open", "Close", etc.

I think using NULL values in the database is feasible until your application has a proper logic to handle it, but according to this post there may be some problems as discussed here
http://databases.aspfaq.com/general/why-should-i-avoid-nulls-in-my-database.html

Which is reasonable to use for a field that appears in three tables: an enum or a table?

I have a field that should contain the place of a seminar. A seminar can be held:
In-house
In another city
In another country
At first, I had only one table, so I have used an enum. But now I have three tables which can't be merged and they all should have this information and customer wants this field to be customizable to add or remove options in the future. But the number of options will be limited they say, probably 5 or so.
Now, my question is, should I use an enum or a table for this field? More importantly, what would be the proper way to decide between an enum or a table?
PS: enum fields are dynamically retrieved from the database, they are not embedded in the code.

As a general rule of thumb, in an application, use an enum if:
Your values change infrequently, and
There are only a few values.
Use a lookup table if:
Values changes frequently, or
Your customer expects to be able to add, change, or delete values in realtime, or
There are a lot of values.
Use both if the prior criteria for an enum are met and:
You need to be able to use a tool external to your application to report on the data, or
You believe you may need to eliminate the enum in favor of a pure lookup table approach later.
You'll need to pick what you feel is a reasonable cut of for the number of values for your cutoff point, I often hear somewhere between 10 and 15 suggested.
If you are using an enum construct provided by your database engine, the first 2 groups of rules still apply.
In your specific case I'd go with the lookup table.

If the elements are fixed in that case enum should be used if in case of variable data then enum should not be used. In those cases table is more preferable.
I think in your case table should be ok.

I agree with the top answers about enums being lean, slighty faster and reduce complexity of your database when used reasonably (that is without a lot of and fastchanging values).
BUT: There are some enormous caveats to consider:
The ENUM data type isn't standard SQL, and beyond MySQL not many other DBMS's have native support for it. PostgreSQL, MariaDB, and Drizzle (the latter two are forks of MySQL anyway), are the only three that I know of. Should you or someone else want to move your database to another system, someone is going to have to add more steps to the migration procedure to deal with all of your clever ENUMs.
(Source and even more points on http://komlenic.com/244/8-reasons-why-mysqls-enum-data-type-is-evil/)
ENUMS aren't supported by most ORMs such as Doctrine exactly because they're not SQL Standard. So if you ever consider using an ORM in your project or even use something as the Doctrine Migrations Bundle
you'll probably end up writing a complex extension bundle (which I've tried) or using an existing like this one which in Postgresql for example cannot even add more tha a pseudosupport for enums: by treating enums as string type with a check constraint. An example:
CREATE TABLE test (id SERIAL NOT NULL, pseudo_enum_type VARCHAR(255) CHECK(pseudo_enum_type IN ('value1', 'value2', 'value3')) , ...
So the gain of using Enums in a bit more complex setups really is below zero.
Seriously: If I don't absolutely have to (and I don't) I'd always avoid enums in a database.

Negative integer indexes: are they evil?

I have this database that I'm designing.
It needs to contain a couple dozen tables with records that we provide (a bunch of defaults) as well as records that the user can add. In order to keep the user from shooting himself in the foot, it's necessary to keep him from modifying the default records.
There are lots of ways to facilitate this, but I like the idea of giving protected records negative integer indexes, while reserving 0 as an invalid record id and giving user records positive integer indexes.
CREATE TABLE t1 (
ixt1 integer AUTOINCREMENT,
d1 double,
CONSTRAINT pk_ixt1 PRIMARY KEY (ixt1),
CONSTRAINT ch_zero CHECK (ixt1 <> 0)
);
-2 | 171.3 <- canned record
-1 | 100.0 <- canned record
1 | 666.6 <- user record
Reasons this seems good:
it doesn't use significantly more space
it's easy to understand
it doesn't require lots of additional tables to implement
"select * from table" gets all the pertinent records, with no additional indirection
the canned records can grow in the negative direction, and the user records can grow in the positive direction
However, I'm relatively new to database design. And after using this solution for a little while, I'm starting to worry that using negative indexes might be bad, because
Negative indexes might not be supported consistently among different DBMSs, making it difficult to write code that is database-agnostic
It might be just too easy to screw stuff up by inserting something at recid 0
It might make it hard to use tools (like db grids, perhaps) that expect integer indexes with nonnegative values.
And maybe there are some other really obvious reasons that would make this a Very Bad Idea.
So what's the definitive answer? Are negative integer indexes evil?

The most important flaw in this is the "Intelligent Key" problem.
Negative integers work fine as a key. In all databases.
No tool requires positive integer index values.
It's relatively easy to screw this up because the index has a "rule" which isn't obvious and no one will remember after you've won the lottery and left.
Further, when you invent a third status code ('pre-canned' vs. 'customer-specific canned' vs. 'the other canned invented by a product line' vs. 'the old canned before version 3') you're doomed.
The issue with "Intelligent keys" is that you're asking the key to do two unrelated jobs.
It's the unique identifier for a record. That's what an key is supposed to be.
You're also asking it to provide status, control and authorization to change properties. Oops. This is fraught with danger. You can't expand the meaning because it's a single bit buried in a key.
Just add a column with "owned by". If it's owned by "magical super user", then it's not shown to users. Use a VIEW to assure this, if you can't trust your application developers to enforce it.
If it is owned by "magical super user", then it's the default data, and whatever rules apply to that ownership.

I worked on a very large billing system. We had a very similar problem... needing to mark some records as being "special". The customers had tens of millions of rows of existing data in their databases for the affected table, and it was deemed unacceptable to migrate all of that data to a new structure (i.e. adding a column).
The decision was taken to do exactly what you suggest.
The trouble with that is, you require every bit of business logic to know about (and remember) the special meaning of negative indices and correctly treat it. That's quite error prone (speaking from experience).
Unless you have unusual circumstances that very strongly speak in favor of this non-traditional approach, I suggest you stick with a more traditional extra column. It's what most developers are used to and therefore less likely to cause errors. I wish we would have bitten the bullet and added the extra column.

It's your data, but I don't think this is a good idea. An 'index' value like this should be meaningless - don't use sign or number range or whatever to mean 'something' or 'something else'. I think that in the long run you'd be much better off having a 'record type' column that indicates clearly what kind of record you're looking at. In my experience this is a much better approach.
Good luck.

Why do we care about data types?

Specifically, in relational database management systems, why do we need to know the data type of a column (more likely, the attribute of an object) at creation time?
To me, data types feel like an optimization, because one data point can be implemented in any number of ways. Wouldn't it be better to assign semantic roles and constraints to a data point and then have the engine internally examine and optimize which data type best serves the user?
I suspect this is where the heavy lifting is and why it's easier to just ask the user rather than to do the work.
What do you think? Where are we headed? Is this a realistic expectation? Or do I have a misguided assumption?

The type expresses a desired constraint on the values of the column.

The answer is storage space and fixed size rows.
Fixed-size rows are much, MUCH faster to search than variable length rows, because you can seek directly to the correct byte if you know which record number and field you want.
Edit: Having said that, if you use proper indexing in your database tables, the fixed-size rows thing isn't as important as it used to be.

SQLite does not care.
Other RDBMS's use principles that were designed in early 80's, when it was vital for performance.
Oracle, for instance, does not distinguish between a NULL and an empty string, and keeps its NUMBER's as sets of centesimal digits.
That hardly makes sense today, but these were very clever solutions when Oracle was being developed.
In one of the databases I developed, though, non-indexed values were used that were stored as VARCHAR2's, casted dynamically into appropriate datatypes depending on several conditions.
That was quite a special thing, though: it was used for bulk loading key-value pairs in one call to the database using collections.
Dynamic SQL statements were used for parsing data and putting them into appropriate tables based on key name.
All values were loaded to the temporary VARCHAR2 column as is and then converted into NUMBER's and DATETIME's to be put into their columns.

Explicit data types are huge for efficiency, and storage. If they are implicit they have to be 'figured' out and therefore incur speed costs. Indexes would be hard to implement as well.
I would suspect, although not positive, that having explicit types also on average incur less storage space. For numbers especially, there is no comparison between a binary int and a string of digit characters.

Hm... Your question is sort of confusing.
If I understand it correctly, you're asking why it is that we specify data types for table columns, and why it is that the "engine" automatically determines what is needed for the user.
Data types act as a constraint - they secure the data's integrity. An int column will never have letters in it, which is a good thing. The data type isn't automatically decided for you, you specify it when you create the database - almost always using SQL.

You're right: assigning a data type to a column is an implementation detail and has nothing to do with the set theory or calculus behind a database engine. As a theoretical model, a database ought to be "typeless" and able to store whatever we throw at it.
But we have to implement the database on a real computer with real constraints. It's not practical, from a performance standpoint, to have the computer dynamically try to figure out how to best store the data.
For example, let's say you have a table in which you store a few million integers. The computer could -- correctly -- figure out that it should store each datum as an integral value. But if you were to one day suddenly try to store a string in that table, should the database engine stop everything until it converts all the data to a more general string format?
Unfortunately, specifying a data type is a necessary evil.

If you know that some data item is supposed to be numeric integer, and you deliberately choose NOT to let the DBMS take care of enforcing this, then it becomes YOUR responsibility to ensure all sorts of things such as data integrity (ensuring that no value 'A' can be entered in the column, ensuring that no value 1.5 can be entered in the column), such as consistency of system behaviour (ensuring that the value '01' is considered equal to the value '1', which is not the behaviour you get from type String), ...
Types take care of all those sorts of things for you.

I'm not sure of the history of datatypes in databases, but to me it makes sense to know the datatype of a field.
When would you want to do a sum of some fields which are entirely varchar?
If I know that a field is an integer, it makes perfect sense to do a sum, avg, max, etc.

Not all databases work this way. SQLite was mentioned earlier, but a much older set of databases also does this, multivalued databases.
Consider UniVerse (now an IBM property). It does not do any data validation, nor does it require that you specify what type it is. Searches are still (relatively) fast, it takes up less space (due to the way it stores data dynamically).
You can describe what the data may look like using meta-data (dictionary items), but that is the limit of how you restrict the data.
See the wikipedia article on UniVerse

When you're pushing half a billion rows in 5 months after go live, every byte counts (in our system)
There is no such anti-pattern as "premature optimisation" in database design.
Disk space is cheap, of course, but you use the data in memory.

You should care about datatypes when it comes to filtering (WHERE clause) or sorting (ORDER BY). For example "200" is LOWER than "3" if those values are strings, and the opposite when they are integers.
I believe sooner or later you wil have to sort or filter your data ("200" > "3" ?) or use some aggregate functions in reports (like sum() or (avg()). Until then you are good with text datatype :)

A book I've been reading on database theory tells me that the SQL standard defines a concept of a domain. For instance, height and width could be two different domains. Although both might be stored as numeric(10,2), a height and a width column could not be compared without casting. This allows for a "type" constraint that is not related to implementation.
I like this idea in general, though, since I've never seen it implemented, I don't know what it would be like to use it. I can see that it would reduce the chance of errors in using values whose implementation happen to be the same, when their conceptual domain is quite different. It might also help keep people from comparing cm and inches, for instance.

Constraint is perhaps the most important thing mentioned here. Data types exist for ensuring the correctness of your data so you are sure you can manipulate it correctly. There are 2 ways we can store a date. In a type of date or as a string "4th of January 1893". But the string could also have been "4/1 1893", "1/4 1893" or similar. Datatypes constrain that and defines a canonical form for a date.
Furthermore, a datatype has the advantage that it can undergo checks. The string "0th of February 1975" is accepted as a string, but should not be as a date. How about "30th of February 1983"? Poor databases, like MySQL, does not make these checks by default (although you can configure MySQL to do it -- and you should!).
data types will ensure the consistency of your data. This is one of the most important concepts as keeping your data sane will spare your head from insanity.

RDBMs generally require definition of column types so it can perform lookups fast. If you want to get the 5th column of every row in a huge dataset, having the columns defined is a huge optimisation.
Instead of scanning each row for some form of delimiter to retrieve the 5th column (if column widths were not fixed width), the RDBMs can just take the item at sizeOf(column1 - 4(bytes)) + sizeOf(column5(bytes)). Imagine how much quicker this would be on a table of say 10,000,000 rows.
Alternatively, if you don't want to specify the types of each column, you have two options that I'm aware of. Specify each column as a varchar(255) and decide what you want to do with it within the calling program. Or you can use a different database system that uses key-value pairs such as Redis.

database is all about physical storage, data type define this!!!

Why use "Y"/"N" instead of a bit field in Microsoft SQL Server?

I'm working on an application developed by another mob and am confounded by the use of a char field instead of bit for all the boolean columns in the database. It uses "Y" for true and "N" for false (these have to be uppercase). The type name itself is then aliased with some obscure name like ybln.
This is very annoying to work with for a lot of reasons, not the least of which is that it just looks downright aesthetically unpleasing.
But maybe its me that's stupid - why would anyone do this? Is it a database compatibility issue or some design pattern that I am not aware of?
Can anyone enlighten me?

I've seen this practice in older database schemas quite often. One advantage I've seen is that using CHAR(1) fields provides support for more than Y/N options, like "Yes", "No", "Maybe".
Other posters have mentioned that Oracle might have been used. The schema I referred to was in-fact deployed on Oracle and SQL Server. It limited the usage of data types to a common subset available on both platforms.
They did diverge in a few places between Oracle and SQL Server but for the most part they used a common schema between the databases to minimize the development work needed to support both DBs.

Welcome to brownfield. You've inherited an app designed by old-schoolers. It's not a design pattern (at least not a design pattern with something good going for it), it's a vestige of coders who cut their teeth on databases with limited data types. Short of refactoring the DB and lots of code, grit your teeth and gut your way through it (and watch your case)!

Other platforms (e.g. Oracle) do not have a bit SQL type. In which case, it's a choice between NUMBER(1) and a single character field. Maybe they started on a different platform or wanted cross platform compatibility.

I don't like the Y/N char(1) field as a replacement to a bit column too, but there is one major down-side to a bit field in a table: You can't create an index for a bit column or include it in a compound index (at least not in SQL Server 2000).
Sure, you could discuss if you'll ever need such an index. See this request on a SQL Server forum.

They may have started development back with Microsoft SQl 6.5
Back then, adding a bit field to an existing table with data in place was a royal pain in the rear. Bit fields couldn't be null, so the only way to add one to an existing table was to create a temp table with all the existing fields of the target table plus the bit field, and then copy the data over, populating the bit field with a default value. Then you had to delete the original table and rename the temp table to the original name. Throw in some foriegn key relationships and you've got a long script to write.
Having said that, there were always 3rd party tools to help with the process. If the previous developer chose to use char fields in lieu of bit fields, the reason, in a nutshell, was probably laziness.

The reasons are as follows (btw, they are not good reasons):
1) Y/N can quickly become "X" (for unknown), "L" (for likely), etc. - What I mean by this is that I have personally worked with programmers who were so used to not collecting requirements correctly that they just started with Y/N as sort of 'flags' with the superstition that it might need to expand (to which they should use an int as a status ID).
2) "Performance" - but as was mentioned above, SQL indexes are ruled out if they are not 'selective' enough... a field that only has 2 possible values will never use that index.
3) Lazyness. - Sometimes developers want to output directly to some visual display with the letter "Y" or "N" for human readableness, and they don't want to convert it themselves :)
There are all 3 bad reasons that I've heard/seen before.

I can't imagine any disadvantage in not being able to index a "BIT" column, as it would be unlikely to have enough different values to help the execution of a query at all.
I also imagine that in most cases the storage difference between BIT and CHAR(1) is negligible (is that CHAR a NCHAR? does it store a 16bit, 24bit or 32bit unicode char? Do we really care?)

This is terribly common in mainframe files, COBOL, etc.
If you only have one such column in a table, it's not that terrible in practice (no real bit-wasting); after all SQL Server will not let you say the natural WHERE BooleanColumn, you have to say WHERE BitColumn = 1 and IF #BitFlag = 1 instead of the far more natural IF #BooleanFlag. When you have multiple bit columns, SQL Server will pack them. The case of the Y/N should only be an issue if case-sensitive collation is used, and to stop invalid data, there is always the option of a constraint.
Having said all that, my personal preference is for bits and only allowing NULLs after careful consideration.
Apparently, bit columns aren't a good idea in MySQL.

They probably were used to using Oracle and didn't properly read up on the available datatypes for SQL Server. I'm in exactly that situation myself (and the Y/N field is driving me nuts).

I've seen worse ...
One O/R mapper I had occasion to work with used 'true' and 'false' as they could be cleanly cast into Java booleans.
Also, On a reporting database such as a data warehouse, the DB is the user interface (metadata based reporting tools notwithstanding). You might want to do this sort of thing as an aid to people developing reports. Also, an index with two values will still get used by index intersection operations on a star schema.

Sometimes such quirks are more associated with the application than the database. For example, handling booleans between PHP and MySQL is a bit hit-and-miss and makes for non-intuitive code. Using CHAR(1) fields and 'Y' and 'N' makes for much more maintainable code.

I don't have any strong feelings either way. I can't see any great benefit to doing it one way over another. I know philosophically the bit fields are better for storage. My reality is that I have very few databases that contain a lot of logical fields in a single record. If I had a lot then I would definitely want bit fields. If you only have a few I don't think it matters. I currently work with Oracle and SQL server DB's and I started with Cullinet's IDMS database (1980) where we packed all kinds of data into records and worried about bits and bytes. While I do still worry about the size of data, I long ago stopped worrying about a few bits.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas