Is CHECK better then invalid data? - sql

I'm writing a code for a database. I have a table with log of machines activity, looking like:
CREATE TABLE Work(
id SERIAL PRIMARY KEY,
machine_ID integer NOT NULL DEFAULT 0,
start_work timestamp,
etc...
);
I know machine_ID can be in 1 to 5.
Here my question comes:
Are there any benefits of using CHECK(machine_ID >= 1 AND machine_ID <=5)?
Wouldn't it be better to accept polluted data in order to repair possible bugs later, or even to use the possibility to clean up the data?

Perhaps this is more appropriate as a comment.
But a column called machine_id should not be validated using a check constraint. Instead, you should have a table -- say machines -- that is a reference table for machines.
Your code should use a foreign key constraint. This is a special type of data validation called relational integrity.
As for your question. In my experience, it is usually better to capture errors when they go into the database. Bad data in the database usually leads to problems further down the road -- problems that could have been avoided.

That will be okey if you want to limit how may id can be in your database, but nothing else.

This is massively subjective, with different approaches and opinions, and is highly dependent on your business case.
One principle is to keep things simple, but even what that means is subjective.
Reject Early
In a pre-processing layer, apply rules to the data being received. As soon as possible reject any row/file/source and request re-submission.
This makes it clear what has been accepted, what hasn't been accepted, how to deal with such scenarios, etc.
Consume Everything
Where failures can be complicated and varied, a database can often be the best place to do analysis to understand the source, cause, magnitude and/or impact of failures. In which case ingesting everything can be beneficial.
But that can have a potentially uncontrolled impact. As such, my general experience of this is to have a separate staging area in the database. Essentially as unstructured as possible, to allow as much data in as possible. But even that's not always helpful. If you use strings for all fields, you can accept data that would normally have to be accepted. But even then, what if you have a file with 9 columns being supplied for a table of 8 columns? You could accept the whole row as a single string, but analyzing that for meaningful results will be close to impossible.
What this means in your case depends on the project you are working on, and that's what you haven't described.
My personal default position is to re-categorise certain source data failures as predicted/business-as-usual inconsistencies. You can then structure your staging area to handle these, for reporting, reconciliation, remedial action, etc. And, more importantly, explicitly build them in to all related business processes. Doing so will introduce a cost, which can be estimated against the benefit of accommodating them, with the intention of only accommodating inconsistencies where it's actually a material benefit. (Rather than just being a hoarder that keeps everything just in case it might help someone someday, maybe.)
Whatever direction that goes, however, the operational database itself (not the staging area) is heavily structured, with integrity constraints in place to prevent bugs, unexpected inputs, etc, from corrupting existing data or causing unexpected, unmonitored consequences.
If you truly believe that your operational database (be that transactional, analytical, or anything else) would benefit from being absent some/many/all of these data integrity tools, then a SQL Relational Database is probably the wrong tool for you. Instead consider the enormous number of unstructured data storage and processing platforms.

There are a bunch of different considerations.
Firstly, I prefer not to use check constraints for business rules that are likely to change - I don't want to roll out a database schema change (and a check constraint is a schema change) in response to a predictable business event. Adding or removing machines feels like a predictable business event; so as #gordonLinoff suggests, I'd recommend using a foreign key to the "machines" table.
Secondly, I don't like "hidden code" - from a development and maintenance point of view, I like to keep all the validation as intelligible as possible. Check constraints and triggers are relatively "hidden", hard to document, hard to debug, hard to test for non-trivial use cases; foreign keys, on the other hand, are explicit and developers expect them.

Related

Database Design Without Foreign Keys

After having worked at various employers I've noticed a trend of "bad" database design with some of these companies - primarily the exclusion of Foreign Keys Constraints. It has always bugged me that these transactional systems didn't have FK's, which would've promoted referential integrity.
Are there any scenarios, in transactional systems, whereby the omission of FK's would be beneficial?
Has anyone else experienced this, if so what was the outcome?
What should one do if they're presented with this scenario and their asked to maintain/enhance the system?
I cannot think of any scenario where, if two columns have a dependency, they should not have a FK constraint set up between them. Removing referential integrity may certainly speed up database operations but there's a pretty high cost to pay for that.
I have experienced such systems and the usual outcome is corrupted data, in the sense that records exists that shouldn't exist (or vice versa). These are the sort of systems where people believe they're okay because the application takes care of it, not caring that:
Every application has to take care of it, rather than one DB server.
It only takes one bug, or malignant app, to screw it up for everyone.
It is the responsibility of the database to protect itself! That is one of its best features.
As to what you should do, I simply put forward the possible things that can go wrong and how using FKs will prevent that (often with a cost/benefit analysis "skewed" toward my viewpoint, if necessary). Then let the company decide - it is their database, after all.
There is a school of thought that a well-written application does not need referential integrity. If the application does things right, the thinking goes, there's no need for constraints.
Such thinking is akin to not doing defensive programming because if you write the code correctly, you won't have bugs. While true, it simply won't happen. Not using appropriate constraints is asking for data corruption.
As for what you should do, you should encourage the company to add constraints at every opportunity. You don't want to push it to the point of getting in trouble or making a bad name for yourself, but as long as the environment is appropriate, keep pushing for it. Everyone's life will be better in the long run.
Personally, I have no problem with a database not having explicit declarations for foreign keys. But, it depends on how the database is being used.
Most of the databases that I work with are relatively static data derived from one or more transactional systems. I am not particularly concerned with rogue updates affecting the database, so an explicit definition of a foreign key relationship is not particularly important.
One thing that I do have is very consistent naming. Basically, every table has a first column called ID, which is exactly how the column is refered to in other tables (or, sometimes with a prefix, when there are multiple relationships between two entities). I also try to insist that every column in such a database has a unique name that describes the attribute (so "CustomerStartDate" is different from "ProductStartDate").
If I were dealing with data that had more "cooks in the pot", then I would want to be more explicit about the foreign key relationships. And, I then I am more willing to have the overhead of foreign key definitions.
This overhead arises in many places. When creating a new table, I may want to use use "create table as" or "select into" and not worry about the particulars of constraints. When running update or insert queries, I may not want the database overhead of checking things that I know are ok. However, I must emphasize that consistent naming greatly increases my confidence that things are ok.
Clearly, my perspective is not one of a DBA but of a practitioner. However, invalid relationships between tables are something I -- or the rest of my team -- almost never has to deal with.
As long as there's a single point of entry into the database it ultimately doesn't matter which "layer" is maintaining referential integrity. Using the "built-in layer" of foreign key constraints seems to make the most sense, but if you have a rock solid service layer responsible for the same thing then it has freedom to break the rules if necessary.
Personally I use foreign key constraints and engineer my apps so they don't have to break the rules. Relational data with guaranteed referential integrity is just easier to work with.
The performance gained is probably equivalent to the performance lost from having to maintain integrity outside of the db.
In an OLTP database, the only reason I can think of is if you care about performance more than data integrity. Enforcing a FK when row is inserted to the child table requires an index seek on the parent table and I can imagine there may be extreme situations where even this relatively quick index seek is too much. For example, some kind of very intensive logging where you can live with incorrect log entries and the application doing the writing is simple and unlikely to have bugs.
That being said, if you can live with corrupt data, you can probably live without a database in the first place.
Defensive Programming withot foreign keys works if you primarily use stored procedures and every application uses those stored procedures, instead of writing their own queries. Then you can control it quite easily and more flexible than the standard foreign keys.
One situation I can think of off the top of my head where foreign key constraints are not readily usable is a permissions module where permissions can be applied per user or per group, determined by a Boolean. So some of the records in the permissions table have a user id and others have a group id. If you still wanted foreign key constraints, you would have to have two different fields for the same mutally exclusive information and allow them to be null. Meaning adding another constraint saying that one is allowed to be null but they can't both be null, as well as a combination of 3 fields must be unique instead of a combination of 2 fields (user/group id and permission id). And the alternative is two separate tables containing the same data, meaning maintaining both tables separately.
But perhaps in that scenario, it's best to separate the data. Anything where you need the same field to connect to different tables based on other data in that record, you cannot use foreign field constraints, and it becomes best to keep the constraints in the stored procedures and views instead.

Is there a severe performance hit for using Foreign Keys in SQL Server?

I'm trying my best to persuade my boss into letting us use foreign keys in our databases - so far without luck.
He claims it costs a significant amount of performance, and says we'll just have jobs to cleanup the invalid references now and then.
Obviously this doesn't work in practice, and the database is flooded with invalid references.
Does anyone know of a comparison, benchmark or similar which proves there's no significant performance hit to using foreign keys? (Which I hope will convince him)
There is a tiny performance hit on inserts, updates and deletes because the FK has to be checked. For an individual record this would normally be so slight as to be unnoticeable unless you start having a ridiculous number of FKs associated to the table (Clearly it takes longer to check 100 other tables than 2). This is a good thing not a bad thing as databases without integrity are untrustworthy and thus useless. You should not trade integrity for speed. That performance hit is usually offset by the better ability to optimize execution plans.
We have a medium sized database with around 9 million records and FKs everywhere they should be and rarely notice a performance hit (except on one badly designed table that has well over 100 foreign keys, it is a bit slow to delete records from this as all must be checked). Almost every dba I know of who deals with large, terabyte sized databases and a true need for high performance on large data sets insists on foreign key constraints because integrity is key to any database. If the people with terabyte-sized databases can afford the very small performance hit, then so can you.
FKs are not automatically indexed and if they are not indexed this can cause performance problems.
Honestly, I'd take a copy of your database, add properly indexed FKs and show the time difference to insert, delete, update and select from those tables in comparision with the same from your database without the FKs. Show that you won't be causing a performance hit. Then show the results of queries that show orphaned records that no longer have meaning because the PK they are related to no longer exists. It is especially effective to show this for tables which contain financial information ("We have 2700 orders that we can't associate with a customer" will make management sit up and take notice).
From Microsoft Patterns and Practices: Chapter 14 Improving SQL Server Performance:
When primary and foreign keys are
defined as constraints in the database
schema, the server can use that
information to create optimal
execution plans.
This is more of a political issue than a technical one. If your project management doesn't see any value in maintaining the integrity of your data, you need to be on a different project.
If your boss doesn't already know or care that you have thousands of invalid references, he isn't going to start caring just because you tell him about it. I sympathize with the other posters here who are trying to urge you to do the "right thing" by fighting the good fight, but I've tried it many times before and in actual practice it doesn't work. The story of David and Goliath makes good reading, but in real life it's a losing proposition.
It is OK to be concerned about performance, but making paranoid decisions is not.
You can easily write benchmark code to show results yourself, but first you'll need to find out what performance your boss is concerned about and detail exactly those metrics.
As far as the invalid references ar concerned, if you don't allow nulls on your foreign keys, you won't get invalid references. The database will esception if you try to assign an invalid foreign key that does not exist. If you need "nulls", assign a key to be "UNDEFINED" or something like that, and make that the default key.
Finally, explain database normalisation issues to your boss, because I think you will quickly find that this issue will be more of a problem than foreign key performance ever will.
Does anyone know of a comparison, benchmark or similar which proves there's no significant performance hit to using foreign keys ? (Which I hope will convince him)
I think you're going about this the wrong way. Benchmarks never convince anyone.
What you should do, is first uncover the problems that result from not using foreign key constraints. Try to quantify how much work it costs to "clean out invalid references". In addition, try and gauge how many errors result in the business process because of these errors. If you can attach a dollar amount to that - even better.
Now for a benchmark - you should try and get insight into your workload, identify which type of operations are done most often. Then set up a testing environment, and replay those operations with foreign keys in place. Then compare.
Personally I would not claim right away without knowledge of the applications that are running on the database that foreign keys don't cost performance. Especially if you have cascading deletes and/or updates in combination with composite natural primary keys, then I personally would have some fear of performance issues, especially timed-out or deadlocked transactions due to side-effects of cascading operations.
But no-one can tell you- you have to test yourself, with your data, your workload, your number of concurrent users, your hardware, your applications.
A significant factor in the cost would be the size of the index the foreign key references - if it's small and frequently used, the performance impact will be negligible, large and less frequently used indexes will have more impact, but if your foreign key is against a clustered index, it still shouldn't be a huge hit, but #Ronald Bouman is right - you need to test to be sure.
i know that this is a decade post.
But database primitives are always on demand.
I will refer to my own experience.
In one of the projects that i have worked has to deal with a telecommunication switch database. They have developed a database with no FKs, the reason was because they wanted as much faster inserts they could have. Because sy system it self it have to deal with calls, it make some sense.
Before, there was no need for any intensive queries and if you wanted any report, you could use the GUI software of the switch. After some time you could have some basic reports.
But when i was involved they wanted to develop and AI thus to be able to create smart reports and have something like an automatic troubleshooting.
It was completely a nightmare, having millions of records, you couldn't execute any long query and many times facing sql server timeout. And don't even think using Entity Framework.
It is much difference when you have to face a situation like this instead of describing.
My advice is that you have to be very specific on your design and having a very good reason why not using FKs.

Storing multiple choice values in database

Say I offer user to check off languages she speaks and store it in a db. Important side note, I will not search db for any of those values, as I will have some separate search engine for search.
Now, the obvious way of storing these values is to create a table like
UserLanguages
(
UserID nvarchar(50),
LookupLanguageID int
)
but the site will be high load and we are trying to eliminate any overhead where possible, so in order to avoid joins with main member table when showing results on UI, I was thinking of storing languages for a user in the main table, having them comma separated, like "12,34,65"
Again, I don't search for them so I don't worry about having to do fulltext index on that column.
I don't really see any problems with this solution, but am I overlooking anything?
Thanks,
Andrey
Don't.
You don't search for them now
Data is useless to anything but this one situation
No data integrity (eg no FK)
You still have to change to "English,German" etc for display
"Give me all users who speak x" = FAIL
The list is actually a presentation issue
It's your system, though, and I look forward to answering the inevitable "help" questions later...
You might not be missing anything now, but when you're requirements change you might regret that decision. You should store it normalized like your first instinct suggested. That's the correct approach.
What you're suggesting is a classic premature optimization. You don't know yet whether that join will be a bottleneck, and so you don't know whether you're actually buying any performance improvement. Wait until you can profile the thing, and then you'll know whether that piece needs to be optimized.
If it does, I would consider a materialized view, or some other approach that pre-computes the answer using the normalized data to a cache that is not considered the book of record.
More generally, there are a lot of possible optimizations that could be done, if necessary, without compromising your design in the way you suggest.
This type of storage has almost ALWAYS come back to haunt me. For one, you are not even in first normal form. For another, some manager or the other will definitely come back and say.. "hey, now that we store this, can you write me a report on... "
I would suggest going with a normalized design. Put it in a separate table.
Problems:
You lose join capability (obviously).
You have to reparse the list on each page load / post back. Which results in more code client side.
You lose all pretenses of trying to keep database integrity. Just imagine if you decide to REMOVE a language later on... What's the sql going to be to fix all of your user profiles?
Assuming your various profile options are stored in a lookup table in the DB, you still have to run "30 queries" per profile page. If they aren't then you have to code deploy for each little change. bad, very bad.
Basing a design decision on something that "won't happen" is an absolute recipe for failure. Sure, the business people said they won't ever do that... Until they think of a reason they absolutely must do it. Today. Which will be promptly after you finish coding this.
As I stated in a comment, 30 queries for a low use page is nothing. Don't sweat it, and definitely don't optimize unless you know for darn sure it's necessary. Guess how many queries SO does for it's profile page?
I generally stay away at the solution you described, you asking for troubles when you store relational data in such fashion.
As alternative solution:
You could store as one bitmasked integer, for example:
0 - No selection
1 - English
2 - Spanish
4 - German
8 - French
16 - Russian
--and so on powers of 2
So if someone selected English and Russian the value would be 17, and you could easily query the values with Bitwise operators.
Premature optimization is the root of all evil.
EDIT: Apparently the context of my observation has been misconstrued by some - and hence the downvotes. So I will clarify.
Denormalizing your model to make things easier and/or 'more performant' - such as creating concatenated columns to represent business information (as in the OP case) - is what I refer to as a "premature optimization".
While there may be some extreme edge cases where there is no other way to get the necessary performance necessary for a particular problem domain - one should rarely assume this is the case. In general, such premature optimizations cause long-term grief because they are hard to undo - changing your data model once it is in production takes a lot more effort than when it initially deployed.
When designing a database, developers (and DBAs) should apply standard practices like normalization to ensure that their data model expresses the business information being collected and managed. I don't believe that proper use of data normalization is an "optimization" - it is a necessary practice. In my opinion, data modelers should always be on the lookout for models that could be restructured to (at least) third normal form (3NF).
If you're not querying against them, you don't lose anything by storing them in a form like your initial plan.
If you are, then storing them in the comma-delimited format will come back to haunt you, and I doubt that any speed savings would be significant, especially when you factor in the work required to translate them back.
You seem to be extremely worried about adding in a few extra lookup table joins. In my experience, the time it takes to actually transmit the HTML response and have the browser render it far exceed a few extra table joins. Especially if you are using indexes for your primary and foreign keys (as you should be). It's like you are planning a multi-day cross-country trip and you are worried about 1 extra 10 minute bathroom stop.
The lack of long-term flexibility and data integrity are not worth it for such a small optimization (which may not be necessary or even noticeable).
Nooooooooooooooooo!!!!!!!!
As stated very well in the above few posts.
If you want a contrary view to this debate, look at wordpress. Tables are chocked full of delimited data, and it's a great, simple platform.

When is referential integrity not appropriate?

I understand the need to have referential integrity for limiting specific values on entry or possibly preventing them from removal upon a request of deletion. However, I am unclear as to a valid use case which would exclude this mechanism from always being used.
I guess this would fall into several sub-questions:
When is referential integrity not appropriate?
Is it appropriate to have fields containing multiple and/or possibly incomplete subsets of a foreign key's list?
Typically, should this be a schema structure design decision or an interface design decision? (Or possibly neither or both)
Thoughts?
When is referential integrity not appropriate?
Referential intergrity if typically not used on Data Warehouses where the data is a read only copy of a transactional datbase. Another example of when you'd not need RI is when you want to log information which includes row ids; maintaining referential integrity for a read-only log table is a waste of database overhead.
Is it appropriate to have fields containing multiple and/or possibly incomplete subsets of a foreign key's list?
Sometimes you care more about capturing data than data quality. Imagine you are aggregating a large amount of data from disparate systems which each in their own right suffer from data quality issues. Sometimes you are after the greater good of data quality and having everything in one place even with broken keys etc. represents a starting point for moving towards true data quality. It's not ideal, but it does happen as the beenfits could outweigh the tradeoffs.
Typically, should this be a schema structure design decision or an interface design decision? (Or possibly neither or both)
Everything about systems development is centered around information security, and a key element of that is data integrity. The database structure should lean towards enforcing these things when possible, however you often are not dealing with modern database systems. Sometimes your data source is an old school AS400 with long-antiquated apps. Sometimes you have to build a data and business layer which provide for data integrity.
Just my thoughts.
The only case I have heard of is if you are going to load a vast amount of data into your database; in that case, it may make sense to turn referential integrity off, as long as you know for certain that the data is valid. Once your loading/migration is complete, referential integrity should be turned back on.
There are arguments about putting data validation rules in programming code vs. the database, and I think it depends on the use cases of your software. If a single application is the only path to the database, you could put validation into the program itself and probably be alright. But if several different programs are using the database at the same time (e.g. your application and your friend's application), you'll want business rules in the database so that your data is always valid.
By 'validation rules', I am talking about rules such as 'items in cart > 0'. You may or may not want validation rules. But I think that primary/foreign keys are always important (or you could find later on that you wish you had them). I think they are required if you want to do replication at some point.
When is referential integrity not appropriate?
Sometimes when you are copying lots
of records in bulk, or restoring
data from some sort of backup, it is
convenient to temporarily turn off
the constraints of referential
integrity.
Is it appropriate to have fields containing multiple and/or possibly incomplete subsets of a foreign key's list?
Duplicating data in this way goes
against the concept of
normalization. There are are
advantages and disadvantages to this
approach.
Typically, should this be a schema structure design decision or an interface design decision? (Or possibly neither or both)
I would consider it a schema design
decision. Think about the best way
to model your problem in relational
terms. Use the database in the way it
was intended.
Referential integrity would always be appropriate if it didn't come at the cost of performance, scalability, and/or other features.
In some applications, referential integrity may be traded for something more important than the quality of the data.
Never, though a few people in the NoSQL, the multi-value, and oo-db realms will feel differently. Don't listen to them, they're wrong.
Yes. For example, if a vehicle is identified uniquely as (lotid,vin) then lotid is a foreign key to the lot table. If you want to find all pictures for a lot you can join the vehicle_pictures table right to the lot table, by using a subset of the vehicle_pictures key (lotid in (lotid,vin)). Or, am I not understanding you?
Schema, interface comes second. If the schema is bad, having a nice interface is not a long term goal.

database relationships

does setting up proper relationships in a database help with anything else other than data integrity?
do they improve or hinder performance?
As long as you have the obvious indexes in place corresponding to the foreign keys, there should be no perceptible negative effect on performance. It's one of the more foolproof database features you have to work with.
I'd have to say that proper relationships will help people to understand the data (or the intention of the data) better than if omitting them, especially as the overall cost is quite low in maintaining them.
Their presence doesn't hinder performance except in terms of architecture (as others have pointed out, data integrity will occasionally cause foreign key violations which may have some effect) but IMHO is outweighed by the many benefits (if used correctly).
I know you weren't asking whether to use FKs or not, but I thought I'd just add a couple of viewpoints about why to use them (and have to deal with the consequences):
There are other considerations too, such as if you ever plan to use an ORM (perhaps later on) you'll require foreign keys. They can also be very helpful for ETL/Data Import and Export and later for reporting and data warehousing.
It's also helpful if other applications will make use of the schema - since Foreign Keys implement a basic business logic. So your application (and any others) only need to be aware of the relationships (and honour them). It'll keep the data consistent and most likely reduce the number of data errors in any consuming applications.
Lastly, it gives you a pretty decent hint as to where to put indexes - since it's likely you'll lookup table data by an FK value.
It neither helps nor hurts performance in any significant way. The only hindrance is the check for integrity when inserting/updating/deleting.
Foreign keys are an important part of database design because they ensure consistency. You should use them because it offers the lowest level of protection against data screw ups that can wreck your applications. Another benefit is that database tools (visualization/analysis/code generation) use foreign keys to relate data.
Do relationships in databases improve or hinder performance?
Like any tool in your toolbox, the results you'll get depend on how you use it. Properly specified relationships and a well-designed logical database can be an enormous boon to performance -- consider the difference between searching through normalized and denormalized data, for example.
Depending on your database engine, relationships defined through foreign key constraints can benefit performance. The constraint allows the engine to make certain assumptions about the existence of data in tables on the parent side of the key.
A brief explanation for MS SQL Server can be found at http://www.microsoft.com/technet/abouttn/flash/tips/tips_122104.mspx. I don't know about other engines, but the concept would make sense in other platforms.
Relationships in the data exist whether you declare them or not. Declaring and enforcing the relationships via FK constraints will prevent certain kinds of errors in the data, at a small cost of checking data when inserts/updates/deletes occur.
Declaring cascading deletes via relationships helps prevent certain kinds of errors when deleting data.
Knowing the relationships helps to make flexible and correct use of the data when forming queries.
Designing the tables well can make the relationships more obvious and more useful. Using relationships in the data is the primary power behind using relational databases in the first place.
About impact on performance: In my experience with MS Access 2003, if you have a multi-user application and use Relationships to enforce a lot of referential integrity, you can take a big hit in terms of response time for the end-user.
There are different ways to take care of enforcing referential integrity. I decided to take out some rules in Relationships, build more enforcement into the front-end and live with some loss of RI. Of course in the multi-user environment, you want to be very careful with that bit of liberty.
In my experience building performance-sensitive databases, Foreign Keys hurt performance pretty significantly, since they have to be checked every time the referring record is inserted/updated or master record is deleted. If you need a proof, just look at the execution plan.
I still keep them for documentation and for tools to use but I usually disable them, especially in high-performance systems where access to DB is only through the application layer.