Removing Referential Integrity on Stable Application

Removing Referential Integrity on Stable Application - sql

I have read through some somewhat related questions, but did not find the specifics related to my question.
If I have a stable application that is not going to be changed and it has been thoroughly tested and used in the wild... one might consider removing referential integrity / foreign key constraints in the database schema, with the aim to improve performance.
Without discussing the cons of doing this, does anyone know how much of a performance benefit one might experience? Has anyone done this and experienced noticeable performance benefits?

From my experience with Oracle:
Foreign Keys provide information to the optimizer ("you're going to find exactly one match on this join"), so removing those might result in (not so) funny things happening to your execution plans.
Foreign Keys do perform checks, which costs performance. I have seen those to use up a big chunk of execution time on batch processing (hours on jobs running for large chunks of a day), causing us to use deferred constraints.
Since dropping foreign keys changes the semantic (think cascade, think the application relying on not being able to remove a master entry which gets referenced by something else, at least in the situation of concurrent access) I would only consider such a step when foreign keys are proven to dominate the performance in this application.

The benefits (however small) with be insignificant to the cons.
If performance is a problem check the indexes. Throw more hardware its way. There are a host of techniques to improve performance.
I know you said not to mention the cons - but you should consider them. The data is a very valuable asset and ensuring its validity keeps your business going. If the data becomes invalid you have a huge problem to fix it.

Trying to decide between performance and validity is like choosing which arm you'd rather live without. As others have pointed out, there's better ways to address performance concerns (like index optimization, hardware, query tuning). In any well-designed database system, the performance impacts of reduced referential integrity should be minimal.

It will vary from application to application. So "how much" will be a relative term.
Performance benefit will come while inserting or deleting records.
So if you have big insertion or deletion operations which are taking time then it might help but i will not suggest you to drop it even if your application is stable because in future development this might lead to big issues.

one might consider removing referential integrity / foreign key constraints in the database schema, with the aim to improve performance ... [does] anyone know how much of a performance benefit one might experience
You've given us no information about your database schema or how it's used, so I'll be conservative and estimate your performance benefit could be between between ±∞% (give or take).
Removing foreign keys can improve performance from the point of view that they don't have to be checked.
Removing foreign keys can reduce performance from the point of view that the query plan generation can't trust them and can't take the same shortcuts it would if they were trusted. See Can you trust your constraints? for a SQL Server example.
Foreign keys have more than just performance implications (e.g. ON DELETE CASCADE). So trying to remove them to improve performance without considering exactly what functionality you are removing is naïve at best.

It is not really a fair question in the context of "only speak to the performance gains and not the drawbacks" of this decision (or likely most / all decisions). Since you can't have the pros without the cons you need to know the full extent of both in order to make a truly informed decision. And for this particular question, since there is at best only one benefit (I say "at best" since the performance gain is not as guaranteed as most people would like to believe), then we have little to discuss if we can't talk about the drawbacks (but we can at least start out with the benefit :).
BENEFITS:
Performance: removing Foreign Keys could get you a performance gain on DML statements (INSERT, UPDATE, and DELETE), but any specifics as to how much is highly dependent on the size of the tables in question, indexes, usage patterns (how often are rows updated and are any of the FK fields updatable; how often are rows inserted and/or deleted), etc. While some questions of best practice can be stated to "nearly always" have a performance gain, any effects on performance related to any change can only be determined through testing. With regards to the typical performance gains related to removing FKs, that is something you are not likely to see until you get into relatively large numbers of rows (as in Millions or more) and bulk operations.
DRAWBACKS:
Performance: most people would not expect to see that performance could be negatively impacted by removing FKs, but it could very well happen. I am not sure how all RDBMS's work, but the Query Optimizer in Microsoft SQL Server uses the existence of FKs (that are both Enabled and Trusted) to short-cut certain operations between tables. Not having properly defined FKs prohibits the Optimizer from having that added insight, sometimes resulting in slower queries.
Data Integrity: the primary responsibility of the database is to ensure data integrity. Performance is secondary (even if a very close second). You should never sacrifice the main goal for a lower-priority goal, especially since performance gains can be achieved via other methods, such as: indexes, more/faster CPU, more/faster RAM, etc. Once your data is bad, you might not be able to correct it. With this in mind:
Don't ever trust that an application is "stable" or won't change. Unless the software is obsolete and nobody has the source code to make a change, it more than likely will change.
Don't trust that you have found all of the bugs in the code yet, no matter how "thoroughly tested" you believe it is. The app might appear stable now, but who is to say that a problem won't be discovered later. If you have more than 10 lines of code in your app, it is doubtful that it is 100% bug free.
Even if the app code doesn't change, can you guarantee that no other app code will be written against the DB? If this is software that leaves your control (i.e. is NOT SaaS), can you stop anyone who has installed it from writing their own custom code to add functionality that they want that was not provided in your app? It happens. And even in SaaS companies, other departments might try writing tools against the DB (such as Support who needs to do an operation to help customers). Anyone considering removing FKs is likely to not have set up permissions / security to prevent such a thing.
Ability to fix / update: The app might be "stable" now, but companies often change direction and decisions. FKs give guidance as to the rules by which the data lives. Even if no data integrity issues are happening now, if there comes a time when the app will have new features added (or bugs fixed), not having the FKs defined will make it more likely that bugs will be introduced due to lack of "documentation" that would have been provided by the FKs.

Referential integrity constraints may [in some databases, not SQL Server] automatically create indexes on those FKs; delivering much better performance in queries on those terms.
These indexes often help query performance, giving potentially large boost to efficiency. They also provide additional information to the optimizer, enabling better query plans.
If performance were an issue, there are many other things (caching, prepared stmts, bulk inserts) I would look at before removing referential integrity. But, if you had large numbers of active indexes & were reaching serious limits on insert speed, it might be considered as a last option.

Related

database normalisation

I'm building a query and as I'm building it, I'm realizing that it'd be easier to write if some of the tables contained redundant fields; it'd save a few joins. However, doing so would mean that the database model would not be totally normalized.
I'm aiming for performance; will having a denormalized database impede performance? I'm using SQL Server.
Thanks.

I don't know exactly what your implementation is, but it normally helps to have redundant index references, but not redundant fields per se.
For example, say you have three tables: tbl_building, tbl_room, and tbl_equipment. (An equipment belongs to a room, which belongs to a buildng)
tbl_building has a buildingID, tbl_room has a roomID and a reference to buildingID. It would save you a join if your tbl_equipment had a reference to both roomID and buildingID, even though you could infer the buildingID from the roomID.
Now, it would not be good if, for example, you have the field buildingSize on tbl_building and copy that buildingSize field to tbl_room and tbl_equipment.

In this type of situation I often find your best option is to create an indexed view that is a denormalized version of your normalized tables. This will allow you to easily query data while not creating a maintenance nightmare.
A few things to note:
This wont work if you are using left
joins
This will slow down
Insert/Update/Delete functions
It will take up space (it's persisted).
Here is an article that goes over some of the benefits of Indexed Views.
In answer to your question; having a denormalized structure will often improve performance but it will create a maintenance nightmare.

Once you know for a fact that the joins are causing performance issues, and upgrading the hardware isn't an option, then it's either time to denormalize or if dealing with certain use cases (multiple users getting the same data e.g. for a home page of a site) start caching.

To answer your question, "will having a non-normalized database impede performance?", the answer is "it depends". Normalization is a constraint. It won't improve database performance, unless you access patterns are such that a lot of data is ignored in your queries (you have smaller result sets). But non-normalization can improve performance where you have many joins (you have bigger result sets).

Normalization does not determine performance. Normalization is about correctness and preventing certain data integrity problems.
A database in Normal Form does also help reduce design bias (a biased schema means one designed to suit some types of query better than others). In that sense it should give the best chance for the database optimiser to do its work. Denormalization means adding redundancy and in many cases that also means more storage is required for the same information - potentially impacting performance.

Denormalisation typically happens after normalisation when you have a issue, perhaps with performance.
You don't design it in up front: I can pretty much guarantee that your assumptions will be wrong and it'll be a world of pain to deal with a denormalued schema that is used in unexpected ways.
For instance, Data modification anomalies
And, perhaps I've misunderstood this last decade and a half, but aren't database engines designed to JOIN tables efficiently?

The basic purpose of normalization is to reduce redundancy of data in your tables which reduces storage wastage and inconsistency.As far as the performance is concerned,it depends on the way your database is designed.If there are too many redundancy, then checking and searching for an element in a relation will increase the search time and reduce the efficiency.On the other hand,if there is less redundancy then there won't be much effect on performance.But it is always better to have a normalized schema .

Does introducing foreign keys to MySQL reduce performance

I'm building Ruby on Rails 2.3.5 app. By default, Ruby on Rails doesn't provide foreign key contraints so I have to do it manually. I was wondering if introducing foreign keys reduces query performance on the database side enough to make it not worth doing. Performance in this case is my first priority as I can check for data consistency with code. What is your recommendation in general? do you recommend using foreign keys? and how do you suggest I should measure this?

Assuming:
You are already using a storage engine that supports FKs (ie: InnoDB)
You already have indexes on the columns involved
Then I would guess that you'll get better performance by having MySQL enforce integrity. Enforcing referential integrity, is, after all, something that database engines are optimized to do. Writing your own code to manage integrity in Ruby is going to be slow in comparison.
If you need to move from MyISAM to InnoDB to get the FK functionality, you need to consider the tradeoffs in performance between the two engines.
If you don't already have indicies, you need to decide if you want them. Generally speaking, if you're doing more reads than writes, you want (need, even) the indicies.
Stacking an FK on top of stuff that is currently indexed should cause less of an overall performance hit than implementing those kinds of checks in your application code.

Generally speaking, more keys (foreign or otherwise) will reduce INSERT/UPDATE performance and increase SELECT performance.
The added benefit of data integrity, is likely just about always worth the small performance decrease that comes with adding your foreign keys. What good is a fast app if the data within it is junk (missing parts or etc)?
Found a similar query here: Does Foreign Key improve query performance?

You should define foreign keys. In general (though I do not know the specifics about mySQL), there is no effect on queries (and when there is an optimizer, like the Cost based optimizer in Oracle, it may even have a positive effects since the optimizer can rely on the foreign key information to choose better access plans).
As per the effect on insert and update, there may be an impact, but the benefits that you get (referential integrity and data consistency) far outweight the performance impact. Of course, you can design a system that will not perform at all, but the main reason will not be because you added the foreign keys. And the impact on maintaining your code when you decide to use some other language, or because the business rules have slightly changed, or because a new programmer joins your team, etc., is far more expensive than the performance impact.
My recommendation, then, is yes, go and define the foreign keys. Your end product will be more robust.

It is a good idea to use foreign keys because that assures you of data consistency ( you do not want orphan rows and other inconsistent data problems).
But at the same time adding a foreign key does introduce some performance hit. Assuming you are using INNODB as the storage engine, it uses clustered index for PK's where essentially data is stored along with the PK. For accessing data using secondary index requires a pass over the secondary index tree ( where nodes contain the PK) and then a second pass over the clustered index to actually fetch the data. So any DML on the parent table which involves the FK in question, will require two passes over the index in the child table. Ofcourse, the impact of the performance hit depends on the amount of data, your disk performance, your memory constraints ( data/index cached). So it is best to measure it with your target system in mind. I would say the best way to measure it is with your sample target data, or atleast some representative target data for your system. Then try to run some benchmarks with and without FK constraints. Write client side scripts which generate the same load in both cases.
Though, if you are manually checking for FK constraints, I would recommend that you leave it upto mysql and let mysql handle it.

Two points:
1. are you sure that checking integrity at the application level would be better in terms of performance?
2. run your own test - testing if FKs have positive or negative influence on performance should be almost trivial.

Is there a severe performance hit for using Foreign Keys in SQL Server?

I'm trying my best to persuade my boss into letting us use foreign keys in our databases - so far without luck.
He claims it costs a significant amount of performance, and says we'll just have jobs to cleanup the invalid references now and then.
Obviously this doesn't work in practice, and the database is flooded with invalid references.
Does anyone know of a comparison, benchmark or similar which proves there's no significant performance hit to using foreign keys? (Which I hope will convince him)

There is a tiny performance hit on inserts, updates and deletes because the FK has to be checked. For an individual record this would normally be so slight as to be unnoticeable unless you start having a ridiculous number of FKs associated to the table (Clearly it takes longer to check 100 other tables than 2). This is a good thing not a bad thing as databases without integrity are untrustworthy and thus useless. You should not trade integrity for speed. That performance hit is usually offset by the better ability to optimize execution plans.
We have a medium sized database with around 9 million records and FKs everywhere they should be and rarely notice a performance hit (except on one badly designed table that has well over 100 foreign keys, it is a bit slow to delete records from this as all must be checked). Almost every dba I know of who deals with large, terabyte sized databases and a true need for high performance on large data sets insists on foreign key constraints because integrity is key to any database. If the people with terabyte-sized databases can afford the very small performance hit, then so can you.
FKs are not automatically indexed and if they are not indexed this can cause performance problems.
Honestly, I'd take a copy of your database, add properly indexed FKs and show the time difference to insert, delete, update and select from those tables in comparision with the same from your database without the FKs. Show that you won't be causing a performance hit. Then show the results of queries that show orphaned records that no longer have meaning because the PK they are related to no longer exists. It is especially effective to show this for tables which contain financial information ("We have 2700 orders that we can't associate with a customer" will make management sit up and take notice).

From Microsoft Patterns and Practices: Chapter 14 Improving SQL Server Performance:
When primary and foreign keys are
defined as constraints in the database
schema, the server can use that
information to create optimal
execution plans.

This is more of a political issue than a technical one. If your project management doesn't see any value in maintaining the integrity of your data, you need to be on a different project.
If your boss doesn't already know or care that you have thousands of invalid references, he isn't going to start caring just because you tell him about it. I sympathize with the other posters here who are trying to urge you to do the "right thing" by fighting the good fight, but I've tried it many times before and in actual practice it doesn't work. The story of David and Goliath makes good reading, but in real life it's a losing proposition.

It is OK to be concerned about performance, but making paranoid decisions is not.
You can easily write benchmark code to show results yourself, but first you'll need to find out what performance your boss is concerned about and detail exactly those metrics.
As far as the invalid references ar concerned, if you don't allow nulls on your foreign keys, you won't get invalid references. The database will esception if you try to assign an invalid foreign key that does not exist. If you need "nulls", assign a key to be "UNDEFINED" or something like that, and make that the default key.
Finally, explain database normalisation issues to your boss, because I think you will quickly find that this issue will be more of a problem than foreign key performance ever will.

Does anyone know of a comparison, benchmark or similar which proves there's no significant performance hit to using foreign keys ? (Which I hope will convince him)
I think you're going about this the wrong way. Benchmarks never convince anyone.
What you should do, is first uncover the problems that result from not using foreign key constraints. Try to quantify how much work it costs to "clean out invalid references". In addition, try and gauge how many errors result in the business process because of these errors. If you can attach a dollar amount to that - even better.
Now for a benchmark - you should try and get insight into your workload, identify which type of operations are done most often. Then set up a testing environment, and replay those operations with foreign keys in place. Then compare.
Personally I would not claim right away without knowledge of the applications that are running on the database that foreign keys don't cost performance. Especially if you have cascading deletes and/or updates in combination with composite natural primary keys, then I personally would have some fear of performance issues, especially timed-out or deadlocked transactions due to side-effects of cascading operations.
But no-one can tell you- you have to test yourself, with your data, your workload, your number of concurrent users, your hardware, your applications.

A significant factor in the cost would be the size of the index the foreign key references - if it's small and frequently used, the performance impact will be negligible, large and less frequently used indexes will have more impact, but if your foreign key is against a clustered index, it still shouldn't be a huge hit, but #Ronald Bouman is right - you need to test to be sure.

i know that this is a decade post.
But database primitives are always on demand.
I will refer to my own experience.
In one of the projects that i have worked has to deal with a telecommunication switch database. They have developed a database with no FKs, the reason was because they wanted as much faster inserts they could have. Because sy system it self it have to deal with calls, it make some sense.
Before, there was no need for any intensive queries and if you wanted any report, you could use the GUI software of the switch. After some time you could have some basic reports.
But when i was involved they wanted to develop and AI thus to be able to create smart reports and have something like an automatic troubleshooting.
It was completely a nightmare, having millions of records, you couldn't execute any long query and many times facing sql server timeout. And don't even think using Entity Framework.
It is much difference when you have to face a situation like this instead of describing.
My advice is that you have to be very specific on your design and having a very good reason why not using FKs.

database relationships

does setting up proper relationships in a database help with anything else other than data integrity?
do they improve or hinder performance?

As long as you have the obvious indexes in place corresponding to the foreign keys, there should be no perceptible negative effect on performance. It's one of the more foolproof database features you have to work with.

I'd have to say that proper relationships will help people to understand the data (or the intention of the data) better than if omitting them, especially as the overall cost is quite low in maintaining them.
Their presence doesn't hinder performance except in terms of architecture (as others have pointed out, data integrity will occasionally cause foreign key violations which may have some effect) but IMHO is outweighed by the many benefits (if used correctly).
I know you weren't asking whether to use FKs or not, but I thought I'd just add a couple of viewpoints about why to use them (and have to deal with the consequences):
There are other considerations too, such as if you ever plan to use an ORM (perhaps later on) you'll require foreign keys. They can also be very helpful for ETL/Data Import and Export and later for reporting and data warehousing.
It's also helpful if other applications will make use of the schema - since Foreign Keys implement a basic business logic. So your application (and any others) only need to be aware of the relationships (and honour them). It'll keep the data consistent and most likely reduce the number of data errors in any consuming applications.
Lastly, it gives you a pretty decent hint as to where to put indexes - since it's likely you'll lookup table data by an FK value.

It neither helps nor hurts performance in any significant way. The only hindrance is the check for integrity when inserting/updating/deleting.
Foreign keys are an important part of database design because they ensure consistency. You should use them because it offers the lowest level of protection against data screw ups that can wreck your applications. Another benefit is that database tools (visualization/analysis/code generation) use foreign keys to relate data.

Do relationships in databases improve or hinder performance?
Like any tool in your toolbox, the results you'll get depend on how you use it. Properly specified relationships and a well-designed logical database can be an enormous boon to performance -- consider the difference between searching through normalized and denormalized data, for example.

Depending on your database engine, relationships defined through foreign key constraints can benefit performance. The constraint allows the engine to make certain assumptions about the existence of data in tables on the parent side of the key.
A brief explanation for MS SQL Server can be found at http://www.microsoft.com/technet/abouttn/flash/tips/tips_122104.mspx. I don't know about other engines, but the concept would make sense in other platforms.

Relationships in the data exist whether you declare them or not. Declaring and enforcing the relationships via FK constraints will prevent certain kinds of errors in the data, at a small cost of checking data when inserts/updates/deletes occur.
Declaring cascading deletes via relationships helps prevent certain kinds of errors when deleting data.
Knowing the relationships helps to make flexible and correct use of the data when forming queries.
Designing the tables well can make the relationships more obvious and more useful. Using relationships in the data is the primary power behind using relational databases in the first place.

About impact on performance: In my experience with MS Access 2003, if you have a multi-user application and use Relationships to enforce a lot of referential integrity, you can take a big hit in terms of response time for the end-user.
There are different ways to take care of enforcing referential integrity. I decided to take out some rules in Relationships, build more enforcement into the front-end and live with some loss of RI. Of course in the multi-user environment, you want to be very careful with that bit of liberty.

In my experience building performance-sensitive databases, Foreign Keys hurt performance pretty significantly, since they have to be checked every time the referring record is inserted/updated or master record is deleted. If you need a proof, just look at the execution plan.
I still keep them for documentation and for tools to use but I usually disable them, especially in high-performance systems where access to DB is only through the application layer.

DB Design: Does having 2 Tables (One is optimized for Read, one for Write) improve performance?

I am thinking about a DB Design Problem.
For example, I am designing this stackoverflow website where I have a list of Questions.
Each Question contains certain meta data that will probably not change.
Each Question also contains certain data that will be consistently changing (Recently Viewed Date, Total Views...etc)
Would it be better to have a Main Table for reading the constant meta data and doing a join
and also keeping the changing values in a different table?
OR
Would it be better to keep everything all in one table.
I am not sure if this is the case, but when updating, does the ROW lock?

When designing a database structure, it's best to normalize first and change for performance after you've profiled and benchmarked your queries. Normalization aims to prevent data-duplication, increase integrity and define the correct relationships between your data.
Bear in mind that performing the join comes at a cost as well, so it's hard to say if your idea would help any. Proper indexing with a normalized structure would be much more helpful.
And regarding row-level locks, that depends on the storage engine - some use row-level locking and some use table-locks.

Your initial database design should be based on conceptual and relational considerations only, completely indepedent of physical considerations. Database software is designed and intended to support good relational design. You will hardly ever need to relax those considerations to deal with performance. Don't even think about the costs of joins, locking, and activity type at first. Then further along, put off these considerations until all other avenues have been explored.
Your rdbms is your friend, not your adversary.

You should have the two table separated out as you might want to record the history of the question. The main Question table is indexed by question ID then the Status table is indexed by query ID and date/time stamp and contains a row for each time the status changes.
Don't know that the updates are really significant unless you were using pessimistic locking where the row would be locked for a period of time.

I would look at caching your results either locally with Asp.net caching or using MemCached.

This would certainly be a bad idea if you were using Oracle. In Oracle, you can quite happily read records while other sessions are modifying them due to it's multi-version concurency control. You would incur extra performance penalty for the join for no savings.
A design patter that is useful, however, is to pre-join tables, pre-calculate aggregates or pre-apply where clauses using materialized views.

As already said, better start with a clean normalized design. It's just easier to denormalize later, than to go the other way around. The experience teaches that you will never denormalize that one big table! You will just throw more columns in as needed. And you will need more and more indexes and updates will go slower and slower.
You should also take a look at the expected loads: Will be there more new answers or just more querying? What other operations will you have? When it comes to optimization, you can use the features of your dbms system: indexing, views, ...

Eran Galperin already provided most of my answer. In addition, the structure you propose really wouldn't help you in terms of locking. If their are relatively static and dynamic attributes in the same row, breaking the static and dynamic into two tables isn't of much benefit. It doesn't matter if static data is being locked, since no one is trying to change it anyway.
In fact, you may actually do worse with this design. Some database engines use page locking. If a table has fewer/smaller columns, more rows will fit on a page. The more rows there are on a page, the more likely there will be a lock contention. By having the static data mixed in with the dynamic, the rows are bigger, therefore there are fewer rows in a page, and therefore fewer waits on page locks.
If you have two independent sets of dynamic attributes, and they are normally modified by different actors, then you might get some benefit by breaking them into different tables. This is a pretty unusual case, however.
I'd also point out that breaking the table into a static and dynamic portion may not be of benefit in a relatively small environment, but in a large distributed environment it may be useful to cache and replicate the dynamic data at different rates than the static data.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas