We have a legacy database that uses strings as primary keys. I want to implement objects on top of that legacy database to better implement some business logic and provide more functionality to the user.
I have read in places that using strings for primary keys on tables is bad. I'm wondering why this is? Is it because of the case-sensitivity issues? character sets?
... why is this particularly bad for NHibernate?
... and following up on that ... if strings do make bad primary keys, is it worth it to replace the primary keys in the database with ints or GUIDs or the like? (we only have about 25-30 tables involved)
Okay, I will have a stab at this. I will give a couple of quick caveats - I am not an expert on databases and my experience is with Hibernate (Java) rather than NHibernate, but here goes.
I think the issue of primary keys as strings is to do with the SQL data-type that is used to represent them in the database. Because the primary key is used all the time when inserting, querying and so on, the database engine has to spend lots of time comparing primary keys. If you are using numbers, these are simply stored as bytes which computers are really good at doing stuff with quickly. As soon as you start using strings, the cost of these operations (comparisons mainly) goes up significantly. Even if the database engine is using really neat strategies to compare keys, it will still always be faster to compare bytes as bytes rather than strings.
On modern hardware though, this is becoming much less an issue than it used to be, and with indexes the problem almost disappears.
I don't know for sure about why this is really bad in Hibernate (and NHibernate) but in my experience, because my application has a complex graph of objects that often have references to other persisted objects, often as lists or sets, the references are all stored using the ID of the other object, and because of the rules I have in place for cascading saves, fetching and so on, this will mean that the primary keys are being used ALL the time. Hibernate - which I quite like - tends to do exactly what its told to, and sometimes people (especially me!) tell it to do really dumb things. As a result, even seemingly simple updates or queries end up generating quite complex SQL.
So - in summary - strings as primary keys are bad due to cost of simple operations on them and using Hibernate may magnify this. In practice though, modern database engines have lots of neat strategies to ensure that the performance hit is not that bad. (Postgres - and presumably others - by default create indexes for primary keys)
For your follow up - should you replace your keys? Well, that depends on the performance of your application. If performance is critical, then for a high volume and very intensive application it may be a good idea, otherwise there will probably be minimal benefit, with the downside of having to spend time changing all your tables. You could expect to get much better results refining the strategies you are using with NHibernate (ie fetching strategies and when you are cascading saves and so on).
Andy K seems to imply that strings are not stored as bytes. That would be funny! In fact it all depends on how long the string PK is and what collation you use. It might be even faster than bigint or int identity and will almost definitely be faster than Guids. If these strings are something you'd have to search by anyway, then you would need an index (perhaps even clustered index) on them anyway, so why not make them PKs!
Using strings or chars adds a huge amount of accidental complexity to your system. Consider these questions:
how to handle case sensitivity;
how to handle padding. NHibernate lets you insert a shorter string, and the database will silently add padding to it, but it won't be reflected in your persisted entity. Trying to fetch the entity again with the in-memory ID returns null;
how to handle encoding issues. C# uses unicode strings, your database migth not. Can you tell how the conversion will be handled? I don't think so.
synthetic integer keys can be autogenerated by most databases without extra effort. With strings you most probably create them "by hand". Unless you hide them behind a Factory (in the DDD sense), the resulting code will clutter your domain model.
Though the performance overhead mentioned by andy K can diminish because of indexing, still many times you do ID comparisions in-memory (hash-maps?) and the DB optimizations do not apply there.
I have been working on a project with a legacy database having string primary keys and no foreign keys at all. We are not allowed to thouch the old schema because a legacy app depends on every minor aspects of it. I feel that the string primary keys hurt the consistency more than the missing foreign keys, since NHibernate handles the later quite gracefully.
Related
As far as I know, SAP CRM and HANA both utilise GUIDs to uniquely identify records instead of using classic incremented integers. Are there best practices or clear guidelines that cover their use?
Here are some factors I've considered in favour of GUIDs:
Offline creation of objects. IIRC GUIDs are near-guaranteed to be unique in these situations so merging or integration of disparate data sets is not an issue.
Surrogate keys have distinct development advantages. While incrementing integers are a form of surrogate key, use of different number sequences can impose a functional meaning on them.
And some scenarious that favour classic keys:
Users require human-readable keys to identify records in the system. This can be handled in GUID tables by also specifying an external ID with a readable value.
Users want to use number sequences to identify different types of records, similar to sales or purchase documents. Though I actually consider this bad design.
What scenarios for custom development would make you prefer GUIDs over classic keys?
Is blanket-usage of GUIDs for all tables a good idea?
To answer the question at the end: No, it isn’t (at least not in an ABAP environment, and I doubt it’s sensible elsewhere). Using GUIDs for primary keys everywhere makes it awfully hard to maintain and follow complex foreign key relationships at runtime. Just imagine having to debug a program that handles everything using GUIDs instead of the semantic keys you’re used to. And remember that the total length of the primary key may not exceed 255, and the total length of the primary key should not exceed 120 if you want to be able to transport table entries using fully qualified keys. Using GUIDs in composite keys blows the keys up unnecessarily, and using them as synthetics keys makes using foreign key relationships virtually impossible. So no, using GUIDs everywhere is not a good idea, especially not for configuration / customizing data.
It is however a good idea to use GUIDs in almost every place where you would have used a number range object in “old-school ABAP development”. GUIDs can be generated by the application server, while number ranges require network communication to the enqueuing server. (Yes, there is some buffering involved, but generally speaking, GUIDs are a lot faster and easier to handle). So unless you need your keys to follow a certain pattern, you should consider using a GUID. Even if you need some kind of sequential number for whatever business reasons, it might be sensible to use a GUID as the primary key and store the sequential number inside an (indexed) attribute to increase flexibility at development time.
After having worked at various employers I've noticed a trend of "bad" database design with some of these companies - primarily the exclusion of Foreign Keys Constraints. It has always bugged me that these transactional systems didn't have FK's, which would've promoted referential integrity.
Are there any scenarios, in transactional systems, whereby the omission of FK's would be beneficial?
Has anyone else experienced this, if so what was the outcome?
What should one do if they're presented with this scenario and their asked to maintain/enhance the system?
I cannot think of any scenario where, if two columns have a dependency, they should not have a FK constraint set up between them. Removing referential integrity may certainly speed up database operations but there's a pretty high cost to pay for that.
I have experienced such systems and the usual outcome is corrupted data, in the sense that records exists that shouldn't exist (or vice versa). These are the sort of systems where people believe they're okay because the application takes care of it, not caring that:
Every application has to take care of it, rather than one DB server.
It only takes one bug, or malignant app, to screw it up for everyone.
It is the responsibility of the database to protect itself! That is one of its best features.
As to what you should do, I simply put forward the possible things that can go wrong and how using FKs will prevent that (often with a cost/benefit analysis "skewed" toward my viewpoint, if necessary). Then let the company decide - it is their database, after all.
There is a school of thought that a well-written application does not need referential integrity. If the application does things right, the thinking goes, there's no need for constraints.
Such thinking is akin to not doing defensive programming because if you write the code correctly, you won't have bugs. While true, it simply won't happen. Not using appropriate constraints is asking for data corruption.
As for what you should do, you should encourage the company to add constraints at every opportunity. You don't want to push it to the point of getting in trouble or making a bad name for yourself, but as long as the environment is appropriate, keep pushing for it. Everyone's life will be better in the long run.
Personally, I have no problem with a database not having explicit declarations for foreign keys. But, it depends on how the database is being used.
Most of the databases that I work with are relatively static data derived from one or more transactional systems. I am not particularly concerned with rogue updates affecting the database, so an explicit definition of a foreign key relationship is not particularly important.
One thing that I do have is very consistent naming. Basically, every table has a first column called ID, which is exactly how the column is refered to in other tables (or, sometimes with a prefix, when there are multiple relationships between two entities). I also try to insist that every column in such a database has a unique name that describes the attribute (so "CustomerStartDate" is different from "ProductStartDate").
If I were dealing with data that had more "cooks in the pot", then I would want to be more explicit about the foreign key relationships. And, I then I am more willing to have the overhead of foreign key definitions.
This overhead arises in many places. When creating a new table, I may want to use use "create table as" or "select into" and not worry about the particulars of constraints. When running update or insert queries, I may not want the database overhead of checking things that I know are ok. However, I must emphasize that consistent naming greatly increases my confidence that things are ok.
Clearly, my perspective is not one of a DBA but of a practitioner. However, invalid relationships between tables are something I -- or the rest of my team -- almost never has to deal with.
As long as there's a single point of entry into the database it ultimately doesn't matter which "layer" is maintaining referential integrity. Using the "built-in layer" of foreign key constraints seems to make the most sense, but if you have a rock solid service layer responsible for the same thing then it has freedom to break the rules if necessary.
Personally I use foreign key constraints and engineer my apps so they don't have to break the rules. Relational data with guaranteed referential integrity is just easier to work with.
The performance gained is probably equivalent to the performance lost from having to maintain integrity outside of the db.
In an OLTP database, the only reason I can think of is if you care about performance more than data integrity. Enforcing a FK when row is inserted to the child table requires an index seek on the parent table and I can imagine there may be extreme situations where even this relatively quick index seek is too much. For example, some kind of very intensive logging where you can live with incorrect log entries and the application doing the writing is simple and unlikely to have bugs.
That being said, if you can live with corrupt data, you can probably live without a database in the first place.
Defensive Programming withot foreign keys works if you primarily use stored procedures and every application uses those stored procedures, instead of writing their own queries. Then you can control it quite easily and more flexible than the standard foreign keys.
One situation I can think of off the top of my head where foreign key constraints are not readily usable is a permissions module where permissions can be applied per user or per group, determined by a Boolean. So some of the records in the permissions table have a user id and others have a group id. If you still wanted foreign key constraints, you would have to have two different fields for the same mutally exclusive information and allow them to be null. Meaning adding another constraint saying that one is allowed to be null but they can't both be null, as well as a combination of 3 fields must be unique instead of a combination of 2 fields (user/group id and permission id). And the alternative is two separate tables containing the same data, meaning maintaining both tables separately.
But perhaps in that scenario, it's best to separate the data. Anything where you need the same field to connect to different tables based on other data in that record, you cannot use foreign field constraints, and it becomes best to keep the constraints in the stored procedures and views instead.
I'm trying my best to persuade my boss into letting us use foreign keys in our databases - so far without luck.
He claims it costs a significant amount of performance, and says we'll just have jobs to cleanup the invalid references now and then.
Obviously this doesn't work in practice, and the database is flooded with invalid references.
Does anyone know of a comparison, benchmark or similar which proves there's no significant performance hit to using foreign keys? (Which I hope will convince him)
There is a tiny performance hit on inserts, updates and deletes because the FK has to be checked. For an individual record this would normally be so slight as to be unnoticeable unless you start having a ridiculous number of FKs associated to the table (Clearly it takes longer to check 100 other tables than 2). This is a good thing not a bad thing as databases without integrity are untrustworthy and thus useless. You should not trade integrity for speed. That performance hit is usually offset by the better ability to optimize execution plans.
We have a medium sized database with around 9 million records and FKs everywhere they should be and rarely notice a performance hit (except on one badly designed table that has well over 100 foreign keys, it is a bit slow to delete records from this as all must be checked). Almost every dba I know of who deals with large, terabyte sized databases and a true need for high performance on large data sets insists on foreign key constraints because integrity is key to any database. If the people with terabyte-sized databases can afford the very small performance hit, then so can you.
FKs are not automatically indexed and if they are not indexed this can cause performance problems.
Honestly, I'd take a copy of your database, add properly indexed FKs and show the time difference to insert, delete, update and select from those tables in comparision with the same from your database without the FKs. Show that you won't be causing a performance hit. Then show the results of queries that show orphaned records that no longer have meaning because the PK they are related to no longer exists. It is especially effective to show this for tables which contain financial information ("We have 2700 orders that we can't associate with a customer" will make management sit up and take notice).
From Microsoft Patterns and Practices: Chapter 14 Improving SQL Server Performance:
When primary and foreign keys are
defined as constraints in the database
schema, the server can use that
information to create optimal
execution plans.
This is more of a political issue than a technical one. If your project management doesn't see any value in maintaining the integrity of your data, you need to be on a different project.
If your boss doesn't already know or care that you have thousands of invalid references, he isn't going to start caring just because you tell him about it. I sympathize with the other posters here who are trying to urge you to do the "right thing" by fighting the good fight, but I've tried it many times before and in actual practice it doesn't work. The story of David and Goliath makes good reading, but in real life it's a losing proposition.
It is OK to be concerned about performance, but making paranoid decisions is not.
You can easily write benchmark code to show results yourself, but first you'll need to find out what performance your boss is concerned about and detail exactly those metrics.
As far as the invalid references ar concerned, if you don't allow nulls on your foreign keys, you won't get invalid references. The database will esception if you try to assign an invalid foreign key that does not exist. If you need "nulls", assign a key to be "UNDEFINED" or something like that, and make that the default key.
Finally, explain database normalisation issues to your boss, because I think you will quickly find that this issue will be more of a problem than foreign key performance ever will.
Does anyone know of a comparison, benchmark or similar which proves there's no significant performance hit to using foreign keys ? (Which I hope will convince him)
I think you're going about this the wrong way. Benchmarks never convince anyone.
What you should do, is first uncover the problems that result from not using foreign key constraints. Try to quantify how much work it costs to "clean out invalid references". In addition, try and gauge how many errors result in the business process because of these errors. If you can attach a dollar amount to that - even better.
Now for a benchmark - you should try and get insight into your workload, identify which type of operations are done most often. Then set up a testing environment, and replay those operations with foreign keys in place. Then compare.
Personally I would not claim right away without knowledge of the applications that are running on the database that foreign keys don't cost performance. Especially if you have cascading deletes and/or updates in combination with composite natural primary keys, then I personally would have some fear of performance issues, especially timed-out or deadlocked transactions due to side-effects of cascading operations.
But no-one can tell you- you have to test yourself, with your data, your workload, your number of concurrent users, your hardware, your applications.
A significant factor in the cost would be the size of the index the foreign key references - if it's small and frequently used, the performance impact will be negligible, large and less frequently used indexes will have more impact, but if your foreign key is against a clustered index, it still shouldn't be a huge hit, but #Ronald Bouman is right - you need to test to be sure.
i know that this is a decade post.
But database primitives are always on demand.
I will refer to my own experience.
In one of the projects that i have worked has to deal with a telecommunication switch database. They have developed a database with no FKs, the reason was because they wanted as much faster inserts they could have. Because sy system it self it have to deal with calls, it make some sense.
Before, there was no need for any intensive queries and if you wanted any report, you could use the GUI software of the switch. After some time you could have some basic reports.
But when i was involved they wanted to develop and AI thus to be able to create smart reports and have something like an automatic troubleshooting.
It was completely a nightmare, having millions of records, you couldn't execute any long query and many times facing sql server timeout. And don't even think using Entity Framework.
It is much difference when you have to face a situation like this instead of describing.
My advice is that you have to be very specific on your design and having a very good reason why not using FKs.
The reason I want to use a Guid is because in the event that I have to split the database into I won't have primary keys that overlap on both databases. So if I use a Guid there won't be any overlapping. I also want to use the GUID in the url also, so the Guid will need to be Indexed.
I will be using ASP.NET C# as my web server.
Postgres has a UUID type. MySQL has a UUID function. Oracle has a SYS_GUID function.
As others have said you can use GUIDs/UUIDs in pretty much any modern DB. The algorithm for generating a GUID is pretty straitforward and you can be reasonably sure that you won't get dupes however there are some considerations.
+) Although GUIDs are generally representations of 128 Bit values the actual format used differs from implementation to implemenation - you may want to consider normalizing them by removing non-significant characters (usually dashes or spaces).
+) To absolutely ensure uniqueness you can also append a value to the guid. For example if you're worried about MS and Oracle guids colliding add "MS" to the former and "Or" to the latter - now even if the guids themselves do collide they keys won't.
As others have mentioned however there is a potentially severe price to pay here: your keys will be large (128 bits) and won't index very well (although this is somewhat dependent on the implementation).
The techique works very well for small databases (especially those where the entire dataset can fit in memory) but as DBs grow you'll definately have to accept a performance trade-off.
One thing you might consider is a hybrid approach. Without more information it's hard to really know what you're trying to do so these might not help:
1) Remember that primary keys don't have to be a single column - you can have a simple numeric key to identify your rows and another row, containing a single value, that identifies the database that hosts the data or created the key. Creating the primary key as aggregate of both columns allows indexing to index fewer complex values and should be significantly faster.
2) You can "fake it" by constructing the key as a concatenated field (as in the above idea to append a DB identifier to the key). So your key would be a simple number followed by some DB identifier (perhaps a guid for each DB).
Indexing such a value (since the values would still be sequential) should be much faster.
In both cases you'll have some manual work to do if you ever do split the DB(s) - you'll have to update some keys with a new DB ID, but this would be a one-time,infrequent event. In exchange you can tune your DB much better.
There are definately other ways to ensure data integrity across mutiple databases. Many enterprise DBMSs have tools built-in for clustering data across multiple servers or databases, some have special tools or design patterns that make it easier, etc.
In short I would say that guids are nice and simple and do what you want, but that you should only consider them if either a) the dataset is small or b) the DBMS has specific features to optimize their use as keys (for example sequential guids). If the datasets are going to be very large or if you're trying to limit DBMS-specific dependencies I would play around more with optimizing a "key + identifier" strategy.
Most any RDBMS you will use can take any number and type of columns as a PK. So, if you're storing the GUID as a CHAR(n) for some length n, you should be fine. Now, I'm not sure if this is advisable, as I'm guessing indexing on CHARs is not as efficient as on integers.
Hope that helps.
I suppose you could store a GUID as an int128 as well.
Both mySQL and postgres are known to support GUID datatypes (I believe it's called UUID but it's the same thing).
Unless I have completely lost my memory, a properly designed 3rd+ normal form database schema does not rely on unique ints, or by extension GUIDs or UUIDs for primary keys. Nor does it use intermediate lookup tables of ints/GUIDS/UUIDS to relate the tables containing the data.
You should grind your schema until it expresses the relations amongst tables of data in terms of the data in the tables, not auto-generated identifiers that have no intrinsic relationship to the data.
I freely grant that you may just possibly be doing something that really really requires GUIDs (or auto-increment integers) for primary keys. But I seriously doubt that is the case - it almost never is.
You can implement your own membership provider based on whatever database schema you choose to design. It's nowhere near as tricky as it may look at first.
google "roll your own membership provider" for plenty of pointers.
In my theoretical little world, you'd be able to do this with SQLite. You'd generate the Guid from .Net and write it to the SQLite database as a string. You could also index that field.
You do loose some of the index benefits because it'd be stored as a string but it should be fully backwards compatible so that you could import/export to/from SQL Server.
From looking through the comments it looks like you are trying to use a different database to MS SQL with the ASP.net membership provider - as others have mentioned you could roll your own provider to use a different DB however a quick Google search turned up a few ready made options:
MySQL Provider
MySQL Provider 2
SqlLite Provider
Hope these help
If you are using other MS technologies already you should consider Sql Server Express.
http://www.microsoft.com/express/sql/default.aspx
It is a real implementation of MS Sql Server and it is free. It does have significant limitations as you might imagine, but if your product can fit inside those you get the support, developer community and stability of Sql Server and a clear upgrade path if you need to grow.
does setting up proper relationships in a database help with anything else other than data integrity?
do they improve or hinder performance?
As long as you have the obvious indexes in place corresponding to the foreign keys, there should be no perceptible negative effect on performance. It's one of the more foolproof database features you have to work with.
I'd have to say that proper relationships will help people to understand the data (or the intention of the data) better than if omitting them, especially as the overall cost is quite low in maintaining them.
Their presence doesn't hinder performance except in terms of architecture (as others have pointed out, data integrity will occasionally cause foreign key violations which may have some effect) but IMHO is outweighed by the many benefits (if used correctly).
I know you weren't asking whether to use FKs or not, but I thought I'd just add a couple of viewpoints about why to use them (and have to deal with the consequences):
There are other considerations too, such as if you ever plan to use an ORM (perhaps later on) you'll require foreign keys. They can also be very helpful for ETL/Data Import and Export and later for reporting and data warehousing.
It's also helpful if other applications will make use of the schema - since Foreign Keys implement a basic business logic. So your application (and any others) only need to be aware of the relationships (and honour them). It'll keep the data consistent and most likely reduce the number of data errors in any consuming applications.
Lastly, it gives you a pretty decent hint as to where to put indexes - since it's likely you'll lookup table data by an FK value.
It neither helps nor hurts performance in any significant way. The only hindrance is the check for integrity when inserting/updating/deleting.
Foreign keys are an important part of database design because they ensure consistency. You should use them because it offers the lowest level of protection against data screw ups that can wreck your applications. Another benefit is that database tools (visualization/analysis/code generation) use foreign keys to relate data.
Do relationships in databases improve or hinder performance?
Like any tool in your toolbox, the results you'll get depend on how you use it. Properly specified relationships and a well-designed logical database can be an enormous boon to performance -- consider the difference between searching through normalized and denormalized data, for example.
Depending on your database engine, relationships defined through foreign key constraints can benefit performance. The constraint allows the engine to make certain assumptions about the existence of data in tables on the parent side of the key.
A brief explanation for MS SQL Server can be found at http://www.microsoft.com/technet/abouttn/flash/tips/tips_122104.mspx. I don't know about other engines, but the concept would make sense in other platforms.
Relationships in the data exist whether you declare them or not. Declaring and enforcing the relationships via FK constraints will prevent certain kinds of errors in the data, at a small cost of checking data when inserts/updates/deletes occur.
Declaring cascading deletes via relationships helps prevent certain kinds of errors when deleting data.
Knowing the relationships helps to make flexible and correct use of the data when forming queries.
Designing the tables well can make the relationships more obvious and more useful. Using relationships in the data is the primary power behind using relational databases in the first place.
About impact on performance: In my experience with MS Access 2003, if you have a multi-user application and use Relationships to enforce a lot of referential integrity, you can take a big hit in terms of response time for the end-user.
There are different ways to take care of enforcing referential integrity. I decided to take out some rules in Relationships, build more enforcement into the front-end and live with some loss of RI. Of course in the multi-user environment, you want to be very careful with that bit of liberty.
In my experience building performance-sensitive databases, Foreign Keys hurt performance pretty significantly, since they have to be checked every time the referring record is inserted/updated or master record is deleted. If you need a proof, just look at the execution plan.
I still keep them for documentation and for tools to use but I usually disable them, especially in high-performance systems where access to DB is only through the application layer.