ER inheritance modeling - sql

A supply farm can have a transportation document. If present it can be one of two types: internal or external. Both documents share some common data, but have different specialized fields.
I though of modeling this in a OO-ish fashion like this:
alt text http://www.arsmaior.com/tmp/mod1.png
In the document table, one of the two doc_*_id is null, the other is the foreign key with the corresponding table.
That is opposed to the other schema where the common data is redundant:
alt text http://www.arsmaior.com/tmp/mod2.png
I'm trying to discover pros&cons of both approaches.
How do I SELECT to know all the internal docs in both cases? We have a sort of mutually exclusive foreign keys, the JOINs are not so trivial.
Is the first approach completely junky?

Classical ER modeling doesn't include foreign keys, and the gist of your question revolves around how the foreign keys are going to work. I think that what you are really doing is relational modeling, even though you are using ER diagrams.
In terms of relational modeling, there is a third way to model inheritance. That is to use the same ID for the specialized tables as is used for the generalized table. Then the ID field of the doc_internal table is both the primary key for the doc_internal table and also a foreign key referencing the supply_farm table. Ditto for the doc_external table.
The ID field in the supply_farm table is both the primary key of the supply_farm table and also a foreign key that references either the doc_internal or the doc_external table, depending. The joins magically get the right data together.
It takes a little programming to set this up, but it's well worth it.
For more details I suggest you google "generalization specialization relational modeling". There are some excellent articles on this subject out there on the web.

Both approaches are correct and their usage will totally depend on the use cases, the kind and volume of data you want to store and the type of queries you want to mostly fire. You can also think of combining these two strategies when the inheritance hierarchies are complex.
One use case where the first approach would be preferred I think is when you want to search through all the documents, for example, based on description or any common field.
This document (although specific to hibernate) can provide a little more insight on different inheritance modelling strategies.

If I have understood this correctly, then supply farm corresponds to either 0 or 1 documents, which is always either an internal or external document (never both).
If so, then why not just use a single table, like so:
**SUPPLY_FARM_DOC**
ID Int (PK)
DOC_ID Int
INTERNAL_FLAG Boolean
DESCRIPTION Varchar(40)
SOME_DATA Varchar(40)
OTHER_DATA Varchar(40)
etc.

Related

<select> for an entity with composite keys - strategy needed

So say I have database table tours (PK tour_id) holding region independent information and tours_regional_details (PK tour_id, region_id) holding region specific information.
Let's say I want to populate select control with entities from tours_regional_details table (my real scenarios are bit different, just imagine this for the sake of simplicity).
So, how would you tackle this? My guts says concatenate PKs into delimited strings, like "pk1|pk2" or "pk1,pk2" and use that as value of select control. While it works, feels dirty and possibly needs additional validation steps before splitting the string again, which again feels dirty.
I don't want to start a composite vs single pk holy war, but may this be a bad database design decision on my part? I always believed identifying relationships and composite keys are there for a reason, but I feel tempted to alter my tables and just stuff them with auto incremental IDs and unique constraints. I'm just not sure what kind of a fresh hell will that introduce.
I am a little bit flabbergasted that I encounter this for the first time now after so many years.
EDIT: Yes, there is a table regions (PK region_id) but is mostly irrelevant for the topic. While in some scenarios two select boxes would make sense, let's say here they don't, let's say I want only one select box and want to select from:
Dummy tour (Region 1)
Dummy tour (Region 2)
Another dummy tour (region 3)
...
Composite primary keys aren't bad database design. In an ideal world, our programming languages and UI libraries would support tuples and relations as first-class values, so you'd be able to assign a pair of values as the value of an option in your dropdown control. However, since they generally only support scalar variables, we're stuck trying to encode or reduce our identifiers.
You can certainly add surrogate keys / autoincrement columns (and unique constraints on the natural keys where available) to every table. It's a very common pattern, most databases I've seen have at least some tables set up like this. You may be able to keep existing composite foreign keys as is, or you may want/need to change them to reference the surrogate primary keys instead.
The risk with using surrogate keys for foreign keys is that your access paths in the database become fixed. For example, let's assume tours_regional_details had a primary key tours_regional_detail_id that's referenced by a foreign key in another table. Queries against this other table would always need to join with tours_regional_details to obtain the tour_id or region_id. Natural keys allow more flexible access paths since identifiers are reused throughout the database. This becomes significant in deep hierarchies of dependent concepts. These are exactly the scenarios where opponents of composite keys complain about the "explosion" of keys, and I can at least agree that it becomes cumbersome to remember and type out joins on numerous columns when writing queries.
You could duplicate the natural key columns into the referencing tables, but storing redundant information requires additional effort to maintain consistency. I often see this done for performance or convenience reasons where surrogate keys were used as foreign keys, since it allows querying a table without having to do all the joins to dereference the surrogate identifiers. In these cases, it might've been better to reference the natural key instead.
If I'm allowed to return to my ideal world, perhaps DBMSs could allow naming and storing joins.
In practice, surrogate keys help balance the complexity we have to deal with. Use them, but don't worship them.

Database design for many-to-many relations with restrictions

I have one database with users and one with questions. What I want is to ensure that every user can answer every question only once.
I thought of a database that has all the question id's as columns and all the user id's as records, but this gets very big (and slow I guess) when the questions and the user count grow.
Is there another way to do this with better performance?
You probably want a setup like this.
Questions table (QuestionID Primary Key, QuestionText)
Users table (UserID Primary Key, Username)
Answers table (QuestionID, UserID, Date) -- plus AnswerText/Score/Etc as needed.
In the Answers table the two first columns together form a compound primary key (QuestionID, UserID) and both are foreign keys to Question(QuestionID) and Users(UserID) respectively.
The compound primary key ensures that each combination of QuestionID/UserID is only allowed once. If you want to allow users to answer the same question multiple times you could extend the ¨compound primary key to include the date (it would then be a composite key).
This is a normalized design and should be efficient enough. It's common to use a surrogate primary key (like AnswerID) instead of the compound key and use a unique constraint instead to ensure uniqueness - the use of a surrogate key is often motivated by ease of use, but it's by no means necessary.
Diagram
Below is a diagram of my own table design, quite similar to the correct Answer by jpw. I made up a few column names to give more of a flavor of the nature of the table. I used Postgres data types.
As the last paragraph of that Answer discusses, I would go with a simple single primary key on the response_ ("Answers") table rather than a compound primary key combining fkey_user_ & fkey_question_.
Unrealistic
This diagram fits the problem description in the Question. However this design is not practicable. This scenario is for a single set of questions to be put to the user, only a single survey or quiz ever. In real life in a situation like a school, opinion survey, or focus group, I expect we would put more than one questionnaire to a user. But I will ignore that to directly address the Question as worded.
Also in some scenarios we might have versions of a question, as it is tweaked and revised over time when given on successive quizzes/questionnaires.
Performance
Your Question correctly identifies this problem as a Many-To-Many relationship between a user and a question, where each user can answer many questions and each question may be answered by many users. In relational database design there is only one proper way to represent a many-to-many. That way is to add a third child table, sometimes called a "bridge table", with a foreign key linking to each of the two parent tables.
In a diagram where you draw parent tables vertically higher up the page than child tables, I personally see such a many-to-many diagram as a butterfly or bird pattern where the child bridge table is the body/thorax and the two parents are wings.
Performance is irrelevant in a sense, as this is the only correct design. Fortunately, modern relational databases are optimized for such situations. You should see good performance for many millions of records. Especially if you a sequential number as your primary key values. I tend to use UUID data type instead; their arbitrary bit values may have less efficient index performance when table size reaches the millions (but I don't know the details.

Designing Tables Sql Server

Good Morning,
in the design of a database, I have a table (TabA's call it) that could have relationships with four other tables. In the sense that this table can be connected both with the first of four, and with the second, and the third to the fourth, but could not have links with them; or it could have one (with any of the tables), or two links (always with two of any of them), and so on.
The table TabA I added four fields that refer to the four tables which could be "null" when they do not have any connection.
Wondering is this the kind of optimal design (say the four fields in the TabA) or you can make a better design for this type of situation?
Many thanks for your reply.
dave
In answer to the question and clarification in your comment, the answer is that your design can't be improved in terms of the number of foreign key columns. Having a specific foreign key column for every potential foreign key relationship is a best practice design.
However, the schema design itself seems questionable. I don't have enough information to tell whether the "Distributori_[N]_Livello" tables are a truly hierarchical structure or not. If it is, it is often possible to use a self-referential table for hierarchical structures rather than a set of N tables, as the diagram you linked seems to use. If you are able to refactor your design in such a way, it might be possible to reduce the number of foreign key columns required.
Whether this is possible or not is not for me to say given the data provided.

Composite primary key

I am working on the design of a database that will be used to store data that originates from a number of different sources. The instances I am storing are assigned unique IDs by the original sources. Each instance I store should contain information about the source it came from, along with the ID it was associated by this source.
As an example, consider the following table that illustrates the problem:
----------------------------------------------------------------
| source_id | id_on_source | data |
----------------------------------------------------------------
| 1 | 17600 | ... |
| 1 | 17601 | ... |
| 2 | 1 | ... |
| 3 | 1 | ... |
----------------------------------------------------------------
Note that while the id_on_source is unique for each source, it is possible for the same id_on_source to be found for different sources.
I have a decent understanding of relational databases, but am far from an expert or even an experienced user. The problem I face with this design is what I should use as primary key. The data seems to dictate the use of a composite primary key of (source_id, id_on_source). After a little googling I found some heated debates on the pros and cons of composite primary keys however, leaving me a little confused.
The table will have one-to-many relationship with other tables, and will thus be referred to in the foreign keys of other tables.
I am not tied to a specific RDBMS and I am not sure if it matters for the sake of the argument, but let's say that I prefer to work with SQLite and MySQL.
What are the pros and cons of using a composite foreign key in this case? Which would you prefer?
I personally find composite primary keys to be painful. For every table that you wish to join to your "sources" table you will need to add both the source_id and id_on_source field.
I would create a standard auto-incrementing primary key on your sources table and add a unique index on source_id and id_on_source columns.
This then allows you to add just the id of the sources table as a foreign key on other tables.
Generally I have also found support for composite primary keys within many frameworks and tooling products to be "patchy" at best and non-existent in others
Composite keys are tough to manage and slow to join. Since you're building a summary table, use a surrogate key (i.e.-an autoincrement/identity column). Leave your natural key columns there.
This has a lot of other benefits, too. Primarily, if you merge with a company and they have one of the same sources, but reused keys, you're going to get into trouble if you aren't using a surrogate key.
This is the widely acknowledged best practice in data warehousing (a much larger undertaking than what you're doing, but still relevant), and for good reason. Surrogates provide data integrity and quick joins. You can get burned very quickly with natural keys, so stay away from them as an identifier, and only use them on the import process.
You have a business requirement that the combination of those two attributes are unique. So, you should have a UNIQUE constraint on those two attributes. Whether you call that UNIQUE constraint "primary" is really just a preference, it doesn't have much impact aside from documentation.
The only question is whether you then add an extra column and mark it UNIQUE. The only reason I can see to do that is performance, which is a legitimate reason.
Personally, I don't like the approach of turning every database into essentially a graph, where the generated columns are essentially pointers and you are just traversing from one to the next. I think that throws away all of the greatness of a relational system. If you step back and think about it, you're introducing a bunch of columns that have no meaning to your business, at all. You may be interested in my related blog post.
I believe that composite keys create a very natural and descriptive data model. My experience comes from Oracle and I don't think there is any technical issues when creating a composite PK. In fact anyone analysing the data dictionary would immediately understand something about the table. In your case it would be obvious that each source_id must have unique id_on_source.
The use of natural keys often creates a hot debate, but people whom I work with like natural keys from a good data model perspective.
Pretty much the only time I use a composite primary key is when the high-order part of the key is the key to another table. For example, I might create an OrderLineItem table with a primary key of OrderId + LineNumber. As many accesses against the OrderLineItem table will be "order join orderlineitem using (orderid)" or some variation of that, this is often handy. It also makes it easy when looking at database dumps to figure out what line items are connected to what order.
As others have noted, composite keys are a pain in most other circumstances because your joins have to involve all the pieces. It's more to type which means more potential for mistakes, queries are slower, etc.
Two-part keys aren't bad; I do those fairly often. I'm reluctant to use a three-part key. More than three-parts, I'd say forget it.
In your example, I suspect there's little to be gained by using the composite key. Just invent a new sequence number and let the source and source key be ordinary attributes.
I ran into problems using a lot of composite keys and so I wouldn't recommend it (more below), I've also found there to be benefits in an independent/surrogate key (rather than natural) when trying to roll back user mistakes.
The problem was that via a set of relations, one table joined two tables where for each row part of the composite was the same (this was appropriate in 3rd normal form - a comparison between two parts of a parent). I de-duplicated that part of the composite relationship in the join table (so instead of parent1ID, other1ID, parent2ID, other2ID there was parentID, other1ID, other2ID) but now the relation couldn't update changes to the primary key, because it tried to do it twice via each route and failed in the middle.
Some people recommend you use a Globally Unique ID (GUID): merge replication and transactional replication with updating subscriptions use uniqueidentifier columns to guarantee that rows are uniquely identified across multiple copies of the table. If the value if globally unique when it's created, then you don't need to add the source_id to make it unique.
Although a uniqueid is a good primary key, I agree that it's usually better to use a different, natural (not necessarily unique) key as your clustered index. For example if a uniqueid is the PK which identifies employees, you might want to clustered index to be the department (if your select statements usually retrieve all employees within a given department). If you do want to use a unqiqueid as the clustered index, see the NEWSEQUENTIALID() function: this creates sequential uniqueid values, which (being sequential) have better clustering performance.
Adding an extra ID column will leave you having to enforce TWO uniqueness constraints instead of one.
Using that extra ID column as the foreign key in other referencing tables, instead of the key that presents itself naturally, will cause you to have to do MORE joins, namely in all the cases where you need the original soruce_ID plus ID_on_source along with data from the referencing table.

ID fields in SQL tables: rule or law?

Just a quick database design question: Do you ALWAYS use an ID field in EVERY table, or just most of them? Clearly most of your tables will benefit, but are there ever tables that you might not want to use an ID field?
For example, I want to add the ability to add tags to objects in another table (foo). So I've got a table FooTag with a varchar field to hold the tag, and a fooID field to refer to the row in foo. Do I really need to create a clustered index around an essentially arbitrary ID field? Wouldn't it be more efficient to use fooID and my text field as the clustered index, since I will almost always be searching by fooID anyway? Plus using my text in the clustered index would keep the data sorted, making sorting easier when I have to query my data. The downside is that inserts would take longer, but wouldn't that be offset by the gains during selection, which would happen far more often?
What are your thoughts on ID fields? Bendable rule, or unbreakable law?
edit: I am aware that the example provided is not normalized. If tagging is to be a major part of the project, with multiple tables being tagged, and other 'extras', a two-table solution would be a clear answer. However in this simplest case, would normalization be worthwhile? It would save some space, but require an extra join when running queries
As in much of programming: rule, not law.
Proof by exception: Some two-column tables exist only to form relationships between other more meaningful tables.
If you are making tables that bridge between two or more other tables and the only fields you need are the dual PK/FK's, then I don't know why you would need ID column in there as well.
ID columns generally can be very helpful, but that doesn't mean you should go peppering them in at every occasion.
As others have said, it's a general, rather than absolute, rule and there are plenty of exceptions (tables with composite keys for example).
There are some occasional but useful occasions where you might want to create an artificial ID in a table that already has a (usually composite) unique identifier. For example, in one system I've created a table to store part numbers; although the part numbers are unique, they may actually change - we add an arbitrary integer PartID. Not so common, but it's a typical real-world example.
In general what you really want is to be able if at all possible to have some kind of way to uniquely identify a record. It could be an id field or it could be a unique index (which does not have to be on just one field). Anytime I thought I could get away without creating a way to uniquely identify a record, I have been proven wrong. All tables do not have a natural key though and if they do not, you really need to have an id file of some kind. If you have a natural key, you could use that instead, but I find that even then I need an id field in most cases to prevent having to do too much updating when the natural key changes (it always seems to change). Plus having worked with literally hundreds of databases concerning many many differnt topics, I can tell you that a true natural key is rare. As others have nmentioned there is no need for an id field in a table that is simply there to join two tables that havea many to many relationship, but even this should have a unique index.
If you need to retrieve records from that table with unique id then yes. If you will retrieve them by some other composite key made up of foreign keys then no. The last thing you need is fields, data, and indexes that you do not use.
A clustered index does not need to be on primary key or a surrogate (identity column) either.
Your design, however, is not normalized. Typically for tagging, I use two tables, a table of tags (with a surrogate key) and a table of links from the tags to the subject table(s) using the surrogate key in the tag table and theprimary key in the subject table. This allows your tags to apply to different entities (photos, articles, employees, locations, products, whatever). It allows you to enforce foreign key relationships to multiple tables, and also allows you to invent tag hierarchies and other things about the tag table.
As far as the indexes on this design, it will be dictated by the usage patterns.
In general developers love having an ID field on all tables except for 'linking' tables because it makes development much easier, and I am no exception to this. DBA's on the other hand see no problem with making natural primary keys made up of 3 or 4 columns. It can be a butting of heads to try and get a good database design.