Database Design: Storing the primary key as a separate field in the same table - sql-server-2005

I have a table that must reference another record, but of the same table. Here's an example:
Customer
********
ID
ManagerID (the ID of another customer)
...
I have a bad feeling about doing this. My other idea was to just have a separate table that just stored the relationship.
CustomerRelationship
***************
ID
CustomerID
ManagerID
I feel I may be over complicating such a trivial thing however, I would like to get some idea's on the best approach for this particular scenario?
Thanks.

There's nothing wrong about the first design. The second one, where you have an 'intermediate' table, is used for many-to-many relationships, which i don't think is yours.
BTW, that intermediate table wouldn't have and ID of its own.

Why do you have a "bad feeling" about this? It's perfectly acceptable for a table to reference its own primary key. Introducing a secondary table only increases the complexity of your queries and negatively impacts performance.

Can a Customer have multiple managers? If so, then you need a separate table.
Otherwise, a single table is fine.

You can use the first approach. See also Using Self-Joins

There's absolutely nothing wrong with the first approach, in fact Oracle has included the 'CONNECT BY' extension to SQL since at least version 6 which is intended to directly support this type of hierarchical structure (and possibly makes Oracle worth considering as your database if you are going to be doing a lot of this).
You'll need self-joins in databases which don't have something analogous, but that's also a perfectly fine and standard solution.

As a programmer I like the first approach. I like to have less number of tables. Here we are not even talking of normalization and why do we need more tables? That is just me.

Follow the KISS principle here: Keep it simple, (silly | stupid | stud | [whatever epithet starting with S you prefer]). Go with one table, unless you have a reason to need more.
Note that if the one-to-many/many-to-many relationship ends up being the case, you can extract the existing column into a table of its own, and fill in the new entries at that time.

The only reason I would ever recommend avoiding such self-referecing tables is that SQL Server does have a few spots where there are limitations with self-referencing tables.
For one, if you ever happen to come across the need for an indexed view, then you'd find out that if one of the tables used in a view definition is indeed self-referencing, you won't be able to create a clustered index on your view :-(
But apart from that - the design per se is sound and absolutely valid - go for it! I always like to keep things as simple as possible (but no simpler than that).
Marc

Related

Which is the correct form to create one to many relationships between two tables on a Oracle database

I have a Oracle database in which I have two tables, RegistroPPL and Alias. One RegistroPPL can have multiple Aliases. Now, my question is, what is the right way to create this relationship, using a bridge table like this:
or creating a direct relationship like this
What is the best way to create one to many relationships according to the normalization rules, and how can I avoid redundant data?.
In my experience, the two-table design is used for one-to-many relationships.
The only reason I can think of you might want to use an intermediate table is if there's some reason to think that in the future this might become a many-to-many relationship.
The question is: can a single alias be used for multiple registro_ppl? Presumably not. Hence, you can get away with one row per alias.
There are some situations where a more complicated design might be needed -- for instance, if alias is a slowly changing dimension, then you might want version_eff_dt and version_end_dt to handle that.
However, the data model does not suggest having an intermediate table.

Redundant field in SQL for Performance

Let's say I have two Tables, called Person, and Couple, where each Couple record stores a pair of Person id's (also assume that each person is bound to at most another different person).
I am planning to support a lot of queries where I will ask for Person records that are not married yet. Do you guys think it's worthwhile to add a 'partnerId' field to Person? (It would be set to null if that person is not married yet)
I am hesitant to do this because the partnerId field is something that is computable - just go through the Couple table to find out. The performance cost for creating new couple will also increase because I have to do this extra book keeping.
I hope that it doesn't sound like I am asking two different questions here, but I felt that this is relevant. Is it a good/common idea to include extra fields that are redundant (computable/inferable by joining with other tables), but will make your query a lot easier to write and faster?
Thanks!
A better option is to keep the data normalized, and utilize a view (indexed, if supported by your rdbms). This gets you the convenience of dealing with all the relevant fields in one place, without denormalizing your data.
Note: Even if a database doesn't support indexed views, you'll likely still be better off with a view as the indexes on the underlying tables can be utilized.
Is there always a zero to one relationship between Person and Couples? i.e. a person can have zero or one partner? If so then your Couple table is actually redundant, and your new field is a better approach.
The only reason to split Couple off to another table is if one Person can have many partners.
When someone gets a partner you either write one record to the Couple table or update one record in the Person table. I argue that your Couple table is redundant here. You haven't indicated that there is any extra info on the Couple record besides the link, and it appears that there is only ever zero or one Couple record for every Person record.
How about one table?
-- This is psuedo-code, the syntax is not correct, but it should
-- be clear what it's doing
CREATE TABLE Person
(
PersonId int not null
primary key
,PartnerId int null
foreign key references Person (PersonId)
)
With this,
Everyone on the system has a row and a PersonId
If you have a partner, they are listed in the PartnerId column
Unnormalized data is always bad. Denormalized data, now, that can be beneficial under very specific circumstances. The best advice I ever heard on this subject it to first fully normalize your data, assess performance/goals/objectives, and then carefully denormalize only if it's demonstrably worth the extra overhead.
I agree with Nick. Also consider the need for history of the couples. You could use row versioning in the same table, but this doesn't work very well for application databases, works best in a in a DW scenario. A history table in theory would duplicate all the data in the table, not just the relationship. A secondary table would give you this flexibility to add additional information about the relationship including StartDate and EndDate.

Is it better to name the primary key column id or *_id?

I've been using Rails for a few years and I've grown used to the convention of naming the primary key column id. But I've run across lots of examples in SQL books that name the primary key column something like employee_id for an employees table or feed_id for a feeds table.
One advantage of the 2nd system seems to be that you can use USING() more to produce more concise SQL queries:
select feeds.title, items.title from items inner join feeds USING(feed_id);
As opposed to
select feeds.title, items.title from items inner join feeds on feeds.id = items.feed_id;
Which naming convention is better? Which is favored by experienced database administrators?
Also, is it better to pluralize the name of the table?
I always use the verbose form (i.e. 'employee_id' rather than 'id') as it is more descriptive. If you are joining more than one table and both have 'id' column you will have to alias 'id' if you need to SELECT both of the ids. Also, as you mentioned, you get the advantage of USING clause. In the grand scheme of things it isn't a huge factor one way or the other but the more verbose form gives you advantages.
Both options are valid but the purists will say use id as its name is specified by the table.
I use table_id because I find it to be more descriptive and makes debugging easier. It's more practical.
Re: Tablenames. Another hotly debated topic among database nerds but I say Singular.
Tablename_Id is my strong preference. When you do joins to Fks you know exactly what to join to what and don't make mistakes where you join to ID in table a when you meant tableb below is an example of how easy this is to do especially if you copy the on clause from somewhere else
FROM tablea a
JOIN tableb b
ON a.ID = b.tableaid
JOIN tablec c
ON a.ID = c.tablebid
In the case above, you really wanted to join to B.Id but forgot to change it from a when you copied. It will work and give you a resultset that isn't correct. If you use table_id instead, the query would fail the syntax check.
Another problem with using Id is when you are doing complex reports. Since the repport queries have to have fields with individual names, you can end up wasting time writing a bunch of aliases you wouldn't need if you had named the id with the tablename.
Now people who use ORMs don't write a lot of SQl but what they do write and what report writers write are generally complex, complicated statements. You need to design you database to make it easier to do those things than simple queries.
The use of ID as the name for the identifying field is considered a SQl antipattern. http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557/ref=sr_1_1?s=books&ie=UTF8&qid=1308929815&sr=1-1
This is user preference, but I always name the primary keys of my tables Id. I always name references of that Id in other tables as [SingularEntityName][Id] e.g.
Credentials
Id Password
Users
Id Name CredentialId
Descriptions
Id UserId
Keeps my references clean. However, just be consistant in your naming and it really shouldn't matter how you set up your schemas.
To open the can of worms again,
I'm willing to bet those who select tablename_id are older, more experienced programmers.
Those who use just id are younger.
Why ? because you learn redundancy and constancy is not always a bad thing.
the one thing I would add to the #1 answer, use the "_" helps make it easier to pick out the variable in code, in the table, etc... I do the same for foreign keys. TableName_FK Some will argue over that but it works for me and it's obvious what it is.
I have had to work on other's code many times over the years. Consistency is critical, obfuscation is worthless and meaningful variable names very helpful.
There are those who argue that verbosity makes code harder to read. I don't think that argument flies in today's world of objects.that.derive.from.some.microsoft.class.twenty.layers.deep.that.you.have.to.fully.reference.
BTW - as so many have said, it's your choice. Those folks who spend time arguing over coding syntax don't have enough work to do. Learn to be flexible and to use the standards of the workplace where you are employed. If you are lucky enough to set your own standards, then have at it. The fact your are wondering is great. But choose one and then be consistent (until you change jobs or decide you have a paradigm shift that means you want to change your style.)
You can often pick out what era someone started learning to code by their personal preferences and styles. Guys that write very tight, minmal, hard to read code, started back when memory was very limited (DOS) and probably wrote a lot of assembler, those that use Hungarian started back with the Win SDK, etc...
This discussion has been evolving for decades. The older I get, the more I document my code, the more meaningful I make my variable names, etc... because in a week I will have forgotten what I wrote and I need the road maps to make sense of it. Not so much that I'm forgetful, although that's part of the equation, but more so because I'm writing code on so many different projects.
it's entirely your choice. But personally I prefer the second one as I wouldn't need to look for table names in my code when I come across an id. I think tablename_id is better.
Another advantage to giving your primary keys names that are unique to that table is that it makes it easier to have a naming convention, when referring to those keys in different tables, that indicates the corresponding key.
For example, suppose everything in your alpha table begins alpha_, so that you have alpha_id as your primary key. In your beta table - where everything would begin beta_ - you would use beta_alpha_id to have a reference in that table to the keys in the alpha table.

Are relationship tables really needed?

Relationship tables mostly contain two columns: IDTABLE1, and IDTABLE2.
Only thing that seems to change between relationship tables is the names of those two columns, and table name.
Would it be better if we create one table Relationships and in this table we place 3 columns:
TABLE_NAME, IDTABLE1, IDTABLE2, and then use this table for all relationships?
Is this a good/acceptable solution in web/desktop application development? What would be downside of this?
Note:
Thank you all for feedback. I appreciate it.
But, I think you are taking it a bit too far... Every solution works until one point.
As data storage simple text file is good till certain point, than excel is better, than MS Access, than SQL Server, than...
To be honest, I haven't seen any argument that states why this solution is bad for small projects (with DB size of few GB).
It would be a monster of a table; it would also be cumbersome. Performance-wise, such a table would not be a great idea. Also, foreign keys are impossible to add to such a table. I really can't see a lot of advantages to such a solution.
Bad idea.
How would you enforce the foreign keys if IDTABLE1 could contain ids from any table at all?
To achieve acceptable performance on joins without a load of unnecessary IO to bring in completely unrelated rows you would need a composite index with leading column TABLE_NAME that basically ends up partitioning the table into sections anyway.
Obviously even with this pseudo partitioning going on you would still be wasting a lot of space in the table/indexes just repeating the table name for each row.
Isn't it a big IF that you're only going to store the 2 ID fields? If I have a StudentCourse (or better yet Enrollment) table that has StudentID & CourseID, but wouldn't EnrollmentDate go in this table as well since not all students enroll on the first day of class. Seems like a bad idea to add this column to an already bloated table where most records will be null.
The benefit of a single table could be a requirement that the application has the ability to allow user/admin to create these relationships with data (Similar to have a single lookup or reference list table) and avoid having to create a new table to address these User Created References. Needing dynamic querying may benefit as well. An application that requires such dynamic data structure requirements might be better suited for a schemaless or nosql database.

Optimal DB structure for additional fields entity

I have a table in a DB (Postgres based), which acts like a superclass in object-oriented programming. It has a column 'type' which determines, which additional columns should be present in the table (sub-class properties). But I don't want the table to include all possible columns (all properties of all possible types).
So I decided to make a table, containg the 'key' and 'value' columns (i.e. 'filename' = '/file', or 'some_value' = '5'), which contain any possible property of the object, not included in the superclass table. And also made one related table to contain the available 'key' values.
But there is a problem with such architecture - the 'value' column should be of a string data type by default, to be able to contain anything. But I don't think converting to and from strings is a good decision. What is the best way to bypass this limitation?
The design you're experimenting with is a variation of Entity-Attribute-Value, and it comes with a whole lot of problems and inefficiencies. It's not a good solution for what you're doing, except as a last resort.
What could be a better solution is what fallen888 describes: create a "subtype" table for each of your subtypes. This is okay if you have a finite number of subtypes, which sounds like what you have. Then your subtype-specific attributes can have data types, and also a NOT NULL constraint if appropriate, which is impossible if you use the EAV design.
One remaining weakness of the subtype-table design is that you can't enforce that a row exists in the subtype table just because the main row in the superclass table says it should. But that's a milder weakness than those introduced by the EAV design.
edit: Regarding your additional information about comments-to-any-entity, yes this is a pretty common pattern. Beware of a broken solution called "polymorphic association" which is a technique many people use in this situation.
How about this instead... each sub-type gets its own DB table. And the base/super table just has a varchar column that holds the name of the sub-type DB table. Then you can have something like this...
Entity
------
ID
Name
Type
SubTypeName (value of this column will be 'Dog')
Dog
---
VetName
VetNumber
etc
If you don't want your (sub-)table names to be varchar values in the base table, you can also just have a SubType table whose primary key will be in the base table.
The only workaround (while retaining your strucure) is to have separate tables:
create table IntProps(...);
create table StringProps(...);
create table CurrencyProps(...);
But I do not think that this is a good idea...
One common approach is having the key-value table contain multiple columns, one for each data type, i.e. StringValue, DecimalValue, etc.
Just know you're trading queryability and performance for a database schema you don't need to change. You could also consider ORM mapping or an object database.
You could have a per type key/value table. The available table would need to encode the availability of a specific key/type pair to point to the correctly typed key/value table.
This seems like a highly inefficient architecture in for a row based relational databases however.
Perhaps you should take a look at a column oriented relational database?
Thanks for the answers. I'll explain a little bit more specifically what i need.
There's a need to program a blog+forum website, and I've been looking at the WordPress DB structure.
There's a strong need for the ability to place comments to any kind of 'object', like a blog entry, or a video file attachment to it. The above DB structure being very easy to scale and to fulfill all our needs was the reason of its choice.
But that's not late to change it, cause this is in stage of early engineering. Also our model smells now like a completely tree-hierarchy based DB. For now I'll accept Bill Karwin's and fallen888 answers, but maybe I'm going in a totally wrong direction?
about the user being able to add a new field to the table:
I admire all these people making comments.
I used to be interested in this kind of thing a few years ago, but have written little code recently (apart from a little bit of PHP and MYSQL).
I think it's fine if you want to keep going - you may end up with something new.
Sorry to pour any cold water on the scheme - I admire your efforts. My personal belief is that if you go far enough in this direction, you will end up with a system that interprets more of natural language than SQL does. (Around 1970, SQL was actually spelt Sequel, and it actually stood for "structured english query language", but after they standardized it in the 1970's - I think someone said that Oracle was the first commercial implementation, 19079, the "English" got dropped off, because I guess they decided that it was only a tiny subset of English.
I have run out of steam in this area, because I haven't got a job. Without an easy job that pays the bills, where I can experiment with these ideas, it's a bit hard to concentrate on this area.
Best wishes to all.
sorry, I wrote 19079 above, I meant the year 1979. Oracle got their first contract writing a database for the CIA.