SQL: Reference one to one-of-many - sql

I'm having what some would call a rather strange problem/question.
Suppose I have a table, which may reference one (and only one) of many different other tables. How would I do that in the best way?? I'm looking for a solution which should work in a majority of databases (MS SQL, MySQL, PostgreSQL etc). The way I see it, there are a couple of different solutions (is any better than the other?):
Have one column for each possible reference. Only one of these columns may contain a value for any given row, all others are null. Allows for strict foreign keys, but it gets tedious when the number of "many" (possible referenced tables) gets large
Have a two column relationship, i.e. one column "describing" which table is referenced, and one referencing the instance (row in that table). Easily extended when the number of "many" (referenced tables) grows, though I can't perform single query lookup in a straightforward way (either left join all possible tables, or union multiple queries which joins towards one table each)
??
Make sense? What's best practise (if any) in this case?
I specifically want to be able to query data from the referenced entity, without really knowing which of the tables are being referenced.
How would you do?

Both of these methods are suitable in any relational database, so you don't have to worry about that consideration. Both result in rather cumbersome queries. For the first method:
select . . .
from t left outer join
ref1
on t.ref1id = ref1.ref1id left outer join
ref2
on t.ref2id = ref2.ref2id . . .
For the second method:
select . . .
from t left outer join
ref1
on t.anyid = ref1.ref1id and anytype = 'ref1' left outer join
ref2
on t.anyid = ref2.ref2id and anytype = 'ref2' . . .
So, from the perspective of query simplicity, I don't see a major advantage for one versus the other. The second version has a small disadvantage -- when writing queries, you have to remember what the name is for the join. This might get lost over time. (Of course, you can use constraints or triggers to ensure that only a fixed set of values make it into the column.)
From the perspective of query performance, the first version has a major advantage. You can identify the column as a foreign key and the database can keep statistics on it. This can help the database choose the right join algorithm, for instance. The second method does not readily offer this possibility.
From the perspective of data size, the first version requires storing the id for each of the possible values. The second is more compact. From the perspective of maintainability, the first is hard to add a new object type; the second is easy.
If you have a set of things that are similar to each other, then you can consider storing them in a single table. Attributes that are not appropriate can be NULLed out. You can even create views for the different flavors of the thing. One table may or may not be an option.
In other words, there is no right answer to this question. As with many aspects of database design, it depends on how the data is going to be used. Absent other information, I would probably first try to coerce the data into a single table. If that is just not reasonable, I would go with the first option if the number of tables can be counted on one hand, and the second if there are more tables.

1)
This is legitimate for small number of static tables. If you anticipate a number of new tables might need to be added in the future, take a look at 3) below...
2)
Please don't do that. You'd be forfeiting the declarative FOREIGN KEYs, which is one of the most important mechanisms for maintaining data integrity.
3)
Use inheritance. More info in this post:
What is the best design for a database table that can be owned by two different resources, and therefore needs two different foreign keys?
You might also be interested in looking at:
Implementing comments and Likes in database
Multiple one to many relationship design
How to avoid multiple tables tables to relations M: M?
database table design thoughts
Relating two database tables (associating an employee with an activity)
How to structure table Activities in a database?

Related

Working with DBs, SQL: table "inheritance" separation decision

Let's say I have a database in which one entity (i.e. table) inherits from another one, for example:
Table 1, named person: (name,surname)
Table 2, named car_owner
In this case, car_owner inherits from person, i.e. a car-owner IS a person.
I'm now in a point where I have to decide whether I should:
create the table car_owner, even though it has no extra columns except the ones in person, although in the future this might change => doing this results in car_owner = table with columns (id,person_id), where person_id is FK to person
or
leave only the person table for now and only do (1) when/if extra information regarding a car-owner will appear => note that if I do this FKs to a car-owner from other tables would actually be FKs to the person table
The tables I'm dealing with have different names and semantics and the choice between (1) and (2) is not clear, because the need for extra columns in car_owner might never pop-up.
Conceptually, (1) seems to be the right choice, but I guess what I'm asking is if there are any serious issues I might run into later if I instead resort to (2)
I would suggest that option 1 is the better answer. While it creates more work to join the tables for queries, it is neater to put "optional" data in it's own table. And if more types of persons are required (pedestrian, car_driver, car_passenger) they can be accommodated with more child tables. You can always use a view to make them look like one table.
BTW for databases, we say Parent and Child, not "inherets".
To answer the part about problems/consequences of option 2 - well, none too serious. This is a database, so you can always re-arrange things later, but there will be a price to pay in rewriting queries and code if you restructure tables. Why I don't like Option 2 is because it can lead to extra tables not re-using the person part. If it looks like that table is for car_owners, I might make an entirely new table for car_passengers with duplication of all the person columns.
In short, nothing too tragic should happen with either approach, they are each preferable for different reasons and the drawbacks are mainly inconvenience and potential future messiness.

Multiple record types and how to split them amongst tables

I'm working on a database structure and trying to imagine the best way to split up a host of related records into tables. Records all have the same base type they inherit from, but each then expands on it for their particular use.
These 4 properties are present for every type.
id, name, groupid, userid
Here are the types that expand off those 4 properties.
"Static": value
"Increment": currentValue, maxValue, overMaxAllowed, underNegativeAllowed
"Target": targetValue, result, lastResult
What I tried initially was to create a "records" table with the 4 base properties in it. I then created 3 other tables named "records_static/increment/target", each with their specific properties as columns. I then forged relationships between a "rowID" column in each of these secondary tables with the main table's "id".
Populating the tables with dummy data, I am now having some major problems attempting to extract the data with a query. The only parameter is the userid, beyond that what I need is a table with all of the columns and data associated with the userid.
I am unsure if I should abandon that table design, or if I just am going about the query incorrectly.
I hope I explained that well enough, please let me know if you need additional detail.
Make the design as simple as possible.
First I'd try a single table that contains all attributes that might apply to a record. Irrelevant attributes can be null. You can enforce null values for a specific type with a check constraint.
If that doesn't work out, you can create three tables for each record type, without a common table.
If that doesn't work out, you can create a base table with 1:1 extension tables. Be aware that querying that is much harder, requiring join for every operation:
select *
from fruit f
left join
apple a
on a.fruit_id = f.id
left join
pear p
on p.fruit_id = f.id
left join
...
The more complex the design, the more room for an inconsistent database state. The second option you could have a pear and an apple with the same id. In the third option you can have missing rows in either the base or the extension table. Or the tables can contradict each other, for example a base row saying "pear" with an extension row in the Apple table. I fully trust end users to find a way to get that into your database :)
Throw out the complex design and start with the simplest one. Your first attempt was not a failure: you now know the cost of adding relations between tables. Which can look deceptively trivial (or even "right") at design time.
This is a typical "object-oriented to relational" mapping problem. You can find books about this. Also a lot of google hits like
http://www.ibm.com/developerworks/library/ws-mapping-to-rdb/
The easiest for you to implement is to have one table containing all columns necessary to store all your types. Make sure you define them as nullable. Only the common columns can be not null if necessary.
Just because object share some of the same properties does not mean you need to have one table for both objects. That leads to unnecessary right outer joins that have a 1 to 1 relationship which is not what I think of as good database design.
but...
If you want to continue in your fashion I think all you need is a primary key in the table with common columns "id, name, groupid, userid" (I assume ID) then that would be the foreign key to your table with currentValue, maxValue, overMaxAllowed, underNegativeAllowed

Multiple many to many relationships between same tables or one n:m relationship with identifying attribute?

Currently I have multiple n:m relationships between 2 tables:
Users --> Favourites(user_id,post_id) <-- Posts
Users --> Follow(user_id,post_id) <-- Posts
Would you rather have 2 join tables or just one join table with an attribute which marks the type of the join, so something like:
Users --> Users_Posts (user_id,post_id,type(VALUES="favourite,follow") <-- Posts
It's not exactly the same example as I have in my application, but I think you can get the idea.
I don't think there is one "right" answer here. I think it depends. If you use a single table with a "relationship type" column and frequently want to extract just relationships of a single type--say just favorites--then each query against that table will need to apply a WHERE clause to filter out the types you don't want. That might make for slow queries if you don't properly index the non-key "relationship type" column. Also, it makes it so future developers will always need to know and remember to filter to the types they want or they may unexpectedly get relationships data back that they don't intend. Having two separate tables is easier to understand. For example, it is easier for me to quickly know what to expect in a "Favorites" table than in a "Users_Posts" table, so separate tables may communicate the differences more quickly.
On the other hand, if you frequently need to select both relationship types in a single set, then having them in a single table is simpler because you don't need to worry about doing a UNION to combine the data from two tables into a single view. What if there were 10,000 different possible relationship types? Would you want 10,000 different tables, or would you prefer a single table? Most people would prefer a single table in that case.
So I think it depends on many factors, such as expected usage, size, etc. The "right" answer is more of an art than a science.

Are relationship tables really needed?

Relationship tables mostly contain two columns: IDTABLE1, and IDTABLE2.
Only thing that seems to change between relationship tables is the names of those two columns, and table name.
Would it be better if we create one table Relationships and in this table we place 3 columns:
TABLE_NAME, IDTABLE1, IDTABLE2, and then use this table for all relationships?
Is this a good/acceptable solution in web/desktop application development? What would be downside of this?
Note:
Thank you all for feedback. I appreciate it.
But, I think you are taking it a bit too far... Every solution works until one point.
As data storage simple text file is good till certain point, than excel is better, than MS Access, than SQL Server, than...
To be honest, I haven't seen any argument that states why this solution is bad for small projects (with DB size of few GB).
It would be a monster of a table; it would also be cumbersome. Performance-wise, such a table would not be a great idea. Also, foreign keys are impossible to add to such a table. I really can't see a lot of advantages to such a solution.
Bad idea.
How would you enforce the foreign keys if IDTABLE1 could contain ids from any table at all?
To achieve acceptable performance on joins without a load of unnecessary IO to bring in completely unrelated rows you would need a composite index with leading column TABLE_NAME that basically ends up partitioning the table into sections anyway.
Obviously even with this pseudo partitioning going on you would still be wasting a lot of space in the table/indexes just repeating the table name for each row.
Isn't it a big IF that you're only going to store the 2 ID fields? If I have a StudentCourse (or better yet Enrollment) table that has StudentID & CourseID, but wouldn't EnrollmentDate go in this table as well since not all students enroll on the first day of class. Seems like a bad idea to add this column to an already bloated table where most records will be null.
The benefit of a single table could be a requirement that the application has the ability to allow user/admin to create these relationships with data (Similar to have a single lookup or reference list table) and avoid having to create a new table to address these User Created References. Needing dynamic querying may benefit as well. An application that requires such dynamic data structure requirements might be better suited for a schemaless or nosql database.

How important are lookup tables?

A lot of the applications I write make use of lookup tables, since that was just the way I was taught (normalization and such). The problem is that the queries I make are often more complicated because of this. They often look like this
get all posts that are still open
"SELECT * FROM posts WHERE status_id = (SELECT id FROM statuses WHERE name = 'open')"
Often times, the lookup tables themselves are very short. For instance, there may only be 3 or so different statuses. In this case, would it be okay to search for a certain type by using a constant or so in the application? Something like
get all posts that are still open
"SELECT * FROM posts WHERE status_id = ".Status::OPEN
Or, what if instead of using a foreign id, I set it as an enum and queried off of that?
Thanks.
The answer depends a little if you are limited to freeware such as PostGreSQL (not fully SQL compliant), or if you are thinking about SQL (ie. SQL compliant) and large databases.
In SQL compliant, Open Architecture databases, where there are many apps using one database, and many users using different report tools (not just the apps) to access the data, standards, normalisation, and open architecture requirements are important.
Despite the people who attempt to change the definition of "normalisation", etc. to suit their ever-changing purpose, Normalisation (the science) has not changed.
if you have data values such as {Open; Closed; etc} repeated in data tables, that is data duplication, a simple Normalisation error: if you those values change, you may have to update millions of rows, which is very limited design.
Such values should be Normalised into a Reference or Lookup table, with a short CHAR(2) PK:
O Open
C Closed
U [NotKnown]
The data values {Open;Closed;etc} are no longer duplicated in the millions of rows. It also saves space.
the second point is ease of change, if Closed were changed to Expired, again, one row needs to be changed, and that is reflected in the entire database; whereas in the un-normalised files, millions of rows need to be changed.
Adding new data values, eg. (H,HalfOpen) is then simply a matter of inserting one row.
in Open Architecture terms, the Lookup table is an ordinary table. It exists in the [SQL compliant] catalogue; as long as the FOREIGN KEY relation has been defined, the report tool can find that as well.
ENUM is a Non-SQL, do not use it. In SQL the "enum" is a Lookup table.
The next point relates to the meaningfulness of the key.
If the Key is meaningless to the user, fine, use an {INT;BIGINT;GUID;etc} or whatever is suitable; do not number them incrementally; allow "gaps".
But if the Key is meaningful to the user, do not use a meaningless number, use a meaningful Relational Key.
Now some people will get in to tangents regarding the permanence of PKs. That is a separate point. Yes, of course, always use a stable value for a PK (not "immutable", because no such thing exists, and a system-generated key does not provide row uniqueness).
{M,F} are unlikely to change
if you have used {0,1,2,4,6}, well don't change it, why would you want to. Those values were supposed to be meaningless, remember, only a meaningful Key need to be changed.
if you do use meaningful keys, use short alphabetic codes, that developers can readily understand (and infer the long description from). You will appreciate this only when you code SELECT and realise you do not have to JOIN every Lookup table. Power users too, appreciate it.
Since PKs are stable, particularly in Lookup tables, you can safely code:
WHERE status_code = 'O' -- Open
You do not have to JOIN the Lookup table and obtain the data value Open, as a developer, you are supposed to know what the Lookup PKs mean.
Last, if the database were large, and supported BI or DSS or OLAP functions in addition to OLTP (as properly Normalised databases can), then the Lookup table is actually a Dimension or Vector, in Dimension-Fact analyses. If it was not there, then it would have to be added in, to satisfy the requirements of that software, before such analyses can be mounted.
If you do that to your database from the outset, you will not have to upgrade it (and the code) later.
Your Example
SQL is a low-level language, thus it is cumbersome, especially when it comes to JOINs. That is what we have, so we need to just accept the encumbrance and deal with it. Your example code is fine. But simpler forms can do the same thing.
A report tool would generate:
SELECT p.*,
s.name
FROM posts p,
status s
WHERE p.status_id = s.status_id
AND p.status_id = 'O'
Another Exaple
For banking systems, where we use short codes which are meaningful (since they are meaningful, we do not change them with the seasons, we just add to them), given a Lookup table such as (carefully chosen, similar to ISO Country Codes):
Eq Equity
EqCS Equity/Common Share
OTC OverTheCounter
OF OTC/Future
Code such as this is common:
WHERE InstrumentTypeCode LIKE "Eq%"
And the users of the GUI would choose the value from a drop-down that displays
{Equity/Common Share;Over The Counter},
not {Eq;OTC;OF}, not {M;F;U}.
Without a lookup table, you can't do that, either in the apps, or in the report tool.
For look-up tables I use a sensible primary key -- usually just a CHAR(1) that makes sense in the domain with an additional Title (VARCHAR) field. This can maintain relationship enforcement while "keeping the SQL simple". The key to remember here is the look-up table does not "contain data". It contains identities. Some other identities might be time-zone names or assigned IOC country codes.
For instance gender:
ID Label
M Male
F Female
N Neutral
select * from people where gender = 'M'
Alternatively, an ORM could be used and manual SQL generation might never have to be done -- in this case the standard "int" surrogate key approach is fine because something else deals with it :-)
Happy coding.
Create a function for each lookup.
There is no easy way. You want performance and query simplicity. Ensure the following is maintained. You could create a SP_TestAppEnums to compare existing lookup values against the function and look for out of sync/zero returned.
CREATE FUNCTION [Enum_Post](#postname varchar(10))
RETURNS int
AS
BEGIN
DECLARE #postId int
SET #postId =
CASE #postname
WHEN 'Open' THEN 1
WHEN 'Closed' THEN 2
END
RETURN #postId
END
GO
/* Calling the function */
SELECT dbo.Enum_Post('Open')
SELECT dbo.Enum_Post('Closed')
Question is: do you need to include the lookup tables (domain tables 'round my neck of the woods) in your queries? Presumably, these sorts of tables are usually
pretty static in nature — the domain might get extended, but it probably won't get shortened.
their primary key values are pretty unlikely to change as well (e.g., the status_id for a status of 'open' is unlikely to suddenly get changed to something other than what it was created as).
If the above assumptions are correct, there's no real need to add all those extra tables to your joins just so your where clause can use a friend name instead of an id value. Just filter on status_id directly where you need to. I'd suspect the non-key attribute in the where clause ('name' in your example above) is more likely to get changes than the key attribute ('name' in your example above): you're more protected by referencing the desire key value(s) of the domain table in your join.
Domain tables serve
to limit the domain of the variable via a foreign key relationship,
to allow the domain to be expanded by adding data to the domain table,
to populate UI controls and the like with user-friendly information,
Naturally, you'd need to suck domain tables into your queries where you you actually required the non-key attributes from the domain table (e.g., descriptive name of the value).
YMMV: a lot depends on context and the nature of the problem space.
The answer is "whatever makes sense".
lookup tables involve joins or subqueries which are not always efficient. I make use of enums a lot to do this job. its efficient and fast
Where possible (and It is not always . . .), I use this rule of thumb: If I need to hard-code a value into my application (vs. let it remain a record in the database), and also store that vlue in my database, then something is amiss with my design. It's not ALWAYS true, but basically, whatever the value in question is, it either represents a piece of DATA, or a peice of PROGRAM LOGIC. It is a rare case that it is both.
NOT that you won't find yourself discovering which one it is halfway into the project. But as the others said above, there can be trade-offs either way. Just as we don't always acheive "perfect" normalization in a database design (for reason of performance, or simply because you CAN take thngs too far in pursuit of acedemic perfection . . .), we may make some concious choices about where we locate our "look-up" values.
Personally, though, I try to stand on my rule above. It is either DATA, or PROGRAM LOGIC, and rarely both. If it ends up as (or IN) a record in the databse, I try to keep it out of the Application code (except, of course, to retrieve it from the database . . .). If it is hardcoded in my application, I try to keep it out of my database.
In cases where I can't observe this rule, I DOCUMENT THE CODE with my reasoning, so three years later, some poor soul will be able to ficure out how it broke, if that happens.
The commenters have convinced me of the error of my ways. This answer and the discussion that went along with it, however, remain here for reference.
I think a constant is appropriate here, and a database table is not. As you design your application, you expect that table of statuses to never, ever change, since your application has hard-coded into it what those statuses mean, anyway. The point of a database is that the data within it will change. There are cases where the lines are fuzzy (e.g. "this data might change every few months or so…"), but this is not one of the fuzzy cases.
Statuses are a part of your application's logic; use constants to define them within the application. It's not only more strictly organized that way, but it will also allow your database interactions to be significantly speedier.