Grouping records with or without a foreign key constraint - sql

I have a table containing objects, something like this:
PK ObjectId
FK ObjectTypeId
Description
etc
The objects need to be grouped. I have been given a number of suggestions, all of which 'work' and some I like more than others. None of them are perfect so I'm struggling to settle on any particular model.
1/ Add a self-referential foreign key. This is clean but not ideal because (a) there is no logical parent (it's a group, not a hierarchy) and (b) it is potentially a pain for LINQ-to-SQL to traverse a self-referential hierarchy - would need to check to see if the current object is a 'parent' or a 'child' object, etc.
PK ObjectId
FK ParentObjectId
2/ Add a parent table & a foreign key. This adds constraints but the Group table doesn't contain any useful information - it exists only to provide a GroupId constraint & identity.
Table Object
PK ObjectId
FK GroupId
Table Group
PK GroupId
3/ Add a GroupId without a constraint or foreign key. No data integrity. The theory is that when a new object group is inserted, each object is given GroupId = the first assigned ObjectId. Probably the simplest & most practical solution.
e.g.
ObjectId GroupId
...
15 10
16 16
17 16
...
21 16
22 22
My question is which of these is the best in theory and/or practice, and why. Or, please tell me a better way to do this! I personally like (2) because it is normalized, but am told that a table with just one field is bad design. Thoughts and suggestions?

For option two, you can always add additional fields, for discriptions, display orders, and GroupParentID, for multiple levels of grouping. Also, you can implement securit on these grouping structures.
In our application we use that structure.

From the limited info you present it is difficult to recommend an option, because the method depends on your usage of the tables. Storing the data is one thing, but using it is a completely different issue. All three store the data is a way possible to retrieve it, however what queries will you need to construct to load it, aggregate it, and search it?
I would implement each method, populate with a small set of data and try to write a few queries to gather data as your application would. Any problems or difficulties in using the table design will become apparent then.
You should always design your tables so data retrieval is fast and easy. You shouldn't have to fight your tables to get your data out. If do you find yourself fighting with your tables to get the data out, then you designed them poorly. Your table structure should make your life easier not harder.

Related

Modelling many-to-many relation between more than two tables

I'm modelling a tier-list database using PostgreSQL. This is how it works:
A user can create a new Tier List;
A user can add as many tiers he wants to the list;
A user can add as many items as he can. Initially, the items are added to an "unranked" section (not assigned to any tier), then the user can rank them as he wants.
Modeling details:
A tier necessarily belongs to a tier_list;
An item can be in multiple tier_lists and in multiple tiers as well;
An item added to a tier_list has not necessarily been added to one of the tiers.
For modelling the relations between item-tier and item-tier_list, I thought about two scenarios:
Creating a junction with a composite PFK key of item and tier_list with a nullable tier FK. The records with no tier value would be the unranked ones, while the ones with an assigned tier would be the ranked;
Creating two M-N relations: one between item and tier, storing ranked items, and another between item and tier_list, storing unranked items.
I feel like the first option would be easier to deal with when having to persist things like moving a product between tiers (or even unranking it), while the second looks more compliant to SQL standards. Am I missing something?
First proposed solution model:
Second proposed solution model:
You can create a joint key using 3 different fields.
First of all, why using smallint and not int? Not fluent in Posgres, but it's usually better to have the biggest integer possible as primary key (things can grow faster than you expect).
Second, I strongly suggest to put ID_ before and not after the name of the filed used for lookup. It makes it easier to read.
As how to build your tables:
Item
ID PK
Title
Descriptions
I see no problems here. I'd just change the name in tblProducts, for easier reading.
Tier_List
ID PK
Description
Works fine too. Again I'll look for a better name. I'd call this one tblTiers or tblLegues instead. Usign similar names can bring troubles in 2-3 years when you have to add things and you're not sure what's what. Better use distinctive names for the tables.
Tier (suggesting tblTiers or tblRankings)
ID PK
Tier_List_ID PK FK
Title
Description
Here I see a HUGE problem. For experience, I don't really understand why you create a combination key here with ID and Tier_List_ID. Do you need to reuse the same ID for different tiers? If that ID has a meaning bring it out from the PK absolutely! PK must be simple counters, that will NEVER be changed. I saw people using the ID with a meaning for the end-user. It was a total disaster! I can't even start describing the quantity of garbage data that that DB was containing.
I suppose, because you were talking about ranking, that the ID there is a Rank, a level or something like that.
The table should become
ID PK uuid
Tier_List_ID FK
Rank smallint
Title
Description
There's another reason why I had you do this: when you have a combined PK, certain DBRMs require you to use the same combined key in the lookup tables, and that can become messy fast!
Now, the lookup table:
tier_list_item (tblRankingLookup?)
ID_Product FK PK
ID_Tier_List FK PK
ID_Tier FK PK
You don't need anything else to make it work smoothly! At least, that's how I'd envision it.
Instead I'd add an ID_User (because I'm not sure if all users can see all tiers and all rankings, or they can see only theirs).
Addendum: if you need to have unique combinations of different elements, I'm pretty sure you can create a combined index and mark it as "unique" (don't remember the correct syntax, not sure it is the same in Postgres).
In exmple, if you don't want the Tier table to have the rank repeated only once per tier_list_ID, you can create an index using tier_list_ID and Ranking and mark it unique. This way a two tiers in the same tier_list will not have the same value for the field Rank (rank can still be null).

Why is it a bad idea to have a table without a primary key?

I am very new to data modeling, and according to Microsoft's Entity Framework, tables without primary keys are not allowed and apparently a bad idea. I am trying to figure out why this is a bad idea, and how to fix my model so that I do not have this hole.
I have 4 tables in my current model: User, City, HelloCity, and RateCity. It is modeled as shown in the picture. The idea is that many users can visit many cities, and a user can only rate a city once, but they can greet a city many times. For this reason, I did not have a PK in HelloCity table.
Any insight as to how I can change this to comply with best practices, and why this is against best practices to begin with?
This response is mainly opinion/experience-based, so I'll list a few reasons that come to mind. Note that this is not exhaustive.
Here're some reasons why you should use primary keys (PKs):
They allow you to have a way to uniquely identify a given row in a table to ensure that there're no duplicates.
The RDBMS enforces this constraint for you, so you don't have to write additional code to check for duplicates before inserting, avoiding a full table scan, which implies better performance here.
PKs allow you to create foreign keys (FKs) to create relations between tables in a way that the RDBMS is "aware" of them. Without PKs/FKs, the relationship only exists inside the programmer's mind, and the referenced table might have a row with its "PK" deleted, and the other table with the "FK" still thinks the "PK" exists. This is bad, which leads to the next point.
It allows the RDBMS to enforce integrity constraints. Is TableA.id referenced by TableB.table_a_id? If TableB.table_a_id = 5 then, you're guaranteed to have a row with id = 5 in TableA. Data integrity and consistency is maintained, and that is good.
It allows the RDBMS to perform faster searches b/c PK fields are indexed, which means that a table doesn't need to have all of its rows checked when searching for something (e.g. a binary search on a tree structure).
In my opinion, not having a PK might be legal (i.e. the RDBMS will let you), but it's not moral (i.e. you shouldn't do it). I think you'd need to have extraordinarily good/powerful reasons to argue for not using a PK in your DB tables (and I'd still find them debatable), but based on your current level of experience (i.e. you say you're "new to data modeling"), I'd say it's not yet enough to attempt justifying a lack of PKs.
There're more reasons, but I hope this gives you enough to work through it.
As far as your M:M relations go, you need to create a new table, called an associative table, and a composite PK in it, that PK being a combination of the 2 PKs of the other 2 tables.
In other words, if there's a M:M relation between tables A and B, then we create a table C that has a 1:M relation to with both tables A and B. "Graphically", it'd look similar to:
+---+ 1 M +---+ M 1 +---+
| A |------| C |------| B |
+---+ +---+ +---+
With the C table PK somewhat like this:
+-----+
| C |
+-----+
| id | <-- C.id = A.id + B.id (i.e. combined/concatenated, not addition!)
+-----+
or like this:
+-------+
| C |
+-------+
| a_id | <--|
+-------+ +-- composite PK columns instead
| b_id | <--| of concatenation (recommended)
+-------+
Two primary reasons for a primary key:
To uniquely identify a record for later reference.
To join to other tables accurately and efficiently.
A primary key essentially tags a row with a unique identifier. This can be composed of one or more columns in a row but most commonly just uses one. Part of what makes this useful is when you have additional tables (such as the one in your scenario) you can refer to this value in other tables. Since it's unique, I can look at a column with that unique ID in another table (say HelloCity) and instantly know where to look in the User table to get more information about the person that column refers to.
For example, HelloCity only stores the IDs for the User and City. Why? Because it'd be silly to re-record ALL the data about the City and ALL the data about the User in another table when you already have it stored elsewhere. The beauty of it is, say the user needs to update their DisplayName for some reason. To do so, you just need to change it in User. Now, any row that refers to the user instantly returns the new DisplayName; otherwise you would have to find every record using the old DisplayName and update it accordingly, which in larger databases could take a considerable amount of time.
Note that the primary key is only unique in that specific table though - you could theoretically see the same primary key value in your City and User tables (this is especially common if you're using simple integers as IDs) but your database will know the difference based on the relationship you build between tables as well as your JOIN statements in your queries.
Another way primary keys help is they automatically have an index generated on their column(s). This increases performance in queries where your WHERE clause searches on the primary key column value. And, since you'll likely be referring to that primary key in other tables, it makes that lookup faster as well.
In your data model I see some columns that already have 'Id' in them. Without knowing your dataset I would hope those already have all-unique values so it should be fine to place a PK on those. If you get errors doing that there are likely duplicates.
Back to your question about HelloCity - Entity Framework is a little finicky when it comes to keys. If you really want to play it safe you can auto-generate a unique ID for every entry and call it good. This makes sense because it's a many-to-many relationship, meaning that any combination can appear any number of times, so in theory there's no reliable way to distinguish between unique entries. In the event you want to drop a single entry in the future, how would you know what row to refer to? You could make the argument that you search on all the fields and the greeting may be different, but if there are multiple visits to a city with the same greeting, you may accidentally drop all of those records instead of just one.
However, if it was a one-to-one relationship you could get away with making the combination of both CityId and UserId the primary key since that combination should always be unique (because you should never see multiple rows making that same combination).
Late to the party but I wanted to add there are special cases where a table does not need to have a primary key, or any type of key.
Take the case of a singleton, for example. A table that includes a single row (or a very well known number of rows) always. The dual table in Oracle is one case.
Formally, the primary key for a singleton is (): that is, a key with no columns. I'm not aware of any database that allows it, though.
There are other cases where a PK is not needed, typically with log tables that usually are "end tables" since you typically draw them at the border of your diagram; no other tables refer to them (i.e. they have no children). A good use of indexes is enough to deal with them since, by their nature, they don't need to enforce row uniqueness.
But, to close, yes 99.99% of tables in a relational database should have a PK.

Table referenced by other tables having different PKs

I would like to create a table called "NOTES". I was thinking this table would contain a "table_name" VARCHAR(100) which indicates what table put in the note, a "key" or multiple "key" columns representing the primary key values of the table that this note applies to and a "note" field VARCHAR(MAX). When other tables use this table they would supply THEIR primary key(s) and their "table_name" and get all the notes associated with the primary key(s) they supplied. The problem is that other tables might have 1, 2 or more PKs so I am looking for ideas on how I can design this...
What you're suggesting sounds a little convoluted to me. I would suggest something like this.
Notes
------
Id - PK
NoteTypeId - FK to NoteTypes.Id
NoteContent
NoteTypes
----------
Id - PK
Description - This could replace the "table_name" column you suggested
SomeOtherTable
--------------
Id - PK
...
Other Columns
...
NoteId - FK to Notes.Id
This would allow you to keep your data better normalized, but still get the relationships between data that you want. Note that this assumes a 1:1 relationship between rows in your other tables and Notes. If that relationship will be many to one, you'll need a cross table.
Have a look at this thread about database normalization
What is Normalisation (or Normalization)?
Additionally, you can check this resource to learn more about foreign keys
http://www.w3schools.com/sql/sql_foreignkey.asp
Instead of putting the other table name's and primary key's in this table, have the primary key of the NOTES table be NoteId. Create an FK in each other table that will make a note, and store the corresponding NoteId's in the other tables. Then you can simply join on NoteId from all of these other tables to the NOTES table.
As I understand your problem, you're attempting to "abstract" the auditing of multiple tables in a way that you might abstract a class in OOP.
While it's a great OOP design principle, it falls flat in databases for multiple reasons. Perhaps the largest single reason is that if you cannot envision it, neither will someone (even you) looking at it later have an easy time reassembling the data. Smaller that that though, is that while you tend to think of a table as a container and thus similar to an object, in reality they are implemented instances of this hypothetical container you are trying to put together and operate better if you treat them as such. By creating an audit table specific to a table or a subset of tables that share structural similarity and data similarity, you increase the performance of your database and you won't run in to strange trigger or select related issues later.
And you can't envision it not because you're not good at what you're doing, but rather, the structure is not conducive to database logging.
Instead, I would recommend that you create separate logging tables that manage the auditing of each table you want to audit or log. In fact, some fast google searches show many scripts already written to do much of this tasking for you: Example of one such search
You should create these individual tables and then if you want to be able to report on multiple table or even all tables at once, you can create a stored procedure (if you want to make queries based on criterion) or a view with an included SELECT statement that JOINs and/or UNIONs the tables you are interested in - in a fashion that makes sense to the report type. You'll still have to write new objects in to the view, but even with your original table design, you'd have to account for that.

What is the best way to tie 1 table to many tables

I have a table. Call it TableA
this table will link to many tables and ideally be enforced by database relationships in (many-1)(TableA-TableB)
(many-1)(TableA-TableC) ... etc
The solution i have is to put all the foreign keys of TableB, TableC, etc in TableA along with a "Type" field (which contains a word version of which relationship is to be enforced). however i think there must be a better way.
What would you do?
I'd appreciate any advice in this and thanks.
This is a perfectly acceptable approach - foreign keys are indeed the correct way of modeling a many-to-one relationship.
Generally, you can't just say you want to make a solution "better"; rather, you should have a specific goal in mind. Faster, shorter implementation, less memory, whatever. Even better is if you have a specific use case you would like to optimize for.
Edit: your question is more clear now that you've edited it. If I understand correctly, you feel your current implementation is inefficient because one of your TableA items can be attached to at most one other item, be it from TableC, TableC, etc.
If that is correct, what I might do is implement the foreign key in Table A as both an ID and a table name, rather than having a new column for each new type of object you want to add to your system. Of course, this would prevent you from changing table names, so a more robust solution would be to have another table mapping unique ids to object types (stored as table names). Then the foreign key in Table A would be item_id and object_type_id, and you could retrieve the object by looking up object_type_id in the object_types table to get the table name.
If you want to add referential integrity enforced by the database server, each key must be represented by a unique column in TableA.
It's hard to give more advice than that without knowing more about what your design is attempting to do.

ID fields in SQL tables: rule or law?

Just a quick database design question: Do you ALWAYS use an ID field in EVERY table, or just most of them? Clearly most of your tables will benefit, but are there ever tables that you might not want to use an ID field?
For example, I want to add the ability to add tags to objects in another table (foo). So I've got a table FooTag with a varchar field to hold the tag, and a fooID field to refer to the row in foo. Do I really need to create a clustered index around an essentially arbitrary ID field? Wouldn't it be more efficient to use fooID and my text field as the clustered index, since I will almost always be searching by fooID anyway? Plus using my text in the clustered index would keep the data sorted, making sorting easier when I have to query my data. The downside is that inserts would take longer, but wouldn't that be offset by the gains during selection, which would happen far more often?
What are your thoughts on ID fields? Bendable rule, or unbreakable law?
edit: I am aware that the example provided is not normalized. If tagging is to be a major part of the project, with multiple tables being tagged, and other 'extras', a two-table solution would be a clear answer. However in this simplest case, would normalization be worthwhile? It would save some space, but require an extra join when running queries
As in much of programming: rule, not law.
Proof by exception: Some two-column tables exist only to form relationships between other more meaningful tables.
If you are making tables that bridge between two or more other tables and the only fields you need are the dual PK/FK's, then I don't know why you would need ID column in there as well.
ID columns generally can be very helpful, but that doesn't mean you should go peppering them in at every occasion.
As others have said, it's a general, rather than absolute, rule and there are plenty of exceptions (tables with composite keys for example).
There are some occasional but useful occasions where you might want to create an artificial ID in a table that already has a (usually composite) unique identifier. For example, in one system I've created a table to store part numbers; although the part numbers are unique, they may actually change - we add an arbitrary integer PartID. Not so common, but it's a typical real-world example.
In general what you really want is to be able if at all possible to have some kind of way to uniquely identify a record. It could be an id field or it could be a unique index (which does not have to be on just one field). Anytime I thought I could get away without creating a way to uniquely identify a record, I have been proven wrong. All tables do not have a natural key though and if they do not, you really need to have an id file of some kind. If you have a natural key, you could use that instead, but I find that even then I need an id field in most cases to prevent having to do too much updating when the natural key changes (it always seems to change). Plus having worked with literally hundreds of databases concerning many many differnt topics, I can tell you that a true natural key is rare. As others have nmentioned there is no need for an id field in a table that is simply there to join two tables that havea many to many relationship, but even this should have a unique index.
If you need to retrieve records from that table with unique id then yes. If you will retrieve them by some other composite key made up of foreign keys then no. The last thing you need is fields, data, and indexes that you do not use.
A clustered index does not need to be on primary key or a surrogate (identity column) either.
Your design, however, is not normalized. Typically for tagging, I use two tables, a table of tags (with a surrogate key) and a table of links from the tags to the subject table(s) using the surrogate key in the tag table and theprimary key in the subject table. This allows your tags to apply to different entities (photos, articles, employees, locations, products, whatever). It allows you to enforce foreign key relationships to multiple tables, and also allows you to invent tag hierarchies and other things about the tag table.
As far as the indexes on this design, it will be dictated by the usage patterns.
In general developers love having an ID field on all tables except for 'linking' tables because it makes development much easier, and I am no exception to this. DBA's on the other hand see no problem with making natural primary keys made up of 3 or 4 columns. It can be a butting of heads to try and get a good database design.