Modelling many-to-many relation between more than two tables - sql

I'm modelling a tier-list database using PostgreSQL. This is how it works:
A user can create a new Tier List;
A user can add as many tiers he wants to the list;
A user can add as many items as he can. Initially, the items are added to an "unranked" section (not assigned to any tier), then the user can rank them as he wants.
Modeling details:
A tier necessarily belongs to a tier_list;
An item can be in multiple tier_lists and in multiple tiers as well;
An item added to a tier_list has not necessarily been added to one of the tiers.
For modelling the relations between item-tier and item-tier_list, I thought about two scenarios:
Creating a junction with a composite PFK key of item and tier_list with a nullable tier FK. The records with no tier value would be the unranked ones, while the ones with an assigned tier would be the ranked;
Creating two M-N relations: one between item and tier, storing ranked items, and another between item and tier_list, storing unranked items.
I feel like the first option would be easier to deal with when having to persist things like moving a product between tiers (or even unranking it), while the second looks more compliant to SQL standards. Am I missing something?
First proposed solution model:
Second proposed solution model:

You can create a joint key using 3 different fields.
First of all, why using smallint and not int? Not fluent in Posgres, but it's usually better to have the biggest integer possible as primary key (things can grow faster than you expect).
Second, I strongly suggest to put ID_ before and not after the name of the filed used for lookup. It makes it easier to read.
As how to build your tables:
Item
ID PK
Title
Descriptions
I see no problems here. I'd just change the name in tblProducts, for easier reading.
Tier_List
ID PK
Description
Works fine too. Again I'll look for a better name. I'd call this one tblTiers or tblLegues instead. Usign similar names can bring troubles in 2-3 years when you have to add things and you're not sure what's what. Better use distinctive names for the tables.
Tier (suggesting tblTiers or tblRankings)
ID PK
Tier_List_ID PK FK
Title
Description
Here I see a HUGE problem. For experience, I don't really understand why you create a combination key here with ID and Tier_List_ID. Do you need to reuse the same ID for different tiers? If that ID has a meaning bring it out from the PK absolutely! PK must be simple counters, that will NEVER be changed. I saw people using the ID with a meaning for the end-user. It was a total disaster! I can't even start describing the quantity of garbage data that that DB was containing.
I suppose, because you were talking about ranking, that the ID there is a Rank, a level or something like that.
The table should become
ID PK uuid
Tier_List_ID FK
Rank smallint
Title
Description
There's another reason why I had you do this: when you have a combined PK, certain DBRMs require you to use the same combined key in the lookup tables, and that can become messy fast!
Now, the lookup table:
tier_list_item (tblRankingLookup?)
ID_Product FK PK
ID_Tier_List FK PK
ID_Tier FK PK
You don't need anything else to make it work smoothly! At least, that's how I'd envision it.
Instead I'd add an ID_User (because I'm not sure if all users can see all tiers and all rankings, or they can see only theirs).
Addendum: if you need to have unique combinations of different elements, I'm pretty sure you can create a combined index and mark it as "unique" (don't remember the correct syntax, not sure it is the same in Postgres).
In exmple, if you don't want the Tier table to have the rank repeated only once per tier_list_ID, you can create an index using tier_list_ID and Ranking and mark it unique. This way a two tiers in the same tier_list will not have the same value for the field Rank (rank can still be null).

Related

Adding an artificial primary key versus using a unique field [duplicate]

This question already has answers here:
Surrogate vs. natural/business keys [closed]
(19 answers)
Why would one consider using Surrogate keys vs Natural with ON UPDATE CASCADE?
(1 answer)
Closed 7 months ago.
Recently I Inherited a huge app from somebody who left the company.
This app used a SQL server DB .
Now the developer always defines an int base primary key on tables. for example even if Users table has a unique UserName field , he always added an integer identity primary key.
This is done for every table no matter if other fields could be unique and define primary key.
Do you see any benefits whatsoever on this? using UserName as primary key vs adding UserID(identify column) and set that as primary key?
I feel like I have to add add another element to my comments, which started to produce an essay of comments, so I think it is better that I post it all as an answer instead.
Sometimes there are domain specific reasons why a candidate key is not a good candidate for joins (maybe people change user names so often that the required cascades start causing performance problems). But another reason to add an ever-increasing surrogate is to make it the clustered index. A static and ever-increasing clustered index alleviates a high-cost IO operation known as a page split. So even with a good natural candidate key, it can be useful to add a surrogate and cluster on that. Read this for further details.
But if you add such a surrogate, recognise that the surrogate is purely internal, it is there for performance reasons only. It does not guarantee the integrity of your data. It has no meaning in the model, unless it becomes part of the model. For example, if you are generating invoice numbers as an identity column, and sending those values out into the real world (on invoice documents/emails/etc), then it's not a surrogate, it's part of the model. It can be meaningfully referenced by the customer who received the invoice, for example.
One final thing that is typically left out of this discussion is one particular aspect of join performance. It is often said that the primary key should also be narrow, because it can make joins more performant, as well as reducing the size of non-clustered indexes. And that's true.
But a natural primary key can eliminate the need for a join in the first place.
Let's put all this together with an example:
create table Countries
(
countryCode char(2) not null primary key clustered,
countryName varchar(64) not null
);
insert Countries values
('AU', 'Australia'),
('FR', 'France');
create table TourLocations
(
tourLocationName varchar(64) not null,
tourLocationId int identity(1,1) unique clustered,
countryCode char(2) not null foreign key references Countries(countryCode),
primary key (countryCode, tourLocationName)
);
insert TourLocations (TourLocationName, countryCode) values
('Bondi Beach', 'AU'),
('Eiffel Tower', 'FR')
I did not add a surrogate key to Countries, because there aren't many rows and we're not going to be constantly inserting new rows. I already know what all the countries are, and they don't change very often.
On the TourLocations table I have added an identity and clustered on it. There could be very many tour locations, changing all the time.
But I still must have a natural key on TourLocations. Otherwise I could insert the same tour location name with the same country twice. Sure, the Id's will be different. But the Id's don't mean anything. As far as any real human is concerned, two tour locations with the same name and country code are completely indistinguishable. Do you intend to have actual users using the system? Then you've got a problem.
By putting the same country and location name in twice I haven't created two facts in my database. I have created the same fact twice! No good. The natural key is necessary. In this sense The Impaler's answer is strictly, necessarily, wrong. You cannot not have a natural key. If the natural key can't be defined as anything other than "every meaningful column in the table" (that is to say, excluding the surrogate), so be it.
OK, now let's investigate the claim that an int identity key is advantageous because it helps with joins. Well, in this case my char(2) country code is narrower than an int would have been.
But even if it wasn't (maybe we think we can get away with a tinyint), those country codes are meaningful to real people, which means a lot of the time I don't have to do the join at all.
Suppose I gave the results of this query to my users:
select countryCode, tourLocationName
from TourLocations
order by 1, 2;
Very many people will not need me to provide the countries.countryName column for them to know which country is represented by the code in each of those rows. I don't have to do the join.
When you're dealing with a specific business domain that becomes even more likely. Meaningful codes are understood by the domain users. They often don't need to see the long description columns from the key table. So in many cases no join is required to give the users all of the information they need.
If I had foreign keyed to an identity surrogate I would have to do the join, because the identity surrogate doesn't mean anything to anyone.
You are talking about the difference between synthetic and natural keys.
In my [very] personal opinion, I would recommend to always use synthetic keys (and always call it id). The main problem is that natural keys are never unique; they are unique in theory, yes, but in the real world there are a myriad of unexpected and inexorable events that will make this false.
In database design:
Natural keys correspond to values present in the domain model. For example, UserName, SSN, VIN can be considered natural keys.
Synthetic keys are values not present in the domain model. They are just numeric/string/UUID values that have no relationship with the actual data. They only serve as a unique identifiers for the rows.
I would say, stick to synthetic keys and sleep well at night. You never know what the Marketing Department will come up with on Monday, and suddenly "the username is not unique anymore".
Yes having a dedicated int is a good thing for PK use.
you may have multiple alternate keys, that's ok too.
two great reasons for it:
it is performant
it protects against key mutation ( editing a name etc. )
A username or any such unique field that holds meaningful data is subject to changes. A name may have been misspelled or you might want to edit a name to choose a better one, etc. etc.
Primary keys are used to identify records and, in conjunction with foreign keys, to connect records in different tables. They should never change. Therefore, it is better to use a meaningless int field as primary key.
By meaningless I mean that apart from being the primary key it has no meaning to the users.
An int identity column has other advantages over a text field as primary key.
It is generated by the database engine and is guaranteed to be unique in multi-user scenarios.
it is faster than a text column.
Text can have leading spaces, hidden characters and other oddities.
There are multiple kinds of text data types, multiple character sets and culture dependent behaviors resulting in text comparisons not always working as expected.
int primary keys generated in ascending order have a superior performance in conjunction with clustered primary keys (which is a SQL-Server specialty).
Note that I am talking from a database point of view. In the user interface, users will prefer identifying entries by name or e-mail address, etc.
But commands like SELECT, INSERT, UPDATE or DELETE will always identify records by the primary key.
This subject - quite much like gulivar travels and wars being fought over which end of the egg you supposed to crack open to eat.
However, using the SAME "id" name for all tables, and autonumber? Yes, it is LONG establihsed choice.
There are of course MANY different views on this subject, and many advantages and disavantages.
Regardless of which choice one perfers (or even needs), this is a long established concept in our industry. In fact SharePoint tables use "ID" and autonumber by defualt. So does ms-access, and there probably more that do this.
The simple concpet?
You can build your tables with the PK and child tables with forighen keys.
At that point you setup your relationships between the tables.
Now, you might decide to add say some invoice number or whatever. Rules might mean that such invoice number is not duplicated.
But, WHY do we care of you have some "user" name, or some "invoice" number or whatever. Why should that fact effect your relational database model?
You mean I don't have a user name, or don't have a invoice number, and the whole database and relatonships don't work anymore? We don't care!!!!
The concept of data, even required fields, or even a column having to be unique ?
That has ZERO to do with a working relational data model.
And maybe you decide that invoice number is not generated until say sent to the customer. So, the fact of some user name, invoice number or whatever? Don't care - you can have all kinds of business rules for those numbers, but they have ZERO do to do with the fact that you designed a working relational data model based on so called "surrogate" or sometime called synthetic keys.
So, once you build that data model - even with JUST the PK "id" and FK (forighen keys), you are NOW free to start adding columns and define what type of data you going to put in each table. but, what you shove into each table has ZERO to do with that working related data model. They are to be thought as seperate concpets.
So, if you have a user name - add that column to the table. If you don't want users name, remove the column. As such data you store in the table has ZERO to do with the automatic PK ID you using - it not really any different then say what area of memory the computer going to allocate to load that data. Basic data operations of the system is has nothing to do with having build database with relationships that simple exist. And the data columns you add after having built those relationships is up to you - but will not, and should not effect the operation of the database and relationships you built and setup. Not only are these two concepts separate, but they free the developer from having to worry about the part that maintains the relationships as opposed to data column you add to such tables to store user data.
I mean, in json data, xml? We often have a master + child table relationship. We don't care how that relationship is maintained - but only that it exists.
Thus yes, all tables have that pk "ID". Even better? in code, you NEVER have to guess what the PK id is - it always the same!!!
So, data and columns you put and toss into a table? Those columns and data have zero to do with the PK id, and while it is the database generating that PK? It could be a web service call to some monkeys living in a far away jungle eating banana's and they give you a PK value based on how many bananas they eaten. We just really don't' care about that number - it is just internal house keeping numbers - one that we don't see or even care about in most code. And thus the number one rule to such auto matic PK values?
You NEVER give that auto PK number any meaning from a user and applcation point of view.
In summary:
Yes, using a PK called "id" for all tables? Common, and in fact in SharePoint and many systems, it not only the default, but is in fact required for such systems to operate.
Its better to use userid. User table is referenced by many other tables.
The referenced table would contain the primary key of the user table as foreign key.
Its better to use userid since its integer value,
it takes less space than string values of username and
the searches by the database engine would be faster
user(userid, username, name)
comments(commentid, comment, userid) would be better than
comments(commentid, comment, username)

A different way to Model a portion of this ERD

I have a very simple table diagram from modeling my application. The problem is I am second guessing my relation between Vendor and VendorOrder. The VendorOrders table should store all vendororders in the system. To get all orders for a certain apartment, you would just use the PK and FK relationship to gather that data. Is there anything I should improve with the overall design?
Diagram:
There's three things I see that you could improve this by doing.
Create an intersection table between your Apartment and Resident tables called ApartmentResidents, where each table references the intersection table with a one to many relationship. In this ERD, it only allows for one resident to be registered to an apartment. If a resident lives in more than one apartment for the lifetime of this database, you'll need to register them as an entirely new resident.
Intersection table example
In your Vendor table, instead of using a name as your primary key I would create an id instead. Using things that have a real-world value as your primary key can get messy for a number of reasons:
If two vendors have the same name, like "Johnson's Repair", you'll need to misspell one of them for it to be a valid key.
If you typo a vendor's name, you're also going to contain a reference to that typo in the foreign key tables (Which also might make it not show in results if you do a select query for the correct spelling).
Placing an index on a string is less performant than if you put it on an auto incrementing integer key.
(Optional) I usually name my database tables pluralized, like "Apartments", or "Vendors". It makes the SQL syntax read more like a sentence inside the query. If I remember right that's also one of the things that SQL's creator was going for too with the syntax design.

Should I use composite primary keys in this example?

I am not a SQL expert, so I defer to someone with more knowledge. So here is my question. I have designed a database where every table has an Id column (auto increment) that is the primary key. And I use this design without any issue - it makes sense to me I simply do referential integrity by way of this simple primary key since the Id columns of all tables uniquely identifies each row.
Some of my colleagues have suggested that I use composite primary keys, but I see no value in doing that. The purpose of a primary key is to enable referential integrity, and that is what it does.
For example, this is a toy example but it demonstrates my design:
tbl_Customers
-------------
Id (PK)
Code (VARCHAR)
Name (VARCHAR)
Surname (VARCHAR)
tbl_CustomerDetails
-----------------
Id (PK)
CustomerId (FK to tbl_Customers)
SomeDetails (VARCHAR)
This does not use a seperate 'linking' table, but it does not matter, it demonstrates my design.
Some of my colleagues noted that I should have a composite primary key on tbl_Customers to not only include Id as I do now, but also Code. They say that this will improve performance and that it will ensure that Code will not duplicate.
My counter argument is that if I want Code to not duplicate, I can create a UNIQUE INDEX on Code. And that, since my front-end only ever works with Ids and never allows for example searching (SELECTing) by Code, that there can not be a performance improvement. On my presentation layer, if I show for example Customers and I allow the user to select one to see the associated CustomerDetails, I will select the corresponding tbl_CustomerDetails rows on CustomerId where it matches the selected Id of the clicked customer.
What do you suggest? Am I correct or am I wrong? I am always willing to learn, and if I am wrong here I'd love to learn. But at the moment, I do not feel their arguments are valid. Which is why I am asking the community.
Thanks!
I would suggest to go with single column primary key instead on composite keys. The biggest drawback with composite key is that you require more than one value /columnto identify a row. If your application uses an O/RM (Object/Relation Mapping) layer, then you will have fits mapping these database rows to objects in a programming language. O/RM's are easiest to set up when every table has a single column primary key.
Programming aside,the major drawback of composite keys in general, and especially composite keys requiring this many columns, is all of this data needs to be specified and copied to child tables in order to set up proper relationships between tables which is wastage of space and it increase unnecessary complexity too.
The biggest headache I've run into with developers is they assume "uniqueness of data" equates "identifying a row in the database". This is rarely the case. I've found applications and databases to be much more maintainable and easy to build by defaulting to single column primary keys, and using composite keys as an exception to the rule, then enforcing data uniqueness by using unique constraints or indexes on those columns.
After reading your question and arguments I would like to say you are not wrong.
Since you have ID auto incremented which will always provides uniqueness to your row.
Now talking about code column, then if code should be unique then you can always have UNIQUE constraint for column which will not allow duplicate values for code and since you are doing it from front end so no need to add composite primary key with(ID,Code) but make sure you add UNIQUE constraint for code column.
You have already given explanation buddy and I believe you are totally right.
If you are going to make composite primary key then you have to consider two things here:
Composite PK on (ID,Code) will allow duplicate ID's and duplicate codes, it will not
allow duplicate combinations.
you have to add code column in tbl_CustomerDetails table as well if you are going
to link both tables.
In Summary I would like to say I don't feel that in this case Composite Primary Key is required.
If your question is, should you use a composite key in your example, the answer to that is a resounding NO! Your colleague's suggestion to add code as a composite key is not only unnecessary but will more than likely introduce problems for you down the road. Let me illustrate:
Let's say that you'd like to distinguish customers by code: All members are having code MEMB plus the Id number, all vendors have code VEND plus the Id number, and all customers have code CUST plus Id.
Among the "customers" are donors who don't purchase anything but give a contribution. You decide to make a distinction between donors and customers.
That means you'll have to change the code of some of your customers from CUST to DONOR plus Id. To make that change you will have to UPDATE EVERY INSTANCE of CUST that's a donor into DONOR. That could be a nightmare to say the least as you'll need to figure out every table that has that Id as a reference.
With your current set up, all you have to do is update the Code in ONE place and no more changes are needed. So you're right in your implementation.

How to add user customized data to database?

I am trying to design a sqlite database that will store notes. Each of these notes will have common fields like title, due date, details, priority, and completed.
In addition though, I would like to add data for more specialized notes like price for shopping list items and author/publisher data for books.
I also want to have a few general purpose fields that users can fill with whatever text data they want.
How can I design my database table in this case?
I could just have a field for each piece of data for every note, but that would waste a lot of fields and I'd like to have other options and suggestions.
There are several standard approaches you could use for solving this situation.
You could create separate tables for each kind of note, copying over the common columns in each case. this would be easy but it would make it difficult to query over all notes.
You could create one large table with many columns and some kind of type field which would let you know which type of note it is (and therefore which subset of columns to use)
CREATE TABLE NOTE ( ID int PRIMARY KEY, NOTE_TYPE int, DUEDATE datetime, ...more common fields, price NUMBER NULL, author VARCHAR(100) NULL,.. more specific fields)
you could break your tables up into a inheritance relationship something like this:
CREATE TABLE NOTE ( ID int PRIMARY KEY, NOTE_TYPE int, DUEDATE datetime, ...more common fields);
CREATE TABLE SHOPPINGLITITEM (ID int PRIMARY KEY, NOTE_ID int FORIENKEY NOTE.ID, price number ... more shopping list item fields)
Option 1 would be easy to implement but would involve lots of mostly redundant table definitions.
Option 2 would be easy to create and easy to write queries on but would be space inefficient
And option 3 would be more space efficient and less redundant but would possibly have slower queries because of all the foreign keys.
This is the typical set of trade-offs for modeling these kinds of relationships in SQL, any of these solutions could be appropriate for use case depending non your performance requirements.
You could create something like a custom_field table. It gets pretty messy once you start to normalize.
So you have your note table with it's common fields.
Now add:
dynamic_note_field
id label
1 publisher
2 color
3 size
dynamic_note_field_data
id dynamic_note_field_id value
1 1 Penguin
2 1 Marvel
3 2 Red
Finally, you can relate instances of your data with the fields they use through
note_dynamic_note_field_data
note_id dynamic_note_field_data_id
1 1
1 3
2 2
So now we've said: note_id 1 has two additional fields. The first one has a value "Penguin" and represents a publisher. The second one has a value of "Red" and represents a color.
So what's the point of normalizing it this far?
You're not wasting space adding fields to every item (you relate a note with it's additional dynamic field via the m2m table).
You're not storing redundant labels (you may continue to store redundant data however as the same publisher is likely to appear many times... this aspect is extremely subjective. If you want rich data about your publishers you typically want to take the step of turning them into their own entity rather than an ad-hoc string. Be careful when making this leap because it adds an extra level of hairiness to the db. Evaluate the use case accordingly.
The dynamic_note_field acts as your data definition. If you're interested in answering a question such as "what are the additional fields I've created" this lets you do it easily without searching all of your dynamic_note_field_data. Eventually, you might add extra info to this table such as a type field. I like to create this separation off the bat, but that might be a violation of the YAGNI principle in your case.
Disadvantages:
It's not too bad to search for all notes that have a publisher, where that publisher is "Penguin".
What's tricky is something like "Find any note with a value of 'Penguin' in any field". You don't know up front which field's your searching. At this point you're better off with a separate index that's generated alongside your normalized db data which acts as the point of truth. Again, the nice thing about normalization is that you maintain the data in a very lossless, non-destructive state.
For data you want to store but does not have to be searchable, another option is to serialize it to/from JSON and store it in a TEXT column. This gives you arbitrary structure, but you cannot readily query against those values.
Yet another option is to dump SQLite and go with an object database. I seem to recall there are one or two working for Android. I have not tried any of these, however.
Just create a small table which contains the common fields of all your notes.
Then a table for each class of special notes you have, that that contains all the extra fiels plus a reference on your first table.
For each note you will enter, you create a row in your main table (that contains the common fields) and a row in your extra table that contains the extra fields, and a reference to the row in your main table.
Then you will just have to make a join in you request.
With this solution :
1)you have a safe design (can't access fields that are not part of your note)
2)your db will be optimized

ID fields in SQL tables: rule or law?

Just a quick database design question: Do you ALWAYS use an ID field in EVERY table, or just most of them? Clearly most of your tables will benefit, but are there ever tables that you might not want to use an ID field?
For example, I want to add the ability to add tags to objects in another table (foo). So I've got a table FooTag with a varchar field to hold the tag, and a fooID field to refer to the row in foo. Do I really need to create a clustered index around an essentially arbitrary ID field? Wouldn't it be more efficient to use fooID and my text field as the clustered index, since I will almost always be searching by fooID anyway? Plus using my text in the clustered index would keep the data sorted, making sorting easier when I have to query my data. The downside is that inserts would take longer, but wouldn't that be offset by the gains during selection, which would happen far more often?
What are your thoughts on ID fields? Bendable rule, or unbreakable law?
edit: I am aware that the example provided is not normalized. If tagging is to be a major part of the project, with multiple tables being tagged, and other 'extras', a two-table solution would be a clear answer. However in this simplest case, would normalization be worthwhile? It would save some space, but require an extra join when running queries
As in much of programming: rule, not law.
Proof by exception: Some two-column tables exist only to form relationships between other more meaningful tables.
If you are making tables that bridge between two or more other tables and the only fields you need are the dual PK/FK's, then I don't know why you would need ID column in there as well.
ID columns generally can be very helpful, but that doesn't mean you should go peppering them in at every occasion.
As others have said, it's a general, rather than absolute, rule and there are plenty of exceptions (tables with composite keys for example).
There are some occasional but useful occasions where you might want to create an artificial ID in a table that already has a (usually composite) unique identifier. For example, in one system I've created a table to store part numbers; although the part numbers are unique, they may actually change - we add an arbitrary integer PartID. Not so common, but it's a typical real-world example.
In general what you really want is to be able if at all possible to have some kind of way to uniquely identify a record. It could be an id field or it could be a unique index (which does not have to be on just one field). Anytime I thought I could get away without creating a way to uniquely identify a record, I have been proven wrong. All tables do not have a natural key though and if they do not, you really need to have an id file of some kind. If you have a natural key, you could use that instead, but I find that even then I need an id field in most cases to prevent having to do too much updating when the natural key changes (it always seems to change). Plus having worked with literally hundreds of databases concerning many many differnt topics, I can tell you that a true natural key is rare. As others have nmentioned there is no need for an id field in a table that is simply there to join two tables that havea many to many relationship, but even this should have a unique index.
If you need to retrieve records from that table with unique id then yes. If you will retrieve them by some other composite key made up of foreign keys then no. The last thing you need is fields, data, and indexes that you do not use.
A clustered index does not need to be on primary key or a surrogate (identity column) either.
Your design, however, is not normalized. Typically for tagging, I use two tables, a table of tags (with a surrogate key) and a table of links from the tags to the subject table(s) using the surrogate key in the tag table and theprimary key in the subject table. This allows your tags to apply to different entities (photos, articles, employees, locations, products, whatever). It allows you to enforce foreign key relationships to multiple tables, and also allows you to invent tag hierarchies and other things about the tag table.
As far as the indexes on this design, it will be dictated by the usage patterns.
In general developers love having an ID field on all tables except for 'linking' tables because it makes development much easier, and I am no exception to this. DBA's on the other hand see no problem with making natural primary keys made up of 3 or 4 columns. It can be a butting of heads to try and get a good database design.