Implementing many-to-many with one "primary" value - sql

I have many products that can each be in many categories.
products: id, ...
products_categories: product_id, category_id
categories: id, ...
Now I want to have many products, each with one master category, and 0 or more secondary categories. I can think of two ways to model this in SQL.
Add an is_primary column to products_categories
OR
Add a primary_category_id column to products
What is the best way to implement this in pure SQL and/or ActiveRecord? I'm using PostgreSQL, for what it's worth.

I would go with the first option unless I have a good reason for choosing 2 (like the cost of an extra join when getting the primary category)
reason: you probably need to add the primary category to product_category table anyway (in order to use it in a uniform and simple way in queries like getting all categories for a product)
option 1 avoids duplicating primary category thus simpler

I would go with option (1). The reason for this is since your products can belong to more than one category, the relationship attribute (that its a 'primary' category) belongs in the table that defines the relationship.
I would even go further and suggest that instead of labeling the field 'is_primary', you should have the field labeled as 'association_type'. And instead of just adding a bit field, make it an integer field, and have all the association types defined. In your case today, there are only two association types - secondary and primary. The advantage is that this design is much more scalable. If tomorrow, you are asked to define a 'primary', a 'secondary' and all other tertiary categories, this design will be able to handle it, instead of having to add another field to designate the 'secondary' field.

It really depends on the exact details of what you're trying to accomplish. Here are some of the things to consider while deciding what's best for you. Other answers already tackled the first case, so I'm going to focus on the second one.
If you have primary_category_id:
It seems cleaner to have one field in product that tells which category is the primary one, than to have a field in every product_category which has 1 in one row and 0 in every other row, although the suggestion by M.R. to use association_type sounds clean too - but what's the chance you're going to have "tertiary" categories?
It's slightly easier to get to the primary category
It's easy to ensure every product always has a primary category (just make the field NOT NULL)
It automatically enforces that a product may only have one primary category
Should you also insert the primary category to products_categories?
Neither option is enforced.
If you don't, it's awkward to query all the categories
If you do, it's still easy to query, but without additional work, nothing guarantees the primary category is also inserted in the other table
If you use the is_primary method, you should somehow ensure that every product always has exactly one primary category.

What are each way's pros and cons?
Option 1. I can be sure that the primary category for a product is indeed one of its categories. But there may be a problem of ensuring that a product has no more than one primary category.
Option 2. This lets me make sure that a product has only one primary category. But then I don't seem to have a way to make sure that it's one of this same product's categories.
So, I would probably go for a third option, using a table Products_PrimaryCategories:
Products_PrimaryCategories: product_id, category_id
It seems the same as product_categories, but has some additional properties:
product_id has an associated unique index, making sure you can only have one primary category for each product;
(product_id, category_id) is a foreign key referencing products_categories (product_id, category_id) ensuring that a product's primary category is one of its categories (which implies that (product_id, category_id) should be products_categories's primary key).

Related

Why need to add ID field to products categories of of online shopping database?

I am just started to learn about relational database. When I studied the database of online shopping websites, I found that many examples create a category table and added ID field to the category name. I don't know why they need to create a category table and use category ID as a foreign key to relate products table. What will happen if I remove the category table and add the category name directly to the products table?
What I think is a lot of cases that you want a website menu showing your categories. This menu allows people to view your categories (Men Clothing, Women Clothing, Kids, Accessories) and once they click it they can see the products relevant to them.
If you put the category name to the product, it is very hard for you to update your menu content as you need to loop, group the category in the product table. Also, it is harder to update the category name in product table as a category name could be in lots of product records,
Whereas if you have a category table, you just need to maintain the category table (view what you have in the category table and update DB record if you want your menu change).
In long term maintenance, category table is desired.
In a case I have come over that I would like an empty category which just to show in the website menu (a menu item which contains no product) which is not possible if I do not have a category table.
By inserting just the category name you may complete your POC but you need to understand what is Normalization and why it is needed.
First normal form (1NF) : An entity type is in 1NF when it contains no repeating groups of data.
Second normal form (2NF) : An entity type is in 2NF when it is in 1NF and when all of its non-key attributes are fully dependent on its primary key.
Third normal form (3NF) : An entity type is in 3NF when it is in 2NF and when all of its attributes are directly dependent on the primary key.
Source
What will happen if I remove the category table and add the category name directly to the products table?
Suppose you store the category with each product, and one day your boss tells you that you misspelled a category name. Which one?
"Theater" he says. Or did he say "theatre?" Which is correct? You check and find about "theater" and "theatre" are used close to evenly among the products that have either one.
So which spelling did your boss mean is the mistake, and which one is correct?
If you store the correct spelling in one place, in its own categories table, then you can be sure. You can correct it, and all the products that reference it will implicitly get the correction.
That's an argument for normalization, but keep in mind using an integer id is only a convention. It has nothing to do with normalization. You can use a string as a primary key of a table, and therefore you can use a string as a foreign key in a table that references it.
It's okay to use a non-integer for key columns. As long as there is one instance that stores the canonical value, it satisfies the goal of normalization -- that is to reduce data anomalies.

Relational Database Design: Conditionals

I'm designing a relational database that I plan to implement with SQL. I have a use case that I'm working on and seem to be having a bit of trouble thinking through the solution. The design is for an e-commerce order system.
Use Case:
The ORDER_DETAILS table contains a deliveryMethod attribute. I then have a SHIPPING_DETAILS table that contains address information and a PICKUP_DETAILS table that contains location, date, and time information for an in-person pickup. When a user places an order, they have the option to have their order shipped to their address or to pick up their order in person. My current thought is to have a shippingId foreign key and pickupId foreign key in the ORDER_DETAILS table. Then, basically run a conditional check on the deliveryMethod attribute and retrieve data from the appropriate table depending on the value of that attribute (either "shipping" or "pickup"). With this thought, however, I would be allowing for null values to be present in the ORDER_DETAILS for either the shippingId or the pickupId attributes. From my understanding, null values are viewed negatively in relational designs. So I'm looking for some feedback on this design. Is this okay? Am I overthinking the nulls? Is there a more efficient way to design this particular schema?
If I understand your problem correctly,
The cardinality of the relationship of ORDER to SHIPPING is 1 ---> (0, 1)
The cardinality of the relationship of ORDER to PICKUP is 1 ---> (0, 1)
An ORDER MUST have either a SHIPPING or a PICKUP, but not both.
To enforce the constraint (#3) you could define a functional constraint in the database. That gets into interesting stuff.
Anyway, like you say, you could make columns in ORDER that are FKs to the SHIPPING or PICKUP tables, but both of those are nullable. I don't think null FKs are evil or anything, but they do get messy especially if you had a whole bunch of delivery methods and not just two.
If you don't like the nulls, you could have separate association tables: (1) ORDER_DELIVERY that has just an order_id and an delivery_id, each are FKs to the respective tables, and (2) ORDER_PICKUP, also a two column table. In each case the primary key would be order_id. Now there are no nulls: the orders with delivery are in the ORDER_DELIVERY table and the orders with pickup are in ORDER_PICKUP.
Of course there's a tradeoff, as maintaining the constraint that there be exactly one and only one delivery method is not a consistency check across tables.
Another idea is to make the delivery and pickup details be JSON fields. Here you are doing more work on the application side, enforcing constraints programmatically, but you won't have nulls.
I wish I could say that there was a slam-dunk go-to design pattern here, but I don't see one. Personally with only two types of delivery methods, I would not shy from having nulls (as I'm not a purist). But I do love it when the database does the work, so....
(Oh, the answer to the question "are you over thinking things?" is no, this thinking is really good!)

Understanding the role of foreign key constraints. Am I using them properly?

I need help in understanding the applicability of foreign keys when setting up constraints. I understand that the role of setting up foreign keys is to prevent orphaned data, but I have found a desire to put the foreign key in the child, which seems to break a pattern. Not sure if I am doing this right, and would like some advice if I have my constraints correctly.
Here is the design I have:
(1) I want all my "product"s to have a type of unit associate with the quantity. Units being like "Each", "Foot", "Gallon", etc, so between the quantity and the unit, you would have something like:
Quantity Unit
5 Gallons
I do not want to allow a bunch of crazy units, so I set this constraint up. This is pretty much by the book.
(2) I also believe that not all products will have an "Image", so I put the foreign key in the "ProductImage" table so I would not have "Product"s with a column with an empty row because I am also trying to "Normalize" the design.
The same issue with "FeeTypes" because not all "Product"s will have fees.
I feel guilt about breaking the pattern of putting the foreign key constraint in the child and not the parent. I just cannot wrap my head around "FeeType" being a parent. This conflict in logic is where I have the question.
Is my design correct, from a design perspective?
Am I still constraining the data properly?
Is there another "role" besides preventing orphaned data?
Thanks in advance.
There are three cases here (from the Product table's point of view):
Many-to-one relationship, e.g. many products having the same unit type - one unit type per product.In this case the foreign key must be in the Product table referencing the primary key UnitType.UnitTypeID.
One-to-many relationship, e.g. one product can have multiple images - one image can belong to only one productIn this case the foreign key must be in the ProductImages table, referencing Product.ProductID.
Many-to-many relationship, e.g. any product can have many categories - any category might describe many productsIn this case you will need a connection table that contain ProductID/CategoryID pairs, with columns being foreign keys referencing Product.ProductID and Category.CategoryID respectively.
So, the design of UnitType (case 1.) and ProductImage (case 2.) tables is OK, but FeeType should probably be case 1. and Category should be case 3.
BTW, it would be perfectly OK to have NULL in a foreign key column; it would not break the rules of normalization. So, for example, if some products do not have fees associated, you can have NULL in the Product.FeeTypeID column. But you will need to use an outer join in your queries to ensure that no products with no fees will not be excluded from the results.

Should I use an index column in a many to many "link" table?

I have two tables, products and categories which have a many to many relationship, so I'm adding a products_categories table which will contain category_id and product_id.
Should I add another (auto-incrementing) index column or use the two existing ones as primary key?
That depends.
Are you seeing your data more as set of objects (and relational database is just
a storage medium) or as set of facts represented and analyzed natively
by relational algebra.
Some ORMs/Frameworks/Tools don't have good support for multicolumn primary keys.
If you happen to use one of them, you'll need additional id column.
If it's just a many-to-many relationship with no additional data associated with it,
it's better to avoid additional id column and have both columns as a primary key.
If you start adding some additional information to this association, then it may reach a point when it becomes
something more then many-to-many relationship of two entities.
It becomes an entity in it's own right and it'd be more convenient if it had it's own id
independent to entities it connects.
You don't need to add an extra, auto-incrementing index column, but I (perhaps contrary to most others) still recommend that you do. First, it is easier in the application program to refer to a row using a single number, for example when you delete a row. Second, it sometimes turns out to be useful to be able to know the order in which the rows were added.
No, it's not necessary at all, given that these two columns are already executing the function of a primary key.
This third column whould just add more space to your table.
But... You could use it maybe to see the order in which your records where added to your table. That's the only function I can see to this column.
You don't need to add an auto-incrementing index column. Standard practice is to use just the two existing columns as your primary key for M:M association tables like you describe.
I would make the primary key category_id and product_id. Add an auto increment only if the order will ever be relevent in later uses.
There's a conceptual question - is products_categories an entity or is simply a table that represents a relationship between two entities? If it's an entity then, even if there are no additional attributes, I'd advocate for a separate ID column for the entity. If it's a relationship, if there are additional attributes (say, begin_date, end_date or something like that), I'd advocate to have a multi-column primary key.

Use of null values in related tables with foreign key constraints

I have the following tables:
Cateogories
CategoryID (int) Primary Key
CategoryName (varchar)
Items
ItemID (int) Primary Key
CategoryID (int)
ItemName (varchar)
There is a foreign key constraint on Items.CategoryID. There is a chance that when a new item is created that there will be no category assigned.
Is it better to set Items.CategoryID to allow nulls and deal with the nulls in my code OR better to not allow nulls, set the default CategoryID to 1, and create a dummy record in the Categories table called "Uncategorized" and then deal with that dummy category in my code?
The logically correct way would be for the CategoryID column to be NULL when there is no Category for the item.
If you get trapped by any of the gotchas that are associated with using NULL, then that is most likely a sign that the design hasnt taken account of the fact that items cannot have a category. Fix the design. The NULL will ensure you stick to solving the correct problem.
It depends:
If your items really have no category, then I would allow NULLs, as that is what you have: no CategoryId.
If you want to list all categories, you do not want to display the dummy row, so you would have to ignore that.
If you want to display all items and show the categories, you'd better be aware that there are items without category, so you would use a LEFT JOIN in that case.
If possible, change your application to select a category before actually saving your item.
If you want to treat that Uncategorized category just like the other categories (list them with the other categories, count items assigned to it, select it in lists/dropdowns), then it should get it's own category, and Item.CategoryId should be NOT NULL.
Ideally you'd want to force a category choice before allowing an item to be created. If an item will have no category at any point in the future then you'll need to create a category specifically to deal with that. I personally wouldn't call it "Uncategorized" though as this implies that a user can just chase it up later - which they will forget to do with alarming regularity!
Go for logical consistency or you'll end up in a mess. If that means creating a "Miscellaneous" category then do that and make sure that (a) Users know when to use it and (b) It is reported on regularly to make sure items are categorised correctly.
For simple lookup tables of this type it is almost always better to disallow NULLs and have the unknown value in your lookup table.
Why?
Because the ANSI NULL specifications are inconsistent and very complex. Dealing with nulls greatly increases the likelihood of coding defects, and takes a lot more code to write
Because few developers really understand how NULLs work in all scenarios
Because it simplifies your model and queries nicely. You can join things together nicely with inner joins from either direction with very simple sql.
However, a few cautions:
You may want more than one "dummy" value: one for "unknown" and another for "not assigned". Of course, NULL bundles both into a single value, so you're going above & beyond the minimal standard if you do this.
You will end up sometimes having additional non-key attributes that either must be nullable or carry 'n/a' type values for the dummy rows. For heavily denormalized lookup tables (like warehousing dimensions) you'll probably want nulls allows for these columns because 'n/a' doesn't work well for timestamps, amounts, etc.
If you apply this technique to more than just simple lookup tables it will dramatically complicate your design. Don't do that.
SQL NULLs are tricky, so I think you're better off with a sentinel category.
In this case I believe it's really a matter of personal preference. Either way you'll have to deal with the uncategorized items in your code.
I do not believe that either of the alternatives are very good.
If you choose the NULL approach you will have problems with the gotchas involved in working with NULLs. If you choose to not allow nulls, you will need to handle cases where if you delete a category the item would cascade.
IMO the best design is to have three tables.
Categories
ID
Name
Items
ID
Name
Categories2Items
CategoryID
ItemID
This eliminates the need for NULL (and the gotchas involved) as well as allows you to have uncategorized items, and items which belong to several categories. This design is also in Boyce-Codd Normal form which is always a good thing ..
en.wikipedia.org/wiki/BCNF