Products database design for product lines, categories, manufacturers, related software, product attributes, etc - sql

I am redeveloping the front end and database for a medium size products database so that it can support categories/subcategories, product lines, manufacturers, supported software and product attributes. Right now there is only a products table. There will be pages for products by line, by category/subcategory, by manufacturer, by supported software (optional). Each page will have additional filtering based on the other classifications.
Categories/Subcategories (multi level)
Products and product lines can be assigned to multiple category trees. Up to 5 levels deep should be supported.
Product lines (single level)
Groups of products. Product can only be in single product line.
Manufacturers (single level)
Products and product lines can be assigned to single manufacturer.
Supported software (single level)
Certain products only work with one or more softwares, so a product/line can be assigned to none, one or more softwares.
Attribues (type / options - could be treated so each type is a category and items are children)
Products and product lines can be assigned attributes (eg - color > red / blue / green). Attributes should be able to be assigned to one or more categories.
Since all these items are basically types of subcategories, do I put them all together in a master table OR split them into separate tables for each one?
Master table idea:
ClassificationTypes (product line, category/sub, manufacturer, software, attribute would all be types)
-TypeID
-Name
Classifications
-ClassID
-TypeID
-ParentClassID
-Name
ClassificationsProductsAssociations
-ProductID
-ClassID
I would still need at least one more table to link types together (eg - to link attributes to a category) and a way to link product lines to various types.
If I go with a table for each type it can get messy quick and I will still need a way to link everything together.
Multiple table setup:
Categories
-CategoryID
-Name
-ParentCategoryID
CategoriesAssociations
-CategoryID
-ProductID
-ProductLineID ?
Attributes
-AttributeID
-Name
-ParentAttributeID (use this as the parent would be "color" and child would be "red")
AttributesAssociations
-AttributeID
-ProductID
-CategoryID (do I also need to link the category to the parent attribute?)
CompatibleSoftware
-SoftwareID
-Name
CompatibleSoftwareAssociations
-SoftwareID
-ProductID
-ProductLineID ?
Manufacturers
-ManufacturerID
-Name
ProductLines
-ProductLineID
-ManufacturerID
-Name
Products
-ProductID
-ProductLineID
-ManufacturerID
-Name
Other option for associations is to have a single associations table to link the tables above:
Master Associations
-ProductID
-ProductLineID
-ManufacturerID
-CategoryID
-SoftwareID
-AttributeID
What is the best solution?

Go for multiple tables, it makes the design more obvious and more extensible, in my opinion. While it may fit your solution now, further changes may be more difficult.

I agree to Paddy. It makes your life easier in the future and you are much more flexible. You might want to put in stock control and other stuff. To link everything together use the id's (integer) parent/child of the tables.

I think multiple tables is the way to go, but to really know, do this: Flesh out the design for both ways and then take a sample of 5-10 products.
Populate the tables in both designs for the 5-10 products.
Now start writing the queries for both ways. You will start to see which is easier to write (the single table I bet), and you might find cases that only work in one design (the multi-table I bet.)
When you are done you have not lost the work -- you can use the table schema to move forward and some of your queries will already be written.
If you get to a query that does not make sense, seems to complicated, or such you can post it here and get feed back -- having real code always gets better comments.

Just wanted to post my decision and since I was not satisfied with any of the answers provided, I have elected to answer my own question.
I ended up setting up a a single set of tables:
Classification Types (eg - product lines, categories, manufacturers, etc)
Classifications (supporting parent/child adjacency list, nested sets, and materialized path all at once in order to take advantage of strengths of each. I have a SQL CTE that can populate all the fields in one go when the data changes)
Classifications Relations (with ability to relate products to classifications, relate classifications to other classifications and also relate classifications to other types)
I will admit that the solution is not 100% normalized, but this setup gives me ultimate flexibility to expand by creating new types and is very powerful and easy to query.

Related

Modeling N-to-N with DynamoDB

I'm working in a project that uses DynamoDB for most persistent data. I'm now trying to model a data structure that more resembles what one would model in a traditional SQL database, but I'd like to explore the possibilities for a good NoSQL design also for this kind of data.
As an example, consider a simple N-to-N relation such as items grouped into categories. In SQL, this might be modeled with a connection table such as
items
-----
item_id (PK)
name
categories
----------
category_id (PK)
name
item_categories
---------------
item_id (PK)
category_id (PK)
To list all items in a category, one could perform a join such as
SELECT items.name from items
JOIN item_categories ON items.item_id = item_categories.item_id
WHERE item_categories.category_id = ?
And to list all categories to which an item belongs, the corresponding query could be made:
SELECT categories.name from categories
JOIN item_categories ON categories.category_id = item_categories.category_id
WHERE item_categories.item_id = ?
Is there any hope in modeling a relation like this with a NoSQL database in general, and DynamoDB in particular, in a fairly efficient way (not requiring a lot of (N, even?) separate operations) for a simple use-case like the ones above - when there is no equivalent of JOINs?
Or should I just go for RDS instead?
Things I have considered:
Inline categories as an array within item. This makes it easy to find the categories of an item, but does not solve getting all items within a category. And I would need to duplicate the needed attributes such as category name etc within each item. Category updates would be awkward.
Duplicate each item for each category and use category_id as range key, and add a GSI with the reverse (category_id as hash, item_id as range). De-normalizing being common for NoSQL, but I still have doubts. Possibly split items into items and item_details and only duplicate the most common attributes that are needed in listings etc.
Go for a connection table mapping items to categories and vice versa. Use [item_id, category_id] as key and [category_id, item_id] as GSI, to support both queries. Duplicate the most common attributes (name etc) here. To get all full items for a category I would still need to perform one query followed by N get operations though, which consumes a lot of CU:s. Updates of item or category names would require multipe update operations, but not too difficult.
The dilemma I have is that the format of the data itself suits a document database perfectly, while the relations I need fit an SQL database. If possible I'd like to stay with DynamoDB, but obviously not at any cost...
You are already in looking in the right direction!
In order to make an informed decision you will need to also consider the cardinality of your data:
Will you be expecting to have just a few (less then ten?) categories? Or quite a lot (ie hundreds, thousands, tens of thousands etc.)
How about items per category: Do you expect to have many cagories with a few items in each or lots of items in a few categories?
Then, you need to consider the cardinality of the total data set and the frequency of various types of queries. Will you most often need to retrieve only items in a single category? Or will you be mostly querying to retrieve items individually and you just need stayistics for number of items per category etc.
Finally, consider the expected growth of your dataset over time. DynamoDB will generally outperform an RDBMS at scale as long as your queries partition well.
Also consider the acceptable latency for each type of query you expect to perform, especially at scale. For instance, if you expect to have hundreds of categories with hundreds of thousands of items each, what does it mean to retrieve all items in a category? Surely you wouldn't be displaying them all to the user at once.
I encourage you to also consider another type of data store to accompany DynamoDB if you need statistics for your data, such as ElasticSearch or a Redis cluster.
In the end, if aggregate queries or joins are essential to your use case, or if the dataset at scale can generally be processed comfortably on a single RDBMS instance, don't try to fit a square peg in a round hole. A managed RDBMS solution like Aurora might be a better fit.

Structuring Categories in SQL Server 2008 R2

Hello I looked at a few similar posts to what I am looking to do but none are the same to what I need to accomplish. I am trying to come up with my structure for categories using SQL Server 2008 R2.
I want to make categories for lets say...Clothing, Electronics, Furniture, Tools......and so on.
I am looking at a 3 field table to start with a category table (category ID (PK), categoryname, parentID) which from what I am finding is a standard practice and can go several layers deep without having to restructure.
The problem lies where it is fine for lets say (electronics-cd players-cd changer), (electronics-lighting-studio lighting) or (clothing-womens-skirts), (clothing-womens-pants) perhaps one level deeper?
What do I do for brands? I was planning to have a brand table (brandID(PK),Brand)
then Category_Brand table (categoryID, BrandID) to link brands to categories when I want to use a cascading dropdown list that populates from the database.
What do I do for deeper attributes where the rest of the attributes apply to the item itself, but are dependent on the category? color, pattern, material, size? which can apply to clothing, but not to electronics or tools, also Mens clothing has different sizing than womens clothing.
Or furniture where I want to store dresser dimensions and color, or beds where I want to store bed size (king, queen, twin) and to store the type (Spring, air, foam, water)
What i need is to connect the item specific attributes to each item based on which category the item belongs to. On another forum I was suggested to just add all the misc. attributes to the item table and leave the ones I don't use null. I know that doesn't make sense, it seems to me that there should be different sub-attribute tables with fields that are related to the categories that they represent. i am thinking that clothing size for example would have a lookup table where each size has a (sizeid) and a link table for a many to many type relationship to connect the size with the (itemid), although there would need to be a few different size tables because men's sizes and women's sizes are different or put then all in one table with the (categoryid) as a sort of parent foreign key, and dimensions for another item like (length, width, height) would be stored into its own table along with the (itemid) as the foreign key?
Or is it a good idea to store the (sizeid) or (dimensionid) right into the item table?
This seemed to be simple to me when I started, but the more I look at it the more I am getting confused as to the correct way to structure this, I want it to work good for performance as this may become a high volume application. But doesn't everyone wish that?
try to understand normalization first. Here is a good article for you.

Delimited string of ids as a field or a separate table?

I have a database in which I store a large amount of user-created products.
The user can also create "views" which works as a container holding these products, lets call them categories.
It would be simple if each product had a category-field containing the id of the category, however each product can be added to multiple categories so that doesn't work.
Presently I solve this by having a string-field "products" in the category-table which is a comma-separated list of product-ids.
What I'm wondering is basically if it's "okay" to do it this way? Is it generally accepted? Will it cause some kind of problem I'm not realizing?
Would it be better to create another table named something like productsInCategories which has 2 fields, one with a category-id and one with product-id and link them together this way?
Will one of these methods perform better or be better in some other way?
I'm using sqlce at the moment if that matters, but that will most likely change soon.
I would go for the second option: a separate table.
Makes it easier to handle if you need to query from the product perspective. Also the join to the categories will be simple and fast. This is exactly what relational databases are made for.
Imagine a simple query like what categories a product is in. With your solution you need to check all categories one by one, parse the csv-list of each category to find the products. With a separate table it is one clean query.

Advice on database model for ecommerce with custom products

I need some advice on modeling an ecommerce domain.
The client sells two products:
Custom art work, the design specified by the customer.
Prints of art with a message on the back specified by the customer.
Here is my cut down database model so far.
Products:
Id
Description
Price
Orderlines:
Id
OrderId
ProductId
Attributes:
Id
Name
OrderAttributes:
AttributeId
OrderlineId
Value
The products table will have the 2 products from above.
The order line links the selected product to an order.
The attributes holds the custom field names for each product.
For example the custom artwork product would have the attribute design.
The order attributes links the ordered product to it's customs attributes and has the value.
For example custom artwork product, with an attribute of design, with a value of paint a house.
I would also like to map this database model to code as well using nhibernate.
Is there a better way of modeling this data?
A couple of suggestions:
The Orderlines table should contain the price (and possibly the description) of the product so that item prices can change without affecting existing orders. Similarly, the Orders table (not shown) should contain customer information (e.g. shipping address) that may change. The data that makes up an order can't change and the easiest approach is to flatten and denormalize it.
The OrderAttributes structure is called an entity-attribute-value model and it has many drawbacks. In general I recommend avoiding it and adding the needed columns to the Orderlines table. If needed, your application can subclass Product and OrderLine so that a CustomArtWorkProduct creates a CustomArtWorkOrderLine when it's added to an order.
In an object-oriented program relations are expressed as associations.
That is:
If Product has Orders then Product must have a collection of Orders.
If an Order is for a Product, the Order must have a property Product.
and so on.
In object-oriented programming you don't associate by an identifier: you don't need this because this is a different world ruled by hierarchical data.
Honestly, if you follow what I said before, NHibernate will be a very powerful tool as it'll be able of loading objects and properties without your intervention.
Think about "getting all orders of some product": you're not going to intentionally execute an SQL Join but you're going to access to the Orders property of Product and NHibernate will translate this access to the database world.
This is the point of using an OR/M. It's not just "I map tables as is". It's about joining two very different worlds: the object-oriented hierarchical world with relational data with no pain.
Check this very old (2004!) CodeProject article and how it creates the Northwind SQL Server database-based model:
http://www.codeproject.com/Articles/8773/NHibernate-in-real-world-applications
Don't pay attention to how maps the model to the database but to the model design.
Check this article, it's more modern than the other one:
http://litemedia.info/introduction-to-nhibernate

Do these database design styles (or anti-pattern) have names?

Consider a database with tables Products and Employees. There is a new requirement to model current product managers, being the sole employee responsible for a product, noting that some products are simple or mature enough to require no product manager. That is, each product can have zero or one product manager.
Approach 1: alter table Product to add a new NULLable column product_manager_employee_ID so that a product with no product manager is modelled by the NULL value.
Approach 2: create a new table ProductManagers with non-NULLable columns product_ID and employee_ID, with a unique constraint on product_ID, so that a product with no product manager is modelled by the absence of a row in this table.
There are other approaches but these are the two I seem to encounter most often.
Assuming these are both legitimate design choices (as I'm inclined to believe) and merely represent differing styles, do they have names? I prefer approach 2 and find it hard to convey the difference in style to someone who prefers approach 1 without employing an actual example (as I have done here!) I'd would be nice if I could say, "I'm prefer the inclination-towards-6NF (or whatever) style myself."
Assuming one of these approaches is in fact an anti-pattern (as I merely suspect may be the case for approach 1 by modelling a relationship between two entities as an attribute of one of those entities) does this anti-pattern have a name?
Well the first is nothing more than a one-to-many relationship (one employee to many products). This is sometimes referred to as a O:M relationship (zero to many) because it's optional (not every product has a product manager). Also not every employee is a product manager so its optional on the other side too.
The second is a join table, usually used for a many-to-many relationship. But since one side is only one-to-one (each product is only in the table once) it's really just a convoluted one-to-many relationship.
Personally I prefer the first one but neither is wrong (or bad).
The second would be used for two reasons that come to mind.
You envision the possibility that a product will have more than one manager; or
You want to track the history of who the product manager is for a product. You do this with, say a current_flag column set to 'Y' (or similar) where only one at a time can be current. This is actually a pretty common pattern in database-centric applications.
It looks to me like the two model different behaviour. In the first example, you can have one product manager per product and one employee can be product manager for more than one product (one to many). The second appears to allow for more than one product manager per product (many to many). This would suggest the two solutions are equally valid in different situations and which one you use would depend on the business rule.
There is a flaw in the first approach. Imagine for a second, that the business requirements have changed and now you need to be able to set 2 Product Manager to a product. What will you do? Add another column to the table Product? Yuck. This obviously violates 1NF then.
Another option the second approach gives is an ability to store some attributes for a certain Product Manager <-> Product relation. Like, if you have two Product Manager for a product, then you can set one of them as a primary...
Or, for example, an employee can have a phone number, but as a product manager he/she can have another phone number... This also goes to the special table then.
Approach 1)
Slows down the use of the Product table with the additional Product Manager field (maybe not for all databases but for some).
Linking from the Product table to the Employee table is simple.
Approach 2)
Existing queries using the Product table are not affected.
Increases the size of your database. You've now duplicated the Product ID column to another table as well as added unique constraints and indexes to that table.
Linking from the Product table to the Employee table is more cumbersome and costly as you have to ink to the intermediate table first.
How often must you link between the two tables?
How many other queries use the Product table?
How many records in the Product table?
in the particular case you give, i think the main motivation for two tables is avoiding nulls for missing data and that's how i would characterise the two approaches.
there's a discussion of the pros and cons on wikipedia.
i am pretty sure that, given c date's dislike of this, he defines relational theory so that only the multiple table solution is "valid". for example, you could call the single table approach "poorly typed" (since the type of null is unclear - see quote on p4).