Avoid storing permutations of all possible product configurations - sql

I'll simplify my problem using a shirt analogy. I have the following tables:
shirt sizes (e.g. small, medium, large...)
shirt colors (e.g. red, green, blue...)
shirt styles (e.g. short sleeve, long sleeve, collared...)
Now, I'd like to create prices for my inventory. Not all shirts are available in every configuration, but some are. For example:
All shirt styles and sizes are available in green. These are $1.
Only large, collared shirts are available in blue. These are $2.
Short sleeve red shirts in all sizes are $3, but long sleeve and collared red shirts are $4.
I could create another table with all available combinations of the three tables and store the prices. This seems inefficient and prone to error. How else can I store these relationships?

Background
Its important to know the terminology and the lexicon when you are researching how to accomplish something. What you are looking to do here is basically design a product configurator or configure price quote (CPQ) system. These systems exists as proprietary and open source customizable off the shelf solutions. As a software architect for a mid market B2B company I am quite familiar with software that implements cpq from scratch and also software that integrates with COTs solutions. If this is anything but an academic exercise I would highly suggest you look at the myriad of free OSS CPQ tools. However since this is stack overflow I will address your question on a more technical level.
Four abstract layers
There are essentially four abstract layers to designing a product configuration system (which we will call a product configuration model).
Components and subcomponents
Attributes shared between those components
Tables / Relational Constraints connecting the components and sub components with their shared attributes
Expressions and expression constraints (which are non reusable statements that are conceptually the bottom layer)
Components
Let's take something simple like skateboards as a use case here. You may have a components table similar to the following
|---------------------|------------------|
| id | Name |
|---------------------|------------------|
| 1 | Decks |
|---------------------|------------------|
| 2 | Wheels |
|---------------------|------------------|
| 3 | Trucks |
|---------------------|------------------|
Sub Components
You may then have a sub components table similar to the following
|---------------------|------------------|------------------|
| id | Name | component_id |
|---------------------|------------------|------------------|
| 1 | Bearings | 2 |
|---------------------|------------------|------------------|
| 2 | Bushing | 3 |
|---------------------|------------------|------------------|
| 3 | Grip Tape | 1 |
|---------------------|------------------|------------------|
| 4 | Nuts / Bolts | 1 |
|---------------------|------------------|------------------|
As you can you in this simple example you have one to one and one to many relationship between components and sub components. It is important that you do not confuse this with attributes, which we have not addressed yet.
Attributes
Your next layer of abstraction is attributes. Generally, all your attributes are associated with table constraints to components and sub components, *and they are not limited to whether that particular combination exists or not).
For a simplied example you might have a table attributes with the following rows
|---------------------|------------------|------------------|
| id | Category | Value |
|---------------------|------------------|------------------|
| 1 | Size | 7.5 |
|---------------------|------------------|------------------|
| 2 | Size | 7.75 |
|---------------------|------------------|------------------|
| 3 | Size | 6.25 |
|---------------------|------------------|------------------|
| 4 | Brand | Toy Machine |
|---------------------|------------------|------------------|
| 5 | Brand | Bird House |
|---------------------|------------------|------------------|
| 5 | Brand | Nike |
|---------------------|------------------|------------------|
| 5 | Model | Nyjah Pro |
|---------------------|------------------|------------------|
| 5 | Model | Vice Monster |
|---------------------|------------------|------------------|
| 6 | ABEC Rating | class 6 |
|---------------------|------------------|------------------|
| 7 | ABEC Rating | class 3 |
|---------------------|------------------|------------------|
As you can see this table is not constrained in the same way your product major and product minor is (however, this is an over simplification and you'd obviously be using business keys in place of attribute labels like ABEC Rating, etc. It lists all attributes.
Expressions
Finally, you would have a table for expressions. These expressions would be stored as rows in the table. They may be relational with other expressions (recursive keys), but should not be relational with your tables. Rather, they should use a mixture of boolean logic, predefined functions, and the surrogate keys from your previous tables to specify the actual configurations available. These are generally NOT reusable (but can be combined with recursive keys for a bit more re-usability).
There are a variety of expression languages out there, some proprietary some open. I manage a custom built product configuration model that uses DMN (from the people who brought you BPMN) to express my statements.
Additionally, I have seem people use XML, XSLT, and XPath in place of the relational model listed above. An expression row might look something like the following
(/component/id#1 & (/attribute/#id == 6 | /attribute/#id == 7))
In Conclusion
Like any software system, abstraction is key. I have seen almost all CPQ and product configuration models boil down into these 4 abstractions (with hundreds of other abstractions in between). Unless this is an academic exercise I highly suggest you find a COTs solution. Knowing your products enough to abstract between major, minor, and attributes is key but the bread and butter (and unfortunately the "least clean" part is definitely the expression language you store in your tables).

Storing all the combinations isn't such a bad idea. But, you could also use wildcards. Your conditions would look like:
style size color price
NULL NULL green $1
collared large blue $2
short sleeve NULL red $3
long sleeve, collared NULL red $4
If you have only a handful of different prices, then this is probably okay. However, querying such a table would be less efficient than expanding it out for every combination.

Related

Flexibility of scenarios in Gherkin.

I looking for mechanism that will allow to build more flexible scenarios.
For example for these two very similar scenarios that test existence of records in database:
Scenario Outline: Testing query with 1 attribute with these 2 record in and another 2 out of result
Given I'm connected to <db> database
When I select <query> from database
Then Result should contain fields:
| <row> |
| <yes1> |
| <yes2> |
And Result should not contain fields:
| <row> |
| <no1> |
| <no2> |
Examples:
| db | row | yes1 | yes2 | no1 | no2 | query |
| 1 | model | 1013 | 1006 | 1012 | 1007 | "SELECT model FROM pc WHERE speed >= 3.0;" |
| 1 | maker | E | A | C | H | "SELECT maker FROM product NATURAL JOIN laptop WHERE hd >= 100;" |
Scenario Outline: Testing query with 2 attributes with these 2 record in and another 2 out of result
Given I'm connected to <db> database
When I select <query> from database
Then Result should contain fields:
| <rowA> | <rowB> |
| <yes1A> | <yes1B> |
| <yes2A> | <yes2B> |
And Result should not contain fields:
| <rowA> | <rowB> |
| <no1A> | <no1B> |
| <no2A> | <no2B> |
Examples:
| db | rowA | rowB | yes1A | yes1B | yes2A | yes2B | no1A | no1B | no2A | no2B | query |
| 1 | model | price | 1004 | 649 | 2007 | 1429 | 2004 | 1150 | 3007 | 200 | "SELECT model,price FROM product" |
| 2 | name | country | Yamato | Japan | North | USA | Repulse | Brit | Cal | USA | "SELECT name, country FROM clases" |
I would like to be able to write one scenario with general number of attributes. It would be great if number of tested rows will not be determined too.
My dream is to write only one general scenario:
Testing query with N attribute with these M record in and another L out of result
How to do this in Gherkin? Is it possible with any hacks?
The short answer is, No. Gherkin is not about flexibility, Gherkin is about concrete examples. Concrete example are everything except flexible.
A long answer is:
You are describing a usage of Gherkin as a test tool. The purpose with Gherkin is, however, not to test things. The purpose with Gherkin is to facilitate communication between development and the stakeholders that want a specific behaviour.
If you want to test something, there are other tooling that will support exactly what you want. Any test framework will be usable. My personal choice would be JUnit since I work mostly with Java.
The litmus test for deciding on the tooling is, who will have to be able to understand this?
If the answer is non techs, I would probably use Gherkin with very concrete examples. Concrete examples are most likely not comparing things in a database. Concrete examples tend to describe external, observable behaviour of the system.
If the answer is developers, then I would probably use a test framework where I have access to a programming language. This would allow for the flexibility you are asking for.
In your case, you are asking for a programming language. Gherkin and Cucumber are not the right tools in your case.
You can do it without any hacks, but I don't think you want to, at least not the entire scenario in a single line.
You will want to follow BDD structure, else why use BDD?
You should have and follow a structure like:
Given
When
Then
You need to split and have a delimitation between initial context, action(s) and result(s).It will be a bad practice to not have a limit between these.
Also note that a clear delimitation will increase reusability, readability and also help you a lot in debugging.
Please do a research of what BDD means and how it helps, it may help if you have a checklist with best practices of BDD that could also help in code review of the automated scenarios.

Converting a table into Third Normal Form

I'm studying for an exam at the moment and I need a hand thoroughly understand how to convert a table into 3NF. I understand from Unormalised to 1NF, and i'm getting stuck on 1NF to 2NF. I've been given this example from a tutorial.
filmNo | fTitle | dirNo | director | actorNo | aName | role | timeOnScreen
F1100 | Happy Days | D101 | Jim Alan | A1020 | Sheila Toner | Jean Simpson | 15.45
| | D101 | Jim Alan | A1222 | Peter Watt | Tom Kinder | 25.38
| | D101 | Jim Alan | A1020 | Sheila Toner | Silvia Simpson| 22.56
F1109 | Snake Bite | D076 | Sue Ramsay | A1567 | Steve Mcdonald| Tim Rosey | 19.56
| | D076 | Sue Ramsay | A1222 | Peter Watt | Archie Bold | 10.44
So this table to 1NF is quite easy but it's getting to 2NF and 3NF that i'm struggling with. I'm getting lost on determining the dependencies on the columns. Am I correct in saying that role and timeOnScreen are dependent on Actor, but also on the film? How would this convert to 2NF. I think from 2NF I'd be able to go to 3NF. But I'd really like to go through the steps to do this so I can fully understand it for my exam.
"I'm getting lost on determining the dependencies on the columns".
First and foremost, dependencies between "columns" in a relation schema should be derived from business rules which should be given. Determining the dependencies from a data sample is just a guessing game, always fallible and never to be relied upon. If your exercise requires you to do that, try and make the best of it, but forget about the approach as soon as your exam is over.
A functional dependency (which is the only type of dependency you need to consider if nothing beyond 3NF is targeted) is a rule "AB->CD" to the effect that for any valid value of our relation schema, IF you take the relational projection over {ABCD} of that relation value, then the resulting relation value will be one in which any combination of {AB} values will appear at most once. That's where the name "functional" dependency derives from : "AB->CD" expresses that both C and D values are mathematical functions of the combinations of AB values. Conversely, it expresses that the AB combination is a determinant for finding single C and D values. Applying that to your sample should allow you to find some reasonable FDs that your exercise is wrongly expecting you to guess.

Schema design: to-one foreign relation with heterogeneous type (different targets), but same role

I have been working to build a more abstract schema, where there had been several tables modeling remarkably similar relationships, I want to model just the "essence". Due to the environment I am working with (Drupal 7), I can't change the nature of the issue: that a relationship of the same essential type could reference one of two different tables for the object in one role. Let's bring in some example to clarify (this is not my actual problem domain, but a similar problem). Here are the requirements:
First, if you are unfamiliar with Drupal, here's the gist: Users in one table, every other entity in a single second table (gross generalization, but enough).
Let's say we want to model the "works for" relationship, and lets have the given be that "companies" are of type "entity" and "supervisor" is of type "user" (and by "type" I mean that's the table in the database where their tuples reside). Here are the simplified requirements:
A user can work for a company
A company can work for a company
These "works for" relationships should be in the same table.
I have two ideas, and both don't exactly sit well with my current disposition toward schema quality, and this is where I would like some insight.
One foreign-key column paired with a 'type' column
Two foreign-key columns, always at most one utilized (ick!)
In case you are a visual thinker, here are the two options representing the fact that users 123 and 632, as well as entity 123 all work for entity 435:
Option 1
+---------------+-------------+---------------+-------------+
| employment_id | employee_id | employee_type | employer_id |
+---------------+-------------+---------------+-------------+
| 1 | 123 | user | 435 |
+---------------+-------------+---------------+-------------+
| 2 | 123 | entity | 435 |
+---------------+-------------+---------------+-------------+
| 3 | 632 | user | 435 |
+---------------+-------------+---------------+-------------+
Option 2
+---------------+------------------+--------------------+-------------+
| employment_id | employee_user_id | employee_entity_id | employer_id |
+---------------+------------------+--------------------+-------------+
| 1 | 123 | <NULL> | 435 |
+---------------+------------------+--------------------+-------------+
| 2 | <NULL> | 123 | 435 |
+---------------+------------------+--------------------+-------------+
| 3 | 632 | <NULL> | 435 |
+---------------+------------------+--------------------+-------------+
Thoughts on option 1: I like that the employee_id column has concrete role, but I despise that it has ambiguous target. Option 2 has ambiguous role (which column is the employee?), but has concrete target for any given FK, so I can think of it this way:
+-----------+-----------+----------+
| | ROLE |
| | ambiguous | concrete |
+-----------+-----------+----------+
| T | | |
| A ambig. | | 1 |
| R | | |
| G -------+-----------+----------+
| E | | |
| T concr. | 2 | ? |
| | | |
+-----------+-----------+----------+
Option two has very pragmatic benefits for my project, but I do not feel comfortable with so many nulls (you might not even call it 1NF!)
So here's the crux of my question for SO: How can option 1 be improved, or else what knowledge gap might I have that leaves me unsettled? While I can't bring to mind a specific rule which it violates, the design clearly is not in keeping with the intentions of normalization (requiring two columns to uniquely identify a relationship is not doing me any favors for safeguarding against anomalies).
I do understand that the ideal solution would be to redesign the users entity to be the same as what I have been calling "entity" here, but please consider that beside the point/circumstantial (or at least let's draw the pragmatic line right exactly there for this question).
Again, the essential question: What, in terms of normalization, is wrong with schema option 1, and how might you model this relationship given the constraint of not refactoring "user" into "entity"?
note: For this, I am more interested in theoretical purity than a pragmatic solution
The solutions you present contravene 4th normal form as #podiluska says. If this is recast into the form below, then the solution removes this difficult and is in 5NF (and even 6NF?).
Adopt one of the patterns for sub/super types. This uses the relation definitions set out below, plus the super/subtype constraint. This constraint is that each tuple in the super type relation must correspond exactly to one sub type tuple. In other words, the subtypes must form a disjoint, covering set over the supertype.
I suspect the performance of this in a real situation might require some heavy tuning:
Table: Employment
+---------------+-------------+
| employee_id | employer_id |
+---------------+-------------+
| 1 | 435 |
+---------------+-------------+
| 2 | 435 |
+---------------+-------------+
| 3 | 435 |
+---------------+-------------+
Table: Employee (SuperType)
+---------------+
| employee_id |
+---------------+
| 1 |
+---------------+
| 2 |
+---------------+
| 3 |
+---------------+
Table: User employee (SubType)
+---------------+-------------+
| employee_id | user_id |
+---------------+-------------+
| 1 | 123 |
+---------------+-------------+
| 3 | 632 |
+---------------+-------------+
Table: Entity employee (SubType)
+---------------+-------------+
| employee_id | entity_id |
+---------------+-------------+
| 2 | 123 |
+---------------+-------------+
What is wrong with option 1 ( and option 2) is that it is a multivalued dependency, and as such, a breach of 4th normal form. However, within the constraints you have given, there's not a lot you can do about that.
If you could replace the worksfor table with a view, then you could keep user-company and company-company relations separate.
Of your two choices, Option 2 has the advantage that it may be easier to enforce the referential integrity, depending on your platform.
One potential, if icky, pragmatic solution within you current constraints could be to give companies positive IDs and users negative IDs which eliminates the empty column of option 2 and turns the type column of option 1 into an implication, but I feel dirty even suggesting it.
Similarly, if you don't need to know what type the entity is as long as you can determine it via joining, then using Guids as IDs would eliminate the need for the type column

What type of data structure should I use for mimicking a file-system?

The title might be worded strange, but it's probably because I don't even know if I'm asking the right question.
So essentially what I'm trying to build is a "breadcrumbish" categoricalization type system (like a file directory) where each node has a parent (except for root) and each node can contain either data or another node. This will be used for organizing email addresses in a database. I have a system right now where you can create a "group" and add email addresses to that group, but it would be very nice to add an organizational system to it.
This (in my head) is in a tree format, but I don't know what tree.
The issue I'm having is building it using MySQL. It's easy to traverse trees that are in memory, but on database, it's a bit trickier.
Image of tree: http://j.imagehost.org/0917/asdf.png
SELECT * FROM Businesses:
Tim's Hardware Store, 7-11, Kwik-E-Mart, Cub Foods, Bob's Grocery Store, CONGLOM-O
SELECT * FROM Grocery Stores:
Cub Foods, Bob's Grocery Store, CONGLOM-O
SELECT * FROM Big Grocery Stores:
CONGLOM-O
SELECT * FROM Churches:
St. Peter's Church, St. John's Church
I think this should be enough information so I can accurately describe what my goal is.
Well, there are a few patterns you could use. Which one is right depends on your needs.
Do you need to select a node and all its children? If so, then a Nested set Model (Scroll down to the heading) may be better for you. The table would look like this:
| Name | Left | Right |
| Emails | 1 | 12 |
| Business | 2 | 7 |
| Tim's | 3 | 4 |
| 7-11 | 5 | 6 |
| Churches | 8 | 11 |
| St. Pete | 9 | 10 |
So then, to find anything below a node, just do
SELECT name FROM nodes WHERE Left > *yourleftnode* AND Right < *yourrightnode*
To find everything above the node:
SELECT name FROM nodes WHERE Left < *yourleftnode* AND Right > *yourrightnode*
If you only want to query for a specific level, you could do an Adjacency List Model (Scoll down to the heading):
| Id | Name | Parent_Id |
| 1 | Email | null |
| 2 | Business | 1 |
| 3 | Tim's | 2 |
To find everything on the same level, just do:
SELECT name FROM nodes WHERE parent_id = *yourparentnode*
Of course, there's nothing stopping you from doing a hybrid approach which will let you query however you'd like for the query at hand
| Id | Name | Parent_Id | Left | Right | Path |
| 1 | Email | null | 1 | 6 | / |
| 2 | Business | 1 | 2 | 5 | /Email/ |
| 3 | Tim's | 2 | 3 | 4 | /Email/Business/ |
Really, it's just a matter of your needs...
The easiest way to do it would be something like this:
Group
- GroupID (PK)
- ParentGroupID
- GroupName
People
- PersonID (PK)
- EmailAddress
- FirstName
- LastName
GroupMembership
- GroupID (PK)
- PersonID (PK)
That should establish a structure where you can have groups that have parent groups and people that can be members of groups (or multiple groups). If a person can only be a member of one group, then get rid of the GroupMembership table and just put a GroupID on the People table.
Complex queries against this structure can get difficult though. There are other less intuitive ways to model this that make querying easier (but often make updates more difficult). If the number of groups is small, the easiest way to handle queries against this is often to load the whole tree of Groups into memory, cache it, and use that to build your queries.
As always when I see questions about modeling trees and hierarchies, my suggestion is that you get a hold of a copy of Joe Celko's book on the subject. He presents various ways to model them in a RDBMS, some of which are fairly imaginative, and he gives the pros and cons for each pattern.
Create an object Group which has a name, many email addresses, and a parent, which can be null.

How should I implement items that are normalized in the database in object oriented design?

How should I implement items that are normalized in the database in object oriented classes? In the database I have a table of items and Groups. Each item belongs to one group:
+----------------------------------------+
| Inventory |
+----+------+-------+----------+---------+
| Id | Name | Price | Quantity | GroupId |
+----+------+-------+----------+---------+
| 43 | Box | 34.00 | 456 | 4 |
| 56 | Ball | 56.50 | 3 | 6 |
| 66 | Tin | 23.00 | 14 | 4 |
+----+------+-------+----------+---------+
Totally 3000 lines
+----------------------+
| Groups |
+---------+------+-----+
| GroupId | Name | VAT |
+---------+------+-----+
| 4 | Mini | 0.2 |
| 6 | Big | 0.3 |
+---------+------+-----+
Totally 10 lines
I will use the OOP classes in a GUI where the user can edit items and groups in the inventory. It should also be easy to do calculations with a bunch of items. Group information like VAT is needed for calculations.
I will write an Item class, but do I need a Group class? If so, should I keep them in a global location or how do I access them when I need it for item calculations? Is there any design pattern for this case?
First of all, the most common practice is to use an ORM (object-relational mapping) tool. These are available to most any modern OO language, and they take care of generating the classes needed to interact with the database, managing retrieval and updating, and managing connection lifetime.
That aside, yes, you need a Group class that has a collection of Items, and (ideally) a reference from Item to its parent Group. This is one of the areas where an ORM can help, since it can ensure that these two references (the collection of children and the parent reference) stay in sync.
Yes, you need a Group class. It looks like it has a 1:many relationship with Inventory, where Group is the parent. Group will have a reference to a collection or set of Inventory.