Converting a table into Third Normal Form - sql

I'm studying for an exam at the moment and I need a hand thoroughly understand how to convert a table into 3NF. I understand from Unormalised to 1NF, and i'm getting stuck on 1NF to 2NF. I've been given this example from a tutorial.
filmNo | fTitle | dirNo | director | actorNo | aName | role | timeOnScreen
F1100 | Happy Days | D101 | Jim Alan | A1020 | Sheila Toner | Jean Simpson | 15.45
| | D101 | Jim Alan | A1222 | Peter Watt | Tom Kinder | 25.38
| | D101 | Jim Alan | A1020 | Sheila Toner | Silvia Simpson| 22.56
F1109 | Snake Bite | D076 | Sue Ramsay | A1567 | Steve Mcdonald| Tim Rosey | 19.56
| | D076 | Sue Ramsay | A1222 | Peter Watt | Archie Bold | 10.44
So this table to 1NF is quite easy but it's getting to 2NF and 3NF that i'm struggling with. I'm getting lost on determining the dependencies on the columns. Am I correct in saying that role and timeOnScreen are dependent on Actor, but also on the film? How would this convert to 2NF. I think from 2NF I'd be able to go to 3NF. But I'd really like to go through the steps to do this so I can fully understand it for my exam.

"I'm getting lost on determining the dependencies on the columns".
First and foremost, dependencies between "columns" in a relation schema should be derived from business rules which should be given. Determining the dependencies from a data sample is just a guessing game, always fallible and never to be relied upon. If your exercise requires you to do that, try and make the best of it, but forget about the approach as soon as your exam is over.
A functional dependency (which is the only type of dependency you need to consider if nothing beyond 3NF is targeted) is a rule "AB->CD" to the effect that for any valid value of our relation schema, IF you take the relational projection over {ABCD} of that relation value, then the resulting relation value will be one in which any combination of {AB} values will appear at most once. That's where the name "functional" dependency derives from : "AB->CD" expresses that both C and D values are mathematical functions of the combinations of AB values. Conversely, it expresses that the AB combination is a determinant for finding single C and D values. Applying that to your sample should allow you to find some reasonable FDs that your exercise is wrongly expecting you to guess.

Related

Pivot Data in a BigQuery Standard SQL View Definition

I'm not sure whether this is possible with some of the new BigQuery scripting capabilities, UDFs, array/string functions (or anything else!), however I simply can't figure it out.
I'm trying to write the SQL for a view in BigQuery which dynamically defines columns based on query results, similar to a pivot table in a spreadsheet/BI tool (or melt in pandas). I can do this externally in Python or hard-code it using case statements, but I'm sure that a SQL solution to this would be incredibly useful to a huge number of people.
Essentially I'm trying to write a query which would transform a table like this:
year | name | number
-----------------------
1963 | Michael | 9246
1961 | Michael | 9055
1958 | Michael | 9203
1957 | Michael | 9116
1953 | Robert | 9061
1952 | Robert | 9205
1951 | Robert | 9054
1948 | Robert | 9015
1947 | Robert | 10025
1947 | John | 9634
1946 | Robert | 9295
----------------------
SQL to generate initial example table:
SELECT year, name, number
FROM `bigquery-public-data.usa_names.usa_1910_2013`
WHERE number > 9000
ORDER BY year DESC
Into a table with the following structure:
year | John | Michael | Robert
---------------------------------
1946 | | 9,295 |
1947 | 9,634 | | 10,025
1948 | | 9,015 |
...
This then needs to be connected to downstream tools, without requiring maintenance when the data changes. I know that this is not always a great idea and that tidy form data is more universally useful, but there are still some scenarios where this behaviour is desirable.
I have seen a few solutions on here, but they all seem to involve string generation and then manually pasting the query... I can do this via the BigQuery API but am desperate to find a dynamic solution using nothing but SQL so I don't have to maintain an external function.
Thanks in advance for any pointers!

Avoid storing permutations of all possible product configurations

I'll simplify my problem using a shirt analogy. I have the following tables:
shirt sizes (e.g. small, medium, large...)
shirt colors (e.g. red, green, blue...)
shirt styles (e.g. short sleeve, long sleeve, collared...)
Now, I'd like to create prices for my inventory. Not all shirts are available in every configuration, but some are. For example:
All shirt styles and sizes are available in green. These are $1.
Only large, collared shirts are available in blue. These are $2.
Short sleeve red shirts in all sizes are $3, but long sleeve and collared red shirts are $4.
I could create another table with all available combinations of the three tables and store the prices. This seems inefficient and prone to error. How else can I store these relationships?
Background
Its important to know the terminology and the lexicon when you are researching how to accomplish something. What you are looking to do here is basically design a product configurator or configure price quote (CPQ) system. These systems exists as proprietary and open source customizable off the shelf solutions. As a software architect for a mid market B2B company I am quite familiar with software that implements cpq from scratch and also software that integrates with COTs solutions. If this is anything but an academic exercise I would highly suggest you look at the myriad of free OSS CPQ tools. However since this is stack overflow I will address your question on a more technical level.
Four abstract layers
There are essentially four abstract layers to designing a product configuration system (which we will call a product configuration model).
Components and subcomponents
Attributes shared between those components
Tables / Relational Constraints connecting the components and sub components with their shared attributes
Expressions and expression constraints (which are non reusable statements that are conceptually the bottom layer)
Components
Let's take something simple like skateboards as a use case here. You may have a components table similar to the following
|---------------------|------------------|
| id | Name |
|---------------------|------------------|
| 1 | Decks |
|---------------------|------------------|
| 2 | Wheels |
|---------------------|------------------|
| 3 | Trucks |
|---------------------|------------------|
Sub Components
You may then have a sub components table similar to the following
|---------------------|------------------|------------------|
| id | Name | component_id |
|---------------------|------------------|------------------|
| 1 | Bearings | 2 |
|---------------------|------------------|------------------|
| 2 | Bushing | 3 |
|---------------------|------------------|------------------|
| 3 | Grip Tape | 1 |
|---------------------|------------------|------------------|
| 4 | Nuts / Bolts | 1 |
|---------------------|------------------|------------------|
As you can you in this simple example you have one to one and one to many relationship between components and sub components. It is important that you do not confuse this with attributes, which we have not addressed yet.
Attributes
Your next layer of abstraction is attributes. Generally, all your attributes are associated with table constraints to components and sub components, *and they are not limited to whether that particular combination exists or not).
For a simplied example you might have a table attributes with the following rows
|---------------------|------------------|------------------|
| id | Category | Value |
|---------------------|------------------|------------------|
| 1 | Size | 7.5 |
|---------------------|------------------|------------------|
| 2 | Size | 7.75 |
|---------------------|------------------|------------------|
| 3 | Size | 6.25 |
|---------------------|------------------|------------------|
| 4 | Brand | Toy Machine |
|---------------------|------------------|------------------|
| 5 | Brand | Bird House |
|---------------------|------------------|------------------|
| 5 | Brand | Nike |
|---------------------|------------------|------------------|
| 5 | Model | Nyjah Pro |
|---------------------|------------------|------------------|
| 5 | Model | Vice Monster |
|---------------------|------------------|------------------|
| 6 | ABEC Rating | class 6 |
|---------------------|------------------|------------------|
| 7 | ABEC Rating | class 3 |
|---------------------|------------------|------------------|
As you can see this table is not constrained in the same way your product major and product minor is (however, this is an over simplification and you'd obviously be using business keys in place of attribute labels like ABEC Rating, etc. It lists all attributes.
Expressions
Finally, you would have a table for expressions. These expressions would be stored as rows in the table. They may be relational with other expressions (recursive keys), but should not be relational with your tables. Rather, they should use a mixture of boolean logic, predefined functions, and the surrogate keys from your previous tables to specify the actual configurations available. These are generally NOT reusable (but can be combined with recursive keys for a bit more re-usability).
There are a variety of expression languages out there, some proprietary some open. I manage a custom built product configuration model that uses DMN (from the people who brought you BPMN) to express my statements.
Additionally, I have seem people use XML, XSLT, and XPath in place of the relational model listed above. An expression row might look something like the following
(/component/id#1 & (/attribute/#id == 6 | /attribute/#id == 7))
In Conclusion
Like any software system, abstraction is key. I have seen almost all CPQ and product configuration models boil down into these 4 abstractions (with hundreds of other abstractions in between). Unless this is an academic exercise I highly suggest you find a COTs solution. Knowing your products enough to abstract between major, minor, and attributes is key but the bread and butter (and unfortunately the "least clean" part is definitely the expression language you store in your tables).
Storing all the combinations isn't such a bad idea. But, you could also use wildcards. Your conditions would look like:
style size color price
NULL NULL green $1
collared large blue $2
short sleeve NULL red $3
long sleeve, collared NULL red $4
If you have only a handful of different prices, then this is probably okay. However, querying such a table would be less efficient than expanding it out for every combination.

Explanation about Direct Mapping to convert Many-to-Many relationship

One of the standards from W3C for RDB2RDF is Direct Mapping. I heard that there is a problem when converting many-to-many relationship from a relational database and they say it loses semantic, I need more explanation about it.
...there is a problem when converting many-to-many relationship from a relational database
I'd say that direct mapping introduces additional "parasitic" semantics, treating normalization artefacts as first-class object.
Let's consider the D011-M2MRelations testcase.
Student
+---------+-----------+----------+
| ID (PK) | FirstName | LastName |
+---------+-----------+----------+
| 10 | Venus | Williams |
| 11 | Fernando | Alonso |
| 12 | David | Villa |
+---------+-----------+----------+
Student_Sport
+------------+----------+
| ID_Student | ID_Sport |
+------------+----------+
| 10 | 110 |
| 11 | 111 |
| 11 | 112 |
| 12 | 111 |
+------------+----------+
Sport
+---------+-------------+
| ID (PK) | Description |
+---------+-------------+
| 110 | Tennis |
| 111 | Football |
| 112 | Formula1 |
+---------+-------------+
Direct mapping generates a lot of triples of this kind:
<Student_Sport/ID_Student=11;ID_Sport=111> <Student_Sport#ref-ID_Student> <Student/ID=11>.
<Student_Sport/ID_Student=11;ID_Sport=111> <Student_Sport#ref-ID_Sport> <Sport/ID=111>.
<Student_Sport/ID_Student=11;ID_Sport=112> <Student_Sport#ref-ID_Student> <Student/ID=11>.
<Student_Sport/ID_Student=11;ID_Sport=112> <Student_Sport#ref-ID_Sport> <Sport/ID=112>.
Modeling from scratch, you'd probably write something like this (R2RML allows to achieve that):
<http://example.com/student/11> <http://example.com/plays> <http://example.com/sport/111>.
<http://example.com/student/11> <http://example.com/plays> <http://example.com/sport/112>.
Moreover, one can't improve results denormalizing original tables or creating SQL views: without primary keys, results are probably even worse.
In order to improve results, subsequent DELETE/INSERT (or CONSTRUCT) seems to be the only option available. The process should be named ELT rather than ETL. Perhaps the following DM-generated triples were intended to help in such transformation:
<Student_Sport/ID_Student=11;ID_Sport=111> <Student_Sport#ID_Student> "11"^^xsd:integer.
<Student_Sport/ID_Student=11;ID_Sport=111> <Student_Sport#ID_Sport> "111"^^xsd:integer.
<Student_Sport/ID_Student=11;ID_Sport=112> <Student_Sport#ID_Student> "11"^^xsd:integer.
<Student_Sport/ID_Student=11;ID_Sport=112> <Student_Sport#ID_Sport> "112"^^xsd:integer.
...they say it loses semantics
#JuanSequeda means that DM doesn't generate an OWL ontology from an relational schema, this behaviour is not many-to-many relations specific.
See also links from Issue 14.

Flexibility of scenarios in Gherkin.

I looking for mechanism that will allow to build more flexible scenarios.
For example for these two very similar scenarios that test existence of records in database:
Scenario Outline: Testing query with 1 attribute with these 2 record in and another 2 out of result
Given I'm connected to <db> database
When I select <query> from database
Then Result should contain fields:
| <row> |
| <yes1> |
| <yes2> |
And Result should not contain fields:
| <row> |
| <no1> |
| <no2> |
Examples:
| db | row | yes1 | yes2 | no1 | no2 | query |
| 1 | model | 1013 | 1006 | 1012 | 1007 | "SELECT model FROM pc WHERE speed >= 3.0;" |
| 1 | maker | E | A | C | H | "SELECT maker FROM product NATURAL JOIN laptop WHERE hd >= 100;" |
Scenario Outline: Testing query with 2 attributes with these 2 record in and another 2 out of result
Given I'm connected to <db> database
When I select <query> from database
Then Result should contain fields:
| <rowA> | <rowB> |
| <yes1A> | <yes1B> |
| <yes2A> | <yes2B> |
And Result should not contain fields:
| <rowA> | <rowB> |
| <no1A> | <no1B> |
| <no2A> | <no2B> |
Examples:
| db | rowA | rowB | yes1A | yes1B | yes2A | yes2B | no1A | no1B | no2A | no2B | query |
| 1 | model | price | 1004 | 649 | 2007 | 1429 | 2004 | 1150 | 3007 | 200 | "SELECT model,price FROM product" |
| 2 | name | country | Yamato | Japan | North | USA | Repulse | Brit | Cal | USA | "SELECT name, country FROM clases" |
I would like to be able to write one scenario with general number of attributes. It would be great if number of tested rows will not be determined too.
My dream is to write only one general scenario:
Testing query with N attribute with these M record in and another L out of result
How to do this in Gherkin? Is it possible with any hacks?
The short answer is, No. Gherkin is not about flexibility, Gherkin is about concrete examples. Concrete example are everything except flexible.
A long answer is:
You are describing a usage of Gherkin as a test tool. The purpose with Gherkin is, however, not to test things. The purpose with Gherkin is to facilitate communication between development and the stakeholders that want a specific behaviour.
If you want to test something, there are other tooling that will support exactly what you want. Any test framework will be usable. My personal choice would be JUnit since I work mostly with Java.
The litmus test for deciding on the tooling is, who will have to be able to understand this?
If the answer is non techs, I would probably use Gherkin with very concrete examples. Concrete examples are most likely not comparing things in a database. Concrete examples tend to describe external, observable behaviour of the system.
If the answer is developers, then I would probably use a test framework where I have access to a programming language. This would allow for the flexibility you are asking for.
In your case, you are asking for a programming language. Gherkin and Cucumber are not the right tools in your case.
You can do it without any hacks, but I don't think you want to, at least not the entire scenario in a single line.
You will want to follow BDD structure, else why use BDD?
You should have and follow a structure like:
Given
When
Then
You need to split and have a delimitation between initial context, action(s) and result(s).It will be a bad practice to not have a limit between these.
Also note that a clear delimitation will increase reusability, readability and also help you a lot in debugging.
Please do a research of what BDD means and how it helps, it may help if you have a checklist with best practices of BDD that could also help in code review of the automated scenarios.

Schema design: to-one foreign relation with heterogeneous type (different targets), but same role

I have been working to build a more abstract schema, where there had been several tables modeling remarkably similar relationships, I want to model just the "essence". Due to the environment I am working with (Drupal 7), I can't change the nature of the issue: that a relationship of the same essential type could reference one of two different tables for the object in one role. Let's bring in some example to clarify (this is not my actual problem domain, but a similar problem). Here are the requirements:
First, if you are unfamiliar with Drupal, here's the gist: Users in one table, every other entity in a single second table (gross generalization, but enough).
Let's say we want to model the "works for" relationship, and lets have the given be that "companies" are of type "entity" and "supervisor" is of type "user" (and by "type" I mean that's the table in the database where their tuples reside). Here are the simplified requirements:
A user can work for a company
A company can work for a company
These "works for" relationships should be in the same table.
I have two ideas, and both don't exactly sit well with my current disposition toward schema quality, and this is where I would like some insight.
One foreign-key column paired with a 'type' column
Two foreign-key columns, always at most one utilized (ick!)
In case you are a visual thinker, here are the two options representing the fact that users 123 and 632, as well as entity 123 all work for entity 435:
Option 1
+---------------+-------------+---------------+-------------+
| employment_id | employee_id | employee_type | employer_id |
+---------------+-------------+---------------+-------------+
| 1 | 123 | user | 435 |
+---------------+-------------+---------------+-------------+
| 2 | 123 | entity | 435 |
+---------------+-------------+---------------+-------------+
| 3 | 632 | user | 435 |
+---------------+-------------+---------------+-------------+
Option 2
+---------------+------------------+--------------------+-------------+
| employment_id | employee_user_id | employee_entity_id | employer_id |
+---------------+------------------+--------------------+-------------+
| 1 | 123 | <NULL> | 435 |
+---------------+------------------+--------------------+-------------+
| 2 | <NULL> | 123 | 435 |
+---------------+------------------+--------------------+-------------+
| 3 | 632 | <NULL> | 435 |
+---------------+------------------+--------------------+-------------+
Thoughts on option 1: I like that the employee_id column has concrete role, but I despise that it has ambiguous target. Option 2 has ambiguous role (which column is the employee?), but has concrete target for any given FK, so I can think of it this way:
+-----------+-----------+----------+
| | ROLE |
| | ambiguous | concrete |
+-----------+-----------+----------+
| T | | |
| A ambig. | | 1 |
| R | | |
| G -------+-----------+----------+
| E | | |
| T concr. | 2 | ? |
| | | |
+-----------+-----------+----------+
Option two has very pragmatic benefits for my project, but I do not feel comfortable with so many nulls (you might not even call it 1NF!)
So here's the crux of my question for SO: How can option 1 be improved, or else what knowledge gap might I have that leaves me unsettled? While I can't bring to mind a specific rule which it violates, the design clearly is not in keeping with the intentions of normalization (requiring two columns to uniquely identify a relationship is not doing me any favors for safeguarding against anomalies).
I do understand that the ideal solution would be to redesign the users entity to be the same as what I have been calling "entity" here, but please consider that beside the point/circumstantial (or at least let's draw the pragmatic line right exactly there for this question).
Again, the essential question: What, in terms of normalization, is wrong with schema option 1, and how might you model this relationship given the constraint of not refactoring "user" into "entity"?
note: For this, I am more interested in theoretical purity than a pragmatic solution
The solutions you present contravene 4th normal form as #podiluska says. If this is recast into the form below, then the solution removes this difficult and is in 5NF (and even 6NF?).
Adopt one of the patterns for sub/super types. This uses the relation definitions set out below, plus the super/subtype constraint. This constraint is that each tuple in the super type relation must correspond exactly to one sub type tuple. In other words, the subtypes must form a disjoint, covering set over the supertype.
I suspect the performance of this in a real situation might require some heavy tuning:
Table: Employment
+---------------+-------------+
| employee_id | employer_id |
+---------------+-------------+
| 1 | 435 |
+---------------+-------------+
| 2 | 435 |
+---------------+-------------+
| 3 | 435 |
+---------------+-------------+
Table: Employee (SuperType)
+---------------+
| employee_id |
+---------------+
| 1 |
+---------------+
| 2 |
+---------------+
| 3 |
+---------------+
Table: User employee (SubType)
+---------------+-------------+
| employee_id | user_id |
+---------------+-------------+
| 1 | 123 |
+---------------+-------------+
| 3 | 632 |
+---------------+-------------+
Table: Entity employee (SubType)
+---------------+-------------+
| employee_id | entity_id |
+---------------+-------------+
| 2 | 123 |
+---------------+-------------+
What is wrong with option 1 ( and option 2) is that it is a multivalued dependency, and as such, a breach of 4th normal form. However, within the constraints you have given, there's not a lot you can do about that.
If you could replace the worksfor table with a view, then you could keep user-company and company-company relations separate.
Of your two choices, Option 2 has the advantage that it may be easier to enforce the referential integrity, depending on your platform.
One potential, if icky, pragmatic solution within you current constraints could be to give companies positive IDs and users negative IDs which eliminates the empty column of option 2 and turns the type column of option 1 into an implication, but I feel dirty even suggesting it.
Similarly, if you don't need to know what type the entity is as long as you can determine it via joining, then using Guids as IDs would eliminate the need for the type column