Database Schema - Many-to-Many Normalisation

Database Schema - Many-to-Many Normalisation - sql

I'm designing a schema where a case can have many forms attached and a form can be used for many cases. The Form table basically holds the structure of a html form which gets rendered on the client side. When the form is submitted the name/value pairs for the fields are stored separately. Is there any value in keeping the name/value attributes seperate from the join table as follows?
CREATE TABLE Case (
ID int NOT NULL PRIMARY KEY,
...
);
CREATE TABLE CaseForm (
CaseID int NOT NULL FOREIGN KEY REFERENCES Case (ID),
FormID int NOT NULL FOREIGN KEY REFERENCES Form (ID),
CONSTRAINT PK_CaseForm PRIMARY KEY (CaseID, FormID)
);
CREATE TABLE CaseFormAttribute (
ID int NOT NULL PRIMARY KEY,
CaseID int NOT NULL FOREIGN KEY REFERENCES CaseForm (CaseID),
FormID int NOT NULL FOREIGN KEY REFERENCES CaseForm (FormID),
Name varchar(255) NOT NULL,
Value varchar(max)
);
CREATE TABLE Form (
ID int NOT NULL PRIMARY KEY,
FieldsJson varchar (max) NOT NULL
);
I'm I overcomplicating the schema since the same many to many relationship can by achieved by turning the CaseFormAttribute table into the join table and getting rid of the CaseForm table altogether as follows?
CREATE TABLE CaseFormAttribute (
ID int NOT NULL PRIMARY KEY,
CaseID int NOT NULL FOREIGN KEY REFERENCES Case (ID),
FormID int NOT NULL FOREIGN KEY REFERENCES Form (ID),
Name varchar(255) NOT NULL,
Value varchar(max) NULL
);
Basically what I'm trying to ask is which is the better design?

The main benefit of splitting up the two would depend on whether or not additional fields would ever be added to the CaseForm table. For instance, say that you want to record if a Form is incomplete. You may add an Incomplete bit field to that effect. Now, you have two main options for retrieving that information:
Clustered index scan on CaseForm
Create a nonclustered index on CaseForm.Incomplete which includes CaseID, FormID, and scan that
If you didn't split the tables, your two main options would be:
Clustered index scan on CaseFormAttribute
Create a nonclustered index on CaseFormAttribute.Incomplete which includes CaseID, FormID, and scan that
For the purposes of this example, query options 1 and 2 are roughly the same in terms of performance. Introducing the nonclustered index adds overhead in multiple ways. It's a little less streamlined than the clustered index (it may take more reads to scan in this particular example), it's additional storage space that CaseForm will take up, and the index has to be maintained for updates to the table. Option 4 will also perform similarly, with the same caveats as option 2. Option 3 will be your worst performer, as a clustered index scan will include reading all of the BLOB data in your Value field, even though it only needs the bit in Incomplete to determine whether or not to return that (Case, Form) pair.
So it really does depend on what direction you're going in the future.
Also, if you stay with the split approach, consider shifting CaseFormAttribute.ID to CaseForm, and then use CaseForm.ID as your PK/FK in CaseFormAttribute. The caveat here is that we're assuming that all Forms will be inserted at the same time for a given Case. If that's not true, then you would invite some page splits because your inserts will be somewhat random, though still generally increasing.

Related

Can I use identity for primary key in more than one table in the same ER model

As it is said in the title, my question is can I use int identity(1,1) for primary key in more than one table in the same ER model? I found on Internet that Primary Key need to have unique value and row, for example if I set int identity (1,1) for table:
CREATE TABLE dbo.Persons
(
Personid int IDENTITY(1,1) PRIMARY KEY,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Age int
);
GO
and the other table
CREATE TABLE dbo.Job
(
jobID int IDENTITY(1,1) NOT NULL PRIMARY KEY,
nameJob NVARCHAR(25) NOT NULL,
Personid int FOREIGN KEY REFERENCES dbo.Persons(Personid)
);
Wouldn't Personid and jobID have the same value and because of that cause an error?

Constraints in general are defined and have a scope of one table (object) in the database. The only exception is the FOREIGN KEY which usually has a REFERENCE to another table.
The PRIMARY KEY (or any UNIQUE key) sets a constraint only on the table it is defined on and is not affecting or is not affected by other constraints on other tables.
The PRIMARY KEY defines a column or a set of columns which can be used to uniquely identify one record in one table (and none of the columns can hold NULL, UNIQUE on the other hand allows NULLs and how it is treated might differ in different database engines).
So yes, you might have the same value for PersonID and JobID, but their meaning is different. (And to select the one unique record, you will need to tell SQL Server in which table and in which column of that table you are looking for it, this is the table list and the WHERE or JOIN conditions in the query).
The query SELECT * FROM dbo.Job WHERE JobID = 1; and SELECT * FROM dbo.Person WHERE PersonID = 1; have a different meaning even when the value you are searching for is the same.
You will define the IDENTITY on the table (the table can have only one IDENTITY column). You don't need to have an IDENTITY definition on a column to have the value 1 in it, the IDENTITY just gives you an easy way to generate unique values per table.
You can share sequences across tables by using a SEQUENCE, but that will not prevent you to manually insert the same values into multiple tables.
In short, the value stored in the column is just a value, the table name, the column name and the business rules and roles will give it a meaning.
To the notion "every table needs to have a PRIMARY KEY and IDENTITY, I would like to add, that in most cases there are multiple (independent) keys in the table. Usually every entity has something what you can call business key, which is in loose terms the key what the business (humans) use to identify something. This key has very similar, but usually the same characteristics as a PRIMARY KEY with IDENTITY.
This can be a product's barcode, or the employee's ID card number, or something what is generated in another system (say HR) or a code which is assigned to a customer or partner.
These business keys are useful for humans, but not always useful for computers, but they could serve as PRIMARY KEY.
In databases we (the developers, architects) like simplicity and a business key can be very complex (in computer terms), can consist of multiple columns, and can also cause performance issues (comparing a strings is not the same as comparing numbers, comparing multiple columns is less efficient than comparing one column), but the worst, it might change over time. To resolve this, we tend to create our own technical key which then can be used by computers more easily and we have more control over it, so we use things like IDENTITYs and GUIDs and whatnot.

sqlite text as primary key vs autoincrement integers

I'm currently debating between two strategies to using a text column as a key.
The first one is to simply use the text column itself as a key, as such:
create table a(
key_a text primary key,
)
create table b(
key_b text primary key,
)
create table c(
key_a text,
key_b text,
foreign key("key_a") references a("key_a"),
foreign key("key_b") references b("key_b")
)
I'm concerned that this would result in every key being duplicated, once in a and b and another in c, since text isn't stored inline.
My second approach is to use an autoincrement id on the first two tables as a primary key, and use those ids on table c to refer to them, as such:
create table a(
id_a integer,
key_a text unique,
primary key("id_a" autoincrement)
)
create table b(
id_b integer,
key_b text unique,
primary key("id_a" autoincrement)
)
create table c(
id_a integer,
id_b integer,
foreign key("id_a") references a("id_a"),
foreign key("id_b") references b("id_b")
)
Am I right to be concerned about text duplication in the first case? Or does sqlite somehow intern these and just use an id for both, akin to what the second strategy does?

SQLite does not automatically compress text. So the answer to your question is "no".
Should you use text or an auto-incrementing id as the primary key? This can be a complex question. But happily, the answer is that it doesn't make much difference. That said, there are some considerations:
Integers are of fixed length. In general, fix length keys are slightly more efficient in B-tree indexes than variable length keys.
If the strings are short (like 1 or 2 or 3 characters), then they may be shorter -- or no longer -- than integers.
If you change the string (say, if it is originally misspelled), then using an "artificial" primary key makes this easy: just change the value in one table. Using the string itself as a key can result in lots of updates to lots of tables.

Am I right to be concerned about text duplication in the first case?
Or does sqlite somehow intern these and just use an id for both, akin
to what the second strategy does?
Yes, you are right to be concerned. The text will be duplicated.
Also, even if you did not define an integer primary key in your 1st approach, there is one.
From Rowid Tables:
The PRIMARY KEY of a rowid table (if there is one) is usually not the
true primary key for the table, in the sense that it is not the unique
key used by the underlying B-tree storage engine. The exception to
this rule is when the rowid table declares an INTEGER PRIMARY KEY. In
the exception, the INTEGER PRIMARY KEY becomes an alias for the rowid.
The true primary key for a rowid table (the value that is used as the
key to look up rows in the underlying B-tree storage engine) is the
rowid.
In your 2nd approach actually you are not creating a new column in each of the tables a and b by defining an integer primary key.
What you are doing is aliasing the existing rowid column:
id_a becomes the alias of rowid of the table a
id_b becomes the alias of rowid of the table b.
So, defining these integer primary keys is not more expensive in terms of space in the parent tables.
Although with your 1st approach you can avoid explicit updates in the child tables when you update a value in the parent tables by defining the foreign keys with ON UPDATE CASCADE, your 2nd approach is what I would suggest.
An integer primary key with a value assigned to it by the system and you don't even have to know or worry about it is common practice.
All you have to do is use that primary key and its corresponding foreign keys in the queries that you create to access the parent tables when you want to fetch from them the text values.

For performance (also it is a good db practice) you should stick to numeric/int value for the Primary Key.
As for the second approach, I'm not getting the concept you are after. Could you elaborate more on this?

Storing single form table questions in 1 or multiple tables

I have been coding ASP.NET forms inside web applications for a long time now. Generally most web apps have a user that logs in, picks a form to fill out and answers questions so your table looks like this
Table: tblInspectionForm
Fields:
inspectionformid (either autoint or guid)
userid (user ID who entered it)
datestamp (added, modified, whatever)
Question1Answer: boolean (maybe a yes/no)
Question2Answer: int (maybe foreign key for sub table 1 with dropdown values)
Question3Answer: int (foreign key for sub table 2 with dropdown values)
If I'm not mistaken it meets both 2nd and 3rd normal forms. You're not storing user names in the tables, just the ID's. You aren't storing the dropdown or "yes/no" values in Q-3, just ID's of other tables.
However, IF all the questions are exactly the same data type (assume there's no Q1 or Q1 is also an int), which link to the exact same foreign key (e.g. a form that has 20 questions, all on a 1-10 scale or have the same answers to chose from), would it be better to do something like this?
so .. Table: tblInspectionForm
userid (user ID who entered it)
datestamp (added, modified, whatever)
... and that's it for table 1 .. then
Table2: tblInspectionAnswers
inspectionformid (composite key that links back to table1 record)
userid (composite key that links back to table1 record)
datastamp (composite key that links back to table1 record)
QuestionIDNumber: int (question 1, question 2, question3)
QuestionAnswer: int (foreign key)
This wouldn't just apply to forms that only have the same types of answers for a single form. Maybe your form has 10 of these 1-10 ratings (int), 10 boolean-valued questions, and then 10 freeform.. You could break it into three tables.
The disadvantage would be that when you save the form, you're making 1 call for every question on your form. The upside is, if you have a lot of nightly integrations or replications that pull your data, if you decide to add a new question, you don't have to manually modify any replications to reporting data sources or anything else that's been designed to read/query your form data. If you originally had 20 questions and you deploy a change to your app that adds a 21st, it will automatically get pulled into any outside replications, data sources, reporting that queries this data. Another advantage is that if you have a REALLY LONG (this happens a lot maybe in the real estate industry when you have inspection forms with 100's of questions that go beyond the 8k limit for a table row) you won't end up running into problems.
Would this kind of scenario ever been the preferred way of saving form data?

As a rule of thumb, whenever you see a set of columns with numbers in their names, you know the database is poorly designed.
What you want to do in most cases is have a table for the form / questionnaire, a table for the questions, a table for the potential answers (for multiple-choice questions), and a table for answers that the user chooses.
You might also need a table for question type (i.e free-text, multiple-choice, yes/no).
Basically, the schema should look like this:
create table Forms
(
id int identity(1,1) not null primary key,
name varchar(100) not null, -- with a unique index
-- other form related fields here
)
create table QuestionTypes
(
id int identity(1,1) not null primary key,
name varchar(100) not null, -- with a unique index
)
create table Questions
(
id int identity(1,1) not null primary key,
form_id int not null foreign key references Forms(id),
type_id int not null foreign key references QuestionTypes(id),
content varchar(1000)
)
create table Answers
(
id int identity(1,1) not null primary key,
question_id int not null foreign key references Questions(id),
content varchar(1000)
-- For quizez, unremark the next row:
-- isCorrect bit not null
)
create table Results
{
id int identity(1,1) not null primary key,
form_id int not null foreign key references Forms(id)
-- in case only registered users can fill the form, unremark the next row
--user_id int not null foreign key references Users(id),
}
create table UserAnswers
(
result_id int not null foreign key references Results(id),
question_id int not null foreign key references Questions(id),
answer_id int not null foreign key references Answers(id),
content varchar(1000) null -- for free text questions
)
This design will require a few joins when generating the forms (and if you have multiple forms per application, you just add an application table that the form can reference), and a few joins to get the results, but it's the best dynamic forms database design I know.

I'm not sure whether it's "preferred" but I have certainly seen that format used commercially.
You could potentially make the secondary table more flexible with multiple answer columns (answer_int, answer_varchar, answer_datetime), and assign a question value that you can relate to get the answer from the right column.
So if q_var = 2 you know to look in answer_varchar, whereas q_value=1 you know is an int and requires a lookup (the name of which could also be specified with the question and stored in a column).
I use an application at the moment which splits answers into combobox, textfield, numeric, date etc in this fashion. The application actually uses a JSON form which splits out the data as it saves into the separate columns. It's a bit flawed as it saves JSON into these columns but the principle can work.

You could go with a single identity field for the parent table key that the child table would reference.

Performance Typed Column x Distinct Table

There are differences between distinct tables and type columns in terms of Performance or Optimizations for queries?
for example:
Create Table AllInOne(
Key Integer Identity Primary Key,
Desc varchar(20) Not Null,
OneType Integer Not Null
)
Where OneType only receives 1,2 or 3. (integer values)
Versus the following architecture:
Create Table One(
Key Integer Identity Primary Key,
Desc varchar(20) Not Null
)
Create Table Two(
Key Integer Identity Primary Key,
Desc varchar(20) Not Null
)
Create Table Three(
Key Integer Identity Primary Key,
Desc varchar(20) Not Null
)
Another possible architecture:
Create Table Root(
Key Integer Identity Primary Key,
Desc varchar(20) Not Null
)
Create Table One(
Key Integer Primary Key references Root
)
Create Table Two(
Key Integer Primary Key references Root
)
Create Table Three(
Key Integer Primary Key references Root
)
In the 3rd way all data will be set in the root and the relationship with the one, two and three tables.
I asked my teacher sometime ago and he couldn't answer if there is any difference.
Let's suppose i have to choose between these three approaches.
Assume that commonly used queries are filtering the type. And there are no child tables that reference these.
To make it easier to understand let's think about an payroll system.
One = Incomings
Two = Discounts
Three = Base for calculation.

Having separate tables, like in (2), will mean that someone who needs to access data for a particular OneType can ignore data for other types, thereby doing less I/O for a table scan. Also, indexes on the table in (2) would be smaller and potentially of less height, meaning less I/Os for index accesses.
Given the high selectivity of OneType, indexes would not help filtering in (1). However, table partitioning could be used to get all the benefits mentioned above.
There would also be an additional benefits. When querying (2), you need to know which OneType you need in order to know which table to query. In a partitioned version of (1), partition elimination for unneeded partitions can happen through values supplied in a where clause predicate, making the process much easier.
Other benefits include easier database management (when you add a column to a partitioned table, it gets added to all partitions), ans easier scaling (adding partitions for new OneType values is easy). Also, as mentioned, the table can be targeted by foreign keys.

Where do you store ad-hoc properties in a relational database?

Lets say you have a relational DB table like INVENTORY_ITEM. It's generic in the sense that anything that's in inventory needs a record here. Now lets say there are tons of different types of inventory and each different type might have unique fields that they want to keep track of (e.g. forks might track the number of tines, but refrigerators wouldn't have a use for that field). These fields must be user-definable per category type.
There are many ways to solve this:
Use ALTER TABLE statements to actually add nullable columns on the fly (yuk)
Have two tables with a one-to-one mapping, INVENTORY_ITEM, and INVENTORY_ITEM_USER, and use ALTER TABLE statements to add and remove nullable columns from the latter table on the fly (a bit nicer).
Add a CUSTOM_PROPERTY table, and a CUSTOM_PROPERTY_VALUE table, and add/remove rows in CUSTOM_PROPERTY when the user adds and removes rows, and store the values in the latter table. This is nice and generic, but the performance would suffer. If you had an average of 20 values per item, the number of rows in CUSTOM_PROPERTY_VALUE goes up at 20 times the rate, and you still need to include columns in CUSTOM_PROPERTY_VALUE for every different data type that you might want to store.
Have one big varchar(MAX) field on INVENTORY_ITEM to store custom properties as XML.
I guess you could have individual tables for each category type that hangs off the INVENTORY_ITEM table, and these get created/destroyed on the fly when the user creates inventory types, and the columns get updated when they add/remove properties to those types. Seems messy though.
Is there a best-practice for this? It seems to me that option 4 is clean, but doesn't allow you to easily search by the metadata. I've used a variant of 3 before, but only on a table that had a really small number of rows, so performance wasn't an issue. It always seemed to me that 2 was a good idea, but it doesn't fit well with auto-generated entity frameworks, so you'd have to exclude the custom properties table from the entity generation and just write your own custom data access code to handle it.
Am I missing any alternatives? Is there a way for SQL server to "look into" XML data in a column so it could actually do stuff with option 4 now?

I am using the xml type column for this kind of situations...
http://msdn.microsoft.com/en-us/library/ms189887.aspx
Before xml we had to use the option 3. Which in my point of view is still a good way to do it. Espacialy if you have a Data Access Layer that is able to handle the type conversion properly for you. We stored everything as string values and defined a column that held the orignial data type for the conversion.
Options 1 and 2 are a no-go. Don't change the database schema in production on the fly.
Option 5 could be done in a separate database... But still no control over the schema and the user would need the rights to create tables etc.

Definitely the 3.
Sometimes 4 if you have a very good reason to do so.
Do not ever dynamically modify database structure to accommodate for incoming data. One day something could break and damage your database. It is simply not done this way.

3 or 4 are the only ones I would consider - you don't want to be changing the schema on the fly, especially if you're using some kind of mapping layer.
I've generally gone with option 3. As a bit of sanity, I always have a type column in the CUSTOM_PROPERTY table, which is repeated in the CUSTOM_PROPERTY_VALUE table. By adding a superkey to the CUSTOM_PROPERTY table of <Primary Key, Type>, you can then have a foreign key that references this (as well as the simpler foreign key to just the primary key). And finally, a check constraint that ensures that only the relevant column in CUSTOM_PROPERTY_VALUE is not null, based on this type column.
In this way, you know that if someone has defined a CUSTOM_PROPERTY, say, Tine count, of type int, that you're actually only ever going to find an int stored in the CUSTOM_PROPERTY_VALUE table, for all instances of this property.
Edit
If you need it to reference multiple entity tables, then it can get more complex, especially if you want full referential integrity. For instance (with two distinct entity types in the database):
create table dbo.Entities (
EntityID uniqueidentifier not null,
EntityType varchar(10) not null,
constraint PK_Entities PRIMARY KEY (EntityID),
constraint CK_Entities_KnownTypes CHECK (
EntityType in ('Foo','Bar')),
constraint UQ_Entities_KnownTypes UNIQUE (EntityID,EntityType)
)
go
create table dbo.Foos (
EntityID uniqueidentifier not null,
EntityType as CAST('Foo' as varchar(10)) persisted,
FooFixedProperty1 int not null,
FooFixedProperty2 varchar(150) not null,
constraint PK_Foos PRIMARY KEY (EntityID),
constraint FK_Foos_Entities FOREIGN KEY (EntityID) references dbo.Entities (EntityID) on delete cascade,
constraint FK_Foos_Entities_Type FOREIGN KEY (EntityID,EntityType) references dbo.Entities (EntityID,EntityType)
)
go
create table dbo.Bars (
EntityID uniqueidentifier not null,
EntityType as CAST('Bar' as varchar(10)) persisted,
BarFixedProperty1 float not null,
BarFixedProperty2 int not null,
constraint PK_Bars PRIMARY KEY (EntityID),
constraint FK_Bars_Entities FOREIGN KEY (EntityID) references dbo.Entities (EntityID) on delete cascade,
constraint FK_Bars_Entities_Type FOREIGN KEY (EntityID,EntityType) references dbo.Entities (EntityID,EntityType)
)
go
create table dbo.ExtendedProperties (
PropertyID uniqueidentifier not null,
PropertyName varchar(100) not null,
PropertyType int not null,
constraint PK_ExtendedProperties PRIMARY KEY (PropertyID),
constraint CK_ExtendedProperties CHECK (
PropertyType between 1 and 4), --Or make type a varchar, and change check to IN('int', 'float'), etc
constraint UQ_ExtendedProperty_Names UNIQUE (PropertyName),
constraint UQ_ExtendedProperties_Types UNIQUE (PropertyID,PropertyType)
)
go
create table dbo.PropertyValues (
EntityID uniqueidentifier not null,
PropertyID uniqueidentifier not null,
PropertyType int not null,
IntValue int null,
FloatValue float null,
DecimalValue decimal(15,2) null,
CharValue varchar(max) null,
EntityType varchar(10) not null,
constraint PK_PropertyValues PRIMARY KEY (EntityID,PropertyID),
constraint FK_PropertyValues_ExtendedProperties FOREIGN KEY (PropertyID) references dbo.ExtendedProperties (PropertyID) on delete cascade,
constraint FK_PropertyValues_ExtendedProperty_Types FOREIGN KEY (PropertyID,PropertyType) references dbo.ExtendedProperties (PropertyID,PropertyType),
constraint FK_PropertyValues_Entities FOREIGN KEY (EntityID) references dbo.Entities (EntityID) on delete cascade,
constraint FK_PropertyValues_Entitiy_Types FOREIGN KEY (EntityID,EntityType) references dbo.Entities (EntityID,EntityType),
constraint CK_PropertyValues_OfType CHECK (
(IntValue is null or PropertyType = 1) and
(FloatValue is null or PropertyType = 2) and
(DecimalValue is null or PropertyType = 3) and
(CharValue is null or PropertyType = 4)),
--Shoot for bonus points
FooID as CASE WHEN EntityType='Foo' THEN EntityID END persisted,
constraint FK_PropertyValues_Foos FOREIGN KEY (FooID) references dbo.Foos (EntityID),
BarID as CASE WHEN EntityType='Bar' THEN EntityID END persisted,
constraint FK_PropertyValues_Bars FOREIGN KEY (BarID) references dbo.Bars (EntityID)
)
go
--Now we wrap up inserts into the Foos, Bars and PropertyValues tables as either Stored Procs, or instead of triggers
--To get the proper additional columns and/or base tables populated

My inclination would be to store things as XML if the database supports that nicely, or else have a small number of different tables for different data types (try to format data so it will fit one of a small number of types--don't use one table for VARCHAR(15), another for VARCHAR(20), etc.) Something like #5, but with all tables pre-created, and everything shoehorned into the existing tables. Each row should hold a main-record ID, record-type indicator, and a piece of data. Set up an index based on record-type, subsorted by data, and it will be possible to query for particular field values (where RecType==19 and Data=='Fred'). Querying for records that match multiple field values would be harder, but such is life.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas