Database Design - Optional Bit Fields

Database Design - Optional Bit Fields - sql

I'm building a database that will be used to store Questions and Answers. There are varying Question types that deal so far with a DateTime answer, Free Text, DropDownList, and some that link to other tables in the database. My design question is this: Some Question types have other boolean attributes that are unique to that type. Is it better to have the boolean column in the generic Questions table or create some sort of Flag table for each question type?
As an example, a DropDownList Question might have a boolean attribute to tell whether or not to display a TextBox when a value "Other" is selected, but a Free Text Question would have no use for this.
Thanks heaps
EDIT:
I guess it seems to be boiling down to is it better to store unused columns in a generic Questions table to extend out for each Question type and have lots of keys back to the Question table using Views to access the data for various Question types.

Strip out all the extra attributes from the base question table and have a field for the 'Question Type' and a set of tables for each question type. In your application code, based on the questions type retrieve the row from the particular question type table and use them.
An example:
Base Question Table: t_question <QuestionID, Question, QuestionType, QuestionTypeLink>
Let's say you have two question types: Comprehensive or Simple. Create two tables for each of them with schema: t_compflags <linkID, field1, field2...> and t_simpleflags <linkID, field1, field2...>.
Now in the base question table, QuestionType would take two values: Comprehensive or Simple. Based on that field it uses the QuestionTypeLink to refer the row in either of the tables.
EDIT:
You can't directly enforce PK-FK constraint on these tabes, you have to do that in application code. But if you would like to enforce that constraint, there is a dirty way of doing it. Instead of QuestionTypeLink, have two columns CompQuestionTypeLink and SimpQuestionTypeLink which allow nulls and references the other two tables. But I personally think this is a bad design.

This depends entirely on how much normalisation you want to do and how many columns you're talking about.
If you are expecting quite a number then you should have a 1:1 table relationship simply to extend that question type. Something like
Create Table QuestionType_DropDownList
(OtherDisplay bit,
SomethingElse bit)
This is easier to read and easier to query. But it's not easily maintainable. It is unfortunately very much a pros/cons thing.
In my experience I would pick this solution as you never know what the future may hold.

Depending on how many combinations you have you could just express each combination as its own type:
DateTime
DropDownList
DropDownListWithOptionalOther
FreeText
FreeTextNumbersOnly
...
This flattens your design a little at the expense of a potential combinatorial explosion. But I don't know how many combinations you have, or will have.
Could you include the text box automatically if you have a DropDownList choice of "Other?" Or would there be a case when the user wouldn't have to specify what "other" is?
If you have too many combinations to consider, then it still sounds like you'll need to specify the flags at a per-question basis, so it makes sense to include another field in the Questions table to specify these flags. Maybe have them as plain text so you can extend later if you need to? Like a comma-separated list of flags in that field?

I am mulling this over as a possible solution, seems more abstracted to me and allows for the most future extension.
CREATE TABLE dbo.QuestionTypes
(
Id INT IDENTITY(1, 1) PRIMARY KEY,
Type VARCHAR(256) NOT NULL
);
CREATE TABLE dbo.TypeSpecificFlags
(
Id INT IDENTITY(1, 1) PRIMARY KEY,
TypeId INT REFERENCES dbo.QuestionTypes(Id) NOT NULL,
Flag VARCHAR(256) NOT NULL
)
CREATE TABLE dbo.Questions
(
Id INT IDENTITY(1, 1) PRIMARY KEY,
Name VARCHAR(256) NOT NULL,
ShortName VARCHAR(32),
TypeId INT REFERENCES QuestionTypes(Id) NOT NULL,
AllowNulls BIT NOT NULL DEFAULT 1,
Sort INT
);
CREATE TABLE dbo.QuestionsFlags
(
Id INT IDENTITY(1, 1) PRIMARY KEY,
QuestionId INT REFERENCES dbo.Questions(Id) NOT NULL,
FlagId INT REFERENCES dbo.TypeSpecificFlags(Id) NOT NULL,
Answer BIT NOT NULL
);

Related

What table structure to go for if there are two objects of same type but of different nature?

Given that there are two kinds of products X and Y. X has A, B and C as the primary key whereas Y has A and D as it's primary key. Should I put them in the same table ? Why should I and if I should not, then why is that ?
I have currently put them in two separate tables but some colleagues are suggesting that they belong in the same table. My question is should I consider putting them in the same table or continue with different tables?
Below I have given example tables for the above case.
CREATE TABLE `product_type_b` (
`PRODUCT_CODE` VARCHAR(50) NOT NULL,
`COMPONENT_CODE` VARCHAR(50) NOT NULL,
`GROUP_INDICATOR` VARCHAR(50) NULL DEFAULT NULL,
`RECORD_TIMESTAMP` DATE NULL DEFAULT NULL,
PRIMARY KEY (`PRODUCT_CODE`, `COMPONENT_CODE`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB
;
CREATE TABLE `product_type_a` (
`PRODUCT_CODE` VARCHAR(50) NOT NULL,
`CHOICE_OF_COVER` VARCHAR(50) NOT NULL,
`PLAN_TYPE` VARCHAR(50) NOT NULL,
`RECORD_TIMESTAMP` DATE NULL DEFAULT NULL,
`PRODUCT_TENURE` INT(11) NULL DEFAULT NULL,
PRIMARY KEY (`PRODUCT_CODE`, `CHOICE_OF_COVER`, `PLAN_TYPE`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB
;
As you can see there are certain fields that are not common to both tables but are part of the primary key. There are also some other fields which are not common to both tables.
Here is the bigger picture of the system in consideration.
Each product type has a different source from where it is sent to the system.
We need to store these products in the database.
I would like to have a balance between normalization and performance so that my read-write speeds aren't compromised much due to over normalization.
There is also a web-app which will have a page where these products are searchable by the user.
User will populate certain column fields as filters based on which we need to fetch the products and show on the UI.
variations in subtypes is currently 2 and is not expected to increase beyond 4-5 which again is go
ig to be over a decade maybe. This again is an approximation.n
I hope this presents a bigger picture of the system.
I want to have good read and write speeds without compromising much. So should I go ahead with this design ? If not, what design should be implemented ?

For a trading system and taking into account max 5 product types and very limited number of attributes I'd prefer a single table for all products with a surrogate PK. Think about references to products from trading transactions, this is the biggest part of the total DB content in a long run.
A metadata table describing every product-specific attribute and its mapping to the general table column would help to build UI and backend/frontend communications.
Search indexes would reflect most popular user seraches depending on product type.

This is typical Category/SubCategory model issue. There are a few options:
Put everything in one table, which will have some columns NULLable
because different subtypes do not have the same attributes;
One parent table for all the common attributes, and also with the
column of the type indication column. Then each sub type has its own
table just for the columns the Subtype has.
Each subtype has its own table, including all the common columns of
all the sub type.
(1) is good if the sub type is very limited;
(3) is suitable if the variations of the sub types are very limited.
The advantage of (2). is it is easy to return all the records with the common columns. And if an artificial key (like auto-increment id) is used, it ensures all records, regards less the sub type, has unique id.
In your case, no artificial PK is used, I think your choice is not bad.

Combine multiple outrigger tables into one?

I have a dimensional table and several outriggers.
create table dimFoo (
FooKey int primary key,
......
)
create table triggerA (
FooKey int references dimFoo (FooKey),
Value varchar(255),
primary key nonclustered (FooKey, Value)
)
create table triggerB (
FooKey int references dimFoo (FooKey),
Value varchar(255)
primary key nonclustered (FooKey, Value)
)
create table triggerC (
FooKey int references dimFoo (FooKey),
Value varchar(255)
primary key nonclustered (FooKey, Value)
)
Should these outrigger tables be merged into one table?
create table Triggers (
FooKey int references dimFoo (FooKey),
TriggerType varchar(20), // triggerA, triggerB, triggerC, etc....
Value varchar(255),
primary key nonclustered (FooKey, TriggerType, Value)
)

In order to meet this kind of scenario, such as with dimCustomer with customers potentially having multiple hobbies, the typical Kimball approach is to use a Bridge table between dimensions (dimCustomer and dimHobby).
This link provides a summary of how bridge tables could solve this problem and also alternatives which may work better for you.
Without knowing more about your specific scenario, including what the business requirements are, how many of these value types you have, how 'uniform' the various value types and values are, and the BI technology you'll be using for accessing the data, its hard to give a definitive answer to whether you should combine the bridges into one uber-bridge that caters for the various many-to-manys. All the above influence the answer to some extent.
Typically the 'generic tables' approach is more useful behind the scenes for administration than it is for presenting for analytics. My default approach would be to have specific bridge tables until/unless this became unmanageable from an ETL perspective or perceived as much more complex from a user query perspective. I wouldn't look to 'optimise' to a combined table from the get-go.
If your situation is outside the usual norms (do you have three as per your example, or ten?), combining could well be a good idea. This would make it more like a factless fact, with dimensions of dimCustomer, dimValueType and dimValue, and would be a perfectly reasonable solution.

Variable amount of sets as SQL database tables

More of a question concerning the database model for a specific problem. The problem is as follows:
I have a number of objects that make up the rows in a fixed table, they are all distinct (of course). One would like to create sets that contain a variable amount of these stored objects. These sets would be user-defined, therefore no hard-coding. Each set will be characterized by a number.
My question is: what advice can you experienced SQL programmers give me to implement such a feature. My most direct approach would be to create a table for each such set using table-variables or temporary tables. Finally, an already present table that contains the names of the sets (as a way to let the user know what sets are currently present in the database).
If not efficient, what direction would I be looking in to solve this?
Thanks.

Table variables and temporary tables are short lived, narrow of scope and probably not what you want to use for this. One table for each Set is also not a solution I would choose.
By the sound of it you need three tables. One for Objects, one for Sets and one for the relationship between Objects and Sets.
Something like this (using SQL Server syntax to describe the tables).
create table [Object]
(
ObjectID int identity primary key,
Name varchar(50)
-- more columns here necessary for your object.
)
go
create table [Set]
(
SetID int identity primary key,
Name varchar(50)
)
go
create table [SetObject]
(
SetID int references [Object](ObjectID),
ObjectID int references [Set](SetID),
primary key (SetID, ObjectID)
)
Here is the m:m relation as a pretty picture:

Generic Database table design

Just trying to figure out the best way to design my table for the following scenario:
I have several areas in my system (documents, projects, groups and clients) and each of these can have comments logged against them.
My question is should I have one table like this:
CommentID
DocumentID
ProjectID
GroupID
ClientID
etc
Where only one of the ids will have data and the rest will be NULL or should I have a separate CommentType table and have my comments table like this:
CommentID
CommentTypeID
ResourceID (this being the id of the project/doc/client)
etc
My thoughts are that option 2 would be more efficient from an indexing point of view. Is this correct?

Option 2 is not a good solution for a relational database. It's called polymorphic associations (as mentioned by #Daniel Vassallo) and it breaks the fundamental definition of a relation.
For example, suppose you have a ResourceId of 1234 on two different rows. Do these represent the same resource? It depends on whether the CommentTypeId is the same on these two rows. This violates the concept of a type in a relation. See SQL and Relational Theory by C. J. Date for more details.
Another clue that it's a broken design is that you can't declare a foreign key constraint for ResourceId, because it could point to any of several tables. If you try to enforce referential integrity using triggers or something, you find yourself rewriting the trigger every time you add a new type of commentable resource.
I would solve this with the solution that #mdma briefly mentions (but then ignores):
CREATE TABLE Commentable (
ResourceId INT NOT NULL IDENTITY,
ResourceType INT NOT NULL,
PRIMARY KEY (ResourceId, ResourceType)
);
CREATE TABLE Documents (
ResourceId INT NOT NULL,
ResourceType INT NOT NULL CHECK (ResourceType = 1),
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
CREATE TABLE Projects (
ResourceId INT NOT NULL,
ResourceType INT NOT NULL CHECK (ResourceType = 2),
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
Now each resource type has its own table, but the serial primary key is allocated uniquely by Commentable. A given primary key value can be used only by one resource type.
CREATE TABLE Comments (
CommentId INT IDENTITY PRIMARY KEY,
ResourceId INT NOT NULL,
ResourceType INT NOT NULL,
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
Now Comments reference Commentable resources, with referential integrity enforced. A given comment can reference only one resource type. There's no possibility of anomalies or conflicting resource ids.
I cover more about polymorphic associations in my presentation Practical Object-Oriented Models in SQL and my book SQL Antipatterns.

Read up on database normalization.
Nulls in the way you describe would be a big indication that the database isn't designed properly.
You need to split up all your tables so that the data held in them is fully normalized, this will save you a lot of time further down the line guaranteed, and it's a lot better practice to get into the habit of.

From a foreign key perspective, the first example is better because you can have multiple foreign key constraints on a column but the data has to exist in all those references. It's also more flexible if the business rules change.

To continue from #OMG Ponies' answer, what you describe in the second example is called a Polymorphic Association, where the foreign key ResourceID may reference rows in more than one table. However in SQL databases, a foreign key constraint can only reference exactly one table. The database cannot enforce the foreign key according to the value in CommentTypeID.
You may be interested in checking out the following Stack Overflow post for one solution to tackle this problem:
MySQL - Conditional Foreign Key Constraints

The first approach is not great, since it is quite denormalized. Each time you add a new entity type, you need to update the table. You may be better off making this an attribute of document - I.e. store the comment inline in the document table.
For the ResourceID approach to work with referential integrity, you will need to have a Resource table, and a ResourceID foreign key in all of your Document, Project etc.. entities (or use a mapping table.) Making "ResourceID" a jack-of-all-trades, that can be a documentID, projectID etc.. is not a good solution since it cannot be used for sensible indexing or foreign key constraint.
To normalize, you need to the comment table into one table per resource type.
Comment
-------
CommentID
CommentText
...etc
DocumentComment
---------------
DocumentID
CommentID
ProjectComment
--------------
ProjectID
CommentID
If only one comment is allowed, then you add a unique constraint on the foreign key for the entity (DocumentID, ProjectID etc.) This ensures that there can only be one row for the given item and so only one comment. You can also ensure that comments are not shared by using a unique constraint on CommentID.
EDIT: Interestingly, this is almost parallel to the normalized implementation of ResourceID - replace "Comment" in the table name, with "Resource" and change "CommentID" to "ResourceID" and you have the structure needed to associate a ResourceID with each resource. You can then use a single table "ResourceComment".
If there are going to be other entities that are associated with any type of resource (e.g. audit details, access rights, etc..), then using the resource mapping tables is the way to go, since it will allow you to add normalized comments and any other resource related entities.

I wouldn't go with either of those solutions. Depending on some of the specifics of your requirements you could go with a super-type table:
CREATE TABLE Commentable_Items (
commentable_item_id INT NOT NULL,
CONSTRAINT PK_Commentable_Items PRIMARY KEY CLUSTERED (commentable_item_id))
GO
CREATE TABLE Projects (
commentable_item_id INT NOT NULL,
... (other project columns)
CONSTRAINT PK_Projects PRIMARY KEY CLUSTERED (commentable_item_id))
GO
CREATE TABLE Documents (
commentable_item_id INT NOT NULL,
... (other document columns)
CONSTRAINT PK_Documents PRIMARY KEY CLUSTERED (commentable_item_id))
GO
If the each item can only have one comment and comments are not shared (i.e. a comment can only belong to one entity) then you could just put the comments in the Commentable_Items table. Otherwise you could link the comments off of that table with a foreign key.
I don't like this approach very much in your specific case though, because "having comments" isn't enough to put items together like that in my mind.
I would probably go with separate Comments tables (assuming that you can have multiple comments per item - otherwise just put them in your base tables). If a comment can be shared between multiple entity types (i.e., a document and a project can share the same comment) then have a central Comments table and multiple entity-comment relationship tables:
CREATE TABLE Comments (
comment_id INT NOT NULL,
comment_text NVARCHAR(MAX) NOT NULL,
CONSTRAINT PK_Comments PRIMARY KEY CLUSTERED (comment_id))
GO
CREATE TABLE Document_Comments (
document_id INT NOT NULL,
comment_id INT NOT NULL,
CONSTRAINT PK_Document_Comments PRIMARY KEY CLUSTERED (document_id, comment_id))
GO
CREATE TABLE Project_Comments (
project_id INT NOT NULL,
comment_id INT NOT NULL,
CONSTRAINT PK_Project_Comments PRIMARY KEY CLUSTERED (project_id, comment_id))
GO
If you want to constrain comments to a single document (for example) then you could add a unique index (or change the primary key) on the comment_id within that linking table.
It's all of these "little" decisions that will affect the specific PKs and FKs. I like this approach because each table is clear on what it is. In databases that's usually better then having "generic" tables/solutions.

Of the options you give, I would go for number 2.

Option 2 is a good way to go. The issue that I see with that is you are putting the resouce key on that table. Each of the IDs from the different resources could be duplicated. When you join resources to the comments you will more than likely come up with comments that do not belong to that particular resouce. This would be considered a many to many join. I would think a better option would be to have your resource tables, the comments table, and then tables that cross reference the resource type and the comments table.

If you carry the same sort of data about all comments regardless of what they are comments about, I'd vote against creating multiple comment tables. Maybe a comment is just "thing it's about" and text, but if you don't have other data now, it's likely you will: date the comment was entered, user id of person who made it, etc. With multiple tables, you have to repeat all these column definitions for each table.
As noted, using a single reference field means that you could not put a foreign key constraint on it. This is too bad, but it doesn't break anything, it just means you have to do the validation with a trigger or in code. More seriously, joins get difficult. You can just say "from comment join document using (documentid)". You need a complex join based on the value of the type field.
So while the multiple pointer fields is ugly, I tend to think that's the right way to go. I know some db people say there should never be a null field in a table, that you should always break it off into another table to prevent that from happening, but I fail to see any real advantage to following this rule.
Personally I'd be open to hearing further discussion on pros and cons.

Pawnshop Application:
I have separate tables for Loan, Purchase, Inventory & Sales transactions.
Each tables rows are joined to their respective customer rows by:
customer.pk [serial] = loan.fk [integer];
= purchase.fk [integer];
= inventory.fk [integer];
= sale.fk [integer];
I have consolidated the four tables into one table called "transaction", where a column:
transaction.trx_type char(1) {L=Loan, P=Purchase, I=Inventory, S=Sale}
Scenario:
A customer initially pawns merchandise, makes a couple of interest payments, then decides he wants to sell the merchandise to the pawnshop, who then places merchandise in Inventory and eventually sells it to another customer.
I designed a generic transaction table where for example:
transaction.main_amount DECIMAL(7,2)
in a loan transaction holds the pawn amount,
in a purchase holds the purchase price,
in inventory and sale holds sale price.
This is clearly a denormalized design, but has made programming alot easier and improved performance. Any type of transaction can now be performed from within one screen, without the need to change to different tables.

Performance - Int vs Char(3)

I have a table and am debating between 2 different ways to store information. It has a structure like so
int id
int FK_id
varchar(50) info1
varchar(50) info2
varchar(50) info3
int forTable or char(3) forTable
The FK_id can be a foreign key to one of 6 tables so I need another field to determine which table it's for.
I see two solutions:
An integer that is a FK to a settings table which has its actual value.
A char(3) field with the a abbreviated version of the table.
I am wondering if anyone knows if one will be more beneficial speed wise over the other or if there will be any major problems using the char(3)
Note: I will be creating an indexed view on each of the 6 different values for this field. This table will contain ~30k rows and will need to be joined with much larger tables

In this case, it probably doesn't matter except for the collation overhead (A vs a vs ä va à)
I'd use char(3), say for currency code like CHF, GBP etc But if my natural key was "Swiss Franc", "British Pound" etc, I'd take the numeric.
3 bytes + collation vs 4 bytes numeric? You'd need a zillion rows or be running a medium sized country before it mattered...

Have you considered using a TinyInt. Only takes one byte to store it's value. TinyInt has a range of values between 0 and 255.

Is the reason you need a single table that you want to ensure that when the six parent tables reference a given instance of a child row that is guaranteed to be the same instance? This is the classic "multi-parent" problem. An example of where you might run into this is with addresses or phone numbers with multiple person/contact tables.
I can think of a couple of options:
Choice 1: A link table for each parent table. This would be the Hoyle architecture. So, something like:
Create Table MyTable(
id int not null Primary Key Clustered
, info1 varchar(50) null
, info2 varchar(50) null
, info3 varchar(50) null
)
Create Table LinkTable1(
MyTableId int not null
, ParentTable1Id int not null
, Constraint PK_LinkTable1 Primary Key Clustered( MyTableId, ParentTable1Id )
, Constraint FK_LinkTable1_ParentTable1
Foreign Key ( MyTableId )
References MyTable ( Id )
, Constraint FK_LinkTable1_ParentTable1
Foreign Key ( ParentTable1Id )
References ParentTable1 ( Id )
)
...
Create Table LinkTable2...LinkTable3
Choice 2. If you knew that you would never have more than say six tables and were willing to accept some denormalization and a fugly design, you could add six foreign keys to your main table. That avoids the problem of populating a bunch of link tables and ensures proper referential integrity. However, that design can quickly get out of hand if the number of parents grows.
If you are content with your existing design, then with respect to the field size, I would use the full table name. Frankly, the difference in performance between a char(3) and a varchar(50) or even varchar(128) will be negligible for the amount of data you are likely to put in the table. If you really thought you were going to have millions of rows, then I would strongly consider the option of linking tables.
If you wanted to stay with your design and wanted the maximum performance, then I would use a tinyint with a foreign key to a table that contained the list of the six tables with a tinyint primary key. That prevents the number from being "magic" and ensures that you narrow down the list of parent tables. Of course, it still does not prevent orphaned records. In this design, you have to use triggers to do that.

Because your FK cannot be enforced (since it is a variant depending upon type) by database constraint, I would strongly consider re-evaluating your design to use link tables, where each link table includes two FK columns, one to the PK of the entity and one to the PK of one of the 6 tables.
While this might seem to be overkill, it makes a lot of things simpler and adding new link tables is no more complex than accommodating new FK-types. In addition, it is more easily expandable to the case where an entity needs more than a 1-1 relationship to a single table, or needs multiple 1-1 relationships to the 6 other entities.
In a varying-FK scenario, you can lose database consistency, you can join to the wrong entity by neglecting to filter on type code, etc.
I should add that another huge benefit of link tables is that you can link to tables which have keys of varying data types (ints, natural keys, etc) without having to add surrograte keys or stored the key in a varchar or similar workarounds which are prone to problems.

I think a small integer (tinyint) is called for here. An "abbreviated version" looks too much like a magic number.
I also think performance wise the integer should beat the char(3).

First off, a 50 character Id that is not globally unique sounds a little scary. Do the IDs have some meaning? If not, you can easily get a GUID in less space. Personally, I am a big fan of making things human readable whenever possible. I would, and have, put the full name in graphs until I needed to do otherwise. My preference would be to have linking tables for each possible related table though.
Unless you are talking about really large scale, you are much better off decreasing the size of the IDs and taking a few more characters for the name of the table. For really large scale, I would decrease the size of the IDs and use an integer.
Jacob

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas