How to design my ingredient database tables? - sql

I'm working on an recipe module (ASP.NET MVC, entity framework, sql server), and one of the entities I have to setup in the database are ingredients, their characteristics and translations to a number of languages.
I was thinking of creating two tables as follows:
Table Ingredient
Id, nvarchar(20), primary key
EnergyInKCal, float
... other characteristics
Source, nvarchar(50)
Table IngredientTranslation
Id, nvarchar(20), primary key
LanguageCode, nvarchar(2)
Name, nvarchar(200)
So each ingredient will be defined once in the Ingredient table, with a unique code as their primary key, for example:
'N32-004669', 64, 368, 'NUBEL'
and translated in the IngredientTranslation table, for example
'N32-004669', 'NL', 'Aardappel, zoete'
'N32-004669', 'FR', 'Pomme de terre, douce'
'N32-004669', 'EN', 'Potatoe, sweet'
I think querying ingredients becomes easy like this... do you think it's a good idea to use code (which is nvarchar(20)) as a primary key? Or is a simple bigint better, but then I have to use JOINS in my queries. Maybe other approaches that are better - performance wise?
EDIT: after reading the answers, I redesigned the tables as follows:
Table Ingredient
Id, bigint, primary key
ExternalId, nvarchar(20)
EnergyInKCal, float
... other characteristics
Source, nvarchar(50)
Table IngredientTranslation
Id, bigint, primary key
IngredientId, bigint (relation with Id of Ingredient table)
LanguageCode, nvarchar(2)
Name, nvarchar(200)
Thanks,
L

Since a primary key is included in every other index, it's best to keep the primary key small. So an int identity is an excellent choice.
One side note: storing translations in a database has a rather hefty performance impact. Both on the database and the rendering engine that has to build the web page. Since translations are fairly constant, most websites store them outside the database. In ASP.NET, the typical choice would be resource files.

In a purely relational schema, using a natural key (such as ID, above) as the primary key should be fine, although in OO design surrogate keys are likely to be preferred.
A couple of additonal notes:
The primary key on IngredientTranslation would need to be a compound key on ID and Language Code, not just ID.
From First Normal Form: remove derived values. So you only need one field for Energy - pick one of the energy units (kJ or kCal) and use the appropriate multiplication factor to convert where appropriate.
EDIT: in the alternative scenario, I suggest adding a unique index to each of the Ingredient (on code) and IngredientTranslation (on Code and Language Code) tables. I also suggest renaming the code field as IngredientCode on the IngredientTranslation table.

Related

What table structure to go for if there are two objects of same type but of different nature?

Given that there are two kinds of products X and Y. X has A, B and C as the primary key whereas Y has A and D as it's primary key. Should I put them in the same table ? Why should I and if I should not, then why is that ?
I have currently put them in two separate tables but some colleagues are suggesting that they belong in the same table. My question is should I consider putting them in the same table or continue with different tables?
Below I have given example tables for the above case.
CREATE TABLE `product_type_b` (
`PRODUCT_CODE` VARCHAR(50) NOT NULL,
`COMPONENT_CODE` VARCHAR(50) NOT NULL,
`GROUP_INDICATOR` VARCHAR(50) NULL DEFAULT NULL,
`RECORD_TIMESTAMP` DATE NULL DEFAULT NULL,
PRIMARY KEY (`PRODUCT_CODE`, `COMPONENT_CODE`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB
;
CREATE TABLE `product_type_a` (
`PRODUCT_CODE` VARCHAR(50) NOT NULL,
`CHOICE_OF_COVER` VARCHAR(50) NOT NULL,
`PLAN_TYPE` VARCHAR(50) NOT NULL,
`RECORD_TIMESTAMP` DATE NULL DEFAULT NULL,
`PRODUCT_TENURE` INT(11) NULL DEFAULT NULL,
PRIMARY KEY (`PRODUCT_CODE`, `CHOICE_OF_COVER`, `PLAN_TYPE`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB
;
As you can see there are certain fields that are not common to both tables but are part of the primary key. There are also some other fields which are not common to both tables.
Here is the bigger picture of the system in consideration.
Each product type has a different source from where it is sent to the system.
We need to store these products in the database.
I would like to have a balance between normalization and performance so that my read-write speeds aren't compromised much due to over normalization.
There is also a web-app which will have a page where these products are searchable by the user.
User will populate certain column fields as filters based on which we need to fetch the products and show on the UI.
variations in subtypes is currently 2 and is not expected to increase beyond 4-5 which again is go
ig to be over a decade maybe. This again is an approximation.n
I hope this presents a bigger picture of the system.
I want to have good read and write speeds without compromising much. So should I go ahead with this design ? If not, what design should be implemented ?
For a trading system and taking into account max 5 product types and very limited number of attributes I'd prefer a single table for all products with a surrogate PK. Think about references to products from trading transactions, this is the biggest part of the total DB content in a long run.
A metadata table describing every product-specific attribute and its mapping to the general table column would help to build UI and backend/frontend communications.
Search indexes would reflect most popular user seraches depending on product type.
This is typical Category/SubCategory model issue. There are a few options:
Put everything in one table, which will have some columns NULLable
because different subtypes do not have the same attributes;
One parent table for all the common attributes, and also with the
column of the type indication column. Then each sub type has its own
table just for the columns the Subtype has.
Each subtype has its own table, including all the common columns of
all the sub type.
(1) is good if the sub type is very limited;
(3) is suitable if the variations of the sub types are very limited.
The advantage of (2). is it is easy to return all the records with the common columns. And if an artificial key (like auto-increment id) is used, it ensures all records, regards less the sub type, has unique id.
In your case, no artificial PK is used, I think your choice is not bad.

Combine multiple outrigger tables into one?

I have a dimensional table and several outriggers.
create table dimFoo (
FooKey int primary key,
......
)
create table triggerA (
FooKey int references dimFoo (FooKey),
Value varchar(255),
primary key nonclustered (FooKey, Value)
)
create table triggerB (
FooKey int references dimFoo (FooKey),
Value varchar(255)
primary key nonclustered (FooKey, Value)
)
create table triggerC (
FooKey int references dimFoo (FooKey),
Value varchar(255)
primary key nonclustered (FooKey, Value)
)
Should these outrigger tables be merged into one table?
create table Triggers (
FooKey int references dimFoo (FooKey),
TriggerType varchar(20), // triggerA, triggerB, triggerC, etc....
Value varchar(255),
primary key nonclustered (FooKey, TriggerType, Value)
)
In order to meet this kind of scenario, such as with dimCustomer with customers potentially having multiple hobbies, the typical Kimball approach is to use a Bridge table between dimensions (dimCustomer and dimHobby).
This link provides a summary of how bridge tables could solve this problem and also alternatives which may work better for you.
Without knowing more about your specific scenario, including what the business requirements are, how many of these value types you have, how 'uniform' the various value types and values are, and the BI technology you'll be using for accessing the data, its hard to give a definitive answer to whether you should combine the bridges into one uber-bridge that caters for the various many-to-manys. All the above influence the answer to some extent.
Typically the 'generic tables' approach is more useful behind the scenes for administration than it is for presenting for analytics. My default approach would be to have specific bridge tables until/unless this became unmanageable from an ETL perspective or perceived as much more complex from a user query perspective. I wouldn't look to 'optimise' to a combined table from the get-go.
If your situation is outside the usual norms (do you have three as per your example, or ten?), combining could well be a good idea. This would make it more like a factless fact, with dimensions of dimCustomer, dimValueType and dimValue, and would be a perfectly reasonable solution.

What is the most suitable database technology for financial time series data with heterogenous attributes?

I need to store large amounts of financial time series data where different data points have potentially different attributes.
For instance consider a situation where your database needs to store a time series of financial instruments that include stocks and options. Both stocks and options have prices at any given point in time, but options have additional attributes such as greeks (delta, gamma, vega), etc.
A relational database seems most appropriate here and one possibility would be to create one column per attribute, and set the unused attributes to NULL. So in the example above, for records that represent stocks you would use only some of the columns, and for options you would use some of the others.
The problem with this approach is that it is very inefficient (you end up storing a large number of NULLs) and that it is very inflexible (you need to add or drop a column every time you add or remove attributes).
One alternative might be to store all attributes in a vertical table (i.e. Key-Name-Value) but that has the disadvantage of forcing you to make all attributes type-unsafe (for example they might all be stored as strings).
Another option I thought of might be to store attributes as an XML document in a single column in the time series table. I tested this approach and it is impractical from a performance standpoint. If you want to extract attributes for any larger number of time series records, parsing the XML in each row is too slow.
The ideal database technology would be a combination between NoSQL and RDBMS where the key-timestamp pair behaves like a row in a relational, tabular database but all attributes are stored in a row-level bag, with fast access to each.
Is anyone aware of such a system? Are there other suggestions for storing the type of data I am describing?
Use "financial_instruments" to store information common to all financial instruments. Use "stocks" to store attributes that apply only to stocks; "options" to store attributes that apply only to options.
create table financial_instruments (
inst_id integer primary key,
inst_name varchar(57) not null unique,
inst_type char(1) check (inst_type in ('s', 'o')),
other_columns char(1), -- columns common to all financial instruments
unique (inst_id, inst_type) -- required for the FK constraint below.
);
create table stocks (
inst_id integer primary key,
inst_type char(1) not null default 's' check (inst_type = 's'),
other_columns char(1), -- columns unique to stocks.
foreign key (inst_id, inst_type) references financial_instruments (inst_id, inst_type)
);
create table options (
inst_id integer primary key,
inst_type char(1) not null default 'o' check (inst_type = 'o'),
other_columns char(1), -- columns unique to options; delta, gamma, vega.
foreign key (inst_id, inst_type) references financial_instruments (inst_id, inst_type)
);
To ease the programming job, you can build updatable views that join "financial_instruments" with each of its subtypes. Application code can just use the views.
Additional tables that store related information about all financial instruments would set a foreign key reference to "financial_instruments"."inst_id". Tables that sore related information about just, say, stocks would set a foreign key reference to "stocks"."inst_id".
A different option.
Master table with affiliated tables for the attributes of similar objects (think object oriented with inheritance). You have 1-1 relationships between the master and subs based on the primary key of the master table as a primary key in the related.

Generic Database table design

Just trying to figure out the best way to design my table for the following scenario:
I have several areas in my system (documents, projects, groups and clients) and each of these can have comments logged against them.
My question is should I have one table like this:
CommentID
DocumentID
ProjectID
GroupID
ClientID
etc
Where only one of the ids will have data and the rest will be NULL or should I have a separate CommentType table and have my comments table like this:
CommentID
CommentTypeID
ResourceID (this being the id of the project/doc/client)
etc
My thoughts are that option 2 would be more efficient from an indexing point of view. Is this correct?
Option 2 is not a good solution for a relational database. It's called polymorphic associations (as mentioned by #Daniel Vassallo) and it breaks the fundamental definition of a relation.
For example, suppose you have a ResourceId of 1234 on two different rows. Do these represent the same resource? It depends on whether the CommentTypeId is the same on these two rows. This violates the concept of a type in a relation. See SQL and Relational Theory by C. J. Date for more details.
Another clue that it's a broken design is that you can't declare a foreign key constraint for ResourceId, because it could point to any of several tables. If you try to enforce referential integrity using triggers or something, you find yourself rewriting the trigger every time you add a new type of commentable resource.
I would solve this with the solution that #mdma briefly mentions (but then ignores):
CREATE TABLE Commentable (
ResourceId INT NOT NULL IDENTITY,
ResourceType INT NOT NULL,
PRIMARY KEY (ResourceId, ResourceType)
);
CREATE TABLE Documents (
ResourceId INT NOT NULL,
ResourceType INT NOT NULL CHECK (ResourceType = 1),
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
CREATE TABLE Projects (
ResourceId INT NOT NULL,
ResourceType INT NOT NULL CHECK (ResourceType = 2),
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
Now each resource type has its own table, but the serial primary key is allocated uniquely by Commentable. A given primary key value can be used only by one resource type.
CREATE TABLE Comments (
CommentId INT IDENTITY PRIMARY KEY,
ResourceId INT NOT NULL,
ResourceType INT NOT NULL,
FOREIGN KEY (ResourceId, ResourceType) REFERENCES Commentable
);
Now Comments reference Commentable resources, with referential integrity enforced. A given comment can reference only one resource type. There's no possibility of anomalies or conflicting resource ids.
I cover more about polymorphic associations in my presentation Practical Object-Oriented Models in SQL and my book SQL Antipatterns.
Read up on database normalization.
Nulls in the way you describe would be a big indication that the database isn't designed properly.
You need to split up all your tables so that the data held in them is fully normalized, this will save you a lot of time further down the line guaranteed, and it's a lot better practice to get into the habit of.
From a foreign key perspective, the first example is better because you can have multiple foreign key constraints on a column but the data has to exist in all those references. It's also more flexible if the business rules change.
To continue from #OMG Ponies' answer, what you describe in the second example is called a Polymorphic Association, where the foreign key ResourceID may reference rows in more than one table. However in SQL databases, a foreign key constraint can only reference exactly one table. The database cannot enforce the foreign key according to the value in CommentTypeID.
You may be interested in checking out the following Stack Overflow post for one solution to tackle this problem:
MySQL - Conditional Foreign Key Constraints
The first approach is not great, since it is quite denormalized. Each time you add a new entity type, you need to update the table. You may be better off making this an attribute of document - I.e. store the comment inline in the document table.
For the ResourceID approach to work with referential integrity, you will need to have a Resource table, and a ResourceID foreign key in all of your Document, Project etc.. entities (or use a mapping table.) Making "ResourceID" a jack-of-all-trades, that can be a documentID, projectID etc.. is not a good solution since it cannot be used for sensible indexing or foreign key constraint.
To normalize, you need to the comment table into one table per resource type.
Comment
-------
CommentID
CommentText
...etc
DocumentComment
---------------
DocumentID
CommentID
ProjectComment
--------------
ProjectID
CommentID
If only one comment is allowed, then you add a unique constraint on the foreign key for the entity (DocumentID, ProjectID etc.) This ensures that there can only be one row for the given item and so only one comment. You can also ensure that comments are not shared by using a unique constraint on CommentID.
EDIT: Interestingly, this is almost parallel to the normalized implementation of ResourceID - replace "Comment" in the table name, with "Resource" and change "CommentID" to "ResourceID" and you have the structure needed to associate a ResourceID with each resource. You can then use a single table "ResourceComment".
If there are going to be other entities that are associated with any type of resource (e.g. audit details, access rights, etc..), then using the resource mapping tables is the way to go, since it will allow you to add normalized comments and any other resource related entities.
I wouldn't go with either of those solutions. Depending on some of the specifics of your requirements you could go with a super-type table:
CREATE TABLE Commentable_Items (
commentable_item_id INT NOT NULL,
CONSTRAINT PK_Commentable_Items PRIMARY KEY CLUSTERED (commentable_item_id))
GO
CREATE TABLE Projects (
commentable_item_id INT NOT NULL,
... (other project columns)
CONSTRAINT PK_Projects PRIMARY KEY CLUSTERED (commentable_item_id))
GO
CREATE TABLE Documents (
commentable_item_id INT NOT NULL,
... (other document columns)
CONSTRAINT PK_Documents PRIMARY KEY CLUSTERED (commentable_item_id))
GO
If the each item can only have one comment and comments are not shared (i.e. a comment can only belong to one entity) then you could just put the comments in the Commentable_Items table. Otherwise you could link the comments off of that table with a foreign key.
I don't like this approach very much in your specific case though, because "having comments" isn't enough to put items together like that in my mind.
I would probably go with separate Comments tables (assuming that you can have multiple comments per item - otherwise just put them in your base tables). If a comment can be shared between multiple entity types (i.e., a document and a project can share the same comment) then have a central Comments table and multiple entity-comment relationship tables:
CREATE TABLE Comments (
comment_id INT NOT NULL,
comment_text NVARCHAR(MAX) NOT NULL,
CONSTRAINT PK_Comments PRIMARY KEY CLUSTERED (comment_id))
GO
CREATE TABLE Document_Comments (
document_id INT NOT NULL,
comment_id INT NOT NULL,
CONSTRAINT PK_Document_Comments PRIMARY KEY CLUSTERED (document_id, comment_id))
GO
CREATE TABLE Project_Comments (
project_id INT NOT NULL,
comment_id INT NOT NULL,
CONSTRAINT PK_Project_Comments PRIMARY KEY CLUSTERED (project_id, comment_id))
GO
If you want to constrain comments to a single document (for example) then you could add a unique index (or change the primary key) on the comment_id within that linking table.
It's all of these "little" decisions that will affect the specific PKs and FKs. I like this approach because each table is clear on what it is. In databases that's usually better then having "generic" tables/solutions.
Of the options you give, I would go for number 2.
Option 2 is a good way to go. The issue that I see with that is you are putting the resouce key on that table. Each of the IDs from the different resources could be duplicated. When you join resources to the comments you will more than likely come up with comments that do not belong to that particular resouce. This would be considered a many to many join. I would think a better option would be to have your resource tables, the comments table, and then tables that cross reference the resource type and the comments table.
If you carry the same sort of data about all comments regardless of what they are comments about, I'd vote against creating multiple comment tables. Maybe a comment is just "thing it's about" and text, but if you don't have other data now, it's likely you will: date the comment was entered, user id of person who made it, etc. With multiple tables, you have to repeat all these column definitions for each table.
As noted, using a single reference field means that you could not put a foreign key constraint on it. This is too bad, but it doesn't break anything, it just means you have to do the validation with a trigger or in code. More seriously, joins get difficult. You can just say "from comment join document using (documentid)". You need a complex join based on the value of the type field.
So while the multiple pointer fields is ugly, I tend to think that's the right way to go. I know some db people say there should never be a null field in a table, that you should always break it off into another table to prevent that from happening, but I fail to see any real advantage to following this rule.
Personally I'd be open to hearing further discussion on pros and cons.
Pawnshop Application:
I have separate tables for Loan, Purchase, Inventory & Sales transactions.
Each tables rows are joined to their respective customer rows by:
customer.pk [serial] = loan.fk [integer];
= purchase.fk [integer];
= inventory.fk [integer];
= sale.fk [integer];
I have consolidated the four tables into one table called "transaction", where a column:
transaction.trx_type char(1) {L=Loan, P=Purchase, I=Inventory, S=Sale}
Scenario:
A customer initially pawns merchandise, makes a couple of interest payments, then decides he wants to sell the merchandise to the pawnshop, who then places merchandise in Inventory and eventually sells it to another customer.
I designed a generic transaction table where for example:
transaction.main_amount DECIMAL(7,2)
in a loan transaction holds the pawn amount,
in a purchase holds the purchase price,
in inventory and sale holds sale price.
This is clearly a denormalized design, but has made programming alot easier and improved performance. Any type of transaction can now be performed from within one screen, without the need to change to different tables.

Performance - Int vs Char(3)

I have a table and am debating between 2 different ways to store information. It has a structure like so
int id
int FK_id
varchar(50) info1
varchar(50) info2
varchar(50) info3
int forTable or char(3) forTable
The FK_id can be a foreign key to one of 6 tables so I need another field to determine which table it's for.
I see two solutions:
An integer that is a FK to a settings table which has its actual value.
A char(3) field with the a abbreviated version of the table.
I am wondering if anyone knows if one will be more beneficial speed wise over the other or if there will be any major problems using the char(3)
Note: I will be creating an indexed view on each of the 6 different values for this field. This table will contain ~30k rows and will need to be joined with much larger tables
In this case, it probably doesn't matter except for the collation overhead (A vs a vs ä va à)
I'd use char(3), say for currency code like CHF, GBP etc But if my natural key was "Swiss Franc", "British Pound" etc, I'd take the numeric.
3 bytes + collation vs 4 bytes numeric? You'd need a zillion rows or be running a medium sized country before it mattered...
Have you considered using a TinyInt. Only takes one byte to store it's value. TinyInt has a range of values between 0 and 255.
Is the reason you need a single table that you want to ensure that when the six parent tables reference a given instance of a child row that is guaranteed to be the same instance? This is the classic "multi-parent" problem. An example of where you might run into this is with addresses or phone numbers with multiple person/contact tables.
I can think of a couple of options:
Choice 1: A link table for each parent table. This would be the Hoyle architecture. So, something like:
Create Table MyTable(
id int not null Primary Key Clustered
, info1 varchar(50) null
, info2 varchar(50) null
, info3 varchar(50) null
)
Create Table LinkTable1(
MyTableId int not null
, ParentTable1Id int not null
, Constraint PK_LinkTable1 Primary Key Clustered( MyTableId, ParentTable1Id )
, Constraint FK_LinkTable1_ParentTable1
Foreign Key ( MyTableId )
References MyTable ( Id )
, Constraint FK_LinkTable1_ParentTable1
Foreign Key ( ParentTable1Id )
References ParentTable1 ( Id )
)
...
Create Table LinkTable2...LinkTable3
Choice 2. If you knew that you would never have more than say six tables and were willing to accept some denormalization and a fugly design, you could add six foreign keys to your main table. That avoids the problem of populating a bunch of link tables and ensures proper referential integrity. However, that design can quickly get out of hand if the number of parents grows.
If you are content with your existing design, then with respect to the field size, I would use the full table name. Frankly, the difference in performance between a char(3) and a varchar(50) or even varchar(128) will be negligible for the amount of data you are likely to put in the table. If you really thought you were going to have millions of rows, then I would strongly consider the option of linking tables.
If you wanted to stay with your design and wanted the maximum performance, then I would use a tinyint with a foreign key to a table that contained the list of the six tables with a tinyint primary key. That prevents the number from being "magic" and ensures that you narrow down the list of parent tables. Of course, it still does not prevent orphaned records. In this design, you have to use triggers to do that.
Because your FK cannot be enforced (since it is a variant depending upon type) by database constraint, I would strongly consider re-evaluating your design to use link tables, where each link table includes two FK columns, one to the PK of the entity and one to the PK of one of the 6 tables.
While this might seem to be overkill, it makes a lot of things simpler and adding new link tables is no more complex than accommodating new FK-types. In addition, it is more easily expandable to the case where an entity needs more than a 1-1 relationship to a single table, or needs multiple 1-1 relationships to the 6 other entities.
In a varying-FK scenario, you can lose database consistency, you can join to the wrong entity by neglecting to filter on type code, etc.
I should add that another huge benefit of link tables is that you can link to tables which have keys of varying data types (ints, natural keys, etc) without having to add surrograte keys or stored the key in a varchar or similar workarounds which are prone to problems.
I think a small integer (tinyint) is called for here. An "abbreviated version" looks too much like a magic number.
I also think performance wise the integer should beat the char(3).
First off, a 50 character Id that is not globally unique sounds a little scary. Do the IDs have some meaning? If not, you can easily get a GUID in less space. Personally, I am a big fan of making things human readable whenever possible. I would, and have, put the full name in graphs until I needed to do otherwise. My preference would be to have linking tables for each possible related table though.
Unless you are talking about really large scale, you are much better off decreasing the size of the IDs and taking a few more characters for the name of the table. For really large scale, I would decrease the size of the IDs and use an integer.
Jacob