Performance - Int vs Char(3) - sql

I have a table and am debating between 2 different ways to store information. It has a structure like so
int id
int FK_id
varchar(50) info1
varchar(50) info2
varchar(50) info3
int forTable or char(3) forTable
The FK_id can be a foreign key to one of 6 tables so I need another field to determine which table it's for.
I see two solutions:
An integer that is a FK to a settings table which has its actual value.
A char(3) field with the a abbreviated version of the table.
I am wondering if anyone knows if one will be more beneficial speed wise over the other or if there will be any major problems using the char(3)
Note: I will be creating an indexed view on each of the 6 different values for this field. This table will contain ~30k rows and will need to be joined with much larger tables

In this case, it probably doesn't matter except for the collation overhead (A vs a vs ä va à)
I'd use char(3), say for currency code like CHF, GBP etc But if my natural key was "Swiss Franc", "British Pound" etc, I'd take the numeric.
3 bytes + collation vs 4 bytes numeric? You'd need a zillion rows or be running a medium sized country before it mattered...

Have you considered using a TinyInt. Only takes one byte to store it's value. TinyInt has a range of values between 0 and 255.

Is the reason you need a single table that you want to ensure that when the six parent tables reference a given instance of a child row that is guaranteed to be the same instance? This is the classic "multi-parent" problem. An example of where you might run into this is with addresses or phone numbers with multiple person/contact tables.
I can think of a couple of options:
Choice 1: A link table for each parent table. This would be the Hoyle architecture. So, something like:
Create Table MyTable(
id int not null Primary Key Clustered
, info1 varchar(50) null
, info2 varchar(50) null
, info3 varchar(50) null
)
Create Table LinkTable1(
MyTableId int not null
, ParentTable1Id int not null
, Constraint PK_LinkTable1 Primary Key Clustered( MyTableId, ParentTable1Id )
, Constraint FK_LinkTable1_ParentTable1
Foreign Key ( MyTableId )
References MyTable ( Id )
, Constraint FK_LinkTable1_ParentTable1
Foreign Key ( ParentTable1Id )
References ParentTable1 ( Id )
)
...
Create Table LinkTable2...LinkTable3
Choice 2. If you knew that you would never have more than say six tables and were willing to accept some denormalization and a fugly design, you could add six foreign keys to your main table. That avoids the problem of populating a bunch of link tables and ensures proper referential integrity. However, that design can quickly get out of hand if the number of parents grows.
If you are content with your existing design, then with respect to the field size, I would use the full table name. Frankly, the difference in performance between a char(3) and a varchar(50) or even varchar(128) will be negligible for the amount of data you are likely to put in the table. If you really thought you were going to have millions of rows, then I would strongly consider the option of linking tables.
If you wanted to stay with your design and wanted the maximum performance, then I would use a tinyint with a foreign key to a table that contained the list of the six tables with a tinyint primary key. That prevents the number from being "magic" and ensures that you narrow down the list of parent tables. Of course, it still does not prevent orphaned records. In this design, you have to use triggers to do that.

Because your FK cannot be enforced (since it is a variant depending upon type) by database constraint, I would strongly consider re-evaluating your design to use link tables, where each link table includes two FK columns, one to the PK of the entity and one to the PK of one of the 6 tables.
While this might seem to be overkill, it makes a lot of things simpler and adding new link tables is no more complex than accommodating new FK-types. In addition, it is more easily expandable to the case where an entity needs more than a 1-1 relationship to a single table, or needs multiple 1-1 relationships to the 6 other entities.
In a varying-FK scenario, you can lose database consistency, you can join to the wrong entity by neglecting to filter on type code, etc.
I should add that another huge benefit of link tables is that you can link to tables which have keys of varying data types (ints, natural keys, etc) without having to add surrograte keys or stored the key in a varchar or similar workarounds which are prone to problems.

I think a small integer (tinyint) is called for here. An "abbreviated version" looks too much like a magic number.
I also think performance wise the integer should beat the char(3).

First off, a 50 character Id that is not globally unique sounds a little scary. Do the IDs have some meaning? If not, you can easily get a GUID in less space. Personally, I am a big fan of making things human readable whenever possible. I would, and have, put the full name in graphs until I needed to do otherwise. My preference would be to have linking tables for each possible related table though.
Unless you are talking about really large scale, you are much better off decreasing the size of the IDs and taking a few more characters for the name of the table. For really large scale, I would decrease the size of the IDs and use an integer.
Jacob

Related

What table structure to go for if there are two objects of same type but of different nature?

Given that there are two kinds of products X and Y. X has A, B and C as the primary key whereas Y has A and D as it's primary key. Should I put them in the same table ? Why should I and if I should not, then why is that ?
I have currently put them in two separate tables but some colleagues are suggesting that they belong in the same table. My question is should I consider putting them in the same table or continue with different tables?
Below I have given example tables for the above case.
CREATE TABLE `product_type_b` (
`PRODUCT_CODE` VARCHAR(50) NOT NULL,
`COMPONENT_CODE` VARCHAR(50) NOT NULL,
`GROUP_INDICATOR` VARCHAR(50) NULL DEFAULT NULL,
`RECORD_TIMESTAMP` DATE NULL DEFAULT NULL,
PRIMARY KEY (`PRODUCT_CODE`, `COMPONENT_CODE`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB
;
CREATE TABLE `product_type_a` (
`PRODUCT_CODE` VARCHAR(50) NOT NULL,
`CHOICE_OF_COVER` VARCHAR(50) NOT NULL,
`PLAN_TYPE` VARCHAR(50) NOT NULL,
`RECORD_TIMESTAMP` DATE NULL DEFAULT NULL,
`PRODUCT_TENURE` INT(11) NULL DEFAULT NULL,
PRIMARY KEY (`PRODUCT_CODE`, `CHOICE_OF_COVER`, `PLAN_TYPE`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB
;
As you can see there are certain fields that are not common to both tables but are part of the primary key. There are also some other fields which are not common to both tables.
Here is the bigger picture of the system in consideration.
Each product type has a different source from where it is sent to the system.
We need to store these products in the database.
I would like to have a balance between normalization and performance so that my read-write speeds aren't compromised much due to over normalization.
There is also a web-app which will have a page where these products are searchable by the user.
User will populate certain column fields as filters based on which we need to fetch the products and show on the UI.
variations in subtypes is currently 2 and is not expected to increase beyond 4-5 which again is go
ig to be over a decade maybe. This again is an approximation.n
I hope this presents a bigger picture of the system.
I want to have good read and write speeds without compromising much. So should I go ahead with this design ? If not, what design should be implemented ?
For a trading system and taking into account max 5 product types and very limited number of attributes I'd prefer a single table for all products with a surrogate PK. Think about references to products from trading transactions, this is the biggest part of the total DB content in a long run.
A metadata table describing every product-specific attribute and its mapping to the general table column would help to build UI and backend/frontend communications.
Search indexes would reflect most popular user seraches depending on product type.
This is typical Category/SubCategory model issue. There are a few options:
Put everything in one table, which will have some columns NULLable
because different subtypes do not have the same attributes;
One parent table for all the common attributes, and also with the
column of the type indication column. Then each sub type has its own
table just for the columns the Subtype has.
Each subtype has its own table, including all the common columns of
all the sub type.
(1) is good if the sub type is very limited;
(3) is suitable if the variations of the sub types are very limited.
The advantage of (2). is it is easy to return all the records with the common columns. And if an artificial key (like auto-increment id) is used, it ensures all records, regards less the sub type, has unique id.
In your case, no artificial PK is used, I think your choice is not bad.

Problems on having a field that will be null very often on a table in SQL Server

I have a column that sometimes will be null. This column is also a foreign key, so I want to know if I'll have problems with performance or with data consistency if this column will have weight
I know its a foolish question but I want to be sure.
There is no problem necessarily with this, other than it is likely indication that you might have poorly normalized design. There might be performance implications due to the way indexes are structured and the sparseness of the column with nulls, but without knowing your structure or intended querying scenarios any conclusions one might draw would be pure speculation.
A better solution might be a shared primary key where table A has a primary key, and there is zero or one records in B with the same primary key.
If table A can have one or zero B, but more than one A can refer to B, then what you have is a one to many relationship. This can be represented as Pieter laid out in his answer. This allows multiple A records to refer to the same B, and in turn each B may optionally refer to an A.
So you see there are two optional structures to address this problem, and choosing each is not guesswork. There is a distinct rational between why you would choose one or the other, but it depends on the nature of your relationships you are modelling.
Instead of this design:
create table Master (
ID int identity not null primary key,
DetailID int null references Detail(ID)
)
go
create table Detail (
ID int identity not null primary key
)
go
consider this instead
create table Master (
ID int identity not null primary key
)
go
create table Detail (
ID int identity not null primary key,
MasterID int not null references Master(ID)
)
go
Now the Foreign Key is never null, rather the existence (or not) of the Detail record indicates whether it exists.
If a Detail can exist for multiple records, create a mapping table to manage the relationship.

What is the most suitable database technology for financial time series data with heterogenous attributes?

I need to store large amounts of financial time series data where different data points have potentially different attributes.
For instance consider a situation where your database needs to store a time series of financial instruments that include stocks and options. Both stocks and options have prices at any given point in time, but options have additional attributes such as greeks (delta, gamma, vega), etc.
A relational database seems most appropriate here and one possibility would be to create one column per attribute, and set the unused attributes to NULL. So in the example above, for records that represent stocks you would use only some of the columns, and for options you would use some of the others.
The problem with this approach is that it is very inefficient (you end up storing a large number of NULLs) and that it is very inflexible (you need to add or drop a column every time you add or remove attributes).
One alternative might be to store all attributes in a vertical table (i.e. Key-Name-Value) but that has the disadvantage of forcing you to make all attributes type-unsafe (for example they might all be stored as strings).
Another option I thought of might be to store attributes as an XML document in a single column in the time series table. I tested this approach and it is impractical from a performance standpoint. If you want to extract attributes for any larger number of time series records, parsing the XML in each row is too slow.
The ideal database technology would be a combination between NoSQL and RDBMS where the key-timestamp pair behaves like a row in a relational, tabular database but all attributes are stored in a row-level bag, with fast access to each.
Is anyone aware of such a system? Are there other suggestions for storing the type of data I am describing?
Use "financial_instruments" to store information common to all financial instruments. Use "stocks" to store attributes that apply only to stocks; "options" to store attributes that apply only to options.
create table financial_instruments (
inst_id integer primary key,
inst_name varchar(57) not null unique,
inst_type char(1) check (inst_type in ('s', 'o')),
other_columns char(1), -- columns common to all financial instruments
unique (inst_id, inst_type) -- required for the FK constraint below.
);
create table stocks (
inst_id integer primary key,
inst_type char(1) not null default 's' check (inst_type = 's'),
other_columns char(1), -- columns unique to stocks.
foreign key (inst_id, inst_type) references financial_instruments (inst_id, inst_type)
);
create table options (
inst_id integer primary key,
inst_type char(1) not null default 'o' check (inst_type = 'o'),
other_columns char(1), -- columns unique to options; delta, gamma, vega.
foreign key (inst_id, inst_type) references financial_instruments (inst_id, inst_type)
);
To ease the programming job, you can build updatable views that join "financial_instruments" with each of its subtypes. Application code can just use the views.
Additional tables that store related information about all financial instruments would set a foreign key reference to "financial_instruments"."inst_id". Tables that sore related information about just, say, stocks would set a foreign key reference to "stocks"."inst_id".
A different option.
Master table with affiliated tables for the attributes of similar objects (think object oriented with inheritance). You have 1-1 relationships between the master and subs based on the primary key of the master table as a primary key in the related.

ORACLE Table design: M:N table best practice

I'd like to hear your suggestions on this very basic question:
Imagine these three tables:
--DROP TABLE a_to_b;
--DROP TABLE a;
--DROP TABLE b;
CREATE TABLE A
(
ID NUMBER NOT NULL ,
NAME VARCHAR2(20) NOT NULL ,
CONSTRAINT A_PK PRIMARY KEY ( ID ) ENABLE
);
CREATE TABLE B
(
ID NUMBER NOT NULL ,
NAME VARCHAR2(20) NOT NULL ,
CONSTRAINT B_PK PRIMARY KEY ( ID ) ENABLE
);
CREATE TABLE A_TO_B
(
id NUMBER NOT NULL,
a_id NUMBER NOT NULL,
b_id NUMBER NOT NULL,
somevalue1 VARCHAR2(20) NOT NULL,
somevalue2 VARCHAR2(20) NOT NULL,
somevalue3 VARCHAR2(20) NOT NULL
) ;
How would you design table a_to_b?
I'll give some discussion starters:
synthetic id-PK column or combined a_id,b_id-PK (dropping the "id" column)
When synthetic: What other indices/constraints?
When combined: Also index on b_id? Or even b_id,a_id (don't think so)?
Also combined when these entries are referenced themselves?
Also combined when these entries perhaps are referenced themselves in the future?
Heap or Index-organized table
Always or only up to x "somevalue"-columns?
I know that the decision for one of the designs is closely related to the question how the table will be used (read/write ratio, density, etc.), but perhaps we get a 20/80 solution as blueprint for future readers.
I'm looking forward to your ideas!
Blama
I have always made the PK be the combination of the two FKs, a_id and b_id in your example. Adding a synthetic id field to this table does no good, since you never end up looking for a row based on a knowledge of its id.
Using the compound PK gives you a constraint that prevents the same instance of the relationship between a and b from being inserted twice. If duplicate entries need to be permitted, there's something wrong with your data model at the conceptual level.
The index you get behind the scenes (for every DBMS I know of) will be useful to speed up common joins. An extra index on b_id is sometimes useful, depending on the kinds of joins you do frequently.
Just as a side note, I don't use the name "id" for all my synthetic pk columns. I prefer a_id, b_id. It makes it easier to manage the metadata, even though it's a little extra typing.
CREATE TABLE A_TO_B
(
a_id NUMBER NOT NULL REFERENCES A (a_id),
b_id NUMBER NOT NULL REFERENCES B (b_id),
PRIMARY KEY (a_id, b_id),
...
) ;
It's not unusual for ORMs to require (or, in more clueful ORMs, hope for) an integer column named "id" in addition to whatever other keys you have. Apart from that, there's no need for it. An id number like that makes the table wider (which usually degrades I/O performance just slightly), and adds an index that is, strictly speaking, unnecessary. It isn't necessary to identify the entity--the existing key does that--and it leads new developers into bad habits. (Specifically, giving every table an integer column named "id", and believing that that column alone is the only key you need.)
You're likely to need one or more of these indexed.
a_id
b_id
{a_id, b_id}
{b_id, a_id}
I believe Oracle should automatically index {a_id, b_id}, because that's the primary key. Oracle doesn't automatically index foreign keys. Oracle's indexing guidelines are online.
In general, you need to think carefully about whether you need ON UPDATE CASCADE or ON DELETE CASCADE. In Oracle, you only need to think carefully about whether you need ON DELETE CASCADE. (Oracle doesn't support ON UPDATE CASCADE.)
the other comments so far are good.
also consider adding begin_dt and end_dt to the relationship. in this way, you can manage a good number of questions about each relationship through time. (consider baseline issues)

VARCHAR as foreign key/primary key in database good or bad?

Is it better if I use ID nr:s instead of VARCHARS as foreign keys?
And is it better to use ID nr:s isntead of VARCHARS as Primary Keys?
By ID nr I mean INT!
This is what I have now:
category table:
cat_id ( INT ) (PK)
cat_name (VARCHAR)
category options table:
option_id ( INT ) (PK)
car_id ( INT ) (FK)
option_name ( VARCHAR )
I COULD HAVE THIS I THINK:
category table:
cat_name (VARCHAR) (PK)
category options table:
cat_name ( VARCHAR ) (FK)
option_name ( VARCHAR ) ( PK )
Or am I thinking completely wrong here?
The problem with VARCHAR being used for any KEY is that they can hold WHITE SPACE. White space consists of ANY non-screen-readable character, like spaces tabs, carriage returns etc. Using a VARCHAR as a key can make your life difficult when you start to hunt down why tables aren't returning records with extra spaces at the end of their keys.
Sure, you CAN use VARCHAR, but you do have to be very careful with the input and output. They also take up more space and are likely slower when doing a Queries.
Integer types have a small list of 10 characters that are valid, 0,1,2,3,4,5,6,7,8,9. They are a much better solution to use as keys.
You could always use an integer-based key and use VARCHAR as a UNIQUE value if you wanted to have the advantages of faster lookups.
My 2 cents:
From a performance perspective, using CHAR or VARCHAR as primary key or index is a nightmare.
I've tested compound primary keys (INT + CHAR, INT + VARCHAR, INT + INT) and by far INT + INT was the best performance (loading a data warehouse). Lets say about twice more performance if you keep only numeric primary keys/indexes.
When I'm doing design work I ask myself: have I got anything in this data that I can guarantee is going to be non-NULL, unique, and unchanging? If so that's a candidate to be the primary key. If not, I know I have to generate a key value to use. Assuming, then, that my candidate key happens to be a VARCHAR I then look at the data. Is it reasonably short in length (meaning, say, 20 characters or less)? Or is the VARCHAR field rather long? If it's short it's usable as a key - if it's long, perhaps it's better to not use it as a key (although if it's in consideration for being the primary key I'm probably going to have to index it anyways). At least part of my concern is that the primary key is going to have to be indexed and will perhaps be used as a foreign key from some other table. Comparisons of VARCHAR fields tend to be slower than the comparison of numeric fields (particularly binary numeric fields such as integers) so using a long VARCHAR field as a key may result in slow performance. YMMV.
with an int you can store up to 2 billion in 4 bytes with varchars you cannot you need to have 10 bytes or so to store that, if you use varchars there is also a 2 byte overhead
so now you add up the 6 extra bytes in every PK and FK + the 2 byte varchar overhead
I would say it is fine to use VARCHAR as both PRIMARY and FOREIGN KEYS.
Only issue I could forsee is if you have a table, lets say Instruments (share instruments) and you create the PRIMARY/FOREIGN KEY as VARCHAR, and it happens that the CODE changes.
This does happen on Stock Exchanges, and would require you to rename all references to this CODE, where as a ID nr would not require this from you.
So to conclude, I would say this dependes on your intended use.
EDIT
When I say CODE, I mean the Ticker Code for lets say GOOG, or any other share. It is possible for these codes to change over time, lets say you look at Dirivative/Future instruments.
If you make the category name into the ID you will have a problem if you ever decide to rename a category.
There's nothing wrong with either approach, although this question might start the usual argument of which is better: natural or surrogate keys.
If you use CHAR or VARCHAR as a primary key you'll end up using it as a forign key at some point. When it comes down to it, as #astander says, it depends on your data and how you are going to use it.