SQL static data / lookup lists IDENTIFIER - sql

In regard to static data table design. Having static data in tables like shown:
Currencies (Code, Name). Row example: USD, United States Dollar
Countries (Code, Name). Row example: DE, Germany
XXXObjectType (Code, Name, ... additional attributes)
...
does it make sense to have another (INTEGER) column as a Primary Key so that all Foreign Key references would use it?
Possible solutions:
Use additional INTEGER as PK and FK
Use Code (usually CHAR(N), where N is small) as PK and FK
Use Code only if less then certain size... What size?
Other _______
What would be your suggestion? Why?
I usually used INT IDENTITY columns, but very often having the short code is good enough to show to the user on the UI, in which case the query would have one JOIN less.

An INT IDENTITY is absolutely not needed here. use the 2 or 3 digit mnemonics instead. If you have an entity that has no small, unique property, you should then consider using a synthetic key. But currency codes and country codes aren't the time to do it.
I once worked on a system where someone actually had a table of years, and each year had a YearID. And, true to form, 2001 was year 3, and 2000 was year 4. It made everything else in the system so much harder to understand and query for, and it was for nothing.

If you use a ID INT or a CHAR, referential integrity is preserved in both cases.
An INT is 4 bytes long, so it's equal in size as a CHAR(4); if you use CHAR(x) where x<4, your CHAR key will be shorter than a INT one; if you use CHAR(x) where x>4, your CHAR key will be greater than a INT one; for short keys doesn't usually make sense to use VARCHAR, as the latter has a 2-byte overhead. Anyway, when talking about tables with - say - 500 records, the total overhead of a CHAR(5) over a INT key would be just 500 bytes, a value hilarious for database where some tables could have millions of records.
Considering that countries and currencies (e.g.) are limited in numbers (a few hundred, at most) you have no real gain in using an ID INT instead of a CHAR(4); moreover, a CHAR(4) key can be easier to remember for the end user, and can ease your life when you have to debug/test your Sql and/or data.
Therefore, though I usually use an ID INT key for most of my tables, in several circumstances I choose to have a PK/FK made of CHARs: countries, languages, currencies are amongst those cases.

Related

SQL table with incompatible columns (only 1 must be used at a time)

Context:
Let's consider that I have a database with a table "house". I also have tables "tiledRoof" and "thatchedRoof".
Aim:
All my houses must have only 1 roof at a time. It can be a tiled one or a thatched one, but not both. Even if it doesn't makes a lot of sense, imagine that we might change the roof of our houses many times.
My solution:
I can figure out 2 solutions to link houses to roofs:
Solution 1 : Delete/create roofs every time :
The database should looks like this (more or less pseudo sql code):
house{
tiledRoof_id int DEFAULT NULL FOREIGN KEY REFERENCES tiledRoof(id)
thatchedRoof_id int DEFAULT NULL FOREIGN KEY REFERENCES thatchedRoof(id)
// Other columns ...
}
tiledRoof{
id
// Other columns ...
}
thatchedRoof{
id
// Other columns ...
}
So, I make "tiledRoof_id" and "thatchedRoof_id" nullable. Then if I want to link an house with a tiled roof, I do an upsert in the table "tiledRoof" . If a row have been created, I update "tiledRoof_id" to match the id created. Then, if my house was linked to a thatched roof, I delete a row in "thatchedRoof" and set "thatchedRoof_id" to NULL (I guess I can do it automatically by implementing the onDelete of my foreign key constraint).
Down sides :
Deleting a row and create later a similar other row might not be really clever. If I change 50 times my roof, I will create 50 rows and also delete 49 of them...
More queries to run than with the second solution.
Solution 2 : Add "enabler columns" :
The database should looks like this (more or less pseudo sql code):
house{
tiledRoof_id int DEFAULT(...) FOREIGN KEY REFERENCES tiledRoof(id)
thatchedRoof_id int DEFAULT(...) FOREIGN KEY REFERENCES thatchedRoof(id)
tiledRoof_enabled boolean DEFAULT True
thatchedRoof_enabled boolean DEFAULT False
// Other columns ...
}
tiledRoof{
id
// Other columns ...
}
thatchedRoof{
id
// Other columns ...
}
I fill both "tiledRoof_id" and "thatchedRoof_id" with a foreign id that links each of my houses to a tile roof AND to a thatched roof.
To make my house not really having both roofs, I just enable one of them. To do so I implement 2 additional columns : "tiledRoof_enabled " and "thatchedRoof_enabled" that will define which roof is enabled.
Alternatively, I can use a single column to set the enabled roof if that column takes an integer (1 would means that the tiled one is enabled and 2 would means the thatched one).
Difficulty :
To make that solution works, It would requiere an implementation of the default value of "tiledRoof_id" and "thatchedRoof_id" that might not be possible. It have to insert in the corresponding roof-table a new row and use the resulting row id as default value.
If that can not be done, I have start by running queries to create my roofs and then create my house.
Question:
What is the best way to reach my purpose? One of the solutions that I proposed? An other one? If it's the second one of my propositions, I would be grateful if you could explain to me if my difficulty can be resolved and how.
Note:
I'm working with sqlite3 (just for syntax is differences)
It sounds like you want a slowly changing dimension. Given only two types, I would suggest:
create table house_roofs (
house_id int references houses(house_id),
thatched_roof_id int references thatched_roofs(thatched_roof_id),
tiled_roof_id int references tiled_roofs(tiled_roof_id),
version_eff_dt datetime not null,
version_end_dt datetime,
check (thatched_roof_id is null or tiles_roof_id is null) -- only one at a time
);
This allows you to have properly declared foreign key relationships.
Are you sure you need to normalize the roof type? Why not simply add a boolean for each of the roof types in your house table. SQLLite doesn't actually have a boolean, so you could use integer 0 or 1.
Note: You would still want to have the tables thatchedRoof and tiledRoof if there are details about each of those types that are generic for all roofs of that type.
If the the tables thatchedRoof and tiledRoof contain details that are specific to each specific house, then this strategy may not work to well.

Efficient storage pattern for millions of values of different types

I am about to build an SQL database that will contain the results of statistics calculations for hundreds of thousands of objects. It is planned to use Postgres, but the question equally applies to MySQL.
For example, hypothetically, let's assume I have half a million records of phone calls. Each PhoneCall will now, through a background job system, have statistics calculated. For example, a PhoneCall has the following statistics:
call_duration: in seconds (float)
setup_time: in seconds (float)
dropouts: periods in which audio dropout was detected (array), e.g. [5.23, 40.92]
hung_up_unexpectedly: true or false (boolean)
These are just simple examples; in reality, the statistics are more complex. Each statistic has a version number associated with it.
I am unsure as to which storage pattern for these type of calculated data will be the most efficient. I'm not looking into fully normalizing everything in the database though. So far, I have come up with the following options:
Option 1 – long format in one column
I store the statistic name and its value in one column each, with a reference to the main transaction object. The value column is a text field; the value will be serialized (e.g. as JSON or YAML) so that different types (strings, arrays, ...) can be stored. The database layout for the statistics table would be:
statistic_id (PK)
phone_call_id (FK)
statistic_name (string)
statistic_value (text, serialized)
statistic_version (integer)
created_at (datetime)
I have worked with this pattern for a while, and what's good about it is that I can easily filter statistics according to phone call and the statistic name. I can also add new types of statistics easily and filter by version and creation time.
But it seems to me that the (de)serialization of values makes it quite inefficient in terms of handling lots of data. Also, I cannot perform calculations on SQL-level; I always have to load and deserialize the data. Or is the JSON suppot in Postgres that good so that I could still pick this pattern?
Option 2 – statistics as attributes of main object
I could also think about collecting all types of statistic names and adding them as new columns to the phone call object, e.g.:
id (PK)
call_duration
setup_time
dropouts
hung_up_unexpectedly
...
This would be very efficient, and each column would have its own type, but I can no longer store different versions of statistics, or filter them according to when they were created. The whole business logic of statistics disappears. Adding new statistics is also not possible easily since the names are baked in.
Option 3 – statistics as different columns
This would probably be the most complex. I am storing only a reference to the statistic type, and the column will be looked up according to that:
statistic_id (PK)
phone_call_id (FK)
statistic_name (string)
statistic_value_bool (boolean)
statistic_value_string (string)
statistic_value_float (float)
statistic_value_complex (serialized or complex data type)
statistic_value_type (string that indicates bool, string etc.)
statistic_version (integer)
created_at (datetime)
This would mean that the table is going to be very sparse, as only one of the statistic_value_ columns would be populated. Could that lead to performance issues?
Option 4 – normalized form
Trying to normalize option 3, I would create two tables:
statistics
id (PK)
version
created_at
statistic_mapping
phone_call_id (FK)
statistic_id (FK)
statistic_type_mapping
statistic_id (FK)
type (string, indicates bool, string etc.)
statistic_values_boolean
statistic_id (FK)
value (bool)
…
But this isn't going anywhere since I can't dynamically join to another table name, can I? Or should I anyway then just join to all statistic_values_* tables based on the statistic ID? My application would have to make sure that no duplicate entries exist then.
To summarize, given this use case, what would be the most efficient approach for storing millions of statistic values in a relational DB (e.g. Postgres), when the requirement is that statistic types may be added or changed, and that several versions exist at the same time, and that querying of the values should be somewhat efficient?
IMO you can use the following simple database structure to solve your problem.
Statistics type dictionary
A very simple table - just name and description of the stat. type:
create table stat_types (
type text not null constraint stat_types_pkey primary key,
description text
);
(You can replace it with enum if you have a finite number of elements)
Stat table for every type of objects in the project
It contains FK to the object, FK to the stat. type (or just enum) and, this is important, the jsonb field with an arbitrary stat. data related to its type. For example, such a table for phone calls:
create table phone_calls_statistics (
phone_call_id uuid not null references phone_calls,
stat_type text not null references stat_types,
data jsonb,
constraint phone_calls_statistics_pkey primary key (phone_call_id, stat_type)
);
I assume here that table phone_calls has uuid type of its PK:
create table phone_calls (
id uuid not null constraint phone_calls_pkey primary key
-- ...
);
The data field has a different structure which depends on its stat. type. Example for call duration:
{
"call_duration": 120.0
}
or for dropouts:
{
"dropouts": [5.23, 40.92]
}
Let's play with data:
insert into phone_calls_statistics values
('9fc1f6c3-a9d3-4828-93ee-cf5045e93c4c', 'CALL_DURATION', '{"call_duration": 100.0}'),
('86d1a2a6-f477-4ed6-a031-b82584b1bc7e', 'CALL_DURATION', '{"call_duration": 110.0}'),
('cfd4b301-bdb9-4cfd-95db-3844e4c0625c', 'CALL_DURATION', '{"call_duration": 120.0}'),
('39465c2f-2321-499e-a156-c56a3363206a', 'CALL_DURATION', '{"call_duration": 130.0}'),
('9fc1f6c3-a9d3-4828-93ee-cf5045e93c4c', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": true}'),
('86d1a2a6-f477-4ed6-a031-b82584b1bc7e', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": true}'),
('cfd4b301-bdb9-4cfd-95db-3844e4c0625c', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": false}'),
('39465c2f-2321-499e-a156-c56a3363206a', 'UNEXPECTED_HANGUP', '{"unexpected_hungup": false}');
Get the average, min and max call duration:
select
avg((pcs.data ->> 'call_duration')::float) as avg,
min((pcs.data ->> 'call_duration')::float) as min,
max((pcs.data ->> 'call_duration')::float) as max
from
phone_calls_statistics pcs
where
pcs.stat_type = 'CALL_DURATION';
Get the number of unexpected hung ups:
select
sum(case when (pcs.data ->> 'unexpected_hungup')::boolean is true then 1 else 0 end) as hungups
from
phone_calls_statistics pcs
where
pcs.stat_type = 'UNEXPECTED_HANGUP';
I believe that this solution is very simple and flexible, has good performance potential and perfect scalability. The main table has a simple index; all queries will perform inside it. You can always extend the number of stat. types and their calculations.
Live example: https://www.db-fiddle.com/f/auATgkRKrAuN3jHjeYzfux/0

How to select only the records whose primary keys are only int types and vice versa?

Suppose I have a table in which the primary keys are double types ; 1, 1.2,1.4, 3,3.2,5,6.2,7 and so on..
In Microsoft Access, I would like to have a record source or a record set that is based on a query(sql statement) to select only the records with primary keys 1,3,5 and 7. Similarly I would also like to have a query that selects only the double types(1.2,1.4,3.2 and 6.2). How do I make such a query?
Here is one method:
select t.*
from t
where pk = int(pk);
The query for the decimal types is quite similar.
By the way, double is a very bad type for a primary key. Two values can look the same but be different (this is how floating point representations work). If you need decimal points for such a key, you should use decimal instead.

MySQL. Working with Integer Data Interval

I've just started using SQL, so that have no idea how t work with not standard data types.
I'm working with MySQL...
Say, there are 2 tables: Stats and Common. The Common table looks like this:
CREATE TABLE Common (
Mutation VARCHAR(10) NOT NULL,
Deletion VARCHAR(10) NOT NULL,
Stats_id ??????????????????????,
UNIQUE(Mutation, Deletion) );
Instead of ? symbols there must be some type that references on the Stats table (Stats.id).
The problem is, this type must make it possible to save data in such a format: 1..30 (interval between 1 and 30). According to this type, it was my idea to shorten the Common table's length.
Is it possible to do this, are there any different ideas?
Assuming that Stats.id is an INTEGER (if not, change the below items as appropriate):
first_stats_id INTEGER NOT NULL REFERENCES Stats(id)
last_stats_id INTEGER NOT NULL REFERENCES Stats(id)
Given that your table contains two VARCHAR fields and an unique index over them, having an additional integer field is the least of your concerns as far as memory usage goes (seriously, one integer field represents a mere 1GB of memory for 262 million lines).

Generate unique ID to share with multiple tables SQL 2008

I have a couple of tables in a SQL 2008 server that I need to generate unique ID's for. I have looked at the "identity" column but the ID's really need to be unique and shared between all the tables.
So if I have say (5) five tables of the flavour "asset infrastructure" and I want to run with a unique ID between them as a combined group, I need some sort of generator that looks at all (5) five tables and issues the next ID which is not duplicated in any of those (5) five tales.
I know this could be done with some sort of stored procedure but I'm not sure how to go about it. Any ideas?
The simplest solution is to set your identity seeds and increment on each table so they never overlap.
Table 1: Seed 1, Increment 5
Table 2: Seed 2, Increment 5
Table 3: Seed 3, Increment 5
Table 4: Seed 4, Increment 5
Table 5: Seed 5, Increment 5
The identity column mod 5 will tell you which table the record is in. You will use up your identity space five times faster so make sure the datatype is big enough.
Why not use a GUID?
You could let them each have an identity that seeds from numbers far enough apart never to collide.
GUIDs would work but they're butt-ugly, and non-sequential if that's significant.
Another common technique is to have a single-column table with an identity that dispenses the next value each time you insert a record. If you need them pulling from a common sequence, it's not unlikely to be useful to have a second column indicating which table it was dispensed to.
You realize there are logical design issues with this, right?
Reading into the design a bit, it sounds like what you really need is a single table called "Asset" with an identity column, and then either:
a) 5 additional tables for the subtypes of assets, each with a foreign key to the primary key on Asset; or
b) 5 views on Asset that each select a subset of the rows and then appear (to users) like the 5 original tables you have now.
If the columns on the tables are all the same, (b) is the better choice; if they're all different, (a) is the better choice. This is a classic DB spin on the supertype / subtype relationship.
Alternately, you could do what you're talking about and recreate the IDENTITY functionality yourself with a stored proc that wraps INSERT access on all 5 tables. Note that you'll have to put a TRANSACTION around it if you want guarantees of uniqueness, and if this is a popular table, that might make it a performance bottleneck. If that's not a concern, a proc like that might take the form:
CREATE PROCEDURE InsertAsset_Table1 (
BEGIN TRANSACTION
-- SELECT MIN INTEGER NOT ALREADY USED IN ANY OF THE FIVE TABLES
-- INSERT INTO Table1 WITH THAT ID
COMMIT TRANSACTION -- or roll back on error, etc.
)
Again, SQL is highly optimized for helping you out if you choose the patterns I mention above, and NOT optimized for this kind of thing (there's overhead with creating the transaction AND you'll be issuing shared locks on all 5 tables while this process is going on). Compare that with using the PK / FK method above, where SQL Server knows exactly how to do it without locks, or the view method, where you're only inserting into 1 table.
I found this when searching on google. I am facing a simillar problem for the first time. I had the idea to have a dedicated ID table specifically to generate the IDs but I was unsure if it was something that was considered OK design. So I just wanted to say THANKS for confirmation.. it looks like it is an adequate sollution although not ideal.
I have a very simple solution. It should be good for cases when the number of tables is small:
create table T1(ID int primary key identity(1,2), rownum varchar(64))
create table T2(ID int primary key identity(2,2), rownum varchar(64))
insert into T1(rownum) values('row 1')
insert into T1(rownum) values('row 2')
insert into T1(rownum) values('row 3')
insert into T2(rownum) values('row 1')
insert into T2(rownum) values('row 2')
insert into T2(rownum) values('row 3')
select * from T1
select * from T2
drop table T1
drop table T2
This is a common problem for example when using a table of people (called PERSON singular please) and each person is categorized, for example Doctors, Patients, Employees, Nurse etc.
It makes a lot of sense to create a table for each of these people that contains thier specific category information like an employees start date and salary and a Nurses qualifications and number.
A Patient for example, may have many nurses and doctors that work on him so a many to many table that links Patient to other people in the PERSON table facilitates this nicely. In this table there should be some description of the realtionship between these people which leads us back to the categories for people.
Since a Doctor and a Patient could create the same Primary Key ID in their own tables, it becomes very useful to have a Globally unique ID or Object ID.
A good way to do this as suggested, is to have a table designated to Auto Increment the primary key. Perform an Insert on that Table first to obtain the OID, then use it for the new PERSON.
I like to go a step further. When things get ugly (some new developer gets got his hands on the database, or even worse, a really old developer, then its very useful to add more meaning to the OID.
Usually this is done programatically, not with the database engine, but if you use a BIG INT for all the Primary Key ID's then you have lots of room to prefix a number with visually identifiable sequence. For example all Doctors ID's could begin with 100, all patients with 110, all Nurses with 120.
To that I would append say a Julian date or a Unix date+time, and finally append the Auto Increment ID.
This would result in numbers like:
110,2455892,00000001
120,2455892,00000002
100,2455892,00000003
since the Julian date 100yrs from now is only 2492087, you can see that 7 digits will adequately store this value.
A BIGINT is 64-bit (8 byte) signed integer with a range of -9.22x10^18 to 9.22x10^18 ( -2^63 to 2^63 -1). Notice the exponant is 18. That's 18 digits you have to work with.
Using this design, you are limited to 100 million OID's, 999 categories of people and dates up to... well past the shelf life of your databse, but I suspect thats good enough for most solutions.
The operations required to created an OID like this are all Multiplication and Division which avoids all the gear grinding of text manipulation.
The disadvantage is that INSERTs require more than a simple TSQL statement, but the advantage is that when you are tracking down errant data or even being clever in your queries, your OID is visually telling you alot more than a random number or worse, an eyesore like GUID.