Undoing two kinds of values in the same column - sql

Help! I have a posgres table like:
create table events(
event_id bigserial primary key,
reason text not null,
lifecycle_id bigint not null,
...
);
I’ve got two kinds of values in the same column. lifecycle_id bigint not null is either populated with an identifier for another resource (transaction_id) or a randomly generated number if the transaction_id is not present. This number is not unique and is used as a way of connecting multiple rows. I.e: there can be several rows with the same lifecycle_id and we will want to query it.
Obviously this is not great because it'll inevitably lead to collusions. So I’m introducing a nullable field to the table called transaction_id and actually populating it. I will also start generating the future values for the lifecycle field. But this isn’t enough for me. I’d like to find a way to migrate the lifecycle field into something less “hacky”. Either a sequence or a serial field.
What would you do in my shoes to not have to maintain collision-management code?

Related

What table structure to go for if there are two objects of same type but of different nature?

Given that there are two kinds of products X and Y. X has A, B and C as the primary key whereas Y has A and D as it's primary key. Should I put them in the same table ? Why should I and if I should not, then why is that ?
I have currently put them in two separate tables but some colleagues are suggesting that they belong in the same table. My question is should I consider putting them in the same table or continue with different tables?
Below I have given example tables for the above case.
CREATE TABLE `product_type_b` (
`PRODUCT_CODE` VARCHAR(50) NOT NULL,
`COMPONENT_CODE` VARCHAR(50) NOT NULL,
`GROUP_INDICATOR` VARCHAR(50) NULL DEFAULT NULL,
`RECORD_TIMESTAMP` DATE NULL DEFAULT NULL,
PRIMARY KEY (`PRODUCT_CODE`, `COMPONENT_CODE`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB
;
CREATE TABLE `product_type_a` (
`PRODUCT_CODE` VARCHAR(50) NOT NULL,
`CHOICE_OF_COVER` VARCHAR(50) NOT NULL,
`PLAN_TYPE` VARCHAR(50) NOT NULL,
`RECORD_TIMESTAMP` DATE NULL DEFAULT NULL,
`PRODUCT_TENURE` INT(11) NULL DEFAULT NULL,
PRIMARY KEY (`PRODUCT_CODE`, `CHOICE_OF_COVER`, `PLAN_TYPE`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB
;
As you can see there are certain fields that are not common to both tables but are part of the primary key. There are also some other fields which are not common to both tables.
Here is the bigger picture of the system in consideration.
Each product type has a different source from where it is sent to the system.
We need to store these products in the database.
I would like to have a balance between normalization and performance so that my read-write speeds aren't compromised much due to over normalization.
There is also a web-app which will have a page where these products are searchable by the user.
User will populate certain column fields as filters based on which we need to fetch the products and show on the UI.
variations in subtypes is currently 2 and is not expected to increase beyond 4-5 which again is go
ig to be over a decade maybe. This again is an approximation.n
I hope this presents a bigger picture of the system.
I want to have good read and write speeds without compromising much. So should I go ahead with this design ? If not, what design should be implemented ?
For a trading system and taking into account max 5 product types and very limited number of attributes I'd prefer a single table for all products with a surrogate PK. Think about references to products from trading transactions, this is the biggest part of the total DB content in a long run.
A metadata table describing every product-specific attribute and its mapping to the general table column would help to build UI and backend/frontend communications.
Search indexes would reflect most popular user seraches depending on product type.
This is typical Category/SubCategory model issue. There are a few options:
Put everything in one table, which will have some columns NULLable
because different subtypes do not have the same attributes;
One parent table for all the common attributes, and also with the
column of the type indication column. Then each sub type has its own
table just for the columns the Subtype has.
Each subtype has its own table, including all the common columns of
all the sub type.
(1) is good if the sub type is very limited;
(3) is suitable if the variations of the sub types are very limited.
The advantage of (2). is it is easy to return all the records with the common columns. And if an artificial key (like auto-increment id) is used, it ensures all records, regards less the sub type, has unique id.
In your case, no artificial PK is used, I think your choice is not bad.

Is a text column with a small number of distinct values a bad smell?

For example, let's say I have a table like this:
CREATE TABLE people (
firstname TEXT,
lastname TEXT,
weight FLOAT,
zodiac_sign TEXT
);
The first three columns are going to have many distinct values, and the number of them will grow without bound as I add more rows. But zodiac_sign will always be one of 12 values.
I'm assuming that SQLite will use 11 bytes for every instance of 'sagittarius' (i.e. that it's not smart enough to infer that zodiac_sign is basically an enum that can be stored in a single byte).
Does this suggest that, if the number of rows I'll be dealing with is non-trivial, I should split off another table like this:
CREATE TABLE people (
firstname TEXT,
lastname TEXT,
weight FLOAT,
zodiac_id INTEGER NOT NULL REFERENCES zodiac_signs(zodiac_id)
);
CREATE TABLE zodiac_signs (
zodiac_id INTEGER PRIMARY KEY,
name TEXT
);
And would this still be a good practice for a text column that holds a small number of distinct values but which isn't constrained to some set of values that will never change? e.g. if I had a column for country of birth.
In the first table design, the field zodiac_sign could just as well be renamed enter_anything_you_want. Good data integrity practices demand that the domain of each field be constrained whenever it is meaningful. For example, for general use, a birthdate field may be constrained only to be in the past, never in the future. For a DMV database, the same field may be constrained to be at least 16 years in the past.
When the field is text with a fixed number of valid alternatives, you could define it with a check constraint:
zodiac_sign text (check zodiac_sign in( 'Leo', 'Cancer', ... )),
(The examples I give may not directly translate into SQLite code. But I wanted to make this a more general discussion.)
However, this is somewhat awkward and would have to be repeated in every table containing that field. Meaning that should the list ever change, there could be a lot of alter table commands needed to update the new list. Although it may seem obvious that zodiac signs are not likely to change, that is not generally a dependable aspect of such fields.
Implementing a lookup table of the textual values, giving them a key value for foreign key reference is a much better way. It limits the valid values just like a check constraint, but all the values are in one location so changes are localized. Plus, additional fields could be added.
create table Zodiac(
ID integer not null,
Name text not null,
StartDate date,
EndDate date,
constraint PK_Zodiac primary key( ID )
);
The dates would make calculating which sign to associate with a given date a lot easier. (Yes, Sqlite does not have a native Date datatype -- use whatever type is already your preference.)

Is it a bad idea to make a count the primary key?

Just thinking about database design issues. Suppose i have a table like this:
CREATE TABLE LEGACYD.CV_PLSQL_COUNT
(
RUN_DTTM DATE NOT NULL,
TABLE_NAME VARCHAR (80) NOT NULL,
COUNT Number(50) PRIMARY KEY
);
Is that a bad idea - to make the COUNT a PRIMARY KEY ? Generally, how should I make the decision of what is a primry key?
Is that a bad idea - to make the COUNT a PRIMARY KEY ? Generally, how should I make the decision of what is a primry key?
Candidate keys are based on functional dependencies. A primary key is one of the candidate keys. There's no formal way to look at a set of candidate keys and say, "This one must be the primary key." That decision is based on practical matters, not on formal logic.
Your table structure tells us these things.
COUNT is unique. No matter how many rows this table has, you'll never find two rows that have the same value for COUNT.
COUNT determines TABLE_NAME. That is, given a value for COUNT, we will forever find one and only one value for TABLE_NAME.
TABLE_NAME is not unique. We expect to find several rows that have the same value for TABLE_NAME.
COUNT determines RUN_DTTM. That is, given a value for COUNT, we will forever find one and only one value for RUN_DTTM.
RUN_DTTM is not unique. We expect to find several rows that have the same value for RUN_DTTM.
The combination of TABLE_NAME and RUN_DTTM is not unique. We expect to find several rows that have the same values for TABLE_NAME and RUN_DTTM in a single row.
There are no other determinants. That means that given a value for TABLE_NAME, we'll find multiple unrelated values for COUNT and RUN_DTTM. Likewise, if we're given a value for RUN_DTTM, or values for the pair of columns {TABLE_NAME, RUN_DTTM}.
If all those things are true, then COUNT might be a good primary key. But I doubt that all those things are true.
Based only on the column names--a risky way to proceed--I think it's far more likely that the only candidate key is {TABLE_NAME, RUN_DTTM}. I think it's also likely that either RUN_DTTM is misnamed, or RUN_DTTM has the wrong data type. If it's a date, you probably meant to name it RUN_DT; if it's a timestamp, the data type should be TIMESTAMP.

Should I use a unique constraint in a table even though it isn't necessarily required?

In Microsoft SQL Server, when creating tables, are there any downsides to using a unique constraint on a column even though you don't really need it to be unique?
An example would be descriptions for say a role in a user management system:
CREATE TABLE Role
(
ID TINYINT PRIMARY KEY NOT NULL IDENTITY(0, 1),
Title CHARACTER VARYING(32) NOT NULL UNIQUE,
Description CHARACTER VARYING(MAX) NOT NULL UNIQUE
)
My fear is that validating this constraint when doing frequent insertions in other tables will be a very time consuming process. I am unsure as to how this constraint is validated, but I feel like it could be done in a very efficient way or as a linear comparison.
Your fear becomes true: UNIQUE constraint are implemented as indices, and this is time and space consuming.
So, whenever you insert a new row, the database have to update the table, and also one index for each unique constraint.
So, according to you:
using a unique constraint on a column even though you don't really need it to be unique
the answer is no, don't use it. there are time and space downsides.
Your sample table would need a clustered index for the Id, and 2 extra indices, one for each unique constraint. This takes up space, and time to update the 3 indices on the inserts.
This would only be justified if you made queries filtering by those fields.
BY THE WAY:
The original post sample table have several flaws:
that syntax is not SQL Server syntax (and you tagged this as SQL Server)
you cannot create an index in a varchar(max) column
If you correct the syntax and create this table:
CREATE TABLE Role
(
ID tinyint PRIMARY KEY NOT NULL IDENTITY(0, 1),
Title varchar(32) NOT NULL UNIQUE,
Description varchar(32) NOT NULL UNIQUE
)
You can then execute sp_help Role and you'll find the 3 indices.
The database creates an index which backs up the UNIQUE constraint, so it should be very low-cost to do the uniqueness check.
http://msdn.microsoft.com/en-us/library/ms177420.aspx
The Database Engine automatically creates a UNIQUE index to enforce the uniqueness requirement of the UNIQUE constraint. Therefore, if an attempt to insert a duplicate row is made, the Database Engine returns an error message that states the UNIQUE constraint has been violated and does not add the row to the table. Unless a clustered index is explicitly specified, a unique, nonclustered index is created by default to enforce the UNIQUE constraint.
Is it typically a good practice to constrain it if you know the data
will always be unique but it doesn't necessarily need to be unique for
the application to function correctly?
My question to you: would it make sense for two roles to have different titles but the same description? e.g.
INSERT INTO Role ( Title , Description )
VALUES ( 'CEO' , 'Senior manager' ),
( 'CTO' , 'Senior manager' );
To me it would seem to devalue the use of the description; if there were many duplications then it might make more sense to do something more like this:
INSERT INTO Role ( Title )
VALUES ( 'CEO' ),
( 'CTO' );
INSERT INTO SeniorManagers ( Title )
VALUES ( 'CEO' ),
( 'CTO' );
But then again you are not expecting duplicates.
I assume this is a low activity table. You say you fear validating this constraint when doing frequent insertions in other tables. Well, that will not happen (unless there is a trigger we cannot see that might update this table when another table is updated).
Personally, I would ask the designer (business analyst, whatever) to justify not applying a unique constraint. If they cannot then I would impose the unqiue constraint based on common sense. As is usual for such a text column, I would also apply CHECK constraints e.g. to disallow leading/trailing/double spaces, zero-length string, etc.
On SQL Server, the data type tinyint only gives you 256 distinct values. No matter what you do outside of the id column, you're not going to end up with a very big table. It will surely perform quickly even with a dozen indexed columns.
You usually need at least one unique constraint besides the surrogate key, though. If you don't have one, you're liable to end up with data like this.
1 First title First description
2 First title First description
3 First title First description
...
17 Third title Third description
18 First title First description
Tables that permit data like that are usually wrong. Any table that uses foreign key references to this table won't be able to report correctly, say, the number of "First title" used.
I'd argue that allowing multiple, identical titles for roles in a user management system is a design error. I'd probably argue that "title" is a really bad name for that column, too.

Performance - Int vs Char(3)

I have a table and am debating between 2 different ways to store information. It has a structure like so
int id
int FK_id
varchar(50) info1
varchar(50) info2
varchar(50) info3
int forTable or char(3) forTable
The FK_id can be a foreign key to one of 6 tables so I need another field to determine which table it's for.
I see two solutions:
An integer that is a FK to a settings table which has its actual value.
A char(3) field with the a abbreviated version of the table.
I am wondering if anyone knows if one will be more beneficial speed wise over the other or if there will be any major problems using the char(3)
Note: I will be creating an indexed view on each of the 6 different values for this field. This table will contain ~30k rows and will need to be joined with much larger tables
In this case, it probably doesn't matter except for the collation overhead (A vs a vs ä va à)
I'd use char(3), say for currency code like CHF, GBP etc But if my natural key was "Swiss Franc", "British Pound" etc, I'd take the numeric.
3 bytes + collation vs 4 bytes numeric? You'd need a zillion rows or be running a medium sized country before it mattered...
Have you considered using a TinyInt. Only takes one byte to store it's value. TinyInt has a range of values between 0 and 255.
Is the reason you need a single table that you want to ensure that when the six parent tables reference a given instance of a child row that is guaranteed to be the same instance? This is the classic "multi-parent" problem. An example of where you might run into this is with addresses or phone numbers with multiple person/contact tables.
I can think of a couple of options:
Choice 1: A link table for each parent table. This would be the Hoyle architecture. So, something like:
Create Table MyTable(
id int not null Primary Key Clustered
, info1 varchar(50) null
, info2 varchar(50) null
, info3 varchar(50) null
)
Create Table LinkTable1(
MyTableId int not null
, ParentTable1Id int not null
, Constraint PK_LinkTable1 Primary Key Clustered( MyTableId, ParentTable1Id )
, Constraint FK_LinkTable1_ParentTable1
Foreign Key ( MyTableId )
References MyTable ( Id )
, Constraint FK_LinkTable1_ParentTable1
Foreign Key ( ParentTable1Id )
References ParentTable1 ( Id )
)
...
Create Table LinkTable2...LinkTable3
Choice 2. If you knew that you would never have more than say six tables and were willing to accept some denormalization and a fugly design, you could add six foreign keys to your main table. That avoids the problem of populating a bunch of link tables and ensures proper referential integrity. However, that design can quickly get out of hand if the number of parents grows.
If you are content with your existing design, then with respect to the field size, I would use the full table name. Frankly, the difference in performance between a char(3) and a varchar(50) or even varchar(128) will be negligible for the amount of data you are likely to put in the table. If you really thought you were going to have millions of rows, then I would strongly consider the option of linking tables.
If you wanted to stay with your design and wanted the maximum performance, then I would use a tinyint with a foreign key to a table that contained the list of the six tables with a tinyint primary key. That prevents the number from being "magic" and ensures that you narrow down the list of parent tables. Of course, it still does not prevent orphaned records. In this design, you have to use triggers to do that.
Because your FK cannot be enforced (since it is a variant depending upon type) by database constraint, I would strongly consider re-evaluating your design to use link tables, where each link table includes two FK columns, one to the PK of the entity and one to the PK of one of the 6 tables.
While this might seem to be overkill, it makes a lot of things simpler and adding new link tables is no more complex than accommodating new FK-types. In addition, it is more easily expandable to the case where an entity needs more than a 1-1 relationship to a single table, or needs multiple 1-1 relationships to the 6 other entities.
In a varying-FK scenario, you can lose database consistency, you can join to the wrong entity by neglecting to filter on type code, etc.
I should add that another huge benefit of link tables is that you can link to tables which have keys of varying data types (ints, natural keys, etc) without having to add surrograte keys or stored the key in a varchar or similar workarounds which are prone to problems.
I think a small integer (tinyint) is called for here. An "abbreviated version" looks too much like a magic number.
I also think performance wise the integer should beat the char(3).
First off, a 50 character Id that is not globally unique sounds a little scary. Do the IDs have some meaning? If not, you can easily get a GUID in less space. Personally, I am a big fan of making things human readable whenever possible. I would, and have, put the full name in graphs until I needed to do otherwise. My preference would be to have linking tables for each possible related table though.
Unless you are talking about really large scale, you are much better off decreasing the size of the IDs and taking a few more characters for the name of the table. For really large scale, I would decrease the size of the IDs and use an integer.
Jacob