I am currently working on a database, where a log is required to track a bunch of different changes of data. Stuff like price changes, project status changes, etc. To accomplish this I've made different 'log' tables that will be storing the data needing to be kept.
To give a solid example, in order to track the changing prices for parts which need to be ordered, I've created a Table called Part_Price_Log. The primary key is composite made up of the date in which the part price is being modified, and a foreign key to the Part's unique ID on the Parts Table.
My logic here, is that if you need to look up the current price for a part, you just need to find the most recent entry for that Part ID. However, I am being told not to implement it this way because using Date as part of a primary key is an easy way to get errors in your data.
So my question is thus.
What are the pros/cons of using a Date column as part of a composite primary key? What are some better alternatives?
In general, I think the best primary keys are synthetic auto-incremented keys. These have certain advantages:
The key value records the insertion order.
The keys are fixed length (typically 4 bytes).
Single keys are much simpler for foreign key references.
In databases (such as SQL Server by default) that cluster the data based on the primary key, inserts go "at the end".
They are relatively easy to type and compare (my eyes just don't work well for comparing UUIDs).
The fourth of these is a really big concern in a database that has lots of inserts, as suggested by your data.
There is nothing a priori wrong with composite primary keys. They are sometimes useful. But that is not a direction I would go in.
I agree that it is better to keep the identity column/uniqueidentifier as primary key in this scenario, Also if you make partid and date as composite primary key, it is going to fail in a case when two concurrent users try to update the part price at same time.The primary key is going to fail in that case.So the better approach will be to have an identity column as primary key and keep on dumping the changes in log table.In case you hit some performance barriers later on you can partition your table year wise and can overcome that performance challenge.
Pros and cons will vary depending on the performance requirements and how often you will query this table.
As a first example think about the following:
CREATE TABLE Part_Price_Log (
ModifiedDate DATE,
PartID INT,
PRIMARY KEY (ModifiedDate, PartID))
If the ModifiedDate is first and this is an logging table with insert-only rows, then every new row will be placed at the end, which is good (reduces fragmentation). This approach is also good when you want to filter directly by ModifiedDate, or by ModifiedDate + PartID, as ModifiedDate is the first column in the primary key. A con here would be searching by PartID, as the clustered index of the primary key won't be able to seek directly the PartID.
A second example would be the same but inverted primary key ordering:
CREATE TABLE Part_Price_Log (
ModifiedDate DATE,
PartID INT,
PRIMARY KEY (PartID, ModifiedDate))
This is good for queries by PartID, but not much for queries directly by ModifiedDate. Also having PartID first would make inserts displace data pages as inserted PartIDis lower than the max PartID (which increases fragmentation).
The last example would be using a surrogate primary key like an IDENTITY.
CREATE TABLE Part_Price_Log (
LogID BIGINT IDENTITY PRIMARY KEY,
ModifiedDate DATE,
PartID INT)
This will make all inserts go last and reduce fragmentation but you will need an additional index to query your data, such as:
CREATE NONCLUSTERED INDEX NCI_Part_Price_Log_Date_PartID ON Part_Price_Log (ModifiedDate, PartID)
CREATE NONCLUSTERED INDEX NCI_Part_Price_Log_PartID_Date ON Part_Price_Log (PartID, ModifiedDate)
The con about this last one is that insert operations will take longer (as the index also has to be updated) and the size of the table will increase due to indexes.
Also keep in mind that if your data allows for multiple updates of the same part for the same day, then using compound PRIMARY KEY would make the 2nd update fail. Your choices here are to use a surrogate key, use a DATETIME instead of DATE (will give you more margin for updates), or use a CLUSTERED INDEX with no PRIMARY KEY or UNIQUE constraint.
I would suggest doing the following. You only keep one index (the actual table, as it is clustered), the order is always insert, you don't need to worry about repeated ModifiedDate with same PartID and your queries by date will be fast.
CREATE TABLE Part_Price_Log (
LogID INT IDENTITY PRIMARY KEY NONCLUSTERED,
ModifiedDate DATE,
PartID INT)
CREATE CLUSTERED INDEX NCI_Part_Price_Log_Date_PartID ON Part_Price_Log (ModifiedDate, PartID)
Without knowing your domain, it's really hard to advise. How do your identify a part in the real world? Let's assume you use EAN. This is your 'natural key'. Now, does a part get a new EAN each time the price changes? Probably not, in which case the real world identifier for a part price is a composite of its EAN and the period of time during which that price was effective.
I think the comment about "an easy way to get errors in your data" is referring to the fact the tempoal databases are not only more complex by nature (they have a additional dimension - time), the support for temporal functionality is lacking in most SQL DBMSs.
For example, does your SQL product of choice have an interval data type, or do you need to roll your own using a pair of start_date and end_date columns? Does your SQL product of choice have the capability to intra-table constraints e.g. to prevent overlapping or non-concurrent intervals for the same part? Does your SQL product have temporal functions to query temporal data easily?
Related
Is there any benefit to using a table schema like this:
CREATE TABLE review (
review_id SERIAL PRIMARY KEY,
account_id INT REFERENCES account(account_id) NOT NULL,
product_id INT REFERENCES product(product_id) NOT NULL,
rating SMALLINT NOT NULL,
comment TEXT,
UNIQUE (account_id, product_id)
);
Or should the constraint itself be the primary key, like this:
CREATE TABLE review (
CONSTRAINT review_pkey (account_id, product_id) PRIMARY KEY,
account_id INT REFERENCES account(account_id) NOT NULL,
product_id INT REFERENCES product(product_id) NOT NULL,
rating SMALLINT NOT NULL,
comment TEXT,
);
The second version is clearly preferable, because it requires one less column and one less index, and there is no down side.
The column is obvious, the indexes aren't, because you forgot to add them:
You need indexes on all the foreign key columns so that deletes on the referenced tables can be fast. With the artificial primary key, you need indexes on review_id, account_id and product_id, while without you can do with the indexes on (account_id, product_id) and product_id.
The only people who will advocate the first solution are people who hold a religious belief that every table has to have an artificially generated numerical primary key, no matter what. In reality, the combination of the two artificially generated keys from the referenced tables is just as good.
Besides religion, habits, personal preferences and convenience with certain client tools, there are other good reasons for an additional surrogate PK as demonstrated in your first example.
If you are going to reference that table with foreign keys from other tables:
Referencing table(s) only need to include the single surrogate PK columns as FK reference, which is smaller, faster and simpler. If referencing table(s) have many rows and review does not, a single instance may already outweigh the additional cost to review. Else, multiple instances may.
For small lookup tables that are referenced in many rows, even consider a smallserial surrogate PK - if that actually helps. See:
Calculating and saving space in PostgreSQL
Typically, there will be an index on FK columns of referencing tables, too. Your example with two integer is most favorable for the multicolumn PK / FK as it keeps index size to a minimum. A B-tree index on two integer columns is no bigger than one on a single integer (8 bytes typically is the minimum "payload" for index tuples). Other, bigger data types would make additional difference.
If review receives many updates to one of the columns (account_id, product_id), those will cascade to all referencing tables based on those two columns. Multiplies write costs, bloats multiple tables and indexes. If it cascades to wide rows or many referencing rows, costs may increase substantially. All of this may be avoided with a surrogate PK - if the relational design is actually supposed to work that way.
If review is involved in many queries with joins, joining on two columns instead of just one is more tedious to write and slightly more expensive. Again, more so for bigger data types.
That said, if you have none of the above (or similar), look to Laurenz' answer.
Weigh actual costs, not religious beliefs.
I want to design primary key for my table with row versioning. My table contains 2 main fields : ID and Timestamp, and bunch of other fields. For a unique "ID" , I want to store previous versions of a record. Hence I am creating primary key for the table to be combination of ID and timestamp fields.
Hence to see all the versions of a particular ID, I can give,
Select * from table_name where ID=<ID_value>
To return the most recent version of a ID, I can use
Select * from table_name where ID=<ID_value> ORDER BY timestamp desc
and get the first element.
My question here is, will this query be efficient and run in O(1) instead of scanning the entire table to get all entries matching same ID considering ID field was a part of primary key fields? Ideally to get a result in O(1), I should have provided the entire primary key. If it does need to do entire table scan, then how else can I design my primary key so that I get this request done in O(1)?
Thanks,
Sriram
The canonical reference on this subject is Effective Timestamping in Databases:
https://www.cs.arizona.edu/~rts/pubs/VLDBJ99.pdf
I usually design with a subset of this paper's recommendations, using a table containing a primary key only, with another referencing table that has that key as well change_user, valid_from and valid_until colums with appropriate defaults. This makes referential integrity easy, as well as future value insertion and history retention. Index as appropriate, and consider check constraints or triggers to prevent overlaps and gaps if you expose these fields to the application for direct modification. These have an obvious performance overhead.
We then make a "current values view" which is exposed to developers, and is also insertable via an "instead of" trigger.
It's far easier and better to use the History Table pattern for this.
create table foo (
foo_id int primary key,
name text
);
create table foo_history (
foo_id int,
version int,
name text,
operation char(1) check ( operation in ('u','d') ),
modified_at timestamp,
modified_by text
primary key (foo_id, version)
);
Create a trigger to copy a foo row to foo_history on update or delete.
https://wiki.postgresql.org/wiki/Audit_trigger_91plus for a full example with postgres
I have a table that tracks statuses that a particular file goes through as it is checked over by our system. It looks like this:
FileID int
Status tinyint
TouchedBy varchar(50)
TouchedWhen datetime
There is currently NO primary key on this table however there is a clustered index on Status and TouchedWhen
As the table has continued to grow and performance decrease in querying against it, one thought I've had is to do add a PrimaryKey so that I get off the heap lookups -- a primary key on FileID, Status and TouchedWhen
The problem I'm running into si that TouchedWhen, due to it's rounding issues, has, on occasion, 2 entries with the exact same datetime.
So then I started researching what it takes to convert that to a datetime2(7) and alter those that are duplicate at that time. My table would then look like:
FileID int
Status tinyint
TouchedBy varchar(50)
TouchedWhen datetime2(7)
And a primarykey on FileID, Status and TouchedWhen
My question is this -- what is the best way to go through and add a millisecond to the existing tables if there are duplicates? How can I do this to a table that needs to remain online?
In advance, thanks,
Brent
You shouldn't need to add a primary key to make queries faster - just adding an index on FileID, Status, TouchedWhen will have just as much of a performance impact as adding a primary key. The main benefit of defining a primary key is for record identity and referential integrity, which could be accomplished with a auto-increment primary key.
(I'm NOT saying you shouldn't have a primary key, I'm saying the performance impact of a primary key is in the index itself, not the fact that it's a primary key)
On the other hand, changing your clustered index to include FileID would likely have a bigger impact as lookups using those columns would not need to search the index then look up the data - the data pages would be right there with the index values.
I'm designing a database and have come across a performance related problem. Please note that we are still in the phase of designing not implementing so I can't test anything yet.
I have the following table structure
Events
----------------------
EventID INT PK
SourceID INT FK
TypeID INT FK
Date DATETIME
The table is expected to contain tens of millions of entries. SourceID and TypeID both reference tables with at most hundreds of entries.
What I want is to have the tuple (SourceID, TypeID, Date) unique across the table. The question is: can I somehow specify which of the three columns will be used as the first to determine uniqueness when I would insert a new item in the table?
Because if the index compared the Date first, then the addition would be much faster, than if it for example used TypeID first, right? Or is this a wrong question altogether and I should trust the SQL server to optimize this itself?
Any feedback is appreciated.
The underlying index created to support the unique constraint will have the same column order as defined by the constraint.
I've read similar questions both on Google and stackoverflow, for example, this thread Should each and every table have a primary key? and I understand it's generally a good idea to have a primary key in every table.
I'm now trying to create a simple table, it stores end of day prices for a list of stocks, so it has three columns: stock ticker, date and price. Apparently none of these three columns are unique and to use the table I'll need to join on both date and stock ticker (I have a unique constrain on that). Of course, I can add another surrogate id column just for the sake of having a primary key, but I just want to check if this is an acceptable design or there're better ways to model the data I'm storing?
Many thanks.