I'm designing a database and have come across a performance related problem. Please note that we are still in the phase of designing not implementing so I can't test anything yet.
I have the following table structure
Events
----------------------
EventID INT PK
SourceID INT FK
TypeID INT FK
Date DATETIME
The table is expected to contain tens of millions of entries. SourceID and TypeID both reference tables with at most hundreds of entries.
What I want is to have the tuple (SourceID, TypeID, Date) unique across the table. The question is: can I somehow specify which of the three columns will be used as the first to determine uniqueness when I would insert a new item in the table?
Because if the index compared the Date first, then the addition would be much faster, than if it for example used TypeID first, right? Or is this a wrong question altogether and I should trust the SQL server to optimize this itself?
Any feedback is appreciated.
The underlying index created to support the unique constraint will have the same column order as defined by the constraint.
Related
I'm currently working on a simple dummy project to refresh my knowledge on SQL and to learn a few new things :)
I have a table Article with the columns:
aID, price
I have another table Storage:
sID, aID, count
The Storage table references the aID as a foreign key and the count column say how much of an article is stored.
Now I want to add a column value to my Storage table. This column should be calculated by Article.price * Storage.count.
I found after searching the web that you can have calculated columns like this
CREATE TABLE tbl
(
int1 INT,
int2 INT,
product BIGINT GENERATED ALWAYS AS (int1 * int2) STORED
);
But I haven't found an example how to this with columns from another table.
What do I have to do in order to use the price from the referenced aID in the calculation?
You cannot define a generated column based on values from other tables. Per the documentation:
The generation expression can refer to other columns in the table, but not other generated columns. Any functions and operators used must be immutable. References to other tables are not allowed.
You can achieve the expected behavior by creating two triggers on both tables but usually creating a view based on the tables is a simpler and more efficient solution.
I am currently working on a database, where a log is required to track a bunch of different changes of data. Stuff like price changes, project status changes, etc. To accomplish this I've made different 'log' tables that will be storing the data needing to be kept.
To give a solid example, in order to track the changing prices for parts which need to be ordered, I've created a Table called Part_Price_Log. The primary key is composite made up of the date in which the part price is being modified, and a foreign key to the Part's unique ID on the Parts Table.
My logic here, is that if you need to look up the current price for a part, you just need to find the most recent entry for that Part ID. However, I am being told not to implement it this way because using Date as part of a primary key is an easy way to get errors in your data.
So my question is thus.
What are the pros/cons of using a Date column as part of a composite primary key? What are some better alternatives?
In general, I think the best primary keys are synthetic auto-incremented keys. These have certain advantages:
The key value records the insertion order.
The keys are fixed length (typically 4 bytes).
Single keys are much simpler for foreign key references.
In databases (such as SQL Server by default) that cluster the data based on the primary key, inserts go "at the end".
They are relatively easy to type and compare (my eyes just don't work well for comparing UUIDs).
The fourth of these is a really big concern in a database that has lots of inserts, as suggested by your data.
There is nothing a priori wrong with composite primary keys. They are sometimes useful. But that is not a direction I would go in.
I agree that it is better to keep the identity column/uniqueidentifier as primary key in this scenario, Also if you make partid and date as composite primary key, it is going to fail in a case when two concurrent users try to update the part price at same time.The primary key is going to fail in that case.So the better approach will be to have an identity column as primary key and keep on dumping the changes in log table.In case you hit some performance barriers later on you can partition your table year wise and can overcome that performance challenge.
Pros and cons will vary depending on the performance requirements and how often you will query this table.
As a first example think about the following:
CREATE TABLE Part_Price_Log (
ModifiedDate DATE,
PartID INT,
PRIMARY KEY (ModifiedDate, PartID))
If the ModifiedDate is first and this is an logging table with insert-only rows, then every new row will be placed at the end, which is good (reduces fragmentation). This approach is also good when you want to filter directly by ModifiedDate, or by ModifiedDate + PartID, as ModifiedDate is the first column in the primary key. A con here would be searching by PartID, as the clustered index of the primary key won't be able to seek directly the PartID.
A second example would be the same but inverted primary key ordering:
CREATE TABLE Part_Price_Log (
ModifiedDate DATE,
PartID INT,
PRIMARY KEY (PartID, ModifiedDate))
This is good for queries by PartID, but not much for queries directly by ModifiedDate. Also having PartID first would make inserts displace data pages as inserted PartIDis lower than the max PartID (which increases fragmentation).
The last example would be using a surrogate primary key like an IDENTITY.
CREATE TABLE Part_Price_Log (
LogID BIGINT IDENTITY PRIMARY KEY,
ModifiedDate DATE,
PartID INT)
This will make all inserts go last and reduce fragmentation but you will need an additional index to query your data, such as:
CREATE NONCLUSTERED INDEX NCI_Part_Price_Log_Date_PartID ON Part_Price_Log (ModifiedDate, PartID)
CREATE NONCLUSTERED INDEX NCI_Part_Price_Log_PartID_Date ON Part_Price_Log (PartID, ModifiedDate)
The con about this last one is that insert operations will take longer (as the index also has to be updated) and the size of the table will increase due to indexes.
Also keep in mind that if your data allows for multiple updates of the same part for the same day, then using compound PRIMARY KEY would make the 2nd update fail. Your choices here are to use a surrogate key, use a DATETIME instead of DATE (will give you more margin for updates), or use a CLUSTERED INDEX with no PRIMARY KEY or UNIQUE constraint.
I would suggest doing the following. You only keep one index (the actual table, as it is clustered), the order is always insert, you don't need to worry about repeated ModifiedDate with same PartID and your queries by date will be fast.
CREATE TABLE Part_Price_Log (
LogID INT IDENTITY PRIMARY KEY NONCLUSTERED,
ModifiedDate DATE,
PartID INT)
CREATE CLUSTERED INDEX NCI_Part_Price_Log_Date_PartID ON Part_Price_Log (ModifiedDate, PartID)
Without knowing your domain, it's really hard to advise. How do your identify a part in the real world? Let's assume you use EAN. This is your 'natural key'. Now, does a part get a new EAN each time the price changes? Probably not, in which case the real world identifier for a part price is a composite of its EAN and the period of time during which that price was effective.
I think the comment about "an easy way to get errors in your data" is referring to the fact the tempoal databases are not only more complex by nature (they have a additional dimension - time), the support for temporal functionality is lacking in most SQL DBMSs.
For example, does your SQL product of choice have an interval data type, or do you need to roll your own using a pair of start_date and end_date columns? Does your SQL product of choice have the capability to intra-table constraints e.g. to prevent overlapping or non-concurrent intervals for the same part? Does your SQL product have temporal functions to query temporal data easily?
I have a basic reverse lookup table in which the ids are already sorted in ascending numerical order:
id INT NOT NULL,
value INT NOT NULL
The ids are not unique; each id has from 5 to 25,000 associated values. Each id is independent, i.e., no relationships between the ids.
The table is static. Read only, no inserts or updates ever. The table has 100-200 million records. The database itself will be around 7-12gb. Sqlite.
I will do frequent lookups in this table and want the fastest response time for each query. Lookups are one-direction only, unordered, and always of the form:
SELECT value WHERE id IN (x,y,z)
What advantages does the pre-sorted order give me in terms of database efficiency? What should I do differently than I would with typical unordered tables? How do I tell sql that it's an ordered list?
What about indices: is it necessary or even helpful to create an index on id?
[Updated for clustered comment thanks to Gordon Linoff]. As far as I can tell, sqlite doesn't support clustered indices directly. The wiki says: "Are [clustered indices] supported? No, but if you use INTEGER PRIMARY KEY it acts as a clustered index." In my situation, the column id is not unique...
Assuming that space is not an issue, you should create an index on (id, value). This should be sufficient for your purposes.
However, if the table is static, then I would recommend that you create a clustered index when you create the table. The index would have the same keys, (id, value).
If the table happens to be sorted, the database does not know about this, so you'd still need an index.
It is a better idea to use a WITHOUT ROWID table (what other DBs call a clustered index):
CREATE TABLE MyLittleLookupTable (
id INTEGER,
value INTEGER,
PRIMARY KEY (id, value)
) WITHOUT ROWID;
I am attempting to replace all records for a give day in a certain table. The table has a composite primary key comprised of 7 fields. One such field is date.
I have deleted all records which have a date value of 2/8/2010. When I try to then insert records into the table for 2/8/2010, I get a primary key violation. The records I am attempting to insert are only for 2/8/2010.
Since date is a component of the PK, shouldn't there be no way to violate the constraint as long as the date I'm inserting is not already in the table?
Thanks in advance.
You could have duplicates in the data you are inserting.
Also, it is a very, very, very poor practice to have a primary key that consists of 7 fields. The proper way to handle this is to havea surrogate identity key and a unique index on the seven fields. Joining to child tables on 7 feilds is a guarantee of poor performance and updating records when they have child records becomes a nightmare and can completely lock up your system. A primary key should be unique adnit should NEVER change.
Do all the rows have only a date component in that field (i.e. the time is always set to midnight: 00:00:00)? If not, you'll need to delete the rows >= 2/8/2010 and < 2/9/2010.
Also, are you sure you're not accidentally trying to insert two records with the same date (and same values in the other 6 PK fields)?
Perhaps there's something going on here you're not aware of. When you insert a row and get a primary key violation, try doing a SELECT with the appropriate key values from the row which could not be inserted (after doing a ROLLBACK, of course) and see what you get. Or perhaps there's a trigger on the table into which you're inserting data that is inserting rows into another table which uses the same primary key but was not cleaned out.
You might try the following SELECT to see what turns up:
SELECT *
FROM YOUR_TABLE
WHERE DATE > 2/7/2010 AND
DATE < 2/9/2010;
(Not sure about the proper format for a date constant in SQL Server as I haven't used it in a few years, but I'm sure you get the idea). See what you get.
Good luck.
I have a table and am debating between 2 different ways to store information. It has a structure like so
int id
int FK_id
varchar(50) info1
varchar(50) info2
varchar(50) info3
int forTable or char(3) forTable
The FK_id can be a foreign key to one of 6 tables so I need another field to determine which table it's for.
I see two solutions:
An integer that is a FK to a settings table which has its actual value.
A char(3) field with the a abbreviated version of the table.
I am wondering if anyone knows if one will be more beneficial speed wise over the other or if there will be any major problems using the char(3)
Note: I will be creating an indexed view on each of the 6 different values for this field. This table will contain ~30k rows and will need to be joined with much larger tables
In this case, it probably doesn't matter except for the collation overhead (A vs a vs ä va à)
I'd use char(3), say for currency code like CHF, GBP etc But if my natural key was "Swiss Franc", "British Pound" etc, I'd take the numeric.
3 bytes + collation vs 4 bytes numeric? You'd need a zillion rows or be running a medium sized country before it mattered...
Have you considered using a TinyInt. Only takes one byte to store it's value. TinyInt has a range of values between 0 and 255.
Is the reason you need a single table that you want to ensure that when the six parent tables reference a given instance of a child row that is guaranteed to be the same instance? This is the classic "multi-parent" problem. An example of where you might run into this is with addresses or phone numbers with multiple person/contact tables.
I can think of a couple of options:
Choice 1: A link table for each parent table. This would be the Hoyle architecture. So, something like:
Create Table MyTable(
id int not null Primary Key Clustered
, info1 varchar(50) null
, info2 varchar(50) null
, info3 varchar(50) null
)
Create Table LinkTable1(
MyTableId int not null
, ParentTable1Id int not null
, Constraint PK_LinkTable1 Primary Key Clustered( MyTableId, ParentTable1Id )
, Constraint FK_LinkTable1_ParentTable1
Foreign Key ( MyTableId )
References MyTable ( Id )
, Constraint FK_LinkTable1_ParentTable1
Foreign Key ( ParentTable1Id )
References ParentTable1 ( Id )
)
...
Create Table LinkTable2...LinkTable3
Choice 2. If you knew that you would never have more than say six tables and were willing to accept some denormalization and a fugly design, you could add six foreign keys to your main table. That avoids the problem of populating a bunch of link tables and ensures proper referential integrity. However, that design can quickly get out of hand if the number of parents grows.
If you are content with your existing design, then with respect to the field size, I would use the full table name. Frankly, the difference in performance between a char(3) and a varchar(50) or even varchar(128) will be negligible for the amount of data you are likely to put in the table. If you really thought you were going to have millions of rows, then I would strongly consider the option of linking tables.
If you wanted to stay with your design and wanted the maximum performance, then I would use a tinyint with a foreign key to a table that contained the list of the six tables with a tinyint primary key. That prevents the number from being "magic" and ensures that you narrow down the list of parent tables. Of course, it still does not prevent orphaned records. In this design, you have to use triggers to do that.
Because your FK cannot be enforced (since it is a variant depending upon type) by database constraint, I would strongly consider re-evaluating your design to use link tables, where each link table includes two FK columns, one to the PK of the entity and one to the PK of one of the 6 tables.
While this might seem to be overkill, it makes a lot of things simpler and adding new link tables is no more complex than accommodating new FK-types. In addition, it is more easily expandable to the case where an entity needs more than a 1-1 relationship to a single table, or needs multiple 1-1 relationships to the 6 other entities.
In a varying-FK scenario, you can lose database consistency, you can join to the wrong entity by neglecting to filter on type code, etc.
I should add that another huge benefit of link tables is that you can link to tables which have keys of varying data types (ints, natural keys, etc) without having to add surrograte keys or stored the key in a varchar or similar workarounds which are prone to problems.
I think a small integer (tinyint) is called for here. An "abbreviated version" looks too much like a magic number.
I also think performance wise the integer should beat the char(3).
First off, a 50 character Id that is not globally unique sounds a little scary. Do the IDs have some meaning? If not, you can easily get a GUID in less space. Personally, I am a big fan of making things human readable whenever possible. I would, and have, put the full name in graphs until I needed to do otherwise. My preference would be to have linking tables for each possible related table though.
Unless you are talking about really large scale, you are much better off decreasing the size of the IDs and taking a few more characters for the name of the table. For really large scale, I would decrease the size of the IDs and use an integer.
Jacob