I have a table that tracks statuses that a particular file goes through as it is checked over by our system. It looks like this:
FileID int
Status tinyint
TouchedBy varchar(50)
TouchedWhen datetime
There is currently NO primary key on this table however there is a clustered index on Status and TouchedWhen
As the table has continued to grow and performance decrease in querying against it, one thought I've had is to do add a PrimaryKey so that I get off the heap lookups -- a primary key on FileID, Status and TouchedWhen
The problem I'm running into si that TouchedWhen, due to it's rounding issues, has, on occasion, 2 entries with the exact same datetime.
So then I started researching what it takes to convert that to a datetime2(7) and alter those that are duplicate at that time. My table would then look like:
FileID int
Status tinyint
TouchedBy varchar(50)
TouchedWhen datetime2(7)
And a primarykey on FileID, Status and TouchedWhen
My question is this -- what is the best way to go through and add a millisecond to the existing tables if there are duplicates? How can I do this to a table that needs to remain online?
In advance, thanks,
Brent
You shouldn't need to add a primary key to make queries faster - just adding an index on FileID, Status, TouchedWhen will have just as much of a performance impact as adding a primary key. The main benefit of defining a primary key is for record identity and referential integrity, which could be accomplished with a auto-increment primary key.
(I'm NOT saying you shouldn't have a primary key, I'm saying the performance impact of a primary key is in the index itself, not the fact that it's a primary key)
On the other hand, changing your clustered index to include FileID would likely have a bigger impact as lookups using those columns would not need to search the index then look up the data - the data pages would be right there with the index values.
Related
I have the following table in an SQLite database
CREATE TABLE `log` (
`time` REAL NOT NULL DEFAULT CURRENT_TIMESTAMP,
`data` BLOB NOT NULL
) WITHOUT ROWID;
CREATE INDEX `time_index` ON `log`(`time`);
The index is created because the most frequent query is going to be
SELECT * FROM `log` WHERE `time` BETWEEN ? AND ?
Since the time is going to be always the current time when the new record is added, the index is not really required here. So I would like to "tell" the SQLite engine something like "The lines are going to be added with the 'time' column always having increasing value (similar to AUTO_INCREMENT), and if something goes wrong I will take all responsibility".
Is it possible at all?
You don't want a separate index. You want to declare the column to be the primary key:
CREATE TABLE `log` (
`time` REAL NOT NULL DEFAULT CURRENT_TIMESTAMP PRIMARY KEY,
`data` BLOB NOT NULL
) WITHOUT ROWID;
This creates a single b-tree index for the log based on the primary key. In other databases, this structure would be called a "clustered index". You have probably already read the documentation but I'm referencing it anyway.
You would have an issue, or not depending upon how you consider that you cannot use :-
CREATE TABLE `log` (
`time` REAL NOT NULL DEFAULT CURRENT_TIMESTAMP,
`data` BLOB NOT NULL
) WITHOUT ROWID;
because :-
Every WITHOUT ROWID table must have a PRIMARY KEY. An error is raised
if a CREATE TABLE statement with the WITHOUT ROWID clause lacks a
PRIMARY KEY.
Clustered Indexes and the WITHOUT ROWID Optimization
So you might as well make the time column the PRIMARY KEY.
but the problem is that the precision of REAL is not enough to handle
microsecond resolution, and thus two adjacent records may have the
same time value which would violate the PRIMARY KEY constraint.
Then you could use a composite PRIMARY KEY where the precision required is satisfied by multiple columns (a second column would likely more than suffice) perhaps along the lines of :-
CREATE TABLE log (
time_datepart INTEGER,
time_microsecondpart,
data BLOB NOt NULL,
PRIMARY KEY (time_datepart,time_microsecondpart)
) WITHOUT ROWID;
The time_microsecondpart column needn't necessarily be microseconds it could be a counter derived from another table similar to how the sqlite_sequence table is utilised when AUTOINCREMENT is utilised (less the need for the column that holds the name of the table that a row is attached to).
I am currently working on a database, where a log is required to track a bunch of different changes of data. Stuff like price changes, project status changes, etc. To accomplish this I've made different 'log' tables that will be storing the data needing to be kept.
To give a solid example, in order to track the changing prices for parts which need to be ordered, I've created a Table called Part_Price_Log. The primary key is composite made up of the date in which the part price is being modified, and a foreign key to the Part's unique ID on the Parts Table.
My logic here, is that if you need to look up the current price for a part, you just need to find the most recent entry for that Part ID. However, I am being told not to implement it this way because using Date as part of a primary key is an easy way to get errors in your data.
So my question is thus.
What are the pros/cons of using a Date column as part of a composite primary key? What are some better alternatives?
In general, I think the best primary keys are synthetic auto-incremented keys. These have certain advantages:
The key value records the insertion order.
The keys are fixed length (typically 4 bytes).
Single keys are much simpler for foreign key references.
In databases (such as SQL Server by default) that cluster the data based on the primary key, inserts go "at the end".
They are relatively easy to type and compare (my eyes just don't work well for comparing UUIDs).
The fourth of these is a really big concern in a database that has lots of inserts, as suggested by your data.
There is nothing a priori wrong with composite primary keys. They are sometimes useful. But that is not a direction I would go in.
I agree that it is better to keep the identity column/uniqueidentifier as primary key in this scenario, Also if you make partid and date as composite primary key, it is going to fail in a case when two concurrent users try to update the part price at same time.The primary key is going to fail in that case.So the better approach will be to have an identity column as primary key and keep on dumping the changes in log table.In case you hit some performance barriers later on you can partition your table year wise and can overcome that performance challenge.
Pros and cons will vary depending on the performance requirements and how often you will query this table.
As a first example think about the following:
CREATE TABLE Part_Price_Log (
ModifiedDate DATE,
PartID INT,
PRIMARY KEY (ModifiedDate, PartID))
If the ModifiedDate is first and this is an logging table with insert-only rows, then every new row will be placed at the end, which is good (reduces fragmentation). This approach is also good when you want to filter directly by ModifiedDate, or by ModifiedDate + PartID, as ModifiedDate is the first column in the primary key. A con here would be searching by PartID, as the clustered index of the primary key won't be able to seek directly the PartID.
A second example would be the same but inverted primary key ordering:
CREATE TABLE Part_Price_Log (
ModifiedDate DATE,
PartID INT,
PRIMARY KEY (PartID, ModifiedDate))
This is good for queries by PartID, but not much for queries directly by ModifiedDate. Also having PartID first would make inserts displace data pages as inserted PartIDis lower than the max PartID (which increases fragmentation).
The last example would be using a surrogate primary key like an IDENTITY.
CREATE TABLE Part_Price_Log (
LogID BIGINT IDENTITY PRIMARY KEY,
ModifiedDate DATE,
PartID INT)
This will make all inserts go last and reduce fragmentation but you will need an additional index to query your data, such as:
CREATE NONCLUSTERED INDEX NCI_Part_Price_Log_Date_PartID ON Part_Price_Log (ModifiedDate, PartID)
CREATE NONCLUSTERED INDEX NCI_Part_Price_Log_PartID_Date ON Part_Price_Log (PartID, ModifiedDate)
The con about this last one is that insert operations will take longer (as the index also has to be updated) and the size of the table will increase due to indexes.
Also keep in mind that if your data allows for multiple updates of the same part for the same day, then using compound PRIMARY KEY would make the 2nd update fail. Your choices here are to use a surrogate key, use a DATETIME instead of DATE (will give you more margin for updates), or use a CLUSTERED INDEX with no PRIMARY KEY or UNIQUE constraint.
I would suggest doing the following. You only keep one index (the actual table, as it is clustered), the order is always insert, you don't need to worry about repeated ModifiedDate with same PartID and your queries by date will be fast.
CREATE TABLE Part_Price_Log (
LogID INT IDENTITY PRIMARY KEY NONCLUSTERED,
ModifiedDate DATE,
PartID INT)
CREATE CLUSTERED INDEX NCI_Part_Price_Log_Date_PartID ON Part_Price_Log (ModifiedDate, PartID)
Without knowing your domain, it's really hard to advise. How do your identify a part in the real world? Let's assume you use EAN. This is your 'natural key'. Now, does a part get a new EAN each time the price changes? Probably not, in which case the real world identifier for a part price is a composite of its EAN and the period of time during which that price was effective.
I think the comment about "an easy way to get errors in your data" is referring to the fact the tempoal databases are not only more complex by nature (they have a additional dimension - time), the support for temporal functionality is lacking in most SQL DBMSs.
For example, does your SQL product of choice have an interval data type, or do you need to roll your own using a pair of start_date and end_date columns? Does your SQL product of choice have the capability to intra-table constraints e.g. to prevent overlapping or non-concurrent intervals for the same part? Does your SQL product have temporal functions to query temporal data easily?
I'm designing a schema where a case can have many forms attached and a form can be used for many cases. The Form table basically holds the structure of a html form which gets rendered on the client side. When the form is submitted the name/value pairs for the fields are stored separately. Is there any value in keeping the name/value attributes seperate from the join table as follows?
CREATE TABLE Case (
ID int NOT NULL PRIMARY KEY,
...
);
CREATE TABLE CaseForm (
CaseID int NOT NULL FOREIGN KEY REFERENCES Case (ID),
FormID int NOT NULL FOREIGN KEY REFERENCES Form (ID),
CONSTRAINT PK_CaseForm PRIMARY KEY (CaseID, FormID)
);
CREATE TABLE CaseFormAttribute (
ID int NOT NULL PRIMARY KEY,
CaseID int NOT NULL FOREIGN KEY REFERENCES CaseForm (CaseID),
FormID int NOT NULL FOREIGN KEY REFERENCES CaseForm (FormID),
Name varchar(255) NOT NULL,
Value varchar(max)
);
CREATE TABLE Form (
ID int NOT NULL PRIMARY KEY,
FieldsJson varchar (max) NOT NULL
);
I'm I overcomplicating the schema since the same many to many relationship can by achieved by turning the CaseFormAttribute table into the join table and getting rid of the CaseForm table altogether as follows?
CREATE TABLE CaseFormAttribute (
ID int NOT NULL PRIMARY KEY,
CaseID int NOT NULL FOREIGN KEY REFERENCES Case (ID),
FormID int NOT NULL FOREIGN KEY REFERENCES Form (ID),
Name varchar(255) NOT NULL,
Value varchar(max) NULL
);
Basically what I'm trying to ask is which is the better design?
The main benefit of splitting up the two would depend on whether or not additional fields would ever be added to the CaseForm table. For instance, say that you want to record if a Form is incomplete. You may add an Incomplete bit field to that effect. Now, you have two main options for retrieving that information:
Clustered index scan on CaseForm
Create a nonclustered index on CaseForm.Incomplete which includes CaseID, FormID, and scan that
If you didn't split the tables, your two main options would be:
Clustered index scan on CaseFormAttribute
Create a nonclustered index on CaseFormAttribute.Incomplete which includes CaseID, FormID, and scan that
For the purposes of this example, query options 1 and 2 are roughly the same in terms of performance. Introducing the nonclustered index adds overhead in multiple ways. It's a little less streamlined than the clustered index (it may take more reads to scan in this particular example), it's additional storage space that CaseForm will take up, and the index has to be maintained for updates to the table. Option 4 will also perform similarly, with the same caveats as option 2. Option 3 will be your worst performer, as a clustered index scan will include reading all of the BLOB data in your Value field, even though it only needs the bit in Incomplete to determine whether or not to return that (Case, Form) pair.
So it really does depend on what direction you're going in the future.
Also, if you stay with the split approach, consider shifting CaseFormAttribute.ID to CaseForm, and then use CaseForm.ID as your PK/FK in CaseFormAttribute. The caveat here is that we're assuming that all Forms will be inserted at the same time for a given Case. If that's not true, then you would invite some page splits because your inserts will be somewhat random, though still generally increasing.
I have a requirement to log application events in a SQL 2012 database. The basic record structure requirement is pretty simple:
CREATE TABLE [dbo].[EventLog]
(
[ProcessId] INT NOT NULL,
[ApplicationId] INT NOT NULL,
[Created] DateTime NOT NULL,
CONSTRAINT [PK_EventLog] PRIMARY KEY CLUSTERED ([ProcessId],[ApplicaionId],[Created] ASC)
)
The problem I am having is one of performance. Up to 1 million events per day can be generated and as the number of rows increase, the insert performance is diminishing - to the point where the logger will not be able to keep up with the events.
I am already writing batches of logs out to a intermediary plain text files and then processing these files using a service running separately from the main application logger.
I suspect that the culprit may be maintaining the index and I would like some advice on how I can approach this problem more efficiently/effectively.
Any advice would be much appreciated.
The main cause of the performance problem is probably the choice of columns forming the clustered index.
In a clustered index, the data is actually stored in the leaf-level pages of the index, in the order defined by the index key columns. Hence, in your table, the data is stored in the order ProcessID, ApplicationID, Created.
Without seeing your data, I would assume that log entries are being created as time passes for a variety of ProcessIDs and ApplicationIDs. If this is the case, for every insert, SQL will actually be inserting each log entry at the appropriate point in the middle of your log table. This is more time-consuming for SQL Server to do than inserting records at the end of the table. Also, when an inserted record cannot fit on the appropriate page, a page split will occur which will result in the clustered index being fragmented - which will decrease the performance further.
Ideally, you should aim to have a clustering key that is a small as possible while also being unique. Therefore one approach would be to create a new ID column as an identity and create a clustered index on that. For example:
CREATE TABLE [dbo].[EventLog]
(
[EventLogId] INT IDENTITY(1,1),
[ProcessId] INT NOT NULL,
[ApplicationId] INT NOT NULL,
[Created] DateTime NOT NULL,
CONSTRAINT [PK_EventLog] PRIMARY KEY CLUSTERED ([EventLogId])
)
There are differences between distinct tables and type columns in terms of Performance or Optimizations for queries?
for example:
Create Table AllInOne(
Key Integer Identity Primary Key,
Desc varchar(20) Not Null,
OneType Integer Not Null
)
Where OneType only receives 1,2 or 3. (integer values)
Versus the following architecture:
Create Table One(
Key Integer Identity Primary Key,
Desc varchar(20) Not Null
)
Create Table Two(
Key Integer Identity Primary Key,
Desc varchar(20) Not Null
)
Create Table Three(
Key Integer Identity Primary Key,
Desc varchar(20) Not Null
)
Another possible architecture:
Create Table Root(
Key Integer Identity Primary Key,
Desc varchar(20) Not Null
)
Create Table One(
Key Integer Primary Key references Root
)
Create Table Two(
Key Integer Primary Key references Root
)
Create Table Three(
Key Integer Primary Key references Root
)
In the 3rd way all data will be set in the root and the relationship with the one, two and three tables.
I asked my teacher sometime ago and he couldn't answer if there is any difference.
Let's suppose i have to choose between these three approaches.
Assume that commonly used queries are filtering the type. And there are no child tables that reference these.
To make it easier to understand let's think about an payroll system.
One = Incomings
Two = Discounts
Three = Base for calculation.
Having separate tables, like in (2), will mean that someone who needs to access data for a particular OneType can ignore data for other types, thereby doing less I/O for a table scan. Also, indexes on the table in (2) would be smaller and potentially of less height, meaning less I/Os for index accesses.
Given the high selectivity of OneType, indexes would not help filtering in (1). However, table partitioning could be used to get all the benefits mentioned above.
There would also be an additional benefits. When querying (2), you need to know which OneType you need in order to know which table to query. In a partitioned version of (1), partition elimination for unneeded partitions can happen through values supplied in a where clause predicate, making the process much easier.
Other benefits include easier database management (when you add a column to a partitioned table, it gets added to all partitions), ans easier scaling (adding partitions for new OneType values is easy). Also, as mentioned, the table can be targeted by foreign keys.