SQL Indexing Strategy on Link Tables - sql

I often find myself creating 'link tables'. For example, the following table maps a user record to an event record.
CREATE TABLE [dbo].[EventLog](
[EventId] [int] NOT NULL,
[UserId] [int] NOT NULL,
[Time] [datetime] NOT NULL,
[Timestamp] [timestamp] NOT NULL
)
For the purposes of this question, please assume the combination of EventId plus UserId is unique and that the database in question is a MS SQL Server 2008 installation.
The problem I have is that I am never sure as to how these tables should be indexed. For example, I might want to list all users for a particular event, or I might want to list all events for a particular user or, perhaps, retrieve a particular EventId/UserId record. Indexing options I have considered include:
Creating a compound primary key on EventId and UserId (but I
understand the index won't be useful when accessing by UserId on its
own).
Creating a compound primary key on EventId and UserId and a adding a
supplemental index on UserId.
Creating a primary key on EventId and a supplemental index on
UserId.
Any advice would be appreciated.

The indices are designed to solve performance problems. If you don't yet have such problem and cannot exactly know where you'll face troubles then you shouldn't create indexes. The indices are quite expensive. Because it not only takes up disk space but also causes the overhead of writing or modifying data. So you have to be clear understand what the specific performance problem you decide by creating an index. So you can appreciate the need to create it.

The answer to your question depends on several aspects.
It depends on the DBMS you are going to use. Some prefer single-column indexes (like Postgresql), some can take more advantage of multi-column indexes (like Oracle). Some can answer a query completely from a covering index (like sqlite), others cannot and eventually have to read the pages of the actual table (again, like postgres).
It depends on the queries you want to answer. For example, do you navigate in both directions, i.e., do you join on both of your Id columns?
It depends on your space and processing time requirements for data modification, too. Keep in mind that indexes are often bigger than the actual table that they index, and that updating indexes is often more expensive that just updating the underlying table.
EDIT:
When your conceptual model has a many-to-many relationship R between two entities E1 and E2, i.e., the logical semantics of R is either "related" or "not-related", than I would always declare that combined primary key for R. That will create a unique index. The primary motivation is, however, data consistency, not query optimization, i.e.:
CREATE TABLE [dbo].[EventLog](
[EventId] [int] NOT NULL,
[UserId] [int] NOT NULL,
[Time] [datetime] NOT NULL,
[Timestamp] [timestamp] NOT NULL,
PRIMARY KEY([EventId],[UserId])
)

Related

How to lock a specific row that doesn't exist yet in SQL Server

I have an API rate limit table that I'm managing for one of our applications. Here's the definition of it.
CREATE TABLE [dbo].[RateLimit]
(
[UserId] [int] NOT NULL,
[EndPointId] [smallint] NOT NULL,
[AllowedRequests] [smallint] NOT NULL,
[ResetDateUtc] [datetime2](0) NOT NULL,
CONSTRAINT [PK_RateLimit]
PRIMARY KEY CLUSTERED ([UserId] ASC, [EndPointId] ASC)
) ON [PRIMARY]
The process that performs CRUD operations on this table is multi-threaded, and therefore careful consideration needs to be placed on this table, which acts as the goto for rate limit checks (i.e. have we surpassed our rate limit, can we make another request, etc.)
I'm trying to introduce SQL locks to enable the application to reliably INSERT, UPDATE, and SELECT values without having the value changed from under it. Besides the normal complexity of this, the big pain point is that the RateLimit record for the UserId+EndPointId may not exist - and would need to be created.
I've been investigating SQL locks, but the thing is that there might be no row to lock if the rate limit record doesn't exist yet (i.e first run).
I've thought about creating a temp table used specifically for controlling the lock flow - but I'm unsure how this would work.
At the farthest extreme, I could wrap the SQL statement in a SERIALIZABLE transaction (or something to that degree), but locking the entire table would have drastic performance impacts - I only care about the userid+endpointid primary key, and making sure that the specific row is read + updated/inserted by one process at a time.
How can I handle this situation?
Version: SQL Server 2016
Notes: READ_COMMITTED_SNAPSHOT is enabled

Having an Identity column in a table with primary key with varchar type

I was told to create an autID identity column in the table with GUID varchar(40) as the primary key and use the autID column as a reference key to help in the join process. But is that a good approach?
This causes a lot of problems like this
CREATE TABLE OauthClientInfo
(
autAppID INT IDENTITY(1,1)
strClientID VARCHAR(40), -- GUID
strClientSecret VARCHAR(40)
)
CREATE TABLE OAuth_AuthToken
(
autID INT IDENTITY(1,1)
strAuthToken VARCHAR(40),
autAppID_fk INT
FOREIGN KEY REFERENCES OauthClientInfo(autAppID)
)
I was told that having autAppID_fk helps in the joins vs having strClientID_fk of varchar(40), but my point to defend is we unnecessarily adding a new id as a reference that some times forces to make joins.
Example, to know what is the strClientID that the strAuthToken belongs, if we have strClientID_fk as the reference key then the OAuth_AuthToken table data make sense a lot for me. Please comment your views on this.
I was told to create an autID identity column in the table with GUID varchar
(40) as the primary key and use the autID column as a reference key to help in the join process. But is that a good approach?
You were told this by someone that confuses clustering and primary keys. They are not one and the same, despite the confusing implementation of the database engine that "helps" the lazy developer.
You might get arguments about adding an identity column to every table and designating it as the primary key. I'll disagree with all of this. One does not BLINDLY do anything of this type in a schema. You do the proper analysis, identify (and enforce) any natural keys, and then you decide an whether a synthetic key is both useful and needed. And then you determine which columns to use for the clustered index (because you only have one). And then you verify the appropriateness of your decisions based on how efficient and effective your schema is under load by testing. There are no absolute rules about how to implement your schema.
Many of your indexing (and again note - indexing and primary key are completely separate things) choices will be affected by how your tables are updated over time. Do you have hotspots that need to be minimized? Does your table experience lots of random inserts, updates, and deletes over time? Or maybe just lots of inserts but relatively few updates or deletes? These are just some of the factors that guide your decision.
You need to use UNIQUEIDENTIFIER data type for GUID columns not VARCHAR
As far as I have read, Auto increment int is the most suitable column for clustered index.
And strClientID is the worst candidate for PK or cluster index.
Most importantly you haven't mention the purpose of StrClientID. What kind of data does it hold, how does it get populated?

How to make bulk insert work with multiple tables

How with SQL server bulk insert can I insert into multiple tables when there is a foreign key relationship?
What I mean is that the tables are this,
CREATE TABLE [dbo].[UndergroundFacilityShape]
([FacilityID] [int] IDENTITY(1,1) NOT NULL,
[FacilityTypeID] [int] NOT NULL,
[FacilitySpatialData] [geometry] NOT NULL)
CREATE TABLE [dbo].[UndergroundFacilityDetail]
([FacilityDetailID] [int] IDENTITY(1,1) NOT NULL,
[FacilityID] [int] NOT NULL,
[Name] [nvarchar](50) NOT NULL,
[Value] [nvarchar](255) NOT NULL)
So each UndergroundFacilityShape can have multiple UndergroundFacilityDetail. The problem is that the FacilityID is not defined until the insert is done because it is an identity column. If I bulk insert the data into the Shape table then I cannot match it back up the the Detail data I have in my C# application.
I am guessing the solution is to run a SQL statement to find out what the next identity value is and popuplate the values myself and turn off the identity column for the bulk insert? Bear in mind that only one person is going to be running this application to insert data, and it will be done infrequently so we don't have to worry about identity values clashing or anything like that.
I am trying to import thousands of records, which takes about 3 minutes using standard inserts, but bulk insert will take a matter of seconds.
In the future I am expecting to import data that is much bigger than 'thousands' of records.
Turns out that this is quite simple. Get the current identity values on each of the tables, and populate them into the DataTable myself incrementing them as I use them. I also have to make sure that the correct values are used to maintain the relationship. That's it. It doesn't seem to matter whether I turn off the identity columns or not.
I have been using this tool on live data for a while now and it works fine.
It was well worth it though the import takes no longer than 3 seconds (rather than 3 minutes), and I expect to receive larger datasets at some point.
So what about if more than one person uses the tool at one time? Well yes I expect issues, but this is never going to be the case for us.
Peter, you mentioned that you already have a solution with straight INSERTs.
If the destination table does not have a clustered index (or has a clustered index and is empty), just using the TABLOCK query hint will make it a minimally-logged transaction, resulting on a considerable speed up.
If the destination table has a clustered index and is not empty, you can also enable trace flag 610 in addition to the TABLOCK query hint to make it a minimally-logged transaction.
Check the "Using INSERT INTO…SELECT to Bulk Import Data with Minimal Logging" section on the INSERT MSDN page.

performance improvements to database table

I have the following sql server 2008 database table:
CREATE TABLE [dbo].[cache](
[cache_key] [nvarchar](50) NOT NULL,
[cache_data] [nvarchar](max) NOT NULL,
[expiry_date] [datetime] NOT NULL) ON [PRIMARY]
I want to add a primary key to it, ie. make the cache_key column the primary key. This column contains unique strings. My question is, are there any implications to making a nvarchar 50 column a primary key? Is it possible to add primary key to this column that contains data, even if the cache_key data is unique?
I also have another script that runs each day that removes data from the table based on the expiry_date column. This could mean up to 5000 records deleted based upon comparison to this field. Would it help performance if I created an index on this field?
You can make a primary key out of anything that's indexable and unique. A varchar(50) is no problem. You can add define a primary key after the fact, as long as every record has a unique value in that column. YOu won't be allowed to "primary-ize" a column(s) that isn't unique.
As for the index, if it only ever gets referenced in a single delete query that runs once a day, then don't bother indexing it. The overhead of maintaining the index through every single insert/update on the table won't be worth the microscopic time savings you'd get on the once-a-day delete. On the other hand, if that field is used frequently in where/join clauses in other queries, then go ahead and put on a index - you'll definitely improve performance.
Basically, technically, you can make any column that is less than 900 bytes in maximum size your primary key, e.g. you cannot make a NVARCHAR(2000) your primary key, but a nvarchar(50) works.
The requirements for the primary key are:
must be unique
must not be NULL
If those requirements are met - you're good to go.
One thing to keep in mind is this: your primary key is - by default - also your clustering key, the key by which the table's contents is physically ordered (slightly simplified). As such, that clustering key is like the address or pointer of your data row in the table, and it will be included into each and every non-clustered index you have on your table, too.
If you have a table that doesn't have any or just a single non-clustered index - no worries. But if your table has quite a few nonclustered indices (like a Customer table which might have four, five indices or even more), than having such a wide clustering key (100 bytes) of variable width is not ideal. In this case, you're better off using something like an INT IDENTITY as your surrogate key, and put your primary key / clustered index onto that column. It will save you a lot of disk space and make your table perform much better.
Read more about what makes a good clustering key (on a busy, large table) in Kimberly Tripp's blog post Ever-increasing clustering key - the Clustered Index Debate..........again! - highly educational!

Database Design and the use of non-numeric Primary Keys

I'm currently in the process of designing the database tables for a customer & website management application. My question is in regards to the use of primary keys as functional parts of a table (and not assigning "ID" numbers to every table just because).
For example, here are four related tables from the database so far, one of which uses the traditional primary key number, the others which use unique names as the primary key:
--
-- website
--
CREATE TABLE IF NOT EXISTS `website` (
`name` varchar(126) NOT NULL,
`client_id` int(11) NOT NULL,
`date_created` timestamp NOT NULL default CURRENT_TIMESTAMP,
`notes` text NOT NULL,
`website_status` varchar(26) NOT NULL,
PRIMARY KEY (`name`),
KEY `client_id` (`client_id`),
KEY `website_status` (`website_status`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
--
-- website_status
--
CREATE TABLE IF NOT EXISTS `website_status` (
`name` varchar(26) NOT NULL,
PRIMARY KEY (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `website_status` (`name`) VALUES
('demo'),
('disabled'),
('live'),
('purchased'),
('transfered');
--
-- client
--
CREATE TABLE IF NOT EXISTS `client` (
`id` int(11) NOT NULL auto_increment,
`date_created` timestamp NOT NULL default CURRENT_TIMESTAMP,
`client_status` varchar(26) NOT NULL,
`firstname` varchar(26) NOT NULL,
`lastname` varchar(46) NOT NULL,
`address` varchar(78) NOT NULL,
`city` varchar(56) NOT NULL,
`state` varchar(2) NOT NULL,
`zip` int(11) NOT NULL,
`country` varchar(3) NOT NULL,
`phone` text NOT NULL,
`email` varchar(78) NOT NULL,
`notes` text NOT NULL,
PRIMARY KEY (`id`),
KEY `client_status` (`client_status`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=4 ;
--
-- client_status
---
CREATE TABLE IF NOT EXISTS `client_status` (
`name` varchar(26) NOT NULL,
PRIMARY KEY (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `client_status` (`name`) VALUES
('affiliate'),
('customer'),
('demo'),
('disabled'),
('reseller');
As you can see, 3 of the 4 tables use their 'name' as the primary key. I know that these will always be unique. In 2 of the cases (the *_status tables) I am basically using a dynamic replacement for ENUM, since status options could change in the future, and for the 'website' table, I know that the 'name' of the website will always be unique.
I'm wondering if this is sound logic, getting rid of table ID's when I know the name is always going to be a unique identifier, or a recipe for disaster? I'm not a seasoned DBA so any feedback, critique, etc. would be extremely helpful.
Thanks for taking the time to read this!
There are 2 reasons I would always add an ID number to a lookup / ENUM table:
If you are referencing a single column table with the name then you may be better served by using a constraint
What happens if you wanted to rename one of the client_status entries? e.g. if you wanted to change the name from 'affiliate' to 'affiliate user' you would need to update the client table which should not be necessary. The ID number serves as the reference and the name is the description.
In the website table, if you are confident that the name will be unique then it is fine to use as a primary key. Personally I would still assign a numeric ID as it reduces the space used in foreign key tables and I find it easier to manage.
EDIT:
As stated above, you will run into problems if the website name is renamed. By making this the primary key you will be making it very difficult if not impossible for this to be changed at a later date.
When making natural PRIMARY KEY's, make sure their uniqueness is under your control.
If you're absolutely sure you will never ever have uniqueness violation, then it's OK to use these values as PRIMARY KEY's.
Since website_status and client_status seem to be generated and used by you and only by you, it's acceptable to use them as a PRIMARY KEY, though having a long key may impact performance.
website name seems be under control of the outer world, that's why I'd make it a plain field. What if they want to rename their website?
The counterexamples would be SSN and ZIP codes: it's not you who generates them and there is no guarantee that they won't be ever duplicated.
Kimberly Tripp has an Excellent series of blog articles (GUIDs as PRIMARY KEYs and/or the clustering key and The Clustered Index Debate Continues) on the issue of creating clustered indexes, and choosing the primary key (related issues, but not always exactly the same). Her recommendation is that a clustered index/primary key should be:
Unique (otherwise useless as a key)
Narrow (the key is used in all non-clustered indexes, and in foreign-key relationships)
Static (you don't want to have to change all related records)
Always Increasing (so new records always get added to the end of the table, and don't have to be inserted in the middle)
Using "Name" as your key, while it seems to satisfy #1, doesn't satisfy ANY of the other three.
Even for your "lookup" table, what if your boss decides to change all affiliates to partners instead? You'll have to modify all rows in the database that use this value.
From a performance perspective, I'm probably most concerned that a key be narrow. If your website name is actually a long URL, then that could really bloat the size of any non-clustered indexes, and all tables that use it as a foreign key.
Besides all the other excellent points that have already been made, I would add one more word of caution against using large fields as clustering keys in SQL Server (if you're not using SQL Server, then this probably doesn't apply to you).
I add this because in SQL Server, the primary key on a table by default also is the clustering key (you can change that, if you want to and know about it, but most of the cases, it's not done).
The clustering key that determines the physical ordering of the SQL Server table is also being added to every single non-clustered index on that table. If you have only a few hundred to a few thousand rows and one or two indices, that's not a big deal. But if you have really large tables with millions of rows, and potentially lots of indices to speed up the queries, this will indeed cause a lot of disk space and server memory to be wasted unnecessarily.
E.g. if your table has 10 million rows, 10 non-clustered indices, and your clustering key is 26 bytes instead of 4 (for an INT), then you're wasting 10 mio. by 10 by 22 bytes for a total of 2.2 billion bytes (or 2.2 GBytes approx.) - that's not peanuts anymore!
Again - this only applies to SQL Server, and only if you have really large tables with lots of non-clustered indices on them.
Marc
"If you're absolutely sure you will never ever have uniqueness violation, then it's OK to use these values as PRIMARY KEY's."
If you're absolutely sure you will never ever have uniqueness violation, then don't bother to define the key.
Personally, I think you will run into trouble using this idea. As you end up with more parent child relationships, you end up with a huge amount of work when the names change (As they always will sooner or later). There can be a big performance hit when having to update a child table that has thousands of rows when the name of the website changes. And you have to plan for how do make sure that those changes happen. Otherwise, the website name changes (oops we let the name expire and someone else bought it.) either break because of the foreign key constraint or you need to put in an automated way (cascade update) to propagate the change through the system. If you use cascading updates, then you can suddenly bring your system to a dead halt while a large chage is processed. This is not considered to be a good thing. It really is more effective and efficient to use ids for relationships and then put unique indexes on the name field to ensure they stay unique. Database design needs to consider maintenance of the data integrity and how that will affect performance.
Another thing to consider is that websitenames tend to be longer than a few characters. This means the performance difference between using an id field for joins and the name for joins could be quite significant. You have to think of these things at the design phase as it is too late to change to an ID when you have a production system with millions of records that is timing out and the fix is to completely restructure the databse and rewrite all of the SQL code. Not something you can fix in fifteen minutes to get the site working again.
This just seems like a really bad idea. What if you need to change the value of the enum? The idea is to make it a relational database and not a set of flat files. At this point, why have the client_status table? Moreover, if you are using the data in an application, by using a type like a GUID or INT, you can validate the type and avoid bad data (in so far as validating the type). Thus, it is another of many lines to deter hacking.
I would argue that a database that is resistant to corruption, even if it runs a little slower, is better than one that isn’t.
In general, surrogate keys (such as arbitrary numeric identifiers) undermine the integrity of the database. Primary keys are the main way of identifying rows in the database; if the primary key values are not meaningful, the constraint is not meaningful. Any foreign keys that refer to surrogate primary keys are therefore also suspect. Whenever you have to retrieve, update or delete individual rows (and be guaranteed of affecting only one), the primary key (or another candidate key) is what you must use; having to work out what a surrogate key value is when there is a meaningful alternative key is a redundant and potentially dangerous step for users and applications.
Even if it means using a composite key to ensure uniqueness, I would advocate using a meaningful, natural set of attributes as the primary key, whenever possible. If you need to record the attributes anyway, why add another one? That said, surrogate keys are fine when there is no natural, stable, concise, guaranteed-to-be-unique key (e.g. for people).
You could also consider using index key compression, if your DBMS supports it. This can be very effective, especially for indexes on composite keys (think trie data structures), and especially if the least selective attributes can appear first in the index.
I think I am in agreement with cheduardo. It has been 25 years since I took a course in database design but I recall being told that database engines can more efficiently manage and load indexes that use character keys. The comments about the database having to update thousands of records when a key is changed and on all of the added space being taken up by the longer keys and then having to be transferred across systems, assumes that the key is actually stored in the records and that it does not have to be transferred across systems anyway. If you create an index on a column(s) of a table, I do not think the value is stored in the records of the table (unless you set some option to do so).
If you have a natural key for a table, even if it is changed occassionally, creating another key creates a redundancy that could result in data integrity issues and actually creates even more information that needs to be stored and transferred across systems. I work for a team that decided to store the local application settings in the database. They have an identity column for each setting, a section name, a key name, and a key value. They have a stored procedure (another holy war) to save a setting that ensures it does not appear twice. I have yet to find a case where I would use a setting's ID. I have, however, ended up with multiple records with the same section and key name that caused my application to fail. And yes, I know that could have been avoided by defining a constraint on the columns.
Here few points should be considered before deciding keys in table
Numeric key is more suitable when you
use references ( foreign keys), since
you not using foreign keys, it ok in
your case to use non numeric key.
Non-numeric key uses more space than
numeric keys, can decrease
performance.
Numeric keys make db look simpler to
understand ( you can easily know no
of rows just by looking at last row)
You NEVER know when the company you work for suddenly explodes in growth and you have to hire 5 developers overnight. Your best bet is to use numeric (integer) primary keys because they will be much easier for the entire team to work with AND will help your performance if and when the database grows. If you have to break records out and partition them, you might want to use the primary key. If you are adding records with a datetime stamp (as every table should), and there is an error somewhere in the code that updates that field incorrectly, the only way to confirm if the record was entered in the proper sequence it to check the primary keys. There are probably 10 more TSQL or debugging reasons to use INT primary keys, not the least of which is writing a simple query to select the last 5 records entered into the table.