How to lock a specific row that doesn't exist yet in SQL Server - sql

I have an API rate limit table that I'm managing for one of our applications. Here's the definition of it.
CREATE TABLE [dbo].[RateLimit]
(
[UserId] [int] NOT NULL,
[EndPointId] [smallint] NOT NULL,
[AllowedRequests] [smallint] NOT NULL,
[ResetDateUtc] [datetime2](0) NOT NULL,
CONSTRAINT [PK_RateLimit]
PRIMARY KEY CLUSTERED ([UserId] ASC, [EndPointId] ASC)
) ON [PRIMARY]
The process that performs CRUD operations on this table is multi-threaded, and therefore careful consideration needs to be placed on this table, which acts as the goto for rate limit checks (i.e. have we surpassed our rate limit, can we make another request, etc.)
I'm trying to introduce SQL locks to enable the application to reliably INSERT, UPDATE, and SELECT values without having the value changed from under it. Besides the normal complexity of this, the big pain point is that the RateLimit record for the UserId+EndPointId may not exist - and would need to be created.
I've been investigating SQL locks, but the thing is that there might be no row to lock if the rate limit record doesn't exist yet (i.e first run).
I've thought about creating a temp table used specifically for controlling the lock flow - but I'm unsure how this would work.
At the farthest extreme, I could wrap the SQL statement in a SERIALIZABLE transaction (or something to that degree), but locking the entire table would have drastic performance impacts - I only care about the userid+endpointid primary key, and making sure that the specific row is read + updated/inserted by one process at a time.
How can I handle this situation?
Version: SQL Server 2016
Notes: READ_COMMITTED_SNAPSHOT is enabled

Related

Is there a "standard" Primary Key pool table implementation method?

Back in the "good o' days" I used Sybase's SQL Anywhere database. It had a feature to avoid collisions when multiple users created new records: A separate table of key values existed that would be used to dole out blocks of unique keys to client applications to be used in subsequent inserts into other tables. When the client's pool of keys gets low, the client requests another block of keys from the server. The keys could be specific to a single table (that is each table has it's own key pool), or the keys could be "shared" among tables such that an INSERT INTO Table1 might use Key=100, and a following INSERT INTO Table2 would then use Key=101.
This key pool had the benefit that the primary key assigned at the client side could also be used as a foreign key in creating inserts into other tables - all on the client side without first committing the transaction if the user ultimately abandons the new data.
I've searched for similar functionality, but I only seem to find database replication and mirroring, not anything about a table of keys.
We are using a shared database and multiple clients running a VB.NET application for data access and creation.
The basic table I had in mind looks something like:
CREATE TABLE [KeyPool] (
[KeyNo] [int] IDENTITY PRIMARY KEY NOT NULL,
[AssignedTo] [varchar](50) NULL,
[Status] [nchar](10) NULL,
[LastTouched] [datetime2] NULL,
)
The Status and LastTouched columns would allow for recovery of "lost keys" if garbage collection was desired, but these are not really necessary. In fact, simply having a single row that stores the last key value given to a client would be the minimum requirement: Just hand out keys in blocks of 1000 upon request and increment the counter to know what block to hand out next. However, without the table that is tracking who has what keys, there would be lots of "wasted" key values (which may or may not be an issue depending on the potential number of records expected in the database).
I'm looking for any "standard" methods in SQL Server before I go out and duplicate the effort of creating my own solution.
Use a Sequence object. You can use ‘NEXT VALUE FOR’ in a default or query, or request blocks of keys from the client using sp_sequence_get_range.

Two SQL statements should return the same results, but they don't (on AWS Aurora DB)

This is the table definition for GpsPosition:
CREATE TABLE GpsPosition
(
altitudeInMeters SMALLINT NOT NULL,
dateCreated BIGINT NOT NULL,
dateRegistered BIGINT NOT NULL,
deviceId BINARY(16) NOT NULL,
emergencyId BINARY(16) NULL,
gpsFix SMALLINT NOT NULL,
heading SMALLINT NOT NULL,
horizontalUncertaintyInMeters SMALLINT NOT NULL,
id BINARY(16) NOT NULL,
latestForDevice BOOLEAN NOT NULL,
latestForUser BOOLEAN NOT NULL,
latitude DOUBLE PRECISION NOT NULL,
longitude DOUBLE PRECISION NOT NULL,
numSatellites SMALLINT NOT NULL,
speedInKmph SMALLINT NOT NULL,
stale BOOLEAN NOT NULL,
userId BINARY(16) NULL,
verticalUncertaintyInMeters SMALLINT NOT NULL,
PRIMARY KEY (id)
);
ALTER TABLE GpsPosition
ADD CONSTRAINT GpsPosition_deviceId_fkey
FOREIGN KEY (deviceId) REFERENCES Device(id)
ON UPDATE CASCADE ON DELETE CASCADE;
ALTER TABLE GpsPosition
ADD CONSTRAINT GpsPosition_emergencyId_fkey
FOREIGN KEY (emergencyId) REFERENCES Emergency(id)
ON UPDATE CASCADE ON DELETE SET NULL;
ALTER TABLE GpsPosition
ADD CONSTRAINT GpsPosition_userId_fkey
FOREIGN KEY (userId) REFERENCES User(id)
ON UPDATE CASCADE ON DELETE SET NULL;
ALTER TABLE GpsPosition
ADD CONSTRAINT deviceId_dateCreated_must_be_unique
UNIQUE (deviceId, dateCreated);
CREATE INDEX i2915035553 ON GpsPosition (deviceId);
CREATE INDEX deviceId_latestForDevice_is_non_unique ON GpsPosition (deviceId, latestForDevice);
CREATE INDEX i3210815937 ON GpsPosition (emergencyId);
CREATE INDEX i1689669068 ON GpsPosition (userId);
CREATE INDEX userId_latestForUser_is_non_unique ON GpsPosition (userId, latestForUser);
Note that userId in GpsPosition is a UUID that is stored as a binary(16).
This SQL code is executing on AWS AuroraDB engine version 5.7.12.
I would expect the queries below to return the same results, but the first one returns many results and the second returns no results. Any idea as to why?
select *
from GpsPosition
where exists (select *
from User
where id = GpsPosition.userId and
id = UNHEX( '3f4163aab2ac46d6ad15164222aca89e' )
);
select *
from GpsPosition
where userId = UNHEX( '3f4163aab2ac46d6ad15164222aca89e' );
Note that the following SQL statement returns a single row, as you would expect:
select *
from User
where id = UNHEX( '3f4163aab2ac46d6ad15164222aca89e' );
I see no semantic equivalence at all.
The one with exists is checking to see if a row exists in another table. If no such matching row exists, then the outer query does not return anything.
That is very different from just returning a matching row in a single table.
The observation that two queries return the same results on a particular set of data does not make them semantically equivalent. They would have to be guaranteed to return the same results on any appropriate data for the query. For instance, 2 + 2 = 2 * 2, but that doesn't make addition and multiplication "semantically equivalent."
I should also add that it is not hard to fool database optimizers, even when two expressions are guaranteed to be equivalent.
So my team has spent literally a couple of months trying to understand this issue and many other inconsistencies (like this one in this posting) we were able to reproduce on AWS Aurora DB 5.7 but unable to reproduce on MySQL 5.7 or anything else for that matter.
As a part of this effort, we engaged AWS support, which was remarkably unhelpful. They confirmed they could reproduce the inconsistencies by executing the same queries we did on the same database we did, but then said they couldn't copy that data to another database and still reproduce the issue, and this seemed to satisfy them to mark the support case as resolved. Now granted, this is a very insidious defect since it is so difficult to reproduce and so intermittent and rare, but when it is hit, it becomes reliably reproducible within the affected data set. And once you do hit this defect, well, your applications depending on the database can no longer operate correctly in those affected areas ;)
While we do not believe the defect is limited to cascade deletes, it appears that a way to "more reliably" produce this defect is to delete rows in tables that have cascade deletes. Again, this appears to produce the defect "more reliably", but even then, it is incredibly rare and difficult to produce. We could produce it by running a huge automated test suite in a tight loop however. Again, once you actually do hit this defect, the affected data will reliably reproduce inconsistencies - it's just VERY hard to hit this defect.
So what conclusions did we draw at the end of all of our analysis?
1) First and foremost, Thorsten Kettner (see his posted comment above) is correct - this is a defect in the RDBMS server itself. We don't have access to the AWS AuroraDB source code or underlying infrastructure, and so we cannot root cause this defect to something much more specific, but it is a defect possibly in the RDBMS server, possibly in the data persistence layer, and possibly somewhere else.
2) Based upon (1) above, we decided that AWS Amazon 5.7.x is not mature enough for us to use for a production application. Even though it works correctly 99.9999% of the time, that 0.0001% was causing development and production database servers to do the wrong things and return incorrect results, which is absolutely unacceptable to us. We also detected cases where integrity constraints on the tables were not reliably honored, resulting in very strange orphaned rows that should have been deleted as a part of cascade deletes in the schema definition, which again, is absolutely unacceptable to us.
3) We were unable to reproduce any of these inconsistencies on AWS MySQL 5.6, AWS MySQL 5.7, AWS AuroraDB with MySQL 5.6 compatibility, non-AWS Windows MySQL 5.6, or non-AWS MySQL 5.7. In short, we believe that whatever is going wrong is specific to AWS AuroraDB with MySQL 5.7 compatibility. We did extensive testing on AWS AuroraDB with MySQL 5.6 compatibility in particular and could not reproduce any of these inconsistency defects, so we believe at this time that AuroraDB with MySQL 5.6 compatibility is mature and suitable for production use.

SQL Indexing Strategy on Link Tables

I often find myself creating 'link tables'. For example, the following table maps a user record to an event record.
CREATE TABLE [dbo].[EventLog](
[EventId] [int] NOT NULL,
[UserId] [int] NOT NULL,
[Time] [datetime] NOT NULL,
[Timestamp] [timestamp] NOT NULL
)
For the purposes of this question, please assume the combination of EventId plus UserId is unique and that the database in question is a MS SQL Server 2008 installation.
The problem I have is that I am never sure as to how these tables should be indexed. For example, I might want to list all users for a particular event, or I might want to list all events for a particular user or, perhaps, retrieve a particular EventId/UserId record. Indexing options I have considered include:
Creating a compound primary key on EventId and UserId (but I
understand the index won't be useful when accessing by UserId on its
own).
Creating a compound primary key on EventId and UserId and a adding a
supplemental index on UserId.
Creating a primary key on EventId and a supplemental index on
UserId.
Any advice would be appreciated.
The indices are designed to solve performance problems. If you don't yet have such problem and cannot exactly know where you'll face troubles then you shouldn't create indexes. The indices are quite expensive. Because it not only takes up disk space but also causes the overhead of writing or modifying data. So you have to be clear understand what the specific performance problem you decide by creating an index. So you can appreciate the need to create it.
The answer to your question depends on several aspects.
It depends on the DBMS you are going to use. Some prefer single-column indexes (like Postgresql), some can take more advantage of multi-column indexes (like Oracle). Some can answer a query completely from a covering index (like sqlite), others cannot and eventually have to read the pages of the actual table (again, like postgres).
It depends on the queries you want to answer. For example, do you navigate in both directions, i.e., do you join on both of your Id columns?
It depends on your space and processing time requirements for data modification, too. Keep in mind that indexes are often bigger than the actual table that they index, and that updating indexes is often more expensive that just updating the underlying table.
EDIT:
When your conceptual model has a many-to-many relationship R between two entities E1 and E2, i.e., the logical semantics of R is either "related" or "not-related", than I would always declare that combined primary key for R. That will create a unique index. The primary motivation is, however, data consistency, not query optimization, i.e.:
CREATE TABLE [dbo].[EventLog](
[EventId] [int] NOT NULL,
[UserId] [int] NOT NULL,
[Time] [datetime] NOT NULL,
[Timestamp] [timestamp] NOT NULL,
PRIMARY KEY([EventId],[UserId])
)

How to make bulk insert work with multiple tables

How with SQL server bulk insert can I insert into multiple tables when there is a foreign key relationship?
What I mean is that the tables are this,
CREATE TABLE [dbo].[UndergroundFacilityShape]
([FacilityID] [int] IDENTITY(1,1) NOT NULL,
[FacilityTypeID] [int] NOT NULL,
[FacilitySpatialData] [geometry] NOT NULL)
CREATE TABLE [dbo].[UndergroundFacilityDetail]
([FacilityDetailID] [int] IDENTITY(1,1) NOT NULL,
[FacilityID] [int] NOT NULL,
[Name] [nvarchar](50) NOT NULL,
[Value] [nvarchar](255) NOT NULL)
So each UndergroundFacilityShape can have multiple UndergroundFacilityDetail. The problem is that the FacilityID is not defined until the insert is done because it is an identity column. If I bulk insert the data into the Shape table then I cannot match it back up the the Detail data I have in my C# application.
I am guessing the solution is to run a SQL statement to find out what the next identity value is and popuplate the values myself and turn off the identity column for the bulk insert? Bear in mind that only one person is going to be running this application to insert data, and it will be done infrequently so we don't have to worry about identity values clashing or anything like that.
I am trying to import thousands of records, which takes about 3 minutes using standard inserts, but bulk insert will take a matter of seconds.
In the future I am expecting to import data that is much bigger than 'thousands' of records.
Turns out that this is quite simple. Get the current identity values on each of the tables, and populate them into the DataTable myself incrementing them as I use them. I also have to make sure that the correct values are used to maintain the relationship. That's it. It doesn't seem to matter whether I turn off the identity columns or not.
I have been using this tool on live data for a while now and it works fine.
It was well worth it though the import takes no longer than 3 seconds (rather than 3 minutes), and I expect to receive larger datasets at some point.
So what about if more than one person uses the tool at one time? Well yes I expect issues, but this is never going to be the case for us.
Peter, you mentioned that you already have a solution with straight INSERTs.
If the destination table does not have a clustered index (or has a clustered index and is empty), just using the TABLOCK query hint will make it a minimally-logged transaction, resulting on a considerable speed up.
If the destination table has a clustered index and is not empty, you can also enable trace flag 610 in addition to the TABLOCK query hint to make it a minimally-logged transaction.
Check the "Using INSERT INTO…SELECT to Bulk Import Data with Minimal Logging" section on the INSERT MSDN page.

Logic in the db for maintaining a points system relationship?

I'm making a little web based game and need to determine where to put logic that checks the integrity of some underlying data in the sql database.
Each user keeps track of points assigned to him, and points are awarded by various tasks. I keep a record of each task transaction to make sure they're not repeated, and to keep track of the value of the task at the time of completion, since an individual award level my fluctuate over time.
My schema looks like this so far:
create table player (
player_ID serial primary key,
player_Points int not null default 0
);
create table task (
task_ID serial primary key,
task_PointsAwarded int not null
);
create table task_list (
player_ID int references player(player_ID),
task_ID int references task(task_ID),
when_completed timestamp default current_timestamp,
point_value int not null, --not fk because task value may change later
constraint pk_player_task_id primary key (player_ID, task_ID)
);
So, the player.player_Points should be the total of all his cumulative task points in the task_list.
Now where do I put the logic to enforce this?
Should I do away with player.player_Points altogether and do queries every time I want to know the total score? Which seems wasteful since I'll be doing that query a lot over the course of a game.
Or, put a trigger in the task_list that automatically updates the player.player_Points? Is that too much logic to have in the database and should just maintain this relationship in the application?
Thanks.
From a relational standpoint you'd want to do away with player.player_Points altogether so that you don't have to worry about the integrity of the data. If this is too much of a performance burden then you could denormalize it as you have, but I'd do some stress testing on the app to make sure this is the case, no need to prematurely optimize and its a pretty simple query to get the value (not just logically but from a database workload perspective as well).
Personally even if you go with the denormalized route I would not use a trigger, but that's just a personal bias, you certainly could go that route. I'd probably just set up some integration tests to make sure everything is updating properly whenever you perform inserts/updates/deletes, and possibly save a query that you can run periodically if you suspect there is a problem.
I'd also consider adding a daterange on task so that you don't have to store the points_value in the task_list.
Actually the trigger suggestion is the best. That's exactly the kind of task triggers are ideally suited to (you can also use a trigger or constraint to check the totals computed by the app, but it's the same amount of work so why bother adding the computation work to the app?).