Using a database table as a queue - sql

I want to use a database table as a queue. I want to insert in it and take elements from it in the inserted order (FIFO). My main consideration is performance because I have thousands of these transactions each second. So I want to use a SQL query that gives me the first element without searching the whole table. I do not remove a row when I read it.
Does SELECT TOP 1 ..... help here?
Should I use any special indexes?

I'd use an IDENTITY field as the primary key to provide the uniquely incrementing ID for each queued item, and stick a clustered index on it. This would represent the order in which the items were queued.
To keep the items in the queue table while you process them, you'd need a "status" field to indicate the current status of a particular item (e.g. 0=waiting, 1=being processed, 2=processed). This is needed to prevent an item be processed twice.
When processing items in the queue, you'd need to find the next item in the table NOT currently being processed. This would need to be in such a way so as to prevent multiple processes picking up the same item to process at the same time as demonstrated below. Note the table hints UPDLOCK and READPAST which you should be aware of when implementing queues.
e.g. within a sproc, something like this:
DECLARE #NextID INTEGER
BEGIN TRANSACTION
-- Find the next queued item that is waiting to be processed
SELECT TOP 1 #NextID = ID
FROM MyQueueTable WITH (UPDLOCK, READPAST)
WHERE StateField = 0
ORDER BY ID ASC
-- if we've found one, mark it as being processed
IF #NextId IS NOT NULL
UPDATE MyQueueTable SET Status = 1 WHERE ID = #NextId
COMMIT TRANSACTION
-- If we've got an item from the queue, return to whatever is going to process it
IF #NextId IS NOT NULL
SELECT * FROM MyQueueTable WHERE ID = #NextID
If processing an item fails, do you want to be able to try it again later? If so, you'll need to either reset the status back to 0 or something. That will require more thought.
Alternatively, don't use a database table as a queue, but something like MSMQ - just thought I'd throw that in the mix!

If you do not remove your processed rows, then you are going to need some sort of flag that indicates that a row has already been processed.
Put an index on that flag, and on the column you are going to order by.
Partition your table over that flag, so the dequeued transactions are not clogging up your queries.
If you would really get 1.000 messages every second, that would result in 86.400.000 rows a day. You might want to think of some way to clean up old rows.

Everything depends on your database engine/implementation.
For me simple queues on tables with following columns:
id / task / priority / date_added
usually works.
I used priority and task to group tasks and in case of doubled task i choosed the one with bigger priority.
And don't worry - for modern databases "thousands" is nothing special.

This will not be any trouble at all as long as you use something to keep track of the datetime of the insert. See here for the mysql options. The question is whether you only ever need the absolute most recently submitted item or whether you need to iterate. If you need to iterate, then what you need to do is grab a chunk with an ORDER BY statement, loop through, and remember the last datetime so that you can use that when you grab your next chunk.

perhaps adding a LIMIT=1 to your select statement would help ... forcing the return after a single match...

Since you don't delete the records from the table, you need to have a composite index on (processed, id), where processed is the column that indicates if the current record had been processed.
The best thing would be creating a partitioned table for your records and make the PROCESSED field the partitioning key. This way, you can keep three or more local indexes.
However, if you always process the records in id order, and have only two states, updating the record would mean just taking the record from the first leaf of the index and appending it to the last leaf
The currently processed record would always have the least id of all unprocessed records and the greatest id of all processed records.

Create a clustered index over a date (or autoincrement) column. This will keep the rows in the table roughly in index order and allow fast index-based access when you ORDER BY the indexed column. Using TOP X (or LIMIT X, depending on your RDMBS) will then only retrieve the first x items from the index.
Performance warning: you should always review the execution plans of your queries (on real data) to verify that the optimizer doesn't do unexpected things. Also try to benchmark your queries (again on real data) to be able to make informed decisions.

I had the same general question of "how do I turn a table into a queue" and couldn't find the answer I wanted anywhere.
Here is what I came up with for Node/SQLite/better-sqlite3.
Basically just modify the inner WHERE and ORDER BY clauses for your use case.
module.exports.pickBatchInstructions = (db, batchSize) => {
const buf = crypto.randomBytes(8); // Create a unique batch identifier
const q_pickBatch = `
UPDATE
instructions
SET
status = '${status.INSTRUCTION_INPROGRESS}',
run_id = '${buf.toString("hex")}',
mdate = datetime(datetime(), 'localtime')
WHERE
id IN (SELECT id
FROM instructions
WHERE
status is not '${status.INSTRUCTION_COMPLETE}'
and run_id is null
ORDER BY
length(targetpath), id
LIMIT ${batchSize});
`;
db.run(q_pickBatch); // Change the status and set the run id
const q_getInstructions = `
SELECT
*
FROM
instructions
WHERE
run_id = '${buf.toString("hex")}'
`;
const rows = db.all(q_getInstructions); // Get all rows with this batch id
return rows;
};

A very easy solution for this in order not to have transactions, locks etc is to use the change tracking mechanisms (not data capture). It utilizes versioning for each added/updated/removed row so you can track what changes happened after a specific version.
So, you persist the last version and query the new changes.
If a query fails, you can always go back and query data from the last version.
Also, if you want to not get all changes with one query, you can get top n order by last version and store the greatest version I'd you have got to query again.
See this for example Using Change Tracking in SQL Server 2008

Related

Mass Updating a single column based on ID

I keep track of users and their activity, assigning a numerical value for what they do, storing it in a cache and updating the DB every 2 hours logging their activity.
I usually have about 10000 users during this period, all with different activity points - so for example, I would have to update 10000 rows of activity column in the table users based on column user_id every 2 hours, with something simple like activity = activity + 500 per row.
What would be an effective way to do so? Obviously it would be really slow if I sent a query each time for each user, some methods I researched was using case, but ultimately 10,000 cases would also take really long and would be inefficient as well. I'm sure there's a good method to do so that I haven't seen yet.
You can use a values list in order to create a virtual user-supplied table, and then do an update with a join to that table.
update users set activity=activity+t.y
from (values (1,5),(2,9),(3,19) /*, ...*/ ) t(id,y)
where users.user_id=t.id;
First, do you need this optimization? 10,000 users over 2 hours is only about 2 users per second. Consider instead simply inserting activities into a user_activity table as needed. Inserting rows into a different table, rather than updating users, avoids needing write locks on user rows. Such an "insert-only" table should perform well.
Second, 2 hours between updates seems excessive. The win of caching is to avoid a flurry of update queries per second, but the benefits rapidly drop off. Try 1 minute or even less. This will reduce the size of the update, greatly simplify the update process, and avoid possibly locking a bunch of rows.
If you do need this optimization, you can do it by updating from a temp table.
Make a temp table with user ID and activity count.
copy your cached user IDs and activity counts into the temp table.
update from the temp table.
The update would look something like this...
update users u
set activity = u.activity + tmp.activity
from tmp_user_activity tmp
where tmp.user_id = u.id

SQL - renumbering a sequential column to be sequential again after deletion

I've researched and realize I have a unique situation.
First off, I am not allowed to post images yet to the board since I'm a new user, so see appropriate links below
I have multiple tables where a column (not always the identifier column) is sequentially numbered and shouldn't have any breaks in the numbering. My goal is to make sure this stays true.
Down and Dirty
We have an 'Event' table where we randomly select a percentage of the rows and insert the rows into table 'Results'. The "ID" column from the 'Results' is passed to a bunch of delete queries.
This more or less ensures that there are missing rows in several tables.
My problem:
Figuring out an sql query that will renumber the column I specify. I prefer to not drop the column.
Example delete query:
delete ItemVoid
from ItemTicket
join ItemVoid
on ItemTicket.item_ticket_id = itemvoid.item_ticket_id
where itemticket.ID in (select ID
from results)
Example Tables Before:
Example Tables After:
As you can see 2 rows were delete from both tables based on the ID column. So now I gotta figure out how to renumber the item_ticket_id and the item_void_id columns where the the higher number decreases to the missing value, and the next highest one decreases, etc. Problem #2, if the item_ticket_id changes in order to be sequential in ItemTickets, then
it has to update that change in ItemVoid's item_ticket_id.
I appreciate any advice you can give on this.
(answering an old question as it's the first search result when I was looking this up)
(MS T-SQL)
To resequence an ID column (not an Identity one) that has gaps,
can be performed using only a simple CTE with a row_number() to generate a new sequence.
The UPDATE works via the CTE 'virtual table' without any extra problems, actually updating the underlying original table.
Don't worry about the ID fields clashing during the update, if you wonder what happens when ID's are set that already exist, it
doesn't suffer that problem - the original sequence is changed to the new sequence in one go.
WITH NewSequence AS
(
SELECT
ID,
ROW_NUMBER() OVER (ORDER BY ID) as ID_New
FROM YourTable
)
UPDATE NewSequence SET ID = ID_New;
Since you are looking for advice on this, my advice is you need to redesign this as I see a big flaw in your design.
Instead of deleting the records and then going through the hassle of renumbering the remaining records, use a bit flag that will mark the records as Inactive. Then when you are querying the records, just include a WHERE clause to only include the records are that active:
SELECT *
FROM yourTable
WHERE Inactive = 0
Then you never have to worry about re-numbering the records. This also gives you the ability to go back and see the records that would have been deleted and you do not lose the history.
If you really want to delete the records and renumber them then you can perform this task the following way:
create a new table
Insert your original data into your new table using the new numbers
drop your old table
rename your new table with the corrected numbers
As you can see there would be a lot of steps involved in re-numbering the records. You are creating much more work this way when you could just perform an UPDATE of the bit flag.
You would change your DELETE query to something similar to this:
UPDATE ItemVoid
SET InActive = 1
FROM ItemVoid
JOIN ItemTicket
on ItemVoid.item_ticket_id = ItemTicket.item_ticket_id
WHERE ItemTicket.ID IN (select ID from results)
The bit flag is much easier and that would be the method that I would recommend.
The function that you are looking for is a window function. In standard SQL (SQL Server, MySQL), the function is row_number(). You use it as follows:
select row_number() over (partition by <col>)
from <table>
In order to use this in your case, you would delete the rows from the table, then use a with statement to recalculate the row numbers, and then assign them using an update. For transactional integrity, you might wrap the delete and update into a single transaction.
Oracle supports similar functionality, but the syntax is a bit different. Oracle calls these functions analytic functions and they support a richer set of operations on them.
I would strongly caution you from using cursors, since these have lousy performance. Of course, this will not work on an identity column, since such a column cannot be modified.

Ensure unique value

I have a table with unique values within it and once a stored procedure is called, I use the following code within a sub-query to get a random value from the table:
SELECT TOP 1 UniqueID FROM UniqueValues
WHERE InitiatingID is NULL
ORDER BY NewID() ASC
I have however noticed that I am managing now and then (and I'm guessing two calls running simultaneously cause it) to retrieve the same unique value twice, which causes some issues within the program.
Is there any way (preferably not locking the table) to make the unique values ID generation completely unique - or unique enough to not affect two simultaneous calls? As a note, I need to keep the unique values and cannot use GUIDs directly here.
Thanks,
Kyle
Edit for clarification:
I am buffering the unique values. That's what the WHERE InitiatingID is NULL is all about. As a value gets picked out of the query, the InitiatingID is set and therefore cannot be used again until released. The problem is that in the milliseconds of that process setting the InitiatingID it seems that the value is getting picked up again, thus harming the process.
Random implies that you will get the same value twice randomly.
Why not using IDENTITY columns?
I wrote a blog post about manual ID generation some days ago here. Maybe that helps.
What you're doing isn't really generating random unique values - which has a low probability of generating duplicates if you use the appropriate routines, but randomly selecting one item from a population - which, depending on the size of your population, will have a much higher chance of repeat occurrences. In fact, given enough repeated drawing, there will occasionally be repeats - if there weren't, it wouldn't be truly random.
If what you want is to never draw the same unique id in a row, you might consider buffering the 'old' unique id somewhere, and discarding your draw if it matches (or running a WHERE <> currentlydrawuniqueID).
What about using update with the output clause to select the UniqueId and set InitiatingId all at once. http://msdn.microsoft.com/en-US/library/ms177564(v=SQL.90).aspx
Something like: (Though I don't have SQL Server handy, so not tested.)
DECLARE #UniqueIDTable TABLE
(
UniqueId int
)
UPDATE UniqueValues
SET InitiatingID = #InitiatingID
OUTPUT INSERTED.UniqueId into #UniqueIDTable
WHERE UniqueID =
(SELECT TOP 1 UniqueID FROM UniqueValues
WHERE InitiatingID is NULL
ORDER BY NewID() ASC)
AND InitiatingID is NULL

Do I need a second table for this database logic?

Firstly, this DB question could be a bit DB agnostic, but I am using Sql Server 2008 if that has a specialised solution for this problem, but please keep reading this if you're not an MS Sql Server person .. please :)
Ok, I read in a log file that contains data for a game. Eg. when a player connects, disconnects, does stuff, etc. Nothing too hard. Works great.
Now, two of the log file entry types are
NewConnection
LostConnection
What I'm trying to keep track of are the currently connected players, to the game.
So what I originally thought of was to create a second table where it contains the most recent new connection, per player. When a player disconnects/loses connection i then remove this entry from this second table.
Eg.
Table 1: LogEntries
LogEntryId INT PK NOT NULL
EntryTypeId TINYINT NOT NULL
PlayerId INT NOT NULL
....
Table 2: ConnectedPlayers
LogEntryId INT FK (back to LogEntries table) NOT NULL
Then, I thought I could use a trigger to insert this into the cache data into the ConnectedPlayers table. Don't forget, if it's a trigger, it needs to handle multiple records, updates and deletes.
But I'm not sure if this is the best way. Like, could I have an Indexed View?
I would love to know people's thoughts on this.
Oh, one more thing: for simplicity, lets just assume that when a player drops connection/lags out/modem dies/etc, the application is smart enough to know this and DOES record this as a LostConnection entry. There will be no phantom users reported as connected, when they have really got disconnected accidently, etc.
UPDATE:
I was thinking that maybe I could use a view instead? (and i can index this view if i want to, also :) ) By Partitioning my results, I could get the most recent event type, per player .. where the event is a NewConnection or a LostConnection. Then only grab those most recent NewConnection .. which means they are connected. No second table/triggers/extra insert .NET code/whatever needed ...
eg..
SELECT LogEntryId, EntryTypeId, PlayerId
FROM
(SELECT LogEntryId, EntryTypeId, PlayerId
RANK() OVER (PARTITION BY PlayerId ORDER BY LogEntryId DESC) AS MostRecentRank
FROM LogEntries
WHERE (EntryTypeId = 2 -- NewConnection
OR EntryTypeId = 4 -- LostConnection)
) SubQuery
WHERE MostRecentRank = 1
How does that sound/look?
You don't need a second table, but you do need a date column, which I assume is part of your log data. I would normalize the data and avoid the temptation to optimize prematurely. Make sure you index the key columns, mainly the LogEntryDate and PlayerId columns in the case of your query.
Then, use a standard aggregate query to determine the newest log entry for each user, and then filter out the ones that are not connected. You could further optimize this by only selecting from log entries from the last 24 hours (or last week or whatever makes sense for your app).
select l.*
from (
select PlayerId, max(LogEntryDate) as MaxLogEntryDate
from LogEntries
where EntryTypeId in (2,4)
and LogEntryDate > GetDate() - 7 --only look at the last week, as connections older than that have timed out
group by PlayerId
) lm
inner join LogEntries l on lm.PlayerId = l.PlayerId and lm.MaxLogEntryDate = l.LogEntryDate
where l.EntryTypeId = 2 --new connections only
If you find that you are still not getting the speed you want out of the query, then look at strategies for optimizing. You seem reluctant to cache in the application layer, so your proposal of indexed views would work. You could use the query above as a basis for this to create a Player view that includes a boolean IsConnected column.
Note: if you do not receive a date with each log entry but the LogEntryId is generated by the game, that should work as a substitute for the date. If you are generating the LogEntryId on insert though, I would caution against relying on that as it would only take one out of order import to throw off all of your data.
Depending on the size of the original table LogEntries, this almost seems like overkill.
The triggers would have to update with each change to the original table, as where if using the correct indexing, a simple query could give you these results when you require the data.
I would thus go against the option of a secondary table.
I'd make an is_connected flag, login_time and that's about it.
You can use a simple MySQL query to check every 10/60 seconds even and cache the data in a file.
Where is_connected=1, order by login_time limit 10/20/100 ...
A second table seems way too much and pretty useless. Extra caching (if needed, in a huge database..) could be done on files.
I personally would use a view that gets the most recent NewConnection or LostConnection(whichever is more recent) for each player, which implies you need some sort of date-time stamp in the log or an id that is ever increasing, and then filter that further throwing out all the LostConnection entries. This will leave you with all players having a NewConnection without a more recent LostConnection, hence they are connected.
The problem you might have with this approach is the log table might be huge. I would probably try performance testing with an index on the timestamp column or whatever column you use to determine what is the "most recent" entry.
As i understand, what you really need is
Player (ID)
Game (ID)
Connection (PlayerID, GameID, LastActivity DateTime)
..
#interestingTime is some time before current
select PlayerID, GameID
from Connection
where LastActivity > #interestingTime
gives you all currently connected players.
select PlayerID, GameID
from Connection
where LastActivity <= #interestingTime
gives you lost connections.

SQL trigger for deleting old results

We have a database that we are using to store test results for an embedded device. There's a table with columns for different types of failures (details not relevant), along with a primary key 'keynum' and a 'NUM_FAILURES' column that lists the number of failures. We store passes and failures, so a pass has a '0' in 'NUM_FAILURES'.
In order to keep the database from growing without bounds, we want to keep the last 1000 results, plus any of the last 50 failures that fall outside of the 1000. So, worst case, the table could have 1050 entries in it. I'm trying to find the most efficient SQL insert trigger to remove extra entries. I'll give what I have so far as an answer, but I'm looking to see if anyone can come up with something better, since SQL isn't something I do very often.
We are using SQLITE3 on a non-Windows platform, if it's relevant.
EDIT: To clarify, the part that I am having problems with is the DELETE, and specifically the part related to the last 50 failures.
The reason you want to remove these entries is to keep the database growing too big and not to keep it in some special state. For that i would really not use triggers and instead setup a job to run at some interval cleaning up the table.
So far, I have ended up using a View combined with a Trigger, but I'm not sure it's going to work for other reasons.
CREATE VIEW tablename_view AS SELECT keynum FROM tablename WHERE NUM_FAILURES!='0'
ORDER BY keynum DESC LIMIT 50;
CREATE TRIGGER tablename_trig
AFTER INSERT ON tablename WHEN (((SELECT COUNT(*) FROM tablename) >= 1000) or
((SELECT COUNT(NUM_FAILURES) FROM tablename WHERE NUM_FAILURES!='0') >= 50))
BEGIN
DELETE FROM tablename WHERE ((((SELECT MAX(keynum) FROM ibit) - keynum) >= 1000)
AND
((NUM_FAILURES=='0') OR ((SELECT MIN(keynum) FROM tablename_view) > keynum)));
END;
I think you may be using the wrong data structure. Instead I'd create two tables and pre-populate one with a 1000 rows (successes) and the other with 50 (failures). Put a primary ID on each. The when you record a result instead of inserting a new row find the ID+1 value for the last timestamped record entered (looping back to 0 if > max(id) in table) and update it with your new values.
This has the advantage of pre-allocating your storage, not requiring a trigger, and internally consistent logic. You can also adjust the size of the log very simply by just pre-populating more records rather than to have to change program logic.
There's several variations you can use on this, but the idea of using a closed loop structure rather than an open list would appear to match the problem domain more closely.
How about this:
DELETE
FROM table
WHERE ( id > ( SELECT max(id) - 1000 FROM table )
AND num_failures = 0
)
OR id > ( SELECT max(id) - 1050 FROM table )
If performance is a concern, it might be better to delete on a periodic basis, rather than on each insert.