Ensure unique value - sql

I have a table with unique values within it and once a stored procedure is called, I use the following code within a sub-query to get a random value from the table:
SELECT TOP 1 UniqueID FROM UniqueValues
WHERE InitiatingID is NULL
ORDER BY NewID() ASC
I have however noticed that I am managing now and then (and I'm guessing two calls running simultaneously cause it) to retrieve the same unique value twice, which causes some issues within the program.
Is there any way (preferably not locking the table) to make the unique values ID generation completely unique - or unique enough to not affect two simultaneous calls? As a note, I need to keep the unique values and cannot use GUIDs directly here.
Thanks,
Kyle
Edit for clarification:
I am buffering the unique values. That's what the WHERE InitiatingID is NULL is all about. As a value gets picked out of the query, the InitiatingID is set and therefore cannot be used again until released. The problem is that in the milliseconds of that process setting the InitiatingID it seems that the value is getting picked up again, thus harming the process.

Random implies that you will get the same value twice randomly.
Why not using IDENTITY columns?
I wrote a blog post about manual ID generation some days ago here. Maybe that helps.

What you're doing isn't really generating random unique values - which has a low probability of generating duplicates if you use the appropriate routines, but randomly selecting one item from a population - which, depending on the size of your population, will have a much higher chance of repeat occurrences. In fact, given enough repeated drawing, there will occasionally be repeats - if there weren't, it wouldn't be truly random.
If what you want is to never draw the same unique id in a row, you might consider buffering the 'old' unique id somewhere, and discarding your draw if it matches (or running a WHERE <> currentlydrawuniqueID).

What about using update with the output clause to select the UniqueId and set InitiatingId all at once. http://msdn.microsoft.com/en-US/library/ms177564(v=SQL.90).aspx
Something like: (Though I don't have SQL Server handy, so not tested.)
DECLARE #UniqueIDTable TABLE
(
UniqueId int
)
UPDATE UniqueValues
SET InitiatingID = #InitiatingID
OUTPUT INSERTED.UniqueId into #UniqueIDTable
WHERE UniqueID =
(SELECT TOP 1 UniqueID FROM UniqueValues
WHERE InitiatingID is NULL
ORDER BY NewID() ASC)
AND InitiatingID is NULL

Related

Can SQL return different results for two runs of the same query using ORDER BY?

I have the following table:
CREATE TABLE dbo.TestSort
(
Id int NOT NULL IDENTITY (1, 1),
Value int NOT NULL
)
The Value column could (and is expected to) contain duplicates.
Let's also assume there are already 1000 rows in the table.
I am trying to prove a point about unstable sorting.
Given this query that returns a 'page' of 10 results from the first 1000 inserted results:
SELECT TOP 10 * FROM TestSort WHERE Id <= 1000 ORDER BY Value
My intuition tells me that two runs of this query could return different rows if the Value column contains repeated values.
I'm basing this on the facts that:
the sort is not stable
if new rows are inserted in the table between the two runs of the query, it could possibly create a re-balancing of B-trees (the Value column may be indexed or not)
EDIT: For completeness: I assume rows never change once inserted, and are never deleted.
In contrast, a query with stable sort (ordering also by Id) should always return the same results, since IDs are unique:
SELECT TOP 10 * FROM TestSort WHERE Id <= 1000 ORDER BY Value, Id
The question is: Is my intuition correct? If yes, can you provide an actual example of operations that would produce different results (at least "on your machine")? You could modify the query, add indexes on the Values column etc.
I don't care about the exact query, but about the principle.
I am using MS SQL Server (2014), but am equally satisfied with answers for any SQL database.
If not, then why?
Your intuition is correct. In SQL, the sort for order by is not stable. So, if you have ties, they can be returned in any order. And, the order can change from one run to another.
The documentation sort of explains this:
Using OFFSET and FETCH as a paging solution requires running the query
one time for each "page" of data returned to the client application.
For example, to return the results of a query in 10-row increments,
you must execute the query one time to return rows 1 to 10 and then
run the query again to return rows 11 to 20 and so on. Each query is
independent and not related to each other in any way. This means that,
unlike using a cursor in which the query is executed once and state is
maintained on the server, the client application is responsible for
tracking state. To achieve stable results between query requests using
OFFSET and FETCH, the following conditions must be met:
The underlying data that is used by the query must not change. That is, either the rows touched by the query are not updated or all
requests for pages from the query are executed in a single transaction
using either snapshot or serializable transaction isolation. For more
information about these transaction isolation levels, see SET
TRANSACTION ISOLATION LEVEL (Transact-SQL).
The ORDER BY clause contains a column or combination of columns that are guaranteed to be unique.
Although this specifically refers to offset/fetch, it clearly applies to running the query multiple times without those clauses.
If you have ties when ordering the order by is not stable.
LiveDemo
CREATE TABLE #TestSort
(
Id INT NOT NULL IDENTITY (1, 1) PRIMARY KEY,
Value INT NOT NULL
) ;
DECLARE #c INT = 0;
WHILE #c < 100000
BEGIN
INSERT INTO #TestSort(Value)
VALUES ('2');
SET #c += 1;
END
Example:
SELECT TOP 10 *
FROM #TestSort
ORDER BY Value
OPTION (MAXDOP 4);
DBCC DROPCLEANBUFFERS; -- run to clear cache
SELECT TOP 10 *
FROM #TestSort
ORDER BY Value
OPTION (MAXDOP 4);
The point is I force query optimizer to use parallel plan so there is no guaranteed that it will read data sequentially like Clustered index probably will do when no parallelism is involved.
You cannot be sure how Query Optimizer will read data unless you explicitly force to sort result in specific way using ORDER BY Id, Value.
For more info read No Seatbelt - Expecting Order without ORDER BY.
I think this post will answer your question:
Is SQL order by clause guaranteed to be stable ( by Standards)
The result is everytime the same when you are in a single-threaded environment. Since multi-threading is used, you can't guarantee.

SQL - renumbering a sequential column to be sequential again after deletion

I've researched and realize I have a unique situation.
First off, I am not allowed to post images yet to the board since I'm a new user, so see appropriate links below
I have multiple tables where a column (not always the identifier column) is sequentially numbered and shouldn't have any breaks in the numbering. My goal is to make sure this stays true.
Down and Dirty
We have an 'Event' table where we randomly select a percentage of the rows and insert the rows into table 'Results'. The "ID" column from the 'Results' is passed to a bunch of delete queries.
This more or less ensures that there are missing rows in several tables.
My problem:
Figuring out an sql query that will renumber the column I specify. I prefer to not drop the column.
Example delete query:
delete ItemVoid
from ItemTicket
join ItemVoid
on ItemTicket.item_ticket_id = itemvoid.item_ticket_id
where itemticket.ID in (select ID
from results)
Example Tables Before:
Example Tables After:
As you can see 2 rows were delete from both tables based on the ID column. So now I gotta figure out how to renumber the item_ticket_id and the item_void_id columns where the the higher number decreases to the missing value, and the next highest one decreases, etc. Problem #2, if the item_ticket_id changes in order to be sequential in ItemTickets, then
it has to update that change in ItemVoid's item_ticket_id.
I appreciate any advice you can give on this.
(answering an old question as it's the first search result when I was looking this up)
(MS T-SQL)
To resequence an ID column (not an Identity one) that has gaps,
can be performed using only a simple CTE with a row_number() to generate a new sequence.
The UPDATE works via the CTE 'virtual table' without any extra problems, actually updating the underlying original table.
Don't worry about the ID fields clashing during the update, if you wonder what happens when ID's are set that already exist, it
doesn't suffer that problem - the original sequence is changed to the new sequence in one go.
WITH NewSequence AS
(
SELECT
ID,
ROW_NUMBER() OVER (ORDER BY ID) as ID_New
FROM YourTable
)
UPDATE NewSequence SET ID = ID_New;
Since you are looking for advice on this, my advice is you need to redesign this as I see a big flaw in your design.
Instead of deleting the records and then going through the hassle of renumbering the remaining records, use a bit flag that will mark the records as Inactive. Then when you are querying the records, just include a WHERE clause to only include the records are that active:
SELECT *
FROM yourTable
WHERE Inactive = 0
Then you never have to worry about re-numbering the records. This also gives you the ability to go back and see the records that would have been deleted and you do not lose the history.
If you really want to delete the records and renumber them then you can perform this task the following way:
create a new table
Insert your original data into your new table using the new numbers
drop your old table
rename your new table with the corrected numbers
As you can see there would be a lot of steps involved in re-numbering the records. You are creating much more work this way when you could just perform an UPDATE of the bit flag.
You would change your DELETE query to something similar to this:
UPDATE ItemVoid
SET InActive = 1
FROM ItemVoid
JOIN ItemTicket
on ItemVoid.item_ticket_id = ItemTicket.item_ticket_id
WHERE ItemTicket.ID IN (select ID from results)
The bit flag is much easier and that would be the method that I would recommend.
The function that you are looking for is a window function. In standard SQL (SQL Server, MySQL), the function is row_number(). You use it as follows:
select row_number() over (partition by <col>)
from <table>
In order to use this in your case, you would delete the rows from the table, then use a with statement to recalculate the row numbers, and then assign them using an update. For transactional integrity, you might wrap the delete and update into a single transaction.
Oracle supports similar functionality, but the syntax is a bit different. Oracle calls these functions analytic functions and they support a richer set of operations on them.
I would strongly caution you from using cursors, since these have lousy performance. Of course, this will not work on an identity column, since such a column cannot be modified.

SQLite: UPDATE column of multiple rows with sequential value

I have a table including the columns (int id pkey, int group, double sortOrder) where sortOrder implements a user-specified sort order within one group. On some modifications, I need to re-number the sort order of all items in the group. SortOrder values in different groups are independen of each other (i.e. they are only ever compared within one group)
SELECT id, sortOrder FROM tbl WHERE group=X ORDER BY SortOrder
Gives me all elements in that group, in the current order, in whch I would have to assign sequential values (1, 2, 3, ...) to SortOrder.
Q: Is there any reasonable way - preferrably portable to other SQL implementations - to do that without updating every row individually?
More info: I am using doubles, because that allows to trivially assign a new sort order without modifying other items (MIN-1 for before, MIN+1 for insert at end, and (A+B)/2 for inserting between A and B.) - this is limited of course by double resolution (~52 inserts in the worst case). I'm not sure yet if this is worth the additional checking for overflow, but I'd have the same problem with any other data type anyway.
The only other idea I came up with was simulating infinite resolution with strings and custom COLLATE and ADD/SUB/AVG functions. However, this seems immensely non-portable.
I don't believe that SQLite has a trivial way of doing this.
What you may be better looking at (if this really is a concern) is sticking with integer sorting and implementing an algorithm that maintains the sequence.
If, for example, all changes to sorting involve moving an item up/down one place, you just need to swap the sort orders of both values...
UPDATE table
SET sort_order = CASE WHEN sort_order = 3 THEN 4 ELSE 3 END
WHERE sort_order IN (3,4)
Or if you can move an item to any position, use something like...
UPDATE table
SET sort_order = CASE WHEN sort_order = #old THEN #new ELSE sort_order + 1
WHERE sort_order >= #new AND sort_order <= #old
I'm aware that you said that you don't want to update other items each time, but this has actually proven to be pretty efficient in most cases for me in the past.

Should I use a unique ID for a row in a junction table?

I am using SQL Server 2008.
A while back, I asked the question "should I use RecordID in a junction table". The tables would look like this:
// Images
ImageID// PK
// Persons
PersonID // pk
// Images_Persons
RecordID // pk
ImageID // fk
PersonID // fk
I was strongly advised NOT to use RecordID because it's useless in a table where the two IDs create a unique combination, meaning there will be no duplicate records.
Now, I am trying to find a random record in the junction table to create a quiz. I want to pull the first id and see if someone can match the second id. Specifically, I grab a random image and display it with three possible choices of persons.
The following query works, but I've quite a bit of negativity that suggests that it's very slow. My database might have 10,000 records, so I don't think that matters much. I've also read that the values generated aren't truly random.
SELECT TOP 1 * FROM Images_Persons ORDER BY newid();
Should I add the RecordID column or not? Is there a better way to find a random record in this case?
Previous questions for reference
Should I use "RecordID" as a column name?
SQL - What is the best table design to store people as musicians and artists?
NEWID is random enough and probably best
10k rows is peanuts
You don't need a surrogate key for a junction (link, many-many) table
Edit: in case you want to prematurely optimise...
You could ignore this and read these from #Mitch Wheat. But with just 10k rows your development time will be longer than any saved execution time..
Efficiently select random rows from large resultset with LINQ (ala TABLESAMPLE)
Efficiently randomize (shuffle) data in Sql Server table
Personally, I don't think that having the RecordID column should be advised AGAINST. Rather I'd advise that often it is UNNECESSARY.
There are cases where having a single value to identify a row makes for simpler code. But they're at the cost of additional storage, often additional indexes, etc. The overheads realistically are small, but so are the benefits.
In terms of the selection of random records, the existence of a single unique identifier can make the task easier if the identifiers are both sequential and consecutive.
The reason I say this is because your proposed solution requires the assignment of NEWID() to every record, and the sorting of all records to find the first one. As the table size grows this operation grows, and can become relatively expensive. Whether it's expensive enough to be worth optimising depends on whatever else is happening, how often, etc.
Where there are sequential consecutive unique identifiers, however, one can then choose a random value between MIN(id) and MAX(id), and then SEEK that value out. The requirement that all value are consecutive, however, is often a constraint too far; you're never allowed to delete a value mid-table, for example...
To overcome this, and depending on indexes, you may find the following approach useful.
DECLARE
#max_id INT
SELECT
#id = COUNT(*)
FROM
Images_Persons
SELECT
*
FROM
(
SELECT
*,
ROW_NUMBER() OVER (ORDER BY ImageID, PersonID) AS id
FROM
Images_Persons
)
AS data
WHERE
Images_Persons.id = CAST(#max_id * RAND() + 1 AS INT)
-- Assuming that `ImageID, PersonID` is the clustered index.
A down side here is that RAND() is notoriously poor at being truly random. Yet it normally perfectly suitable if executed at a random time relative to any other call to RAND().
Consider what you've got.
SELECT TOP 1 * FROM Images_Persons ORDER BY newid();
Not truly random? Excluding the 'truly random is impossible' bit, you're probably right - I believe that there are patterns in generated uniqueidentifiers. But you should test this yourself. It'd be simple; just create a table with 1 to 100 in it, order by newid() a lot of times, and look at the results. If it's random 'enough' for you (which it probably will be, for a quiz) then it's good enough.
Very slow? I wouldn't worry about that. I'd be very surprised if the newid() is slower than reading the record from the table. But again, test and benchmark.
I'd be happy with the solution you have, pending tests if you're concerned about it.
I've always used order by newid().

Using a database table as a queue

I want to use a database table as a queue. I want to insert in it and take elements from it in the inserted order (FIFO). My main consideration is performance because I have thousands of these transactions each second. So I want to use a SQL query that gives me the first element without searching the whole table. I do not remove a row when I read it.
Does SELECT TOP 1 ..... help here?
Should I use any special indexes?
I'd use an IDENTITY field as the primary key to provide the uniquely incrementing ID for each queued item, and stick a clustered index on it. This would represent the order in which the items were queued.
To keep the items in the queue table while you process them, you'd need a "status" field to indicate the current status of a particular item (e.g. 0=waiting, 1=being processed, 2=processed). This is needed to prevent an item be processed twice.
When processing items in the queue, you'd need to find the next item in the table NOT currently being processed. This would need to be in such a way so as to prevent multiple processes picking up the same item to process at the same time as demonstrated below. Note the table hints UPDLOCK and READPAST which you should be aware of when implementing queues.
e.g. within a sproc, something like this:
DECLARE #NextID INTEGER
BEGIN TRANSACTION
-- Find the next queued item that is waiting to be processed
SELECT TOP 1 #NextID = ID
FROM MyQueueTable WITH (UPDLOCK, READPAST)
WHERE StateField = 0
ORDER BY ID ASC
-- if we've found one, mark it as being processed
IF #NextId IS NOT NULL
UPDATE MyQueueTable SET Status = 1 WHERE ID = #NextId
COMMIT TRANSACTION
-- If we've got an item from the queue, return to whatever is going to process it
IF #NextId IS NOT NULL
SELECT * FROM MyQueueTable WHERE ID = #NextID
If processing an item fails, do you want to be able to try it again later? If so, you'll need to either reset the status back to 0 or something. That will require more thought.
Alternatively, don't use a database table as a queue, but something like MSMQ - just thought I'd throw that in the mix!
If you do not remove your processed rows, then you are going to need some sort of flag that indicates that a row has already been processed.
Put an index on that flag, and on the column you are going to order by.
Partition your table over that flag, so the dequeued transactions are not clogging up your queries.
If you would really get 1.000 messages every second, that would result in 86.400.000 rows a day. You might want to think of some way to clean up old rows.
Everything depends on your database engine/implementation.
For me simple queues on tables with following columns:
id / task / priority / date_added
usually works.
I used priority and task to group tasks and in case of doubled task i choosed the one with bigger priority.
And don't worry - for modern databases "thousands" is nothing special.
This will not be any trouble at all as long as you use something to keep track of the datetime of the insert. See here for the mysql options. The question is whether you only ever need the absolute most recently submitted item or whether you need to iterate. If you need to iterate, then what you need to do is grab a chunk with an ORDER BY statement, loop through, and remember the last datetime so that you can use that when you grab your next chunk.
perhaps adding a LIMIT=1 to your select statement would help ... forcing the return after a single match...
Since you don't delete the records from the table, you need to have a composite index on (processed, id), where processed is the column that indicates if the current record had been processed.
The best thing would be creating a partitioned table for your records and make the PROCESSED field the partitioning key. This way, you can keep three or more local indexes.
However, if you always process the records in id order, and have only two states, updating the record would mean just taking the record from the first leaf of the index and appending it to the last leaf
The currently processed record would always have the least id of all unprocessed records and the greatest id of all processed records.
Create a clustered index over a date (or autoincrement) column. This will keep the rows in the table roughly in index order and allow fast index-based access when you ORDER BY the indexed column. Using TOP X (or LIMIT X, depending on your RDMBS) will then only retrieve the first x items from the index.
Performance warning: you should always review the execution plans of your queries (on real data) to verify that the optimizer doesn't do unexpected things. Also try to benchmark your queries (again on real data) to be able to make informed decisions.
I had the same general question of "how do I turn a table into a queue" and couldn't find the answer I wanted anywhere.
Here is what I came up with for Node/SQLite/better-sqlite3.
Basically just modify the inner WHERE and ORDER BY clauses for your use case.
module.exports.pickBatchInstructions = (db, batchSize) => {
const buf = crypto.randomBytes(8); // Create a unique batch identifier
const q_pickBatch = `
UPDATE
instructions
SET
status = '${status.INSTRUCTION_INPROGRESS}',
run_id = '${buf.toString("hex")}',
mdate = datetime(datetime(), 'localtime')
WHERE
id IN (SELECT id
FROM instructions
WHERE
status is not '${status.INSTRUCTION_COMPLETE}'
and run_id is null
ORDER BY
length(targetpath), id
LIMIT ${batchSize});
`;
db.run(q_pickBatch); // Change the status and set the run id
const q_getInstructions = `
SELECT
*
FROM
instructions
WHERE
run_id = '${buf.toString("hex")}'
`;
const rows = db.all(q_getInstructions); // Get all rows with this batch id
return rows;
};
A very easy solution for this in order not to have transactions, locks etc is to use the change tracking mechanisms (not data capture). It utilizes versioning for each added/updated/removed row so you can track what changes happened after a specific version.
So, you persist the last version and query the new changes.
If a query fails, you can always go back and query data from the last version.
Also, if you want to not get all changes with one query, you can get top n order by last version and store the greatest version I'd you have got to query again.
See this for example Using Change Tracking in SQL Server 2008