How to select random entry from table? - sql

I need to display a random last name of a person who entered into an employment contract in a specified month using the rand function
go
CREATE OR ALTER function [dbo].[User_Surname]
(#mont int)
returns nvarchar(50)
begin
Declare #surname nvarchar(50)
Set #surname = (Select top(1) surname from dbo.Tenants
inner join dbo.lease_agreements on Tenants.tenant_code = lease_agreements.tenant_code
where MONTH(lease_agreements.rental_start_date) = #mont and dbo.Tenants.tenant_code = (select * from randNumber))
return #surname
end
go
select dbo.User_Surname (1)
create or alter view randNumber as
Select FLOOR((RAND() * (MAX(tenant_code + 1) - 1)) + 1) as value from Tenants

So what if tenant #42 has been removed? If the random number function returns 42, then your query will yield nothing.
To fix this problem, an approach which would be quite difficult to correctly implement would involve a row-sequence-number column which is an integer which sequentially increments and does not contain gaps. In order to avoid a gap when a row is deleted, you must pick the last row from the table and give it the row-sequence-number of the deleted column. Consistently doing so without ever forgetting to do it seems like a tough proposition. Achieving this without concurrency problems when rows are being concurrently deleted also seems like a tough proposition. Furthermore, the possibility that the last row may be re-sequenced means that you cannot use an SQL SEQUENCE for issuing row sequence numbers, or that your RDBMS must support the ability to count-down on a sequence, which is a tough proposition.
A better approach would be to create a random number N between zero and the number of rows instead of the maximum row id number, and then to pick the Nth row from the table. That would be something like SELECT BOTTOM 1 FROM (SELECT TOP N FROM...
An SQL-only solution (involving no stored procedures) would be very inefficient. It would involve joining the table of interest with the random-number function, (just real random numbers between 0.0 and 1.0,) essentially creating a new table which also contains a random number field, then using ORDER BY the random field, and then using TOP 1 to get the first row. To achieve this, your RDBMS would be performing a full table scan, and creating an entire new sorted temporary table, and it would be doing that each time you ask for a row at random, so it would be preposterously inefficient.
A performance improvement on the above idea would be to permanently add the random number column to each row, (and to issue a new random number between 0.0 and 1.0 to each row later inserted,) and then use a SEQUENCE for issuing sequential row index numbers, so that each time you want a new random row you pick the next number N from the sequence, you compute its modulus by the number of rows in the table, and you get the Nth row from the table sorted by random-number-column. It will probably be a good idea to make that random number column indexed. The problem with this approach is that it does not truly yield records at random, it yields all records in random order. Truly yielding records at random means that the same row might be yielded twice in two successive queries. This approach will only yield a record again once all other records have first been yielded.

As you want only one tenant, use 'ORDEr BY RAND()`
As always with randomness, you could alos get 100 times the same Tennant, especially when you have only a small number of tennants that fit the bill.
This will never be fast as the table needs to be full scnanned
but at least you should have an index on (tenant_code ,rental_start_date) so that it will be faster to select the correct tennants
CREATE OR ALTER function [dbo].[User_Surname]
(#mont int)
returns nvarchar(50)
begin
Declare #surname nvarchar(50)
Set #surname = (Select top(1) surname from dbo.Tenants
inner join dbo.lease_agreements on Tenants.tenant_code = lease_agreements.tenant_code
where MONTH(lease_agreements.rental_start_date) = #mont
ORDER By RAND())
return #surname
end

Related

Optimize SQL query with pagination

I have a query running against a SQL Server database that is taking over 10 seconds to execute. The table being queried has over 14 million rows.
I want to display the Text column from a Notes table by a given ServiceUserId in date order. There could be thousands of entries so I want to limit the returned values to a manageable level.
SELECT Text
FROM
(SELECT
ROW_NUMBER() OVER (ORDER BY [DateDone]) AS RowNum, Text
FROM
Notes
WHERE
ServiceUserId = '6D33B91A-1C1D-4C99-998A-4A6B0CC0A6C2') AS RowConstrainedResult
WHERE
RowNum >= 40 AND RowNum < 60
ORDER BY
RowNum
Below is the execution plan for the above query.
Nonclustered Index - nonclustered index on the ServiceUserId and DateDone columns in ascending order.
Key lookup - Primary key for the table which is the NoteId
If I run the same query a second time but with different row numbers then I get a response in milliseconds, I assume from a cached execution plan. The query ran for a different ServiceUserId will take ~10 seconds though.
Any suggestions for how to speed up this query?
You should look into Keyset Pagination.
It is far more performant than Rowset Pagination.
It differs fundamentally from it, in that instead of referencing a particular block of row numbers, instead you reference starting point to lookup the index key.
The reason it is much faster is that you don't care about how many rows are before a particular key, you just seek a key and move forward (or backward).
Say you are filtering by a single ServiceUserId, ordering by DateDone. You need an index as follows (you could leave out the INCLUDE if it's too big, it doesn't change the maths very much):
create index IX_DateDone on Notes (ServiceUserId, DateDone) INCLUDE (TEXT);
Now, when you select some rows, instead of giving the start and end row numbers, give the starting key:
SELECT TOP (20)
Text,
DateDone
FROM
Notes
WHERE
ServiceUserId = '6D33B91A-1C1D-4C99-998A-4A6B0CC0A6C2'
AND DateDone > #startingDate
ORDER BY
DateDone;
On the next run, you pass the last DateDone value you received. This gets you the next batch.
The one small downside is that you cannot jump pages. However, it is much rarer than some may think (from a UI perspective) for a user to want to jump to page 327. So that doesn't really matter.
The key must be unique. If it is not unique you can't seek to exactly the next row. If you need to use an extra column to guarantee uniqueness, it gets a little more complicated:
WITH NotesFiltered AS
(
SELECT * FROM Notes
WHERE
ServiceUserId = '6D33B91A-1C1D-4C99-998A-4A6B0CC0A6C2'
)
SELECT TOP (20)
Text,
DateDone
FROM (
SELECT
Text,
DateDone,
0 AS ordering
FROM NotesFiltered
WHERE
DateDone = #startingDate AND NoteId > #startingNoteId
UNION ALL
SELECT
Text,
DateDone,
1 AS ordering
FROM NotesFiltered
WHERE
DateDone > #startingDate
) n
ORDER BY
ordering, DateDone, NoteId;
Side Note
In RDBMSs that support row-value comparisons, the multi-column example could be simplified back to the original code by writing:
WHERE (DateDone, NoteId) > (#startingDate, #startingNoteId)
Unfortunately SQL Server does not support this currently.
Please vote for the Azure Feedback request for this
I would suggest to use order by offset fetch :
it starts from row no x and fetch z next row, which can be parameterized
SELECT
Text
FROM
Notes
WHERE
ServiceUserId = '6D33B91A-1C1D-4C99-998A-4A6B0CC0A6C2'
Order by DateDone
OFFSET 40 ROWS FETCH NEXT 20 ROWS ONLY
also make sure you have proper index for "DateDone" , maybe include it in the index you already have on "Notes" if you have not yet
you may need to include text column to you index :
create index IX_DateDone on Notes(DateDone) INCLUDE (TEXT,ServiceUserId)
however be noticed that adding such huge column to the index will effect your insert/update efficiency and of course It will need disk space

How to create a queue like structure in SQL Server

Is there a good way to create a queue like structure in SQL Server?
Requirements:
When I insert rows, I want them to default to the bottom of the queue
When I select rows, I want to easily be able to get the top of the queue
Here's the tough one: I want to be able to easily move something up the queue, and reorient the rest. Example: move item 5 up to number 1, then 1-4 becomes 2-5
A simple identity column would work for requirements 1 and 2, but how would I handle 3?
Solution
I ended up implementing the solution from #roger-wolf
One difference, I used a trigger rather than a stored procedure to renumber. Here's my trigger code:
CREATE TRIGGER [dbo].[TR_Queue]
ON [dbo].[Queue]
AFTER INSERT, DELETE, UPDATE
AS
BEGIN
SET NOCOUNT ON;
-- Get the current max value in priority
DECLARE #maxPriority INT = COALESCE((SELECT MAX([priority]) FROM [dbo].[Queue]), 0);
WITH newValues AS (
-- Renumber by priority, starting at 1
SELECT [queueID]
,ROW_NUMBER() OVER(ORDER BY [priority] ASC) AS [priority]
FROM (
-- Pretend all nulls are greater than previous max priority
SELECT [queueID]
,COALESCE([priority], #maxPriority+1) AS [priority]
FROM [dbo].[Queue]
) AS tbl
)
UPDATE q
SET q.[priority] = newValues.[priority]
FROM [dbo].[Queue] AS qroger-wolf
INNER JOIN newValues
ON q.[queueID] = newValues.[queueID]
END
This works well for me as the queue is always relatively small and infrequently updated, so I don't have to work about performance of the trigger.
Use a float column for prioritisation and an approach similar to Celko trees:
If you have items with priorities 1, 2, and 3 and the last needs to become second, calculate an average between its new neighbours, 1.5 in this example;
If another one needs to become second, its priority would be 1.25. This can go on for quite a while;
When displaying queued items by their priority, use row_number() instead of float values in UI;
If items become too close together (say, 1e-10 or less), have a stored procedure ready to renumber them as integers.
The only deficiency I see here is that it becomes a bit more difficult to find N-th item in a middle of a queue, when it's neither first nor last. If you don't need that, the approach should work.
You could add a Priority column of type DateTime, and when you set a row as a priority row you set the current date-time in the Priority column and then use that as part of your order by criteria?
I had a similar requirement in a past project, what I did (and it worked):
Add column update_at_utc of type datetime2
When inserting, set update_at_utc = GETDATEUTC()
When retrieving, order by update_at_utc
When moving a row in the queue, for example between rows 3 and 4, simply take average of update_at_utc of these rows and use it to set update_at_utc of the row being moved.
Note 1: Point 4 assumes that the frequency of inserts and of moving the rows up/down the queue is such that datetime2 type has sufficient resolution. For example, if you insert 2 rows 1 millisecond apart, and then try to move 1000 rows between these 2 rows, then datetime2 resolution will be insufficient (https://learn.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql?view=sql-server-2017). In such case, the moving of rows up/down the queue would need to be more complicated; When moving a row N places lower down:
Remember update_at_utc of the row N places lower down
For all rows between the current and the new position: assign row's update_at_utc to the preceding row's update_at_utc
Assign update_at_utc of the row being moved to the date remembered in point 1 above.
Note 2: I suggest UTC dates instead of local dates to avoid issues during a daylight saving switch.

Get random data from SQL Server but no repeated values

I need to get 10 random rows from table at each time, but rows shall never repeat when I repeat the query.
But if I get all rows it will repeat again from one, like table has 20 rows, at first time I get 10 random rows, 2nd time I will need to get remaining 10 rows and at my 3rd query I need to get 10 rows randomly.
Currently my query for getting 10 rows randomly:
SELECT TOP 10 *
FROM tablename
ORDER BY NEWID()
But MSDN suggest this query
SELECT TOp 10 * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
For good performance. But this query not return constant rows. Could you please suggest something on this
Since required outcome of your second query depends on the (random) outcome of the first query, the querying cannot be stateless. You'll need to store the state (info about the previous query/queries) somewhere, somehow.
The simplest solution would probably be storing the already-retrieved rows or their IDs in a temporary table and then querying ... where id not in (select id from temp_table) in the second query.
As Jiri Tousek said, each query that you run has to know what previous queries returned.
Instead of inserting the IDs of previously returned rows in a table and then checking that new result is not in that table yet, I'd simply add a column to the table with the random number that would define a new random order of rows.
You populate this column with random numbers once.
This will remember the random order of rows and make it stable, so all you need to remember between your queries is how many random rows you have requested so far. Then just fetch as many rows as needed starting from where you stopped in the previous query.
Add a column RandomNumber binary(8) to the table. You can choose a different size. 8 bytes should be enough.
Populate it with random numbers. Once.
UPDATE tablename
SET RandomNumber = CRYPT_GEN_RANDOM(8)
Create an index on RandomNumber column. Unique index. If it turns out that there are repeated random numbers (which is unlikely for 20,000 rows and random numbers 8 bytes long), then re-generate random numbers (run the UPDATE statement once again) until all of them are unique.
Request first 10 random rows:
SELECT TOP(10) *
FROM tablename
ORDER BY RandomNumber
As you process/use these 10 random rows remember the last used random number. The best way to do it depends on how you process these 10 random rows.
DECLARE #VarLastRandomNumber binary(8);
SET #VarLastRandomNumber = ...
-- the random number from the last row returned by the previous query
Request next 10 random rows:
SELECT TOP(10) *
FROM tablename
WHERE RandomNumber > #VarLastRandomNumber
ORDER BY RandomNumber
Process them and remember the last used random number.
Repeat. As a bonus you can request different number of random rows on each iteration (it doesn't have to be 10 each time).
what I would do is have two new fields, SELECTED (int) and TimesSelected (integer) then
UPDATE tablename SET SELECTED = 0;
WITH CTE AS (SELECT TOP 10 *
FROM tablename
ORDER BY TimesSelected ASC, NEWID())
UPDATE CTE SET SELECTED = 1, TimesSelected = TimesSelected + 1;
SELECT * from tablename WHERE SELECTED = 1;
so if you use that each time, once selected a record goes to the top of the pile, and records below it are selected randomly.
you might want to put an index on SELECTED and do
UPDATE tablename SET SELECTED = 0 WHERE SELECTED = 1; -- for performance
The most elegant solution, provided you do the consecutive queries within a certain amount of time, would be to use a cursor:
DECLARE rnd_cursor CURSOR FOR
SELECT col1, col2, ...
FROM tablename
ORDER BY NEWID();
OPEN rnd_cursor;
FETCH NEXT FROM rnd_cursor; -- Repeat ten times
Keep the cursor open and just keep fetching rows as you need them. Close the cursor when you're done:
CLOSE rnd_cursor;
DEALLOCATE rnd_cursor;
As for the second part of your question, once you fetched the last row, open a new cursor:
IF ##FETCH_STATUS <> 0
BEGIN
CLOSE rnd_cursor;
OPEN rnd_cursor;
END;

Can SQL return different results for two runs of the same query using ORDER BY?

I have the following table:
CREATE TABLE dbo.TestSort
(
Id int NOT NULL IDENTITY (1, 1),
Value int NOT NULL
)
The Value column could (and is expected to) contain duplicates.
Let's also assume there are already 1000 rows in the table.
I am trying to prove a point about unstable sorting.
Given this query that returns a 'page' of 10 results from the first 1000 inserted results:
SELECT TOP 10 * FROM TestSort WHERE Id <= 1000 ORDER BY Value
My intuition tells me that two runs of this query could return different rows if the Value column contains repeated values.
I'm basing this on the facts that:
the sort is not stable
if new rows are inserted in the table between the two runs of the query, it could possibly create a re-balancing of B-trees (the Value column may be indexed or not)
EDIT: For completeness: I assume rows never change once inserted, and are never deleted.
In contrast, a query with stable sort (ordering also by Id) should always return the same results, since IDs are unique:
SELECT TOP 10 * FROM TestSort WHERE Id <= 1000 ORDER BY Value, Id
The question is: Is my intuition correct? If yes, can you provide an actual example of operations that would produce different results (at least "on your machine")? You could modify the query, add indexes on the Values column etc.
I don't care about the exact query, but about the principle.
I am using MS SQL Server (2014), but am equally satisfied with answers for any SQL database.
If not, then why?
Your intuition is correct. In SQL, the sort for order by is not stable. So, if you have ties, they can be returned in any order. And, the order can change from one run to another.
The documentation sort of explains this:
Using OFFSET and FETCH as a paging solution requires running the query
one time for each "page" of data returned to the client application.
For example, to return the results of a query in 10-row increments,
you must execute the query one time to return rows 1 to 10 and then
run the query again to return rows 11 to 20 and so on. Each query is
independent and not related to each other in any way. This means that,
unlike using a cursor in which the query is executed once and state is
maintained on the server, the client application is responsible for
tracking state. To achieve stable results between query requests using
OFFSET and FETCH, the following conditions must be met:
The underlying data that is used by the query must not change. That is, either the rows touched by the query are not updated or all
requests for pages from the query are executed in a single transaction
using either snapshot or serializable transaction isolation. For more
information about these transaction isolation levels, see SET
TRANSACTION ISOLATION LEVEL (Transact-SQL).
The ORDER BY clause contains a column or combination of columns that are guaranteed to be unique.
Although this specifically refers to offset/fetch, it clearly applies to running the query multiple times without those clauses.
If you have ties when ordering the order by is not stable.
LiveDemo
CREATE TABLE #TestSort
(
Id INT NOT NULL IDENTITY (1, 1) PRIMARY KEY,
Value INT NOT NULL
) ;
DECLARE #c INT = 0;
WHILE #c < 100000
BEGIN
INSERT INTO #TestSort(Value)
VALUES ('2');
SET #c += 1;
END
Example:
SELECT TOP 10 *
FROM #TestSort
ORDER BY Value
OPTION (MAXDOP 4);
DBCC DROPCLEANBUFFERS; -- run to clear cache
SELECT TOP 10 *
FROM #TestSort
ORDER BY Value
OPTION (MAXDOP 4);
The point is I force query optimizer to use parallel plan so there is no guaranteed that it will read data sequentially like Clustered index probably will do when no parallelism is involved.
You cannot be sure how Query Optimizer will read data unless you explicitly force to sort result in specific way using ORDER BY Id, Value.
For more info read No Seatbelt - Expecting Order without ORDER BY.
I think this post will answer your question:
Is SQL order by clause guaranteed to be stable ( by Standards)
The result is everytime the same when you are in a single-threaded environment. Since multi-threading is used, you can't guarantee.

Working with sequential numbers in SQL Server 2005 without cursors

I'm currently working on a project that needs to have a process that assigns "control numbers" to some records. This also needs to be able to be run at a later date and include records without a control number that changed, and assign an unused control number to these records. These control numbers are preassigned by an outside entity and are 9 digits long. You would usually get a range depending on how many estimated records your company will generate. For example one of the companies estimated they would need 50, so they assigned us the range 790123401 to 790123450.
The problem: right now I'm using cursors to assign these numbers. For each individual record, I go and check if the first number in the sequence is already taken in the table, if it is, I increment the number, and recheck. This check goes on and on for each record in the table. One of the companies has 17,000 records, which means that for each of the records, I could be potentially iterating at worst 17,000 times if all numbers have been taken.
I really don't mind all the repetition on the initial run since the first run will assign control numbers to a lot of records. My problem is that if later a record gets changed and now should have a control number associated with it, then re-running the process would mean it would go through each available number until I get an unused one.
I've seen numerous examples on how to use sequences without cursors, but most are specific to Oracle. I'm using SQL Server 2005 for this particular project.
Suggestions?
You are looking for all unassigned numbers in a range? If so you can outer join onto a numbers table. The example below uses a CTE to create one on the fly I would suggest a permanent one containing at least 17,000 numbers if that is the max size of your range.
DECLARE #StartRange int, #EndRange int
SET #StartRange = 790123401
SET #EndRange = 790123450;
WITH YourTable(ControlNumber) AS
(
SELECT 790123401 UNION ALL
SELECT 790123402 UNION ALL
SELECT 790123403 UNION ALL
SELECT 790123406
),
Nums(N) AS
(
SELECT #StartRange
UNION ALL
SELECT N+1
FROM Nums
WHERE N < #EndRange
)
SELECT N
FROM Nums
WHERE NOT EXISTS(SELECT *
FROM YourTable
WHERE ControlNumber = N )
OPTION (MAXRECURSION 0)