Get random data from SQL Server but no repeated values - sql

I need to get 10 random rows from table at each time, but rows shall never repeat when I repeat the query.
But if I get all rows it will repeat again from one, like table has 20 rows, at first time I get 10 random rows, 2nd time I will need to get remaining 10 rows and at my 3rd query I need to get 10 rows randomly.
Currently my query for getting 10 rows randomly:
SELECT TOP 10 *
FROM tablename
ORDER BY NEWID()
But MSDN suggest this query
SELECT TOp 10 * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
For good performance. But this query not return constant rows. Could you please suggest something on this

Since required outcome of your second query depends on the (random) outcome of the first query, the querying cannot be stateless. You'll need to store the state (info about the previous query/queries) somewhere, somehow.
The simplest solution would probably be storing the already-retrieved rows or their IDs in a temporary table and then querying ... where id not in (select id from temp_table) in the second query.

As Jiri Tousek said, each query that you run has to know what previous queries returned.
Instead of inserting the IDs of previously returned rows in a table and then checking that new result is not in that table yet, I'd simply add a column to the table with the random number that would define a new random order of rows.
You populate this column with random numbers once.
This will remember the random order of rows and make it stable, so all you need to remember between your queries is how many random rows you have requested so far. Then just fetch as many rows as needed starting from where you stopped in the previous query.
Add a column RandomNumber binary(8) to the table. You can choose a different size. 8 bytes should be enough.
Populate it with random numbers. Once.
UPDATE tablename
SET RandomNumber = CRYPT_GEN_RANDOM(8)
Create an index on RandomNumber column. Unique index. If it turns out that there are repeated random numbers (which is unlikely for 20,000 rows and random numbers 8 bytes long), then re-generate random numbers (run the UPDATE statement once again) until all of them are unique.
Request first 10 random rows:
SELECT TOP(10) *
FROM tablename
ORDER BY RandomNumber
As you process/use these 10 random rows remember the last used random number. The best way to do it depends on how you process these 10 random rows.
DECLARE #VarLastRandomNumber binary(8);
SET #VarLastRandomNumber = ...
-- the random number from the last row returned by the previous query
Request next 10 random rows:
SELECT TOP(10) *
FROM tablename
WHERE RandomNumber > #VarLastRandomNumber
ORDER BY RandomNumber
Process them and remember the last used random number.
Repeat. As a bonus you can request different number of random rows on each iteration (it doesn't have to be 10 each time).

what I would do is have two new fields, SELECTED (int) and TimesSelected (integer) then
UPDATE tablename SET SELECTED = 0;
WITH CTE AS (SELECT TOP 10 *
FROM tablename
ORDER BY TimesSelected ASC, NEWID())
UPDATE CTE SET SELECTED = 1, TimesSelected = TimesSelected + 1;
SELECT * from tablename WHERE SELECTED = 1;
so if you use that each time, once selected a record goes to the top of the pile, and records below it are selected randomly.
you might want to put an index on SELECTED and do
UPDATE tablename SET SELECTED = 0 WHERE SELECTED = 1; -- for performance

The most elegant solution, provided you do the consecutive queries within a certain amount of time, would be to use a cursor:
DECLARE rnd_cursor CURSOR FOR
SELECT col1, col2, ...
FROM tablename
ORDER BY NEWID();
OPEN rnd_cursor;
FETCH NEXT FROM rnd_cursor; -- Repeat ten times
Keep the cursor open and just keep fetching rows as you need them. Close the cursor when you're done:
CLOSE rnd_cursor;
DEALLOCATE rnd_cursor;
As for the second part of your question, once you fetched the last row, open a new cursor:
IF ##FETCH_STATUS <> 0
BEGIN
CLOSE rnd_cursor;
OPEN rnd_cursor;
END;

Related

How to select random entry from table?

I need to display a random last name of a person who entered into an employment contract in a specified month using the rand function
go
CREATE OR ALTER function [dbo].[User_Surname]
(#mont int)
returns nvarchar(50)
begin
Declare #surname nvarchar(50)
Set #surname = (Select top(1) surname from dbo.Tenants
inner join dbo.lease_agreements on Tenants.tenant_code = lease_agreements.tenant_code
where MONTH(lease_agreements.rental_start_date) = #mont and dbo.Tenants.tenant_code = (select * from randNumber))
return #surname
end
go
select dbo.User_Surname (1)
create or alter view randNumber as
Select FLOOR((RAND() * (MAX(tenant_code + 1) - 1)) + 1) as value from Tenants
So what if tenant #42 has been removed? If the random number function returns 42, then your query will yield nothing.
To fix this problem, an approach which would be quite difficult to correctly implement would involve a row-sequence-number column which is an integer which sequentially increments and does not contain gaps. In order to avoid a gap when a row is deleted, you must pick the last row from the table and give it the row-sequence-number of the deleted column. Consistently doing so without ever forgetting to do it seems like a tough proposition. Achieving this without concurrency problems when rows are being concurrently deleted also seems like a tough proposition. Furthermore, the possibility that the last row may be re-sequenced means that you cannot use an SQL SEQUENCE for issuing row sequence numbers, or that your RDBMS must support the ability to count-down on a sequence, which is a tough proposition.
A better approach would be to create a random number N between zero and the number of rows instead of the maximum row id number, and then to pick the Nth row from the table. That would be something like SELECT BOTTOM 1 FROM (SELECT TOP N FROM...
An SQL-only solution (involving no stored procedures) would be very inefficient. It would involve joining the table of interest with the random-number function, (just real random numbers between 0.0 and 1.0,) essentially creating a new table which also contains a random number field, then using ORDER BY the random field, and then using TOP 1 to get the first row. To achieve this, your RDBMS would be performing a full table scan, and creating an entire new sorted temporary table, and it would be doing that each time you ask for a row at random, so it would be preposterously inefficient.
A performance improvement on the above idea would be to permanently add the random number column to each row, (and to issue a new random number between 0.0 and 1.0 to each row later inserted,) and then use a SEQUENCE for issuing sequential row index numbers, so that each time you want a new random row you pick the next number N from the sequence, you compute its modulus by the number of rows in the table, and you get the Nth row from the table sorted by random-number-column. It will probably be a good idea to make that random number column indexed. The problem with this approach is that it does not truly yield records at random, it yields all records in random order. Truly yielding records at random means that the same row might be yielded twice in two successive queries. This approach will only yield a record again once all other records have first been yielded.
As you want only one tenant, use 'ORDEr BY RAND()`
As always with randomness, you could alos get 100 times the same Tennant, especially when you have only a small number of tennants that fit the bill.
This will never be fast as the table needs to be full scnanned
but at least you should have an index on (tenant_code ,rental_start_date) so that it will be faster to select the correct tennants
CREATE OR ALTER function [dbo].[User_Surname]
(#mont int)
returns nvarchar(50)
begin
Declare #surname nvarchar(50)
Set #surname = (Select top(1) surname from dbo.Tenants
inner join dbo.lease_agreements on Tenants.tenant_code = lease_agreements.tenant_code
where MONTH(lease_agreements.rental_start_date) = #mont
ORDER By RAND())
return #surname
end

how to get all row count of a table from rowcount after having top row filter

I have a huge table in my database and a stored procedure accessing it, which needs pagination.
To achieve this I want total records of the table, and for that, I am facing performance issue because for doing that I need to run this query twice:
First time to get count for all records
Secondly when I need to select records in that page range
Is there any way I can avoid the first query for getting the total count instead of I can use row count or something else?
One way to do it would be something like this:
SELECT
(your list of columns),
COUNT(*) OVER ()
FROM
dbo.YourTable
ORDER BY
(whatever column you want to order by)
OFFSET x ROWS FETCH NEXT y ROWS ONLY;
With the OFFSET / FETCH, you retrieve only a page of data - and the COUNT(*) OVER() will give you the total count of the rows in the table - all in a single query

Is there a way to select different rows each time avoiding ORDER BY clause?

I have a table with approximately 100 million rows (TABLE_A), I need to select 6 millons different rows each query, once the entire table is selected, the process ends. TABLE_A does not have index or primary key, and ORDER BY is very expensive in terms of time, also I don't need any order here, just different rows. I have tried to order using ROWID, according to this,
They are the fastest way to access a single row.
This query works but takes about 5 minutes (I would like to avoid this order by)
SELECT * FROM TABLE_A ORDER BY ROWID
OFFSET 6000000 ROWS FETCH NEXT 6000000 ROWS ONLY;
This query works faster but has no sense since ROWNUM, according to this
returns a number indicating the order in which Oracle selects the row
from a table
SELECT * FROM TABLE_A ORDER BY ROWNUM asc
OFFSET 6000000 ROWS FETCH NEXT 6000000 ROWS ONLY;
As expected, same query returns different results each time.
This query seems to be conceptually better.
SELECT * FROM TABLE_A WHERE ROWID >= 6000000 AND ROWID <12000000;
But it can't be done in this way, ROWID (UROWID Datatype) has values like AAAZIUAHWAAC6XcAAI
So, Is there a way to select different rows avoiding order? and just call the rows using some kind of internal ID, maybe a direction in the storage or maybe a default order. The whole approach was likely wrong, so I'm open to radical changes.
I've also tried somethig like this
SELECT * FROM TABLE_A
WHERE dbms_rowid.rowid_block_number(rowid)
BETWEEN 2049 AND 261281;
it's surprisingly fast but unfortunately a row could have more than one block number.
Based on your last comment, some things to look at:
DBMS_PARALLEL_EXECUTE
If you are going through 100 million rows, the best place to process them is on the database itself. If your processing is done with PL/SQL, then dbms_parallel_execute can manage most of the parallelisation for you, and carve up the rows.
ROWID ranges
Even if you don't process the rows on the database, you can use DBMS_PARALLEL_EXECUTE to produce the rowid ranges for you. Then use those start-end pairs as inputs to whatever app you are using to do the processing
simple MOD
Each instance of your app gets an ID from 0 to 'n-1' and each issues a query
select *
from (
select rownum r, m.* from my_table
)
where mod(r,"n") = :x
where x is that app's ID. If you already have a numeric sequence column of some sort that is reasonably distributed, you can substitute that in for the rownum

I am using select top 5 from(select * from tbl2) tbl, will it get all records from tbl2 or will it get only specific records i.e 5

I am using select top 5 from(select * from tbl2) tbl, will it get all records from tbl2 and then it will it get only specific records i.e 5? or do it only get 5 records from internal memory. suppose we have 1000 records in tbl2.
For future reference, SQL actually has a huge amount of amazing documentation/resources online. For instance: This google search.
As to your question, it will pull the top five results matching your criteria, so it depends. It'll go through, and find the first five results matching your criteria. If those are the last results it goes through, it'll still have to do the comparisons and filtering on all rows, the only difference will be that it will have to send less rows to your computer.
For example, let's say we have a table my_table with two columns: Customer_ID (which is unique) and First_Purchase_Date (which is a date value between 2015-01-01 and 2017-07-26). If we simply do SELECT TOP 5 * FROM my_table then it will go through and pull the first five rows it finds, without looking at the rest of the rows. On the other hand, if we do SELECT TOP 5 * FROM my_table WHERE First_Purchase_Date = '2017-05-17' then it will have to go through all the rows until it can find five rows with a First_Purchase_Date of 2017-05-17. If First_Purchase_Date is indexed, this should not be very expensive, as it'll more or less know where to look. If it's not, then it depends on your how SQL has decided to structure your table, and if it has created any useful statistics. Worst case, it could not have any statistics and the desired rows could be the last five in the database, in which case it will have to complete the comparison on all the rows in the database.
By the way, this is a somewhat poor idea, as the columns returned will not necessarily stay consistent over time. It may be a good idea to throw in an ORDER BY clause, to ensure you get the same records every time.
The SELECT TOP clause is used to specify the number of records to return.
It limits the rows returned in a query result set to a specified number of rows or percentage of rows. When TOP is used in conjunction with the ORDER BY clause, the result set is limited to the first N number of ordered rows; otherwise, it returns the first N number of rows in an undefined order.
SELECT TOP number|percent column_name(s)
FROM table_name
WHERE condition;

Stop at first match in a sqlite query

I have a sqlite3 database worth 4GB and 400k rows. The id column is a consecutive number starting at 1. If I query the following
select * from game where id = 1
After printing the first match the query continues until it reach the 400k row, thus taking a few seconds to finish the query.
How do I make the query stop at the first match?
Or how do I go directly to specific row since id and rowcount are the same?
Just add a LIMIT 1 to your query:
SELECT * FROM game WHERE id = 1 LIMIT 1;