mysql count performance

mysql count performance - sql

select count(*) from mytable;
select count(table_id) from mytable; //table_id is the primary_key
both query were running slow on a table with 10 million rows.
I am wondering why since wouldn't it easy for mysql to keep a counter that gets updated on all insert,update and delete?
and is there a way to improve this query? I used explain but didn't help much.

take a look at the following blog posts:
1) COUNT(***) vs COUNT(col)
2) Easy MySQL Performance Tips
3) Fast count(*) for InnoDB
btw, which engine do you use?
EDITED: About technique to speed up count when you need just to know if there are some amount of rows. Sorry, just was wrong with my query. So, when you need just to know, if there is e.g. 300 rows by specific condition you can try subquery:
select count(*) FROM
( select 1 FROM _table_ WHERE _conditions_ LIMIT 300 ) AS result
at first you minify result set, and then count the result; it will still scan result set, but you can limit it (once more, it works when the question to DB is "is here more or less than 300 rows), and if DB contains more than 300 rows which satisfy condition that query is faster
Testing results (my table has 6.7mln rows):
1) SELECT count(*) FROM _table_ WHERE START_DATE > '2011-02-01'
returns 4.2mln for 65.4 seconds
2) SELECT count(*) FROM ( select 1 FROM _table_ WHERE START_DATE > '2011-02-01' LIMIT 100 ) AS result
returns 100 for 0.03 seconds
Below is result of the explain query to see what is going on there:
EXPLAIN SELECT count(*) FROM ( select 1 FROM _table_ WHERE START_DATE > '2011-02-01' LIMIT 100 ) AS result

As cherouvim pointed out in the comments, it depends on the storage engine.
MyISAM does keep a count of the table rows, and can keep it accurate since the only locks MyISAM supports is a table lock.
InnoDB however supports transactions, and needs to do a table scan to count the rows.
http://www.mysqlperformanceblog.com/2006/12/01/count-for-innodb-tables/

Related

Query using Rownum and order by clause does not use the index

I am using Oracle (Enterprise Edition 10g) and I have a query like this:
SELECT * FROM (
SELECT * FROM MyTable
ORDER BY MyColumn
) WHERE rownum <= 10;
MyColumn is indexed, however, Oracle is for some reason doing a full table scan before it cuts the first 10 rows. So for a table with 4 million records the above takes around 15 seconds.
Now consider this equivalent query:
SELECT MyTable.*
FROM
(SELECT rid
FROM
(SELECT rowid as rid
FROM MyTable
ORDER BY MyColumn
)
WHERE rownum <= 10
)
INNER JOIN MyTable
ON MyTable.rowid = rid
ORDER BY MyColumn;
Here Oracle scans the index and finds the top 10 rowids, and then uses nested loops to find the 10 records by rowid. This takes less than a second for a 4 million table.
My first question is why is the optimizer taking such an apparently bad decision for the first query above?
An my second and most important question is: is it possible to make the first query perform better. I have a specific need to use the first query as unmodified as possible. I am looking for something simpler than my second query above. Thank you!
Please note that for particular reasons I am unable to use the /*+ FIRST_ROWS(n) */ hint, or the ROW_NUMBER() OVER (ORDER BY column) construct.

If this is acceptable in your case, adding a WHERE ... IS NOT NULL clause will help the optimizer to use the index instead of doing a full table scan when using an ORDER BY clause:
SELECT * FROM (
SELECT * FROM MyTable
WHERE MyColumn IS NOT NULL
-- ^^^^^^^^^^^^^^^^^^^^
ORDER BY MyColumn
) WHERE rownum <= 10;
The rational is Oracle does not store NULL values in the index. As your query was originally written, the optimizer took the decision of doing a full table scan, as if there was less than 10 non-NULL values, it should retrieve some "NULL rows" to "fill in" the remaining rows. Apparently it is not smart enough to check first if the index contains enough rows...
With the added WHERE MyColumn IS NOT NULL, you inform the optimizer that you don't want in any circumstances any row having NULL in MyColumn. So it can blindly use the index without worrying about hypothetical rows having NULL in MyColumn.
For the same reason, declaring the ORDER BY column as NOT NULL should prevent the optimizer to do a full table scan. So, if you can change the schema, a cleaner option would be:
ALTER TABLE MyTable MODIFY (MyColumn NOT NULL);
See http://sqlfiddle.com/#!4/e3616/1 for various comparisons (click on view execution plan)

Update n random rows in SQL

I have table which is having about 1000 rows.I have to update a column("X") in the table to 'Y' for n ramdom rows. For this i can have following query
update xyz set X='Y' when m in (
'SELECT m FROM (SELECT m
FROM xyz
order by dbms_random.value
) RNDM
where rownum < n+1);
Is there another efficient way to write this query. The table has no index.
Please help?

I would use the ROWID:
UPDATE xyz SET x='Y' WHERE rowid IN (
SELECT r FROM (
SELECT ROWID r FROM xyz ORDER BY dbms_random.value
) RNDM WHERE rownum < n+1
)
The actual reason I would use ROWID isn't for efficiency though (it will still do a full table scan) - your SQL may not update the number of rows you want if column m isn't unique.
With only 1000 rows, you shouldn't really be worried about efficiency (maybe with a hundred million rows). Without any index on this table, you're stuck doing a full table scan to select random records.
[EDIT:] "But what if there are 100,000 rows"
Well, that's still 3 orders of magnitude less than 100 million.
I ran the following:
create table xyz as select * from all_objects;
[created about 50,000 rows on my system - non-indexed, just like your table]
UPDATE xyz SET owner='Y' WHERE rowid IN (
SELECT r FROM (
SELECT ROWID r FROM xyz ORDER BY dbms_random.value
) RNDM WHERE rownum < 10000
);
commit;
This took approximately 1.5 seconds. Maybe it was 1 second, maybe up to 3 seconds (didn't formally time it, it just took about enough time to blink).

You can improve performance by replacing the full table scan with a sample.
The first problem you run into is that you can't use SAMPLE in a DML subquery, ORA-30560: SAMPLE clause not allowed. But logically this is what is needed:
UPDATE xyz SET x='Y' WHERE rowid IN (
SELECT r FROM (
SELECT ROWID r FROM xyz sample(0.15) ORDER BY dbms_random.value
) RNDM WHERE rownum < 100/*n*/+1
);
You can get around this by using a collection to store the rowids, and then update the rows using the rowid collection. Normally breaking a query into separate parts and gluing them together with PL/SQL leads to horrible performance. But in this case you can still save a lot of time by significantly reducing the amount of data read.
declare
type rowid_nt is table of rowid;
rowids rowid_nt;
begin
--Get the rowids
SELECT r bulk collect into rowids
FROM (
SELECT ROWID r
FROM xyz sample(0.15)
ORDER BY dbms_random.value
) RNDM WHERE rownum < 100/*n*/+1;
--update the table
forall i in 1 .. rowids.count
update xyz set x = 'Y'
where rowid = rowids(i);
end;
/
I ran a simple test with 100,000 rows (on a table with only two columns), and N = 100.
The original version took 0.85 seconds, #Gerrat's answer took 0.7 seconds, and the PL/SQL version took 0.015 seconds.
But that's only one scenario, I don't have enough information to say my answer will always be better. As N increases the sampling advantage is lost, and the writing will be more significant than the reading. If you have a very small amount of data, the PL/SQL context switching overhead in my answer may make it slower than #Gerrat's solution.
For performance issues, the size of the table in bytes is usually much more important than the size in rows. 1000 rows that use a terabyte of space is much larger than 100 million rows that only use a gigabyte.
Here are some problems to consider with my answer:
Sampling does not always return exactly the percent you asked for. With 100,000 rows and a 0.15% sample size the number of rows returned was 147, not 150. That's why I used 0.15 instead of 0.10. You need to over-sample a little bit to ensure that you get more than N. How much do you need to over-sample? I have no idea, you'll probably have to test it and pick a safe number.
You need to know the approximate number of rows to pick the percent.
The percent must be a literal, so as the number of rows and N change, you'll need to use dynamic SQL to change the percent.

The following solution works just fine. It's performant and seems to be similar to sample():
create table t1 as
select level id, cast ('item'||level as varchar2(32)) item
from dual connect by level<=100000;
Table T1 created.
update t1 set item='*'||item
where exists (
select rnd from (
select dbms_random.value() rnd
from t1
) t2 where t2.rowid = t1.rowid and rnd < 0.15
);
14,858 rows updated.
Elapsed: 00:00:00.717
Consider that alias rnd must be included in select clause. Otherwise changes the omptimizer the filter predicat from RND<0.1 to DBMS_RANDOM.VALUE()<0.1. In that case dbms_random.value will be executed only once.
As mentioned in answer #JonHeller, the best solution remains the pl/sql code block because it allows to avoid full table scan. Here is my suggestion:
create or replace type rowidListType is table of varchar(18);
/
create or replace procedure updateRandomly (prefix varchar2 := '*') is
rowidList rowidListType;
begin
select rowidtochar (rowid) bulk collect into rowidList
from t1 sample(15)
;
update t1 set item=prefix||item
where exists (
select 1 from table (rowidList) t2
where chartorowid(t2.column_value) = t1.rowid
);
dbms_output.put_line ('updated '||sql%rowcount||' rows.');
end;
/
begin updateRandomly; end;
/
Elapsed: 00:00:00.293
updated 14892 rows.

total number of rows of a query

I have a very large query that is supposed to return only the top 10 results:
select top 10 ProductId from .....
The problem is that I also want the total number of results that match the criteria without that 'top 10', but in the same time it's considered unaceptable to return all rows (we are talking of roughly 100 thousand results.
Is there a way to get the total number of rows affected by the previous query, either in it or afterwords without running it again?
PS: please no temp tables of 100 000 rows :))

dump the count in a variable and return that
declare #count int
select #count = count(*) from ..... --same where clause as your query
--now you add that to your query..of course it will be the same for every row..
select top 10 ProductId, #count as TotalCount from .....

Assuming that you're using an ORDER BY clause already (to properly define which the "TOP 10" results are), then you could add a call of ROW_NUMBER also, with the opposite sort order, and pick the highest value returned.
E.g., the following:
select top 10 *,ROW_NUMBER() OVER (order by id desc) from sysobjects order by ID
Has a final column with values 2001, 2000, 1999, etc, descending. And the following:
select COUNT(*) from sysobjects
Confirms that there are 2001 rows in sysobjects.

I suppose you could hack it with a union select
select top 10 ... from ... where ...
union
select count(*) from ... where ...
For you to get away with this type of hack you will need to add fake columns to the count query so it returns the same amount of columns as the main query. For example:
select top 10 id, first_name from people
union
select count(*), '' as first_name from people
I don't recommend using this solution. Using two separate queries is how it should be done

Generally speaking no - reasoning is as follows:
If(!) the query planner can make use of TOP 10 to return only 10 rows then RDBMS will not even know the exact number of rows that satisfy the full criteria, it just gets the TOP 10.
Therefore, when you want to find out count of all rows satisfying the criteria you are not running it the second time, but the first time.
Having said that proper indexes might make both queries execute pretty fast.
Edit
MySQL has SQL_CALC_FOUND_ROWS which returns the number of rows that query would return if there was no LIMIT applied - googling for an equivalent in MS SQL points to analytical SQL and CTE variant, see this forum (even though not sure that either would qualify as running it only once, but feel free to check - and let us know).

How to select first 'N' records from a database containing million records?

I have an oracle database populated with million records. I am trying to write a SQL query that returns the first 'N" sorted records ( say 100 records) from the database based on certain condition.
SELECT *
FROM myTable
Where SIZE > 2000
ORDER BY NAME DESC
Then programmatically select first N records.
The problem with this approach is :
The query results into half million
records and "ORDER BY NAME" causes
all the records to be sorted on NAME in the descending order. This sorting is taking lot of time. (nearly 30-40 seconds. If I omit ORDER BY, it takes only 1 second).
After the sort I am interested in
only first N (100) records. So the sorting of complete records is not useful.
My questions are:
Is it possible to specify the 'N' in
query itself? ( so that sort applies to only N records and query becomes faster).
Any better way in SQL to improve the query to sort
only N elements and return in quick
time.

If your purpose is to find 100 random rows and sort them afterwards then Lasse's solution is correct. If as I think you want the first 100 rows sorted by name while discarding the others you would build a query like this:
SELECT *
FROM (SELECT *
FROM myTable
WHERE SIZE > 2000 ORDER BY NAME DESC)
WHERE ROWNUM <= 100
The optimizer will understand that it is a TOP-N query and will be able to use an index on NAME. It won't have to sort the entire result set, it will just start at the end of the index and read it backwards and stop after 100 rows.
You could also add an hint to your original query to let the optimizer understand that you are interested in the first rows only. This will probably generate a similar access path:
SELECT /*+ FIRST_ROWS*/* FROM myTable WHERE SIZE > 2000 ORDER BY NAME DESC
Edit: just adding AND rownum <= 100 to the query won't work since in Oracle rownum is attributed before sorting : this is why you have to use a subquery. Without the subquery Oracle will select 100 random rows then sort them.

This shows how to pick the top N rows depending on your version of Oracle.
From Oracle 9i onwards, the RANK() and
DENSE_RANK() functions can be used to
determine the TOP N rows. Examples:
Get the top 10 employees based on
their salary
SELECT ename, sal FROM ( SELECT
ename, sal, RANK() OVER (ORDER BY sal
DESC) sal_rank
FROM emp ) WHERE sal_rank <= 10;
Select the employees making the top 10
salaries
SELECT ename, sal FROM ( SELECT
ename, sal, DENSE_RANK() OVER (ORDER
BY sal DESC) sal_dense_rank
FROM emp ) WHERE sal_dense_rank <= 10;
The difference between the two is explained here

Add this:
AND rownum <= 100
to your WHERE-clause.
However, this won't do what you're asking.
If you want to pick 100 random rows, sort those, and then return them, you'll have to formulate a query without the ORDER BY first, then limit that to 100 rows, then select from that and sort.
This could work, but unfortunately I don't have an Oracle server available to test:
SELECT *
FROM (
SELECT *
FROM myTable
WHERE SIZE > 2000
AND rownum <= 100
) x
ORDER BY NAME DESC
But note the "random" part there, you're saying "give me 100 rows with SIZE > 2000, I don't care which 100".
Is that really what you want?
And no, you won't actually get a random result, in the sense that it'll change each time you query the server, but you are at the mercy of the query optimizer. If the data load and index statistics for that table changes over time, at some point you might get different data than you did on the previous query.

Your problem is that the sort is being done every time the query is run. You can eliminate the sort operation by using an index - the optimiser can use an index to eliminate a sort operation - if the sorted column is declared NOT NULL.
(If the column is nullable, it is still possible, by either (a) adding a NOT NULL predicate to the query, or (b) adding a function-based index and modifying the ORDER BY clause accordingly).

Just for reference, in Oracle 12c, this task can be done using FETCH clause. You can see here for examples and additional reference links regarding this matter.

How to request a random row in SQL?

How can I request a random row (or as close to truly random as is possible) in pure SQL?

See this post: SQL to Select a random row from a database table. It goes through methods for doing this in MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2 and Oracle (the following is copied from that link):
Select a random row with MySQL:
SELECT column FROM table
ORDER BY RAND()
LIMIT 1
Select a random row with PostgreSQL:
SELECT column FROM table
ORDER BY RANDOM()
LIMIT 1
Select a random row with Microsoft SQL Server:
SELECT TOP 1 column FROM table
ORDER BY NEWID()
Select a random row with IBM DB2
SELECT column, RAND() as IDX
FROM table
ORDER BY IDX FETCH FIRST 1 ROWS ONLY
Select a random record with Oracle:
SELECT column FROM
( SELECT column FROM table
ORDER BY dbms_random.value )
WHERE rownum = 1

Solutions like Jeremies:
SELECT * FROM table ORDER BY RAND() LIMIT 1
work, but they need a sequential scan of all the table (because the random value associated with each row needs to be calculated - so that the smallest one can be determined), which can be quite slow for even medium sized tables. My recommendation would be to use some kind of indexed numeric column (many tables have these as their primary keys), and then write something like:
SELECT * FROM table WHERE num_value >= RAND() *
( SELECT MAX (num_value ) FROM table )
ORDER BY num_value LIMIT 1
This works in logarithmic time, regardless of the table size, if num_value is indexed. One caveat: this assumes that num_value is equally distributed in the range 0..MAX(num_value). If your dataset strongly deviates from this assumption, you will get skewed results (some rows will appear more often than others).

I don't know how efficient this is, but I've used it before:
SELECT TOP 1 * FROM MyTable ORDER BY newid()
Because GUIDs are pretty random, the ordering means you get a random row.

ORDER BY NEWID()
takes 7.4 milliseconds
WHERE num_value >= RAND() * (SELECT MAX(num_value) FROM table)
takes 0.0065 milliseconds!
I will definitely go with latter method.

You didn't say which server you're using. In older versions of SQL Server, you can use this:
select top 1 * from mytable order by newid()
In SQL Server 2005 and up, you can use TABLESAMPLE to get a random sample that's repeatable:
SELECT FirstName, LastName
FROM Contact
TABLESAMPLE (1 ROWS) ;

For SQL Server
newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.
TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).
For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:
If you really want a random sample of
individual rows, modify your query to
filter out rows randomly, instead of
using TABLESAMPLE. For example, the
following query uses the NEWID
function to return approximately one
percent of the rows of the
Sales.SalesOrderDetail table:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
The SalesOrderID column is included in
the CHECKSUM expression so that
NEWID() evaluates once per row to
achieve sampling on a per-row basis.
The expression CAST(CHECKSUM(NEWID(),
SalesOrderID) & 0x7fffffff AS float /
CAST (0x7fffffff AS int) evaluates to
a random float value between 0 and 1.
When run against a table with 1,000,000 rows, here are my results:
SET STATISTICS TIME ON
SET STATISTICS IO ON
/* newid()
rows returned: 10000
logical reads: 3359
CPU time: 3312 ms
elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()
/* TABLESAMPLE
rows returned: 9269 (varies)
logical reads: 32
CPU time: 0 ms
elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)
/* Filter
rows returned: 9994 (varies)
logical reads: 3359
CPU time: 641 ms
elapsed time: 627 ms
*/
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.

If possible, use stored statements to avoid the inefficiency of both indexes on RND() and creating a record number field.
PREPARE RandomRecord FROM "SELECT * FROM table LIMIT ?,1";
SET #n=FLOOR(RAND()*(SELECT COUNT(*) FROM table));
EXECUTE RandomRecord USING #n;

Best way is putting a random value in a new column just for that purpose, and using something like this (pseude code + SQL):
randomNo = random()
execSql("SELECT TOP 1 * FROM MyTable WHERE MyTable.Randomness > $randomNo")
This is the solution employed by the MediaWiki code. Of course, there is some bias against smaller values, but they found that it was sufficient to wrap the random value around to zero when no rows are fetched.
newid() solution may require a full table scan so that each row can be assigned a new guid, which will be much less performant.
rand() solution may not work at all (i.e. with MSSQL) because the function will be evaluated just once, and every row will be assigned the same "random" number.

For SQL Server 2005 and 2008, if we want a random sample of individual rows (from Books Online):
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)

In late, but got here via Google, so for the sake of posterity, I'll add an alternative solution.
Another approach is to use TOP twice, with alternating orders. I don't know if it is "pure SQL", because it uses a variable in the TOP, but it works in SQL Server 2008. Here's an example I use against a table of dictionary words, if I want a random word.
SELECT TOP 1
word
FROM (
SELECT TOP(#idx)
word
FROM
dbo.DictionaryAbridged WITH(NOLOCK)
ORDER BY
word DESC
) AS D
ORDER BY
word ASC
Of course, #idx is some randomly-generated integer that ranges from 1 to COUNT(*) on the target table, inclusively. If your column is indexed, you'll benefit from it too. Another advantage is that you can use it in a function, since NEWID() is disallowed.
Lastly, the above query runs in about 1/10 of the exec time of a NEWID()-type of query on the same table. YYMV.

Insted of using RAND(), as it is not encouraged, you may simply get max ID (=Max):
SELECT MAX(ID) FROM TABLE;
get a random between 1..Max (=My_Generated_Random)
My_Generated_Random = rand_in_your_programming_lang_function(1..Max);
and then run this SQL:
SELECT ID FROM TABLE WHERE ID >= My_Generated_Random ORDER BY ID LIMIT 1
Note that it will check for any rows which Ids are EQUAL or HIGHER than chosen value.
It's also possible to hunt for the row down in the table, and get an equal or lower ID than the My_Generated_Random, then modify the query like this:
SELECT ID FROM TABLE WHERE ID <= My_Generated_Random ORDER BY ID DESC LIMIT 1

As pointed out in #BillKarwin's comment on #cnu's answer...
When combining with a LIMIT, I've found that it performs much better (at least with PostgreSQL 9.1) to JOIN with a random ordering rather than to directly order the actual rows: e.g.
SELECT * FROM tbl_post AS t
JOIN ...
JOIN ( SELECT id, CAST(-2147483648 * RANDOM() AS integer) AS rand
FROM tbl_post
WHERE create_time >= 1349928000
) r ON r.id = t.id
WHERE create_time >= 1349928000 AND ...
ORDER BY r.rand
LIMIT 100
Just make sure that the 'r' generates a 'rand' value for every possible key value in the complex query which is joined with it but still limit the number of rows of 'r' where possible.
The CAST as Integer is especially helpful for PostgreSQL 9.2 which has specific sort optimisation for integer and single precision floating types.

For MySQL to get random record
SELECT name
FROM random AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1
More detail http://jan.kneschke.de/projects/mysql/order-by-rand/

With SQL Server 2012+ you can use the OFFSET FETCH query to do this for a single random row
select * from MyTable ORDER BY id OFFSET n ROW FETCH NEXT 1 ROWS ONLY
where id is an identity column, and n is the row you want - calculated as a random number between 0 and count()-1 of the table (offset 0 is the first row after all)
This works with holes in the table data, as long as you have an index to work with for the ORDER BY clause. Its also very good for the randomness - as you work that out yourself to pass in but the niggles in other methods are not present. In addition the performance is pretty good, on a smaller dataset it holds up well, though I've not tried serious performance tests against several million rows.

Random function from the sql could help. Also if you would like to limit to just one row, just add that in the end.
SELECT column FROM table
ORDER BY RAND()
LIMIT 1

For SQL Server and needing "a single random row"..
If not needing a true sampling, generate a random value [0, max_rows) and use the ORDER BY..OFFSET..FETCH from SQL Server 2012+.
This is very fast if the COUNT and ORDER BY are over appropriate indexes - such that the data is 'already sorted' along the query lines. If these operations are covered it's a quick request and does not suffer from the horrid scalability of using ORDER BY NEWID() or similar. Obviously, this approach won't scale well on a non-indexed HEAP table.
declare #rows int
select #rows = count(1) from t
-- Other issues if row counts in the bigint range..
-- This is also not 'true random', although such is likely not required.
declare #skip int = convert(int, #rows * rand())
select t.*
from t
order by t.id -- Make sure this is clustered PK or IX/UCL axis!
offset (#skip) rows
fetch first 1 row only
Make sure that the appropriate transaction isolation levels are used and/or account for 0 results.
For SQL Server and needing a "general row sample" approach..
Note: This is an adaptation of the answer as found on a SQL Server specific question about fetching a sample of rows. It has been tailored for context.
While a general sampling approach should be used with caution here, it's still potentially useful information in context of other answers (and the repetitious suggestions of non-scaling and/or questionable implementations). Such a sampling approach is less efficient than the first code shown and is error-prone if the goal is to find a "single random row".
Here is an updated and improved form of sampling a percentage of rows. It is based on the same concept of some other answers that use CHECKSUM / BINARY_CHECKSUM and modulus.
It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal.
Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution.
Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID() can never be stable/repeatable.
Does not use ORDER BY NEWID() of the entire input set, as ordering can become a significant bottleneck with large input sets. Avoiding unnecessary sorting also reduces memory and tempdb usage.
Does not use TABLESAMPLE and thus works with a WHERE pre-filter.
Here is the gist. See this answer for additional details and notes.
Naïve try:
declare #sample_percent decimal(7, 4)
-- Looking at this value should be an indicator of why a
-- general sampling approach can be error-prone to select 1 row.
select #sample_percent = 100.0 / count(1) from t
-- BAD!
-- When choosing appropriate sample percent of "approximately 1 row"
-- it is very reasonable to expect 0 rows, which definitely fails the ask!
-- If choosing a larger sample size the distribution is heavily skewed forward,
-- and is very much NOT 'true random'.
select top 1
t.*
from t
where 1=1
and ( -- sample
#sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * #sample_percent)
)
This can be largely remedied by a hybrid query, by mixing sampling and ORDER BY selection from the much smaller sample set. This limits the sorting operation to the sample size, not the size of the original table.
-- Sample "approximately 1000 rows" from the table,
-- dealing with some edge-cases.
declare #rows int
select #rows = count(1) from t
declare #sample_size int = 1000
declare #sample_percent decimal(7, 4) = case
when #rows <= 1000 then 100 -- not enough rows
when (100.0 * #sample_size / #rows) < 0.0001 then 0.0001 -- min sample percent
else 100.0 * #sample_size / #rows -- everything else
end
-- There is a statistical "guarantee" of having sampled a limited-yet-non-zero number of rows.
-- The limited rows are then sorted randomly before the first is selected.
select top 1
t.*
from t
where 1=1
and ( -- sample
#sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * #sample_percent)
)
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()

SELECT * FROM table ORDER BY RAND() LIMIT 1

Most of the solutions here aim to avoid sorting, but they still need to make a sequential scan over a table.
There is also a way to avoid the sequential scan by switching to index scan. If you know the index value of your random row you can get the result almost instantially. The problem is - how to guess an index value.
The following solution works on PostgreSQL 8.4:
explain analyze select * from cms_refs where rec_id in
(select (random()*(select last_value from cms_refs_rec_id_seq))::bigint
from generate_series(1,10))
limit 1;
I above solution you guess 10 various random index values from range 0 .. [last value of id].
The number 10 is arbitrary - you may use 100 or 1000 as it (amazingly) doesn't have a big impact on the response time.
There is also one problem - if you have sparse ids you might miss. The solution is to have a backup plan :) In this case an pure old order by random() query. When combined id looks like this:
explain analyze select * from cms_refs where rec_id in
(select (random()*(select last_value from cms_refs_rec_id_seq))::bigint
from generate_series(1,10))
union all (select * from cms_refs order by random() limit 1)
limit 1;
Not the union ALL clause. In this case if the first part returns any data the second one is NEVER executed!

You may also try using new id() function.
Just write a your query and use order by new id() function. It quite random.

Didn't quite see this variation in the answers yet. I had an additional constraint where I needed, given an initial seed, to select the same set of rows each time.
For MS SQL:
Minimum example:
select top 10 percent *
from table_name
order by rand(checksum(*))
Normalized execution time: 1.00
NewId() example:
select top 10 percent *
from table_name
order by newid()
Normalized execution time: 1.02
NewId() is insignificantly slower than rand(checksum(*)), so you may not want to use it against large record sets.
Selection with Initial Seed:
declare #seed int
set #seed = Year(getdate()) * month(getdate()) /* any other initial seed here */
select top 10 percent *
from table_name
order by rand(checksum(*) % seed) /* any other math function here */
If you need to select the same set given a seed, this seems to work.

In MSSQL (tested on 11.0.5569) using
SELECT TOP 100 * FROM employee ORDER BY CRYPT_GEN_RANDOM(10)
is significantly faster than
SELECT TOP 100 * FROM employee ORDER BY NEWID()

For Firebird:
Select FIRST 1 column from table ORDER BY RAND()

In SQL Server you can combine TABLESAMPLE with NEWID() to get pretty good randomness and still have speed. This is especially useful if you really only want 1, or a small number, of rows.
SELECT TOP 1 * FROM [table]
TABLESAMPLE (500 ROWS)
ORDER BY NEWID()

I have to agree with CD-MaN: Using "ORDER BY RAND()" will work nicely for small tables or when you do your SELECT only a few times.
I also use the "num_value >= RAND() * ..." technique, and if I really want to have random results I have a special "random" column in the table that I update once a day or so. That single UPDATE run will take some time (especially because you'll have to have an index on that column), but it's much faster than creating random numbers for every row each time the select is run.

Be careful because TableSample doesn't actually return a random sample of rows. It directs your query to look at a random sample of the 8KB pages that make up your row. Then, your query is executed against the data contained in these pages. Because of how data may be grouped on these pages (insertion order, etc), this could lead to data that isn't actually a random sample.
See: http://www.mssqltips.com/tip.asp?tip=1308
This MSDN page for TableSample includes an example of how to generate an actualy random sample of data.
http://msdn.microsoft.com/en-us/library/ms189108.aspx

It seems that many of the ideas listed still use ordering
However, if you use a temporary table, you are able to assign a random index (like many of the solutions have suggested), and then grab the first one that is greater than an arbitrary number between 0 and 1.
For example (for DB2):
WITH TEMP AS (
SELECT COMLUMN, RAND() AS IDX FROM TABLE)
SELECT COLUMN FROM TABLE WHERE IDX > .5
FETCH FIRST 1 ROW ONLY

A simple and efficient way from http://akinas.com/pages/en/blog/mysql_random_row/
SET #i = (SELECT FLOOR(RAND() * COUNT(*)) FROM table); PREPARE get_stmt FROM 'SELECT * FROM table LIMIT ?, 1'; EXECUTE get_stmt USING #i;

There is better solution for Oracle instead of using dbms_random.value, while it requires full scan to order rows by dbms_random.value and it is quite slow for large tables.
Use this instead:
SELECT *
FROM employee sample(1)
WHERE rownum=1

For SQL Server 2005 and above, extending #GreyPanther's answer for the cases when num_value has not continuous values. This works too for cases when we have not evenly distributed datasets and when num_value is not a number but a unique identifier.
WITH CTE_Table (SelRow, num_value)
AS
(
SELECT ROW_NUMBER() OVER(ORDER BY ID) AS SelRow, num_value FROM table
)
SELECT * FROM table Where num_value = (
SELECT TOP 1 num_value FROM CTE_Table WHERE SelRow >= RAND() * (SELECT MAX(SelRow) FROM CTE_Table)
)

select r.id, r.name from table AS r
INNER JOIN(select CEIL(RAND() * (select MAX(id) from table)) as id) as r1
ON r.id >= r1.id ORDER BY r.id ASC LIMIT 1
This will require a lesser computation time

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

mysql count performance - sql

Related

Query using Rownum and order by clause does not use the index

Update n random rows in SQL

total number of rows of a query

How to select first 'N' records from a database containing million records?

How to request a random row in SQL?

Categories

Resources