Update n random rows in SQL - sql

I have table which is having about 1000 rows.I have to update a column("X") in the table to 'Y' for n ramdom rows. For this i can have following query
update xyz set X='Y' when m in (
'SELECT m FROM (SELECT m
FROM xyz
order by dbms_random.value
) RNDM
where rownum < n+1);
Is there another efficient way to write this query. The table has no index.
Please help?

I would use the ROWID:
UPDATE xyz SET x='Y' WHERE rowid IN (
SELECT r FROM (
SELECT ROWID r FROM xyz ORDER BY dbms_random.value
) RNDM WHERE rownum < n+1
)
The actual reason I would use ROWID isn't for efficiency though (it will still do a full table scan) - your SQL may not update the number of rows you want if column m isn't unique.
With only 1000 rows, you shouldn't really be worried about efficiency (maybe with a hundred million rows). Without any index on this table, you're stuck doing a full table scan to select random records.
[EDIT:] "But what if there are 100,000 rows"
Well, that's still 3 orders of magnitude less than 100 million.
I ran the following:
create table xyz as select * from all_objects;
[created about 50,000 rows on my system - non-indexed, just like your table]
UPDATE xyz SET owner='Y' WHERE rowid IN (
SELECT r FROM (
SELECT ROWID r FROM xyz ORDER BY dbms_random.value
) RNDM WHERE rownum < 10000
);
commit;
This took approximately 1.5 seconds. Maybe it was 1 second, maybe up to 3 seconds (didn't formally time it, it just took about enough time to blink).

You can improve performance by replacing the full table scan with a sample.
The first problem you run into is that you can't use SAMPLE in a DML subquery, ORA-30560: SAMPLE clause not allowed. But logically this is what is needed:
UPDATE xyz SET x='Y' WHERE rowid IN (
SELECT r FROM (
SELECT ROWID r FROM xyz sample(0.15) ORDER BY dbms_random.value
) RNDM WHERE rownum < 100/*n*/+1
);
You can get around this by using a collection to store the rowids, and then update the rows using the rowid collection. Normally breaking a query into separate parts and gluing them together with PL/SQL leads to horrible performance. But in this case you can still save a lot of time by significantly reducing the amount of data read.
declare
type rowid_nt is table of rowid;
rowids rowid_nt;
begin
--Get the rowids
SELECT r bulk collect into rowids
FROM (
SELECT ROWID r
FROM xyz sample(0.15)
ORDER BY dbms_random.value
) RNDM WHERE rownum < 100/*n*/+1;
--update the table
forall i in 1 .. rowids.count
update xyz set x = 'Y'
where rowid = rowids(i);
end;
/
I ran a simple test with 100,000 rows (on a table with only two columns), and N = 100.
The original version took 0.85 seconds, #Gerrat's answer took 0.7 seconds, and the PL/SQL version took 0.015 seconds.
But that's only one scenario, I don't have enough information to say my answer will always be better. As N increases the sampling advantage is lost, and the writing will be more significant than the reading. If you have a very small amount of data, the PL/SQL context switching overhead in my answer may make it slower than #Gerrat's solution.
For performance issues, the size of the table in bytes is usually much more important than the size in rows. 1000 rows that use a terabyte of space is much larger than 100 million rows that only use a gigabyte.
Here are some problems to consider with my answer:
Sampling does not always return exactly the percent you asked for. With 100,000 rows and a 0.15% sample size the number of rows returned was 147, not 150. That's why I used 0.15 instead of 0.10. You need to over-sample a little bit to ensure that you get more than N. How much do you need to over-sample? I have no idea, you'll probably have to test it and pick a safe number.
You need to know the approximate number of rows to pick the percent.
The percent must be a literal, so as the number of rows and N change, you'll need to use dynamic SQL to change the percent.

The following solution works just fine. It's performant and seems to be similar to sample():
create table t1 as
select level id, cast ('item'||level as varchar2(32)) item
from dual connect by level<=100000;
Table T1 created.
update t1 set item='*'||item
where exists (
select rnd from (
select dbms_random.value() rnd
from t1
) t2 where t2.rowid = t1.rowid and rnd < 0.15
);
14,858 rows updated.
Elapsed: 00:00:00.717
Consider that alias rnd must be included in select clause. Otherwise changes the omptimizer the filter predicat from RND<0.1 to DBMS_RANDOM.VALUE()<0.1. In that case dbms_random.value will be executed only once.
As mentioned in answer #JonHeller, the best solution remains the pl/sql code block because it allows to avoid full table scan. Here is my suggestion:
create or replace type rowidListType is table of varchar(18);
/
create or replace procedure updateRandomly (prefix varchar2 := '*') is
rowidList rowidListType;
begin
select rowidtochar (rowid) bulk collect into rowidList
from t1 sample(15)
;
update t1 set item=prefix||item
where exists (
select 1 from table (rowidList) t2
where chartorowid(t2.column_value) = t1.rowid
);
dbms_output.put_line ('updated '||sql%rowcount||' rows.');
end;
/
begin updateRandomly; end;
/
Elapsed: 00:00:00.293
updated 14892 rows.

Related

Sampling a large number of rows from a table

I want to extract a roughly 5 million row sample from a table that will contain somewhere between 10 million and 20 million rows.
Due to the large number of rows, efficiency is key. As such, I am trying to avoid sorting the rows where possible, hence why I am avoiding the dbms_random.value solution that I have seen in similar questions.
I tried to do something like the following:
SELECT *
FROM full_table
SAMPLE (CEIL(100 * 5000000 / (SELECT COUNT(*) FROM full_table)));
However, I can't seem to do arithmetic in the SAMPLE clause (ORA-00933: SQL command not properly ended - I tried this with a simple SAMPLE(10/2) and still get the same thing).
Is this a reasonable approach, and, if so, how do I calculate the number of rows in the sample clause?
You could use PL/SQL with dynamic SQL like this:
declare
cnt integer;
begin
select count(*) into cnt from full_table;
dbms_output.put_line(cnt);
execute immediate
'insert into target_table'
||' select * from full_tablesample (' || ceil(100 * 5000000/cnt) || ')';
end;
If you need roughly 5 million rows, you can do:
SELECT ft.*
FROM full_table ft cross join
(SELECT COUNT(*) as cnt FROM full_table) x
WHERE dbms_random.value * cnt < 5000000;
This uses simple arithmetic to determine which rows go into the table. The result should be pretty close to 5,000,000.

Efficient repeated sampling with replacement of a table in a PostgreSQL alike?

I'm trying to check the distribution of numbers in a column of a table. Rather than calculate on the entire table (which is large - tens of gigabytes) I want to estimate via repeated sampling. I think the typical Postgres method for this is
select COLUMN
from TABLE
order by RANDOM()
limit 1;
but this is slow for repeated sampling, especially since (I suspect) it manipulates the entire column each time I run it.
Is there a better way?
EDIT: Just to make sure I expressed it right, I want to do the following:
for(i in 1:numSamples)
draw 500 random rows
end
without having to reorder the entire massive table each time. Perhaps I could get all of the table row IDs and sample from it in R or something, and then just request those rows?
As you want a sample of the data, what about using the estimated size of the table and then calculate a percentage of that as the sample?
The table pg_class stores an estimate of the number of rows for each table (updated by the vacuum process if I'm not mistaken).
So the following would select 1% of all rows from that table:
with estimated_rows as (
select reltuples as num_rows
from pg_class t
join pg_namespace n on n.oid = t.relnamespace
where t.relname = 'some_table'
and n.nspname = 'public'
)
select *
from some_table
limit (select 0.01 * num_rows from estimated_rows)
;
If you do that very often you might want to create a function so you could do something like this:
select *
from some_table
limit (select estimate_percent(0.01, 'public', 'some_table'))
;
Create a temporary table from the target table adding a row number column
drop table if exists temp_t;
create temporary table temp_t as
select *, (row_number() over())::int as rn
from t
Create a lighter temporary table by selecting only the columns that will be used in the sampling and filtering as necessary.
Index it by the row number column
create index temp_t_rn on temp_t(rn);
analyze temp_t;
Issue this query for each sample
with r as (
select ceiling(random() * (select max(rn) from temp_t))::int as rn
from generate_series(1, 500) s
)
select *
from temp_t
where rn in (select rn from r)
SQL Fiddle

Best way to select random rows PostgreSQL

I want a random selection of rows in PostgreSQL, I tried this:
select * from table where random() < 0.01;
But some other recommend this:
select * from table order by random() limit 1000;
I have a very large table with 500 Million rows, I want it to be fast.
Which approach is better? What are the differences? What is the best way to select random rows?
Fast ways
Given your specifications (plus additional info in the comments),
You have a numeric ID column (integer numbers) with only few (or moderately few) gaps.
Obviously no or few write operations.
Your ID column has to be indexed! A primary key serves nicely.
The query below does not need a sequential scan of the big table, only an index scan.
First, get estimates for the main query:
SELECT count(*) AS ct -- optional
, min(id) AS min_id
, max(id) AS max_id
, max(id) - min(id) AS id_span
FROM big;
The only possibly expensive part is the count(*) (for huge tables). Given above specifications, you don't need it. An estimate to replace the full count will do just fine, available at almost no cost:
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint AS ct
FROM pg_class
WHERE oid = 'big'::regclass; -- your table name
Detailed explanation:
Fast way to discover the row count of a table in PostgreSQL
As long as ct isn't much smaller than id_span, the query will outperform other approaches.
WITH params AS (
SELECT 1 AS min_id -- minimum id <= current min id
, 5100000 AS id_span -- rounded up. (max_id - min_id + buffer)
)
SELECT *
FROM (
SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
FROM params p
, generate_series(1, 1100) g -- 1000 + buffer
GROUP BY 1 -- trim duplicates
) r
JOIN big USING (id)
LIMIT 1000; -- trim surplus
Generate random numbers in the id space. You have "few gaps", so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.
Each id can be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or use DISTINCT).
Join the ids to the big table. This should be very fast with the index in place.
Finally trim surplus ids that have not been eaten by dupes and gaps. Every row has a completely equal chance to be picked.
Short version
You can simplify this query. The CTE in the query above is just for educational purposes:
SELECT *
FROM (
SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
FROM generate_series(1, 1100) g
) r
JOIN big USING (id)
LIMIT 1000;
Refine with rCTE
Especially if you are not so sure about gaps and estimates.
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * 5100000)::int AS id
FROM generate_series(1, 1030) -- 1000 + few percent - adapt to your needs
LIMIT 1030 -- hint for query planner
) r
JOIN big b USING (id) -- eliminate miss
UNION -- eliminate dupe
SELECT b.*
FROM (
SELECT 1 + trunc(random() * 5100000)::int AS id
FROM random_pick r -- plus 3 percent - adapt to your needs
LIMIT 999 -- less than 1000, hint for query planner
) r
JOIN big b USING (id) -- eliminate miss
)
TABLE random_pick
LIMIT 1000; -- actual limit
We can work with a smaller surplus in the base query. If there are too many gaps so we don't find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached - or we have to start with a large enough buffer which defies the purpose of optimizing performance.
Duplicates are eliminated by the UNION in the rCTE.
The outer LIMIT makes the CTE stop as soon as we have enough rows.
This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.
Wrap into function
For repeated use with the same table with varying parameters:
CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
RETURNS SETOF big
LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
_surplus int := _limit * _gaps;
_estimate int := ( -- get current estimate from system
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM pg_class
WHERE oid = 'big'::regclass);
BEGIN
RETURN QUERY
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * _estimate)::int
FROM generate_series(1, _surplus) g
LIMIT _surplus -- hint for query planner
) r (id)
JOIN big USING (id) -- eliminate misses
UNION -- eliminate dupes
SELECT *
FROM (
SELECT 1 + trunc(random() * _estimate)::int
FROM random_pick -- just to make it recursive
LIMIT _limit -- hint for query planner
) r (id)
JOIN big USING (id) -- eliminate misses
)
TABLE random_pick
LIMIT _limit;
END
$func$;
Call:
SELECT * FROM f_random_sample();
SELECT * FROM f_random_sample(500, 1.05);
Generic function
We can make this generic to work for any table with a unique integer column (typically the PK): Pass the table as polymorphic type and (optionally) the name of the PK column and use EXECUTE:
CREATE OR REPLACE FUNCTION f_random_sample(_tbl_type anyelement
, _id text = 'id'
, _limit int = 1000
, _gaps real = 1.03)
RETURNS SETOF anyelement
LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
-- safe syntax with schema & quotes where needed
_tbl text := pg_typeof(_tbl_type)::text;
_estimate int := (SELECT (reltuples / relpages
* (pg_relation_size(oid) / 8192))::bigint
FROM pg_class -- get current estimate from system
WHERE oid = _tbl::regclass);
BEGIN
RETURN QUERY EXECUTE format(
$$
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * $1)::int
FROM generate_series(1, $2) g
LIMIT $2 -- hint for query planner
) r(%2$I)
JOIN %1$s USING (%2$I) -- eliminate misses
UNION -- eliminate dupes
SELECT *
FROM (
SELECT 1 + trunc(random() * $1)::int
FROM random_pick -- just to make it recursive
LIMIT $3 -- hint for query planner
) r(%2$I)
JOIN %1$s USING (%2$I) -- eliminate misses
)
TABLE random_pick
LIMIT $3;
$$
, _tbl, _id
)
USING _estimate -- $1
, (_limit * _gaps)::int -- $2 ("surplus")
, _limit -- $3
;
END
$func$;
Call with defaults (important!):
SELECT * FROM f_random_sample(null::big); --!
Or more specifically:
SELECT * FROM f_random_sample(null::"my_TABLE", 'oDD ID', 666, 1.15);
About the same performance as the static version.
Related:
Refactor a PL/pgSQL function to return the output of various SELECT queries - chapter "Various complete table types"
Return SETOF rows from PostgreSQL function
Format specifier for integer variables in format() for EXECUTE?
INSERT with dynamic table name in trigger function
This is safe against SQL injection. See:
Table name as a PostgreSQL function parameter
SQL injection in Postgres functions vs prepared queries
Possible alternative
I your requirements allow identical sets for repeated calls (and we are talking about repeated calls) consider a MATERIALIZED VIEW. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.
Postgres 9.5 introduces TABLESAMPLE SYSTEM (n)
Where n is a percentage. The manual:
The BERNOULLI and SYSTEM sampling methods each accept a single
argument which is the fraction of the table to sample, expressed as a
percentage between 0 and 100. This argument can be any real-valued expression.
Bold emphasis mine. It's very fast, but the result is not exactly random. The manual again:
The SYSTEM method is significantly faster than the BERNOULLI method
when small sampling percentages are specified, but it may return a
less-random sample of the table as a result of clustering effects.
The number of rows returned can vary wildly. For our example, to get roughly 1000 rows:
SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);
Related:
Fast way to discover the row count of a table in PostgreSQL
Or install the additional module tsm_system_rows to get the number of requested rows exactly (if there are enough) and allow for the more convenient syntax:
SELECT * FROM big TABLESAMPLE SYSTEM_ROWS(1000);
See Evan's answer for details.
But that's still not exactly random.
You can examine and compare the execution plan of both by using
EXPLAIN select * from table where random() < 0.01;
EXPLAIN select * from table order by random() limit 1000;
A quick test on a large table1 shows, that the ORDER BY first sorts the complete table and then picks the first 1000 items. Sorting a large table not only reads that table but also involves reading and writing temporary files. The where random() < 0.1 only scans the complete table once.
For large tables this might not what you want as even one complete table scan might take to long.
A third proposal would be
select * from table where random() < 0.01 limit 1000;
This one stops the table scan as soon as 1000 rows have been found and therefore returns sooner. Of course this bogs down the randomness a bit, but perhaps this is good enough in your case.
Edit: Besides of this considerations, you might check out the already asked questions for this. Using the query [postgresql] random returns quite a few hits.
quick random row selection in Postgres
How to retrieve randomized data rows from a postgreSQL table?
postgres: get random entries from table - too slow
And a linked article of depez outlining several more approaches:
http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/
1 "large" as in "the complete table will not fit into the memory".
postgresql order by random(), select rows in random order:
These are all slow because they do a tablescan to guarantee that every row gets an exactly equal chance of being chosen:
select your_columns from your_table ORDER BY random()
select * from
(select distinct your_columns from your_table) table_alias
ORDER BY random()
select your_columns from your_table ORDER BY random() limit 1
If you know how many rows are in the table N:
offset by floored random is constant time. However I am NOT convinced that OFFSET is producing a true random sample. It's simulating it by getting 'the next bunch' and tablescanning that, so you can step through, which isn't quite the same as above.
SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;
Roll your own constant Time Select Random N rows with periodic table scan to be absolutely sure of a random:
If your table is huge then the above table-scans are a show stopper taking up to 5 minutes to finish.
To go faster you can schedule a behind the scenes nightly table-scan reindexing which will guarantee a perfectly random selection in an O(1) constant-time speed, except during the nightly reindexing table-scan, where it must wait for maintenance to finish before you may receive another random row.
--Create a demo table with lots of random nonuniform data, big_data
--is your huge table you want to get random rows from in constant time.
drop table if exists big_data;
CREATE TABLE big_data (id serial unique, some_data text );
CREATE INDEX ON big_data (id);
--Fill it with a million rows which simulates your beautiful data:
INSERT INTO big_data (some_data) SELECT md5(random()::text) AS some_data
FROM generate_series(1,10000000);
--This delete statement puts holes in your index
--making it NONuniformly distributed
DELETE FROM big_data WHERE id IN (2, 4, 6, 7, 8);
--Do the nightly maintenance task on a schedule at 1AM.
drop table if exists big_data_mapper;
CREATE TABLE big_data_mapper (id serial, big_data_id int);
CREATE INDEX ON big_data_mapper (id);
CREATE INDEX ON big_data_mapper (big_data_id);
INSERT INTO big_data_mapper(big_data_id) SELECT id FROM big_data ORDER BY id;
--We have to use a function because the big_data_mapper might be out-of-date
--in between nightly tasks, so to solve the problem of a missing row,
--you try again until you succeed. In the event the big_data_mapper
--is broken, it tries 25 times then gives up and returns -1.
CREATE or replace FUNCTION get_random_big_data_id()
RETURNS int language plpgsql AS $$
declare
response int;
BEGIN
--Loop is required because big_data_mapper could be old
--Keep rolling the dice until you find one that hits.
for counter in 1..25 loop
SELECT big_data_id
FROM big_data_mapper OFFSET floor(random() * (
select max(id) biggest_value from big_data_mapper
)
) LIMIT 1 into response;
if response is not null then
return response;
end if;
end loop;
return -1;
END;
$$;
--get a random big_data id in constant time:
select get_random_big_data_id();
--Get 1 random row from big_data table in constant time:
select * from big_data where id in (
select get_random_big_data_id() from big_data limit 1
);
┌─────────┬──────────────────────────────────┐
│ id │ some_data │
├─────────┼──────────────────────────────────┤
│ 8732674 │ f8d75be30eff0a973923c413eaf57ac0 │
└─────────┴──────────────────────────────────┘
--Get 4 random rows from big_data in constant time:
select * from big_data where id in (
select get_random_big_data_id() from big_data limit 3
);
┌─────────┬──────────────────────────────────┐
│ id │ some_data │
├─────────┼──────────────────────────────────┤
│ 2722848 │ fab6a7d76d9637af89b155f2e614fc96 │
│ 8732674 │ f8d75be30eff0a973923c413eaf57ac0 │
│ 9475611 │ 36ac3eeb6b3e171cacd475e7f9dade56 │
└─────────┴──────────────────────────────────┘
--Test what happens when big_data_mapper stops receiving
--nightly reindexing.
delete from big_data_mapper where 1=1;
select get_random_big_data_id(); --It tries 25 times, and returns -1
--which means wait N minutes and try again.
Adapted from: https://www.gab.lc/articles/bigdata_postgresql_order_by_random
Alternatively if all the above is too much work.
A simpler good 'nuff solution for constant time select random row is to make a new column on your big table called big_data.mapper_int make it not null with a unique index. Every night reset the column with a unique integer between 1 and max(n). To get a random row you "choose a random integer between 0 and max(id)" and return the row where mapper_int is that. If there's no row by that id, because the row has changed since re-index, choose another random row. If a row is added to big_data.mapper_int then populate it with max(id) + 1
Alternatively TableSample to the rescue:
If you have postgresql version > 9.5 then tablesample can do a constant time random sample without a heavy tablescan.
https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation
--Select 1 percent of rows from yourtable,
--display the first 100 rows, order by column a_column
select * from yourtable TABLESAMPLE SYSTEM (1)
order by a_column
limit 100;
TableSample is doing some stuff behind the scenes that takes some time and I don't like it, but is faster than order by random(). Good, fast, cheap, choose any two on this job.
Starting with PostgreSQL 9.5, there's a new syntax dedicated to getting random elements from a table :
SELECT * FROM mytable TABLESAMPLE SYSTEM (5);
This example will give you 5% of elements from mytable.
See more explanation on the documentation: http://www.postgresql.org/docs/current/static/sql-select.html
The one with the ORDER BY is going to be the slower one.
select * from table where random() < 0.01; goes record by record, and decides to randomly filter it or not. This is going to be O(N) because it only needs to check each record once.
select * from table order by random() limit 1000; is going to sort the entire table, then pick the first 1000. Aside from any voodoo magic behind the scenes, the order by is O(N * log N).
The downside to the random() < 0.01 one is that you'll get a variable number of output records.
Note, there is a better way to shuffling a set of data than sorting by random: The Fisher-Yates Shuffle, which runs in O(N). Implementing the shuffle in SQL sounds like quite the challenge, though.
select * from table order by random() limit 1000;
If you know how many rows you want, check out tsm_system_rows.
tsm_system_rows
module provides the table sampling method SYSTEM_ROWS, which can be used in the TABLESAMPLE clause of a SELECT command.
This table sampling method accepts a single integer argument that is the maximum number of rows to read. The resulting sample will always contain exactly that many rows, unless the table does not contain enough rows, in which case the whole table is selected. Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested.
First install the extension
CREATE EXTENSION tsm_system_rows;
Then your query,
SELECT *
FROM table
TABLESAMPLE SYSTEM_ROWS(1000);
Here is a decision that works for me. I guess it's very simple to understand and execute.
SELECT
field_1,
field_2,
field_2,
random() as ordering
FROM
big_table
WHERE
some_conditions
ORDER BY
ordering
LIMIT 1000;
If you want just one row, you can use a calculated offset derived from count.
select * from table_name limit 1
offset floor(random() * (select count(*) from table_name));
One lesson from my experience:
offset floor(random() * N) limit 1 is not faster than order by random() limit 1.
I thought the offset approach would be faster because it should save the time of sorting in Postgres. Turns out it wasn't.
I think the best and simplest way in postgreSQL is:
SELECT * FROM tableName ORDER BY random() LIMIT 1
A variation of the materialized view "Possible alternative" outlined by Erwin Brandstetter is possible.
Say, for example, that you don't want duplicates in the randomized values that are returned. An example use case is to generate short codes which can only be used once.
The primary table containing your (non-randomized) set of values must have some expression that determines which rows are "used" and which aren't — here I'll keep it simple by just creating a boolean column with the name used.
Assume this is the input table (additional columns may be added as they do not affect the solution):
id_values id | used
----+--------
1 | FALSE
2 | FALSE
3 | FALSE
4 | FALSE
5 | FALSE
...
Populate the ID_VALUES table as needed. Then, as described by Erwin, create a materialized view that randomizes the ID_VALUES table once:
CREATE MATERIALIZED VIEW id_values_randomized AS
SELECT id
FROM id_values
ORDER BY random();
Note that the materialized view does not contain the used column, because this will quickly become out-of-date. Nor does the view need to contain other columns that may be in the id_values table.
In order to obtain (and "consume") random values, use an UPDATE-RETURNING on id_values, selecting id_values from id_values_randomized with a join, and applying the desired criteria to obtain only relevant possibilities. For example:
UPDATE id_values
SET used = TRUE
WHERE id_values.id IN
(SELECT i.id
FROM id_values_randomized r INNER JOIN id_values i ON i.id = r.id
WHERE (NOT i.used)
LIMIT 1)
RETURNING id;
Change LIMIT as necessary -- if you need multiple random values at a time, change LIMIT to n where n is the number of values needed.
With the proper indexes on id_values, I believe the UPDATE-RETURNING should execute very quickly with little load. It returns randomized values with one database round-trip. The criteria for "eligible" rows can be as complex as required. New rows can be added to the id_values table at any time, and they will become accessible to the application as soon as the materialized view is refreshed (which can likely be run at an off-peak time). Creation and refresh of the materialized view will be slow, but it only needs to be executed when new id's added to the id_values table need to be made available.
Add a column called r with type serial. Index r.
Assume we have 200,000 rows, we are going to generate a random number n, where 0 < n <= 200, 000.
Select rows with r > n, sort them ASC and select the smallest one.
Code:
select * from YOUR_TABLE
where r > (
select (
select reltuples::bigint AS estimate
from pg_class
where oid = 'public.YOUR_TABLE'::regclass) * random()
)
order by r asc limit(1);
The code is self-explanatory. The subquery in the middle is used to quickly estimate the table row counts from https://stackoverflow.com/a/7945274/1271094 .
In application level you need to execute the statement again if n > the number of rows or need to select multiple rows.
I know I'm a little late to the party, but I just found this awesome tool called pg_sample:
pg_sample - extract a small, sample dataset from a larger PostgreSQL database while maintaining referential integrity.
I tried this with a 350M rows database and it was really fast, don't know about the randomness.
./pg_sample --limit="small_table = *" --limit="large_table = 100000" -U postgres source_db | psql -U postgres target_db

mysql count performance

select count(*) from mytable;
select count(table_id) from mytable; //table_id is the primary_key
both query were running slow on a table with 10 million rows.
I am wondering why since wouldn't it easy for mysql to keep a counter that gets updated on all insert,update and delete?
and is there a way to improve this query? I used explain but didn't help much.
take a look at the following blog posts:
1) COUNT(***) vs COUNT(col)
2) Easy MySQL Performance Tips
3) Fast count(*) for InnoDB
btw, which engine do you use?
EDITED: About technique to speed up count when you need just to know if there are some amount of rows. Sorry, just was wrong with my query. So, when you need just to know, if there is e.g. 300 rows by specific condition you can try subquery:
select count(*) FROM
( select 1 FROM _table_ WHERE _conditions_ LIMIT 300 ) AS result
at first you minify result set, and then count the result; it will still scan result set, but you can limit it (once more, it works when the question to DB is "is here more or less than 300 rows), and if DB contains more than 300 rows which satisfy condition that query is faster
Testing results (my table has 6.7mln rows):
1) SELECT count(*) FROM _table_ WHERE START_DATE > '2011-02-01'
returns 4.2mln for 65.4 seconds
2) SELECT count(*) FROM ( select 1 FROM _table_ WHERE START_DATE > '2011-02-01' LIMIT 100 ) AS result
returns 100 for 0.03 seconds
Below is result of the explain query to see what is going on there:
EXPLAIN SELECT count(*) FROM ( select 1 FROM _table_ WHERE START_DATE > '2011-02-01' LIMIT 100 ) AS result
As cherouvim pointed out in the comments, it depends on the storage engine.
MyISAM does keep a count of the table rows, and can keep it accurate since the only locks MyISAM supports is a table lock.
InnoDB however supports transactions, and needs to do a table scan to count the rows.
http://www.mysqlperformanceblog.com/2006/12/01/count-for-innodb-tables/

How to request a random row in SQL?

How can I request a random row (or as close to truly random as is possible) in pure SQL?
See this post: SQL to Select a random row from a database table. It goes through methods for doing this in MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2 and Oracle (the following is copied from that link):
Select a random row with MySQL:
SELECT column FROM table
ORDER BY RAND()
LIMIT 1
Select a random row with PostgreSQL:
SELECT column FROM table
ORDER BY RANDOM()
LIMIT 1
Select a random row with Microsoft SQL Server:
SELECT TOP 1 column FROM table
ORDER BY NEWID()
Select a random row with IBM DB2
SELECT column, RAND() as IDX
FROM table
ORDER BY IDX FETCH FIRST 1 ROWS ONLY
Select a random record with Oracle:
SELECT column FROM
( SELECT column FROM table
ORDER BY dbms_random.value )
WHERE rownum = 1
Solutions like Jeremies:
SELECT * FROM table ORDER BY RAND() LIMIT 1
work, but they need a sequential scan of all the table (because the random value associated with each row needs to be calculated - so that the smallest one can be determined), which can be quite slow for even medium sized tables. My recommendation would be to use some kind of indexed numeric column (many tables have these as their primary keys), and then write something like:
SELECT * FROM table WHERE num_value >= RAND() *
( SELECT MAX (num_value ) FROM table )
ORDER BY num_value LIMIT 1
This works in logarithmic time, regardless of the table size, if num_value is indexed. One caveat: this assumes that num_value is equally distributed in the range 0..MAX(num_value). If your dataset strongly deviates from this assumption, you will get skewed results (some rows will appear more often than others).
I don't know how efficient this is, but I've used it before:
SELECT TOP 1 * FROM MyTable ORDER BY newid()
Because GUIDs are pretty random, the ordering means you get a random row.
ORDER BY NEWID()
takes 7.4 milliseconds
WHERE num_value >= RAND() * (SELECT MAX(num_value) FROM table)
takes 0.0065 milliseconds!
I will definitely go with latter method.
You didn't say which server you're using. In older versions of SQL Server, you can use this:
select top 1 * from mytable order by newid()
In SQL Server 2005 and up, you can use TABLESAMPLE to get a random sample that's repeatable:
SELECT FirstName, LastName
FROM Contact
TABLESAMPLE (1 ROWS) ;
For SQL Server
newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.
TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).
For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:
If you really want a random sample of
individual rows, modify your query to
filter out rows randomly, instead of
using TABLESAMPLE. For example, the
following query uses the NEWID
function to return approximately one
percent of the rows of the
Sales.SalesOrderDetail table:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
The SalesOrderID column is included in
the CHECKSUM expression so that
NEWID() evaluates once per row to
achieve sampling on a per-row basis.
The expression CAST(CHECKSUM(NEWID(),
SalesOrderID) & 0x7fffffff AS float /
CAST (0x7fffffff AS int) evaluates to
a random float value between 0 and 1.
When run against a table with 1,000,000 rows, here are my results:
SET STATISTICS TIME ON
SET STATISTICS IO ON
/* newid()
rows returned: 10000
logical reads: 3359
CPU time: 3312 ms
elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()
/* TABLESAMPLE
rows returned: 9269 (varies)
logical reads: 32
CPU time: 0 ms
elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)
/* Filter
rows returned: 9994 (varies)
logical reads: 3359
CPU time: 641 ms
elapsed time: 627 ms
*/
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.
If possible, use stored statements to avoid the inefficiency of both indexes on RND() and creating a record number field.
PREPARE RandomRecord FROM "SELECT * FROM table LIMIT ?,1";
SET #n=FLOOR(RAND()*(SELECT COUNT(*) FROM table));
EXECUTE RandomRecord USING #n;
Best way is putting a random value in a new column just for that purpose, and using something like this (pseude code + SQL):
randomNo = random()
execSql("SELECT TOP 1 * FROM MyTable WHERE MyTable.Randomness > $randomNo")
This is the solution employed by the MediaWiki code. Of course, there is some bias against smaller values, but they found that it was sufficient to wrap the random value around to zero when no rows are fetched.
newid() solution may require a full table scan so that each row can be assigned a new guid, which will be much less performant.
rand() solution may not work at all (i.e. with MSSQL) because the function will be evaluated just once, and every row will be assigned the same "random" number.
For SQL Server 2005 and 2008, if we want a random sample of individual rows (from Books Online):
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
In late, but got here via Google, so for the sake of posterity, I'll add an alternative solution.
Another approach is to use TOP twice, with alternating orders. I don't know if it is "pure SQL", because it uses a variable in the TOP, but it works in SQL Server 2008. Here's an example I use against a table of dictionary words, if I want a random word.
SELECT TOP 1
word
FROM (
SELECT TOP(#idx)
word
FROM
dbo.DictionaryAbridged WITH(NOLOCK)
ORDER BY
word DESC
) AS D
ORDER BY
word ASC
Of course, #idx is some randomly-generated integer that ranges from 1 to COUNT(*) on the target table, inclusively. If your column is indexed, you'll benefit from it too. Another advantage is that you can use it in a function, since NEWID() is disallowed.
Lastly, the above query runs in about 1/10 of the exec time of a NEWID()-type of query on the same table. YYMV.
Insted of using RAND(), as it is not encouraged, you may simply get max ID (=Max):
SELECT MAX(ID) FROM TABLE;
get a random between 1..Max (=My_Generated_Random)
My_Generated_Random = rand_in_your_programming_lang_function(1..Max);
and then run this SQL:
SELECT ID FROM TABLE WHERE ID >= My_Generated_Random ORDER BY ID LIMIT 1
Note that it will check for any rows which Ids are EQUAL or HIGHER than chosen value.
It's also possible to hunt for the row down in the table, and get an equal or lower ID than the My_Generated_Random, then modify the query like this:
SELECT ID FROM TABLE WHERE ID <= My_Generated_Random ORDER BY ID DESC LIMIT 1
As pointed out in #BillKarwin's comment on #cnu's answer...
When combining with a LIMIT, I've found that it performs much better (at least with PostgreSQL 9.1) to JOIN with a random ordering rather than to directly order the actual rows: e.g.
SELECT * FROM tbl_post AS t
JOIN ...
JOIN ( SELECT id, CAST(-2147483648 * RANDOM() AS integer) AS rand
FROM tbl_post
WHERE create_time >= 1349928000
) r ON r.id = t.id
WHERE create_time >= 1349928000 AND ...
ORDER BY r.rand
LIMIT 100
Just make sure that the 'r' generates a 'rand' value for every possible key value in the complex query which is joined with it but still limit the number of rows of 'r' where possible.
The CAST as Integer is especially helpful for PostgreSQL 9.2 which has specific sort optimisation for integer and single precision floating types.
For MySQL to get random record
SELECT name
FROM random AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1
More detail http://jan.kneschke.de/projects/mysql/order-by-rand/
With SQL Server 2012+ you can use the OFFSET FETCH query to do this for a single random row
select * from MyTable ORDER BY id OFFSET n ROW FETCH NEXT 1 ROWS ONLY
where id is an identity column, and n is the row you want - calculated as a random number between 0 and count()-1 of the table (offset 0 is the first row after all)
This works with holes in the table data, as long as you have an index to work with for the ORDER BY clause. Its also very good for the randomness - as you work that out yourself to pass in but the niggles in other methods are not present. In addition the performance is pretty good, on a smaller dataset it holds up well, though I've not tried serious performance tests against several million rows.
Random function from the sql could help. Also if you would like to limit to just one row, just add that in the end.
SELECT column FROM table
ORDER BY RAND()
LIMIT 1
For SQL Server and needing "a single random row"..
If not needing a true sampling, generate a random value [0, max_rows) and use the ORDER BY..OFFSET..FETCH from SQL Server 2012+.
This is very fast if the COUNT and ORDER BY are over appropriate indexes - such that the data is 'already sorted' along the query lines. If these operations are covered it's a quick request and does not suffer from the horrid scalability of using ORDER BY NEWID() or similar. Obviously, this approach won't scale well on a non-indexed HEAP table.
declare #rows int
select #rows = count(1) from t
-- Other issues if row counts in the bigint range..
-- This is also not 'true random', although such is likely not required.
declare #skip int = convert(int, #rows * rand())
select t.*
from t
order by t.id -- Make sure this is clustered PK or IX/UCL axis!
offset (#skip) rows
fetch first 1 row only
Make sure that the appropriate transaction isolation levels are used and/or account for 0 results.
For SQL Server and needing a "general row sample" approach..
Note: This is an adaptation of the answer as found on a SQL Server specific question about fetching a sample of rows. It has been tailored for context.
While a general sampling approach should be used with caution here, it's still potentially useful information in context of other answers (and the repetitious suggestions of non-scaling and/or questionable implementations). Such a sampling approach is less efficient than the first code shown and is error-prone if the goal is to find a "single random row".
Here is an updated and improved form of sampling a percentage of rows. It is based on the same concept of some other answers that use CHECKSUM / BINARY_CHECKSUM and modulus.
It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal.
Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution.
Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID() can never be stable/repeatable.
Does not use ORDER BY NEWID() of the entire input set, as ordering can become a significant bottleneck with large input sets. Avoiding unnecessary sorting also reduces memory and tempdb usage.
Does not use TABLESAMPLE and thus works with a WHERE pre-filter.
Here is the gist. See this answer for additional details and notes.
Naïve try:
declare #sample_percent decimal(7, 4)
-- Looking at this value should be an indicator of why a
-- general sampling approach can be error-prone to select 1 row.
select #sample_percent = 100.0 / count(1) from t
-- BAD!
-- When choosing appropriate sample percent of "approximately 1 row"
-- it is very reasonable to expect 0 rows, which definitely fails the ask!
-- If choosing a larger sample size the distribution is heavily skewed forward,
-- and is very much NOT 'true random'.
select top 1
t.*
from t
where 1=1
and ( -- sample
#sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * #sample_percent)
)
This can be largely remedied by a hybrid query, by mixing sampling and ORDER BY selection from the much smaller sample set. This limits the sorting operation to the sample size, not the size of the original table.
-- Sample "approximately 1000 rows" from the table,
-- dealing with some edge-cases.
declare #rows int
select #rows = count(1) from t
declare #sample_size int = 1000
declare #sample_percent decimal(7, 4) = case
when #rows <= 1000 then 100 -- not enough rows
when (100.0 * #sample_size / #rows) < 0.0001 then 0.0001 -- min sample percent
else 100.0 * #sample_size / #rows -- everything else
end
-- There is a statistical "guarantee" of having sampled a limited-yet-non-zero number of rows.
-- The limited rows are then sorted randomly before the first is selected.
select top 1
t.*
from t
where 1=1
and ( -- sample
#sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * #sample_percent)
)
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()
SELECT * FROM table ORDER BY RAND() LIMIT 1
Most of the solutions here aim to avoid sorting, but they still need to make a sequential scan over a table.
There is also a way to avoid the sequential scan by switching to index scan. If you know the index value of your random row you can get the result almost instantially. The problem is - how to guess an index value.
The following solution works on PostgreSQL 8.4:
explain analyze select * from cms_refs where rec_id in
(select (random()*(select last_value from cms_refs_rec_id_seq))::bigint
from generate_series(1,10))
limit 1;
I above solution you guess 10 various random index values from range 0 .. [last value of id].
The number 10 is arbitrary - you may use 100 or 1000 as it (amazingly) doesn't have a big impact on the response time.
There is also one problem - if you have sparse ids you might miss. The solution is to have a backup plan :) In this case an pure old order by random() query. When combined id looks like this:
explain analyze select * from cms_refs where rec_id in
(select (random()*(select last_value from cms_refs_rec_id_seq))::bigint
from generate_series(1,10))
union all (select * from cms_refs order by random() limit 1)
limit 1;
Not the union ALL clause. In this case if the first part returns any data the second one is NEVER executed!
You may also try using new id() function.
Just write a your query and use order by new id() function. It quite random.
Didn't quite see this variation in the answers yet. I had an additional constraint where I needed, given an initial seed, to select the same set of rows each time.
For MS SQL:
Minimum example:
select top 10 percent *
from table_name
order by rand(checksum(*))
Normalized execution time: 1.00
NewId() example:
select top 10 percent *
from table_name
order by newid()
Normalized execution time: 1.02
NewId() is insignificantly slower than rand(checksum(*)), so you may not want to use it against large record sets.
Selection with Initial Seed:
declare #seed int
set #seed = Year(getdate()) * month(getdate()) /* any other initial seed here */
select top 10 percent *
from table_name
order by rand(checksum(*) % seed) /* any other math function here */
If you need to select the same set given a seed, this seems to work.
In MSSQL (tested on 11.0.5569) using
SELECT TOP 100 * FROM employee ORDER BY CRYPT_GEN_RANDOM(10)
is significantly faster than
SELECT TOP 100 * FROM employee ORDER BY NEWID()
For Firebird:
Select FIRST 1 column from table ORDER BY RAND()
In SQL Server you can combine TABLESAMPLE with NEWID() to get pretty good randomness and still have speed. This is especially useful if you really only want 1, or a small number, of rows.
SELECT TOP 1 * FROM [table]
TABLESAMPLE (500 ROWS)
ORDER BY NEWID()
I have to agree with CD-MaN: Using "ORDER BY RAND()" will work nicely for small tables or when you do your SELECT only a few times.
I also use the "num_value >= RAND() * ..." technique, and if I really want to have random results I have a special "random" column in the table that I update once a day or so. That single UPDATE run will take some time (especially because you'll have to have an index on that column), but it's much faster than creating random numbers for every row each time the select is run.
Be careful because TableSample doesn't actually return a random sample of rows. It directs your query to look at a random sample of the 8KB pages that make up your row. Then, your query is executed against the data contained in these pages. Because of how data may be grouped on these pages (insertion order, etc), this could lead to data that isn't actually a random sample.
See: http://www.mssqltips.com/tip.asp?tip=1308
This MSDN page for TableSample includes an example of how to generate an actualy random sample of data.
http://msdn.microsoft.com/en-us/library/ms189108.aspx
It seems that many of the ideas listed still use ordering
However, if you use a temporary table, you are able to assign a random index (like many of the solutions have suggested), and then grab the first one that is greater than an arbitrary number between 0 and 1.
For example (for DB2):
WITH TEMP AS (
SELECT COMLUMN, RAND() AS IDX FROM TABLE)
SELECT COLUMN FROM TABLE WHERE IDX > .5
FETCH FIRST 1 ROW ONLY
A simple and efficient way from http://akinas.com/pages/en/blog/mysql_random_row/
SET #i = (SELECT FLOOR(RAND() * COUNT(*)) FROM table); PREPARE get_stmt FROM 'SELECT * FROM table LIMIT ?, 1'; EXECUTE get_stmt USING #i;
There is better solution for Oracle instead of using dbms_random.value, while it requires full scan to order rows by dbms_random.value and it is quite slow for large tables.
Use this instead:
SELECT *
FROM employee sample(1)
WHERE rownum=1
For SQL Server 2005 and above, extending #GreyPanther's answer for the cases when num_value has not continuous values. This works too for cases when we have not evenly distributed datasets and when num_value is not a number but a unique identifier.
WITH CTE_Table (SelRow, num_value)
AS
(
SELECT ROW_NUMBER() OVER(ORDER BY ID) AS SelRow, num_value FROM table
)
SELECT * FROM table Where num_value = (
SELECT TOP 1 num_value FROM CTE_Table WHERE SelRow >= RAND() * (SELECT MAX(SelRow) FROM CTE_Table)
)
select r.id, r.name from table AS r
INNER JOIN(select CEIL(RAND() * (select MAX(id) from table)) as id) as r1
ON r.id >= r1.id ORDER BY r.id ASC LIMIT 1
This will require a lesser computation time