Postgres: limit by the results of a sum function - sql

CREATE TABLE inventory_box (
box_id varchar(10),
value integer
);
INSERT INTO inventory_box VALUES ('1', 10), ('2', 15), ('3', 20);
I prepared a sql fiddle with the schema.
I would like to select a list of inventory boxes with combined value of above 20
possible result 1. box 1 + box 2 (10 + 15 >= 20)
Here is what I am doing right now:
SELECT * FROM inventory_box LIMIT 1 OFFSET 0;
-- count on the client side and see if I got enough
-- got 10
SELECT * FROM inventory_box LIMIT 1 OFFSET 1;
-- count on the client side and see if I got enough
-- got 15, add it to the first query which returned 10
-- total is 25, ok, got enough, return answer
I am looking for a solution where the scan will stop as soon as it reaches the target value

One possible approach scans the table in box_id order until the total is above 30, then returns all the previous rows plus the row that tipped the sum over the limit. Note that the scan doesn't stop when the sum is reached, it totals the whole table then goes back over the results to pick the results.
http://sqlfiddle.com/#!15/1c502/4
SELECT
array_agg(box_id ORDER BY box_id) AS box_ids,
max(boxsum) AS boxsum
FROM
(
SELECT
box_id,
sum(value) OVER (ORDER BY box_id) AS boxsum,
sum(value) OVER (ORDER BY box_id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS prevboxsum
FROM
inventory_box
) x
WHERE prevboxsum < 30 OR prevboxsum IS NULL;
but really, this is going to be pretty gruesome to do in a general and reliable manner in SQL (or at all).
You can ORDER BY value ASC instead of ORDER BY box_id if you like; this will add boxes from the smallest to the biggest. However, this will catastrophically fail if you then remove all the small boxes from the pool and run it again, and repeat. Soon it'll just be lumping two big boxes together inefficiently.
To solve this for the general case, finding the smallest combination, is a hard optimization problem that probably benefits from imprecise sample- and probabilistic based methods.
To scan the table in order until the sum reaches the target, lock the table then use PL/PgSQL to read rows from a cursor that returns the rows in value order plus an array_agg(box_id) OVER (ORDER BY value) and sum(value) OVER (order by value). When you reach the desired sum, return the current row's array. This won't produce an optimal solution, but it'll produce a solution, and I think it'll do so without a full table scan if there's a suitable index in place.

Your question update clarifies that your actual requirements are much simpler than a full-blown "subset sum problem" as suspected by #GhostGambler:
Just fetch rows until the sum is big enough.
I am sorting by box_id to get deterministic results. You might even drop the ORDER BY altogether to get any valid result a bit faster, yet.
Slow: Recursive CTE
WITH RECURSIVE i AS (
SELECT *, row_number() OVER (ORDER BY box_id) AS rn
FROM inventory_box
)
, r AS (
SELECT box_id, val, val AS total, 2 AS rn
FROM i
WHERE rn = 1
UNION ALL
SELECT i.box_id, i.val, r.total + i.val, r.rn + 1
FROM r
JOIN i USING (rn)
WHERE r.total < 20
)
SELECT box_id, val, total
FROM r
ORDER BY box_id;
Fast: PL/pgSQL function with FOR loop
Using sum() as window aggregate function (cheapest this way).
CREATE OR REPLACE FUNCTION f_shop_for(_total int)
RETURNS TABLE (box_id text, val int, total int) AS
$func$
BEGIN
total := 0;
FOR box_id, val, total IN
SELECT i.box_id, i.val
, sum(i.val) OVER (ORDER BY i.box_id) AS total
FROM inventory_box i
LOOP
RETURN NEXT;
EXIT WHEN total >= _total;
END LOOP;
END
$func$ LANGUAGE plpgsql STABLE;
SELECT * FROM f_shop_for(35);
I tested both with a big table of 1 million rows. The function only reads the necessary rows from index and table. The CTE is very slow, seems to scan the whole table ...
SQL Fiddle for both.
Aside: sorting by a varchar column (box_id) containing numeric data yields dubious results. Maybe this should be a numeric type, really?

Related

SQL: Give up/return different result if too many rows

Short version, I have a SQL statement where I only want the results if the number of rows returned is less than some value (say 1000) and otherwise I want a different result set. What's the best way to do this without incurring the overhead of returning the 1000 rows (as would happen if I used limit) when I'm just going to throw them away?
For instance, I want to return the results of
SELECT *
FROM T
WHERE updated_at > timestamp
AND name <= 'Michael'
ORDER BY name ASC
provided there are at most 1000 entries but if there are more than that I want to return
SELECT *
FROM T
ORDER BY name ASC
LIMIT 25
Two queries isn't bad, but I definitely don't want to get 1000 records back from the first query only to toss them.
(Happy to use Postgres extensions too but prefer SQL)
--
To explain I'm refreshing data requested by client in batches and sometimes the client needs to know if there have been any changes in the part they've already received. If there are too many changes, however, I'm just giving up and starting to send the records from the start again.
WITH max1000 AS (
SELECT the_row, count(*) OVER () AS total
FROM (
SELECT the_row -- named row type
FROM T AS the_row
WHERE updated_at > timestamp
AND name <= 'Michael'
ORDER BY name
LIMIT 1001
) sub
)
SELECT (the_row).* -- parentheses required
FROM max1000 m
WHERE total < 1001
UNION ALL
( -- parentheses required
SELECT *
FROM T
WHERE (SELECT total > 1000 FROM max1000 LIMIT 1)
ORDER BY name
LIMIT 25
)
The subquery sub in CTE max1000 gets the complete, sorted result for the first query - wrapped as row type, and with LIMIT 1001 to avoid excess work.
The outer SELECT adds the total row count. See:
Run a query with a LIMIT/OFFSET and also get the total number of rows
The first SELECT of the outer UNION query returns decomposed rows as result - if there are less than 1001 of them.
The second SELECT of the outer UNION query returns the alternate result - if there were more than 1000. Parentheses are required - see:
Combining 3 SELECT statements to output 1 table
Or:
WITH max1000 AS (
SELECT *
FROM T
WHERE updated_at > timestamp
AND name <= 'Michael'
ORDER BY name
LIMIT 1001
)
, ct(ok) AS (SELECT count(*) < 1001 FROM max1000)
SELECT *
FROM max1000 m
WHERE (SELECT ok FROM ct)
UNION ALL
( -- parentheses required
SELECT *
FROM T
WHERE (SELECT NOT ok FROM ct)
ORDER BY name
LIMIT 25
);
I think I like the 2nd better. Not sure which is faster.
Either optimizes performance for less than 1001 rows in most calls. If that's the exception, I would first run a somewhat cheaper count. Also depends a lot on available indexes ...
You get no row if the first query finds no row. (Seems like an odd result.)

Select first 50 rows then order

Is it possible to select the first 50 rows in Postgres with select * from yellow_tripdata_staging fetch first 50 rows only and after that sort the results by column?
If so, how?
edit: the table is really big, and is not really important which rows i get.
this question was because i was using Redash to visualise the data and was getting some weird order on the sorted results.then i realized that the column i was using to order was not numerical but char, which causes values like 11 and 10 to come before 2 and 3.
Im sorry for this dumb question
It's not completely clear how your first 50 rows are identified and in what order they shall be returned. There is no "natural order" in tables of a relational database. No guarantees without explicit ORDER BY.
However, there is a current physical order of rows you can (ab-)use. And by default that's the order in which rows have been inserted - as long as nothing else has happened to that table. But the RDBMS is free to change the physical order any time, so the physical order is not reliable. Results can and will change with write operations to the table (including VACUUM or other utility commands).
Let's call your column used to sort after 50 rows sort_col.
( -- parentheses required
TABLE yellow_tripdata_staging LIMIT 50
)
UNION ALL
( -- parentheses required
SELECT *
FROM (TABLE yellow_tripdata_staging OFFSET 50) sub
ORDER BY sort_col
);
More explanation (incl. TABLE and parentheses):
Is there a shortcut for SELECT * FROM in psql?
Get n grouped categories and sum others into one
Or, assuming sort_col is defined NOT NULL:
SELECT *
FROM yellow_tripdata_staging
ORDER BY CASE WHEN row_number() OVER () > 50 THEN sort_col END NULLS FIRST;
The window function row_number() is allowed to appear in the ORDER BY clause.
row_number() OVER () (with empty OVER clause) will attach serial numbers according to the current physical order of row - all the disclaimers above still apply.
The CASE expression replaces the first 50 row numbers with NULL, which sort first due to attached NULLS FIRST. In effect, the first 50 rows are unsorted the rest is sorted by sort_col.
Or, if you actually mean to take the first 50 rows according to sort_col and leave them unsorted, while the rest is to be sorted:
SELECT *
FROM yellow_tripdata_staging
ORDER BY GREATEST (row_number() OVER (ORDER BY sort_col), 50);
Or, if you just mean to fetch the "first" 50 rows according to current physical order or some other undisclosed (more reliable) criteria, you need a subquery or CTE to sort those 50 rows in the outer SELECT:
SELECT *
FROM (TABLE yellow_tripdata_staging LIMIT 50) sub
ORDER BY sort_col;
You need to define your requirements clearly.
You can order by two different columns. For instance:
select yts.*
from (select yts.*,
row_number() over (order by id) as seqnum
from yellow_tripdata_staging yts
) yts
order by (seqnum <= 50)::int desc,
(case when seqnum <= 50 then id end),
col

Joining a series in postgres with a select query

I'm looking for a way to join these two queries (or run these two together):
SELECT s
FROM generate_series(1, 50) s;
With this query:
SELECT id FROM foo ORDER BY RANDOM() LIMIT 50;
In a way where I get 50 rows like this:
series, ids_from_foo
1, 53
2, 34
3, 23
I've been at it for a couple days now and I can't figure it out. Any help would be great.
Use row_number()
select row_number() over() as rn, a
from (
select a
from foo
order by random()
limit 50
) s
order by rn;
Picking the top n rows from a randomly sorted table is a simple, but slow way to pick 50 rows randomly. All rows have to be sorted that way.
Doesn't matter much for small to medium tables and one-time, ad-hoc use. For repeated use on a big table, there are much more efficient ways.
If the ratio of gaps / island in the primary key is low, use this:
SELECT row_number() OVER() AS rn, *
FROM (
SELECT *
FROM (
SELECT trunc(random() * 999999)::int AS foo_id
FROM generate_series(1, 55) g
GROUP BY 1 -- fold duplicates
) sub1
JOIN foo USING (foo_id)
LIMIT 50
) sub2;
With an index on foo_id, this blazingly fast, no matter how big the table. (A primary key serves just fine.) Compare performance with EXPLAIN ANALYZE.
How?
999999 is an estimated row count of the table, rounded up. You can get it cheaply from:
SELECT reltuples FROM pg_class WHERE oid = 'foo'::regclass;
Round up to easily include possible new entries since the last ANALYZE. You can also use the expression itself in a generic query dynamically, it's cheap. Details:
Fast way to discover the row count of a table in PostgreSQL
55 is your desired number of rows (50) in the result, multiplied by a low factor to easily make up for the gap ratio in your table and (unlikely but possible) duplicate random numbers.
If your primary key does not start near 1 (does not have to be 1 exactly, gaps are covered), add the minimum pk value to the calculation:
min_pkey + trunc(random() * 999999)::int
Detailed explanation here:
Best way to select random rows PostgreSQL

SQL stored proc runs extremely slow when filtering by high row numbers

This query is generated from a very long dynamic sequel stored procedure -- the procedure returns the requested number of records starting at a given index to be displayed in a Telerik Radgrid, effectively handling paging. A simplified version of the stored proc's output:
SELECT r.* FROM (
SELECT ROW_NUMBER() OVER(ORDER BY InventoryId DESC) as row,
v.* FROM vInventorySearch v
) as R WHERE [ROW] BETWEEN 1 AND 10
When the "BETWEEN" clause is between 1 and 10, it runs in a fraction of a second, but if it's between something like 10000 and 1010 it takes almost a full minute to execute.
I feel like I may be missing something fundamental here, but it seems to me that it shouldn't matter which 10 records I'm retrieving, it should take the same amount of time.
Thanks for any input, I'm looking forward to being embarrassed!
Solution, courtesy Martin Smith (below) :
SELECT r.*, inv.* FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY InventoryId DESC) as row, v.InventoryID
FROM vInventorySearch v
WHERE 1=1
) as R
inner join vInventory inv on r.InventoryID = inv.InventoryID
WHERE [ROW] BETWEEN 10001 AND 10010
Thanks for your help!
Paginating by ROW_NUMBER can indeed be pretty inefficient for higher row numbers.
Sometimes it is better to break it up a bit and have the ROW_NUMBER query on a narrow index to retrieve the matching PKs with a join back onto the base table to retrieve the missing columns.
SQL 2012 has more efficient paging mechanism
http://stevestedman.com/2012/04/tsql-2012-offset-and-fetch/
SELECT DepartmentID, Revenue, Year
FROM REVENUE
ORDER BY Year, DepartmentID ASC
OFFSET 10 ROWS FETCH NEXT 10 ROWS ONLY;

Best way to select random rows PostgreSQL

I want a random selection of rows in PostgreSQL, I tried this:
select * from table where random() < 0.01;
But some other recommend this:
select * from table order by random() limit 1000;
I have a very large table with 500 Million rows, I want it to be fast.
Which approach is better? What are the differences? What is the best way to select random rows?
Fast ways
Given your specifications (plus additional info in the comments),
You have a numeric ID column (integer numbers) with only few (or moderately few) gaps.
Obviously no or few write operations.
Your ID column has to be indexed! A primary key serves nicely.
The query below does not need a sequential scan of the big table, only an index scan.
First, get estimates for the main query:
SELECT count(*) AS ct -- optional
, min(id) AS min_id
, max(id) AS max_id
, max(id) - min(id) AS id_span
FROM big;
The only possibly expensive part is the count(*) (for huge tables). Given above specifications, you don't need it. An estimate to replace the full count will do just fine, available at almost no cost:
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint AS ct
FROM pg_class
WHERE oid = 'big'::regclass; -- your table name
Detailed explanation:
Fast way to discover the row count of a table in PostgreSQL
As long as ct isn't much smaller than id_span, the query will outperform other approaches.
WITH params AS (
SELECT 1 AS min_id -- minimum id <= current min id
, 5100000 AS id_span -- rounded up. (max_id - min_id + buffer)
)
SELECT *
FROM (
SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
FROM params p
, generate_series(1, 1100) g -- 1000 + buffer
GROUP BY 1 -- trim duplicates
) r
JOIN big USING (id)
LIMIT 1000; -- trim surplus
Generate random numbers in the id space. You have "few gaps", so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.
Each id can be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or use DISTINCT).
Join the ids to the big table. This should be very fast with the index in place.
Finally trim surplus ids that have not been eaten by dupes and gaps. Every row has a completely equal chance to be picked.
Short version
You can simplify this query. The CTE in the query above is just for educational purposes:
SELECT *
FROM (
SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
FROM generate_series(1, 1100) g
) r
JOIN big USING (id)
LIMIT 1000;
Refine with rCTE
Especially if you are not so sure about gaps and estimates.
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * 5100000)::int AS id
FROM generate_series(1, 1030) -- 1000 + few percent - adapt to your needs
LIMIT 1030 -- hint for query planner
) r
JOIN big b USING (id) -- eliminate miss
UNION -- eliminate dupe
SELECT b.*
FROM (
SELECT 1 + trunc(random() * 5100000)::int AS id
FROM random_pick r -- plus 3 percent - adapt to your needs
LIMIT 999 -- less than 1000, hint for query planner
) r
JOIN big b USING (id) -- eliminate miss
)
TABLE random_pick
LIMIT 1000; -- actual limit
We can work with a smaller surplus in the base query. If there are too many gaps so we don't find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached - or we have to start with a large enough buffer which defies the purpose of optimizing performance.
Duplicates are eliminated by the UNION in the rCTE.
The outer LIMIT makes the CTE stop as soon as we have enough rows.
This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.
Wrap into function
For repeated use with the same table with varying parameters:
CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
RETURNS SETOF big
LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
_surplus int := _limit * _gaps;
_estimate int := ( -- get current estimate from system
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM pg_class
WHERE oid = 'big'::regclass);
BEGIN
RETURN QUERY
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * _estimate)::int
FROM generate_series(1, _surplus) g
LIMIT _surplus -- hint for query planner
) r (id)
JOIN big USING (id) -- eliminate misses
UNION -- eliminate dupes
SELECT *
FROM (
SELECT 1 + trunc(random() * _estimate)::int
FROM random_pick -- just to make it recursive
LIMIT _limit -- hint for query planner
) r (id)
JOIN big USING (id) -- eliminate misses
)
TABLE random_pick
LIMIT _limit;
END
$func$;
Call:
SELECT * FROM f_random_sample();
SELECT * FROM f_random_sample(500, 1.05);
Generic function
We can make this generic to work for any table with a unique integer column (typically the PK): Pass the table as polymorphic type and (optionally) the name of the PK column and use EXECUTE:
CREATE OR REPLACE FUNCTION f_random_sample(_tbl_type anyelement
, _id text = 'id'
, _limit int = 1000
, _gaps real = 1.03)
RETURNS SETOF anyelement
LANGUAGE plpgsql VOLATILE ROWS 1000 AS
$func$
DECLARE
-- safe syntax with schema & quotes where needed
_tbl text := pg_typeof(_tbl_type)::text;
_estimate int := (SELECT (reltuples / relpages
* (pg_relation_size(oid) / 8192))::bigint
FROM pg_class -- get current estimate from system
WHERE oid = _tbl::regclass);
BEGIN
RETURN QUERY EXECUTE format(
$$
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * $1)::int
FROM generate_series(1, $2) g
LIMIT $2 -- hint for query planner
) r(%2$I)
JOIN %1$s USING (%2$I) -- eliminate misses
UNION -- eliminate dupes
SELECT *
FROM (
SELECT 1 + trunc(random() * $1)::int
FROM random_pick -- just to make it recursive
LIMIT $3 -- hint for query planner
) r(%2$I)
JOIN %1$s USING (%2$I) -- eliminate misses
)
TABLE random_pick
LIMIT $3;
$$
, _tbl, _id
)
USING _estimate -- $1
, (_limit * _gaps)::int -- $2 ("surplus")
, _limit -- $3
;
END
$func$;
Call with defaults (important!):
SELECT * FROM f_random_sample(null::big); --!
Or more specifically:
SELECT * FROM f_random_sample(null::"my_TABLE", 'oDD ID', 666, 1.15);
About the same performance as the static version.
Related:
Refactor a PL/pgSQL function to return the output of various SELECT queries - chapter "Various complete table types"
Return SETOF rows from PostgreSQL function
Format specifier for integer variables in format() for EXECUTE?
INSERT with dynamic table name in trigger function
This is safe against SQL injection. See:
Table name as a PostgreSQL function parameter
SQL injection in Postgres functions vs prepared queries
Possible alternative
I your requirements allow identical sets for repeated calls (and we are talking about repeated calls) consider a MATERIALIZED VIEW. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.
Postgres 9.5 introduces TABLESAMPLE SYSTEM (n)
Where n is a percentage. The manual:
The BERNOULLI and SYSTEM sampling methods each accept a single
argument which is the fraction of the table to sample, expressed as a
percentage between 0 and 100. This argument can be any real-valued expression.
Bold emphasis mine. It's very fast, but the result is not exactly random. The manual again:
The SYSTEM method is significantly faster than the BERNOULLI method
when small sampling percentages are specified, but it may return a
less-random sample of the table as a result of clustering effects.
The number of rows returned can vary wildly. For our example, to get roughly 1000 rows:
SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);
Related:
Fast way to discover the row count of a table in PostgreSQL
Or install the additional module tsm_system_rows to get the number of requested rows exactly (if there are enough) and allow for the more convenient syntax:
SELECT * FROM big TABLESAMPLE SYSTEM_ROWS(1000);
See Evan's answer for details.
But that's still not exactly random.
You can examine and compare the execution plan of both by using
EXPLAIN select * from table where random() < 0.01;
EXPLAIN select * from table order by random() limit 1000;
A quick test on a large table1 shows, that the ORDER BY first sorts the complete table and then picks the first 1000 items. Sorting a large table not only reads that table but also involves reading and writing temporary files. The where random() < 0.1 only scans the complete table once.
For large tables this might not what you want as even one complete table scan might take to long.
A third proposal would be
select * from table where random() < 0.01 limit 1000;
This one stops the table scan as soon as 1000 rows have been found and therefore returns sooner. Of course this bogs down the randomness a bit, but perhaps this is good enough in your case.
Edit: Besides of this considerations, you might check out the already asked questions for this. Using the query [postgresql] random returns quite a few hits.
quick random row selection in Postgres
How to retrieve randomized data rows from a postgreSQL table?
postgres: get random entries from table - too slow
And a linked article of depez outlining several more approaches:
http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/
1 "large" as in "the complete table will not fit into the memory".
postgresql order by random(), select rows in random order:
These are all slow because they do a tablescan to guarantee that every row gets an exactly equal chance of being chosen:
select your_columns from your_table ORDER BY random()
select * from
(select distinct your_columns from your_table) table_alias
ORDER BY random()
select your_columns from your_table ORDER BY random() limit 1
If you know how many rows are in the table N:
offset by floored random is constant time. However I am NOT convinced that OFFSET is producing a true random sample. It's simulating it by getting 'the next bunch' and tablescanning that, so you can step through, which isn't quite the same as above.
SELECT myid FROM mytable OFFSET floor(random() * N) LIMIT 1;
Roll your own constant Time Select Random N rows with periodic table scan to be absolutely sure of a random:
If your table is huge then the above table-scans are a show stopper taking up to 5 minutes to finish.
To go faster you can schedule a behind the scenes nightly table-scan reindexing which will guarantee a perfectly random selection in an O(1) constant-time speed, except during the nightly reindexing table-scan, where it must wait for maintenance to finish before you may receive another random row.
--Create a demo table with lots of random nonuniform data, big_data
--is your huge table you want to get random rows from in constant time.
drop table if exists big_data;
CREATE TABLE big_data (id serial unique, some_data text );
CREATE INDEX ON big_data (id);
--Fill it with a million rows which simulates your beautiful data:
INSERT INTO big_data (some_data) SELECT md5(random()::text) AS some_data
FROM generate_series(1,10000000);
--This delete statement puts holes in your index
--making it NONuniformly distributed
DELETE FROM big_data WHERE id IN (2, 4, 6, 7, 8);
--Do the nightly maintenance task on a schedule at 1AM.
drop table if exists big_data_mapper;
CREATE TABLE big_data_mapper (id serial, big_data_id int);
CREATE INDEX ON big_data_mapper (id);
CREATE INDEX ON big_data_mapper (big_data_id);
INSERT INTO big_data_mapper(big_data_id) SELECT id FROM big_data ORDER BY id;
--We have to use a function because the big_data_mapper might be out-of-date
--in between nightly tasks, so to solve the problem of a missing row,
--you try again until you succeed. In the event the big_data_mapper
--is broken, it tries 25 times then gives up and returns -1.
CREATE or replace FUNCTION get_random_big_data_id()
RETURNS int language plpgsql AS $$
declare
response int;
BEGIN
--Loop is required because big_data_mapper could be old
--Keep rolling the dice until you find one that hits.
for counter in 1..25 loop
SELECT big_data_id
FROM big_data_mapper OFFSET floor(random() * (
select max(id) biggest_value from big_data_mapper
)
) LIMIT 1 into response;
if response is not null then
return response;
end if;
end loop;
return -1;
END;
$$;
--get a random big_data id in constant time:
select get_random_big_data_id();
--Get 1 random row from big_data table in constant time:
select * from big_data where id in (
select get_random_big_data_id() from big_data limit 1
);
┌─────────┬──────────────────────────────────┐
│ id │ some_data │
├─────────┼──────────────────────────────────┤
│ 8732674 │ f8d75be30eff0a973923c413eaf57ac0 │
└─────────┴──────────────────────────────────┘
--Get 4 random rows from big_data in constant time:
select * from big_data where id in (
select get_random_big_data_id() from big_data limit 3
);
┌─────────┬──────────────────────────────────┐
│ id │ some_data │
├─────────┼──────────────────────────────────┤
│ 2722848 │ fab6a7d76d9637af89b155f2e614fc96 │
│ 8732674 │ f8d75be30eff0a973923c413eaf57ac0 │
│ 9475611 │ 36ac3eeb6b3e171cacd475e7f9dade56 │
└─────────┴──────────────────────────────────┘
--Test what happens when big_data_mapper stops receiving
--nightly reindexing.
delete from big_data_mapper where 1=1;
select get_random_big_data_id(); --It tries 25 times, and returns -1
--which means wait N minutes and try again.
Adapted from: https://www.gab.lc/articles/bigdata_postgresql_order_by_random
Alternatively if all the above is too much work.
A simpler good 'nuff solution for constant time select random row is to make a new column on your big table called big_data.mapper_int make it not null with a unique index. Every night reset the column with a unique integer between 1 and max(n). To get a random row you "choose a random integer between 0 and max(id)" and return the row where mapper_int is that. If there's no row by that id, because the row has changed since re-index, choose another random row. If a row is added to big_data.mapper_int then populate it with max(id) + 1
Alternatively TableSample to the rescue:
If you have postgresql version > 9.5 then tablesample can do a constant time random sample without a heavy tablescan.
https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation
--Select 1 percent of rows from yourtable,
--display the first 100 rows, order by column a_column
select * from yourtable TABLESAMPLE SYSTEM (1)
order by a_column
limit 100;
TableSample is doing some stuff behind the scenes that takes some time and I don't like it, but is faster than order by random(). Good, fast, cheap, choose any two on this job.
Starting with PostgreSQL 9.5, there's a new syntax dedicated to getting random elements from a table :
SELECT * FROM mytable TABLESAMPLE SYSTEM (5);
This example will give you 5% of elements from mytable.
See more explanation on the documentation: http://www.postgresql.org/docs/current/static/sql-select.html
The one with the ORDER BY is going to be the slower one.
select * from table where random() < 0.01; goes record by record, and decides to randomly filter it or not. This is going to be O(N) because it only needs to check each record once.
select * from table order by random() limit 1000; is going to sort the entire table, then pick the first 1000. Aside from any voodoo magic behind the scenes, the order by is O(N * log N).
The downside to the random() < 0.01 one is that you'll get a variable number of output records.
Note, there is a better way to shuffling a set of data than sorting by random: The Fisher-Yates Shuffle, which runs in O(N). Implementing the shuffle in SQL sounds like quite the challenge, though.
select * from table order by random() limit 1000;
If you know how many rows you want, check out tsm_system_rows.
tsm_system_rows
module provides the table sampling method SYSTEM_ROWS, which can be used in the TABLESAMPLE clause of a SELECT command.
This table sampling method accepts a single integer argument that is the maximum number of rows to read. The resulting sample will always contain exactly that many rows, unless the table does not contain enough rows, in which case the whole table is selected. Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested.
First install the extension
CREATE EXTENSION tsm_system_rows;
Then your query,
SELECT *
FROM table
TABLESAMPLE SYSTEM_ROWS(1000);
Here is a decision that works for me. I guess it's very simple to understand and execute.
SELECT
field_1,
field_2,
field_2,
random() as ordering
FROM
big_table
WHERE
some_conditions
ORDER BY
ordering
LIMIT 1000;
If you want just one row, you can use a calculated offset derived from count.
select * from table_name limit 1
offset floor(random() * (select count(*) from table_name));
One lesson from my experience:
offset floor(random() * N) limit 1 is not faster than order by random() limit 1.
I thought the offset approach would be faster because it should save the time of sorting in Postgres. Turns out it wasn't.
I think the best and simplest way in postgreSQL is:
SELECT * FROM tableName ORDER BY random() LIMIT 1
A variation of the materialized view "Possible alternative" outlined by Erwin Brandstetter is possible.
Say, for example, that you don't want duplicates in the randomized values that are returned. An example use case is to generate short codes which can only be used once.
The primary table containing your (non-randomized) set of values must have some expression that determines which rows are "used" and which aren't — here I'll keep it simple by just creating a boolean column with the name used.
Assume this is the input table (additional columns may be added as they do not affect the solution):
id_values id | used
----+--------
1 | FALSE
2 | FALSE
3 | FALSE
4 | FALSE
5 | FALSE
...
Populate the ID_VALUES table as needed. Then, as described by Erwin, create a materialized view that randomizes the ID_VALUES table once:
CREATE MATERIALIZED VIEW id_values_randomized AS
SELECT id
FROM id_values
ORDER BY random();
Note that the materialized view does not contain the used column, because this will quickly become out-of-date. Nor does the view need to contain other columns that may be in the id_values table.
In order to obtain (and "consume") random values, use an UPDATE-RETURNING on id_values, selecting id_values from id_values_randomized with a join, and applying the desired criteria to obtain only relevant possibilities. For example:
UPDATE id_values
SET used = TRUE
WHERE id_values.id IN
(SELECT i.id
FROM id_values_randomized r INNER JOIN id_values i ON i.id = r.id
WHERE (NOT i.used)
LIMIT 1)
RETURNING id;
Change LIMIT as necessary -- if you need multiple random values at a time, change LIMIT to n where n is the number of values needed.
With the proper indexes on id_values, I believe the UPDATE-RETURNING should execute very quickly with little load. It returns randomized values with one database round-trip. The criteria for "eligible" rows can be as complex as required. New rows can be added to the id_values table at any time, and they will become accessible to the application as soon as the materialized view is refreshed (which can likely be run at an off-peak time). Creation and refresh of the materialized view will be slow, but it only needs to be executed when new id's added to the id_values table need to be made available.
Add a column called r with type serial. Index r.
Assume we have 200,000 rows, we are going to generate a random number n, where 0 < n <= 200, 000.
Select rows with r > n, sort them ASC and select the smallest one.
Code:
select * from YOUR_TABLE
where r > (
select (
select reltuples::bigint AS estimate
from pg_class
where oid = 'public.YOUR_TABLE'::regclass) * random()
)
order by r asc limit(1);
The code is self-explanatory. The subquery in the middle is used to quickly estimate the table row counts from https://stackoverflow.com/a/7945274/1271094 .
In application level you need to execute the statement again if n > the number of rows or need to select multiple rows.
I know I'm a little late to the party, but I just found this awesome tool called pg_sample:
pg_sample - extract a small, sample dataset from a larger PostgreSQL database while maintaining referential integrity.
I tried this with a 350M rows database and it was really fast, don't know about the randomness.
./pg_sample --limit="small_table = *" --limit="large_table = 100000" -U postgres source_db | psql -U postgres target_db