Select random row from a PostgreSQL table with weighted row probabilities - sql

Example input:
SELECT * FROM test;
id | percent
----+----------
1 | 50
2 | 35
3 | 15
(3 rows)
How would you write such query, that on average 50% of time i could get the row with id=1, 35% of time row with id=2, and 15% of time row with id=3?
I tried something like SELECT id FROM test ORDER BY p * random() DESC LIMIT 1, but it gives wrong results. After 10,000 runs I get a distribution like: {1=6293, 2=3302, 3=405}, but I expected the distribution to be nearly: {1=5000, 2=3500, 3=1500}.
Any ideas?

This should do the trick:
WITH CTE AS (
SELECT random() * (SELECT SUM(percent) FROM YOUR_TABLE) R
)
SELECT *
FROM (
SELECT id, SUM(percent) OVER (ORDER BY id) S, R
FROM YOUR_TABLE CROSS JOIN CTE
) Q
WHERE S >= R
ORDER BY id
LIMIT 1;
The sub-query Q gives the following result:
1 50
2 85
3 100
We then simply generate a random number in range [0, 100) and pick the first row that is at or beyond that number (the WHERE clause). We use common table expression (WITH) to ensure the random number is calculated only once.
BTW, the SELECT SUM(percent) FROM YOUR_TABLE allows you to have any weights in percent - they don't strictly need to be percentages (i.e. add-up to 100).
[SQL Fiddle]

ORDER BY random() ^ (1.0 / p)
from the algorithm described by Efraimidis and Spirakis.

Branko's accepted solution is great (thanks!). However, I'd like to contribute an alternative that is just as performant (according to my tests), and perhaps easier to visualize.
Let's recap. The original question can perhaps be generalized as follows:
Given an map of ids and relative weights, create a query that returns a random id in the map, but with a probability proportional to its relative weight.
Note the emphasis on relative weights, not percent. As Branko points out in his answer, using relative weights will work for anything, including percents.
Now, consider some test data, which we'll put in a temporary table:
CREATE TEMP TABLE test AS
SELECT * FROM (VALUES
(1, 25),
(2, 10),
(3, 10),
(4, 05)
) AS test(id, weight);
Note that I'm using a more complicated example than that in the original question, in that it does not conveniently add up to 100, and in that the same weight (20) is used more than once (for ids 2 and 3), which is important to consider, as you'll see later.
The first thing we have to do is turn the weights into probabilities from 0 to 1, which is nothing more than a simple normalization (weight / sum(weights)):
WITH p AS ( -- probability
SELECT *,
weight::NUMERIC / sum(weight) OVER () AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumprobability
FROM p
)
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
;
This will result in the following output:
id | weight | probability | startprobability | endprobability
----+--------+-------------+------------------+----------------
1 | 25 | 0.5 | 0.0 | 0.5
2 | 10 | 0.2 | 0.5 | 0.7
3 | 10 | 0.2 | 0.7 | 0.9
4 | 5 | 0.1 | 0.9 | 1.0
The query above is admittedly doing more work than strictly necessary for our needs, but I find it helpful to visualize the relative probabilities this way, and it does make the final step of choosing the id trivial:
SELECT id FROM (queryabove)
WHERE random() BETWEEN startprobability AND endprobability;
Now, let's put it all together with a test that ensures the query is returning data with the expected distribution. We'll use generate_series() to generate a random number a million times:
WITH p AS ( -- probability
SELECT *,
weight::NUMERIC / sum(weight) OVER () AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumprobability
FROM p
),
fp AS ( -- final probability
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
)
SELECT *
FROM fp
CROSS JOIN (SELECT random() FROM generate_series(1, 1000000)) AS random(val)
WHERE random.val BETWEEN fp.startprobability AND fp.endprobability
;
This will result in output similar to the following:
id | count
----+--------
1 | 499679
3 | 200652
2 | 199334
4 | 100335
Which, as you can see, tracks the expected distribution perfectly.
Performance
The query above is quite performant. Even in my average machine, with PostgreSQL running in a WSL1 instance (the horror!), execution is relatively fast:
count | time (ms)
-----------+----------
1,000 | 7
10,000 | 25
100,000 | 210
1,000,000 | 1950
Adaptation to generate test data
I often use a variation of the query above when generating test data for unit/integration tests. The idea is to generate random data that approximates a probability distribution that tracks reality.
In that situation I find it useful to compute the start and end distributions once and storing the results in a table:
CREATE TEMP TABLE test AS
WITH test(id, weight) AS (VALUES
(1, 25),
(2, 10),
(3, 10),
(4, 05)
),
p AS ( -- probability
SELECT *, (weight::NUMERIC / sum(weight) OVER ()) AS probability
FROM test
),
cp AS ( -- cumulative probability
SELECT *,
sum(p.probability) OVER (
ORDER BY probability DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) cumprobability
FROM p
)
SELECT
cp.id,
cp.weight,
cp.probability,
cp.cumprobability - cp.probability AS startprobability,
cp.cumprobability AS endprobability
FROM cp
;
I can then use these precomputed probabilities repeatedly, which results in extra performance and simpler use.
I can even wrap it all in a function that I can call any time I want to get a random id:
CREATE OR REPLACE FUNCTION getrandomid(p_random FLOAT8 = random())
RETURNS INT AS
$$
SELECT id
FROM test
WHERE p_random BETWEEN startprobability AND endprobability
;
$$
LANGUAGE SQL STABLE STRICT
Window function frames
It's worth noting that the technique above is using a window function with a non-standard frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This is necessary to deal with the fact that some weights might be repeated, which is why I chose test data with repeated weights in the first place!

Your proposed query appears to work; see this SQLFiddle demo. It creates the wrong distribution though; see below.
To prevent PostgreSQL from optimising the subquery I've wrapped it in a VOLATILE SQL function. PostgreSQL has no way to know that you intend the subquery to run once for every row of the outer query, so if you don't force it to volatile it'll just execute it once. Another possibility - though one that the query planner might optimize out in future - is to make it appear to be a correlated subquery, like this hack that uses an always-true where clause, like this: http://sqlfiddle.com/#!12/3039b/9
At a guess (before you updated to explain why it didn't work) your testing methodology was at fault, or you're using this as a subquery in an outer query where PostgreSQL is noticing it isn't a correlated subquery and executing it just once, like in this example. .
UPDATE: The distribution produced isn't what you're expecting. The issue here is that you're skewing the distribution by taking multiple samples of random(); you need a single sample.
This query produces the correct distribution (SQLFiddle):
WITH random_weight(rw) AS (SELECT random() * (SELECT sum(percent) FROM test))
SELECT id
FROM (
SELECT
id,
sum(percent) OVER (ORDER BY id),
coalesce(sum(prev_percent) OVER (ORDER BY id),0) FROM (
SELECT
id,
percent,
lag(percent) OVER () AS prev_percent
FROM test
) x
) weighted_ids(id, weight_upper, weight_lower)
CROSS JOIN random_weight
WHERE rw BETWEEN weight_lower AND weight_upper;
Performance is, needless to say, horrible. It's using two nested sets of windows. What I'm doing is:
Creating (id, percent, previous_percent) then using that to create two running sums of weights that are used as range brackets; then
Taking a random value, scaling it to the range of weights, and then picking a value that has weights within the target bracket

Here is something for you to play with:
select t1.id as id1
, case when t2.id is null then 0 else t2.id end as id2
, t1.percent as percent1
, case when t2.percent is null then 0 else t2.percent end as percent2
from "Test1" t1
left outer join "Test1" t2 on t1.id = t2.id + 1
where random() * 100 between t1.percent and
case when t2.percent is null then 0 else t2.percent end;
Essentially perform a left outer join so that you have two columns to apply a between clause.
Note that it will only work if you get your table ordered in the right way.

Based on Branko Dimitrijevic's answer, I wrote this query, which may or may not be faster by using the sum total of percent using tiered windowing functions (not unlike a ROLLUP).
WITH random AS (SELECT random() AS random)
SELECT id FROM (
SELECT id, percent,
SUM(percent) OVER (ORDER BY id) AS rank,
SUM(percent) OVER () * random AS roll
FROM test CROSS JOIN random
) t WHERE roll <= rank LIMIT 1
If the ordering isn't important, SUM(percent) OVER (ROWS UNBOUNDED PRECEDING) AS rank, may be preferable because it avoids having to sort the data first.
I also tried Mechanic Wei's answer (as described in this paper, apparently), which seems very promising in terms of performance, but after some testing, the distribution appear to be off :
SELECT id
FROM test
ORDER BY random() ^ (1.0/percent)
LIMIT 1

Related

SQL- Sample the 3rd from the beginning and backwards

I have a table with id and score. I want to create a new set of data with a sampling method. The sampling method would be to order the id in decreasing order of the scores and sample the 3rd id, starting with the first form the beginning until we get 10k positive samples. And we would like to do the same in the other direction, starting from the end to get 10k negative samples.
id
score
24
0.55
58
0.43
987
0.93
How can I write a SQL query to execute this sampling and get the expected output?
To start with, this would be more straightforward to write an answer if you included the database you used (SQL Server, MySQL, etc). Different SQL versions have different syntax.
BACKGROUND
To answer this question, the main tools you need are the ability to sort, and an ability to take every 3rd row.
I'm using SQL Server here, so sorting includes
TOP modifier in SELECT statements - in other databases it's often LIMIT (e.g., SELECT * FROM Test LIMIT 1000)
ROW_NUMBER() which I believe is relatively common
To get every third row, I use the 'modulus' mathematical function - in SQL Server signified by a % symbol - so, for example
1 % 3 = 1
2 % 3 = 2
3 % 3 = 0
4 % 3 = 1
APPROACH
There is an example of this in this db<>fiddle - but note that it is only dealing with test data (1000 rows, selecting top and bottom 10).
Running through the steps - and assuming your data is stored in #DataTable:
The following command assigns a row number rn to the data, sorted by the score.
SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable;
To get every third value, start with that data and take every third value (e.g., where the row number is a multiple of 3)
SELECT *
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable)
WHERE rn % 3 = 0;
To get the first 10,000 of them, use TOP (or LIMIT, etc)
SELECT TOP 10000 *
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable)
WHERE rn % 3 = 0
ORDER BY rn;
Note - to get it the other way/get the highest scores, take the ROW_NUMBER() in reverse order (e.g., ORDER BY score DESC, id DESC).
FINAL ANSWER
Take the above 10,000 rows, and do a similar for the other way (e.g., to get the highest scores) then UNION them together. Below it is done with a CTE.
WITH TopScores AS
(SELECT TOP 10000 id, score
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score DESC, id DESC) as rn
FROM #DataTable
) AS RankedScores_down
WHERE RankedScores_down.rn % 3 = 0
ORDER BY RankedScores_down.rn
),
LowScores AS
(SELECT TOP 10000 id, score
FROM (SELECT id, score, ROW_NUMBER() OVER (ORDER BY score, id) as rn
FROM #DataTable
) AS RankedScores_up
WHERE RankedScores_up.rn % 3 = 0
ORDER BY RankedScores_up.rn
)
SELECT * FROM TopScores
UNION
SELECT * FROM LowScores
ORDER BY score, id;
Notes
I used 'UNION' rather than 'UNION ALL' because, in the chance that there is overlap (e.g., you have less than 60,000 datapoints) we only want to include each sample once
If you use a different database, you'll need to translate this! Here are the benefits of specifying the database you use.
Note that taking every third value (when sorted by score) is not really 'independent' sampling - one would ask why you just don't use all of the top/bottom 30,000 scores? If you to sample 1 in 3 of them, instead you could use id % 3 instead of rn % 3. But once again, why would you do this? Why not just collect fewer results and use them all?
Of course, one good reason is to use half the data to check the validity of stats e.g., take half the data, do your model - then check against the other half how good your model is.

How to use a group by but still access every number values cleverly

I need to "group by" my data to distinguish between tests (each test has a specific id, name and temperature) and to calculate their count, standard deviation, etc. But I also need access every raw data value from each group, for further indexes calculations that I do in a python script.
I have found two solution to this problem, but both seems non-optimal/flawed:
1) Using listagg to store every raw value that were grouped into a single string row. It does the work but it is not optimized : I concatenate multiples float values into a giant string that I will immediately de-concatenate and convert back to float. That seem necessary and costly.
2) Removing the group by entirely and do the count and standard deviation though partitioning. But that seems even worse to me. I don't know if PLSQL/oracle optimizes this, it could be calculating the same count and standard deviation for every line (I don't know how to check this). The query result also becomes messy: since there is no 'group by' anymore, I have to do add multiple checks in my python file in order to differentiate every test data (specific id, name and temperature).
I think that my first solution can be improved but I don't know how. How can I use a group by but still access every number values cleverly ?
A function similar to list_agg but with a collection/array output type instead of a string output type could maybe do the trick (a sort of 'array_agg' compatible with oracle), but I don't know any.
EDIT:
The sample data is complex and probably restricted to the company viewing, but I can show you my simplified query for my 1) :
SELECT
rav.rav_testid as test_id,
tte.tte_testname as test_name,
tsc.tsc_temperature as temperature,
listagg(rav.rav_value, ' ')WITHIN GROUP (ORDER BY rav.rav_value) as all_specific_test_values,
COUNT(rav.rav_value) as n,
STDDEV(rav.rav_value) as sigma,
FROM
...
(8 inner joins)
GROUP BY
rav.rqv_testid, tte.tte_testname,tsc.tsc_temperature
ORDER BY
rav.RAV_testid, tte.tte_testname, spd.SPD_SPLITNAMEINTERNAL,tsc.tsc_temperature
The result looks like :
test_id | test_name | temperature | all_specific_test_values | n | sigma
-------------------------------------------------------------------------
6001 |VADC_A(...) | -40 | ,8094034194946289 ,8(...)| 58 | 0,54
6001 |VADC_A(...) | 25 | ,5054857852946545 ,6(...)| 56 | 0,24
6001 |VADC_A(...) | 150 | ,8625754277452524 ,4(...)| 56 | 0,26
6002 |VADC_B(...) | -40 | ,9874514651548454 ,5(...)| 57 | 0,44
I think you want analytic functions:
select t.*,
count(*) over (partition by test) as cnt,
avg(value) over (partition by test) as avg_value,
stddev(value) over (partition by test) as stddev_value
from t;
This adds additional columns on each row.
I would suggest going with #Gordon_Linoff's solution. That is likely the most standard solution.
If you want to go with a less standard solution, you can have a group by that returns a collection as one of the columns. Presumably, your script could iterate through that collection though it might take a bit of work in the script to do that.
create type num_tbl as table of number;
/
create table foo (
grp integer,
val number
);
insert into foo values( 1, 1.1 );
insert into foo values( 2, 1.2 );
insert into foo values( 1, 1.3 );
insert into foo values( 2, 1.4 );
select grp, avg(val), cast( collect( val ) as num_tbl )
from foo
group by grp

Filtering Rows in SQL

My data looks like this: Number(String), Number2(String), Transaction Type(String), Cost(Integer)
enter image description here
For number 1, Cost 10 and -10 cancel out so the remaining cost is 100
For number 2, Cost 50 and -50 cancel out, Cost 87 and -87 cancel out
For number 3, Cost remains 274
For number 4, Cost 316 and -316 cancel out, 313 remains as the cost
The output I am looking for Looks like this:
How do I do this in SQL?
I have tried "sum(price)" and group by "number", but oracle doesn't let me get results because of other columns
https://datascience.stackexchange.com/questions/47572/filtering-unique-row-values-in-sql
When you're doing an aggregate query, you have to pick one value for each column - either by including it in the group by, or wrapping it in an aggregate function.
It's not clear what you want to display for columns 2 and 3 in your output, but from your example data it looks like you're taking the MAX, so that's what I did here.
select number, max(number2), max(transaction_type), sum(cost)
from my_data
group by number
having sum(cost) <> 0;
Oracle has very nice functionality equivalent toe first() . . . but the syntax is a little cumbersome:
select number,
max(number2) keep (dense_rank first order by cost desc) as number2,
max(transaction_type) keep (dense_rank first order by cost desc) as transaction_type,
max(cost) as cost
from t
group by number;
In my experience, keep has good performance characteristics.
You're almost there... you'll need to get the sum for each number without the other columns and then join back to your table.
select * from table t
join
(select number,sum(cost)
from table
group by number) sums on sums.number=t.number
You can use correlated subquery :
select t.*
from table t
where t.cost = (select sum(t1.cost) from table t1 where t1.number = t.number);

Implementing a total order ranking in PostgreSQL 8.3

The issue with 8.3 is that rank() is introduced in 8.4.
Consider the numbers [10,6,6,2].
I wish to achieve a rank of those numbers where the rank is equal to the row number:
rank | score
-----+------
1 | 10
2 | 6
3 | 6
4 | 2
A partial solution is to self-join and count items with a higher or equal, score. This produces:
1 | 10
3 | 6
3 | 6
4 | 2
But that's not what I want.
Is there a way to rank, or even just order by score somehow and then extract that row number?
If you want a row number equivalent to the window function row_number(), you can improvise in version 8.3 (or any version) with a (temporary) SEQUENCE:
CREATE TEMP SEQUENCE foo;
SELECT nextval('foo') AS rn, *
FROM (SELECT score FROM tbl ORDER BY score DESC) s;
db<>fiddle here
Old sqlfiddle
The subquery is necessary to order rows before calling nextval().
Note that the sequence (like any temporary object) ...
is only visible in the same session it was created.
hides any other table object of the same name.
is dropped automatically at the end of the session.
To use the sequence in the same session repeatedly run before each query:
SELECT setval('foo', 1, FALSE);
There's a method using an array that works with PG 8.3. It's probably not very efficient, performance-wise, but will do OK if there aren't a lot of values.
The idea is to sort the values in a temporary array, then extract the bounds of the array, then join that with generate_series to extract the values one by one, the index into the array being the row number.
Sample query assuming the table is scores(value int):
SELECT i AS row_number,arr[i] AS score
FROM (SELECT arr,generate_series(1,nb) AS i
FROM (SELECT arr,array_upper(arr,1) AS nb
FROM (SELECT array(SELECT value FROM scores ORDER BY value DESC) AS arr
) AS s2
) AS s1
) AS s0
Do you have a PK for this table?
Just self join and count items with: a higher or equal score and higher PK.
PK comparison will break ties and give you desired result.
And after you upgrade to 9.1 - use row_number().

PostgreSQL - fetch the rows which have the Max value for a column in each GROUP BY group

I'm dealing with a Postgres table (called "lives") that contains records with columns for time_stamp, usr_id, transaction_id, and lives_remaining. I need a query that will give me the most recent lives_remaining total for each usr_id
There are multiple users (distinct usr_id's)
time_stamp is not a unique identifier: sometimes user events (one by row in the table) will occur with the same time_stamp.
trans_id is unique only for very small time ranges: over time it repeats
remaining_lives (for a given user) can both increase and decrease over time
example:
time_stamp|lives_remaining|usr_id|trans_id
-----------------------------------------
07:00 | 1 | 1 | 1
09:00 | 4 | 2 | 2
10:00 | 2 | 3 | 3
10:00 | 1 | 2 | 4
11:00 | 4 | 1 | 5
11:00 | 3 | 1 | 6
13:00 | 3 | 3 | 1
As I will need to access other columns of the row with the latest data for each given usr_id, I need a query that gives a result like this:
time_stamp|lives_remaining|usr_id|trans_id
-----------------------------------------
11:00 | 3 | 1 | 6
10:00 | 1 | 2 | 4
13:00 | 3 | 3 | 1
As mentioned, each usr_id can gain or lose lives, and sometimes these timestamped events occur so close together that they have the same timestamp! Therefore this query won't work:
SELECT b.time_stamp,b.lives_remaining,b.usr_id,b.trans_id FROM
(SELECT usr_id, max(time_stamp) AS max_timestamp
FROM lives GROUP BY usr_id ORDER BY usr_id) a
JOIN lives b ON a.max_timestamp = b.time_stamp
Instead, I need to use both time_stamp (first) and trans_id (second) to identify the correct row. I also then need to pass that information from the subquery to the main query that will provide the data for the other columns of the appropriate rows. This is the hacked up query that I've gotten to work:
SELECT b.time_stamp,b.lives_remaining,b.usr_id,b.trans_id FROM
(SELECT usr_id, max(time_stamp || '*' || trans_id)
AS max_timestamp_transid
FROM lives GROUP BY usr_id ORDER BY usr_id) a
JOIN lives b ON a.max_timestamp_transid = b.time_stamp || '*' || b.trans_id
ORDER BY b.usr_id
Okay, so this works, but I don't like it. It requires a query within a query, a self join, and it seems to me that it could be much simpler by grabbing the row that MAX found to have the largest timestamp and trans_id. The table "lives" has tens of millions of rows to parse, so I'd like this query to be as fast and efficient as possible. I'm new to RDBM and Postgres in particular, so I know that I need to make effective use of the proper indexes. I'm a bit lost on how to optimize.
I found a similar discussion here. Can I perform some type of Postgres equivalent to an Oracle analytic function?
Any advice on accessing related column information used by an aggregate function (like MAX), creating indexes, and creating better queries would be much appreciated!
P.S. You can use the following to create my example case:
create TABLE lives (time_stamp timestamp, lives_remaining integer,
usr_id integer, trans_id integer);
insert into lives values ('2000-01-01 07:00', 1, 1, 1);
insert into lives values ('2000-01-01 09:00', 4, 2, 2);
insert into lives values ('2000-01-01 10:00', 2, 3, 3);
insert into lives values ('2000-01-01 10:00', 1, 2, 4);
insert into lives values ('2000-01-01 11:00', 4, 1, 5);
insert into lives values ('2000-01-01 11:00', 3, 1, 6);
insert into lives values ('2000-01-01 13:00', 3, 3, 1);
I would propose a clean version based on DISTINCT ON (see docs):
SELECT DISTINCT ON (usr_id)
time_stamp,
lives_remaining,
usr_id,
trans_id
FROM lives
ORDER BY usr_id, time_stamp DESC, trans_id DESC;
On a table with 158k pseudo-random rows (usr_id uniformly distributed between 0 and 10k, trans_id uniformly distributed between 0 and 30),
By query cost, below, I am referring to Postgres' cost based optimizer's cost estimate (with Postgres' default xxx_cost values), which is a weighed function estimate of required I/O and CPU resources; you can obtain this by firing up PgAdminIII and running "Query/Explain (F7)" on the query with "Query/Explain options" set to "Analyze"
Quassnoy's query has a cost estimate of 745k (!), and completes in 1.3 seconds (given a compound index on (usr_id, trans_id, time_stamp))
Bill's query has a cost estimate of 93k, and completes in 2.9 seconds (given a compound index on (usr_id, trans_id))
Query #1 below has a cost estimate of 16k, and completes in 800ms (given a compound index on (usr_id, trans_id, time_stamp))
Query #2 below has a cost estimate of 14k, and completes in 800ms (given a compound function index on (usr_id, EXTRACT(EPOCH FROM time_stamp), trans_id))
this is Postgres-specific
Query #3 below (Postgres 8.4+) has a cost estimate and completion time comparable to (or better than) query #2 (given a compound index on (usr_id, time_stamp, trans_id)); it has the advantage of scanning the lives table only once and, should you temporarily increase (if needed) work_mem to accommodate the sort in memory, it will be by far the fastest of all queries.
All times above include retrieval of the full 10k rows result-set.
Your goal is minimal cost estimate and minimal query execution time, with an emphasis on estimated cost. Query execution can dependent significantly on runtime conditions (e.g. whether relevant rows are already fully cached in memory or not), whereas the cost estimate is not. On the other hand, keep in mind that cost estimate is exactly that, an estimate.
The best query execution time is obtained when running on a dedicated database without load (e.g. playing with pgAdminIII on a development PC.) Query time will vary in production based on actual machine load/data access spread. When one query appears slightly faster (<20%) than the other but has a much higher cost, it will generally be wiser to choose the one with higher execution time but lower cost.
When you expect that there will be no competition for memory on your production machine at the time the query is run (e.g. the RDBMS cache and filesystem cache won't be thrashed by concurrent queries and/or filesystem activity) then the query time you obtained in standalone (e.g. pgAdminIII on a development PC) mode will be representative. If there is contention on the production system, query time will degrade proportionally to the estimated cost ratio, as the query with the lower cost does not rely as much on cache whereas the query with higher cost will revisit the same data over and over (triggering additional I/O in the absence of a stable cache), e.g.:
cost | time (dedicated machine) | time (under load) |
-------------------+--------------------------+-----------------------+
some query A: 5k | (all data cached) 900ms | (less i/o) 1000ms |
some query B: 50k | (all data cached) 900ms | (lots of i/o) 10000ms |
Do not forget to run ANALYZE lives once after creating the necessary indices.
Query #1
-- incrementally narrow down the result set via inner joins
-- the CBO may elect to perform one full index scan combined
-- with cascading index lookups, or as hash aggregates terminated
-- by one nested index lookup into lives - on my machine
-- the latter query plan was selected given my memory settings and
-- histogram
SELECT
l1.*
FROM
lives AS l1
INNER JOIN (
SELECT
usr_id,
MAX(time_stamp) AS time_stamp_max
FROM
lives
GROUP BY
usr_id
) AS l2
ON
l1.usr_id = l2.usr_id AND
l1.time_stamp = l2.time_stamp_max
INNER JOIN (
SELECT
usr_id,
time_stamp,
MAX(trans_id) AS trans_max
FROM
lives
GROUP BY
usr_id, time_stamp
) AS l3
ON
l1.usr_id = l3.usr_id AND
l1.time_stamp = l3.time_stamp AND
l1.trans_id = l3.trans_max
Query #2
-- cheat to obtain a max of the (time_stamp, trans_id) tuple in one pass
-- this results in a single table scan and one nested index lookup into lives,
-- by far the least I/O intensive operation even in case of great scarcity
-- of memory (least reliant on cache for the best performance)
SELECT
l1.*
FROM
lives AS l1
INNER JOIN (
SELECT
usr_id,
MAX(ARRAY[EXTRACT(EPOCH FROM time_stamp),trans_id])
AS compound_time_stamp
FROM
lives
GROUP BY
usr_id
) AS l2
ON
l1.usr_id = l2.usr_id AND
EXTRACT(EPOCH FROM l1.time_stamp) = l2.compound_time_stamp[1] AND
l1.trans_id = l2.compound_time_stamp[2]
2013/01/29 update
Finally, as of version 8.4, Postgres supports Window Function meaning you can write something as simple and efficient as:
Query #3
-- use Window Functions
-- performs a SINGLE scan of the table
SELECT DISTINCT ON (usr_id)
last_value(time_stamp) OVER wnd,
last_value(lives_remaining) OVER wnd,
usr_id,
last_value(trans_id) OVER wnd
FROM lives
WINDOW wnd AS (
PARTITION BY usr_id ORDER BY time_stamp, trans_id
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
);
Here's another method, which happens to use no correlated subqueries or GROUP BY. I'm not expert in PostgreSQL performance tuning, so I suggest you try both this and the solutions given by other folks to see which works better for you.
SELECT l1.*
FROM lives l1 LEFT OUTER JOIN lives l2
ON (l1.usr_id = l2.usr_id AND (l1.time_stamp < l2.time_stamp
OR (l1.time_stamp = l2.time_stamp AND l1.trans_id < l2.trans_id)))
WHERE l2.usr_id IS NULL
ORDER BY l1.usr_id;
I am assuming that trans_id is unique at least over any given value of time_stamp.
There is a new option in Postgressql 9.5 called DISTINCT ON
SELECT DISTINCT ON (location) location, time, report
FROM weather_reports
ORDER BY location, time DESC;
It eliminates duplicate rows an leaves only the first row as defined my the ORDER BY clause.
see the official documentation
I like the style of Mike Woodhouse's answer on the other page you mentioned. It's especially concise when the thing being maximised over is just a single column, in which case the subquery can just use MAX(some_col) and GROUP BY the other columns, but in your case you have a 2-part quantity to be maximised, you can still do so by using ORDER BY plus LIMIT 1 instead (as done by Quassnoi):
SELECT *
FROM lives outer
WHERE (usr_id, time_stamp, trans_id) IN (
SELECT usr_id, time_stamp, trans_id
FROM lives sq
WHERE sq.usr_id = outer.usr_id
ORDER BY trans_id, time_stamp
LIMIT 1
)
I find using the row-constructor syntax WHERE (a, b, c) IN (subquery) nice because it cuts down on the amount of verbiage needed.
Actaully there's a hacky solution for this problem. Let's say you want to select the biggest tree of each forest in a region.
SELECT (array_agg(tree.id ORDER BY tree_size.size)))[1]
FROM tree JOIN forest ON (tree.forest = forest.id)
GROUP BY forest.id
When you group trees by forests there will be an unsorted list of trees and you need to find the biggest one. First thing you should do is to sort the rows by their sizes and select the first one of your list. It may seems inefficient but if you have millions of rows it will be quite faster than the solutions that includes JOIN's and WHERE conditions.
BTW, note that ORDER_BY for array_agg is introduced in Postgresql 9.0
You can do it with window functions
SELECT t.*
FROM
(SELECT
*,
ROW_NUMBER() OVER(PARTITION BY usr_id ORDER BY time_stamp DESC) as r
FROM lives) as t
WHERE t.r = 1
SELECT l.*
FROM (
SELECT DISTINCT usr_id
FROM lives
) lo, lives l
WHERE l.ctid = (
SELECT ctid
FROM lives li
WHERE li.usr_id = lo.usr_id
ORDER BY
time_stamp DESC, trans_id DESC
LIMIT 1
)
Creating an index on (usr_id, time_stamp, trans_id) will greatly improve this query.
You should always, always have some kind of PRIMARY KEY in your tables.
I think you've got one major problem here: there's no monotonically increasing "counter" to guarantee that a given row has happened later in time than another. Take this example:
timestamp lives_remaining user_id trans_id
10:00 4 3 5
10:00 5 3 6
10:00 3 3 1
10:00 2 3 2
You cannot determine from this data which is the most recent entry. Is it the second one or the last one? There is no sort or max() function you can apply to any of this data to give you the correct answer.
Increasing the resolution of the timestamp would be a huge help. Since the database engine serializes requests, with sufficient resolution you can guarantee that no two timestamps will be the same.
Alternatively, use a trans_id that won't roll over for a very, very long time. Having a trans_id that rolls over means you can't tell (for the same timestamp) whether trans_id 6 is more recent than trans_id 1 unless you do some complicated math.