Ordered count of consecutive repeats / duplicates - sql

I highly doubt I'm doing this in the most efficient manner, which is why I tagged plpgsql on here. I need to run this on 2 billion rows for a thousand measurement systems.
You have measurement systems that often report the previous value when they lose connectivity, and they lose connectivity for spurts often but sometimes for a long time. You need to aggregate but when you do so, you need to look at how long it was repeating and make various filters based on that information. Say you are measuring mpg on a car but it's stuck at 20 mpg for an hour than moves around to 20.1 and so on. You'll want to evaluate the accuracy when it's stuck. You could also place some alternative rules that look for when the car is on the highway, and with window functions you can generate the 'state' of the car and have something to group on. Without further ado:
--here's my data, you have different systems, the time of measurement, and the actual measurement
--as well, the raw data has whether or not it's a repeat (hense the included window function
select * into temporary table cumulative_repeat_calculator_data
FROM
(
select
system_measured, time_of_measurement, measurement,
case when
measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc)
then 1 else 0 end as repeat
FROM
(
SELECT 5 as measurement, 1 as time_of_measurement, 1 as system_measured
UNION
SELECT 150 as measurement, 2 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 1 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 2 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 2 as system_measured
UNION
SELECT 150 as measurement, 5 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 6 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 7 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 8 as time_of_measurement, 2 as system_measured
) as data
) as data;
--unfortunately you can't have window functions within window functions, so I had to break it down into subquery
--what we need is something to partion on, the 'state' of the system if you will, so I ran a running total of the nonrepeats
--this creates a row that stays the same when your data is repeating - aka something you can partition/group on
select * into temporary table cumulative_repeat_calculator_step_1
FROM
(
select
*,
sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system
from cumulative_repeat_calculator_data
order by system_measured, time_of_measurement
) as data;
--finally, the query. I didn't bother showing my desired output, because this (finally) got it
--I wanted a sequential count of repeats that restarts when it stops repeating, and starts with the first repeat
--what you can do now is take the average measurement under some condition based on how long it was repeating, for example
select *,
case when repeat = 0 then 0
else
row_number() over (partition by cumlative_sum_of_nonrepeats_by_system, system_measured order by time_of_measurement) - 1
end as ordered_repeat
from cumulative_repeat_calculator_step_1
order by system_measured, time_of_measurement
So, what would you do differently in order to run this on a huge table, or what alternative tools would you use? I'm thinking plpgsql because I suspect this needs to done in-database, or during the data insertion process, although I generally work with the data after it's loaded. Is there any way to get this in one sweep without resorting to sub-queries?
I have tested one alternative method, but it still relies on a sub-query and I think this is faster. For that method you create a "starts and stops" table with start_timestamp, end_timestamp, system. Then you join to the larger table and if the timestamp is between those, you classify it as being in that state, which is essentially an alternative to cumlative_sum_of_nonrepeats_by_system. But when you do this, you join on 1=1 for thousands of devices and thousands or millions of 'events'. Do you think that's a better way to go?

Test case
First, a more useful way to present your data - or even better, in an sqlfiddle, ready to play with:
CREATE TEMP TABLE data(
system_measured int
, time_of_measurement int
, measurement int
);
INSERT INTO data VALUES
(1, 1, 5)
,(1, 2, 150)
,(1, 3, 5)
,(1, 4, 5)
,(2, 1, 5)
,(2, 2, 5)
,(2, 3, 5)
,(2, 4, 5)
,(2, 5, 150)
,(2, 6, 5)
,(2, 7, 5)
,(2, 8, 5);
Simplified query
Since it remains unclear, I am assuming only the above as given.
Next, I simplified your query to arrive at:
WITH x AS (
SELECT *, CASE WHEN lag(measurement) OVER (PARTITION BY system_measured
ORDER BY time_of_measurement) = measurement
THEN 0 ELSE 1 END AS step
FROM data
)
, y AS (
SELECT *, sum(step) OVER(PARTITION BY system_measured
ORDER BY time_of_measurement) AS grp
FROM x
)
SELECT * ,row_number() OVER (PARTITION BY system_measured, grp
ORDER BY time_of_measurement) - 1 AS repeat_ct
FROM y
ORDER BY system_measured, time_of_measurement;
Now, while it is all nice and shiny to use pure SQL, this will be much faster with a plpgsql function, because it can do it in a single table scan where this query needs at least three scans.
Faster with plpgsql function:
CREATE OR REPLACE FUNCTION x.f_repeat_ct()
RETURNS TABLE (
system_measured int
, time_of_measurement int
, measurement int, repeat_ct int
) LANGUAGE plpgsql AS
$func$
DECLARE
r data; -- table name serves as record type
r0 data;
BEGIN
-- SET LOCAL work_mem = '1000 MB'; -- uncomment an adapt if needed, see below!
repeat_ct := 0; -- init
FOR r IN
SELECT * FROM data d ORDER BY d.system_measured, d.time_of_measurement
LOOP
IF r.system_measured = r0.system_measured
AND r.measurement = r0.measurement THEN
repeat_ct := repeat_ct + 1; -- start new array
ELSE
repeat_ct := 0; -- start new count
END IF;
RETURN QUERY SELECT r.*, repeat_ct;
r0 := r; -- remember last row
END LOOP;
END
$func$;
Call:
SELECT * FROM x.f_repeat_ct();
Be sure to table-qualify your column names at all times in this kind of plpgsql function, because we use the same names as output parameters which would take precedence if not qualified.
Billions of rows
If you have billions of rows, you may want to split this operation up. I quote the manual here:
Note: The current implementation of RETURN NEXT and RETURN QUERY
stores the entire result set before returning from the function, as
discussed above. That means that if a PL/pgSQL function produces a
very large result set, performance might be poor: data will be written
to disk to avoid memory exhaustion, but the function itself will not
return until the entire result set has been generated. A future
version of PL/pgSQL might allow users to define set-returning
functions that do not have this limitation. Currently, the point at
which data begins being written to disk is controlled by the work_mem
configuration variable. Administrators who have sufficient memory to
store larger result sets in memory should consider increasing this
parameter.
Consider computing rows for one system at a time or set a high enough value for work_mem to cope with the load. Follow the link provided in the quote on more about work_mem.
One way would be to set a very high value for work_mem with SET LOCAL in your function, which is only effective for for the current transaction. I added a commented line in the function. Do not set it very high globally, as this could nuke your server. Read the manual.

Related

WHILE Window Operation with Different Starting Point Values From Column - SQL Server [duplicate]

In SQL there are aggregation operators, like AVG, SUM, COUNT. Why doesn't it have an operator for multiplication? "MUL" or something.
I was wondering, does it exist for Oracle, MSSQL, MySQL ? If not is there a workaround that would give this behaviour?
By MUL do you mean progressive multiplication of values?
Even with 100 rows of some small size (say 10s), your MUL(column) is going to overflow any data type! With such a high probability of mis/ab-use, and very limited scope for use, it does not need to be a SQL Standard. As others have shown there are mathematical ways of working it out, just as there are many many ways to do tricky calculations in SQL just using standard (and common-use) methods.
Sample data:
Column
1
2
4
8
COUNT : 4 items (1 for each non-null)
SUM : 1 + 2 + 4 + 8 = 15
AVG : 3.75 (SUM/COUNT)
MUL : 1 x 2 x 4 x 8 ? ( =64 )
For completeness, the Oracle, MSSQL, MySQL core implementations *
Oracle : EXP(SUM(LN(column))) or POWER(N,SUM(LOG(column, N)))
MSSQL : EXP(SUM(LOG(column))) or POWER(N,SUM(LOG(column)/LOG(N)))
MySQL : EXP(SUM(LOG(column))) or POW(N,SUM(LOG(N,column)))
Care when using EXP/LOG in SQL Server, watch the return type http://msdn.microsoft.com/en-us/library/ms187592.aspx
The POWER form allows for larger numbers (using bases larger than Euler's number), and in cases where the result grows too large to turn it back using POWER, you can return just the logarithmic value and calculate the actual number outside of the SQL query
* LOG(0) and LOG(-ve) are undefined. The below shows only how to handle this in SQL Server. Equivalents can be found for the other SQL flavours, using the same concept
create table MUL(data int)
insert MUL select 1 yourColumn union all
select 2 union all
select 4 union all
select 8 union all
select -2 union all
select 0
select CASE WHEN MIN(abs(data)) = 0 then 0 ELSE
EXP(SUM(Log(abs(nullif(data,0))))) -- the base mathematics
* round(0.5-count(nullif(sign(sign(data)+0.5),1))%2,0) -- pairs up negatives
END
from MUL
Ingredients:
taking the abs() of data, if the min is 0, multiplying by whatever else is futile, the result is 0
When data is 0, NULLIF converts it to null. The abs(), log() both return null, causing it to be precluded from sum()
If data is not 0, abs allows us to multiple a negative number using the LOG method - we will keep track of the negativity elsewhere
Working out the final sign
sign(data) returns 1 for >0, 0 for 0 and -1 for <0.
We add another 0.5 and take the sign() again, so we have now classified 0 and 1 both as 1, and only -1 as -1.
again use NULLIF to remove from COUNT() the 1's, since we only need to count up the negatives.
% 2 against the count() of negative numbers returns either
--> 1 if there is an odd number of negative numbers
--> 0 if there is an even number of negative numbers
more mathematical tricks: we take 1 or 0 off 0.5, so that the above becomes
--> (0.5-1=-0.5=>round to -1) if there is an odd number of negative numbers
--> (0.5-0= 0.5=>round to 1) if there is an even number of negative numbers
we multiple this final 1/-1 against the SUM-PRODUCT value for the real result
No, but you can use Mathematics :)
if yourColumn is always bigger than zero:
select EXP(SUM(LOG(yourColumn))) As ColumnProduct from yourTable
I see an Oracle answer is still missing, so here it is:
SQL> with yourTable as
2 ( select 1 yourColumn from dual union all
3 select 2 from dual union all
4 select 4 from dual union all
5 select 8 from dual
6 )
7 select EXP(SUM(LN(yourColumn))) As ColumnProduct from yourTable
8 /
COLUMNPRODUCT
-------------
64
1 row selected.
Regards,
Rob.
With PostgreSQL, you can create your own aggregate functions, see http://www.postgresql.org/docs/8.2/interactive/sql-createaggregate.html
To create an aggregate function on MySQL, you'll need to build an .so (linux) or .dll (windows) file. An example is shown here: http://www.codeproject.com/KB/database/mygroupconcat.aspx
I'm not sure about mssql and oracle, but i bet they have options to create custom aggregates as well.
You'll break any datatype fairly quickly as numbers mount up.
Using LOG/EXP is tricky because of numbers <= 0 that will fail when using LOG. I wrote a solution in this question that deals with this
Using CTE in MS SQL:
CREATE TABLE Foo(Id int, Val int)
INSERT INTO Foo VALUES(1, 2), (2, 3), (3, 4), (4, 5), (5, 6)
;WITH cte AS
(
SELECT Id, Val AS Multiply, row_number() over (order by Id) as rn
FROM Foo
WHERE Id=1
UNION ALL
SELECT ff.Id, cte.multiply*ff.Val as multiply, ff.rn FROM
(SELECT f.Id, f.Val, (row_number() over (order by f.Id)) as rn
FROM Foo f) ff
INNER JOIN cte
ON ff.rn -1= cte.rn
)
SELECT * FROM cte
Not sure about Oracle or sql-server, but in MySQL you can just use * like you normally would.
mysql> select count(id), count(id)*10 from tablename;
+-----------+--------------+
| count(id) | count(id)*10 |
+-----------+--------------+
| 961 | 9610 |
+-----------+--------------+
1 row in set (0.00 sec)

Given a table of numbers, can I get all the rows which add up to less than or equal to a number?

Say I have a table with an incrementing id column and a random positive non zero number.
id
rand
1
12
2
5
3
99
4
87
Write a query to return the rows which add up to a given number.
A couple rules:
Rows must be "consumed" in order, even if a later row makes it a a perfect match. For example, querying for 104 would be a perfect match for rows 1, 2, and 4 but rows 1-3 would still be returned.
You can use a row partially if there is more available than is necessary to add up to whatever is leftover on the number E.g. rows 1, 2, and 3 would be returned if your max number is 50 because 12 + 5 + 33 equals 50 and 90 is a partial result.
If there are not enough rows to satisfy the amount, then return ALL the rows. E.g. in the above example a query for 1,000 would return rows 1-4. In other words, the sum of the rows should be less than or equal to the queried number.
It's possible for the answer to be "no this is not possible with SQL alone" and that's fine but I was just curious. This would be a trivial problem with a programming language but I was wondering what SQL provides out of the box to do something as a thought experiment and learning exercise.
You didn't mention which RDBMS, but assuming SQL Server:
DROP TABLE #t;
CREATE TABLE #t (id int, rand int);
INSERT INTO #t (id,rand)
VALUES (1,12),(2,5),(3,99),(4,87);
DECLARE #target int = 104;
WITH dat
AS
(
SELECT id, rand, SUM(rand) OVER (ORDER BY id) as runsum
FROM #t
),
dat2
as
(
SELECT id, rand
, runsum
, COALESCE(LAG(runsum,1) OVER (ORDER BY id),0) as prev_runsum
from dat
)
SELECT id, rand
FROM dat2
WHERE #target >= runsum
OR #target BETWEEN prev_runsum AND runsum;

Using "order by" and fetch inside a union in SQL on as400 database

Let's say I have this table
Table name: Traffic
Seq. Type Amount
1 in 10
2 out 30
3 in 50
4 out 70
What I need is to get the previous smaller and next larger amount of a value. So, if I have 40 as a value, I will get...
Table name: Traffic
Seq. Type Amount
2 out 30
3 in 50
I already tried doing it with MYSQL and quite satisfied with the results
(select * from Traffic where
Amount < 40 order by Amount desc limit 1)
union
(select * from Traffic where
Amount > 40 order by Amount desc limit 1)
The problem lies when I try to convert it to a SQL statement acceptable by AS400. It appears that the order by and fetch function (AS400 doesn't have a limit function so we use fetch, or does it?) is not allowed inside the select statement when I use it with a union. I always get a keyword not expected error. Here is my statement;
(select seq as sequence, type as status, amount as price from Traffic where
Amount < 40 order by price asc fetch first 1 rows only)
union
(select seq as sequence, type as status, amount as price from Traffic where
Amount > 40 order by price asc fetch first 1 rows only)
Can anyone please tell me what's wrong and how it should be? Also, please share if you know other ways to achieve my desired result.
How about a CTE? From memory (no machine to test with):
with
less as (select * from traffic where amount < 40),
more as (select * from traffic where amount > 40)
select * from traffic
where id = (select id from less where amount = (select max(amount from less)))
or id = (select id from more where amount = (select min(amount from more)))
I looked at this question from possibly another point of view. I have seen other questions about date-time ranges between rows, and I thought perhaps what you might be trying to do is establish what range some value might fall in.
If working with these ranges will be a recurring theme, then you might want to create a view for it.
create or replace view traffic_ranges as
with sorted as
( select t.*
, smallint(row_number() over (order by amount)) as pos
from traffic t
)
select b.pos range_seq
, b.id beg_id
, e.id end_id
, b.typ beg_type
, e.typ end_type
, b.amount beg_amt
, e.amount end_amt
from sorted b
join sorted e on e.pos = b.pos+1
;
Once you have this view, it becomes very simple to get your answer:
select *
from traffic_ranges
where 40 is between beg_amt and end_amt
Or to get only one range where the search amount happens to be an amount in your base table, you would want to pick whether to include the beginning value or ending value as part of the range, and exclude the other:
where beg_amt < 40 and end_amt >= 40
One advantage of this approach is performance. If you are finding the range for multiple values, such as a column in a table or query, then having the range view should give you significantly better performance than a query where you must aggregate all the records that are more or less than each search value.
Here's my query using CTE and union inspired by Buck Calabro's answer. Credits go to him and WarrenT for being SQL geniuses!
I won't be accepting my own answer. That will be unfair. hehe
with
apple(seq, type, amount) as (select seq, type, amount from traffic where amount < 40
order by amount desc fetch first 1 rows only),
banana(seq, type, amount) as (select seq, type, amount from traffic where
amount > 40 fetch first 1 rows only)
select * from apple
union
select * from banana
It's a bit slow but I can accept that since I'll only use it once in the progam.
This is just a sample. The actual query is a bit different.

PostgreSQL - fetch the rows which have the Max value for a column in each GROUP BY group

I'm dealing with a Postgres table (called "lives") that contains records with columns for time_stamp, usr_id, transaction_id, and lives_remaining. I need a query that will give me the most recent lives_remaining total for each usr_id
There are multiple users (distinct usr_id's)
time_stamp is not a unique identifier: sometimes user events (one by row in the table) will occur with the same time_stamp.
trans_id is unique only for very small time ranges: over time it repeats
remaining_lives (for a given user) can both increase and decrease over time
example:
time_stamp|lives_remaining|usr_id|trans_id
-----------------------------------------
07:00 | 1 | 1 | 1
09:00 | 4 | 2 | 2
10:00 | 2 | 3 | 3
10:00 | 1 | 2 | 4
11:00 | 4 | 1 | 5
11:00 | 3 | 1 | 6
13:00 | 3 | 3 | 1
As I will need to access other columns of the row with the latest data for each given usr_id, I need a query that gives a result like this:
time_stamp|lives_remaining|usr_id|trans_id
-----------------------------------------
11:00 | 3 | 1 | 6
10:00 | 1 | 2 | 4
13:00 | 3 | 3 | 1
As mentioned, each usr_id can gain or lose lives, and sometimes these timestamped events occur so close together that they have the same timestamp! Therefore this query won't work:
SELECT b.time_stamp,b.lives_remaining,b.usr_id,b.trans_id FROM
(SELECT usr_id, max(time_stamp) AS max_timestamp
FROM lives GROUP BY usr_id ORDER BY usr_id) a
JOIN lives b ON a.max_timestamp = b.time_stamp
Instead, I need to use both time_stamp (first) and trans_id (second) to identify the correct row. I also then need to pass that information from the subquery to the main query that will provide the data for the other columns of the appropriate rows. This is the hacked up query that I've gotten to work:
SELECT b.time_stamp,b.lives_remaining,b.usr_id,b.trans_id FROM
(SELECT usr_id, max(time_stamp || '*' || trans_id)
AS max_timestamp_transid
FROM lives GROUP BY usr_id ORDER BY usr_id) a
JOIN lives b ON a.max_timestamp_transid = b.time_stamp || '*' || b.trans_id
ORDER BY b.usr_id
Okay, so this works, but I don't like it. It requires a query within a query, a self join, and it seems to me that it could be much simpler by grabbing the row that MAX found to have the largest timestamp and trans_id. The table "lives" has tens of millions of rows to parse, so I'd like this query to be as fast and efficient as possible. I'm new to RDBM and Postgres in particular, so I know that I need to make effective use of the proper indexes. I'm a bit lost on how to optimize.
I found a similar discussion here. Can I perform some type of Postgres equivalent to an Oracle analytic function?
Any advice on accessing related column information used by an aggregate function (like MAX), creating indexes, and creating better queries would be much appreciated!
P.S. You can use the following to create my example case:
create TABLE lives (time_stamp timestamp, lives_remaining integer,
usr_id integer, trans_id integer);
insert into lives values ('2000-01-01 07:00', 1, 1, 1);
insert into lives values ('2000-01-01 09:00', 4, 2, 2);
insert into lives values ('2000-01-01 10:00', 2, 3, 3);
insert into lives values ('2000-01-01 10:00', 1, 2, 4);
insert into lives values ('2000-01-01 11:00', 4, 1, 5);
insert into lives values ('2000-01-01 11:00', 3, 1, 6);
insert into lives values ('2000-01-01 13:00', 3, 3, 1);
I would propose a clean version based on DISTINCT ON (see docs):
SELECT DISTINCT ON (usr_id)
time_stamp,
lives_remaining,
usr_id,
trans_id
FROM lives
ORDER BY usr_id, time_stamp DESC, trans_id DESC;
On a table with 158k pseudo-random rows (usr_id uniformly distributed between 0 and 10k, trans_id uniformly distributed between 0 and 30),
By query cost, below, I am referring to Postgres' cost based optimizer's cost estimate (with Postgres' default xxx_cost values), which is a weighed function estimate of required I/O and CPU resources; you can obtain this by firing up PgAdminIII and running "Query/Explain (F7)" on the query with "Query/Explain options" set to "Analyze"
Quassnoy's query has a cost estimate of 745k (!), and completes in 1.3 seconds (given a compound index on (usr_id, trans_id, time_stamp))
Bill's query has a cost estimate of 93k, and completes in 2.9 seconds (given a compound index on (usr_id, trans_id))
Query #1 below has a cost estimate of 16k, and completes in 800ms (given a compound index on (usr_id, trans_id, time_stamp))
Query #2 below has a cost estimate of 14k, and completes in 800ms (given a compound function index on (usr_id, EXTRACT(EPOCH FROM time_stamp), trans_id))
this is Postgres-specific
Query #3 below (Postgres 8.4+) has a cost estimate and completion time comparable to (or better than) query #2 (given a compound index on (usr_id, time_stamp, trans_id)); it has the advantage of scanning the lives table only once and, should you temporarily increase (if needed) work_mem to accommodate the sort in memory, it will be by far the fastest of all queries.
All times above include retrieval of the full 10k rows result-set.
Your goal is minimal cost estimate and minimal query execution time, with an emphasis on estimated cost. Query execution can dependent significantly on runtime conditions (e.g. whether relevant rows are already fully cached in memory or not), whereas the cost estimate is not. On the other hand, keep in mind that cost estimate is exactly that, an estimate.
The best query execution time is obtained when running on a dedicated database without load (e.g. playing with pgAdminIII on a development PC.) Query time will vary in production based on actual machine load/data access spread. When one query appears slightly faster (<20%) than the other but has a much higher cost, it will generally be wiser to choose the one with higher execution time but lower cost.
When you expect that there will be no competition for memory on your production machine at the time the query is run (e.g. the RDBMS cache and filesystem cache won't be thrashed by concurrent queries and/or filesystem activity) then the query time you obtained in standalone (e.g. pgAdminIII on a development PC) mode will be representative. If there is contention on the production system, query time will degrade proportionally to the estimated cost ratio, as the query with the lower cost does not rely as much on cache whereas the query with higher cost will revisit the same data over and over (triggering additional I/O in the absence of a stable cache), e.g.:
cost | time (dedicated machine) | time (under load) |
-------------------+--------------------------+-----------------------+
some query A: 5k | (all data cached) 900ms | (less i/o) 1000ms |
some query B: 50k | (all data cached) 900ms | (lots of i/o) 10000ms |
Do not forget to run ANALYZE lives once after creating the necessary indices.
Query #1
-- incrementally narrow down the result set via inner joins
-- the CBO may elect to perform one full index scan combined
-- with cascading index lookups, or as hash aggregates terminated
-- by one nested index lookup into lives - on my machine
-- the latter query plan was selected given my memory settings and
-- histogram
SELECT
l1.*
FROM
lives AS l1
INNER JOIN (
SELECT
usr_id,
MAX(time_stamp) AS time_stamp_max
FROM
lives
GROUP BY
usr_id
) AS l2
ON
l1.usr_id = l2.usr_id AND
l1.time_stamp = l2.time_stamp_max
INNER JOIN (
SELECT
usr_id,
time_stamp,
MAX(trans_id) AS trans_max
FROM
lives
GROUP BY
usr_id, time_stamp
) AS l3
ON
l1.usr_id = l3.usr_id AND
l1.time_stamp = l3.time_stamp AND
l1.trans_id = l3.trans_max
Query #2
-- cheat to obtain a max of the (time_stamp, trans_id) tuple in one pass
-- this results in a single table scan and one nested index lookup into lives,
-- by far the least I/O intensive operation even in case of great scarcity
-- of memory (least reliant on cache for the best performance)
SELECT
l1.*
FROM
lives AS l1
INNER JOIN (
SELECT
usr_id,
MAX(ARRAY[EXTRACT(EPOCH FROM time_stamp),trans_id])
AS compound_time_stamp
FROM
lives
GROUP BY
usr_id
) AS l2
ON
l1.usr_id = l2.usr_id AND
EXTRACT(EPOCH FROM l1.time_stamp) = l2.compound_time_stamp[1] AND
l1.trans_id = l2.compound_time_stamp[2]
2013/01/29 update
Finally, as of version 8.4, Postgres supports Window Function meaning you can write something as simple and efficient as:
Query #3
-- use Window Functions
-- performs a SINGLE scan of the table
SELECT DISTINCT ON (usr_id)
last_value(time_stamp) OVER wnd,
last_value(lives_remaining) OVER wnd,
usr_id,
last_value(trans_id) OVER wnd
FROM lives
WINDOW wnd AS (
PARTITION BY usr_id ORDER BY time_stamp, trans_id
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
);
Here's another method, which happens to use no correlated subqueries or GROUP BY. I'm not expert in PostgreSQL performance tuning, so I suggest you try both this and the solutions given by other folks to see which works better for you.
SELECT l1.*
FROM lives l1 LEFT OUTER JOIN lives l2
ON (l1.usr_id = l2.usr_id AND (l1.time_stamp < l2.time_stamp
OR (l1.time_stamp = l2.time_stamp AND l1.trans_id < l2.trans_id)))
WHERE l2.usr_id IS NULL
ORDER BY l1.usr_id;
I am assuming that trans_id is unique at least over any given value of time_stamp.
There is a new option in Postgressql 9.5 called DISTINCT ON
SELECT DISTINCT ON (location) location, time, report
FROM weather_reports
ORDER BY location, time DESC;
It eliminates duplicate rows an leaves only the first row as defined my the ORDER BY clause.
see the official documentation
I like the style of Mike Woodhouse's answer on the other page you mentioned. It's especially concise when the thing being maximised over is just a single column, in which case the subquery can just use MAX(some_col) and GROUP BY the other columns, but in your case you have a 2-part quantity to be maximised, you can still do so by using ORDER BY plus LIMIT 1 instead (as done by Quassnoi):
SELECT *
FROM lives outer
WHERE (usr_id, time_stamp, trans_id) IN (
SELECT usr_id, time_stamp, trans_id
FROM lives sq
WHERE sq.usr_id = outer.usr_id
ORDER BY trans_id, time_stamp
LIMIT 1
)
I find using the row-constructor syntax WHERE (a, b, c) IN (subquery) nice because it cuts down on the amount of verbiage needed.
Actaully there's a hacky solution for this problem. Let's say you want to select the biggest tree of each forest in a region.
SELECT (array_agg(tree.id ORDER BY tree_size.size)))[1]
FROM tree JOIN forest ON (tree.forest = forest.id)
GROUP BY forest.id
When you group trees by forests there will be an unsorted list of trees and you need to find the biggest one. First thing you should do is to sort the rows by their sizes and select the first one of your list. It may seems inefficient but if you have millions of rows it will be quite faster than the solutions that includes JOIN's and WHERE conditions.
BTW, note that ORDER_BY for array_agg is introduced in Postgresql 9.0
You can do it with window functions
SELECT t.*
FROM
(SELECT
*,
ROW_NUMBER() OVER(PARTITION BY usr_id ORDER BY time_stamp DESC) as r
FROM lives) as t
WHERE t.r = 1
SELECT l.*
FROM (
SELECT DISTINCT usr_id
FROM lives
) lo, lives l
WHERE l.ctid = (
SELECT ctid
FROM lives li
WHERE li.usr_id = lo.usr_id
ORDER BY
time_stamp DESC, trans_id DESC
LIMIT 1
)
Creating an index on (usr_id, time_stamp, trans_id) will greatly improve this query.
You should always, always have some kind of PRIMARY KEY in your tables.
I think you've got one major problem here: there's no monotonically increasing "counter" to guarantee that a given row has happened later in time than another. Take this example:
timestamp lives_remaining user_id trans_id
10:00 4 3 5
10:00 5 3 6
10:00 3 3 1
10:00 2 3 2
You cannot determine from this data which is the most recent entry. Is it the second one or the last one? There is no sort or max() function you can apply to any of this data to give you the correct answer.
Increasing the resolution of the timestamp would be a huge help. Since the database engine serializes requests, with sufficient resolution you can guarantee that no two timestamps will be the same.
Alternatively, use a trans_id that won't roll over for a very, very long time. Having a trans_id that rolls over means you can't tell (for the same timestamp) whether trans_id 6 is more recent than trans_id 1 unless you do some complicated math.

Ranking in MySQL, how do I get the best performance with frequent updates and a large data set?

I want grouped ranking on a very large table, I've found a couple of solutions for this problem e.g. in this post and other places on the web. I am, however, unable to figure out the worst case complexity of these solutions. The specific problem consists of a table where each row has a number of points and a name associated. I want to be able to request rank intervals such as 1-4. Here are some data examples:
name | points
Ab 14
Ac 14
B 16
C 16
Da 15
De 13
With these values the following "ranking" is created:
Query id | Rank | Name
1 1 B
2 1 C
3 3 Da
4 4 Ab
5 4 Ac
6 6 De
And it should be possible to create the following interval on query-id's: 2-5 giving rank: 1,3,4 and 4.
The database holds about 3 million records so if possible I want to avoid a solution with complexity greater than log(n). There are constantly updates and inserts on the database so these actions should preferably be performed in log(n) complexity as well. I am not sure it's possible though and I've tried wrapping my head around it for some time. I've come to the conclusion that a binary search should be possible but I haven't been able to create a query that does this. I am using a MySQL server.
I will elaborate on how the pseudo code for the filtering could work. Firstly, an index on (points, name) is needed. As input you give a fromrank and a tillrank. The total number of records in the database is n. The pseudocode should look something like this:
Find median point value, count rows less than this value (the count gives a rough estimate of rank, not considering those with same amount of points). If the number returned is greater than the fromrank delimiter, we subdivide the first half and find median of it. We keep doing this until we are pinpointed to the amount of points where fromrank should start. then we do the same within that amount of points with the name index, and find median until we have reached the correct row. We do the exact same thing for tillrank.
The result should be log(n) number of subdivisions. So given the median and count can be made in log(n) time it should be possible to solve the problem in worst case complexity log(n). Correct me if I am wrong.
You need a stored procedure to be able to call this with parameters:
CREATE TABLE rank (name VARCHAR(20) NOT NULL, points INTEGER NOT NULL);
CREATE INDEX ix_rank_points ON rank(points, name);
CREATE PROCEDURE prc_ranks(fromrank INT, tillrank INT)
BEGIN
SET #fromrank = fromrank;
SET #tillrank = tillrank;
PREPARE STMT FROM
'
SELECT rn, rank, name, points
FROM (
SELECT CASE WHEN #cp = points THEN #rank ELSE #rank := #rn + 1 END AS rank,
#rn := #rn + 1 AS rn,
#cp := points,
r.*
FROM (
SELECT #cp := -1, #rn := 0, #rank = 1
) var,
(
SELECT *
FROM rank
FORCE INDEX (ix_rank_points)
ORDER BY
points DESC, name DESC
LIMIT ?
) r
) o
WHERE rn >= ?
';
EXECUTE STMT USING #tillrank, #fromrank;
END;
CALL prc_ranks (2, 5);
If you create the index and force MySQL to use it (as in my query), then the complexity of the query will not depend on the number of rows at all, it will depend only on tillrank.
It will actually take last tillrank values from the index, perform some simple calculations on them and filter out first fromrank values.
Time of this operation, as you can see, depends only on tillrank, it does not depend on how many records are there.
I just checked in on 400,000 rows, it selects ranks from 5 to 100 in 0,004 seconds (that is, instantly)
Important: this only works if you sort on names in DESCENDING order. MySQL does not support DESC clause in the indices, that means that the points and name must be sorted in one order for INDEX SORT to be usable (either both ASCENDING or both DESCENDING). If you want fast ASC sorting by name, you will need to keep negative points in the database, and change the sign in the SELECT clause.
You may also remove name from the index at all, and perform a final ORDER'ing without using an index:
CREATE INDEX ix_rank_points ON rank(points);
CREATE PROCEDURE prc_ranks(fromrank INT, tillrank INT)
BEGIN
SET #fromrank = fromrank;
SET #tillrank = tillrank;
PREPARE STMT FROM
'
SELECT rn, rank, name, points
FROM (
SELECT CASE WHEN #cp = points THEN #rank ELSE #rank := #rn + 1 END AS rank,
#rn := #rn + 1 AS rn,
#cp := points,
r.*
FROM (
SELECT #cp := -1, #rn := 0, #rank = 1
) var,
(
SELECT *
FROM rank
FORCE INDEX (ix_rank_points)
ORDER BY
points DESC
LIMIT ?
) r
) o
WHERE rn >= ?
ORDER BY rank, name
';
EXECUTE STMT USING #tillrank, #fromrank;
END;
That will impact performance on big ranges, but you will hardly notice it on small ranges.