How to consolidate blocks of time? - sql

I have a derived table with a list of relative seconds to a foreign key (ID):
CREATE TABLE Times (
ID INT
, TimeFrom INT
, TimeTo INT
);
The table contains mostly non-overlapping data, but there are occasions where I have a TimeTo < TimeFrom of another record:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 70 |
| 10 | 60 | 150 |
| 10 | 75 | 150 |
| .. | ... | ... |
+----+----------+--------+
The result set is meant to be a flattened linear idle report, but with too many of these overlaps, I end up with negative time in use. I.e. If the window above for ID = 10 was 150 seconds long, and I summed the differences of relative seconds to subtract from the window size, I'd wind up with 150-(20+20+90+75)=-55. This approach I've tried, and is what led me to realizing there were overlaps that needed to be flattened.
So, what I'm looking for is a solution to flatten the overlaps into one set of times:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 150 |
| .. | ... | ... |
+----+----------+--------+
Considerations: Performance is very important here, as this is part of a larger query that will perform well on it's own, and I'd rather not impact its performance much if I can help it.
On a comment regarding "Which seconds have an interval", this is something I have tried for the end result, and am looking for something with better performance. Adapted to my example:
SELECT SUM(C.N)
FROM (
SELECT A.N, ROW_NUMBER()OVER(ORDER BY A.N) RowID
FROM
(SELECT TOP 60 1 N FROM master..spt_values) A
, (SELECT TOP 720 1 N FROM master..spt_values) B
) C
WHERE EXISTS (
SELECT 1
FROM Times SE
WHERE SE.ID = 10
AND SE.TimeFrom <= C.RowID
AND SE.TimeTo >= C.RowID
AND EXISTS (
SELECT 1
FROM Times2 D
WHERE ID = SE.ID
AND D.TimeFrom <= C.RowID
AND D.TimeTo >= C.RowID
)
GROUP BY SE.ID
)
The problem I have with this solution is I have get a Row Count Spool out of the EXISTS query in the query plan with a number of executions equal to COUNT(C.*). I left the real numbers in that query to illustrate that getting around this approach is for the best. Because even with a Row Count Spool reducing the cost of the query by quite a bit, it's execution count increases the cost of the query as a whole by quite a bit as well.
Further Edit: The end goal is to put this in a procedure, so Table Variables and Temp Tables are also a possible tool to use.

OK. I'm still trying to do this with just one SELECT. But This totally works:
DECLARE #tmp TABLE (ID INT, GroupId INT, TimeFrom INT, TimeTo INT)
INSERT INTO #tmp
SELECT ID, 0, TimeFrom, TimeTo
FROM Times
ORDER BY Id, TimeFrom
DECLARE #timeTo int, #id int, #groupId int
SET #groupId = 0
UPDATE #tmp
SET
#groupId = CASE WHEN id != #id THEN 0
WHEN TimeFrom > #timeTo THEN #groupId + 1
ELSE #groupId END,
GroupId = #groupId,
#timeTo = TimeTo,
#id = id
SELECT Id, MIN(TimeFrom), Max(TimeTo) FROM #tmp
GROUP BY ID, GroupId ORDER BY ID

Left join each row to its successor overlapping row on the same ID value (where such exist).
Now for each row in the result-set of LHS left join RHS the contribution to the elapsed time for the ID is:
isnull(RHS.TimeFrom,LHS.TimeTo) - LHS.TimeFrom as TimeElapsed
Summing these by ID should give you the correct answer.
Note that:
- where there isn't an overlapping successor row the calculation is simply
LHS.TimeTo - LHS.TimeFrom
- where there is an overlapping successor row the calculation will net to
(RHS.TimeFrom - LHS.TimeFrom) + (RHS.TimeTo - RHS.TimeFrom)
which simplifies to
RHS.TimeTo - LHS.TimeFrom

What about something like below (assumes SQL 2008+ due to CTE):
WITH Overlaps
AS
(
SELECT t1.Id,
TimeFrom = MIN(t1.TimeFrom),
TimeTo = MAX(t2.TimeTo)
FROM dbo.Times t1
INNER JOIN dbo.Times t2 ON t2.Id = t1.Id
AND t2.TimeFrom > t1.TimeFrom
AND t2.TimeFrom < t1.TimeTo
GROUP BY t1.Id
)
SELECT o.Id,
o.TimeFrom,
o.TimeTo
FROM Overlaps o
UNION ALL
SELECT t.Id,
t.TimeFrom,
t.TimeTo
FROM dbo.Times t
INNER JOIN Overlaps o ON o.Id = t.Id
AND (o.TimeFrom > t.TimeFrom OR o.TimeTo < t.TimeTo);
I do not have a lot of data to test with but seems decent on the smaller data sets I have.

I also wrapped by head around this issue - and afterall I found, that the problem is your data.
You claim (if i get that right), that these entries should reflect the relative times, when a user goes idle / comes back.
So, you should consider to sanitize your data and refactor your inserts to produce valid data sets.
For instance, the two lines:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 50 | 70 |
| 10 | 60 | 150 |
how can it be possible that a user is idle until second 70, but goes idle on second 60? This already implies, that he has been back latest at around second 59.
I can only assume that this issue comes from different threads and/or browser windows (tabs) a user might be using your application with. (Each having it's own "idle detection")
So instead of working-around the symptoms - you should fix the cause! Why is this data entry inserted into the table? You could avoid this by simple checking, if the user is already idle before inserting a new row.
Create a unique key constraint on ID and TimeTo
Whenever an idle-event is detected, execute the following query:
INSERT IGNORE INTO Times (ID,TimeFrom,TimeTo)VALUES('10', currentTimeStamp, -1);
-- (If the user is already "idle" - nothing will happen)
Whenever an comeback-event is detected, execute the following query:
UPDATE Times SET TimeTo=currentTimeStamp WHERE ID='10' and TimeTo=-1
-- (If the user is already "back" - nothing will happen)
The fiddle linked here: http://sqlfiddle.com/#!2/dcb17/1 would reproduce the chain of events for your example, but resulting in a clean and logical set of idle-windows:
ID TIMEFROM TIMETO
10 10 30
10 50 70
10 75 150
Note: The Output is slightly different from the output you desired. But I feel that this is more accurate, cause of the reason outlined above: A user cannot go idle on second 70 without returning from it's current idle state before. He either STAYS idle (and a second thread/tab runs into the idle-event) Or he returned in between.
Especially for your need to maximize performance, you should fix the data and not invent a work-around-query. This is maybe 3 ms upon inserts, but could be worth 20 seconds upon select!
Edit: if Multi-Threading / Multiple-Sessions is the cause for the wrong insert, you would also need to implement a check, if most_recent_come_back_time < now() - idleTimeout - otherwhise a user might comeback on tab1, and is recorded idle on tab2 after a few seconds, cause tab2 did run into it's idle timeout, cause the user only refreshed tab1.

I had the 'same' problem once with 'days' (additionaly without counting WE and Holidays)
The word counting gave me the following idea:
create table Seconds ( sec INT);
insert into Seconds values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9), ...
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom and t.timeto-1
and id=10;
you can cut the start to 0 (I put the '10' here in braces)
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom- (10) and t.timeto- (10)-1
and id=10;
and finaly
select count(distinct sec) from times t, seconds s,
(select min(timefrom) m from times where id=10) as m
where s.sec between t.timefrom-m.m and t.timeto-m.m-1
and id=10;
additionaly you can "ignore" eg. 10 seconds by dividing you loose some prezition but earn speed
select count(distinct sec)*d from times t, seconds s,
(select min(timefrom) m from times where id=10) as m,
(select 10 d) as d
where s.sec between (t.timefrom-m)/d and (t.timeto-m)/d-1
and id=10;
Sure it depends on the range you have to look at, but a 'day' or two of seconds should work (although i did not test it)
fiddle ...

Related

PostgreSQL efficiently find last decendant in linear list

I currently try to retrieve the last decendet efficiently from a linked list like structure.
Essentially there's a table with a data series, with certain criteria I split it up to get a list like this
current_id | next_id
for example
1 | 2
2 | 3
3 | 4
4 | NULL
42 | 43
43 | 45
45 | NULL
etc...
would result in lists like
1 -> 2 -> 3 -> 4
and
42 -> 43 -> 45
Now I want to get the first and the last id from each of those lists.
This is what I have right now:
WITH RECURSIVE contract(ruid, rdid, rstart_ts, rend_ts) AS ( -- recursive Query to traverse the "linked list" of continuous timestamps
SELECT start_ts, end_ts FROM track_caps tc
UNION
SELECT c.rstart_ts, tc.end_ts AS end_ts0 FROM contract c INNER JOIN track_caps tc ON (tc.start_ts = c.rend_ts AND c.rend_ts IS NOT NULL AND tc.end_ts IS NOT NULL)
),
fcontract AS ( --final step, after traversing the "linked list", pick the largest timestamp found as the end_ts and the smallest as the start_ts
SELECT DISTINCT ON(start_ts, end_ts) min(rstart_ts) AS start_ts, rend_ts AS end_ts
FROM (
SELECT rstart_ts, max(rend_ts) AS rend_ts FROM contract
GROUP BY rstart_ts
) sq
GROUP BY end_ts
)
SELECT * FROM fcontract
ORDER BY start_ts
In this case I just used timestamps which work fine for the given data.
Basically I just use a recursive query that walks through all the nodes until it reaches the end, as suggested by many other posts on StackOverflow and other sites. The next query removes all the sub-steps and returns what I want, like in the first list example: 1 | 4
Just for illustration, the produced result set by the recursive query looks like this:
1 | 2
2 | 3
3 | 4
1 | 3
2 | 4
1 | 4
As nicely as it works, it's quite a memory hog however which is absolutely unsurprising when looking at the results of EXPLAIN ANALYZE.
For a dataset of roughly 42,600 rows, the recursive query produces a whopping 849,542,346 rows. Now it was actually supposed to process around 2,000,000 rows but with that solution right now it seems very unfeasible.
Did I just improperly use recursive queries? Is there a way to reduce the amount of data it produces?(like removing the sub-steps?)
Or are there better single-query solutions to this problem?
The main problem is that your recursive query doesn't properly filter the root nodes which is caused by the the model you have. So the non-recursive part already selects the entire table and then Postgres needs to recurse for each and every row of the table.
To make that more efficient only select the root nodes in the non-recursive part of your query. This can be done using:
select t1.current_id, t1.next_id, t1.current_id as root_id
from track_caps t1
where not exists (select *
from track_caps t2
where t2.next_id = t1.current_id)
Now that is still not very efficient (compared to the "usual" where parent_id is null design), but at least makes sure the recursion doesn't need to process more rows then necessary.
To find the root node of each tree, just select that as an extra column in the non-recursive part of the query and carry it over to each row in the recursive part.
So you wind up with something like this:
with recursive contract as (
select t1.current_id, t1.next_id, t1.current_id as root_id
from track_caps t1
where not exists (select *
from track_caps t2
where t2.next_id = t1.current_id)
union
select c.current_id, c.next_id, p.root_id
from track_caps c
join contract p on c.current_id = p.next_id
and c.next_id is not null
)
select *
from contract
order by current_id;
Online example: http://rextester.com/DOABC98823

SQL Query to select certain number of records with WHERE clause

Let's say I have a table that looks like the following:
ID | EntityType | Foo | Bar
----------------------------
1 | Business | test | test
2 | Family | welp | testing
3 | Individual | hmm | 100
4 | Family | test | test
5 | Business | welp | testing
6 | Individual | hmm | 100
This table is fairly large, and there are random (fairly infrequent) instances of "Business" in the EntityType column.
A query like
SELECT TOP 500 * FROM Records WHERE EntityType='Business' ORDER BY ID DESC
works perfectly for grabbing the first set of Businesses, now how would I page backwards and get the previous set of 500 records which meet my criteria?
I understand I could look at records between IDs, but there is no guarantee on what ID that would be, for example it wouldn't just be the last ID of the previous query minus 500 because the Business EntityType is so infrequent.
I've also looked at some paging models but I'm not sure how I can integrate them while keeping my WHERE clause just how it is (only accepting EntityType of Business) and guaranteeing 500 records (I've used one that "pages" back 500 records, and only shows about 18 businesses because they're within the 500 total records returned).
I appreciate any help on this matter!
select * from (
select top 500 * from (
select top 1000 * FROM Records WHERE EntityType='Business' ORDER BY ID DESC
) x
order by id
) y
order by id desc
Innermost query - take the top 1000, to get page 2 and page 1 results
2nd level query - take the page 2 records from the first query
outermost - reorder the results
I believe what you need is called paging. There is great article on paging on CodeGuru (I think it was mentioned here before):
http://www.codeguru.com/csharp/.net/net_data/article.php/c19611/Paging-in-SQL-Server-2005.htm
I think there you will find everything you need
So, I'd do this slightly differently from the other answers. My query to always pull the 500 last rows with a minimum row would look like this and require a rowcount.
Note that using a rowcount outside the query makes it exponentially easier to push through SQL syntax. I wish it wasn't necessary.
Declare #row_min as integer
Declare #row_count as integer
set #row_min = 500
SELECT #row_count = COUNT(*) FROM Records WHERE EntityType='Business' ;
WITH MyCTE AS
(
SELECT
ID, EntityType, Foo, Bar,
ROW_NUMBER() OVER (ORDER BY TagId) AS 'RowNum'
FROM Records
WHERE EntityType='Business'
)
Select TOP 500 *, (Select Max(RowNum) From MyCTE) As RowMax
FROM MyCTE
WHERE EntityType='Business'
AND
RowNum >
Case sign(#row_count - 500 - #row_min)
When -1 Then (#row_count - 500)
ELSE #row_min
end
AND
RowNum <
Case sign(#row_count - 500 - #row_min)
When -1 Then (#row_count)
ELSE #row_min + 500
end
--Note : Debugging purposes.
select sign(#row_count - 500 - #row_min), (#row_count - 500 - #row_min), #row_count, #row_min

How to label a big set of “transitive groups” with a constraint?

EDIT after #NealB solution: the #NealB's solution is very very fast comparated with any another one, and dispenses this new question about "add a constraint to improve performance". The #NealB's not need any improve, have O(n) time and is very simple.
The problem of "label transitive groups with SQL" have an elegant solution using recursion and CTE... But this solution consumes an exponential time (!). I need to work with 10000 itens: with 1000 itens need 1 second, with 2000 need 1 day...
Constraint: in my case is possible to break the problem into pieces of ~100 itens or less, but only to select one group of ~10 itens, and discard all the other ~90 labeled itens...
There are a generic algotithm to add and use this kind of "pre-selection", to reduce the quadratic, O(N^2), time? Perhaps, as showed by comments and #wildplasser, a O(N log(N)) time; but I expect, with "pre-selection" to reduce to O(N) time.
(EDIT)
I try to use alternative algorithm, but it need some improvement to use as solution here; or, to really increase performance (to O(N) time), need to use "pre-selection".
The "pre-selection" (constraint) is based on a "super-set grouping"... Stating by the original "How to label 'transitive groups' with SQL?" question t1 table,
table T1
(original T1 augmented by "super-set grouping label" ssg, and more one row)
ID1 | ID2 | ssg
1 | 2 | 1
1 | 5 | 1
4 | 7 | 1
7 | 8 | 1
9 | 1 | 1
10 | 11 | 2
So there are three groups,
g1: {1,2,5,9} because "1 t 2", "1 t 5" and "9 t 1"
g2: {4,7,8} because "4 t 7" and "7 t 8"
g3: {10,11} because "10 t 11"
The super-group is only a auxiliary grouping,
ssg1: {g1,g2}
ssg2: {g3}
If we have M super-group-items and N total T1 items, the average group length will be less tham N/M. We can suppose (for my typical problem) also that ssg maximum length is ~N/M.
So, the "label algorithm" need to run only M times with ~N/M items if it use the ssg constraint.
An SQL only soulution appears to be a bit of a problem here. With the help of some procedural
programming on top of SQL the solution appears to be failry simple and efficient. Here is a brief outline
of a solution as could be implemented using any procedural language invoking SQL.
Declare table R with primary key ID where ID corresponds the same domain as ID1 and ID2 of table T1.
Table R contains one other non-key column, a Label number
Populate table R with the range of values found in T1. Set Label to zero (no label).
Using your example data, the initial setup for R would look like:
Table R
ID Label
== =====
1 0
2 0
4 0
5 0
7 0
8 0
9 0
Using a host language cursor plus an auxiliary counter, read each row from T1. Lookup ID1 and ID2 in R. You will find one of
four cases:
Case 1: ID1.Label == 0 and ID2.Label == 0
In this case neither one of these IDs have been "seen" before: Add 1 to the counter and then update both
rows of R to the value of the counter: update R set R.Label = :counter where R.ID in (:ID1, :ID2)
Case 2: ID1.Label == 0 and ID2.Label <> 0
In this case, ID1 is new but ID2 has already been assigned a label. ID1 needs to be assigned to the
same label as ID2: update R set R.Lablel = :ID2.Label where R.ID = :ID1
Case 3: ID1.Label <> 0 and ID2.Label == 0
In this case, ID2 is new but ID1 has already been assigned a label. ID2 needs to be assigned to the
same label as ID1: update R set R.Lablel = :ID1.Label where R.ID = :ID2
Case 4: ID1.Label <> 0 and ID2.Label <> 0
In this case, the row contains redundant information. Both rows of R should contain the same Label value. If not,
there is some sort of data integrity problem. Ahhhh... not quite see edit...
EDIT I just realized that there are situations where both Label values here could be non-zero and different. If both are non-zero and different then two Label groups need to be merged at this point. All you need to do is choose one Label and update the others to match with something like: update R set R.Label to ID1.Label where R.Label = ID2.Label. Now both groups have been merged with the same Label value.
Upon completion of the cursor, table R will contain Label values needed to update T2.
Table R
ID Label
== =====
1 1
2 1
4 2
5 1
7 2
8 2
9 1
Process table T2
using something along the lines of: set T2.Label to R.Label where T2.ID1 = R.ID. The end result should be:
table T2
ID1 | ID2 | LABEL
1 | 2 | 1
1 | 5 | 1
4 | 7 | 2
7 | 8 | 2
9 | 1 | 1
This process is puerly iterative and should scale to fairly large tables without difficulty.
I suggest you check this and use some
general-purpose language for solving it.
http://en.wikipedia.org/wiki/Disjoint-set_data_structure
Traverse the graph, maybe run DFS or BFS from each node,
then use this disjoint set hint. I think this should work.
The #NealB solution is the faster(!) See an example of PostgreSQL implementation here.
Below an example of another "brute force algorithm", only for curiosity!
As #peter.petrov and #RBarryYoung suggested, some performance problems can be avoided abandoning the CTE recursion... I do some issues at the basic labeler, and, abover I add the constraint for grouping by a super-set label. This new transgroup1_loop() function is working!
PS: this solution still have performance limitations, please post your answer with better, or with some adaptation of this one.
-- DROP table transgroup1;
CREATE TABLE transgroup1 (
id serial NOT NULL PRIMARY KEY,
items integer[], -- two or more items in the transitive relationship
ssg_label varchar(12), -- the super-set gropuping label
dels integer[] DEFAULT array[]::integer[]
);
INSERT INTO transgroup1(items,ssg_label) values
(array[1, 2],'1'),
(array[1, 5],'1'),
(array[4, 7],'1'),
(array[7, 8],'1'),
(array[9, 1],'1'),
(array[10, 11],'2');
-- or SELECT array[id1, id2],ssg_label FROM t1, with 10000 items
them, with these two functions we can solve the problem,
CREATE FUNCTION transgroup1_loop(p_ssg varchar, p_max_i integer DEFAULT 100)
RETURNS integer AS $funcBody$
DECLARE
cp_dels integer[];
i integer;
BEGIN
i:=1;
LOOP
UPDATE transgroup1
SET items = array_uunion(transgroup1.items,t2.items),
dels = transgroup1.dels || t2.id
FROM transgroup1 AS t1, transgroup1 AS t2
WHERE transgroup1.id=t1.id AND t1.ssg_label=$1 AND
t1.id>t2.id AND t1.items && t2.items;
cp_dels := array(
SELECT DISTINCT unnest(dels) FROM transgroup1
); -- ensures all itens to del
RAISE NOTICE '-- bug, repeting dels, item-%; % dels! %', i, array_length(cp_dels,1), array_to_string(cp_dels,';','*');
EXIT WHEN i>p_max_i OR array_length(cp_dels,1)=0;
DELETE FROM transgroup1
WHERE ssg_label=$1 AND id IN (SELECT unnest(cp_dels));
UPDATE transgroup1 SET dels=array[]::integer[];
i:=i+1;
END LOOP;
UPDATE transgroup1 -- only to beautify
SET items = ARRAY(SELECT unnest(items) ORDER BY 1 desc);
RETURN i;
END;
$funcBody$ LANGUAGE plpgsql VOLATILE;
to run and see results, you can use
SELECT transgroup1_loop('1'); -- run with ssg-1 items only
SELECT transgroup1_loop('2'); -- run with ssg-2 items only
-- show all with a sequential group label:
SELECT *, dense_rank() over (ORDER BY id) AS group_label from transgroup1;
results:
id | items | ssg_label | dels | group_label
----+-----------+-----------+------+-------------
4 | {8,7,4} | 1 | {} | 1
5 | {9,5,2,1} | 1 | {} | 2
6 | {11,10} | 2 | {} | 3
PS: the function array_uunion() is the same as original,
CREATE FUNCTION array_uunion(anyarray,anyarray) RETURNS anyarray AS $$
-- ensures distinct items of a concatemation
SELECT ARRAY(SELECT unnest($1) UNION SELECT unnest($2))
$$ LANGUAGE sql immutable;

How to generate records and spread them among pairs from a table?

I have to generate about a million random trips between about 40K destinations. Each destination has it's own weight (total_probability), the more it is, the more trips should start or end in this place.
Either the trips should be generated randomly, but destinations (start and end points) should be weighted by probability, or it's possible to just pre-calculate an exact number of trips (divide each weight by the sum of weights, multiply by 1M and round to integers).
Problem is how to make it in PostgreSQL without generating the 40K*40K table with all destinations pairs.
Table "public.dests"
Column | Type | Modifiers
-------------------+------------------+-----------
id | integer |
total_probability | double precision |
Table "public.trips"
Column | Type | Modifiers
------------+------------------+-----------
from_id | integer |
to_id | integer |
trips_num | integer |
...
some other metrics...
primary key for trips is (from_id, to_id)
Should I generate a table with 1M records and then update it iteratively, or a for loop with 1M inserts will be fast enough? I work on a 2-core lightweight laptop.
P.S. I gave up and did this in Python. To perform a set of queries and the transformation in Python, I'll run SQL scripts from Python rather than from a shell script. Thanks for suggestions!
In 9.1, you can use TRIGGERs on VIEWs, which effectively let you create materialized views (albeit manually). I think your first run may be expensive, but using a loop is probably the way to go, but then after that, I'd use a series of TRIGGERs to maintain the data in a table.
At the end of the day you need to decide whether or not you want to calculate the results for every query, or you memoize the result via a materialized view.
I'm confused by your requirement but I guess this can get you started:
select
f.id as "from", t.id as to,
f.total_prob as from_prob, t.total_prob as to_prob
from
(
select id, total_prob
from dest
order by random()
limit 1010
) f
inner join
(
select id, total_prob
from dest
order by random()
limit 1010
) t on f.i != t.i
order by random()
limit 1000000
;
EDIT:
This took about ten minutes in my not that modern desktop:
create table trips (from_id integer, to_id integer, trip_prob double precision);
insert into trips (from_id, to_id, trip_prob)
select
f.id, t.id, f.total_prob * t.total_prob
from
(
select id, total_prob
from dests
) f
inner join
(
select id, total_prob
from dests
) t on f.id != t.id
where random() <= f.total_prob * t.total_prob
order by random()
limit 1000000
;
alter table trips add primary key (from_id, to_id);
select * from trips limit 5;
from_id | to_id | trip_prob
---------+-------+--------------------
1 | 6 | 0.0728749980226821
1 | 11 | 0.239824750923743
1 | 14 | 0.235899211677577
1 | 15 | 0.176168172647811
1 | 17 | 0.19708509944588
(5 rows)

Is there a [straightforward] way to order results *first*, *then* group by another column, with SQL?

I see that in an SQL query, the GROUP BY has to precede the ORDER BY expression. Does this imply that ordering is done after grouping would have discarded identical rows?
Because I seem to need to order rows by a timestamp first, then discard the rows with identical timestamp. And I don't know how to accomplish this.
I am using MySQL 5.1.41.
Here is the definition of the table expressed with create table:
create table
(
A int,
B timestamp
)
The data could be:
+-----+-----------------------+
| A | B |
+-----+-----------------------+
| 1 | today |
| 1 | yesterday |
| 2 | yesterday |
| 2 | tomorrow |
+-----+-----------------------+
The results of the query on the above table, which I am after, would be:
+-----+-----------------------+
| A | B |
+-----+-----------------------+
| 1 | today |
| 2 | tomorrow |
+-----+-----------------------+
Basically, I want the rows with the latest timestamp in column "B" (hence the mention of ORDER BY), and only one row for each value in column "A" (think DISTINCT or GROUP BY).
The actual problem behind the simplified example above:
In reality, I have two tables - users and payment_receipts:
create table users
(
phone_nr int(10) unsigned not null,
primary key (phone_nr)
)
create table payment_receipts
(
phone_nr int(10) unsigned not null,
payed_ts timestamp default current_timestamp not null,
payed_until_ts timestamp not null,
primary key (phone_nr, payed_ts, payed_until_ts)
)
The tables may include other columns but I omit these as irrelevant. Implementing a payment scheme, I have to send SMS to users across the cellular network, in periodic intervals depending on whether the payment is due or not. The payment is actualized when the SMS is sent as the recipient is taxed for it. I use the payment_receipts table to keep records of all payments done, i.e. for book-keeping. This is intended to model a real shop where both the buyer and the seller get a copy of the receipt of purchase, for reference. This table stores my (seller's) copy [of each receipt]. The customer's receipt is the received SMS itself. Each time an SMS is sent (and thus a payment is accomplished), the table is inserted a receipt record, stating who paid, when and "until when". To explain the latter, imagine a subscription service, but one which spans indefinitely until the user opt-out explicitly, at which point the corresponding user record is removed. A payment is made a month in advance, so as a rule, the difference between the payed_ts and payed_until_ts is 30 days worth of time.
I have a batch job that executes every day and needs to select a list of users that are due monthly payment as part of the automatic subscription renewal described above. To link this to the dummy example earlier, the phone number column phone_nr would be the column "A" and payed_until_ts would be column "B", but in reality there are two tables, which has to do with the following behaviour: when a user record is removed, the receipt must remain, for book-keeping. So not only do I need to group payments by date and discard all but the latest payment receipt date, I also need to watch out not to select receipts for which there no longer is a matching user record.
To solve the problem of selecting required records -- those that are due payment -- I need to find receipts with the latest payed_until_ts timestamp for each phone_nr (there may be several, obviously) and out of those records I further need to select only those phone numbers where payed_until_ts is earlier than the time the batch job executes. I then would send an SMS to each of these numbers, inserting a receipt record for each sent SMS, where payed_ts is now() and payed_until_ts is now() + interval 30 days.
But I can't seem to come up with the query required.
Select a,b from (select a,b from table order by b) as c group by a;
Yes, grouping is done first, and it affects a single select whereas ordering affects all the results from all select statements in a union, such as:
select a, 'max', max(b) from tbl group by a
union all select a, 'min', min(b) from tbl group by a
order by 1, 2
(using field numbers in order by since I couldn't be bothered to name my columns). Each group by affects only its select, the order by affects the combined result set.
It seems that what you're after can be achieved with:
select A, max(B) from tbl group by A
This uses the max aggregation function to basically do your pre-group ordering (it doesn't actually sort it in any decent DBMS, rather it will simply choose the maximum from an suitable index if available).
SELECT DISTINCT a,b
FROM tbl t
WHERE b = (SELECT MAX(b) FROM tbl WHERE tbl.a = t.a);
According to your new rules (tested with PostgreSQL)
Query You'd Want:
SELECT pr.phone_nr, pr.payed_ts, pr.payed_until_ts
FROM payment_receipts pr
JOIN users
ON (pr.phone_nr = users.phone_nr)
JOIN (select phone_nr, max(payed_until_ts) as payed_until_ts
from payment_receipts
group by phone_nr
) sub
ON ( pr.phone_nr = sub.phone_nr
AND pr.payed_until_ts = sub.payed_until_ts)
ORDER BY pr.phone_nr, pr.payed_ts, pr.payed_until_ts;
Original Answer (with updates):
CREATE TABLE foo (a NUMERIC, b TEXT, DATE);
INSERT INTO foo VALUES
(1,'a','2010-07-30'),
(1,'b','2010-07-30'),
(1,'c','2010-07-31'),
(1,'d','2010-07-31'),
(1,'a','2010-07-29'),
(1,'c','2010-07-29'),
(2,'a','2010-07-29'),
(2,'a','2010-08-01');
-- table contents
SELECT * FROM foo ORDER BY c,a,b;
a | b | c
---+---+------------
1 | a | 2010-07-29
1 | c | 2010-07-29
2 | a | 2010-07-29
1 | a | 2010-07-30
1 | b | 2010-07-30
1 | c | 2010-07-31
1 | d | 2010-07-31
2 | a | 2010-08-01
-- The following solutions both retrieve records based on the latest date
-- they both return the same result set, solution 1 is faster, solution 2
-- is easier to read
-- Solution 1:
SELECT foo.a, foo.b, foo.c
FROM foo
JOIN (select a, max(c) as c from foo group by a) bar
ON (foo.a=bar.a and foo.c=bar.c)
ORDER BY foo.a, foo.b, foo.c;
-- Solution 2:
SELECT a, b, MAX(c) AS c
FROM foo main
GROUP BY a, b
HAVING MAX(c) = (select max(c) from foo sub where main.a=sub.a group by a)
ORDER BY a, b;
a | b | c
---+---+------------
1 | c | 2010-07-31
1 | d | 2010-07-31
2 | a | 2010-08-01
(3 rows)
Comment:
1 is returned twice because their are multiple b values. This is acceptable (and advised). Your data should never have this problem, because c is based on b's value.
create table user_payments
(
phone_nr int NOT NULL,
payed_until_ts datetime NOT NULL
)
insert into user_payments
(phone_nr, payed_until_ts)
values
(1, '2016-01-28'), -- today
(1, '2016-01-27'), -- yesterday
(2, '2016-01-27'), -- yesterday
(2, '2016-01-29') -- tomorrow
select phone_nr, MAX(payed_until_ts) as latest_payment
from user_payments
group by phone_nr
-- OUTPUT:
-- phone_nr latest_payment
-- 1 2016-01-28 00:00:00.000
-- 2 2016-01-29 00:00:00.000
In the above example, I have used datetime column but similar query should work for timestamp column.
The MAX function will basically do the "ORDER BY" payed_until_ts column and pick the latest value for each phone_nr.
Also, you will get only one value for each phone_nr due to "GROUP BY" clause.