How to select based on maximum time interval - sql

I have a table as follows in Postgres SQL 8.4:
1 | John Smith | 2011-08-12 12:44:13.125+08
2 | John Smith | 2011-08-16 08:38:57.968+08
3 | John Smith | 2011-08-16 08:38:58.062+08
4 | Kenny Long | 2011-08-16 17:06:35.843+08
5 | Kenny Long | 2011-08-16 17:06:35.906+08
6 | Kenny Long | 2011-08-16 17:06:59.281+08
7 | Kenny Long | 2011-08-16 17:07:00.234+08
8 | Kenny Long | 2011-08-16 17:07:32.859+08
9 | Kenny Long | 2011-08-16 17:08:00.437+08
10 | Kenny Long | 2011-08-16 17:08:22.718+08
11 | Kenny Long | 2011-08-16 17:08:22.781+08
I would like to select the columns based timestamp. Only one row is needed for those records that fall within 2 minutes from each other. For example records number 4 to 9 should return only row number 4 and ignore the rest of the rows.
How can I achieve this? Your help is greatly appreciated.
Thank you in advance.
Joe Liew

I've tried it with a recursive way. I'm not sure it's the better way, and I'm quite sure I should study some Window operations to reduce it.
But It worked on my test case. The goal is to start with one min timestamp per guy, then track which rows are to be deleted (within the 2 min range), and which row is the next valid one. Then at each iteration we continue from this valid row (one per guy).
So here is the query for table myschema.mytable with colums id,name,tm. Note that the level column is there just to track recursion and debug, not necessary:
WITH RECURSIVE mytmp(id,name,thetime,thelevel) AS (
-- recursive query: 1st row
-- starting point, one row of the table for each people
-- with a subquery to get the min time with id, maybe a better way to do it
(
select myschema.mytable.id,myschema.mytable.name,myschema.mytable.tm as thetime,1 as thelevel
from (
select name,min(tm) as mintm
from myschema.mytable
group by name
) q,myschema.mytable
WHERE myschema.mytable.name=q.name
AND myschema.mytable.tm=q.mintm
ORDER BY name ASC) -- end of starting point of recursive query
UNION ALL
-- now the recursive part, starting from the 1st row and then again and again (loop)
-- get descendants in the 2 minutes interval for every computed row already in mytmp
--
-- get from previous iterations targets, one per guy
-- and track the 1st new valid row (>2min) for that guy
-- removing bad rows (<2min) is easy, several way to do it
-- keeping only one valid row (and not all the others is harder, limit and aggregates functions are restricted in recursive terms
-- we must keep only one, as the future valid rows will depend on the 2 minutes range from this one
-- maybe some window function could help me, but at least I've a working solution
select myschema.mytable.id,myschema.mytable.name,myschema.mytable.tm as thetime,q2.thelevel
FROM myschema.mytable,(
-- here need to keep 1st true one
select myschema.mytable.name,MIN(myschema.mytable.tm) as tm,mytmp2.thelevel +1 as thelevel
FROM myschema.mytable,(
select id,name,thetime,thelevel
from mytmp
) mytmp2
-- hack: mytmp2 is useless, mytmp should have been used
-- we create this indirection to avoid this message:
-- "ERROR: aggregate functions not allowed in a recursive query's recursive term"
-- on the MIN functions
-- I do not know why it worked :-)
WHERE myschema.mytable.name=mytmp2.name
-- future
AND myschema.mytable.tm - mytmp2.thetime > INTERVAL '0'
GROUP BY
-- hack the group by, to make 2 groups
-- the first one for rows in the 2 min range and the second one for others
CASE WHEN ((myschema.mytable.tm - mytmp2.thetime) > INTERVAL '2 minutes') THEN 1 ELSE 2 END,
myschema.mytable.name,mytmp2.thelevel,mytmp2.thetime
-- then with the having we keep only the second group, containing the first valid > 2min row
HAVING ((MIN(myschema.mytable.tm) - mytmp2.thetime) > INTERVAL '2 minutes')=true
) q2 -- q2contains 1st true row and all false rows for each people
-- q2 is used to get the id, that we cannot have in a group by request
WHERE q2.tm=myschema.mytable.tm
AND q2.name=myschema.mytable.name
) -- end of recursive query
SELECT *
FROM mytmp
ORDER BY name asc, thelevel asc,thetime asc
-- LIMIT 100 -- to debug, avoid infinite loops
Another solution would maybe be using a stored procedure, doing the same things in a temporary table (take valid rows, delete rows in the 2min range, then take the next valid ones, etc), maybe easier to maintain.

Just some idea, not tested. Windowing function need 8.4 / later.
SELECT * FROM
(SELECT
name,
tm,
case
when lagname is NULL OR -- first row of everything
(name <> lagname) OR -- we have order by name, this is first row of this name
(name = lagname AND lagtm + interval '2 minutes' >= tm)
then 1
else 0
end as flags
FROM
(
SELECT name,
tm,
lag(name) over (order by name,tm) as lagname,
lag(tm) over (order by name,tm) as lagtm
from "table"."table"
) AS lagtable
) AS blar
WHERE "flags" = 1

Related

How do I create a repeating number sequence in a column in BigQuery

I'd like to create a query that returns a column with a repeating number sequence in it.
For example:
row_num | repeat
----------------
1 | 1
2 | 2
3 | 3
4 | 1
5 | 2
6 | 3
I'm struggling to understand how I could achieve this with BigQuery Standard SQL.
So far i've generated the row number (ROW_NUMBER() OVER()) as row_num in my select, and then I was thinking I could use a modulus function to determine the repeat number, but this would split it into several separate columns, so I'd need additional steps to merge them into the one column. I wondered if there was a more elegant way of achieving this.
Many Thanks!
In fact, the modulus should work here. Assuming your table already has a row_num column, and you want to generate the repeat column, you may try:
SELECT
row_num,
MOD(row_num - 1, 3) + 1 AS repeat
FROM yourTable
ORDER BY
row_num;

PostgreSQL efficiently find last decendant in linear list

I currently try to retrieve the last decendet efficiently from a linked list like structure.
Essentially there's a table with a data series, with certain criteria I split it up to get a list like this
current_id | next_id
for example
1 | 2
2 | 3
3 | 4
4 | NULL
42 | 43
43 | 45
45 | NULL
etc...
would result in lists like
1 -> 2 -> 3 -> 4
and
42 -> 43 -> 45
Now I want to get the first and the last id from each of those lists.
This is what I have right now:
WITH RECURSIVE contract(ruid, rdid, rstart_ts, rend_ts) AS ( -- recursive Query to traverse the "linked list" of continuous timestamps
SELECT start_ts, end_ts FROM track_caps tc
UNION
SELECT c.rstart_ts, tc.end_ts AS end_ts0 FROM contract c INNER JOIN track_caps tc ON (tc.start_ts = c.rend_ts AND c.rend_ts IS NOT NULL AND tc.end_ts IS NOT NULL)
),
fcontract AS ( --final step, after traversing the "linked list", pick the largest timestamp found as the end_ts and the smallest as the start_ts
SELECT DISTINCT ON(start_ts, end_ts) min(rstart_ts) AS start_ts, rend_ts AS end_ts
FROM (
SELECT rstart_ts, max(rend_ts) AS rend_ts FROM contract
GROUP BY rstart_ts
) sq
GROUP BY end_ts
)
SELECT * FROM fcontract
ORDER BY start_ts
In this case I just used timestamps which work fine for the given data.
Basically I just use a recursive query that walks through all the nodes until it reaches the end, as suggested by many other posts on StackOverflow and other sites. The next query removes all the sub-steps and returns what I want, like in the first list example: 1 | 4
Just for illustration, the produced result set by the recursive query looks like this:
1 | 2
2 | 3
3 | 4
1 | 3
2 | 4
1 | 4
As nicely as it works, it's quite a memory hog however which is absolutely unsurprising when looking at the results of EXPLAIN ANALYZE.
For a dataset of roughly 42,600 rows, the recursive query produces a whopping 849,542,346 rows. Now it was actually supposed to process around 2,000,000 rows but with that solution right now it seems very unfeasible.
Did I just improperly use recursive queries? Is there a way to reduce the amount of data it produces?(like removing the sub-steps?)
Or are there better single-query solutions to this problem?
The main problem is that your recursive query doesn't properly filter the root nodes which is caused by the the model you have. So the non-recursive part already selects the entire table and then Postgres needs to recurse for each and every row of the table.
To make that more efficient only select the root nodes in the non-recursive part of your query. This can be done using:
select t1.current_id, t1.next_id, t1.current_id as root_id
from track_caps t1
where not exists (select *
from track_caps t2
where t2.next_id = t1.current_id)
Now that is still not very efficient (compared to the "usual" where parent_id is null design), but at least makes sure the recursion doesn't need to process more rows then necessary.
To find the root node of each tree, just select that as an extra column in the non-recursive part of the query and carry it over to each row in the recursive part.
So you wind up with something like this:
with recursive contract as (
select t1.current_id, t1.next_id, t1.current_id as root_id
from track_caps t1
where not exists (select *
from track_caps t2
where t2.next_id = t1.current_id)
union
select c.current_id, c.next_id, p.root_id
from track_caps c
join contract p on c.current_id = p.next_id
and c.next_id is not null
)
select *
from contract
order by current_id;
Online example: http://rextester.com/DOABC98823

How to consolidate blocks of time?

I have a derived table with a list of relative seconds to a foreign key (ID):
CREATE TABLE Times (
ID INT
, TimeFrom INT
, TimeTo INT
);
The table contains mostly non-overlapping data, but there are occasions where I have a TimeTo < TimeFrom of another record:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 70 |
| 10 | 60 | 150 |
| 10 | 75 | 150 |
| .. | ... | ... |
+----+----------+--------+
The result set is meant to be a flattened linear idle report, but with too many of these overlaps, I end up with negative time in use. I.e. If the window above for ID = 10 was 150 seconds long, and I summed the differences of relative seconds to subtract from the window size, I'd wind up with 150-(20+20+90+75)=-55. This approach I've tried, and is what led me to realizing there were overlaps that needed to be flattened.
So, what I'm looking for is a solution to flatten the overlaps into one set of times:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 150 |
| .. | ... | ... |
+----+----------+--------+
Considerations: Performance is very important here, as this is part of a larger query that will perform well on it's own, and I'd rather not impact its performance much if I can help it.
On a comment regarding "Which seconds have an interval", this is something I have tried for the end result, and am looking for something with better performance. Adapted to my example:
SELECT SUM(C.N)
FROM (
SELECT A.N, ROW_NUMBER()OVER(ORDER BY A.N) RowID
FROM
(SELECT TOP 60 1 N FROM master..spt_values) A
, (SELECT TOP 720 1 N FROM master..spt_values) B
) C
WHERE EXISTS (
SELECT 1
FROM Times SE
WHERE SE.ID = 10
AND SE.TimeFrom <= C.RowID
AND SE.TimeTo >= C.RowID
AND EXISTS (
SELECT 1
FROM Times2 D
WHERE ID = SE.ID
AND D.TimeFrom <= C.RowID
AND D.TimeTo >= C.RowID
)
GROUP BY SE.ID
)
The problem I have with this solution is I have get a Row Count Spool out of the EXISTS query in the query plan with a number of executions equal to COUNT(C.*). I left the real numbers in that query to illustrate that getting around this approach is for the best. Because even with a Row Count Spool reducing the cost of the query by quite a bit, it's execution count increases the cost of the query as a whole by quite a bit as well.
Further Edit: The end goal is to put this in a procedure, so Table Variables and Temp Tables are also a possible tool to use.
OK. I'm still trying to do this with just one SELECT. But This totally works:
DECLARE #tmp TABLE (ID INT, GroupId INT, TimeFrom INT, TimeTo INT)
INSERT INTO #tmp
SELECT ID, 0, TimeFrom, TimeTo
FROM Times
ORDER BY Id, TimeFrom
DECLARE #timeTo int, #id int, #groupId int
SET #groupId = 0
UPDATE #tmp
SET
#groupId = CASE WHEN id != #id THEN 0
WHEN TimeFrom > #timeTo THEN #groupId + 1
ELSE #groupId END,
GroupId = #groupId,
#timeTo = TimeTo,
#id = id
SELECT Id, MIN(TimeFrom), Max(TimeTo) FROM #tmp
GROUP BY ID, GroupId ORDER BY ID
Left join each row to its successor overlapping row on the same ID value (where such exist).
Now for each row in the result-set of LHS left join RHS the contribution to the elapsed time for the ID is:
isnull(RHS.TimeFrom,LHS.TimeTo) - LHS.TimeFrom as TimeElapsed
Summing these by ID should give you the correct answer.
Note that:
- where there isn't an overlapping successor row the calculation is simply
LHS.TimeTo - LHS.TimeFrom
- where there is an overlapping successor row the calculation will net to
(RHS.TimeFrom - LHS.TimeFrom) + (RHS.TimeTo - RHS.TimeFrom)
which simplifies to
RHS.TimeTo - LHS.TimeFrom
What about something like below (assumes SQL 2008+ due to CTE):
WITH Overlaps
AS
(
SELECT t1.Id,
TimeFrom = MIN(t1.TimeFrom),
TimeTo = MAX(t2.TimeTo)
FROM dbo.Times t1
INNER JOIN dbo.Times t2 ON t2.Id = t1.Id
AND t2.TimeFrom > t1.TimeFrom
AND t2.TimeFrom < t1.TimeTo
GROUP BY t1.Id
)
SELECT o.Id,
o.TimeFrom,
o.TimeTo
FROM Overlaps o
UNION ALL
SELECT t.Id,
t.TimeFrom,
t.TimeTo
FROM dbo.Times t
INNER JOIN Overlaps o ON o.Id = t.Id
AND (o.TimeFrom > t.TimeFrom OR o.TimeTo < t.TimeTo);
I do not have a lot of data to test with but seems decent on the smaller data sets I have.
I also wrapped by head around this issue - and afterall I found, that the problem is your data.
You claim (if i get that right), that these entries should reflect the relative times, when a user goes idle / comes back.
So, you should consider to sanitize your data and refactor your inserts to produce valid data sets.
For instance, the two lines:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 50 | 70 |
| 10 | 60 | 150 |
how can it be possible that a user is idle until second 70, but goes idle on second 60? This already implies, that he has been back latest at around second 59.
I can only assume that this issue comes from different threads and/or browser windows (tabs) a user might be using your application with. (Each having it's own "idle detection")
So instead of working-around the symptoms - you should fix the cause! Why is this data entry inserted into the table? You could avoid this by simple checking, if the user is already idle before inserting a new row.
Create a unique key constraint on ID and TimeTo
Whenever an idle-event is detected, execute the following query:
INSERT IGNORE INTO Times (ID,TimeFrom,TimeTo)VALUES('10', currentTimeStamp, -1);
-- (If the user is already "idle" - nothing will happen)
Whenever an comeback-event is detected, execute the following query:
UPDATE Times SET TimeTo=currentTimeStamp WHERE ID='10' and TimeTo=-1
-- (If the user is already "back" - nothing will happen)
The fiddle linked here: http://sqlfiddle.com/#!2/dcb17/1 would reproduce the chain of events for your example, but resulting in a clean and logical set of idle-windows:
ID TIMEFROM TIMETO
10 10 30
10 50 70
10 75 150
Note: The Output is slightly different from the output you desired. But I feel that this is more accurate, cause of the reason outlined above: A user cannot go idle on second 70 without returning from it's current idle state before. He either STAYS idle (and a second thread/tab runs into the idle-event) Or he returned in between.
Especially for your need to maximize performance, you should fix the data and not invent a work-around-query. This is maybe 3 ms upon inserts, but could be worth 20 seconds upon select!
Edit: if Multi-Threading / Multiple-Sessions is the cause for the wrong insert, you would also need to implement a check, if most_recent_come_back_time < now() - idleTimeout - otherwhise a user might comeback on tab1, and is recorded idle on tab2 after a few seconds, cause tab2 did run into it's idle timeout, cause the user only refreshed tab1.
I had the 'same' problem once with 'days' (additionaly without counting WE and Holidays)
The word counting gave me the following idea:
create table Seconds ( sec INT);
insert into Seconds values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9), ...
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom and t.timeto-1
and id=10;
you can cut the start to 0 (I put the '10' here in braces)
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom- (10) and t.timeto- (10)-1
and id=10;
and finaly
select count(distinct sec) from times t, seconds s,
(select min(timefrom) m from times where id=10) as m
where s.sec between t.timefrom-m.m and t.timeto-m.m-1
and id=10;
additionaly you can "ignore" eg. 10 seconds by dividing you loose some prezition but earn speed
select count(distinct sec)*d from times t, seconds s,
(select min(timefrom) m from times where id=10) as m,
(select 10 d) as d
where s.sec between (t.timefrom-m)/d and (t.timeto-m)/d-1
and id=10;
Sure it depends on the range you have to look at, but a 'day' or two of seconds should work (although i did not test it)
fiddle ...

SQL - Min difference between two integer fields

How I can get min difference between two integer fields(value_0 - value)?
value_0 >= value always
value_0 | value
-------------------
15 | 10
12 | 10
15 | 11
11 | 11
Try this:
SELECT MIN(value_0-value) as MinDiff
FROM TableName
WHERE value_0>=value
With the sample data you have given,
Output is 0. (11-11)
See demo in SQL Fiddle.
Read more about MIN() here.
Here is one way:
select min(value_0 - value)
from table t;
This is pretty basic SQL. If you want to see other values on the same row as the minimum, use order by and choose one row:
select (value_0 - value)
from table t
order by (value_0 - value)
limit 1;
The limit 1 works in some databases for getting one row. Others use top 1 in the select clause. Or fetch first 1 rows only. Or even something else.

Select distinct values for a particular column choosing arbitrarily from duplicates

I have health data relating to deaths. Individual should die once maximum. In the database they sometimes don't; probably because causes of death were changed but the original entry was not deleted. I don't really understand how this was allowed to happen, but it has. So, as a made up example, I have:
Row_number | Individual_ID | Cause_of_death | Date_of_death
------------+---------------+-----------------------+---------------
1 | 1 | Stroke | 3 march 2008
2 | 2 | Myocardial infarction | 1 jan 2009
3 | 2 | Pulmonary Embolus | 1 jan 2009
I want each individual to have only one cause of death.
In the example, I want a query that returns row 1 and either row 2 or row 3 (not both). I have to make an arbitrary choice between rows 2 and 3 because there is no timestamp in any of the fields that can be used to determine which is the revision; it's not ideal but is unavoidable.
I can't make the SQL work to do this. I've tried inner joining distinct Individual_ID to the other fields, but this still gives all the rows. I've tried adding a 'having count(Individual_ID) = 1' clause with it. This leaves out people with more than one cause of death completely. Suggestions on the internet seem to be based on using a timestamped field to choose the most recent, but I don't have that.
IBM DB2. Windows XP. Any thoughts gratefully received.
Have you tried using MIN (or MAX) against the cause of death. (and the date of death, if they died on two different dates)
SELECT IndividualID, MIN(Cause_Of_Death), MIN (Date_Of_Death)
from deaths
GROUP BY IndividualID
I don't know DB2 so I'll answer in general. There are two main approaches:
select *
from T
join (
select keys, min(ID) as MinID
from T
group by keys
) on T.ID = MinID
And
select *, row_number() over (partition by keys) as r
from T
where r = 1
Both return all rows, no matter if duplicate or not. But they returns only one duplicate per "key".
Notice, that both statements are pseudo-SQL.
The row_number() approach is probably preferable from a performance standpoint. Here is usr's example, in DB2 syntax:
select * from (
select T.*, row_number() over (partition by Individual_ID) as r
from T
)
where r=1;