PostgreSQL: How to return rows with respect to a found row (relative results)? - sql

Forgive my example if it does not make sense. I'm going to try with a simplified one to encourage more participation.
Consider a table like the following:
dt | mnth | foo
--------------+------------+--------
2012-12-01 | December |
...
2012-08-01 | August |
2012-07-01 | July |
2012-06-01 | June |
2012-05-01 | May |
2012-04-01 | April |
2012-03-01 | March |
...
1997-01-01 | January |
If you look for the record with dt closest to today w/o going over, what would be the best way to also return the 3 records beforehand and 7 records after?
I decided to try windowing functions:
WITH dates AS (
select row_number() over (order by dt desc)
, dt
, dt - now()::date as dt_diff
from foo
)
, closest_date AS (
select * from dates
where dt_diff = ( select max(dt_diff) from dates where dt_diff <= 0 )
)
SELECT *
FROM dates
WHERE row_number - (select row_number from closest_date) >= -3
AND row_number - (select row_number from closest_date) <= 7 ;
I feel like there must be a better way to return relative records with a window function, but it's been some time since I've looked at them.

create table foo (dt date);
insert into foo values
('2012-12-01'),
('2012-08-01'),
('2012-07-01'),
('2012-06-01'),
('2012-05-01'),
('2012-04-01'),
('2012-03-01'),
('2012-02-01'),
('2012-01-01'),
('1997-01-01'),
('2012-09-01'),
('2012-10-01'),
('2012-11-01'),
('2013-01-01')
;
select dt
from (
(
select dt
from foo
where dt <= current_date
order by dt desc
limit 4
)
union all
(
select dt
from foo
where dt > current_date
order by dt
limit 7
)) s
order by dt
;
dt
------------
2012-03-01
2012-04-01
2012-05-01
2012-06-01
2012-07-01
2012-08-01
2012-09-01
2012-10-01
2012-11-01
2012-12-01
2013-01-01
(11 rows)

You could use the window function lead():
SELECT dt_lead7 AS dt
FROM (
SELECT *, lead(dt, 7) OVER (ORDER BY dt) AS dt_lead7
FROM foo
) d
WHERE dt <= now()::date
ORDER BY dt DESC
LIMIT 11;
Somewhat shorter, but the UNION ALL version will be faster with a suitable index.
That leaves a corner case where "date closest to today" is within the first 7 rows. You can pad the original data with 7 rows of -infinity to take care of this:
SELECT d.dt_lead7 AS dt
FROM (
SELECT *, lead(dt, 7) OVER (ORDER BY dt) AS dt_lead7
FROM (
SELECT '-infinity'::date AS dt FROM generate_series(1,7)
UNION ALL
SELECT dt FROM foo
) x
) d
WHERE d.dt &lt= now()::date -- same as: WHERE dt &lt= now()::date1
ORDER BY d.dt_lead7 DESC -- same as: ORDER BY dt DESC 1
LIMIT 11;
I table-qualified the columns in the second query to clarify what happens. See below.
The result will include NULL values if the "date closest to today" is within the last 7 rows of the base table. You can filter those with an additional sub-select if you need to.
1To address your doubts about output names versus column names in the comments - consider the following quotes from the manual.
Where to use an output column's name:
An output column's name can be used to refer to the column's value in
ORDER BY and GROUP BY clauses, but not in the WHERE or HAVING clauses;
there you must write out the expression instead.
Bold emphasis mine. WHERE dt <= now()::date references the column d.dt, not the the output column of the same name - thereby working as intended.
Resolving conflicts:
If an ORDER BY expression is a simple name that matches both an output
column name and an input column name, ORDER BY will interpret it as
the output column name. This is the opposite of the choice that GROUP BY
will make in the same situation. This inconsistency is made to be
compatible with the SQL standard.
Bold emphasis mine again. ORDER BY dt DESC in the example references the output column's name - as intended. Anyway, either columns would sort the same. The only difference could be with the NULL values of the corner case. But that falls flat, too, because:
the default behavior is NULLS LAST when ASC is specified or implied,
and NULLS FIRST when DESC is specified
As the NULL values come after the biggest values, the order is identical either way.
Or, without LIMIT (as per request in comment):
WITH x AS (
SELECT *
, row_number() OVER (ORDER BY dt) AS rn
, first_value(dt) OVER (ORDER BY (dt > '2011-11-02')
, dt DESC) AS dt_nearest
FROM foo
)
, y AS (
SELECT rn AS rn_nearest
FROM x
WHERE dt = dt_nearest
)
SELECT dt
FROM x, y
WHERE rn BETWEEN rn_nearest - 3 AND rn_nearest + 7
ORDER BY dt;
If performance is important, I would still go with #Clodoaldo's UNION ALL variant. It will be fastest. Database agnostic SQL will only get you so far. Other RDBMS do not have window functions at all, yet (MySQL), or different function names (like first_val instead of first_value). You might just as well replace LIMIT with TOP n (MS SQL) or whatever the local dialect.

You could use something like that:
select * from foo
where dt between now()- interval '7 months' and now()+ interval '3 months'
This and this may help you.

Related

SQL How to subtract 2 row values of a same column based on same key

How to extract the difference of a specific column of multiple rows with same id?
Example table:
id
prev_val
new_val
date
1
0
1
2020-01-01 10:00
1
1
2
2020-01-01 11:00
2
0
1
2020-01-01 10:00
2
1
2
2020-01-02 10:00
expected result:
id
duration_in_hours
1
1
2
24
summary:
with id=1, (2020-01-01 10:00 - 2020-01-01 11:00) is 1hour;
with id=2, (2020-01-01 10:00 - 2020-01-02 10:00) is 24hour
Can we achieve this with SQL?
This solutions will be an effective way
with pd as (
select
id,
max(date) filter (where c.old_value = '0') as "prev",
max(date) filter (where c.old_value = '1') as "new"
from
table
group by
id )
select
id ,
new - prev as diff
from
pd;
if you need the difference between successive readings something like this should work
select a.id, a.new_val, a.date - b.date
from my_table a join my_table b
on a.id = b.id and a.prev_val = b.new_val
you could use min/max subqueries. For example:
SELECT mn.id, (mx.maxdate - mn.mindate) as "duration",
FROM (SELECT id, max(date) as mindate FROM table GROUP BY id) mn
JOIN (SELECT id, min(date) as maxdate FROM table GROUP BY id) mx ON
mx.id=mn.id
Let me know if you need help in converting duration to hours.
You can use the lead()/lag() window functions to access data from the next/ previous row. You can further subtract timestamps to give an interval and extract the parts needed.
select id, floor( extract('day' from diff)*24 + extract('hour' from diff) ) "Time Difference: Hours"
from (select id, date_ts - lag(date_ts) over (partition by id order by date_ts) diff
from example
) hd
where diff is not null
order by id;
NOTE: Your expected results, as presented, are incorrect. The results would be -1 and -24 respectively.
DATE is a very poor choice for a column name. It is both a Postgres data type (at best leads to confusion) and a SQL Standard reserved word.

Deleting rows within floating data ranges using SQL

I have some date data as follows:-
Person | Date
1 | 1/1/2000
1 | 6/1/2000
1 | 11/1/2000
1 | 21/1/2000
1 | 28/1/2000
I need to delete rows within 14 days of a previous one. However, if a row is deleted, it should not later become a 'base' date against which later rows are checked. It's perhaps easier to show the results needed:-
Person | Date
1 | 1/1/2000
1 | 21/1/2000
My feeling is that recursive SQL will be needed but I'm not sure how to set it up. I'll be running this on Teradata.
Thanks.
--- Edit ---
Well, this is embarrassing. It turns out this question has been asked before - and it was asked by me! See this old question for an excellent answer from #dnoeth:-
Drop rows identified within moving time window
Use recursive tables. Use ROWNUMBER() to Order and Number the dates.
DATEDIFF() to receive the number of days passed from previous date
Maybe SQL2012 and above can simplify using SUM() OVER PARTITION with a RANGE
I didn't find it useful in this case
DECLARE #Tab TABLE ([MyDate] SMALLDATETIME)
INSERT INTO #Tab ([MyDate])
VALUES
('2000-01-06'),
('2000-01-01'),
('2000-01-11'),
('2000-01-21'),
('2000-01-28')
;
WITH DOrder (MyDate, SortID) AS (
SELECT MyDate,
ROW_NUMBER() OVER (ORDER BY MyDate)SortID
FROM #Tab t)
,Summarize(MyDate, SortID, sSum, rSum ) AS (
SELECT MyDate, SortID, 0, 0 rSum
FROM DOrder WHERE SortID = 1
UNION ALL
SELECT t.MyDate, t.SortID, DATEDIFF(D, ISNULL(s.MyDate,t.MyDate), t.MyDate) rSum,
CASE WHEN DATEDIFF(D, ISNULL(s.MyDate,t.MyDate), t.MyDate) + s.rSum>14 THEN 0
ELSE DATEDIFF(D, ISNULL(s.MyDate,t.MyDate), t.MyDate)
END rSum
FROM DOrder t INNER JOIN Summarize s
ON (t.SortID = s.SortID+1))
SELECT MyDate
FROM Summarize
WHERE rSum=0

Counting an already counted column in SQL (db2)

I'm pretty new to SQL and have this problem:
I have a filled table with a date column and other not interesting columns.
date | name | name2
2015-03-20 | peter | pan
2015-03-20 | john | wick
2015-03-18 | harry | potter
What im doing right now is counting everything for a date
select date, count(*)
from testtable
where date >= current date - 10 days
group by date
what i want to do now is counting the resulting lines and only returning them if there are less then 10 resulting lines.
What i tried so far is surrounding the whole query with a temp table and the counting everything which gives me the number of resulting lines (yeah)
with temp_count (date, counter) as
(
select date, count(*)
from testtable
where date >= current date - 10 days
group by date
)
select count(*)
from temp_count
What is still missing the check if the number is smaller then 10.
I was searching in this Forum and came across some "having" structs to use, but that forced me to use a "group by", which i can't.
I was thinking about something like this :
with temp_count (date, counter) as
(
select date, count(*)
from testtable
where date >= current date - 10 days
group by date
)
select *
from temp_count
having count(*) < 10
maybe im too tired to think of an easy solution, but i can't solve this so far
Edit: A picture for clarification since my english is horrible
http://imgur.com/1O6zwoh
I want to see the 2 columned results ONLY IF there are less then 10 rows overall
I think you just need to move your having clause to the inner query so that it is paired with the GROUP BY:
with temp_count (date, counter) as
(
select date, count(*)
from testtable
where date >= current date - 10 days
group by date
having count(*) < 10
)
select *
from temp_count
If what you want is to know whether the total # of records (after grouping), are returned, then you could do this:
with temp_count (date, counter) as
(
select date, counter=count(*)
from testtable
where date >= current date - 10 days
group by date
)
select date, counter
from (
select date, counter, rseq=row_number() over (order by date)
from temp_count
) x
group by date, counter
having max(rseq) >= 10
This will return 0 rows if there are less than 10 total, and will deliver ALL the results if there are 10 or more (you can just get the first 10 rows if needed with this also).
In your temp_count table, you can filter results with the WHERE clause:
with temp_count (date, counter) as
(
select date, count(distinct date)
from testtable
where date >= current date - 10 days
group by date
)
select *
from temp_count
where counter < 10
Something like:
with t(dt, rn, cnt) as (
select dt, row_number() over (order by dt) as rn
, count(1) as cnt
from testtable
where dt >= current date - 10 days
group by dt
)
select dt, cnt
from t where 10 >= (select max(rn) from t);
will do what you want (I think)

How to select more than 1 record per day?

This is a postgresql problem.
PostgreSQL 8.3.3 on i686-redhat-linux-gnu, compiled by GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-9).
The table looks like:
date_time other_column
2012-11-01 00:00:00 ...
2012-11-02 01:00:00 ...
2012-11-02 02:00:00 ...
2012-11-02 03:00:00 ...
2012-11-02 04:00:00 ...
2012-11-03 05:00:00 ...
2012-11-03 06:00:00 ...
2012-11-05 00:00:00 ...
2012-11-07 00:00:00 ...
2012-11-07 00:00:00 ...
...
I want to select at most 3 records per day from a specific date range.
For example, I want to select at most 3 records from 2012-11-02 to 2012-11-05.
The expected result would be:
date_time other_column
2012-11-02 01:00:00 ...
2012-11-02 02:00:00 ...
2012-11-02 03:00:00 ...
2012-11-03 05:00:00 ...
2012-11-03 06:00:00 ...
2012-11-05 00:00:00 ...
I have spent a few hours on this and still cannot figure it out. Please help me. :(
UPDATE:
The current sql I tried could only select one record per day:
SELECT DISTINCT ON (TO_DATE(SUBSTRING((date_time || '') FROM 1 FOR 10), 'YYYY-MM-DD')) *
FROM myTable
WHERE date_time >= '20121101 00:00:00'
AND date_time <= '20121130 23:59:59'
I want to select at most 3 records per day from a specific date range.
SELECT date_time, other_column
FROM (
SELECT *, row_number() OVER (PARTITION BY date_time::date) AS rn
FROM tbl
WHERE date_time >= '2012-11-01 0:0'
AND date_time < '2012-12-01 0:0'
) x
WHERE rn < 4;
Major points
Use the window function row_number(). rank() or dense_rank() would be wrong according to the question - more than 3 records might be selected with timestamp duplicates.
Since you do not define which rows you want per day, the correct answer is not to include an ORDER BY clause in the window function. Gives you an arbitrary selection, which matches the question.
I changed your WHERE clause from
WHERE date_time >= '20121101 00:00:00'
AND date_time <= '20121130 23:59:59'
to
WHERE date_time >= '2012-11-01 0:0'
AND date_time < '2012-12-01 0:0'
Your syntax would fail for corner cases like '20121130 23:59:59.123'.
What #Craig suggested:
date_time::date BETWEEN '2012-11-02' AND '2012-11-05'
.. would work correctly, but is an anti-pattern regarding performance. If you apply a cast or a function to your database column in the expression, plain indexes cannot be used.
Solution for PostgreSQL 8.3
Best solution: Upgrade to a more recent version, preferably to the current version 9.2.
Other solutions:
For only few days you could employ UNION ALL:
SELECT date_time, other_column
FROM tbl t1
WHERE date_time >= '2012-11-01 0:0'
AND date_time < '2012-11-02 0:0'
LIMIT 3
)
UNION ALL
(
SELECT date_time, other_column
FROM tbl t1
WHERE date_time >= '2012-11-02 0:0'
AND date_time < '2012-11-03 0:0'
LIMIT 3
)
...
Parenthesis are not optional here.
For more days there are workarounds with generate_series() - something like I posted here (including a link to more).
I might have solved it with a plpgsql function back in the old days before we had window functions:
CREATE OR REPLACE FUNCTION x.f_foo (date, date, integer
, OUT date_time timestamp, OUT other_column text)
RETURNS SETOF record AS
$BODY$
DECLARE
_last_day date; -- remember last day
_ct integer := 1; -- count
BEGIN
FOR date_time, other_column IN
SELECT t.date_time, t.other_column
FROM tbl t
WHERE t.date_time >= $1::timestamp
AND t.date_time < ($2 + 1)::timestamp
ORDER BY t.date_time::date
LOOP
IF date_time::date = _last_day THEN
_ct := _ct + 1;
ELSE
_ct := 1;
END IF;
IF _ct <= $3 THEN
RETURN NEXT;
END IF;
_last_day := date_time::date;
END LOOP;
END;
$BODY$ LANGUAGE plpgsql STABLE STRICT;
COMMENT ON FUNCTION f_foo(date3, date, integer) IS 'Return n rows per day
$1 .. date_from (incl.)
$2 .. date_to (incl.)
$3 .. maximim rows per day';
Call:
SELECT * FROM f_foo('2012-11-01', '2012-11-05', 3);
The following answers all use date_trunc('day',date_time) or just cast to date to truncate a timestamp to a date. There's no need to jump through hoops with date formatting and strings. See Date/time functions in the manual.
This SQLFiddle shows three possible answers: http://sqlfiddle.com/#!12/0fd51/14, all of which produce the same result for the input data (but not necessarily the same result if date_time can have duplicates in it).
To solve your problem you could use a correlated subquery with a limit to generate an IN-list to filter on:
SELECT a.date_time, a.other_column
FROM table1 a
WHERE a.date_time IN (
SELECT b.date_time
FROM table1 b
WHERE b.date_time IS NOT NULL
AND a.date_time::date = b.date_time::date
ORDER BY b.date_time
LIMIT 3
)
AND a.date_time::date BETWEEN '2012-11-02' AND '2012-11-05';
This should be the most portable approach - though it won't work with MySQL (at least as of 5.5) because MySQL doesn't support LIMIT in a subquery used in an IN clause. It works in SQLite3 and PostgreSQL, though, and should work in most other DBs.
Another option would be to select the range of dates you wanted, annotate the rows within the range with a row number using a window function, then filter the output to exclude excess rows:
SELECT date_time, other_column
FROM (
SELECT
date_time,
other_column,
rank() OVER (PARTITION BY date_trunc('day',date_time) ORDER BY date_time) AS n
FROM Table1
WHERE date_trunc('day',date_time) BETWEEN '2012-11-02' AND '2012-11-05'
ORDER BY date_time
) numbered_rows
WHERE n < 4;
If ties are a possibility, ie if date_time is not unique, then consider using either the rank or dense_rank window functions instead of row_number to get deterministic results, or add an additional clause to the ORDER BY in row_number to break the tie.
If you use rank then it'll include none of the rows if it can't fit all of them in; if you use dense_rank it'll include all of them even if it has to go over the 3-row-per-day limit to do so.
All sorts of other processing are possible this way too, using the window specification.
Here's yet another formulation that uses array aggregation and slicing, which is completely PostgreSQL specific but fun.
SELECT b.date_time, b.other_column
FROM (
SELECT array_agg(a.date_time ORDER BY a.date_time)
FROM table1 a
WHERE a.date_time::date BETWEEN '2012-11-02'
AND '2012-11-05'
GROUP BY a.date_time::date
) x(arr)
INNER JOIN table1 b ON (b.date_time = ANY (arr[1:3]));
I would use a sub-select and a left outer join. This should do the trick:
select distinct(date_format(a.date_time,"%Y-%m-%d")) date_time, b.* from table a
left outer join (
select date_format(date_time,"%Y-%m-%d") dt, * from table limit 3
) b
on date_format(a.date_time,"%Y-%m-%d") = b.dt;

Teradata SQL: select a literal

I want to use a list of arbitrary numbers as a sort of input to a select. Option A, of course, is to create a temporary table that contains just the values (e.g., 1,2,3).
I hope that you folks know what Option >A is.
Suppose the statement is like:
select Fx,
XXXXXX as Foo
from MyTable
where MyTest depends on each XXXXXX
So if I could magically make XXXXXX a list of values (1,2,3), I'd have a resultset like:
My val | Foo
-------+---
cat | 1
mouse | 2
cheesecake | 3
Again, I could source the inputs from a table, but I prefer not to if it's not necessary. Gurus, please chime in.
TIA.
You will probably find success using the ROW_NUMBER() Window Aggregate function.
Random Order
SELECT CALENDAR_DATE
, ROW_NUMBER()
OVER (ORDER BY 1)
FROM SYS_CALENDAR.CALENDAR
WHERE CALENDAR_DATE BETWEEN DATE '2010-06-01' AND DATE
;
OR Order by the column
SELECT CALENDAR_DATE
, ROW_NUMBER()
OVER (ORDER BY CALENDAR_DATE)
FROM SYS_CALENDAR.CALENDAR
WHERE CALENDAR_DATE BETWEEN DATE '2010-06-01' AND DATE
;
OR Partition by another column to restart the sequence
SELECT CALENDAR_DATE
, YEAR_OF_CALENDAR
, ROW_NUMBER()
OVER (PARTITION BY YEAR_OF_CALENDAR
ORDER BY CALENDAR_DATE)
FROM SYS_CALENDAR.CALENDAR
WHERE CALENDAR_DATE BETWEEN DATE '2009-11-01' AND DATE
;
;