Select first & last date in window - sql

I'm trying to select first & last date in window based on month & year of date supplied.
Here is example data:
F.rates
| id | c_id | date | rate |
---------------------------------
| 1 | 1 | 01-01-1991 | 1 |
| 1 | 1 | 15-01-1991 | 0.5 |
| 1 | 1 | 30-01-1991 | 2 |
.................................
| 1 | 1 | 01-11-2014 | 1 |
| 1 | 1 | 15-11-2014 | 0.5 |
| 1 | 1 | 30-11-2014 | 2 |
Here is pgSQL SELECT I came up with:
SELECT c_id, first_value(date) OVER w, last_value(date) OVER w FROM F.rates
WINDOW w AS (PARTITION BY EXTRACT(YEAR FROM date), EXTRACT(MONTH FROM date), c_id
ORDER BY date ASC)
Which gives me a result pretty close to what I want:
| c_id | first_date | last_date |
----------------------------------
| 1 | 01-01-1991 | 15-01-1991 |
| 1 | 01-01-1991 | 30-01-1991 |
.................................
Should be:
| c_id | first_date | last_date |
----------------------------------
| 1 | 01-01-1991 | 30-01-1991 |
.................................
For some reasons last_value(date) returns every record in a window. Which giving me a thought that I'm misunderstanding how windows in SQL works. It's like SQL forming a new window for each row it iterates through, but not multiple windows for entire table based on YEAR and MONTH.
So could any one be kind and explain if I'm wrong and how do I achieve the result I want?
There is a reason why i'm not using MAX/MIN over GROUP BY clause. My next step would be to retrieve associated rates for dates I selected, like:
| c_id | first_date | last_date | first_rate | last_rate | avg rate |
-----------------------------------------------------------------------
| 1 | 01-01-1991 | 30-01-1991 | 1 | 2 | 1.1 |
.......................................................................

If you want your output to become grouped into a single (or just fewer) row(s), you should use simple aggregation (i.e. GROUP BY), if avg_rate is enough:
SELECT c_id, min(date), max(date), avg(rate)
FROM F.rates
GROUP BY c_id, date_trunc('month', date)
More about window functions in PostgreSQL's documentation:
But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities.
...
There is another important concept associated with window functions: for each row, there is a set of rows within its partition called its window frame. Many (but not all) window functions act only on the rows of the window frame, rather than of the whole partition. By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause. When ORDER BY is omitted the default frame consists of all rows in the partition.
...
There are options to define the window frame in other ways ... See Section 4.2.8 for details.
EDIT:
If you want to collapse (min/max aggregation) your data and want to collect more columns than those what listed in GROUP BY, you have 2 choice:
The SQL way
Select min/max value(s) in a sub-query, then join their original rows back (but this way, you have to deal with the fact, that min/max-ed column(s) usually not unique):
SELECT c_id,
min first_date,
max last_date,
first.rate first_rate,
last.rate last_rate,
avg avg_rate
FROM (SELECT c_id, min(date), max(date), avg(rate)
FROM F.rates
GROUP BY c_id, date_trunc('month', date)) agg
JOIN F.rates first ON agg.c_id = first.c_id AND agg.min = first.date
JOIN F.rates last ON agg.c_id = last.c_id AND agg.max = last.date
PostgreSQL's DISTINCT ON
DISTINCT ON is typically meant for this task, but highly rely on ordering (only 1 extremum can be searched for this way at a time):
SELECT DISTINCT ON (c_id, date_trunc('month', date))
c_id,
date first_date,
rate first_rate
FROM F.rates
ORDER BY c_id, date
You can join this query with other aggregated sub-queries of F.rates, but this point (if you really need both minimum & maximum, and in your case even an average) the SQL compliant way is more suiting.

Windowing functions aren't appropriate for this. Use aggregate functions instead.
select
c_id, date_trunc('month', date)::date,
min(date) first_date, max(date) last_date
from rates
group by c_id, date_trunc('month', date)::date;
c_id | date_trunc | first_date | last_date
------+------------+------------+------------
1 | 2014-11-01 | 2014-11-01 | 2014-11-30
1 | 1991-01-01 | 1991-01-01 | 1991-01-30
create table rates (
id integer not null,
c_id integer not null,
date date not null,
rate numeric(2, 1),
primary key (id, c_id, date)
);
insert into rates values
(1, 1, '1991-01-01', 1),
(1, 1, '1991-01-15', 0.5),
(1, 1, '1991-01-30', 2),
(1, 1, '2014-11-01', 1),
(1, 1, '2014-11-15', 0.5),
(1, 1, '2014-11-30', 2);

Related

Is there a difference between Oracle SQL 'KEEP' for multiple columns and 'KEEP' for one and GROUP BY for the rest?

I'm just now learning about KEEP in Oracle SQL, but I cannot seem to find documentation that explains why their examples use KEEP in all columns that are not indexed.
I have a table with 5 columns
PERSON_ID | BRANCH | YEAR | STATUS | TIMESTAMP
123456 | 0001 | 2017 | 1 | 1-1-2017 (ROW 1)
123456 | 0001 | 2017 | 2 | 2-1-2017 (ROW 2)
123456 | 0002 | 2017 | 3 | 3-1-2017 (ROW 3)
123456 | 0001 | 2017 | 2 | 4-1-2017 (ROW 4)
123456 | 0001 | 2018 | 2 | 1-1-2018 (ROW 5)
123456 | 0001 | 2018 | 3 | 2-1-2018 (ROW 6)
I want to return the row of the most recent timestamp by person, branch, and year, so rows 3, 4, and 6.
RESULTS
PERSON_ID | BRANCH | YEAR | STATUS | TIME_STAMP
123456 | 0002 | 2017 | 3 | 3-1-2017 (ROW 3)
123456 | 0001 | 2017 | 2 | 4-1-2017 (ROW 4)
123456 | 0001 | 2018 | 3 | 2-1-2018 (ROW 6)
To get the entire row, I would normally I would write something like this:
SELECT *
FROM STATUS_TABLE a
WHERE a.TIME_STAMP =
(
SELECT MAX(sub.TIME_STAMP)
FROM STATUS_TABLE sub
WHERE a.PERSON_ID = sub.PERSON_ID
AND a.YEAR = sub.YEAR
AND a.BRANCH = sub.BRANCH
)
But I'm learning I can write this:
SELECT
a.PERSON_ID,
a.YEAR,
a.BRANCH,
MAX(a.STATUS) KEEP (DENSE_RANK FIRST ORDER BY TIME_STAMP DESC)
FROM STATUS_TABLE a
GROUP BY a.PERSON_ID, a.YEAR, a.BRANCH;
My concern is that a lot of the documentation and example I'm finding doesn't put all the group-by columns in GROUP BY, but rather they write a KEEP statement for many columns.
Like this:
SELECT
a.PERSON_ID,
MAX(a.YEAR) KEEP (DENSE_RANK FIRST ORDER BY TIME_STAMP DESC),
MAX(a.BRANCH) KEEP (DENSE_RANK FIRST ORDER BY TIME_STAMP DESC),
MAX(a.STATUS) KEEP (DENSE_RANK FIRST ORDER BY TIME_STAMP DESC)
FROM STATUS_TABLE a
GROUP BY a.PERSON_ID;
QUESTION
If I know that there will never be duplicates on TIME_STAMP for an ID, YEAR, and BRANCH, can I write it the first way or do I still need to write it the 2nd way. Using the first way, I get the results I'm expecting, but I can't seem to find any explanation of this method and what the differences may be.
Are there any?
Your aggregation queries are different. When you have:
GROUP BY a.PERSON_ID, a.YEAR, a.BRANCH
Your result set will have one row in the result set for each combination of the three columns.
If you specify:
GROUP BY a.PERSON_ID
Then there is one row only for each PERSON_ID. Under some circumstances, this is the same as the above version. But only when there is one YEAR and BRANCH per PERSON_ID. That is not true in your data.
These versions are functionally equivalent for most practical purposes to your version with the correlated subquery. One difference is what happens if any of the grouping/correlation columns are NULL. The GROUP BY keeps these groupings. The correlated subquery filters them out.

Group function with "partition by" still duplicate values

I'm trying to sum some values from one column (total_stake) based on the second column (node_id) and group results by node_id. Right now it sums everything perfectly but it's still duplicate rows with the same node_id and summed value and I don't fully understand why.
Here is my query:
WITH events AS (
SELECT n.id as node_id, n.event_time FROM nodes n
)
SELECT
node_id,
sum(total) FILTER (WHERE prior_to=0 OR prior_to=2) OVER (PARTITION BY node_id) as node_total_previous_days,
sum(total) FILTER (WHERE prior_to=1) OVER (PARTITION BY node_id) as node_total_same_day,
sum(total) FILTER (WHERE prior_to=2) OVER (PARTITION BY node_id) as node_total_previous_day,
FROM (
SELECT e.node_id,
n.total,
CASE
WHEN date_trunc('day', np.event_time) - INTERVAL '1 day' = date_trunc('day', np.placed_time) THEN 2
WHEN date_trunc('day', np.event_time) - INTERVAL '1 day' > n.placed_time THEN 0
WHEN date_trunc('day', np.event_time) = date_trunc('day', n.placed_time) THEN 1
end as prior_to
FROM events e
JOIN net_parts np on np.node_id = e.node_id
JOIN nets n ON n.id = np.net_id) as summary
GROUP BY node_id, total_stake, prior_to ORDER BY node_id;
Result of the query is:
node_id | node_total_previous_days | node_total_same_day | node_total_previous_day |
---------+--------------------------+---------------------+-------------------------+
6194 | | | 3.00 |
6187 | | 60.00 | 200.00 |
6305 | 150.00 | 569.00 | |
6305 | 150.00 | 569.00 | |
6305 | 150.00 | 569.00 | |
6305 | 150.00 | 569.00 | |
6305 | 150.00 | 569.00 | |
And the question is, how to get grouped result without duplicated values? And to good understand it, why it duplicate that values?
Use group by to determine the rows you want. If you want one row per node_id, then use:
GROUP BY node_id
ORDER BY node_id;
Your additional group by keys are generating more rows. You would see the additional values if you included total_stake and prior_to in the outermost select.

Aggregating multiple rows more than once

I've got a set of data which has an type column, and a created_at time column. I've already got a query which is pulling the relevant data from the database, and this is the data that is returned.
type | created_at | row_num
-----------------------------------------------------
"ordersPage" | "2015-07-21 11:32:40.568+12" | 1
"getQuote" | "2015-07-21 15:49:47.072+12" | 2
"completeBrief" | "2015-07-23 01:00:15.341+12" | 3
"sendBrief" | "2015-07-24 08:59:42.41+12" | 4
"sendQuote" | "2015-07-24 18:43:15.967+12" | 5
"acceptQuote" | "2015-08-03 04:40:20.573+12" | 6
The row number is returned from the standard row number function in postgres
ROW_NUMBER() OVER (ORDER BY created_at ASC) AS row_num
What I want to do is somehow aggregate this data so get a time distance between every event, so the output data might look something like this
type_1 | type_2 | time_distance
--------------------------------------------------------
"ordersPage" | "getQuote" | 123423.3423
"getQuote" | "completeBrief" | 123423.3423
"completeBrief" | "sendBrief" | 123423.3423
"sendBrief" | "sendQuote" | 123423.3423
"sendQuote" | "acceptQuote" | 123423.3423
The time distance would be a float in milliseconds, in other queries I've been using something like this to get time differences.
EXTRACT(EPOCH FROM (MAX(events.created_at) - MIN(events.created_at)))
But this time i need it for every pair of events in the sequential order of the row_num so I need the aggregate for (1,2), (2,3), (3,4)...
Any ideas if this is possible? Also doesn't have to be exact, I can deal with duplicates, and with type_1 and type_2 columns returning an existing row in a different order. I just need a way to at least get those values above.
What about a self join ? It would look like this :
SELECT
t1.type
, t2.type
, ABS(t1.created_at - t2.created_at) AS time_diff
FROM your_table t1
INNER JOIN your_table t2
ON t1.row_num = t2.row_num + 1
You can use the LAG window function to compare the current value with the previous:
with
t(type,created_at) as (
values
('ordersPage', '2015-07-21 11:32:40.568+12'::timestamptz),
('getQuote', '2015-07-21 15:49:47.072+12'),
('completeBrief', '2015-07-23 01:00:15.341+12'),
('sendBrief', '2015-07-24 08:59:42.41+12'),
('sendQuote', '2015-07-24 18:43:15.967+12'),
('acceptQuote', '2015-08-03 04:40:20.573+12'))
select *, EXTRACT(EPOCH FROM created_at - lag(created_at) over (order by created_at))
from t
order by created_at
select type_1,
type_2,
created_at_2-created_at_1 as time_distance
from
(select
type type_1,
lead(type,1) over (order by row_num) type_2,
created_at created_at_1,
lead(created_at,1) over (order by row_num) created_at_2
from table_name) temp
where type_2 is not null

Finding gaps in huge event streams?

I have about 1 million events in a PostgreSQL database that are of this format:
id | stream_id | timestamp
----------+-----------------+-----------------
1 | 7 | ....
2 | 8 | ....
There are about 50,000 unique streams.
I need to find all of the events where the time between any two of the events is over a certain time period. In other words, I need to find event pairs where there was no event in a certain period of time.
For example:
a b c d e f g h i j k
| | | | | | | | | | |
\____2 mins____/
In this scenario, I would want to find the pair (f, g) since those are the events immediately surrounding a gap.
I don't care if the query is (that) slow, i.e. on 1 million records it's fine if it takes an hour or so. However, the data set will keep growing, so hopefully if it's slow it scales sanely.
I also have the data in MongoDB.
What's the best way to perform this query?
You can do this with the lag() window function over a partition by the stream_id which is ordered by the timestamp. The lag() function gives you access to previous rows in the partition; without a lag value, it is the previous row. So if the partition on stream_id is ordered by time, then the previous row is the previous event for that stream_id.
SELECT stream_id, lag(id) OVER pair AS start_id, id AS end_id,
("timestamp" - lag("timestamp") OVER pair) AS diff
FROM my_table
WHERE diff > interval '2 minutes'
WINDOW pair AS (PARTITION BY stream_id ORDER BY "timestamp");
In postgres it can be done very easily with a help of the lag() window function. Check the fiddle below as an example:
SQL Fiddle
PostgreSQL 9.3 Schema Setup:
CREATE TABLE Table1
("id" int, "stream_id" int, "timestamp" timestamp)
;
INSERT INTO Table1
("id", "stream_id", "timestamp")
VALUES
(1, 7, '2015-06-01 15:20:30'),
(2, 7, '2015-06-01 15:20:31'),
(3, 7, '2015-06-01 15:20:32'),
(4, 7, '2015-06-01 15:25:30'),
(5, 7, '2015-06-01 15:25:31')
;
Query 1:
with c as (select *,
lag("timestamp") over(partition by stream_id order by id) as pre_time,
lag(id) over(partition by stream_id order by id) as pre_id
from Table1
)
select * from c where "timestamp" - pre_time > interval '2 sec'
Results:
| id | stream_id | timestamp | pre_time | pre_id |
|----|-----------|------------------------|------------------------|--------|
| 4 | 7 | June, 01 2015 15:25:30 | June, 01 2015 15:20:32 | 3 |

PostgreSQL return multiple rows with DISTINCT though only latest date per second column

Lets says I have the following database table (date truncated for example only, two 'id_' preix columns join with other tables)...
+-----------+---------+------+--------------------+-------+
| id_table1 | id_tab2 | date | description | price |
+-----------+---------+------+--------------------+-------+
| 1 | 11 | 2014 | man-eating-waffles | 1.46 |
+-----------+---------+------+--------------------+-------+
| 2 | 22 | 2014 | Flying Shoes | 8.99 |
+-----------+---------+------+--------------------+-------+
| 3 | 44 | 2015 | Flying Shoes | 12.99 |
+-----------+---------+------+--------------------+-------+
...and I have a query like the following...
SELECT id, date, description FROM inventory ORDER BY date ASC;
How do I SELECT all the descriptions, but only once each while simultaneously only the latest year for that description? So I need the database query to return the first and last row from the sample data above; the second it not returned because the last row has a later date.
Postgres has something called distinct on. This is usually more efficient than using window functions. So, an alternative method would be:
SELECT distinct on (description) id, date, description
FROM inventory
ORDER BY description, date desc;
The row_number window function should do the trick:
SELECT id, date, description
FROM (SELECT id, date, description,
ROW_NUMBER() OVER (PARTITION BY description
ORDER BY date DESC) AS rn
FROM inventory) t
WHERE rn = 1
ORDER BY date ASC;