I have a little table to try to understand how the LAST_VALUE function works in PostgreSQL. It looks like this:
id | value
----+--------
0 | A
1 | B
2 | C
3 | D
4 | E
5 | [null]
6 | F
What I want to do is to use LAST_VALUE to fill the NULL value with the precedent non-NULL value, so the result should be this:
id | value
----+--------
0 | A
1 | B
2 | C
3 | D
4 | E
5 | E
6 | F
The query I tried to accomplish that is:
SELECT LAST_VALUE(value)
OVER (PARTITION BY id ORDER BY case WHEN value IS NULL THEN 0 ELSE 1 END ASC)
FROM test;
From what I understand of the LAST_VALUE function, it takes all the rows before the current one as a window, sorts them following the ORDER By thing and then returns the last row of the window. With my ORDER BY, all the rows containing a NULL should be put on top of the window, so LAST_VALUE should return the last non NULL value. But it doesn't.
I am clearly missing something. Please help.
I'm not sure last_value will do what you want. It would be better to use lag:
select id,
coalesce(value, lag(value) OVER (order by id))
FROM test;
id | coalesce
----+----------
0 | A
1 | B
2 | C
3 | D
4 | E
5 | E
6 | F
(7 rows)
last_value will return the last value of the current frame. Since you partitioned by id, there's only ever one value in the current frame. lag will return the previous row (by default) in the frame, which seems to be exactly what you want.
To expand on this answer a bit, you can use row_number() to give you a good idea of the frame you are looking at. For your proposed solution, look at the row numbers for each row, when you partition by id:
SELECT id, row_number() OVER (PARTITION BY id ORDER BY case WHEN value IS NULL THEN 0 ELSE 1 END ASC)
FROM test;
id | row_number
----+------------
0 | 1
1 | 1
2 | 1
3 | 1
4 | 1
5 | 1
6 | 1
(7 rows)
Each row is its own frame, so you won't be able to get anything values from other rows.
If we don't partition by id, but still use your ordering, you can see why this still won't work for last_value:
SELECT id, row_number() OVER (ORDER BY case WHEN value IS NULL THEN 0 ELSE 1 END ASC, id)
FROM test;
id | row_number
----+------------
5 | 1
0 | 2
1 | 3
2 | 4
3 | 5
4 | 6
6 | 7
(7 rows)
In this case, the row that was NULL is first. By default, last_value will include rows up to the current row, which in this case is just the current row for id 5. You could include all rows in your frame:
SELECT id,
row_number() OVER (ORDER BY case WHEN value IS NULL THEN 0 ELSE 1 END ASC,
id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
last_value(value) OVER (ORDER BY case WHEN value IS NULL THEN 0 ELSE 1 END ASC, id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;
id | row_number | last_value
----+------------+------------
5 | 1 | F
0 | 2 | F
1 | 3 | F
2 | 4 | F
3 | 5 | F
4 | 6 | F
6 | 7 | F
(7 rows)
But now the last row is the end of the frame and it's clearly not what you want. If you're looking for the previous row, choose lag().
So, thanks to Jeremy's explanations and another post (PostgreSQL last_value ignore nulls) I finally figured it out:
SELECT id, value, first_value(value) OVER (partition by t.isnull) AS new_val
FROM(
SELECT id, value, SUM (CASE WHEN value IS NOT NULL THEN 1 END) OVER (ORDER BY id) AS isnull
FROM test) t;
This query returns the result I expected.
The trick here is to provide BETWEEN params, like this:
SELECT
id,
COALESCE(value, LAST_VALUE(value) OVER id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING))
FROM test;
The issue with your first attempt was -aside from partitioning- that ever since BETWEEN params weren't provided, it assumed these by default:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Even more confusing is that few window functions, like RANK, ROW_NUMBER, NTILE, etc. assume these by default:
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
But, your final solution is still more robust, since it handles contiguous null values. I just wanted to point out this default behavior since I've seen people going through this many times.
Related
Hi I was trying to group data based on a particular pattern.
I have a table with two column as below,
Name rollingsum
A 5
A 10
A 0
A 5
A 0
B 6
B 0
I need to generate a key column that increment only after rollingsum equals 0 is encountered.As given below
Name rollingsum key
A 5 1
A 10 1
A 0 1
A 5 2
A 0 2
B 6 3
B 0 3
I am using postgres, I tried to increment variable in case statement as below
Declare a int;
a:=1;
........etc
Case when rolling sum =0 then a:=a+1 else a end as key
But I am getting an error near :
Thanks in advance for all help
You need an ordering columns because the results depend on the ordering of the rows -- and SQL tables represent unordered sets.
Then do a cumulative sum of the 0 counts from the end of the data. That is in reverse order, so subtract that from the total:
select t.*,
(1 + sum( (rolling_sum = 0)::int ) over () -
sum( (rolling_sum = 0)::int ) over (order by ordercol desc)
) as key
from t;
Assuming that you have a column called id to order the rows, here is one option using a cumulative count and a window frame:
select name, rollingsum,
1 + count(*) filter(where rollingsum = 0) over(
order by id
rows between unbounded preceding and 1 preceding
) as key
from mytable
Demo on DB Fiddle:
name | rollingsum | key
:--- | ---------: | --:
A | 5 | 1
A | 10 | 1
A | 0 | 1
A | 5 | 2
A | 0 | 2
B | 6 | 3
B | 0 | 3
I have a table like this:
Events
----+------+-----
id |start | end
----+------+-----
1 | 3 | 5
2 | 8 | 10
3 | 14 | 17
4 | 6 | 6
5 | 19 | 20
I would like to find the biggest number of empty days between two consecutive events.
Desired result:
3
This query return the MAX() gap, but I can't seem to find a way to order the result by the end column first:
SELECT MAX(empty)
FROM
( SELECT a.start-b.end-1 AS empty
FROM
Reservations AS a,
Reservations AS b
WHERE a.id=b.id+1
GROUP BY b.end
ORDER BY b.end
);
Use lag():
select max(start - prev_end) - 1 as diff
from (select t.*, lag(end) over (order by start) as prev_end
from t
) t
where prev_end is not null;
Note: This assumes that the periods are not overlapping, which is consistent with the data you have provided.
I have the following info in my SQLite database:
ID | timestamp | val
1 | 1577644027 | 0
2 | 1577644028 | 0
3 | 1577644029 | 1
4 | 1577644030 | 1
5 | 1577644031 | 2
6 | 1577644032 | 2
7 | 1577644033 | 3
8 | 1577644034 | 2
9 | 1577644035 | 1
10 | 1577644036 | 0
11 | 1577644037 | 1
12 | 1577644038 | 1
13 | 1577644039 | 1
14 | 1577644040 | 0
I want to perform a query that returns the elements that compose an episode. An episode is a set of ordered registers that comply the following requirements:
The first element is greater than zero.
The previous element of the first one is zero.
The last element is greater than zero.
The next element of the last one is zero.
The expected result of the query on this example would be something like this:
[
[{"id":3, tmstamp:1577644029, value:1}
{"id":4, tmstamp:1577644030, value:1}
{"id":5, tmstamp:1577644031, value:2}
{"id":6, tmstamp:1577644032, value:2}
{"id":7, tmstamp:1577644033, value:3}
{"id":8, tmstamp:1577644034, value:2}
{"id":9, tmstamp:1577644035, value:1}],
[{"id":11, tmstamp:1577644037, value:1}
{"id":12, tmstamp:1577644038, value:1}
{"id":13, tmstamp:1577644039, value:1}]
]
Currently, I am avoiding this query and I am using an auxiliary table to store the initial and end timestamp of episodes, but this is only because I do not know how to perform this query.
Threfore, my question is quite straightforward: does anyone know how can I perform this query in order to obtain something similar to the stated ouput?
This answer assumes that the "before" and "after" conditions are not really important. That is, an episode can be the first row in the table.
You can identify the episodes by counting the number of 0s before each row. Then filter out the 0 values:
select t.*,
dense_rank() over (order by grp) as episode
from (select t.*,
sum(case when val = 0 then 1 else 0 end) over (order by timestamp) as grp
from t
) t
where val <> 0;
If this is not the case, then lag() and lead() and a cumulative sum can handle the previous value being 0:
select t.*,
sum(case when prev_val = 0 and val > 0 then 1 else 0 end) over (order by timestamp) as episode
from (select t.*,
lag(val) over (order by timestamp) as prev_val,
lead(val) over (order by timestamp) as next_val
from t
) t
where val <> 0;
If you want the result as JSON objects then you must use the JSON1 Extension functions of SQLite:
with cte as (
select *, sum(val = 0) over (order by timestamp) grp
from tablename
)
select
json_group_array(
json_object('id', id, 'timestamp', timestamp, 'val', val)
) result
from cte
where val > 0
group by grp
See the demo.
Results:
| result |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [{"id":3,"timestamp":1577644029,"val":1},{"id":4,"timestamp":1577644030,"val":1},{"id":5,"timestamp":1577644031,"val":2},{"id":6,"timestamp":1577644032,"val":2},{"id":7,"timestamp":1577644033,"val":3},{"id":8,"timestamp":1577644034,"val":2},{"id":9,"timestamp":1577644035,"val":1}] |
| [{"id":11,"timestamp":1577644037,"val":1},{"id":12,"timestamp":1577644038,"val":1},{"id":13,"timestamp":1577644039,"val":1}] |
I was trying to understand PARTITION BY in postgres by writing a few sample queries. I have a test table on which I run my query.
id integer | num integer
___________|_____________
1 | 4
2 | 4
3 | 5
4 | 6
When I run the following query, I get the output as I expected.
SELECT id, COUNT(id) OVER(PARTITION BY num) from test;
id | count
___________|_____________
1 | 2
2 | 2
3 | 1
4 | 1
But, when I add ORDER BY to the partition,
SELECT id, COUNT(id) OVER(PARTITION BY num ORDER BY id) from test;
id | count
___________|_____________
1 | 1
2 | 2
3 | 1
4 | 1
My understanding is that COUNT is computed across all rows that fall into a partition. Here, I have partitioned the rows by num. The number of rows in the partition is the same, with or without an ORDER BY clause. Why is there a difference in the outputs?
When you add an order by to an aggregate used as a window function that aggregate turns into a "running count" (or whatever aggregate you use).
The count(*) will return the number of rows up until the "current one" based on the order specified.
The following query shows the different results for aggregates used with an order by. With sum() instead of count() it's a bit easier to see (in my opinion).
with test (id, num, x) as (
values
(1, 4, 1),
(2, 4, 1),
(3, 5, 2),
(4, 6, 2)
)
select id,
num,
x,
count(*) over () as total_rows,
count(*) over (order by id) as rows_upto,
count(*) over (partition by x order by id) as rows_per_x,
sum(num) over (partition by x) as total_for_x,
sum(num) over (order by id) as sum_upto,
sum(num) over (partition by x order by id) as sum_for_x_upto
from test;
will result in:
id | num | x | total_rows | rows_upto | rows_per_x | total_for_x | sum_upto | sum_for_x_upto
---+-----+---+------------+-----------+------------+-------------+----------+---------------
1 | 4 | 1 | 4 | 1 | 1 | 8 | 4 | 4
2 | 4 | 1 | 4 | 2 | 2 | 8 | 8 | 8
3 | 5 | 2 | 4 | 3 | 1 | 11 | 13 | 5
4 | 6 | 2 | 4 | 4 | 2 | 11 | 19 | 11
There are more examples in the Postgres manual
Your two expressions are:
COUNT(id) OVER (PARTITION BY num)
COUNT(id) OVER (PARTITION BY num ORDER BY id)
Why would you expect these to return the same values? The syntax is different for a reason.
The first returns the overall count for each num -- essentially joining back the aggregated value.
The second does a cumulative count. It does the COUNT() for each row of id, for all values up to that ids value.
Note that such cumulative counts would normally be implemented using RANK() (or related functions).
The cumulative count is subtly different from RANK(). The cumulative count implements:
COUNT(id) OVER (PARTITION BY num ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
RANK() is slightly different. The difference only matters when the ORDER BY keys have ties.
The "why" has already been explained by others. Sometimes you have an ordered window, and you have to do a count over the whole partition despite having an ORDER BY.
To do so, use an unbounded range with RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
create table search_log
(
id bigint not null primary key,
query varchar(255) not null,
stemmed_query varchar(255) not null,
created timestamp not null,
);
SELECT query,
created as seen_on,
first_value(created) OVER query_window as last_seen,
row_number() OVER query_window AS rn,
count(*) OVER query_window AS occurence
FROM search_log l
WINDOW query_window AS (PARTITION BY stemmed_query ORDER BY created DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
Disclaimer: I don't mean partition in the window function sense, nor table partitioning; I mean it in the more general sense, i.e. to divide up.
Here's a table:
id | y
----+------------
1 | 1
2 | 1
3 | 1
4 | 2
5 | 2
6 | null
7 | 2
8 | 2
9 | null
10 | null
I'd like to partition by checking equality on y, such that I end up with counts of the number of times each value of y appears contiguously, when sorted on id (i.e. in the order shown).
Here's the output I'm looking for:
y | count
-----+----------
1 | 3
2 | 2
null | 1
2 | 2
null | 2
So reading down the rows in that output we have:
The first partition of three 1's
The first partition of two 2's
The first partition of a null
The second partition of two 2's
The second partition of two nulls
Try:
SELECT y, count(*)
FROM (
SELECT y,
sum( xyz ) OVER (
ORDER BY id
rows between unbounded preceding
and current row
) qwe
FROM (
SELECT *,
case
when y is null and
lag(y) OVER ( ORDER BY id ) is null
then 0
when y = lag(y) OVER ( ORDER BY id )
then 0
else 1 end xyz
FROM table1
) alias
) alias
GROUP BY qwe, y
ORDER BY qwe;
demo: http://sqlfiddle.com/#!15/b1794/12