Unexpected behavior of window function first_value

Unexpected behavior of window function first_value - sql

I have 2 columns - order no, value. Table value constructor:
(1, null)
,(2, 5)
,(3, null)
,(4, null)
,(5, 2)
,(6, 1)
I need to get
(1, 5) -- i.e. first nonnull Value if I go from current row and order by OrderNo
,(2, 5)
,(3, 2) -- i.e. first nonnull Value if I go from current row and order by OrderNo
,(4, 2) -- analogous
,(5, 2)
,(6, 1)
This is query that I think should work.
;with SourceTable as (
select *
from (values
(1, null)
,(2, 5)
,(3, null)
,(4, null)
,(5, 2)
,(6, 1)
) as T(OrderNo, Value)
)
select
*
,first_value(Value) over (
order by
case when Value is not null then 0 else 1 end
, OrderNo
rows between current row and unbounded following
) as X
from SourceTable
order by OrderNo
The issue is that it returns exactly same resultset as SourceTable. I don't understand why. E.g., if first row is processed (OrderNo = 1) I'd expect column X returns 5 because frame should include all rows (current row and unbound following) and it orders by Value - nonnulls first, then by OrderNo. So first row in frame should be OrderNo=2. Obviously it doesn't work like that but I don't get why.
Much appreciated if someone explains how is constructed the first frame. I need this for SQL Server and also Postgresql.
Many thanks

Although probably more expensive than two window functions, you can do this without a subquery using arrays:
with SourceTable as (
select *
from (values (1, null),
(2, 5),
(3, null),
(4, null),
(5, 2),
(6, 1)
) T(OrderNo, Value)
)
select st.*,
(array_remove(array_agg(value) over (order by orderno rows between current row and unbounded following), null))[1] as x
from SourceTable st
order by OrderNo;
Here is the db<>fiddle.
Or using a lateral join:
select st.*, st2.value
from SourceTable st left join lateral
(select st2.*
from SourceTable st2
where st2.value is not null and st2.orderno >= st.orderno
order by st2.orderno asc
limit 1
) st2
on 1=1
order by OrderNo;
With the right indexes on the source table, the lateral join might be the best solution from a performance perspective (I have been surprised by the performance of lateral joins under the right circumstances).

It's pretty easy to see why first_value doesn't work if you order the results by case when Value is not null then 0 else 1 end, orderno
orderno | value | x
---------+-------+---
2 | 5 | 5
5 | 2 | 2
6 | 1 | 1
1 | |
3 | |
4 | |
(6 rows)
For orderno=1, there's nothing after it in the frame that would be not-null.
Instead, we can arrange the orders into groups using count as a window function in a sub-query. We then use max as a window function over that group (this is arbitrary, min would work just as well) to get the one non-null value in that group:
with SourceTable as (
select *
from (values
(1, null)
,(2, 5)
,(3, null)
,(4, null)
,(5, 2)
,(6, 1)
) as T(OrderNo, Value)
)
select orderno, order_group, max(value) OVER (PARTITION BY order_group) FROM (
SELECT *,
count(value) OVER (ORDER BY orderno DESC) as order_group
from SourceTable
) as sub
order by orderno;
orderno | order_group | max
---------+-------------+-----
1 | 3 | 5
2 | 3 | 5
3 | 2 | 2
4 | 2 | 2
5 | 2 | 2
6 | 1 | 1
(6 rows)

Related

How check value by consecutive date variable

I have database table in SNOWFLAKE, where I need check for each customer if there is FLAG_1 == 1 at minimum 3 days in row. Flag_1 indicates whether the order contained any specific goods. And create new table with customer_id and flag_2. I really don't know how to handle this problem.
Sample table:
CREATE TABLE TMP_TEST
(
CUSTOMER_ID INT,
ORDER_DATE DATE,
FLAG_1 INT
);
INSERT INTO TMP_TEST (CUSTOMER_ID, ORDER_DATE, FLAG_1)
VALUES
(001, '2020-04-01', 0),
(001, '2020-04-02', 1),
(001, '2020-04-03', 1),
(001, '2020-04-04', 1),
(001, '2020-04-05', 1),
(001, '2020-04-06', 0),
(001, '2020-04-07', 0),
(001, '2020-04-08', 0),
(001, '2020-04-09', 1),
(002, '2020-04-10', 1),
(002, '2020-04-11', 0),
(002, '2020-04-12', 0),
(002, '2020-04-13', 1),
(002, '2020-04-14', 1),
(002, '2020-04-15', 0),
(002, '2020-04-16', 1),
(002, '2020-04-17', 1);
Expected output table:
CUSTOMER_ID FLAG_2
001 1
002 0

Maybe this can be help:
with calcflag as (
select customer_id, IFF( sum(flag_1) over (PARTITION by customer_id order by order_date rows between 3 preceding and 1 preceding) = 3, 1, 0 ) as new_flag
from tmp_Test)
select customer_id, max(new_flag) flag_2
from calcflag
group by 1
order by 1;
+-------------+--------+
| CUSTOMER_ID | FLAG_2 |
+-------------+--------+
| 1 | 1 |
| 2 | 0 |
+-------------+--------+

using COUNT_IF also works:
with calcflag as (
select
customer_id,
IFF(
count_if(flag_1 = 1) over (
PARTITION by customer_id
order by order_date
rows between 2 preceding and current row
) = 3, 1, 0
) as new_flag
from tmp_Test
)
select
customer_id,
max(new_flag) flag_2
from calcflag
group by 1
+-------------+--------+
| CUSTOMER_ID | FLAG_2 |
|-------------+--------|
| 1 | 1 |
| 2 | 0 |
+-------------+--------+

Snowflake supports MATCH_RECOGNIZE which is the easiest way to detect advanced patterns across multiple rows:
To find 3 or more occurences the pattern is PATTERN ( a{3,}):
SELECT *
FROM TMP_TEST
MATCH_RECOGNIZE (
PARTITION BY CUSTOMER_ID
ORDER BY ORDER_DATE
MEASURES MATCH_NUMBER() AS mn
ALL ROWS PER MATCH WITH UNMATCHED ROWS
PATTERN ( a{3,} )
DEFINE a AS FLAG_1 = 1
) mr
ORDER BY CUSTOMER_ID, ORDER_DATE;
Output:
Collapsing to single row per group:
SELECT CUSTOMER_ID, COALESCE(MIN(MN),0) AS FLAG_2
FROM TMP_TEST
MATCH_RECOGNIZE (
PARTITION BY CUSTOMER_ID
ORDER BY ORDER_DATE
MEASURES MATCH_NUMBER() AS mn
ALL ROWS PER MATCH WITH UNMATCHED ROWS
PATTERN ( a{3,})
DEFINE a AS FLAG_1 = 1
) mr
GROUP BY CUSTOMER_ID;
Output:
The power of this solution lies at the PATTERN part which could be easily extended with new conditions. For instance:
PATTERN ( a b{1,2} a )
DEFINE a AS FLAG_1 = 1,
b AS FLAT_2 = 0;
Here: Find sequence of flag = 1, followed by one or two occurences of flag = 0 and ended by flag = 1.

Select rows using group by and in each group get column values based on highest of another column value

I need to get latest field based on another field in group by
we have
Table "SchoolReview"
Id
SchoolId
Review
Point
1
1
rv1
8
2
1
rv2
7
3
2
rv3
4
4
2
rv4
7
5
3
rv5
2
6
3
rv6
8
I need to group by SchoolId and the inside group I need to get Review and Point from highest "Id" column.
I dont need "Id" coulmn but even if I get it for this solution its okay.
Result I am looking for shall look like this.
SchoolId
Review
Point
1
rv2
7
2
rv4
7
3
rv6
8
Any one experienced in MS SQL Server can help in this regard?

Using sample data from other answer
SELECT *
INTO #Data
FROM (VALUES
(1, 1, 'rv1', 8),
(2, 1, 'rv2', 7),
(3, 2, 'rv3', 4),
(4, 2, 'rv4', 7),
(5, 3, 'rv5', 2),
(6, 3, 'rv6', 8)
) v (Id, SchoolId, Review, Point)
SELECT S.SchoolId,
S.Review,
S.Point
FROM #Data S
INNER JOIN
(
SELECT Id = MAX(S1.Id),
S1.SchoolId
FROM #Data S1
GROUP BY SchoolId
) X ON X.Id = S.Id AND X.schoolId = S.SchoolId
ORDER BY X.SchoolId
;
output

You do not need to group the rows, you simply need to select the appropriate rows from the table. In this case, using ROW_NUMBER() is an option:
Table:
SELECT *
INTO Data
FROM (VALUES
(1, 1, 'rv1', 8),
(2, 1, 'rv2', 7),
(3, 2, 'rv3', 4),
(4, 2, 'rv4', 7),
(5, 3, 'rv5', 2),
(6, 3, 'rv6', 8)
) v (Id, SchoolId, Review, Point)
Statement:
SELECT SchoolId, Review, Point
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY SchoolId ORDER BY Id DESC) AS Rn
FROM Data
) t
WHERE Rn = 1
Result:
SchoolId Review Point
---------------------
1 rv2 7
2 rv4 7
3 rv6 8

SQL Occurrence of Sequence Number

I want to find if any Name has straight 4 or more occurrences of SeqNo in consecutive sequence only.
If there is a break in seqNo but 4 or more rows are consecutive then also i need that Name.
Example:
SeqNo Name
10 | A
15 | A
16 | A
17 | A
18 | A
9 | B
10 | B
13 | B
14 | B
6 | C
7 | C
9 | C
10 | C
OUTPUT:
A
BELOW IS SCRIPT FOR ANYONE HELPING.
create table testseq (Id int, Name char)
INSERT into testseq values
(10, 'A'),
(15, 'A'),
(16, 'A'),
(17, 'A'),
(18, 'A'),
(9, 'B'),
(10, 'B'),
(13, 'B'),
(14, 'B'),
(6, 'C'),
(7, 'C'),
(9, 'C'),
(10, 'C')
SELECT * FROM testseq

You can use some gaps-and-islands techniques for this.
If you want names that have at least 4 consecutive records where seqno is increasing by 1, then you can use the difference between seqno androw_number()` to define the groups, and then aggregate:
select distinct name
from (
select t.*, row_number() over(partition by name order by seqno) rn
from testseq t
) t
group by name, rn - seqno
having count(*) >= 4
Note that for your sample data, this returns no rows. A has 3 consecutive records where seqno is incrementing by 1, B and C have two.

I don't really view this as a "gaps-and-islands" problem. You are just looking for a minimum number of adjacent rows. This is easily handled using lag() or lead():
select t.*
from (select t.*,
lead(seqno, 3) over (partition by name order by seqno) as seqno_name_3
from t
) t
where seqno_name_3 = seqno + 3;
This checks the third sequence number on the same name. The third one after means that four names are the same in a row.
If you just want the name and to handle duplicates:
select distinct name
from (select t.*,
lead(seqno, 3) over (partition by name order by seqno) as seqno_name_3
from t
) t
where seqno_name_3 = seqno + 3;
If the sequence numbers can have gaps (but are otherwise adjacent):
select distinct name
from (select t.*,
lead(seqno, 3) over (partition by name order by seqno) as seqno_name_3,
lead(seqno, 3) over (order by seqno) as seqno_3
from t
) t
where seqno_name_3 = seqno_3;

A solution in plain SQL, no LAG() or LEAD() or ROW_NUMBER():
SELECT t1.Name
FROM testseq t1
WHERE (
SELECT count(t2.Id)
FROM testseq t2
WHERE t2.Name=t1.Name
and t2.Id between t1.Id and t1.Id+3
GROUP BY t2.Name)>=4
GROUP BY t1.Name;

Filling in missing values with a median in postgres

How can I replace avg with a median calculation in this?
select *
, coalesce(val, avg(val) over (order by t rows between 3 preceding and 1 preceding)) as fixed
from (
values
(1, 10),
(2, NULL),
(3, 10),
(4, 15),
(5, 11),
(6, NULL),
(7, NULL),
(8, NULL),
(9, NULL)
) as test(t, val)
;
Is there a legal version of this?
percentile_cont(0.5) within group(order by val) over (order by t rows between 3 preceding and 1 preceding)

Unfortunately percentile_cont() is an aggregate function, for which there is no equivalent window function.
One workaround is to use an inline subquery to do the aggregate computation.
If ids are always increasing, then you can do:
select
t.*,
coalesce(
t.val,
(
select percentile_cont(0.5) within group(order by t1.val)
from test t1
where t1.id between t.id - 3 and t.id - 1
)
) fixed
from test t
Otherwise, you need an additional level of nesting:
select
t.*,
coalesce(
t.val,
(
select percentile_cont(0.5) within group(order by t1.val)
from (select val from test t1 where t1.id < t.id order by t1.id desc limit 3) t1
)
) fixed
from test t
Demo on DB Fiddle - both queries yield:
id | val | fixed
-: | ---: | :----
1 | 10 | 10
2 | null | 10
3 | 10 | 10
4 | 15 | 15
5 | 11 | 11
6 | null | 11
7 | null | 13
8 | null | 11
9 | null | null

SELECT TOP 20 Percent SQL

I have a query which can select TOP 20 percent of TOP highest with GrandTotal. But there is something is not fair. For example, in between the Top 20 out of 10 People is 2. So the out put is show this:
EmpName GrandTotal
Kelvin 50
Gem 40
But the grand total of the 3rd and 4th people also having 40 as Grand Total. I need some idea and advice, how i going to do solve this problem?
SELECT TOP 20 PERCENT
EmpName,
SUM(Scoring) AS GrandTotal
FROM
[masterView]
GROUP BY
EmpName
ORDER BY
GrandTotal DESC, EmpName ASC

On SQL server you can use WITH TIES in order to include ties
SELECT TOP 20 PERCENT WITH TIES Id, sum(Score) as GrandTotal
FROM myTable GROUP BY Id
ORDER BY GrandTotal DESC

SQL Fiddle Demo
Test Data
CREATE TABLE Table1
([ID] int, [Score] int)
;
INSERT INTO Table1
([ID], [Score])
VALUES
(1, 10), (2, 20),
(3, 30), (4, 20),
(5, 10), (6, 40),
(7, 40), (8, 50),
(9, 10), (10, 5);
Query
with ranked as (
select
id,
rank() over (order by Score desc) as rnk
from Table1
),
total as (
select count(*) as total
from Table1
)
SELECT *
FROM ranked
CROSS JOIN total
WHERE ranked.rnk <= 0.2 * total.total
OUTPUT
| id | rnk | total |
|----|-----|-------|
| 8 | 1 | 10 |
| 6 | 2 | 10 |
| 7 | 2 | 10 |

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Unexpected behavior of window function first_value - sql

Related

How check value by consecutive date variable

Select rows using group by and in each group get column values based on highest of another column value

SQL Occurrence of Sequence Number

Filling in missing values with a median in postgres

SELECT TOP 20 Percent SQL

Categories

Resources