Making a partition query, reporting the first NOT NULL occurrence within partition before current row (if any) - sql

I have a logins table which looks like this:
person_id | login_at | points_won
-----------+----------------+----------------------
1 | 2017-02-02 |
1 | 2017-02-01 |
2 | 2017-02-01 | 2
1 | 2017-01-29 | 2
2 | 2017-01-28 |
2 | 2017-01-25 | 1
3 | 2017-01-22 |
3 | 2017-01-21 |
1 | 2017-01-10 | 3
1 | 2017-01-01 | 1
I want to generate a result set containing a points_won column, which should work something like: For each row partition based on the person_id order the partition by login_at desc then report the first occurrence (not null) of last_points_won of the ordered rows in the partition (if any).
It should result in something like this:
person_id | login_at | points_won | last_points_won
-----------+----------------+----------------------+----------------------
1 | 2017-02-02 | | 2
1 | 2017-02-01 | | 2
2 | 2017-02-01 | 2 | 2
1 | 2017-01-29 | 2 | 2
2 | 2017-01-28 | | 1
2 | 2017-01-25 | 1 | 1
3 | 2017-01-22 | |
3 | 2017-01-21 | |
1 | 2017-01-10 | 3 | 3
1 | 2017-01-01 | 1 | 1
Or in plain words:
for each row, give me either the points won during this login OR if none, give
me the points won at the persons latest previous login, where he actually made some
points.

This could be achieved within a single window too, with the IGNORE NULLS option of the last_value() window function. But that's not supported in PostgreSQL yet. One alternative is the FILTER (WHERE ...) clause, but that will only work, when the window function is an aggregate function in the first place (which is not true for last_value(), but something similar could be created easily with CREATE AGGREGATE). To solve this with only built-in aggregates, you can use the array_agg() too:
SELECT (tbl).*,
all_points_won[array_upper(all_points_won, 1)] last_points_won
FROM (SELECT tbl,
array_agg(points_won)
FILTER (WHERE points_won IS NOT NULL)
OVER (PARTITION BY person_id ORDER BY login_at) all_points_won
FROM tbl) s
Note: the sub-query is not needed, if you create a dedicated last_agg() aggregate, like:
CREATE FUNCTION last_val(anyelement, anyelement)
RETURNS anyelement
LANGUAGE SQL
IMMUTABLE
CALLED ON NULL INPUT
AS 'SELECT $2';
CREATE AGGREGATE last_agg(anyelement) (
SFUNC = last_val,
STYPE = anyelement
);
SELECT tbl.*,
last_agg(points_won)
FILTER (WHERE points_won IS NOT NULL)
OVER (PARTITION BY person_id ORDER BY login_at) last_points_won
FROM tbl;
Rextester sample
Edit: once the IGNORE NULLS option will be supported on PostgreSQL, you can use the following query (which should work in Amazon Redshift too):
SELECT tbl.*,
last_value(points_won IGNORE NULLS)
OVER (PARTITION BY person_id ORDER BY login_at ROW BETWEEN UNBOUNCED PRECEDING AND CURRENT ROW) last_points_won
FROM tbl;

select *
,min(points_won) over
(
partition by person_id,group_id
) as last_points_won
from (select *
,count(points_won) over
(
partition by person_id
order by login_at
) as group_id
from mytable
) t
+-----------+------------+------------+----------+-----------------+
| person_id | login_at | points_won | group_id | last_points_won |
+-----------+------------+------------+----------+-----------------+
| 1 | 2017-01-01 | 1 | 1 | 1 |
+-----------+------------+------------+----------+-----------------+
| 1 | 2017-01-10 | 3 | 2 | 3 |
+-----------+------------+------------+----------+-----------------+
| 1 | 2017-01-29 | 2 | 3 | 2 |
+-----------+------------+------------+----------+-----------------+
| 1 | 2017-02-01 | (null) | 3 | 2 |
+-----------+------------+------------+----------+-----------------+
| 1 | 2017-02-02 | (null) | 3 | 2 |
+-----------+------------+------------+----------+-----------------+
| 2 | 2017-01-25 | 1 | 1 | 1 |
+-----------+------------+------------+----------+-----------------+
| 2 | 2017-01-28 | (null) | 1 | 1 |
+-----------+------------+------------+----------+-----------------+
| 2 | 2017-02-01 | 2 | 2 | 2 |
+-----------+------------+------------+----------+-----------------+
| 3 | 2017-01-21 | (null) | 0 | (null) |
+-----------+------------+------------+----------+-----------------+
| 3 | 2017-01-22 | (null) | 0 | (null) |
+-----------+------------+------------+----------+-----------------+

Related

Get row for each unique user based on highest column value

I have the following data
+--------+-----------+--------+
| UserId | Timestamp | Rating |
+--------+-----------+--------+
| 1 | 1 | 1202 |
| 2 | 1 | 1198 |
| 1 | 2 | 1204 |
| 2 | 2 | 1196 |
| 1 | 3 | 1206 |
| 2 | 3 | 1194 |
| 1 | 4 | 1198 |
| 2 | 4 | 1202 |
+--------+-----------+--------+
I am trying to find the distribution of each user's Rating, based on their latest row in the table (latest is determined by Timestamp). On the path to that, I am trying to get a list of user IDs and Ratings which would look like the following
+--------+--------+
| UserId | Rating |
+--------+--------+
| 1 | 1198 |
| 2 | 1202 |
+--------+--------+
Trying to get here, I sorted the list on UserId and Timestamp (desc) which gives the following.
+--------+-----------+--------+
| UserId | Timestamp | Rating |
+--------+-----------+--------+
| 1 | 4 | 1198 |
| 2 | 4 | 1202 |
| 1 | 3 | 1206 |
| 2 | 3 | 1194 |
| 1 | 2 | 1204 |
| 2 | 2 | 1196 |
| 1 | 1 | 1202 |
| 2 | 1 | 1198 |
+--------+-----------+--------+
So now I just need to take the top N rows, where N is the number of players. But, I can't do a LIMIT statement as that needs a constant expression, as I want to use count(id) as the input for LIMIT which doesn't seem to work.
Any suggestions on how I can get the data I need?
Cheers!
Andy
This should work:
SELECT test.UserId, Rating FROM test
JOIN
(select UserId, MAX(Timestamp) Timestamp FROM test GROUP BY UserId) m
ON test.UserId = m.UserId AND test.Timestamp = m.Timestamp
If you can use WINDOW FUNCTIONS then you can use the following:
SELECT UserId, Rating FROM(
SELECT UserId, Rating, ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY Timestamp DESC) row_num FROM test
)m WHERE row_num = 1

Count rows in table that are the same in a sequence

I have a table that looks like this
+----+------------+------+
| ID | Session_ID | Type |
+----+------------+------+
| 1 | 1 | 2 |
| 2 | 1 | 4 |
| 3 | 1 | 2 |
| 4 | 2 | 2 |
| 5 | 2 | 2 |
| 6 | 3 | 2 |
| 7 | 3 | 1 |
+----+------------+------+
And I would like to count all occurences of a type that are in a sequence.
Output look some how like this:
+------------+------+-----+
| Session_ID | Type | cnt |
+------------+------+-----+
| 1 | 2 | 1 |
| 1 | 4 | 1 |
| 1 | 2 | 1 |
| 2 | 2 | 2 |
| 3 | 2 | 1 |
| 3 | 1 | 1 |
+------------+------+-----+
A simple group by like
SELECT session_id, type, COUNT(type)
FROM table
GROUP BY session_id, type
doesn't work, since I need to group only rows that are "touching".
Is this possible with a merge sql-select or will I need some sort of coding. Stored Procedure or Application side coding?
UPDATE Sequence:
If the following row has the same type, it should be counted (ordered by ID).
to determine the sequence the ID is the key with the session_ID, since I just want to group rows with the same session_ID.
So if there are 3 rows is in one session
row with the ID 1 has type 1,
and the second row has type 1
and row 3 has type 2
Input:
+----+------------+------+
| ID | Session_ID | Type |
+----+------------+------+
| 1 | 1 | 1 |
| 2 | 1 | 1 |
| 3 | 1 | 2 |
+----+------------+------+
The squence is Row 1 to Row 2. This three row should output
Output:
+------------+------+-------+
| Session_ID | Type | count |
+------------+------+-------+
| 1 | 1 | 2 |
| 3 | 2 | 1 |
+------------+------+-------+
You can use a difference of id and row_number() to identify the gaps and then perform your count
;with cte as
(
Select *, id - row_number() over (partition by session_id,type order by id) as grp
from table
)
select session_id,type,count(*) as cnt
from cte
group by session_id,type,grp
order by max(id)

SQL - Identify consecutive numbers in a table

Is there a way to flag consecutive numbers in an SQL table?
Based on the values in 'value_group_4' column, is it possible to tag continous values? This needs to be done within groups of each 'date_group_1'
I tried using row_numbers, rank, dense_rank but unable to come up with a foolproof way.
This has nothing to do with consecutiveness. You simply want to mark all rows where date_group_1 and value_group_4 are not unique.
One way:
select
mytable.*,
case when exists
(
select null
from mytable agg
where agg.date_group_1 = mytable.date_group_1
and agg.value_group_4 = mytable.value_group_4
group by agg.date_group_1, agg.value_group_4
having count(*) > 1
) then 1 else 0 end as flag
from mytable
order by date_group_1, value_group_4;
In a later version of SQL Server you'd use COUNT OVER instead.
SQL tables represent unordered sets. There is no such thing as consecutive values, unless a column specifies the ordering. Your data does not have such an obvious column, but I'll assume one exists and just call it id for convenience.
With such a column, lag()/lead() does what you want:
select t.*,
(case when lag(value_group_4) over (partition by data_group1 order by id) = value_group_4
then 1
when lead(value_group_4) over (partition by data_group1 order by id) = value_group_4
then 1
else 0
end) as flag
from t;
On close inspection, value_group_3 may do what you want. So you can use that for the id.
If your version of SQL Server doesn't have a full suite of windowing functions it should be still possible. This problem looks like a last-non-null problem which Itzik Ben-Gan has good example here... http://www.itprotoday.com/software-development/last-non-null-puzzle
Also, look at Mikael Eriksson's answer here which uses no windowing functions.
If the order of your data is determined by the date_group_1, value_group_3 column values, then why not make it as simple as the following query:
select
*,
rank() over(partition by date_group_1 order by value_group_3) - 1 value_group_3,
case
when count(*) over(partition by date_group_1, value_group_3) > 1 then 1
else 0
end expected_result
from data;
Output:
| date_group_1 | category_group_2 | value_group_3 | value_group_3 | expected_result |
+--------------+------------------+---------------+---------------+-----------------+
| 2018-01-11 | A | 15.3 | 0 | 0 |
| 2018-01-11 | B | 17.3 | 1 | 1 |
| 2018-01-11 | A | 17.3 | 1 | 1 |
| 2018-01-11 | B | 21 | 3 | 0 |
| 2018-01-22 | A | 15.3 | 0 | 0 |
| 2018-01-22 | B | 17.3 | 1 | 0 |
| 2018-01-22 | A | 21 | 2 | 0 |
| 2018-01-22 | B | 23 | 3 | 0 |
| 2018-03-13 | A | 15.3 | 0 | 0 |
| 2018-03-13 | B | 17.3 | 1 | 1 |
| 2018-03-13 | A | 17.3 | 1 | 1 |
| 2018-03-13 | B | 23 | 3 | 0 |
| 2018-05-15 | A | 6 | 0 | 0 |
| 2018-05-15 | B | 6.3 | 1 | 0 |
| 2018-05-15 | A | 15 | 2 | 0 |
| 2018-05-15 | B | 16.3 | 3 | 1 |
| 2018-05-15 | A | 16.3 | 3 | 1 |
| 2018-05-15 | B | 22 | 5 | 0 |
| 2019-05-04 | A | 0 | 0 | 0 |
| 2019-05-04 | B | 7 | 1 | 0 |
| 2019-05-04 | A | 15.3 | 2 | 0 |
| 2019-05-04 | B | 17.3 | 3 | 0 |
Test it online with SQL Fiddle.

Selecting latest consecutive records that match a condition with PostgreSQL

I am looking for a PostgreSQL query to find the latest consecutive records that match a condition. Let me explain it better with an example:
| ID | HEATING STATE | DATE |
| ---- | --------------- | ---------- |
| 1 | ON | 2018-02-19 |
| 2 | ON | 2018-02-20 |
| 3 | OFF | 2018-02-20 |
| 4 | OFF | 2018-02-21 |
| 5 | ON | 2018-02-21 |
| 6 | OFF | 2018-02-21 |
| 7 | ON | 2018-02-22 |
| 8 | ON | 2018-02-22 |
| 9 | ON | 2018-02-22 |
| 10 | ON | 2018-02-23 |
I need to find all the recent consecutive records with date >= 2018-02-20 and heating_state ON, i.e. the ones with ID 7, 8, 9, 10. My main issue is with the fact that they must be consecutive.
For further clarification, if needed:
ID 1 is excluded because older than 2018-02-20
ID 2 is excluded because followed by ID 3 which has heating state OFF
ID 3 is excluded because it has heating state OFF
ID 4 is excluded because it is followed by ID 5, which has heating OFF
ID 5 is excluded because it has heating state OFF
ID 6 is excluded because it has heating state OFF
I think this is best solved using windows functions and a filtered aggregate.
For each row, add the number of later rows that have state = 'OFF', then use only the rows where that count is 0.
You need a subquery because you cannot use a window function result in the WHERE condition (WHERE is evaluated before window functions).
SELECT id, state, date
FROM (SELECT id, state, date,
count(*) FILTER (WHERE state = 'OFF')
OVER (ORDER BY date DESC, state DESC) AS later_off_count
FROM tab) q
WHERE later_off_count = 0;
id | state | date
----+-------+------------
10 | ON | 2018-02-23
9 | ON | 2018-02-22
8 | ON | 2018-02-22
7 | ON | 2018-02-22
(4 rows)
Use the LEAD function with a CASE expression.
SQL Fiddle
Query 1:
SELECT id,
heating_state,
dt
FROM (SELECT t.*,
CASE
WHEN dt >= timestamp '2018-02-20'
AND heating_state = 'ON'
AND LEAD(heating_state, 1, heating_state)
OVER (
ORDER BY dt ) = 'ON' THEN 1
ELSE 0
END on_state
FROM t) s
WHERE on_state = 1
Results:
| id | heating_state | dt |
|----|---------------|----------------------|
| 7 | ON | 2018-02-22T00:00:00Z |
| 8 | ON | 2018-02-22T00:00:00Z |
| 9 | ON | 2018-02-22T00:00:00Z |
| 10 | ON | 2018-02-23T00:00:00Z |

Update using Self Join Sql Server

I have huge data and sample of the table looks like below
+-----------+------------+-----------+-----------+
| Unique_ID | Date | RowNumber | Flag_Date |
+-----------+------------+-----------+-----------+
| 1 | 6/3/2014 | 1 | 6/3/2014 |
| 1 | 5/22/2015 | 2 | NULL |
| 1 | 6/3/2015 | 3 | NULL |
| 1 | 11/20/2015 | 4 | NULL |
| 2 | 2/25/2014 | 1 | 2/25/2014 |
| 2 | 7/31/2014 | 2 | NULL |
| 2 | 8/26/2014 | 3 | NULL |
+-----------+------------+-----------+-----------+
Now I need to check if the difference between Date in 2nd row and Flag_date in 1st row. If the difference is more than 180 then 2nd row Flag_date should be updated with the date in 2nd row else it needs to be updated by Flag_date in 1st Row. And same rule follows for all rows with same unique_ID
update a
set a.Flag_Date=case when DATEDIFF(dd,b.Flag_Date,a.[Date])>180 then a.[Date] else b.Flag_Date end
from Table1 a
inner join Table1 b
on a.RowNumber=b.RowNumber+1 and a.Unique_ID=b.Unique_ID
The above update query when executed once, only the second row under each Unique_ID gets updated and result looks like below
+-----------+------------+-----------+------------+
| Unique_ID | Date | RowNumber | Flag_Date |
+-----------+------------+-----------+------------+
| 1 | 2014-06-03 | 1 | 2014-06-03 |
| 1 | 2015-05-22 | 2 | 2015-05-22 |
| 1 | 2015-06-03 | 3 | NULL |
| 1 | 2015-11-20 | 4 | NULL |
| 2 | 2014-02-25 | 1 | 2014-02-25 |
| 2 | 2014-07-31 | 2 | 2014-02-25 |
| 2 | 2014-08-26 | 3 | NULL |
+-----------+------------+-----------+------------+
And I need to run four times to achieve my desired result
+-----------+------------+-----------+------------+
| Unique_ID | Date | RowNumber | Flag_Date |
+-----------+------------+-----------+------------+
| 1 | 2014-06-03 | 1 | 2014-06-03 |
| 1 | 2015-05-22 | 2 | 2015-05-22 |
| 1 | 2015-06-03 | 3 | 2015-05-22 |
| 1 | 2015-11-20 | 4 | 2015-11-20 |
| 2 | 2014-02-25 | 1 | 2014-02-25 |
| 2 | 2014-07-31 | 2 | 2014-02-25 |
| 2 | 2014-08-26 | 3 | 2014-08-26 |
+-----------+------------+-----------+------------+
Is there a way where I can run update only once and all the rows are updated.
Thank you!
If you are using SQL Server 2012+, then you can use lag():
with toupdate as (
select t1.*,
lag(flag_date) over (partition by unique_id order by rownumber) as prev_flag_date
from table1 t1
)
update toupdate
set Flag_Date = (case when DATEDIFF(day, prev_Flag_Date, toupdate.[Date]) > 180
then toupdate.[Date] else prev_Flag_Date
end);
Both this version and your version can take advantage of an index on table1(unique_id, rownumber) or, better yet, table1(unique_id, rownumber, flag_date).
EDIT:
In earlier versions, this might have better performance:
with toupdate as (
select t1.*, t2.flag_date as prev_flag_date
from table1 t1 outer apply
(select top 1 t2.flag_date
from table1 t2
where t2.unique_id = t1.unique_id and
t2.rownumber < t1.rownumber
order by t2.rownumber desc
) t2
)
update toupdate
set Flag_Date = (case when DATEDIFF(day, prev_Flag_Date, toupdate.[Date]) > 180
then toupdate.[Date] else prev_Flag_Date
end);
The CTE can make use of the same index -- and it is important to have the index. The reason for the better performance is because your join on row_number() cannot use an index on that field.