Oracle - Dealing with NULLS in DENSE_RANK - sql

I have this table structure
+----------------+----------------+
| DATE | VALUE |
|----------------|----------------|
| 2015-01-01 | 5 |
| 2015-01-02 | 4 |
| 2015-01-03 | NULL |
| 2015-02-10 | 2 |
| 2015-02-25 | 1 |
+----------------+----------------+
I'm trying to get the most recent non null value within each month. In this case it would be this:
+----------------+----------------+
| MONTH | VALUE |
|----------------|----------------|
| 2015-01 | 4 |
| 2015-02 | 1 |
+----------------+----------------+
I've tried DENSE_RANK but i'm having a difficult time dealing with the null values.
Using:
SELECT TO_CHAR(date,'YYYY-MM'),
MAX(value) KEEP (DENSE_RANK FIRST ORDER BY date DESC)
FROM mytable
GROUP BY TO_CHAR(date,'YYYY-MM')
I'm getting
+----------------+----------------+
| MONTH | VALUE |
|----------------|----------------|
| 2015-01 | NULL |
| 2015-02 | 1 |
+----------------+----------------+
Obviously I'm doing something wrong.
Can you help me figure this out?
Thanks in advance.
EDIT: Unfortunately, adding the condition "WHERE value IS NOT NULL" can't be applied to my situation.

Unfortunately, MAX() KEEP doesn't have an IGNORE NULLS clause, as far as I know. But LAST_VALUE does. So, how about this:
SELECT mth,
MAX (last_val)
FROM (SELECT TO_CHAR (d, 'YYYY-MM') mth,
d,
n,
LAST_VALUE (
n IGNORE NULLS)
OVER (PARTITION BY TO_CHAR (d, 'YYYY-MM')
ORDER BY d ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
last_val
FROM matt_test)
GROUP BY mth

This solution eliminates null values and uses row_number to get the latest date and the corresponding value for each month.
select myr, value
from (
SELECT date, value,TO_CHAR(date,'YYYY-MM') myr,
row_number() over(partition by TO_CHAR(date,'YYYY-MM') order by date desc) rn
FROM mytable
where value is not null) t
where rn = 1

I have a personal averseness to construction like SELECT ... FROM (SELECT ... FROM ...), so this is my proposal:
SELECT DISTINCT
TRUNC(THE_DATE, 'MM') AS MONTH,
FIRST_VALUE(THE_VALUE IGNORE NULLS) OVER (PARTITION BY TRUNC(THE_DATE, 'MM') ORDER BY THE_VALUE) AS VALUE
FROM MY_TABLE;

I could not use LAST_VALUE because of Group By and many other reasons.
So this worked for me, example line:
SELECT
MAX(the_value) KEEP (dense_rank LAST ORDER BY (CASE WHEN the_value IS NOT NULL THEN 1 END) NULLS FIRST, the_date) the_value
FROM ...
or like that:
SELECT
MAX(the_value) KEEP (dense_rank FIRST ORDER BY (CASE WHEN the_value IS NOT NULL THEN 1 END) NULLS LAST, the_date DESC) the_value
FROM ...

Related

Get Max And Min dates for consecutive values in T-SQL

I have a log table like below and want to simplfy it by getting min start date and max end date for consecutive Status values for each Id. I tried many window function combinations but no luck.
This is what I have:
This is what want to see:
This is a typical gaps-and-islands problem. You want to aggregate groups of consecutive records that have the same Id and Status.
No need for recursion, here is one way to solve it using window functions:
select
Id,
Status,
min(StartDate) StartDate,
max(EndDate) EndDate
from (
select
t.*,
row_number() over(partition by id order by StartDate) rn1,
row_number() over(partition by id, status order by StartDate) rn2
from mytable t
) t
group by
Id,
Status,
rn1 - rn2
order by Id, min(StartDate)
The query works by ranking records over two different partitions (by Id, and by Id and Status). The difference between the ranks gives you the group each record belongs to. You can run the subquery independently to see what it returns and understand the logic.
Demo on DB Fiddle:
Id | Status | StartDate | EndDate
-: | :----- | :------------------ | :------------------
1 | B | 07/02/2019 00:00:00 | 18/02/2019 00:00:00
1 | C | 18/02/2019 00:00:00 | 10/03/2019 00:00:00
1 | B | 10/03/2019 00:00:00 | 01/04/2019 00:00:00
2 | A | 05/02/2019 00:00:00 | 22/04/2019 00:00:00
2 | D | 22/04/2019 00:00:00 | 05/05/2019 00:00:00
2 | A | 05/05/2019 00:00:00 | 30/06/2019 00:00:00
Try the following query. First order the data by StartDate and generate a sequence (rid). Then you the recursive cte to get the first row (rid=1) for each group (id,status), and recursively get the next row and compare the start/end date.
;WITH cte_r(id,[Status],StartDate,EndDate,rid)
AS
(
SELECT id,[Status],StartDate,EndDate, ROW_NUMBER() OVER(PARTITION BY Id,[Status] ORDER BY StartDate) AS rid
FROM log_table
),
cte_range(id,[Status],StartDate,EndDate,rid)
AS
(
SELECT id,[Status],StartDate,EndDate,rid
FROM cte_r
WHERE rid=1
UNION ALL
SELECT p.id, p.[Status], CASE WHEN c.StartDate<p.EndDate THEN p.StartDate ELSE c.StartDate END AS StartDate, c.EndDate,c.rid
FROM cte_range p
INNER JOIN cte_r c
ON p.id=c.id
AND p.[Status]=c.[Status]
AND p.rid+1=c.rid
)
SELECT id,[Status],StartDate,MAX(EndDate) AS EndDate FROM cte_range GROUP BY id,StartDate ;

Running total sum over date presto SQL

I'm trying to calculate the cumulative sum of columns t and s over a date from my sample data below, using Presto SQL.
Date | T | S
1/2/19 | 2 | 5
2/1/19 | 5 | 1
3/1/19 | 1 | 1
I would like to get
Date | T | S | cum_T | cum_S
1/2/19 | 2 | 5 | 2 | 5
2/1/19 | 5 | 1 | 7 | 6
3/1/19 | 1 | 1 | 8 | 7
However when I run the below query using Presto SQL I am receiving an unexpected error message, telling me to put columns T and S into the group by section of my query.
Is this expected? When I remove the group by from my query it runs without error, but produces duplicate date rows. +
select
date_trunc('day',tb1.date),
sum(tb1.S) over (partition by date_trunc('day',tb1.date) order by date_trunc('day',tb1.date) rows unbounded preceding ) as cum_S,
sum(tb1.T) over (partition by date_trunc('day',tb1.date) order by date_trunc('day',tb1.date) rows unbounded preceding) as cum_T
from esi_dpd_bi_esds_prst.points_tb1_use_dedup_18months_vw tb1
where
tb1.reason_id not in (45,264,418,983,990,997,999,1574)
and tb1.group_id not in (22)
and tb1.point_status not in (3)
and tb1.date between cast(DATE '2019-01-01' as date) and cast( DATE '2019-01-03' as date)
group by
1
order by date_trunc('day',tb1.date) desc
Error looks like this:
Error: line 3:1: '"sum"(tb1.S) OVER (PARTITION BY "date_trunc"('day', tb1.tb1) ORDER BY "date_trunc"('day', tb1.tb1) ASC ROWS UNBOUNDED PRECEDING)' must be an aggregate expression or appear in GROUP BY clause.
You have an aggregation query and you want to mix the aggregations with window functions. The correct syntax is:
select date_trunc('day', tb1.date),
sum(tbl1.S) as S,
sum(tbl1.T) as T,
sum(sum(tb1.S)) over (order by date_trunc('day', tb1.date) rows unbounded preceding ) as cum_S,
sum(sum(tb1.T)) over (order by date_trunc('day', tb1.date) rows unbounded preceding) as cum_T
from esi_dpd_bi_esds_prst.points_tb1_use_dedup_18months_vw tb1
where tb1.reason_id not in (45, 264, 418, 983, 990, 997, 999, 1574) and
tb1.group_id not in (22) and
tb1.point_status not in (3) and
tb1.date between cast(DATE '2019-01-01' as date) and cast( DATE '2019-01-03' as date)
group by 1
order by date_trunc('day', tb1.date) desc ;
That is, the window function is running after the aggregation and needs to process the aggregated value.

Redshift window function for change in column

I have a redshift table with amongst other things an id and plan_type column and would like a window function group clause where the plan_type changes so that if this is the data for example:
| user_id | plan_type | created |
|---------|-----------|------------|
| 1 | A | 2019-01-01 |
| 1 | A | 2019-01-02 |
| 1 | B | 2019-01-05 |
| 2 | A | 2019-01-01 |
| 2 | A | 2-10-01-05 |
I would like a result like this where I get the first date that the plan_type was "new":
| user_id | plan_type | created |
|---------|-----------|------------|
| 1 | A | 2019-01-01 |
| 1 | B | 2019-01-05 |
| 2 | A | 2019-01-01 |
Is this possible with window functions?
EDIT
Since I have some garbage in the data where plan_type can sometimes be null and the accepted solution does not include the first row (since I can't have the OR is not null I had to make some modifications. Hopefully his will help other people if they have similar issues. The final query is as follows:
SELECT * FROM
(
SELECT
user_id,
plan_type,
created_at,
lag(plan_type) OVER (PARTITION by user_id ORDER BY created_at) as prev_plan,
row_number() OVER (PARTITION by user_id ORDER BY created_at) as rownum
FROM tablename
WHERE plan_type IS NOT NULL
) userHistory
WHERE
userHistory.plan_type <> userHistory.prev_plan
OR userHistory.rownum = 1
ORDER BY created_at;
The plan_type IS NOT NULL filters out bad data at the source table and the outer where clause gets any changes OR the first row of data that would not be included otherwise.
ALSO BE CAREFUL about the created_at timestamp if you are working of your prev_plan field since it would of course give you the time of the new value!!!
This is a gaps-and-islands problem. I think lag() is the simplest approach:
select user_id, plan_type, created
from (select t.*,
lag(plan_type) over (partition by user_id order by created) as prev_plan_type
from t
) t
where prev_plan_type is null or prev_plan_type <> plan_type;
This assumes that plan types can move back to another value and you want each one.
If not, just use aggregation:
select user_id, plan_type, min(created)
from t
group by user_id, plan_type;
use row_number() window function
select * from
(select *,row_number()over(partition by user_id,plan_type order by created) rn
) a where a.rn=1
use lag()
select * from
(
select user_id, plant_type, lag(plan_type) over (partition by user_id order by created) as changes, created
from tablename
)A where plan_type<>changes and changes is not null

Query with conditional lag statement

I'm trying to find the previous value of a column where the row meets some criteria. Consider the table:
| user_id | session_id | time | referrer |
|---------|------------|------------|------------|
| 1 | 1 | 2018-01-01 | [NULL] |
| 1 | 2 | 2018-02-01 | google.com |
| 1 | 3 | 2018-03-01 | google.com |
I want to find, for each session, the previous value of session_id where the referrer is NULL. So, for the second AND third rows, the value of parent_session_id should be 1.
However, by just using lag(session_id) over (partition by user_id order by time), I will get parent_session_id=2 for the 3rd row.
I suspect it can be done using a combination of window functions, but I just can't figure it out.
I'd use last_value() in combination with if():
WITH t AS (SELECT * FROM UNNEST([
struct<user_id int64, session_id int64, time date, referrer string>(1, 1, date('2018-01-01'), NULL),
(1,2,date('2018-02-01'), 'google.com'),
(1,3,date('2018-03-01'), 'google.com')
]) )
SELECT
*,
last_value(IF(referrer is null, session_id, NULL) ignore nulls)
over (partition by user_id order by time rows between unbounded preceding and 1 preceding) lastNullrefSession
FROM t
You could even do this via a correlated subquery:
SELECT
session_id,
(SELECT MAX(t2.session_id) FROM yourTable t2
WHERE t2.referrer IS NULL AND t2.session_id < t1.session_id) prev_session_id
FROM yourTable t1
ORDER BY
session_id;
Here is an approach using analytic functions which might work:
WITH cte AS (
SELECT *,
SUM(CASE WHEN referrer IS NULL THEN 1 ELSE 0 END)
OVER (ORDER BY session_id) cnt
FROM yourTable
)
SELECT
session_id,
CASE WHEN cnt = 0
THEN NULL
ELSE MIN(session_id) OVER (PARTITION BY cnt) END prev_session_id
FROM cte
ORDER BY
session_id;

get the id based on condition in group by

I'm trying to create a sql query to merge rows where there are equal dates. the idea is to do this based on the highest amount of hours, so that i in the end gets the corresponding id for each date with the highest amount of hours. i've been trying to do with a simple group by, but does not seem to work, since i CANT just put a aggregate function on id column, since it should be based the hours condition
+------+-------+--------------------------------------+
| id | date | hours |
+------+-------+--------------------------------------+
| 1 | 2012-01-01 | 37 |
| 2 | 2012-01-01 | 10 |
| 3 | 2012-01-01 | 5 |
| 4 | 2012-01-02 | 37 |
+------+-------+--------------------------------------+
desired result
+------+-------+--------------------------------------+
| id | date | hours |
+------+-------+--------------------------------------+
| 1 | 2012-01-01 | 37 |
| 4 | 2012-01-02 | 37 |
+------+-------+--------------------------------------+
If you want exactly one row -- even if there are ties -- then use row_number():
select t.*
from (select t.*, row_number() over (partition by date order by hours desc) as seqnum
from t
) t
where seqnum = 1;
Ironically, both Postgres and Oracle (the original tags) have what I would consider to be better ways of doing this, but they are quite different.
Postgres:
select distinct on (date) t.*
from t
order by date, hours desc;
Oracle:
select date, max(hours) as hours,
max(id) keep (dense_rank first over order by hours desc) as id
from t
group by date;
Here's one approach using row_number:
select id, dt, hours
from (
select id, dt, hours, row_number() over (partition by dt order by hours desc) rn
from yourtable
) t
where rn = 1
You can use subquery with correlation approach :
select t.*
from table t
where id = (select t1.id
from table t1
where t1.date = t.date
order by t1.hours desc
limit 1);
In Oracle you can use fetch first 1 row only in subquery instead of LIMIT clause.