Fill NULL rows based on some mathematical operations - sql

I have a table A which contains id and report_day and other columns. Also I've a table B which contains also id, report_day and also subscribers. I want to create a VIEW with id, report_day, subscribers columns. So it's a simple join:
select a.id, a.report_day, b.subscribers from schema.a
left join schema.b on a.id = b.id
and a.report_day = b.report_day
Now i want to add column subscribers_increment based on subscribers. But for some days I don't have stats for subscribers column and it's set to NULL. subcribers_increment it's just a (subcribers(current_day) - subscribers (prev_day).
I read some articles and add next statement:
case WHEN row_number() OVER (PARTITION BY b.id ORDER BY b.report_day ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) = 1 THEN b.subscribers
else ab.subscribers - COALESCE(last_value(b.subscribers) OVER (PARTITION BY b.id ORDER BY b.report_day ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0::bigint::numeric)
END::integer AS subscribers_increment
And now I've next result:
NULL is still NULL.
For example it has incorrect increment for 2021-04-07. It's increment for 2 days. Can i divide this value from 2021-04-08 by numbers of days (here it's 2) and write same value for 2021-04-07 and 2021-04-08 (or at least for 2021-04-07 where it was null)? And same logic for all days where subscribers is null?
So i need to follow next rules:
If I see NULL value in subcribers column I should go for the next (future) NOT NULL day and grab value for this next day. Substract from this (feature) value last not null value (past - order by date, so we looping back). Divide result of substraction by number of days and fill these rows for column subcribers_increment.
Is it possible?
UPDATE:
For my data it shoud look like this:
UPDATE v2
After applying script:
UPDATE v3
case (our increment) 25.03-27.03 still is NULL

The basic idea is:
Use lag() to get the previous subscribers and dates before joining. This assumes that the left join is the cause of all the NULL values.
Use a cumulative count in reverse to assign a grouping so NULL is combined with the next value in one grouping.
As a result of (2), the count of NULLs in a group is the denominator
As a result of (1) the difference between subscribers and prev_subscribers is the numerator.
The actual calculation requires more window functions and case logic.
So the idea is:
with t as (
select a.id, a.report_day, b.subscribers, b.prev_report_day, b.prev_subscribers,
count(b.subscribers) over (partition by a.id order by a.report_day desc) as grp
from first_table a left join
(select b.*,
lag(b.report_day) over (partition by id order by report_day) as prev_report_day,
lag(b.subscribers) over (partition by id order by report_day) as prev_subscribers
from second_table b
) b
on a.id = b.id and a.report_day = b.report_day
)
select t.*,
(case when t.subscribers is not null and t.prev_report_day = t.report_day - interval '1 day'
then t.subscribers - t.prev_subscribers
when t.subscribers is not null
then (t.subscribers - t.prev_subscribers) / count(*) over (partition by id, grp)
when t.subscribers is null
then (max(t.subscribers) over (partition by id, grp) - max(t.prev_subscribers) over (partition by id, grp)
) / count(*) over (partition by id, grp)
end)
from t;
Here is a db<>fiddle.

Related

How can I obtain the minimum date for a value that is equal to the maximum date?

I am trying to obtain the minimum start date for a query, in which the value is equal to its maximum date. So far, I'm able to obtain the value in it's maximum date, but I can't seem to obtain the minimum date where that value remains the same.
Here is what I got so far and the query result:
select a.id, a.end_date, a.value
from database1 as a
inner join (
select id, max(end_date) as end_date
from database1
group by id
) as b on a.id = b.id and a.end_date = b.end_date
where value is not null
order by id, end_date
This result obtains the most recent record, but I'm looking to obtain the most minimum end date record where the value remains the same as the most recent.
In the following sample table, this is the record I'd like to obtain the record from the row where id = 3, as it has the minimum end date in which the value remains the same:
id
end_date
value
1
02/12/22
5
2
02/13/22
5
3
02/14/22
4
4
02/15/22
4
Another option that just approaches the problem somewhat as described for the sample data as shown - Get the value of the maximum date and then the minimum id row that has that value:
select top(1) t.*
from (
select top(1) Max(end_date)d, [value]
from t
group by [value]
order by d desc
)d
join t on t.[value] = d.[value]
order by t.id;
DB<>Fiddle
I'm most likely overthinking this as a Gaps & Island problem, but you can do:
select min(end_date) as first_date
from (
select *, sum(inc) over (order by end_date desc) as grp
from (
select *,
case when value <> lag(value) over (order by end_date desc) then 1 else 0 end as inc
from t
) x
) y
where grp = 0
Result:
first_date
----------
2022-02-14
See running example at SQL Fiddle.
with data as (
select *,
row_number() over (partition by value) as rn,
last_value(value) over (order by end_date) as lv
from T
)
select * from data
where value = lv and rn = 1
This isn't looking strictly for streaks of consecutive days. Any date that happened to have the same value as on final date would be in contention.

SUM a specific column in next rows until a condition is true

Here is a table of articles and I want to store sum of Mass Column from next rows in sumNext Column based on a condition.
If next row has same floor (in floorNo column) as current row, then add the mass of next rows until the floor is changed
E.g : Rows three has sumNext = 2. That is computed by adding the mass from row four and row five because both rows has same floor number as row three.
id
mass
symbol
floorNo
sumNext
2891176
1
D
1
0
2891177
1
L
8
0
2891178
1
L
1
2
2891179
1
L
1
1
2891180
1
1
0
2891181
1
5
2
2891182
1
5
1
2891183
1
5
0
Here is the query, that is generating this table, I just want to add sumNext column with the right value inside.
WITH items AS (SELECT
SP.id,
SP.mass,
SP.symbol,
SP.floorNo
FROM articles SP
ORDER BY
DECODE(SP.symbol,
'P',1,
'D',2,
'L',3,
4 ) asc)
SELECT CLS.*
FROM items CLS;
You could use below solution which uses
common table expression (cte) technique to put all consecutive rows with same FLOORNO value in the same group (new grp column).
Then uses the analytic version of SUM function to sum all next MASS per grp column as required.
Items_RowsNumbered (id, mass, symbol, floorNo, rnb) as (
select ID, MASS, SYMBOL, FLOORNO
, row_number()over(
order by DECODE(symbol, 'P',1, 'D',2, 'L',3, 4 ) asc, ID )
/*
You need to add ID column (or any others columns that can identify each row uniquely)
in the "order by" clause to make the result deterministic
*/
from (Your source query)Items
)
, cte(id, mass, symbol, floorNo, rnb, grp) as (
select id, mass, symbol, floorNo, rnb, 1 grp
from Items_RowsNumbered
where rnb = 1
union all
select t.id, t.mass, t.symbol, t.floorNo, t.rnb
, case when t.floorNo = c.floorNo then c.grp else c.grp + 1 end grp
from Items_RowsNumbered t
join cte c on (c.rnb + 1 = t.rnb)
)
select
ID, MASS, SYMBOL, FLOORNO
/*, RNB, GRP*/
, nvl(
sum(MASS)over(
partition by grp
order by rnb
ROWS BETWEEN 1 FOLLOWING and UNBOUNDED FOLLOWING)
, 0
) sumNext
from cte
;
demo on db<>fiddle
This is a typical gaps-and-islands problem. You can use LAG() in order to determine the exact partitions, and then SUM() analytic function such as
WITH ii AS
(
SELECT i.*,
ROW_NUMBER() OVER (ORDER BY id DESC) AS rn2,
ROW_NUMBER() OVER (PARTITION BY floorNo ORDER BY id DESC) AS rn1
FROM items i
)
SELECT id,mass,symbol, floorNo,
SUM(mass) OVER (PARTITION BY rn2-rn1 ORDER BY id DESC)-1 AS sumNext
FROM ii
ORDER BY id
Demo

BigQuery Standard SQL - Cumulative Count of (almost) Duplicated Rows

With the following data:
id
field
eventTime
1
A
1
1
A
2
1
B
3
1
A
4
1
B
5
1
B
6
1
B
7
For visualisation purposes, I would like to turn it into the below. Consecutive occurrences of the same field value essentially get aggregated into one.
id
field
eventTime
1
Ax2
1
1
B
3
1
A
4
1
Bx3
5
I will then use STRING_AGG() to turn it into "Ax2 > B > A > Bx3".
I've tried using ROW_NUMBER() to count the repeated instances, with the plan being to utilise the highest row number to modify the string in field, but if I partition on eventTime, there are no consecutive "duplicates", and if I don't partition on it then all rows with the same field value are counted - not just consecutive ones.
I though about bringing in the previous field with LAG() for a comparison to reset the row count, but that only works for transitions from one field value to the other and is a problem if the same field is repeated consecutively.
I'm been struggling with this to the point where I'm considering writing a script that just CASE WHENs up to a reasonable number of consecutive hits, but I've seen it get as high as 17 on a given day and really don't want to be doing that!
My other alternative will just be to enforce a maximum number of field values to help control this, but now I've started this problem I'd quite like to solve it without that, if at all possible.
Thanks!
Consider below
select id,
any_value(field) || if(count(1) = 1, '', 'x' || count(1)) field,
min(eventTime) eventTime
from (
select id, field, eventTime,
countif(ifnull(flag, true)) over(partition by id order by eventTime) grp
from (
select id, field, eventTime,
field != lag(field) over(partition by id order by eventTime) flag
from `project.dataset.table`
)
)
group by id, grp
# order by eventTime
If applied to sample data in your question - output is
Just use lag() to detect when the value of field changes. You can now do that with qualify:
select t.*
from t
where 1=1
qualify lag(field, 1, '') over (partition by id order by eventtime) <> field;
For your final step, you can use a subquery:
select id, string_agg(field, '->' order by eventtime)
from (select t.*
from t
where 1=1
qualify lag(field, 1, '') over (partition by id order by eventtime) <> field
) t
group by id;

2 different values for the same key

My table:
booking_id
arrivel_time
departure_time
date
I have cases that for the same key (booking_id) I have 2 records - the first one is with null on arrivel_time and departure_time and the second is with values (date and time) on the arrivel_time and departure_time or only in arrival time.
I would like to select only the record that the booking_id is with the values if it happens.
I am struggling with how to select that, would you be able to explain how to achieve this?
You can see an example of my desired results here
One method uses row_number():
select t.*
from (select t.*,
row_number() over (partition by booking_id order by arrival_time nulls last) as seqnum
from t
) t
where seqnum = 1;
In older versions of Hive, you can use:
select t.*
from (select t.*,
row_number() over (partition by booking_id
order by (case when arrival_time is not null then 1 else 2 end)
) as seqnum
from t
) t
where seqnum = 1;
You can use MAX() and DISTINCT functions to get the desired result:
SELECT DISTINCT
booking_id
MAX(arrivel_time),
MAX(departure_time),
date
FROM MyTable
GROUP BY booking_id, date
However in this case, what the MAX() function does is that it gets the latest date when the booking was updated, so in the event that you have for example one record having 04/05/2021 08:00 as your arrival time and another record with the same booking id having 04/05/2021 09:00 as your arrival time; then it will ignore the first and take the second.
The query I gave you above only works if one of the booking ids has a null value or if both are null.
The DISTINCT function is then used to consolidate 2 rows having the EXACT SAME VALUES into 1 row.

SQL - Window function to get values from previous row where value is not null

I am using Exasol, in other DBMS it was possible to use analytical functions such LAST_VALUE() and specify some condition for the ORDER BY clause withing the OVER() function, like:
select ...
LAST_VALUE(customer)
OVER (PARTITION BY ID ORDER BY date_x DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ) as the_last
Unfortunately I get the following error:
ERROR: [0A000] Feature not supported: windowing clause (Session:
1606983630649130920)
the same do not happen if instead of AND 1 PRECEDING I use: CURRENT ROW.
Basically what I wanted is to get the last value according the Order by that is NOT the current row. In this example it would be the $customer of the previous row.
I know that I could use the LAG(customer,1) OVER ( ...) but the problem is that I want the previous customer that is NOT null, so the offset is not always 1...
How can I do that?
Many thanks!
Does this work?
select lag(customer) over (partition by id
order by (case when customer is not null then 1 else 0 end),
date
)
You can do this with two steps:
select t.*,
max(customer) over (partition by id, max_date) as max_customer
from (select t.*,
max(case when customer is not null then date end) over (partition by id order by date) as max_date
from t
) t;