How can I back fill null values in bigquery? - sql

I'm trying to perform a null backfill, similar to Panda's dataframe bfill, in BigQuery. Reading the docs, the last_value function seems to be a good choice. However, this leaves some null spots until it finds the first value (quite reasonable, given the name of the function). How can I backfill those null? Or I just have to drop them?
This is a sample query:
select table_path.*, last_value(sn_6 ignore nulls) over (order by time)
from (select 1 as time, null as sn_6 union all
select 2, 1 union all
select 3, null union all
select 4, null union all
select 5, null union all
select 6, 0 union all
select 7, null union all
select 8, null
) table_path;
Actual output:
time sn_6 f0_
1 null null
2 1 1
3 null 1
4 null 1
5 null 1
6 0 0
7 null 0
8 null 0
Desired output:
time sn_6 f0_
1 null 1 <---Back fill all the gaps!
2 1 1
3 null 1
4 null 1
5 null 1
6 0 0
7 null 0
8 null 0
The real data has a timestamp column followed by 6 float columns and there are null values everywhere.

If the intention is to make the missing "backfill" to be a "forward-fill", you can use first_value function to look forward to locate the first non-null value, as:
select table_path.*,
coalesce(
last_value(sn_6 ignore nulls) over (order by time),
first_value(sn_6 ignore nulls) over (order by time RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
)
from (select 1 as time, null as sn_6 union all
select 2, 1 union all
select 3, null union all
select 4, null union all
select 5, null union all
select 6, 0 union all
select 7, null union all
select 8, null
) table_path;

Related

Shifting values in an Oracle table [duplicate]

This question already has answers here:
Fill null values with last non-null amount - Oracle SQL
(4 answers)
Closed last month.
I have a table like this:
Key
values
1
null
2
value1
3
null
4
null
5
null
6
value2
7
null
8
null
I need to have a table where every value is shifted down if (and only if) the subsequent cell is null. When I found a different value I keep it and then if I found a new null cell I shift down the new value.
There is a query to do this trick? Thank you.
I want to obtain a table like this:
Key
values
1
null
2
value1
3
value1
4
value1
5
value1
6
value2
7
value2
8
value2
See LAST_VALUE() with IGNORE NULLS: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/LAST_VALUE.html
with data(Key, val) as (
select 1, null from dual union all
select 2, 'value1' from dual union all
select 3, null from dual union all
select 4, null from dual union all
select 5, null from dual union all
select 6, 'value2' from dual union all
select 7, null from dual union all
select 8, null from dual -- union all
)
select key, val, last_value(val) ignore nulls over(order by key )
from data
;
1
2 value1 value1
3 value1
4 value1
5 value1
6 value2 value2
7 value2
8 value2

Need a SQL select statement to return rows that have the same id in one column and distinct value in another column

I have a table that contains a group number column and a data column:
GROUP
DataColumn
1
NULL
1
NULL
1
"hello"
1
NULL
2
"bye"
2
"sorry"
3
NULL
3
NULL
3
NULL
I want to return the string in the DataColunm as long as all rows in that group contain a string (no row is null).
If any row in the group is NULL then I'd like to return all rows in that group with NULL in the DataColumn.
My desired output would be:
GROUP
DataColumn
1
NULL
1
NULL
1
NULL (swap "hello" to null since the other values for group 1 are null)
1
NULL
2
"bye"
2
"sorry"
3
NULL
3
NULL
3
NULL
Use COUNT() window function to count all the rows of each GROUP and compare the result to the number of the rows with non-null values:
SELECT "GROUP",
CASE
WHEN COUNT(*) OVER (PARTITION BY "GROUP") =
COUNT("DataColumn") OVER (PARTITION BY "GROUP")
THEN "DataColumn"
END "DataColumn"
FROM tablename;
See the demo.
Here's one option: check whether number of null and not null values per each group is a positive number; if so, return null for that group.
Sample data:
SQL> set null NULL
SQL> with test (cgroup, datacolumn) as
2 (select 1, null from dual union all
3 select 1, null from dual union all
4 select 1, 'hello' from dual union all
5 select 1, null from dual union all
6 select 2, 'bye' from dual union all
7 select 2, 'sorry' from dual union all
8 select 3, null from dual union all
9 select 3, null from dual union all
10 select 3, null from dual
11 ),
Query begins here:
12 temp as
13 (select cgroup, datacolumn,
14 sum(case when datacolumn is null then 1 else 0 end) over (partition by cgroup) cnt_null,
15 sum(case when datacolumn is null then 0 else 1 end) over (partition by cgroup) cnt_not_null
16 from test
17 )
18 select cgroup,
19 case when cnt_null > 0 and cnt_not_null > 0 then null
20 else datacolumn
21 end as datacolumn
22 from temp;
CGROUP DATACOLUMN
---------- ---------------
1 NULL
1 NULL
1 NULL
1 NULL
2 bye
2 sorry
3 NULL
3 NULL
3 NULL
9 rows selected.
SQL>

Oracle how to return rows based on condition?

I am trying to return id and name based on flag column. If id has a rows with flag = 1 my query should only return these rows. If it hasn't flag=1 value it should return rows with flag = 0. What is the best way for it ? Here is sample data :
id name flag
5 aa 1
5 bb 0
6 cc 1
10 dd 0
11 ee 1
11 ee 0
Expected output is :
id name flag
5 aa 1
6 cc 1
10 dd 0
11 ee 1
Assuming flag column contains only 0 or 1, select rows whose flag is equal to maximal value of flags of given id:
select id, name, flag
from (
select id, name, flag, max(flag) over (partition by id) as m
from your_table
) x
where x.flag = x.m
You can use the keep dense_rank aggregating function to acheive that like below.
with t (id, name, flag) as (
select 5 , 'aa', 1 from dual union all
select 5 , 'bb', 0 from dual union all
select 6 , 'cc', 1 from dual union all
select 10, 'dd', 0 from dual union all
select 11, 'ee', 1 from dual union all
select 11, 'ee', 0 from dual
)
select id
, max(name)keep(dense_rank last order by id, flag) name
, max(flag)keep(dense_rank last order by id, flag) flag
from t
where flag in (0, 1)
group by id
order by id
;

Update with lag

I would like to set the ACTIVE value of a table as follow:
If FLAG=E => ACTIVE=1 and for any subsequent FLAG values, until FLAG=H
If FLAG=H => ACTIVE=0 and for any subsequent FLAG values, until FLAG=E
and so on and so forth.
Example
ID | FLAG | ACTIVE
---+------+-------
1 | E | 1
2 | V | 1
3 | H | 0
4 | V | 0
5 | E | 1
6 | S | 1
7 | V | 1
8 | D | 1
9 | H | 0
The value are ordered by date.
For simplicity, I added an ID column to get the column order.
Question
What can be the SQL update statement ?
Note:
The business rule can also be expressed as follow:
If for a given row, the count of preceding E - the count of preceding H is 1, then ACTIVE is 1 for this row, 0 otherwise.
You can get the active value with the last_value() analytic function:
select id, flag,
last_value(case when flag = 'E' then 1 when flag = 'H' then 0 end) ignore nulls
over (order by id) as active
from your_table;
As a demo:
create table your_table (id, flag) as
select 1, 'E' from dual
union all select 2, 'V' from dual
union all select 3, 'H' from dual
union all select 4, 'V' from dual
union all select 5, 'E' from dual
union all select 6, 'S' from dual
union all select 7, 'V' from dual
union all select 8, 'D' from dual
union all select 9, 'H' from dual;
select id, flag,
last_value(case when flag = 'E' then 1 when flag = 'H' then 0 end) ignore nulls
over (order by id) as active
from your_table;
ID F ACTIVE
---------- - ----------
1 E 1
2 V 1
3 H 0
4 V 0
5 E 1
6 S 1
7 V 1
8 D 1
9 H 0
You can use the same thing for an update, though a merge is probably going to be simpler:
alter table your_table add active number;
merge into your_table
using (
select id,
last_value(case when flag = 'E' then 1 when flag = 'H' then 0 end) ignore nulls
over (order by id) as active
from your_table
) tmp
on (your_table.id = tmp.id)
when matched then update set active = tmp.active;
9 rows merged.
select * from your_table;
ID F ACTIVE
---------- - ----------
1 E 1
2 V 1
3 H 0
4 V 0
5 E 1
6 S 1
7 V 1
8 D 1
9 H 0
db<>fiddle demo.
You said your real data is actually ordered by a date, and I guess there are multiple flags for each of multiple IDs, so something like this is probably more realistic:
create table your_table (id, flag_time, flag) as
select 1, timestamp '2018-07-04 00:00:00', 'E' from dual
union all select 1, timestamp '2018-07-04 00:00:01', 'V' from dual
union all select 1, timestamp '2018-07-04 00:00:02', 'H' from dual
union all select 1, timestamp '2018-07-04 00:00:03', 'V' from dual
union all select 1, timestamp '2018-07-04 00:00:04', 'E' from dual
union all select 1, timestamp '2018-07-04 00:00:05', 'S' from dual
union all select 1, timestamp '2018-07-04 00:00:06', 'V' from dual
union all select 1, timestamp '2018-07-04 00:00:07', 'D' from dual
union all select 1, timestamp '2018-07-04 00:00:08', 'H' from dual;
alter table your_table add active number;
merge into your_table
using (
select id, flag_time,
last_value(case when flag = 'E' then 1 when flag = 'H' then 0 end) ignore nulls
over (partition by id order by flag_time) as active
from your_table
) tmp
on (your_table.id = tmp.id and your_table.flag_time = tmp.flag_time)
when matched then update set active = tmp.active;
select * from your_table;
ID FLAG_TIME F ACTIVE
---------- ----------------------- - ----------
1 2018-07-04 00:00:00.000 E 1
1 2018-07-04 00:00:01.000 V 1
1 2018-07-04 00:00:02.000 H 0
1 2018-07-04 00:00:03.000 V 0
1 2018-07-04 00:00:04.000 E 1
1 2018-07-04 00:00:05.000 S 1
1 2018-07-04 00:00:06.000 V 1
1 2018-07-04 00:00:07.000 D 1
1 2018-07-04 00:00:08.000 H 0
The main difference is the partition by id and changing the ordering to use flag_time - or whatever your real columns are called.
db<>fiddle demo.
There is potentially an issue if two flags can share a time; with a timestamp column that's hopefully very unlikely, but with a date the precision of the column may allow it. There isn't much you can do about that though, except maybe get into some logic to break ties by assuming flags should arrive in a certain order, and give them a weighting based on that. Rather off-topic though.

counting most recent consecutive rows with like data using tabibitosan

My project is using an Oracle SQL database. I have a historical table that appends task status on a weekly basis, and am attempting to query the number of weeks a task that is currently off track has been off track. Here's an example excerpt from my source historical table:
ID WEEK ON_TRACK
1 1 N
1 2 Y
1 3 N
1 4 N
1 5 N
2 1 N
2 2 N
2 3 Y
2 4 Y
2 5 N
3 1 N
3 2 N
3 3 Y
3 4 Y
3 5 Y
I'm looking to return the count of consecutive "N" values in ON_TRACK starting backwards from the latest append. For the above example data, I'd like the query to return:
ID WKS_OFF_TRACK
1 3
2 1
3 0
I've done some research, and it looks like the Tabibitosan method is the most logical approach, and I've found ample examples to give the max consecutive values that match 1 criteria, but I'm having trouble tweaking to return the most recent consecutive values that match 2 criteria (ID and ON_TRACK).
Here's what I have so far
--this step creates a temp table with unique IDs for each weekly append to the historical table, and a 1 (if ON_TRACK = N) or 0 (if ON_TRACK = Y). This results in the expected info.
WITH HIST_TBL AS (
SELECT DISTINCT(ID),
CASE ON_TRACK
WHEN 'N' THEN 1
ELSE 0
END AS OFF_TRACK,
WEEK
FROM SOURCE_HISTORICAL_TBL
ORDER BY ID,WEEK DESC)
-- end of temp table
--this is where Im struggling I want one line per project number, and the sum of the latest string of 1s (weeks the task has been off track), until a 0 is reached.
SELECT ID,
SUM(OFF_TRACK) AS WKS_OFF_TRACK
FROM (SELECT WEEK,
ID,
OFF_TRACK,
ROW_NUMBER() OVER (ORDER BY WEEK DESC) - ROW_NUMBER() OVER
(PARTITION BY ID,OFF_TRACK ORDER BY WEEK DESC) GRP
FROM HIST_TBL)
GROUP BY ID, GRP
ORDER BY ID;
This code results in the a cumulative sum of all weeks each project has been off track, which for my example data would be:
ID WKS_OFF_TRACK
1 4
2 3
3 2
Any ideas where I'm going wrong?
Here is one method that assumes people were "on track" at some point in time:
select sht.id, count(*)
from SOURCE_HISTORICAL_TBL sht
where sht.week > (select max(sht2.week)
from SOURCE_HISTORICAL_TBL sht2
where sht2.id = sht.id and sht2.on_track = 'Y'
)
group by sht.id;
Otherwise, you need one more condition:
select sht.id, count(*)
from SOURCE_HISTORICAL_TBL sht
where sht.week > (select max(sht2.week)
from SOURCE_HISTORICAL_TBL sht2
where sht2.id = sht.id and sht2.on_track = 'Y'
) or
not exists (select 1
from SOURCE_HISTORICAL_TBL sht2
where sht2.id = sht.id and sht2.on_track = 'Y'
)
group by sht.id;
You can also phrase these as analytic functions:
select id,
sum(case when week > max_week_y or max_week_y is null then 1 else 0 end) as max_off_track
from (select sht.*,
max(case when on_track = 'Y' then week end) over (partition by id) as max_week_y
from SOURCE_HISTORICAL_TBL sht
) sht
group by id;
Note that this version will return 0s for people currently on track.
You can do it in a single table scan:
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE SOURCE_HISTORICAL_TBL ( ID, WEEK, ON_TRACK ) AS
SELECT 1, 1, 'N' FROM DUAL UNION ALL
SELECT 1, 2, 'Y' FROM DUAL UNION ALL
SELECT 1, 3, 'N' FROM DUAL UNION ALL
SELECT 1, 4, 'N' FROM DUAL UNION ALL
SELECT 1, 5, 'N' FROM DUAL UNION ALL
SELECT 2, 1, 'N' FROM DUAL UNION ALL
SELECT 2, 2, 'N' FROM DUAL UNION ALL
SELECT 2, 3, 'Y' FROM DUAL UNION ALL
SELECT 2, 4, 'Y' FROM DUAL UNION ALL
SELECT 2, 5, 'N' FROM DUAL UNION ALL
SELECT 3, 1, 'N' FROM DUAL UNION ALL
SELECT 3, 2, 'N' FROM DUAL UNION ALL
SELECT 3, 3, 'Y' FROM DUAL UNION ALL
SELECT 3, 4, 'Y' FROM DUAL UNION ALL
SELECT 3, 5, 'Y' FROM DUAL UNION ALL
SELECT 4, 1, 'N' FROM DUAL UNION ALL
SELECT 5, 1, 'Y' FROM DUAL;
Query 1:
SELECT ID,
GREATEST(
COALESCE( MAX( CASE ON_TRACK WHEN 'N' THEN WEEK END ), 0 )
- COALESCE( MAX( CASE ON_TRACK WHEN 'Y' THEN WEEK END ), 0 ),
0
) AS weeks
FROM SOURCE_HISTORICAL_TBL
GROUP BY id
ORDER BY id
Results:
| ID | WEEKS |
|----|-------|
| 1 | 3 |
| 2 | 1 |
| 3 | 0 |
| 4 | 1 |
| 5 | 0 |