How to fill irregularly missing values with linear interepolation in BigQuery? - sql

I have data which has missing values irregulaly, and I'd like to convert it with a certain interval with liner interpolation using BigQuery Standard SQL.
Specifically, I have data like this:
# data is missing irregulary
+------+-------+
| time | value |
+------+-------+
| 1 | 3.0 |
| 5 | 5.0 |
| 7 | 1.0 |
| 9 | 8.0 |
| 10 | 4.0 |
+------+-------+
and I'd like to convert this table as follows:
# interpolated with interval of 1
+------+--------------------+
| time | value_interpolated |
+------+--------------------+
| 1 | 3.0 |
| 2 | 3.5 |
| 3 | 4.0 |
| 4 | 4.5 |
| 5 | 5.0 |
| 6 | 3.0 |
| 7 | 1.0 |
| 8 | 4.5 |
| 9 | 8.0 |
| 10 | 4.0 |
+------+--------------------+
Any smart soluton for this?
Supplement: this question is similar to this question in stackoverflow but different in that the data is missing irregulaly.
Thank you.

Below is for BigQuery Standard SQL
#standardSQL
select time,
ifnull(value, start_value + (end_value - start_value) / (end_tick - start_tick) * (time - start_tick)) as value_interpolated
from (
select time, value,
first_value(tick ignore nulls) over win1 as start_tick,
first_value(value ignore nulls) over win1 as start_value,
first_value(tick ignore nulls) over win2 as end_tick,
first_value(value ignore nulls) over win2 as end_value,
from (
select time, t.time as tick, value
from (
select generate_array(min(time), max(time)) times
from `project.dataset.table`
), unnest(times) time
left join `project.dataset.table` t
using(time)
)
window win1 as (order by time desc rows between current row and unbounded following),
win2 as (order by time rows between current row and unbounded following)
)
if to apply to sample data from your question - output is

Here is an example of how to solve this in Postgresql.
https://dbfiddle.uk/?rdbms=postgres_9.5&fiddle=c560dd9a8db095920d0a15834b6768f1
with data
as (select time
,lead(time) over(order by time) as next_time
,value
,lead(value) over(order by time) as next_value
,(lead(value) over(order by time)- value) as val_diff
,(lead(time) over(order by time)- time) as time_diff
from t
)
select *
,generate_series- time as grp
,case when generate_series- time = 0 then
value
else value + (val_diff*1.0/time_diff)*(generate_series-time)*1.0
end as val_grp
from data
cross join generate_series(time, coalesce(next_time-1,time))
+------+-----------------+-----+-------------------------+
| time | generate_series | grp | val_grp |
+------+-----------------+-----+-------------------------+
| 1 | 1 | 0 | 3.0 |
| 1 | 2 | 1 | 3.500000000000000000000 |
| 1 | 3 | 2 | 4.000000000000000000000 |
| 1 | 4 | 3 | 4.500000000000000000000 |
| 5 | 5 | 0 | 5.0 |
| 5 | 6 | 1 | 3.00000000000000000 |
| 7 | 7 | 0 | 1.0 |
| 7 | 8 | 1 | 4.50000000000000000 |
| 9 | 9 | 0 | 8.0 |
| 10 | 10 | 0 | 4.0 |
+------+-----------------+-----+-------------------------+
I believe the syntax would be different in BigQuery using UNNEST and GENERATE_ARRAY as follows. You could give it a try.
with data
as (select time
,lead(time) over(order by time) as next_time
,value
,lead(value) over(order by time) as next_value
,(lead(value) over(order by time)- value) as val_diff
,(lead(time) over(order by time)- time) as time_diff
from t
)
select *
,generate_series- time as grp
,case when generate_series- time = 0 then
value
else value + (val_diff*1.0/time_diff)*(generate_series-time)*1.0
end as val_grp
from data
cross join UNNEST(GENERATE_ARRAY(time, coalesce(next_time-1,time))) as generate_series

In BigQuery you can generate the extra rows for each row using generate_array(). Then you can use lead() to get information from the next row and some arithmetic for interpolation:
with t as (
select 1 as time, 3.0 as value union all
select 5 , 5.0 union all
select 7 , 1.0 union all
select 9 , 8.0 union all
select 10 , 4.0
),
tt as (
select t.*,
lead(time) over (order by time) as next_time,
lead(value) over (order by time) as next_value
from t
)
select coalesce(n, tt.time) as time,
(case when n = tt.time or n is null then value
else tt.value + (tt.next_value - tt.value) * (n - tt.time) / (tt.next_time - tt.time)
end) as value
from tt left join
unnest(generate_array(tt.time, tt.next_time - 1, 1)) n
on true
order by 1;
Note: You have a column called time that contains an integer. If this is really a date/time data type of some type, I would suggest that you ask a new question with more appropriate sample data and desired results -- if you don't see how to adapt this answer.

Related

SQL SERVER How to select the latest record in each group? [duplicate]

This question already has answers here:
Get top 1 row of each group
(19 answers)
Closed 2 years ago.
| ID | TimeStamp | Item |
|----|-----------|------|
| 1 | 0:00:20 | 0 |
| 1 | 0:00:40 | 1 |
| 1 | 0:01:00 | 1 |
| 2 | 0:01:20 | 1 |
| 2 | 0:01:40 | 0 |
| 2 | 0:02:00 | 1 |
| 3 | 0:02:20 | 1 |
| 3 | 0:02:40 | 1 |
| 3 | 0:03:00 | 0 |
I have this and I would like to turn it into
| ID | TimeStamp | Item |
|----|-----------|------|
| 1 | 0:01:00 | 1 |
| 2 | 0:02:00 | 1 |
| 3 | 0:03:00 | 0 |
Please advise, thank you!
A correlated subquery is often the fastest method:
select t.*
from t
where t.timestamp = (select max(t2.timestamp)
from t t2
where t2.id = t.id
);
For this, you want an index on (id, timestamp).
You can also use row_number():
select t.*
from (select t.*,
row_number() over (partition by id order by timestamp desc) as seqnum
from t
) t
where seqnum = 1;
This is typically a wee bit slower because it needs to assign the row number to every row, even those not being returned.
You need to group by id, and filter out through timestamp values descending in order to have all the records returning as first(with value 1) in the subquery with contribution of an analytic function :
SELECT *
FROM
(
SELECT *,
DENSE_RANK() OVER (PARTITION BY ID ORDER BY TimeStamp DESC) AS dr
FROM t
) t
WHERE t.dr = 1
where DENSE_RANK() analytic function is used in order to include records with ties also.

Partition & consecutive in SQL

fellow stackers
I have a data set like so:
+---------+------+--------+
| user_id | date | metric |
+---------+------+--------+
| 1 | 1 | 1 |
| 1 | 2 | 1 |
| 1 | 3 | 1 |
| 2 | 1 | 1 |
| 2 | 2 | 1 |
| 2 | 3 | 0 |
| 2 | 4 | 1 |
+---------+------+--------+
I am looking to flag those customers who has 3 consecutive "1"s in the metric column. I have a solution as below.
select distinct user_id
from (
select user_id
,metric +
ifnull( lag(metric, 1) OVER (PARTITION BY user_id ORDER BY date), 0 ) +
ifnull( lag(metric, 2) OVER (PARTITION BY user_id ORDER BY date), 0 )
as consecutive_3
from df
) b
where consecutive_3 = 3
While it works it is not scalable. As one can imagine what the above query would look like if I were looking for a consecutive 50.
May I ask if there is a scalable solution? Any cloud SQL will do. Thank you.
If you only want such users, you can use a sum(). Assuming that metric is only 0 or 1:
select user_id,
(case when max(metric_3) = 3 then 1 else 0 end) as flag_3
from (select df.*,
sum(metric) over (partition by user_id
order by date
rows between 2 preceding and current row
) as metric_3
from df
) df
group by user_id;
By using a windowing clause, you can easily expand to as many adjacent 1s as you like.

How to detect an interval between consecutive rows?

Consider the following rows:
Id RecordedOn
1 9/3/19 11:15:00
2 9/3/19 11:15:01
3 9/3/19 11:15:02
4 9/3/19 11:18:55
5 9/3/19 11:18:01
As you can see, there are typically records every second, but from row 3 to row 4, there is a gap.
How do I find gaps like these? Preferably I'd like the starting and ending row of the gap, so 3, 4 in this case.
If you want both the before and after rows, use lag() and lead():
select t.*
from (select t.*,
lag(recordedon) over (order by recordedon) as prev_ro,
lead(recordedon) over (order by recordedon) as next_ro
from t
) t
where prev_ro < dateadd(second, -1, recordedon) or
next_ro > dateadd(second, 1, recordedon);
SQL DEMO
SELECT *, DATEDIFF(second, previous, [RecordedOn]) as diff
FROM (
SELECT [Id], [RecordedOn], LAG([RecordedOn]) OVER (ORDER BY [RecordedOn]) previous
FROM t
) t
OUTPUT
| Id | RecordedOn | previous | diff |
|----|----------------------|----------------------|--------|
| 1 | 2019-09-03T11:15:00Z | (null) | (null) |
| 2 | 2019-09-03T11:15:01Z | 2019-09-03T11:15:00Z | 1 |
| 3 | 2019-09-03T11:15:02Z | 2019-09-03T11:15:01Z | 1 |
| 5 | 2019-09-03T11:18:01Z | 2019-09-03T11:15:02Z | 179 |
| 4 | 2019-09-03T11:18:55Z | 2019-09-03T11:18:01Z | 54 |
You can also use LAG() to get previous id if need it.
You could self-join the table with a LEFT JOIN anti-pattern to exhibit records for which no record exist 1 second later, like:
SELECT t.id
FROM mytable t
LEFT JOIN mytable t1 ON t1.RecordedOn = DATEADD(second, 1, t.RecordedOn)
WHERE t1.id IS NULL
Demo on DB Fiddle:
| id |
| -: |
| 3 |
| 4 |
| 5 |

How to increment the counting for each non-consecutive value?

Below is a simple representation of my table:
ID | GA
----------
1 | 1.5
2 | 1.5
3 | 1.2
4 | 1.5
5 | 1.3
I would like to count the number of occurrence of the GA column's values BUT the count should not increment when the value is the same as the next row.
What I would like to expect is like this:
ID | GA | COUNT
-------------------
1 | 1.5 | 1
2 | 1.5 | 1
3 | 1.2 | 1
4 | 1.5 | 2
5 | 1.3 | 1
Notice that GA = 1.5 count is 2. This is because there is a row between ID 2 & 4 that breaks the succession of 1.5.
NOTE: The ordering by ID also matters.
Here's what I've done so far:
SELECT ID,GA,COUNT (*) OVER (
PARTITION BY GA
ORDER BY ID
) COUNT
FROM (
SELECT 1 AS ID,'1.5' AS GA
FROM DUAL
UNION
SELECT 2,'1.5' FROM DUAL
UNION
SELECT 3,'1.2' FROM DUAL
UNION
SELECT 4,'1.5' FROM DUAL
UNION
SELECT 5,'1.3' FROM DUAL
) FOO
ORDER BY ID;
But the result is far from expectation:
ID | GA | COUNT
-------------------
1 | 1.5 | 1
2 | 1.5 | 2
3 | 1.2 | 1
4 | 1.5 | 3
5 | 1.3 | 1
Notice that even if they are consecutive values, the count is still incrementing.
It seems, that you are asking for a kind of a running total, not just a global count.
Assuming, that the input data is in a table named input_data, this should do the trick:
WITH
with_previous AS (
SELECT id, ga, LAG(ga) OVER (ORDER BY id) AS previous_ga
FROM input_data
),
just_new AS (
SELECT id,
ga,
CASE
WHEN previous_ga IS NULL
OR previous_ga <> ga
THEN ga
END AS new_ga
FROM with_previous
)
SELECT id,
ga,
COUNT(new_ga) OVER (PARTITION BY ga ORDER BY id) AS ga_count
FROM just_new
ORDER BY 1
See sqlfiddle: http://sqlfiddle.com/#!4/187e13/1
Result:
ID | GA | GA_COUNT
----+-----+----------
1 | 1.5 | 1
2 | 1.5 | 1
3 | 1.2 | 1
4 | 1.5 | 2
5 | 1.3 | 1
6 | 1.5 | 3
7 | 1.5 | 3
8 | 1.3 | 2
I took sample data from #D-Shih's sqlfiddle
As I understand the problem, this is a variation of a gaps-and-islands problem. You want to enumerate the groups for each ga value independently.
If this interpretation is correct, then I would go for dense_rank() and the difference of row numbers:
select t.*, dense_rank() over (partition by ga order by seqnum_1 - seqnum_2)
from (select t.*,
row_number() over (order by id) as seqnum_1,
row_number() over (partition by ga order by id) as seqnum_2
from t
) t
order by id;
Here is a rextester.
Use a subquery with LAG and SUM anlytic functions:
SELECT id, ga,
sum( cnt ) over (partition by ga order by id) as cnt
FROM (
select t.*,
case lag(ga) over (order by id)
when ga then 0 else 1
end cnt
from Tab t
)
order by id
| ID | GA | CNT |
|----|-----|-----|
| 1 | 1.5 | 1 |
| 2 | 1.5 | 1 |
| 3 | 1.2 | 1 |
| 4 | 1.5 | 2 |
| 5 | 1.3 | 1 |
Demo: http://sqlfiddle.com/#!4/5ddd1/5

Calculating consecutive range of dates with a value in Hive

I want to know if it is possible to calculate the consecutive ranges of a specific value for a group of Id's and return the calculated value(s) of each one.
Given the following data:
+----+----------+--------+
| ID | DATE_KEY | CREDIT |
+----+----------+--------+
| 1 | 8091 | 0.9 |
| 1 | 8092 | 20 |
| 1 | 8095 | 0.22 |
| 1 | 8096 | 0.23 |
| 1 | 8098 | 0.23 |
| 2 | 8095 | 12 |
| 2 | 8096 | 18 |
| 2 | 8097 | 3 |
| 2 | 8098 | 0.25 |
+----+----------+--------+
I want the following output:
+----+-------------------------------+
| ID | RANGE_DAYS_CREDIT_LESS_THAN_1 |
+----+-------------------------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 1 |
| 2 | 1 |
+----+-------------------------------+
In this case, the ranges are the consecutive days with credit less than 1. If there is a gap between date_key column, then the range won't have to take the next value, like in ID 1 between 8096 and 8098 date key.
Is it possible to do this with windowing functions in Hive?
Thanks in advance!
You can do this with a running sum classifying rows into groups, incrementing by 1 every time a credit<1 row is found(in the date_key order). Thereafter it is just a group by.
select id,count(*) as range_days_credit_lt_1
from (select t.*
,sum(case when credit<1 then 0 else 1 end) over(partition by id order by date_key) as grp
from tbl t
) t
where credit<1
group by id
The key is to collapse all the consecutive sequence and compute their length, I struggled to achieve this in a relatively clumsy way:
with t_test as
(
select num,row_number()over(order by num) as rn
from
(
select explode(array(1,3,4,5,6,9,10,15)) as num
)
)
select length(sign)+1 from
(
select explode(continue_sign) as sign
from
(
select split(concat_ws('',collect_list(if(d>1,'v',d))), 'v') as continue_sign
from
(
select t0.num-t1.num as d from t_test t0
join t_test t1 on t0.rn=t1.rn+1
)
)
)
Get the previous number b in the seq for each original a;
Check if a-b == 1, which shows if there is a "gap", marked as 'v';
Merge all a-b to a string, and then split using 'v', and compute length.
To get the ID column out, another string which encode id should be considered.