Skip rows in bigquery based on difference from value in previous row - sql

assuming the table below is ordered by value (DESC), how can I return only those rows where the difference between current value and value in previous row is less than some number x (e.g. 2), and also discard the next rows once this condition is met for the first time
i.e. return only rows 1 and 2 below, since the difference between the values of rows 3 and 2 (9.0-4.0=5.0) >2, so we skip rows 3 and 4
with table as (
select 1 as id, "a" as name, 10.0 as value UNION ALL
select 2, "b", 9.0 UNION ALL
select 3, "c", 4.0 UNION ALL
select 4, "d", 1.0 UNION ALL
)
output
id, name, value
1, a, 10.0
2, b, 9.0

We can use lag() to find the difference and combine id<=id and max(difference)<=2 to filter the results.
with t1 as (
select 1 as id, 'a' as name, 10.0 as value UNION ALL
select 2, 'b', 9.0 UNION ALL
select 3, 'c', 4.0 UNION ALL
select 4, 'd', 1.0
)
select
a.id, a.name, a.value,
max(b.value_diff) max_diff
from t1 a
join (select id, abs(coalesce(value - lag(value) over (order by id),0)) as value_diff from t1 )b
on a.id >= b.id
group by a.id, a.name, a.value
having max(b.value_diff) <= 2;

Related

SQL/Snowflake Sampling with specific probability

Suppose I have table 1 below, how can I select the values from table 1 with the specified probabilities, where each probability is the chance of the respective value getting selected?
Table 1:
Group Value Probability
A 1 5%
A 10 5%
A 50 20%
A 30 70%
B 5 5%
B 25 70%
B 100 25%
A possible outcome is (assuming 30 and 25 are selected simply because of their higher probabilities):
Table 2:
Group Value
A 30
B 25
I'm trying to solve this on Snowflake and have not been able to through various methods, including partitioning the values and comparing their ranks, as well as using the Uniform function to create random probabilities. Not sure if there's a more elegant way to do a sampling and partition by Group. The end goal is to have the Value field in Table 1 deduplicated, so that each value is given a chance of getting selected based on their probabilities.
Give each group a consecutive range. For example, for 15%, the range will be between 30 and 45.
Pick a random number between 0 and 100.
Find in which range that random number falls:
create or replace temp table probs
as
select 'a' id, 1 value, 20 prob
union all select 'a', 2, 30
union all select 'a', 3, 40
union all select 'a', 4, 10
union all select 'b', 1, 5
union all select 'b', 2, 7
union all select 'b', 3, 8
union all select 'b', 4, 80;
with calculated_ranges as (
select *, range_prob2-prob range_prob1
from (
select *, sum(prob) over(partition by id order by prob) range_prob2
from probs
)
)
select id, random_draw, value, prob
from (
select id, any_value(uniform(0, 100, random())) random_draw
from probs group by id
) a
join calculated_ranges b
using (id)
where range_prob1<=random_draw and range_prob2>random_draw
;
Felipe's answer is great, it definitely solved the problem.
While trying out different approaches yesterday, I tested out this approach on Felipe's table and it seems to be working as well.
I'm giving each record a random probability and comparing against the actual probability. The idea is that if the random probability is less than or equal to the actual probability, then it's accepted and the partitioning will do the deduplication based on a descending order with the probabilities.
create or replace temp table probs
as
select 'a' id, 1 value, 20 prob
union all select 'a', 2, 30
union all select 'a', 3, 40
union all select 'a', 4, 10
union all select 'b', 1, 5
union all select 'b', 2, 7
union all select 'b', 3, 8
union all select 'b', 4, 80;
create or replace temp table t2 as
select *,
min(compare_prob) over(partition by id) as min_compare_prob,
max(compare_prob) over(partition by id) as max_compare_prob,
min_compare_prob <> max_compare_prob as not_all_identical --min_rank2 <> max_rank2 checks if all records (by group) have different values
from (select id,
value,
prob,
UNIFORM(0.00001::float,1::float,random(2)) as rand_prob, --random probability
case when prob >= rand_prob then 1 else 0 end as compare_prob
from (select id, value, prob/100 as prob from probs)
);
--dedeup results
select id, value, prob, rand_prob
from (select *,
row_number() over(partition by id order by prob desc, rand_prob desc) as rn
from t2
where not_all_identical = FALSE
union all
select *,
row_number() over(partition by id order by prob desc, COMPARE_PROB desc) as rn
from t2
where not_all_identical = TRUE)
where rn = 1;

Oracle SQL - Count based on a condition to include distinct rows with zero matches

Is there a "better" way to refactor the query below that returns the number occurrences of a particular value (e.g. 'A') for each distinct id? The challenge seems to be keeping id = 2 in the result set even though the count is zero (id = 2 is never related to 'A'). It has a common table expression, NVL function, in-line view, distinct, and left join. Is all of that really needed to get this job done? (Oracle 19c)
create table T (id, val) as
select 1, 'A' from dual
union all select 1, 'B' from dual
union all select 1, 'A' from dual
union all select 2, 'B' from dual
union all select 2, 'B' from dual
union all select 3, 'A' from dual
;
with C as (select id, val, count(*) cnt from T where val = 'A' group by id, val)
select D.id, nvl(C.cnt, 0) cnt_with_zero from (select distinct id from T) D left join C on D.id = C.id
order by id
;
ID CNT_WITH_ZERO
---------- -------------
1 2
2 0
3 1
A simple way is conditional aggregation:
select id,
sum(case when val = 'A' then 1 else 0 end) as num_As
from t
group by id;
If you have another table with one row per id, you I would recommend:
select i.id,
(select count(*) from t where t.id = i.id and t.val = 'A') as num_As
from ids i;

How to compare column in one table with array from another table in BigQuery?

Just continue from the answer for my previous question.
I want to get all values from table b (in rows) if there is any difference between values in arrays from table a by same ids
WITH a as (SELECT 1 as id, ['123', 'abc', '456', 'qaz', 'uqw'] as value
UNION ALL SELECT 2, ['123', 'wer', 'thg', '10', '200']
UNION ALL SELECT 3, ['200']
UNION ALL SELECT 4, null
UNION ALL SELECT 5, ['140']),
b as (SELECT 1 as id, '123' as value
UNION ALL SELECT 1, 'abc'
UNION ALL SELECT 1, '456'
UNION ALL SELECT 1, 'qaz'
UNION ALL SELECT 1, 'uqw'
UNION ALL SELECT 2, '123'
UNION ALL SELECT 2, 'wer'
UNION ALL SELECT 2, '10'
UNION ALL SELECT 3, null
UNION ALL SELECT 4, 'wer'
UNION ALL SELECT 4, '234'
UNION ALL SELECT 5, '140'
UNION ALL SELECT 5, '121'
)
SELECT * EXCEPT(flag)
FROM (
SELECT b.*, COUNTIF(b.value IS NULL) OVER(PARTITION BY id) flag
FROM a LEFT JOIN a.value
FULL OUTER JOIN b
USING(id, value)
)
WHERE flag > 0
AND NOT id IS NULL
It works well for all ids except 5.
In my case I need to return all values if there is any difference.
In example array with id 5 from table a has only one value is '140' while there are two rows with values by id 5 from table b. So in this case all values by id 5 from table b also must appear in expected output
How need to modify this query to get what I want?
UPDATED
Seems like it works for me. But I can not be sure for 100%
SELECT * EXCEPT(flag)
FROM (
SELECT b.*, COUNTIF((b.value IS NULL AND a.value IS NOT NULL) OR (b.value IS NOT NULL AND a.value IS NULL)) OVER(PARTITION BY id) flag
FROM a LEFT JOIN a.value
FULL OUTER JOIN b
USING(id, value)
)
WHERE flag > 0
AND NOT id IS NULL
#standardSQL
SELECT *
FROM table_b
WHERE id IN (
SELECT id FROM table_a a
JOIN table_b b USING(id)
GROUP BY id
HAVING STRING_AGG(IFNULL(b.value, 'NULL') ORDER BY b.value) !=
IFNULL(ANY_VALUE((SELECT STRING_AGG(IFNULL(value, 'NULL') ORDER BY value) FROM a.value)), 'NULL')
)

Does BigQuery `ANY_VALUE` give me any guarantees of values being from the same row?

Consider the following query:
with abc as (
select 1 as a, 1 as b, 2 as c
union all select 1, 3, 4
union all select 1, 5, 6
union all select 2, 7, 8
union all select 2, 9, 10
)
select
a,
any_value(b),
any_value(c)
from abc
group by a
Are there any guarantees as to whether the b and c values picked by ANY_VALUE will be from the same row? In other words, can I be sure that if the values picked for b are 1 and 9 (yes, I know there are no guarantees that this will be the case, or that it will be the same each time) then c is 2 and 10, respectively?
Below is for BigQuery Standard SQL
ANY_VALUE returns expression for some row in the group. It is nondeterministic which row to choose
To make sure that b and c is taken from the same row - use below approach
#standardSQL
SELECT AS VALUE ANY_VALUE(t)
FROM abc t
GROUP BY a
to also make sure those values not just from the same row but always from the same rows from run to run - use below approach
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY b,c LIMIT 1)[OFFSET(0)]
FROM abc t
GROUP BY a

SQl Query : need to get the latest created data in the child records

I have a requirment in which I need to get the latest created data in the child records.
Suppose there are two tables A and B. A is parent and B is child. They have 1:M relation. Both has some columns and B table has one 'created date' column also which holds the created date of the record in table B.
Now, I need to write a query which can fetch all records from A table and it's latest created child record from B table. suppose If two child records are created today in table B for a parent record then the latest one out of them should get fetch.
One record of A table could have many childs, so how can we achive this.
Result should be - Columns of tbl A, Columns of tbl B(Latest created one)
I hope the 'created date' is a DATETIME column. This would give you the most recent child record. Assuming you have a consistent ID in the parent table with the same ParentID in the child table as a foreign key....
select A.*, B.*
from A
join B on A.ParentID = B.ParentID
join (
select ParentID, max([created date]) as [created date]
from B
group by ParentID
) maxchild on A.ParentID = maxchild.ParentID
where B.ParentID = maxchild.ParentID and B.[created date] = maxchild.[created date]
Below is the query that can help you out.
select x, y from ( select a.coloumn_TAB_A x, b.coloumn_TAB_B y from TableA a ,
TableB b where a.primary_key=b.primary_key
and a.Primary_key ='XYZ' order by b.created_date desc) where rownum < 2
Here we have two tables A and B, Joined them based on primary keys, order them on created date column of Table B in Descending order.
Use this output as inline view for outer query and select whichever coloumn u want like x, y. where rownum < 2 (that will fetch the latest record of table B)
This is not the most efficient but will work (SQL Only):
SELECT [Table_A].[Columns], [Table_B].[Columns]
FROM [Table_A]
LEFT OUTER JOIN [Table_B]
ON [Table_B].ForeignKey = [Table_A].PrimaryKey
AND [Table_B].PrimaryKey = (SELECT TOP 1 [Table_B].PrimaryKey
FROM [Table_B]
WHERE [Table_B].ForeignKey = [Table_A].PrimaryKey
ORDER BY [Table_B].CREATIONDATE DESC)
You can use analytic functions to avoid hitting each table (or specifically B) more than once
Using CTEs to provide dummy data for A and B you can do this:
with A as (
select 1 as id from dual
union all select 2 from dual
union all select 3 from dual
),
B as (
select 1 as a_id, date '2012-01-01' as created_date, 'First for 1' as value
from dual
union all select 1, date '2012-01-02', 'Second for 1' from dual
union all select 1, date '2012-01-03', 'Third for 1' from dual
union all select 2, date '2012-02-01', 'First for 2' from dual
union all select 2, date '2012-02-03', 'Second for 2' from dual
union all select 3, date '2012-02-01', 'First for 3' from dual
union all select 3, date '2012-02-03', 'Second for 3' from dual
union all select 3, date '2012-02-05', 'Third for 3' from dual
union all select 3, date '2012-02-09', 'Fourth for 3' from dual
)
select id, created_date, value from (
select a.id, b.created_date, b.value,
row_number() over (partition by a.id order by b.created_date desc) as rn
from a
join b on b.a_id = a.id
)
where rn = 1
order by id;
ID CREATED_D VALUE
---------- --------- ------------
1 03-JAN-12 Third for 1
2 03-FEB-12 Second for 2
3 09-FEB-12 Fourth for 3
You can select any columns you want from A and B, but you'll need to alias them in the subquery if there are any with the same name in both tables.
You may also need to user rank() or dense_rank() instead of row_number to handle ties appropriately, if you can have child records with the same created date.