SQL/Snowflake Sampling with specific probability

SQL/Snowflake Sampling with specific probability - sql

Suppose I have table 1 below, how can I select the values from table 1 with the specified probabilities, where each probability is the chance of the respective value getting selected?
Table 1:
Group Value Probability
A 1 5%
A 10 5%
A 50 20%
A 30 70%
B 5 5%
B 25 70%
B 100 25%
A possible outcome is (assuming 30 and 25 are selected simply because of their higher probabilities):
Table 2:
Group Value
A 30
B 25
I'm trying to solve this on Snowflake and have not been able to through various methods, including partitioning the values and comparing their ranks, as well as using the Uniform function to create random probabilities. Not sure if there's a more elegant way to do a sampling and partition by Group. The end goal is to have the Value field in Table 1 deduplicated, so that each value is given a chance of getting selected based on their probabilities.

Give each group a consecutive range. For example, for 15%, the range will be between 30 and 45.
Pick a random number between 0 and 100.
Find in which range that random number falls:
create or replace temp table probs
as
select 'a' id, 1 value, 20 prob
union all select 'a', 2, 30
union all select 'a', 3, 40
union all select 'a', 4, 10
union all select 'b', 1, 5
union all select 'b', 2, 7
union all select 'b', 3, 8
union all select 'b', 4, 80;
with calculated_ranges as (
select *, range_prob2-prob range_prob1
from (
select *, sum(prob) over(partition by id order by prob) range_prob2
from probs
)
)
select id, random_draw, value, prob
from (
select id, any_value(uniform(0, 100, random())) random_draw
from probs group by id
) a
join calculated_ranges b
using (id)
where range_prob1<=random_draw and range_prob2>random_draw
;

Felipe's answer is great, it definitely solved the problem.
While trying out different approaches yesterday, I tested out this approach on Felipe's table and it seems to be working as well.
I'm giving each record a random probability and comparing against the actual probability. The idea is that if the random probability is less than or equal to the actual probability, then it's accepted and the partitioning will do the deduplication based on a descending order with the probabilities.
create or replace temp table probs
as
select 'a' id, 1 value, 20 prob
union all select 'a', 2, 30
union all select 'a', 3, 40
union all select 'a', 4, 10
union all select 'b', 1, 5
union all select 'b', 2, 7
union all select 'b', 3, 8
union all select 'b', 4, 80;
create or replace temp table t2 as
select *,
min(compare_prob) over(partition by id) as min_compare_prob,
max(compare_prob) over(partition by id) as max_compare_prob,
min_compare_prob <> max_compare_prob as not_all_identical --min_rank2 <> max_rank2 checks if all records (by group) have different values
from (select id,
value,
prob,
UNIFORM(0.00001::float,1::float,random(2)) as rand_prob, --random probability
case when prob >= rand_prob then 1 else 0 end as compare_prob
from (select id, value, prob/100 as prob from probs)
);
--dedeup results
select id, value, prob, rand_prob
from (select *,
row_number() over(partition by id order by prob desc, rand_prob desc) as rn
from t2
where not_all_identical = FALSE
union all
select *,
row_number() over(partition by id order by prob desc, COMPARE_PROB desc) as rn
from t2
where not_all_identical = TRUE)
where rn = 1;

Related

Oracle get rank for only latest date

I have origin table A:
dt
c1
value
2022/10/1
1
1
2022/10/2
1
2
2022/10/3
1
3
2022/10/1
2
4
2022/10/2
2
6
2022/10/3
2
5
Currently I got the latest dt's percent_rank by:
select * from
(
select
*,
percent_rank() over (partition by c1 order by value) as prank
from A
) as pt
where pt.dt = Date'2022-10-3'
Demo: https://www.db-fiddle.com/f/rXynTaD5nmLqFJdjDSCZpL/0
the excepted result looks like:
dt
c1
value
prank
2022/10/3
1
3
1
2022/10/3
2
5
0.5
Which means at 2022-10-3, the value in c1 group's percent_rank in history is 100% while in c2 group is 66%.
But this sql will sort evey partition which I thought it's time complexity is O(n log n).
I just need the latest date's rank and I thought I could do that by calculating count(last_value > value)/count() which cost O(n).
Any suggestions?

Rather than hard-coding the maximum date, you can use the ROW_NUMBER() analytic function:
SELECT *
FROM (
SELECT t.*,
PERCENT_RANK() OVER (PARTITION BY c1 ORDER BY value) AS prank,
ROW_NUMBER() OVER (PARTITION BY c1 ORDER BY dt DESC) AS rn
FROM table_name t
) t
WHERE rn = 1
Which, for the sample data:
CREATE TABLE table_name (dt, c1, value) AS
SELECT DATE '2022-10-01', 1, 1 FROM DUAL UNION ALL
SELECT DATE '2022-10-02', 1, 2 FROM DUAL UNION ALL
SELECT DATE '2022-10-03', 1, 3 FROM DUAL UNION ALL
SELECT DATE '2022-10-01', 2, 4 FROM DUAL UNION ALL
SELECT DATE '2022-10-02', 2, 6 FROM DUAL UNION ALL
SELECT DATE '2022-10-03', 2, 5 FROM DUAL;
Outputs:
DT
C1
VALUE
PRANK
RN
2022-10-03 00:00:00
1
3
1
1
2022-10-03 00:00:00
2
5
.5
1
fiddle
But this sql will sort every partition which I thought it's time complexity is O(n log n).
Whatever you do you will need to iterate over the entire result-set.
I just need the latest date's rank and I thought I could do that by calculating count(last_value > value)/count().
Then you will need to find the last value which (unless you are hardcoding the last date) will involve using an index- or table-scan over all the values in each partition and sorting the values and then to find a count of the greater values will require a second index- or table-scan. You can profile both solutions but I expect you would find that using analytic functions is going to be equally efficient, if not better, than trying to use aggregation functions.
For example:
SELECT c1,
dt,
value,
( SELECT ( COUNT(CASE WHEN value <= t.value THEN 1 END) - 1 )
/ ( COUNT(*) - 1 )
FROM table_name c
WHERE c.c1 = t.c1
) AS prank
FROM table_name t
WHERE dt = DATE '2022-10-03'
If going to access the table twice and you are likely to find that the I/O costs of table access are going to far outweight any potential savings from using a different method. However, if you look at the explain plan (fiddle) then the query is still performing an aggregate sort so there is not going to be any cost savings, only additional costs from this method.

Try this
select t.c1, t.dt, t.value
from TABLENAME t
inner join (
select c1, max(dt) as MaxDate
from TABLENAME
group BY dt
) tm on t.c1 = tm.c1 and t.dt = tm.MaxDate ORDER BY dt DESC;
Or as simple as
SELECT * from TABLENAME ORDER BY dt DESC;

I fiddled it a bit, it is almost the same answer as MT0 already put.
select dt, c1, val, prank*100 as percent_rank from (select
t1.*,
percent_rank() over (partition by c1 order by val) as prank,
row_number() over (partition by c1 order by dt desc) rn from t1) where rn=1;
result
DT C1 VAL PERCENT_RANK
2022-10-03 1 3 100
2022-10-03 2 5 50
http://sqlfiddle.com/#!4/ec60a/23
I used Row_number = 1 to get the latest date.
And also pushed the percent_rank as percent.
Is this what you desire?

Skip rows in bigquery based on difference from value in previous row

assuming the table below is ordered by value (DESC), how can I return only those rows where the difference between current value and value in previous row is less than some number x (e.g. 2), and also discard the next rows once this condition is met for the first time
i.e. return only rows 1 and 2 below, since the difference between the values of rows 3 and 2 (9.0-4.0=5.0) >2, so we skip rows 3 and 4
with table as (
select 1 as id, "a" as name, 10.0 as value UNION ALL
select 2, "b", 9.0 UNION ALL
select 3, "c", 4.0 UNION ALL
select 4, "d", 1.0 UNION ALL
)
output
id, name, value
1, a, 10.0
2, b, 9.0

We can use lag() to find the difference and combine id<=id and max(difference)<=2 to filter the results.
with t1 as (
select 1 as id, 'a' as name, 10.0 as value UNION ALL
select 2, 'b', 9.0 UNION ALL
select 3, 'c', 4.0 UNION ALL
select 4, 'd', 1.0
)
select
a.id, a.name, a.value,
max(b.value_diff) max_diff
from t1 a
join (select id, abs(coalesce(value - lag(value) over (order by id),0)) as value_diff from t1 )b
on a.id >= b.id
group by a.id, a.name, a.value
having max(b.value_diff) <= 2;

Select total average of averages grouped by id

In my database that represents a car service station, I am trying to figure out a SQL query that would give me a total average of how much does the customer pays for a single service but instead of getting AVG() of the price on all existing Invoices, I want to group the invoices by the same reservation_id. After that, I would like to get the total average of all of those grouped results.
I am using the two tables listed in the picture below. I want to get the value of a total average price by applying AVG() on all averages that are made by grouping prices by the same FK Reservation_reservation_id.
I tried to make this into a single query but I failed so I came looking for help from more experienced users. Also, I need to select (get) only the result of the total average. This result should give me an overview of how much each customer pays on average for one reservation.
Thanks for your time

You appear to want to aggregate twice:
SELECT AVG( avg_price ) avg_avg_price
FROM (
SELECT AVG( price ) AS avg_price
FROM invoice
GROUP BY reservation_reservation_id
)
Which, for the sample data:
CREATE TABLE invoice ( reservation_reservation_id, price ) AS
SELECT 1, 10 FROM DUAL UNION ALL
SELECT 1, 12 FROM DUAL UNION ALL
SELECT 1, 14 FROM DUAL UNION ALL
SELECT 1, 16 FROM DUAL UNION ALL
SELECT 2, 10 FROM DUAL UNION ALL
SELECT 2, 11 FROM DUAL UNION ALL
SELECT 2, 12 FROM DUAL;
Outputs:
AVG_AVG_PRICE
12
db<>fiddle here

If you want this per customer:
SELECT customer_customer_id, AVG(avg_reservation_price)
FROM (SELECT i.customer_customer_id, i.reservation_reservation_id,
AVG(i.price) as avg_reservation_price
FROM invoice i
GROUP BY i.customer_customer_id, i.reservation_reservation_id
) ir
GROUP BY customer_customer_id;
If you want this for a particular "checkout reason" -- which is the closest that I imagine that "service" means -- then join in the reservations table and filter:
SELECT customer_customer_id, AVG(avg_reservation_price)
FROM (SELECT i.customer_customer_id, i.reservation_reservation_id,
AVG(i.price) as avg_reservation_price
FROM invoice i JOIN
reservation r
ON i.reservation_reservation_id = r.reservation_id
WHERE r.checkup_type = ?
GROUP BY i.customer_customer_id, i.reservation_reservation_id
) ir
GROUP BY customer_customer_id;

You might want to try the below:
with aux (gr, subgr, val) as (
select 'a', 'a1', 1 from dual union all
select 'a', 'a2', 2 from dual union all
select 'a', 'a3', 3 from dual union all
select 'a', 'a4', 4 from dual union all
select 'b', 'b1', 5 from dual union all
select 'b', 'b2', 6 from dual union all
select 'b', 'b3', 7 from dual union all
select 'b', 'b4', 8 from dual)
SELECT
gr,
avg(val) average_gr,
avg(avg(val)) over () average_total
FROM
aux
group by gr;
Which, applied to your table, would result in:
SELECT
reservation_id,
avg(price) average_rn,
avg(avg(price)) over () average_total
FROM
invoices
group by reservation_id;

Oracle query to find unique occurrences of a column in a table

Here's my table:
ID|2ndID|Value
1|ABC|103
2|ABC|102
3|DEF|103
4|XYZ|105
My query should return all instance of the ID that has only one Value=103 for the 2ndID. It shouldn't return Ids 1 and 2 because apart from 103, ABC has 102 too. 3|DEF on the other hand has only one Value = 103. And I need such rows back. I don't need the 4|XYZ also since value <> 103. Based on the above sample set my result should only be.
3|DEF|103
I can use a group by 2ndID having COUNT(*) =1 which will return all but I don't know how to filter it only to Value = 103.
Thanks in advance.

This should return all the row with a single 2ndId value
select * from
my_table where 2ndId in (
select 2ndId
from my_table
group by 2ndId
having count(*) =1
)
And if you need to enforce the filter for value 103
select * from
my_table where 2ndId in (
select 2ndId
from my_table
group by 2ndId
having count(*) =1
)
and value = 103

This is a standard application of the HAVING clause in aggregate queries. You want to group by the second id, and select only the groups that have only one row and where the min(value) is 103. MIN(value) will be the unique value in the unique row, in the groups that only have one row to begin with; and you don't care about any other groups.
COMMENT: This solution assumes that the combination (second_id, value) is unique - it can't appear in the table more than once, for different id's. I asked the OP in a Comment under the original question to clarify whether this is in fact the case.
with
mytable ( id, second_id, value ) as (
select 1, 'ABC', 103 from dual union all
select 2, 'ABC', 102 from dual union all
select 3, 'DEF', 103 from dual union all
select 4, 'XYZ', 105 from dual
)
-- End of SIMULATED inputs (for testing only, NOT PART OF THE SOLUTION).
-- SQL query begins BELOW THIS LINE. Use your actual table and column names.
select min(id) as id, second_id, min(value) as value
from mytable
group by second_id
having count(*) = 1 and min(value) = 103
;
ID SECOND_ID VALUE
-- --------- -----
3 DEF 103

Get single column each row difference in SQL Server (alternative of LEAD function)

Get single column each row difference in SQL Server.
In my table, ORIGINAL_DATA is my current column and want to generate new EXPECTED_OUTPUT column with show the difference of each row.
like 40-30 = 10 , 30-25 = 5, 25-10 = 15.
SELECT 10 ORIGINAL_DATA,0 EXPECTED_OUTPUT UNION
SELECT 25, 15 UNION
SELECT 30, 5 UNION
SELECT 40, 10
I have used the LEAD function but it is not supported by my current version of SQL Server.
So can you please help me to solve this without LEAD and SELF JOIN?
As my query already taking too much time, here i have mention only sample data only.

Try this
WITH rows AS
(SELECT Column1 ,ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS rn
FROM (
SELECT 10 Column1 UNION
SELECT 25 UNION
SELECT 30 UNION
SELECT 40
)M)
SELECT
--mp.Column1 ,
mc.Column1,
--mc.rn,
--mp.rn,
CAST(mc.Column1 AS float) - CAST(mp.Column1 AS float) EXPECTED_OUTPUT
FROM rows mc
LEFT JOIN rows mp
ON mp.rn = mc.rn - 1;

Try this, change SQL Script as per table structure Partition By A and Order By B
SELECT ORIGINAL_DATA,
ORIGINAL_DATA-LEAD(ORIGINAL_DATA, 1, 0) OVER (PARTITION BY A ORDER BY B DESC) AS EXPECTED_OUTPUT
FROM Table1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL/Snowflake Sampling with specific probability - sql

Related

Oracle get rank for only latest date

Skip rows in bigquery based on difference from value in previous row

Select total average of averages grouped by id

Oracle query to find unique occurrences of a column in a table

Get single column each row difference in SQL Server (alternative of LEAD function)

Categories

Resources