Select n amount of random rows where n is proportionate to each value's % of total population - sql

I have a table of 58 million customer records. Each customer has a market value (EN, US, FR etc.)
I'm trying to select a 100k sample set which contains customers from every market. The ratio of customers per market in the sample must match the ratios in the actual table.
So if UK customers account for 15% of the records in the customer table then there must be 15k UK customers in the 100k sample set and the same then for each market.
Is there a way to do this?

First, a simple random sample should do pretty well on representing the market sizes. What you are asking for is a stratified sample.
One way to get such a sample is to order the data randomly and assign a sequential number in each group. Then normalize the sequential number to be between 0 and 1, and finally order by the normalized value and choose the top "n" rows:
select top 100000 c.*
from (select c.*,
row_number() over (partition by market order by rand(checksum(newid()))
) as seqnum,
count(*) over (partition by market) as cnt
from customers c
) c
order by cast(seqnum as float) / cnt
It may be clear what is happening if you look at the data. Consider taking a sample of 5 from:
1 A
2 B
3 C
4 D
5 D
6 D
7 B
8 A
9 D
10 C
The first step assigns a sequential number randomly within each market:
1 A 1
2 B 1
3 C 1
4 D 1
5 D 2
6 D 3
7 B 2
8 A 2
9 D 4
10 C 2
Next, normalize these values:
1 A 1 0.50
2 B 1 0.50
3 C 1 0.50
4 D 1 0.25
5 D 2 0.50
6 D 3 0.75
7 B 2 1.00
8 A 2 1.00
9 D 4 1.00
10 C 2 1.00
Now, if you take the top 5, you will get the first five values which is a stratified sample.

Using a sample that big a casual extraction will give you a sample with good statitical approximation of the original population, as pointed out by Gordon Linoff.
To force the equal percentage between the population and the sample you can calculate and use all the needed parameter: the dimension of the population and the dimension of the partition, with the addition of a random ID.
Declare #sampleSize INT
Set #sampleSize = 100000
With D AS (
SELECT customerID
, Country
, Count(customerID) OVER (PARTITION BY Null) TotalData
, Count(customerID) OVER (PARTITION BY Country) CountryData
, Row_Number() OVER (PARTITION BY Country
ORDER BY rand(checksum(newid()))) ID
FROM customer
)
SELECT customerID
, Country
FROM D
WHERE ID <= Round((Cast(CountryData as Float) / TotalData) * #sampleSize, 0)
ORDER BY Country
SQLFiddle demo with less data.
Be aware that the approximation of the function in the WHERE condition can make the returned data a little less or a little more of the desired one, for example in the demo the rows returned are 9 instead of 10.

Related

Merge row values based on other column value

I'm trying to merge the values of two rows based on the value of another row in a different column. Below is my based table
Customer ID
Property ID
Bookings per customer
Cancellations per customer
A
1
0
1
B
2
10
1
C
3
100
1
C
4
100
1
D
5
20
1
Here is the SQL query I used
select customer_id, property_id, bookings_per_customer, cancellations_per_customer
from table
And this is what I want to see. Any ideas the query to get this would be? We use presto SQL
Thanks!
Customer ID
Property ID
Bookings per customer
Cancellations per customer
A
1
0
1
B
2
10
1
C
3 , 4
100
1
D
5
20
1
We can try:
SELECT
customer_id,
ARRAY_JOIN(ARRAY_AGG(property_id), ',') AS properties,
bookings_per_customer,
cancellations_per_customer
FROM yourTable
GROUP BY
customer_id,
bookings_per_customer,
cancellations_per_customer;

Select column's occurence order without group by

I currently have two tables, users and coupons
id
first_name
1
Roberta
2
Oliver
3
Shayna
4
Fechin
id
discount
user_id
1
20%
1
2
40%
2
3
15%
3
4
30%
1
5
10%
1
6
70%
4
What I want to do is select from the coupons table until I've selected X users.
so If I chose X = 2 the resulting table would be
id
discount
user_id
1
20%
1
2
40%
2
4
30%
1
5
10%
1
I've tried using both dense_rank and row_number but they return the count of occurrences of each user_id not it's order.
SELECT id,
discount,
user_id,
dense_rank() OVER (PARTITION BY user_id)
FROM coupons
I'm guessing I need to do it in multiple subqueries (which is fine) where the first subquery would return something like
id
discount
user_id
order_of_occurence
1
20%
1
1
2
40%
2
2
3
15%
3
3
4
30%
1
1
5
10%
1
1
6
70%
4
4
which I can then use to filter by what I need.
PS: I'm using postgresql.
You've stated that you want to parameterize the query so that you can retrieve X users. I'm reading that as all coupons for the first X distinct user_ids in coupon id column order.
It appears your attempt was close. dense_rank() is the right idea. Since you want to look over the entire table you can't use partition by. And a sorting column is also required to determine the ranking.
with data as (
select *,
dense_rank() over (order by id) as dr
from coupons
)
select * from data where dr <= <X>;

How to count distinct a field cumulatively using recursive cte or other method in SQL?

Using example below, Day 1 will have 1,3,3 distinct name(s) for A,B,C respectively.
When calculating distinct name(s) for each house on Day 2, data up to Day 2 is used.
When calculating distinct name(s) for each house on Day 3, data up to Day 3 is used.
Can recursive cte be used?
Data:
Day
House
Name
1
A
Jack
1
B
Pop
1
C
Anna
1
C
Dew
1
C
Franco
2
A
Jon
2
B
May
2
C
Anna
3
A
Jon
3
B
Ken
3
C
Dew
3
C
Dew
Result:
Day
House
Distinct names
1
A
1
1
B
1
1
C
3
2
A
2 (jack and jon)
2
B
2
2
C
3
3
A
2 (jack and jon)
3
B
3
3
C
3
Without knowing the need and size of data it'll be hard to give an ideal/optimal solution. Assuming a small dataset needing a quick and dirty way to calculate, just use sub query like this...
SELECT p.[Day]
, p.House
, (SELECT COUNT(DISTINCT([Name]))
FROM #Bing
WHERE [Day]<= p.[Day] AND House = p.House) DistinctNames
FROM #Bing p
GROUP BY [Day], House
ORDER BY 1
There is no need for a recursive CTE. Just mark the first time a name is seen in a house and use a cumulative sum:
select day, house,
sum(sum(case when seqnum = 1 then 1 else 0 end)) over (partition by house order by day) as num_unique_names
from (select t.*,
row_number() over (partition by house, name order by day) as seqnum
from t
) t
group by day, house

Assign a category to a product without repeating

I have a function that produces a table like below. The sequence is important here.
I want each product to be assigned to a separate category, but if there is a change over time, eg Product D to Product C (row 11) , then another category should be created. The result I want to get is in the Result column.
Order
Number
Product
Result
1
106893
Product A
1
2
108468
Product B
2
3
108468
Product B
2
4
107011
Product C
3
5
107011
Product C
3
6
107011
Product C
3
7
107011
Product D
4
8
107011
Product D
4
9
107011
Product D
4
10
107011
Product D
4
11
107011
Product C
5
12
107011
Product E
6
13
107011
Product E
6
I tried to do it with rank() but in line 11 it again throws me a result of 3 instead of 5.
Theoretically, I did it with CTE, but it takes a long time to calculate on a small sample. There must be a simpler and faster way.
Use lag() to see where there is a switch. Then use a cumulative sum:
select t.*,
sum(case when prev_product = product then 0 else 1 end) over (order by order)
from (select t.*,
lag(product) over (order by order) as prev_product
from t
) t;
order is, of course, a SQL keyword. It is a bad name for a column and may need to be escaped if that is the actual name.

Oracle SQL Count grouped rows in table

I was wonder if it is possible preferably using a select statement on PL/SQL V11 to get the following results from this table:
Area Store Product
10 1 A
10 1 B
11 1 E
11 1 D
10 2 C
10 2 B
10 2 A
10 3 B
10 3 A
13 1 B
13 1 A
and Return this result, so it groups by Area, and Store and looks for and area and store with the same products. So Area 10 Store 1 has products A and B so it will look at the list for other stores that only have A and B and count them. In this example it counts Area 10 store 1/Area 10 store 3/Area 13 Store 1.
Product Count of groups
AB 3
ABC 1
DE 1
Thanks in advance for the help.
Yes, you can use listagg() and then another group by:
select products, count(*)
from (select listagg(product) within group (order by product) as products
from t
group by area, store
) p
group by products;