How to find number of distinct phones per customer and put the customers(counts) in different buckets as per the counts? - sql

Below is the table where I have customer_id and different phones they have.
customer_id phone_number
101 123456789
102 234567891
103 345678912
102 456789123
101 567891234
104 678912345
105 789123456
106 891234567
106 912345678
106 456457234
101 655435664
107 453426782
Now, I want to find customer_id and distinct phone number count.
So I used this query:
select distinct customer_id ,count(distinct phone_number)
from customer_phone;
customer_id no of phones
101 3
102 2
103 1
104 1
105 1
106 3
107 1
And, from the above table my final goal is to achieve the below output which takes the counts and puts in different buckets and then count number of consumers that fall in those buckets.
Buckets no of consumers
3 2
2 1
1 4
There are close to 200 million records. Can you please explain an efficient way to work on this?

You can use width_bucket for that:
select bucket, count(*)
from (
select width_bucket(count(distinct phone_number), 1, 10, 10) as bucket
from customer_phone
group by customer_id
) t
group by bucket;
width_bucket(..., 1, 10, 10) creates ten buckets for the values 1 through 10.
Online Example: http://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=1e6d55305570499f363837aba21bdc7e

Use two aggregations:
select cnt, count(*), min(customer_id), max(customer_id)
from (select customer_id, count(distinct phone_number) as cnt
from customer_phone
group by customer_id
) c
group by cnt
order by cnt;

Related

Distribute large quantities over multiple rows

I have a simple Order table and one order can have different products with Quantity and it's Product's weight as below
OrderID
ProductName
Qty
Weight
101
ProductA
2
24
101
ProductB
1
24
101
ProductC
1
48
101
ProductD
1
12
101
ProductE
1
12
102
ProductA
5
60
102
ProductB
1
12
I am trying to partition and group the products in such a way that for an order, grouped products weight should not exceed 48.
Expected table look as below
OrderID
ProductName
Qty
Weight
GroupedID
101
ProductA
2
24
1
101
ProductB
1
24
1
101
ProductC
1
48
2
101
ProductD
1
12
3
101
ProductE
1
12
3
102
ProductA
4
48
1
102
ProductA
1
12
2
102
ProductB
1
12
2
Kindly let me know if this is possible.
Thank you.
This is a bin packing problem which is non-trivial in general. It's not just NP-complete but superexponential, ie the time increase as complexity increases is worse than exponential. Dai posted a link to Hugo Kornelis's article series which is referenced by everyone trying to solve this problem. The set-based solution performs really bad. For realistic scenarios you need iteration and preferably, using bin packing libraries eg in Python.
For production work it would be better to take advantage of SQL Server 2017+'s support for Python scripts and use a bin packing library like Google's OR Tools or the binpacking module. Even if you don't want to use sp_execute_external_script you can use a Python script to read the data from the database and split them.
The question's numbers are so regular though you could cheat a bit (actually quite a lot) and distribute all order lines into individual items, calculate the running total per order and then divide the total by the limit to produce the group number.
This works only because the running totals are guaranteed to align with the bin size.
Distributing into items can be done using a Tally/Numbers table, a table with a single Number column storing numbers from 0 to eg 1M.
Given the question's data:
declare #OrderItems table(id int identity(1,1) primary key, OrderID int,ProductName varchar(20),Qty int,Weight int)
insert into #OrderItems(OrderId,ProductName,Qty,Weight)
values
(101,'ProductA',2,24),
(101,'ProductB',1,24),
(101,'ProductC',1,48),
(101,'ProductD',1,12),
(101,'ProductE',1,12),
(102,'ProductA',5,60),
(102,'ProductB',1,12);
The following query will split each order item into individual items. It repeats each order item row as there are individual items and calculates the individual item weight
select o.*, Weight/Qty as ItemWeight
from #OrderItems o inner join Numbers ON Qty >Numbers.Number;
This row:
1 101 ProductA 2 24
Becomes
1 101 ProductA 2 24 12
1 101 ProductA 2 24 12
Calculating the running total inside a query can be done with :
SUM(ItemWeight) OVER(Partition By OrderId
Order By Itemweight
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
The Order By Itemweight claus means the smallest items are picked first, ie it's a Worst fit algorithm.
The overall query calculating the total and Group ID is
with items as (
select o.*, Weight/Qty as ItemWeight
from #OrderItems o INNER JOIN Numbers ON Qty > Numbers.Number
)
select Id,OrderId,ProductName,Qty,Weight, ItemWeight,
ceiling(SUM(ItemWeight) OVER(Partition By OrderId
Order By Itemweight
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)/48.0)
As GroupId
from items;
After that, individual items need to be grouped back into order items and groups. This produces the final query:
with items as (
select o.*, Weight/Qty as ItemWeight
from #OrderItems o INNER JOIN Numbers ON Qty > Numbers.Number
)
,bins as(
select Id,OrderId,ProductName,Qty,Weight, ItemWeight,
ceiling(SUM(ItemWeight) OVER(Partition By OrderId
Order By Itemweight
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)/48.0) As GroupId
from items
)
select
max(OrderId) as orderid,
max(productname) as ProductName,
count(*) as Qty,
sum(ItemWeight) as Weight,
max(GroupId) as GroupId
from bins
group by id,groupid
order by orderid,groupid
This returns
orderid
ProductName
Qty
Weight
GroupId
101
ProductA
2
24
1
101
ProductD
1
12
1
101
ProductE
1
12
1
101
ProductB
1
24
2
101
ProductC
1
48
3
102
ProductA
4
48
1
102
ProductA
1
12
2
102
ProductB
1
12
2

Need help in forming a SQL query

We have 2 tables called tbl1 and tbl2. It contains columns such as Visit_ID, Customer ID, and so on. There are instances where a Visit_ID will be associated with multiple Customer IDs.
For example, if customer logs into a website, a unique Visit_ID will be generated for each time he visits the website.
In one visit, multiple customers can login to their accounts and make individual purchases.
There are instances where a visit will be associated to multiple customer IDs. If there are more than 2 instances, append any other retail customer ID instances in this column.
For instances there are visit, which had 200 Customer IDs attached to that visit.
For example, if there are 7 Customer IDs in 1 visit, for Customer 1,
it should have the first customer 1. For Customer 2, we will need to display the 2nd customer ID.
For 3rd to 7, all those 5 will be comma separated.
Can someone help how to frame a SQL query using this logic?
with CTE as (
SELECT
visit_id,
B.visitpg_nbr::INT AS visitpg_nbr,
CUSTOMER_ID,
dense_rank()over( PARTITION BY VISIT_ID order by CUSTOMER_ID) as rank
from
db_name.schema_name.tbl_1 A
JOIN db_name.schema_name.tbl_2 B
ON B.id_column = A.id_column
JOIN db_name.schema_name.tbl_3 C
ON CAST(C.xid as VARCHAR)= A.CUSTOMER_ID
WHERE flg_col = '0'
AND so_cd NOT IN ('0','1','2','3')
AND DATE_COL = '2022-01-17'
and visit_id='12345'
ORDER BY visitpg_nbr
)
select VISIT_ID, arr[0], arr[1], array_to_string( array_slice(arr, 2, 99999), ', ')
from (
select VISIT_ID, array_agg(distinct CUSTOMER_ID) within group(order by CUSTOMER_ID) arr
from CTE
group by 1
);
Thanks for those who have responded. I really appreciate their guidance. The logic worked fine. When I'm joining 3 tables inside CTE, I'm getting lot of duplicates. I want to eliminate the duplicate values.
When I run the below query that I have included inside CTE, I'm getting records which are duplicates.
SELECT
visit_id,
B.visitpg_nbr::INT AS visitpg_nbr,
CUSTOMER_ID,
dense_rank()over( PARTITION BY VISIT_ID order by CUSTOMER_ID) as rank
from
db_name.schema_name.tbl_1 A
JOIN db_name.schema_name.tbl_2 B
ON B.id_column = A.id_column
JOIN db_name.schema_name.tbl_3 C
ON CAST(C.xid as VARCHAR)= A.CUSTOMER_ID
WHERE flg_col = '0'
AND so_cd NOT IN ('0','1','2','3')
AND DATE_COL = '2022-01-17'
and visit_id='12345'
ORDER BY visitpg_nbr
Row VISIT_ID CUSTOMER_ID VISITPG_NBR RANK
**1 12345 100 1 1**
2 12345 100 2 1
3 12345 100 3 1
4 12345 100 4 1
5 12345 100 5 1
**6 67891 101 6 2**
7 67891 101 7 2
8 67891 101 8 2
9 67891 101 9 2
10 67891 101 10 2
**11 78910 102 11 3**
12 78910 102 12 3
13 78910 102 13 3
14 78910 102 14 3
Is there any logic to display the distinct results in the CTE temp table?
The final result should be populated as below.
VISIT_ID First_Customer Second_Customer Other_Customers
1 100 101 102,103,104,105,106
2 200 201 202,203,204,205
First Customer_ID should get displayed in the First_Customer column, Second_Customer_Id should get displayed in Second_Customer column.. All the other customer_ids should be displayed in the final column and it should be comma separated.
Also, I wanted the results to be ordered by visitpg_nbr
You should be able to get this with array_agg(), and then choosing the first, second, and subsequent (array_slice()) elements:
with data as (
select *
from snowflake_sample_data.tpch_sf100.orders
where o_custkey between 5411266 and 5411290
)
select o_custkey, arr[0], arr[1], array_to_string(array_slice(arr, 2, 99999), ', ')
from (
select o_custkey, array_agg(o_orderkey) within group(order by o_orderdate) arr
from data
group by 1
);
You might need to get unique ids in case there are many, you can solve that with a subquery before array_agg().
slightly different to Felipe's answer, not sure which would be more performant. I suspect his, but anyways here is another way to try it.
SELECT visit_id, first_customer, second_customer
,array_agg(other_ids) within group (order by order_id) as other_customer
FROM(
SELECT visit_id,
order_id,
first_value(customer_id) over (partition by visit_id order by order_id) as first_customer,
first_value(customer_id) over (partition by visit_id order by order_id) as second_customer,
IFF(row_number() over (partition by visit_id order by order_id) > 2, customer_id, null) as other_ids
FROM VALUES
(1,100, 1),
(1,101, 2),
(1,102, 3),
(1,103, 5),
(1,104, 6),
(1,105, 6),
(1,106, 7),
(2,200, 1),
(2,201, 2),
(2,202, 3),
(2,203, 4)
v(visit_id, customer_id, order_id)
)
GROUP BY 1,2,3
ORDER BY 1,2,3;
VISIT_ID
FIRST_CUSTOMER
SECOND_CUSTOMER
OTHER_CUSTOMER
1
100
100
[ 102, 103, 104, 105, 106 ]
2
200
200
[ 202, 203 ]

How to Select ID's in SQL (Databricks) in which at least 2 items from a list are present

I'm working with patient-level data in Azure Databricks and I'm trying to build out a cohort of patients that have at least 2 diagnoses from a list of specific diagnosis codes. This is essentially what the table looks like:
CLAIM_ID | PTNT_ID | ICD_CD | DATE
---------+---------+--------+------------
1 101 2500 01_25_2020
2 101 3850 03_13_2018
3 222 2500 10_26_2018
4 222 8888 11_30_2018
5 222 9155 04_01_2019
6 871 2500 02_17_2020
7 871 3200 09_09_2019
The list of ICD_CD codes of interest is something like [2500, 3850, 8888]. In this case, I would want to return TOTAL UNIQUE PTNT_ID = 2. These would be PTNT_ID = (101, 222) as these are the only two patients that have at least 2 ICD_CD codes of interest.
When I use something like this, I'm able to return all of the relevant PTNT_ID values, but I'm not able to get the total count of these PTNT_ID:
select mc.PTNT_ID
from MEDICAL_CLAIMS mc
where mc.PTNT_ID in ( # list of ICD_CD of interest
)
group by mc.PTNT_ID
having count(distinct mc.PTNT) >= 2
When I try to add a COUNT statement in, it returns an error
Just select from the query:
select count(*)
from
(
select mc.PTNT_ID
from MEDICAL_CLAIMS mc
where mc.PTNT_ID in ( # list of ICD_CD of interest )
group by mc.PTNT_ID
having count(distinct mc.PTNT) >= 2
) ptnts;

Find Duplicates in a table

My table contains multiple lots (LOT_ID) and each lot contains multiple products(PRODUCT_ID) and there are multiple orders (ORDER_ID) under each Product. I would like to know the order ID’s which are repeated for multiple products for a given LOT
S.NO LOT_ID Product_ID Order_ID
1 101 P108 90001
2 101 P109 90001
3 101 P110 80900
4 102 S189 10098
5 102 S234 10087
6 102 S465 10098
7 102 S342 10050
8 103 L109 20090
9 103 L110 20098
10 103 L111 20020
Desired result
S.NO LOT_ID Product_ID Order_ID
1 101 P108 90001
2 101 P109 90001
3 102 S189 10098
4 102 S465 10098
I think you should apply group by on order_id first and you will get the result set. Please check the answer posted, However I haven't run this.
select LOT_ID, Product_ID, Order_ID
from <tableName>
where Order_ID IN (SELECT Order_ID FROM <tableName> where LOT_ID in (101,102)
GROUP BY Order_ID HAVING COUNT(*) > 1);
count repeats and then select the quantity you need
select t.*, count(*) over (partition by t.LOT_ID, t.Product_ID, t.Order_ID) as c
, count(*) over (partition by t.LOT_ID, t.Order_ID) as c2
from t
When count of unique strings is not equal count of unique Lots and Orders - is your case.

How to select rows based on condition

The following is the code snippet.
Just design purpose I have added.
Here The user will be assigned multiple group.
So I want to select the person details alone.
Here Person id 103 have two different persmission for the same Product.
But the higher permission only be selected for the person.
But if he is not assinged to multiple group, the default permission should be selected.
Sample data
ProdId PersonId GroupId Permission
10103 78 55 15
10103 99 33 15
10103 100 33 0
10103 103 33 15
10103 103 40 0
10103 112 33 15
Result data should be
ProdId PersonId Permission
10103 78 15
10103 99 15
10103 100 0
10103 103 15
10103 112 15
You should use ROW_NUMBER() :
SELECT * FROM (
SELECT t.*,
ROW_NUMBER() OVER(PARTITION BY t.prodid,t.personID ORDER BY t.permission DESC) as rnk
FROM YourTable t) s
WHERE s.rnk = 1
I assumed you want the highest number on permission by your example? If not, change the ORDER BY clause to what you want.
Right now it will select all columns, specify the ones you want.
If you are using Oracle, try the below query..
select * from (
select ProdID, PersonID, Permission, row_number() over (partition by PersonID order by Permission Desc) as column1 from table1)
where column1 = 1;