Counting unique combinations of values across multiple columns regardless of order? - sql

I have a table that looks a bit like this:
Customer_ID | Offer_1 | Offer_2 | Offer_3
------------|---------|---------|--------
111 | A01 | 001 | B01
222 | A01 | B01 | 001
333 | A02 | 001 | B01
I want to write a query to figure out how many unique combinations of offers there are in the table, regardless of what order the offers appear in.
So in the example above there are two unique combinations: customers 111 & 222 both have the same three offers so they count as one unique combination, and then customer 333 is the only customer to have the three orders that they have. So the desired output of the query would be 2.
For some additional context:
The customer_ID column is in integer format, and all the offer
columns are in varchar format.
There are 12 offer columns and over 3 million rows in the actual
table, with over 100 different values in the offer columns. I
simplified the example to better illustrate what I'm trying to do, but any solution needs to scale to this amount of
possible combinations.
I can concatenate all of the offer columns together and then run a count distinct statement on the result, but this doesn't account for customers who have the same unique combination of offers but ordered differently (like customers 111 & 222 in the example above).
Does anyone know how to solve this problem please?

Assuming the character / doesn't show up in any of the offer names, you can do:
select count(distinct offer_combo) as distinct_offers
from (
select listagg(offer, '/') within group (order by offer) as offer_combo
from (
select customer_id, offer_1 as offer from t
union all select customer_id, offer_2 from t
union all select customer_id, offer_3 from t
) x
group by customer_id
) y
Result:
DISTINCT_OFFERS
---------------
2
See running example at db<>fiddle.

One way to do it would be to union all the offers into one column, then use select distinct listagg... to get the combinations of offers. Try this:
with u as
(select Customer_ID, Offer_1 as Offer from table_name union all
select Customer_ID, Offer_2 as Offer from table_name union all
select Customer_ID, Offer_3 as Offer from table_name)
select distinct listagg(Offer, ',') within group(order by Offer) from u
group by Customer_ID
Fiddle

The solution without UNION ALLs. It should have better performance.
/*
WITH MYTAB (Customer_ID, Offer_1, Offer_2, Offer_3) AS
(
VALUES
(111, 'A01', '001', 'B01')
, (222, 'A01', 'B01', '001')
, (333, 'A02', '001', 'B01')
)
*/
SELECT COUNT (DISTINCT LIST)
FROM
(
SELECT LISTAGG (V.Offer, '|') WITHIN GROUP (ORDER BY V.Offer) LIST
FROM MYTAB T
CROSS JOIN TABLE (VALUES T.Offer_1, T.Offer_2, T.Offer_3) V (Offer)
GROUP BY T.CUSTOMER_ID
)

Related

Need help in forming a SQL query

We have 2 tables called tbl1 and tbl2. It contains columns such as Visit_ID, Customer ID, and so on. There are instances where a Visit_ID will be associated with multiple Customer IDs.
For example, if customer logs into a website, a unique Visit_ID will be generated for each time he visits the website.
In one visit, multiple customers can login to their accounts and make individual purchases.
There are instances where a visit will be associated to multiple customer IDs. If there are more than 2 instances, append any other retail customer ID instances in this column.
For instances there are visit, which had 200 Customer IDs attached to that visit.
For example, if there are 7 Customer IDs in 1 visit, for Customer 1,
it should have the first customer 1. For Customer 2, we will need to display the 2nd customer ID.
For 3rd to 7, all those 5 will be comma separated.
Can someone help how to frame a SQL query using this logic?
with CTE as (
SELECT
visit_id,
B.visitpg_nbr::INT AS visitpg_nbr,
CUSTOMER_ID,
dense_rank()over( PARTITION BY VISIT_ID order by CUSTOMER_ID) as rank
from
db_name.schema_name.tbl_1 A
JOIN db_name.schema_name.tbl_2 B
ON B.id_column = A.id_column
JOIN db_name.schema_name.tbl_3 C
ON CAST(C.xid as VARCHAR)= A.CUSTOMER_ID
WHERE flg_col = '0'
AND so_cd NOT IN ('0','1','2','3')
AND DATE_COL = '2022-01-17'
and visit_id='12345'
ORDER BY visitpg_nbr
)
select VISIT_ID, arr[0], arr[1], array_to_string( array_slice(arr, 2, 99999), ', ')
from (
select VISIT_ID, array_agg(distinct CUSTOMER_ID) within group(order by CUSTOMER_ID) arr
from CTE
group by 1
);
Thanks for those who have responded. I really appreciate their guidance. The logic worked fine. When I'm joining 3 tables inside CTE, I'm getting lot of duplicates. I want to eliminate the duplicate values.
When I run the below query that I have included inside CTE, I'm getting records which are duplicates.
SELECT
visit_id,
B.visitpg_nbr::INT AS visitpg_nbr,
CUSTOMER_ID,
dense_rank()over( PARTITION BY VISIT_ID order by CUSTOMER_ID) as rank
from
db_name.schema_name.tbl_1 A
JOIN db_name.schema_name.tbl_2 B
ON B.id_column = A.id_column
JOIN db_name.schema_name.tbl_3 C
ON CAST(C.xid as VARCHAR)= A.CUSTOMER_ID
WHERE flg_col = '0'
AND so_cd NOT IN ('0','1','2','3')
AND DATE_COL = '2022-01-17'
and visit_id='12345'
ORDER BY visitpg_nbr
Row VISIT_ID CUSTOMER_ID VISITPG_NBR RANK
**1 12345 100 1 1**
2 12345 100 2 1
3 12345 100 3 1
4 12345 100 4 1
5 12345 100 5 1
**6 67891 101 6 2**
7 67891 101 7 2
8 67891 101 8 2
9 67891 101 9 2
10 67891 101 10 2
**11 78910 102 11 3**
12 78910 102 12 3
13 78910 102 13 3
14 78910 102 14 3
Is there any logic to display the distinct results in the CTE temp table?
The final result should be populated as below.
VISIT_ID First_Customer Second_Customer Other_Customers
1 100 101 102,103,104,105,106
2 200 201 202,203,204,205
First Customer_ID should get displayed in the First_Customer column, Second_Customer_Id should get displayed in Second_Customer column.. All the other customer_ids should be displayed in the final column and it should be comma separated.
Also, I wanted the results to be ordered by visitpg_nbr
You should be able to get this with array_agg(), and then choosing the first, second, and subsequent (array_slice()) elements:
with data as (
select *
from snowflake_sample_data.tpch_sf100.orders
where o_custkey between 5411266 and 5411290
)
select o_custkey, arr[0], arr[1], array_to_string(array_slice(arr, 2, 99999), ', ')
from (
select o_custkey, array_agg(o_orderkey) within group(order by o_orderdate) arr
from data
group by 1
);
You might need to get unique ids in case there are many, you can solve that with a subquery before array_agg().
slightly different to Felipe's answer, not sure which would be more performant. I suspect his, but anyways here is another way to try it.
SELECT visit_id, first_customer, second_customer
,array_agg(other_ids) within group (order by order_id) as other_customer
FROM(
SELECT visit_id,
order_id,
first_value(customer_id) over (partition by visit_id order by order_id) as first_customer,
first_value(customer_id) over (partition by visit_id order by order_id) as second_customer,
IFF(row_number() over (partition by visit_id order by order_id) > 2, customer_id, null) as other_ids
FROM VALUES
(1,100, 1),
(1,101, 2),
(1,102, 3),
(1,103, 5),
(1,104, 6),
(1,105, 6),
(1,106, 7),
(2,200, 1),
(2,201, 2),
(2,202, 3),
(2,203, 4)
v(visit_id, customer_id, order_id)
)
GROUP BY 1,2,3
ORDER BY 1,2,3;
VISIT_ID
FIRST_CUSTOMER
SECOND_CUSTOMER
OTHER_CUSTOMER
1
100
100
[ 102, 103, 104, 105, 106 ]
2
200
200
[ 202, 203 ]

oracle join one to many relationship, return 1st joining date

I am trying to join two tables whereby one person can have more than one card and some of them might be canceled.
For example :
**Customer Card**
Cust ID | Cust Acct | Card No | Join Date | Cancel Date
1 | 10001 | E100001 | 20150501 | 20160101
1 | 10001 | E100002 | 20151001 | 0
2 | 10002 | E100003 | 20150101 | 20160601
3 | 10003 | E100004 | 20150201 | 0
4 | 10003 | E100005 | 20160101 | 0
**Customer Account**
Cust ID | Cust Acct
1 | 10001
2 | 10002
3 | 10003
Basically, I want to show all accounts with 1st join card no, even though card is canceled. If the 1st card is canceled, then needs to show the 2nd card joining date.
The expected result :
Cust ID | Cust Acct | Card No | Join Date | Cancel Date
1 | 10001 | E100002 | 20151001 | 0
2 | 10002 | E100003 | 20151001 | 20160601
3 | 10003 | E100004 | 20150201 | 0
Thanks for the assistance ! Any idea ?
One method uses row_number():
select cc.*, ca.CardNo, ca.JoinDate, ca.CancelDate
from customercard cc join
(select ca.*,
row_number() over (partition by custid order by joindate asc) as seqnum
from customeraccount ca
) ca
on cc.custid = ca.custid and seqnum = 1;
This can be done in one pass over the data (without requiring a subquery and outer query), using GROUP BY and KEEP(DENSE_RANK FIRST).
First some housekeeping.
Table and column names cannot have spaces in them (unless you use double-quoted names, which is an unnecessary and very poor practice in most cases).
Your date columns seem to be in number format, which is a very poor practice. How can you prevent an input like 20151490 (the 90-th day of the 14-th month) being stored in the db? All dates SHOULD be stored as dates. However, storing them in exactly that format allows correct order comparison (although that is just by accident and shouldn't be relied on). Since that is not the main point of your question, though, I used the data as is.
Why do you need a join? The first table should not include the cust_id - including it violates the second normal form of database design. If you do, in fact, have that column in the first table, I don't see the need for the second table, or for a join. (If the cust_id is not in the first table, then you do need a join, but I will leave that aside since the question is really about picking the right rows, not about joining - despite the title).
In the first table you have two cust_id, 3 and 4, associated with the same account (and contradicting the second table, too). I assume that's a typo and in fact 4 should be 3 - but this illustrates EXACTLY why second normal form is so important. You SHOULD NOT have cust_id in the first table.
The key to your reformulated requirement is conditional ordering. If for a given account all cards on file are canceled, or if none is canceled, then pick the one with the earliest join_date. However, if an account has a mix of both kinds of cards, then pick the earliest card that is not canceled. In SQL, that can be achieved with a composite ordering (by two expressions, of which the SECOND is join_date). The first criterion is the "conditional" part. In the solution below, I use the expression CASE when cancel_date = 0 then 0 end. That is, a card that has NOT been canceled will have a flag of 0, and one that is canceled will have the flag NULL (the default if there is no ELSE part in the CASE expression). By default, NULL comes last in an ordering (which is ascending by default). So, if all cards are still valid they will all have the flag 0 and the ordering by this flag won't matter. If all are canceled the flag is NULL for all, so ordering by this flag won't matter. But if some are valid and some canceled, then the valid ones will come first, so the earliest date will be picked only from valid cards.
Note that then 0 (the flag value of 0) is irrelevant; I could make it 1, or even a string (then 'a') and the "conditional ordering" would work just the same, and for the same reason. I attach something that is not NULL to valid cards and NULL to canceled cards; that's all that matters.
This is the change that Gordon would need to make his solution work, too. But, in cases like this, I prefer the KEEP(DENSE_RANK FIRST) approach, especially if performance is important (as might be the case when you have a very large number of customers, accounts, and credit cards on file).
with
customer_card ( cust_id , cust_acct , card_no , join_date , cancel_date ) as (
select 1, 10001, 'E100001', 20150501, 20160101 from dual union all
select 1, 10001, 'E100002', 20151001, 0 from dual union all
select 2, 10002, 'E100003', 20150101, 20160601 from dual union all
select 3, 10003, 'E100004', 20150201, 0 from dual union all
select 3, 10003, 'E100005', 20160101, 0 from dual
)
-- end of test data; actual solution begins HERE
select cust_id, cust_acct,
min(card_no) keep (dense_rank first
order by case when cancel_date = 0 then 0 end, join_date) as card_no,
min(join_date) keep (dense_rank first
order by case when cancel_date = 0 then 0 end, join_date) as join_date,
min(cancel_date) keep (dense_rank first
order by case when cancel_date = 0 then 0 end, join_date) as cancel_date
from customer_card
group by cust_id, cust_acct
order by cust_id, cust_acct -- ORDER BY is optional
;
Output:
CUST_ID CUST_ACCT CARD_NO JOIN_DATE CANCEL_DATE
--------- ---------- ------- --------- -----------
1 10001 E100002 20151001 0
2 10002 E100003 20150101 20160601
3 10003 E100004 20150201 0
Try this:
Edit: I'm stealing mathguy's "customer_card" table creation. Would imagine his way works too, so here's another solution:
with
customer_card ( cust_id , cust_acct , card_no , join_date , cancel_date ) as (
select 1, 10001, 'E100001', 20150501, 20160101 from dual union all
select 1, 10001, 'E100002', 20151001, 0 from dual union all
select 2, 10002, 'E100003', 20150101, 20160601 from dual union all
select 3, 10003, 'E100004', 20150201, 0 from dual union all
select 3, 10003, 'E100005', 20160101, 0 from dual
)
, allresults as(
select
cust_id,
cust_acct,
card_no,
join_date,
cancel_date,
rank() over(partition by cust_acct order by decode(cancel_date, 0, 1, 2), join_date, rownum) DATE_RANK
from customer_card
)
select
*
from allresults
where DATE_RANK = 1

countif type function in SQL where total count could be retrieved in other column

I have 36 columns in a table but one of the columns have data multiple times like below
ID Name Ref
abcd john doe 123
1234 martina 100
123x brittany 123
ab12 joe 101
and i want results like
ID Name Ref cnt
abcd john doe 123 2
1234 martina 100 1
123x brittany 123 2
ab12 joe 101 1
as 123 has appeared twice i want it to show 2 in cnt column and so on
select ID, Name, Ref, (select count(ID) from [table] where Ref = A.Ref)
from [table] A
Edit:
As mentioned in comments below, this approach may not be the most efficient in all cases, but should be sufficient on reasonably small tables.
In my testing:
a table of 5,460 records and 976 distinct 'Ref' values returned in less than 1 second.
a table of 600,831 records and 8,335 distinct 'Ref' values returned in 6 seconds.
a table of 845,218 records and 15,147 distinct 'Ref' values returned in 13 seconds.
You should provide SQL brand to know capabilities:
1) If your DB supports window functions:
Select
*,
count(*) over ( partition by ref ) as cnt
from your_table
2) If not:
Select
T.*, G.cnt
from
( select * from your_table ) T inner join
( select count(*) as cnt from your_table group by ref ) G
on T.ref = G.ref
You can use COUNT with OVERin following:
QUERY
select ID,
Name,
ref,
count(ref) over (partition by ref) cnt
from #t t
SAMPLE DATA
create table #t
(
ID NVARCHAR(400),
Name NVARCHAR(400),
Ref INT
)
insert into #t values
('abcd','john doe', 123),
('1234','martina', 100),
('123x','brittany', 123),
('ab12','joe', 101)

SQL - Looking to show when 2 columns combined have the same data

I have a database table that has a Vendor_ID column and a Vendor_Item column.
Vendor_id Vendor_item
101 111
101 111
101 123
I need a way to show when vendor_id and vendor_item are combined, show if having count greater than 1. The vendor_item number can be in there multiple times as long as it has a different vendor_id.
Vendor_id Vendor_item
101 111
101 111
I have done the following but it only shows results have have more than 1 and doesn't show both records like the above example.
SELECT vendor_id,vendor_item
From Inventory_master
group by vendor_id,vendor_item
having count(*) >1
If possible I would like a way to add another column ( UPC ) to the results. The system I am working on can import back into the system with UPC so I would be able to fix what is duplicated.
Vendor_id Vendor_item UPC
101 111 456
101 111 789
Not sure about the UPC column as from where and how you are getting it but you can change your existing query a bit like below to get the desired data
SELECT * FROM Inventory_master WHERE vendor_item IN (
SELECT vendor_item
From Inventory_master
group by vendor_item
having count(vendor_item) >1);
You can use a subquery and then JOIN back to the inventory_master table:
SELECT im.*
FROM
Inventory_master im INNER JOIN (
SELECT vendor_id, vendor_item
From Inventory_master
group by vendor_id,vendor_item
having count(*) >1) s
ON im.vendor_id = s.vendor_id AND im.vendor_item = s.vendor_item
Try this
select * from(
select vendor_id,vendor_item, count(*) over (partition by vendor_id) cnt
from Inventory_master
) where cnt>1

SQL:How to get min Quantity?

I got this problem. When i tried to summarize the min quatity of nations's products and it did not work.
I have 2 tables below
PRODUCT:
ID|NAME |NaID|Qty
-------------------
01|Fruit|JP |50
02|MEAT |AUS |10
03|MANGA|JP |80
04|BOOK |AUS |8
NATION:
NaID |NAME
-------------------
AUS |Australia
JP |Japan
I want my result like this:
ID|NAME |Name|minQty
-------------------
01|Fruit|JP |50
04|BOOK |AUS |8
and i used:
select p.id,p.name, p.NaID,n.name,min(P.Qty)as minQty
from Product p,Nation n
where p.NaID=n.NaID
group by p.id,p.name, p.NaID,n.name,p.Qty
and i got this (T_T):
ID|NAME |NaID|minQty
-------------------
01|Fruit|JP |50
02|MEAT |AUS |10
03|MANGA|JP |80
04|BOOK |AUS |8
Please,Could soneone help me? I am thinking that i am bad at SQL now.
SQL Server 2005 supports window functions, so you can do something like this:
select id,
name,
NaID,
name,
qty
from (
select p.id,
p.name,
p.NaID,
n.name,
min(P.Qty) over (partition by n.naid) as min_qty,
p.qty
from Product p
join Nation n on p.NaID=n.NaID
) t
where qty = min_qty;
If there is more than one nation with the same minimum value, you will get each of them. If you don't want that, you need to use row_number()
select id,
name,
NaID,
name,
qty
from (
select p.id,
p.name,
p.NaID,
n.name,
row_number() over (partition by n.naid order by p.qty) as rn,
p.qty
from Product p
join Nation n on p.NaID = n.NaID
) t
where rn = 1;
As your example output with only includes the NaID but not the nation's name you don't really need the the join between product and nation.
(There is no DBMS product called "SQL 2005". SQL is just a (standard) for a query language. The DBMS product you mean is called Microsoft SQL Server 2005. Or just SQL Server 2005).
In Oracle, you can use several techniques. You can use subqueries and analytic functions, but the most efficient one is to use aggregate functions MIN and FIRST.
Your tables:
SQL> create table nation (naid,name)
2 as
3 select 'AUS', 'Australia' from dual union all
4 select 'JP', 'Japan' from dual
5 /
Table created.
SQL> create table product (id,name,naid,qty)
2 as
3 select '01', 'Fruit', 'JP', 50 from dual union all
4 select '02', 'MEAT', 'AUS', 10 from dual union all
5 select '03', 'MANGA', 'JP', 80 from dual union all
6 select '04', 'BOOK', 'AUS', 8 from dual
7 /
Table created.
The query:
SQL> select max(p.id) keep (dense_rank first order by p.qty) id
2 , max(p.name) keep (dense_rank first order by p.qty) name
3 , p.naid "NaID"
4 , n.name "Nation"
5 , min(p.qty) "minQty"
6 from product p
7 inner join nation n on (p.naid = n.naid)
8 group by p.naid
9 , n.name
10 /
ID NAME NaID Nation minQty
-- ----- ---- --------- ----------
01 Fruit JP Japan 50
04 BOOK AUS Australia 8
2 rows selected.
Since you're not using Oracle, a less efficient query, but probably working in all RDBMS:
SQL> select p.id
2 , p.name
3 , p.naid
4 , n.name
5 , p.qty
6 from product p
7 inner join nation n on (p.naid = n.naid)
8 where ( p.naid, p.qty )
9 in
10 ( select p2.naid
11 , min(p2.qty)
12 from product p2
13 group by p2.naid
14 )
15 /
ID NAME NAID NAME QTY
-- ----- ---- --------- ----------
01 Fruit JP Japan 50
04 BOOK AUS Australia 8
2 rows selected.
Note that if you have several rows with the same minimum quantity per nation, all those rows will be returned, instead of just one as in the previous "Oracle"-query.
with cte as (
select *,
row_number() over (partition by Nation order by qty) as [rn]
from product
)
select * from cte where [rn] = 1