De-duplicating combinations

De-duplicating combinations - sql

I have recently run a query on SQL that brings me the most common combinations of products on a basket. Here's how the query looks like:
WITH cte AS (
SELECT a.order_id, a.SKU_number as product_1, b.SKU_number as product_2, c.SKU_number as product_3, d.SKU_number as product_4
FROM [cons_customer].[sales_detail_time] a
JOIN [cons_customer].[sales_detail_time] b
ON a.order_id = b.order_id AND a.SKU_number <> b.SKU_number
JOIN [cons_customer].[sales_detail_time] c
ON a.order_id = c.order_id AND a.SKU_number <> c.SKU_number AND b.SKU_number <> c.SKU_number
JOIN [cons_customer].[sales_detail_time] d
ON a.order_id = d.order_id AND a.SKU_number <> d.SKU_number AND b.SKU_number <> d.SKU_number AND c.SKU_number <> d.SKU_number
WHERE a.SKU_number = 'PBPR108BAU.H01'
)
SELECT TOP 50 product_2, product_3, product_4, COUNT(*) as count
FROM cte
GROUP BY product_2, product_3, product_4
ORDER BY count DESC;
However, there's one tiny problem with the results. I'm getting duplicated combinations, as the same products swap around the product_2, product_3 and product_4 columns. Here's an example:
I have one combination of 3 products: X, Y and Z.
The query I'm running is showing me three lines where:
product_2
product_3
product_4
count
X
Y
Z
18
Y
Z
X
18
Z
X
Y
18
As you can see, there is no duplicates along the columns, but these three lines are basically the same combination, but sorted on a different order. Any way of de-duplicating these values?

Use < in place of <> in the JOIN conditions.
WITH cte AS (
SELECT a.order_id,
a.SKU_number as product_1,
b.SKU_number as product_2,
c.SKU_number as product_3,
d.SKU_number as product_4
FROM [cons_customer].[sales_detail_time] a
JOIN [cons_customer].[sales_detail_time] b
ON a.order_id = b.order_id AND a.SKU_number < b.SKU_number
JOIN [cons_customer].[sales_detail_time] c
ON a.order_id = c.order_id AND a.SKU_number < c.SKU_number
AND b.SKU_number < c.SKU_number
JOIN [cons_customer].[sales_detail_time] d
ON a.order_id = d.order_id AND a.SKU_number < d.SKU_number
AND b.SKU_number < d.SKU_number
AND c.SKU_number < d.SKU_number
WHERE a.SKU_number = 'PBPR108BAU.H01'
)
SELECT TOP(50) product_2, product_3, product_4, COUNT(*) as count
FROM cte
GROUP BY product_2, product_3, product_4
ORDER BY count DESC;
Given that you enforce a < b < c < d, you can try removing some conditions too.
WITH cte AS (
SELECT a.order_id,
a.SKU_number as product_1,
b.SKU_number as product_2,
c.SKU_number as product_3,
d.SKU_number as product_4
FROM [cons_customer].[sales_detail_time] a
JOIN [cons_customer].[sales_detail_time] b
ON a.order_id = b.order_id AND a.SKU_number < b.SKU_number
JOIN [cons_customer].[sales_detail_time] c
ON a.order_id = c.order_id AND b.SKU_number < c.SKU_number
JOIN [cons_customer].[sales_detail_time] d
ON a.order_id = d.order_id AND c.SKU_number < d.SKU_number
WHERE a.SKU_number = 'PBPR108BAU.H01'
)
SELECT TOP(50) product_2, product_3, product_4, COUNT(*) as count
FROM cte
GROUP BY product_2, product_3, product_4
ORDER BY count DESC;

Huge thanks to #lemon on this.
His suggestion was totally right, but the details of my table were a bit more complex than I fought.
As my table has one entry for every product purchased, there were several purchases with more than 3 items that were not being identified by the query, once the query was looking only for the most combination of the 3 most purchased products.
Therefore I had to adjust a little bit - with an extra challenge. There is also entries for items where revenue = 0, such as packaging or trial sachets.
This is what my final query looks like. Basically, I've asked SQL to bring me all the transactions with exactly 3 items where Revenue > 0 and also one extra fourth item which was the product I wanted to explore.
WITH cte AS (
SELECT order_id
FROM [cons_customer].[sales_detail_time]
WHERE sku_number = 'PBPR108BAU.H01' OR [revenue_tax_inc_AUD]
> 0
GROUP BY order_id
HAVING SUM(CASE WHEN sku_number = 'PBPR108BAU.H01' THEN 1 ELSE 0 END) > 0
AND COUNT(DISTINCT CASE WHEN sku_number <> 'PBPR108BAU.H01' AND [revenue_tax_inc_AUD]
> 0 THEN sku_number END) = 3
)
SELECT 'PBPR108BAU.H01' AS product_01,
t1.sku_number AS product_02,
t2.sku_number AS product_03,
t3.sku_number AS product_04,
COUNT(DISTINCT t1.order_id) AS count
FROM [cons_customer].[sales_detail_time] t1
JOIN [cons_customer].[sales_detail_time] t2
ON t1.order_id = t2.order_id AND t1.sku_number < t2.sku_number
JOIN [cons_customer].[sales_detail_time] t3
ON t2.order_id = t3.order_id AND t2.sku_number < t3.sku_number
AND t1.sku_number <> t3.sku_number
WHERE t1.order_id IN (SELECT order_id FROM cte)
AND t1.sku_number < 'PBPR108BAU.H01'
AND t2.sku_number < 'PBPR108BAU.H01'
AND t3.sku_number < 'PBPR108BAU.H01'
GROUP BY t1.sku_number, t2.sku_number, t3.sku_number
ORDER BY count DESC

Related

SQL is it possible to have multiple subqueries in From clause?

So I have 2 tables.
the first one would be the bill:
id
total
client_code
created_at
1
10
1
2022-02-01
2
20
1
2022-02-01
3
20
3
2022-03-01
the second would be the product (for a bill):
bill_id
category
total
1
Electronic
2
1
Food
5
1
Food
3
2
Food
10
2
Food
10
3
Food
10
3
Food
10
What I want to get with my query is the average spending by a client for each month over the last 4 months for example.
my query right now is looking like this for the first 2 months and it works, I get the result I want:
SELECT
COALESCE(AVG(first.total),0),
COALESCE(AVG(second.total),0)
FROM
(SELECT
SUM(p.total) as total
FROM
bill b
INNER JOIN product p one b.id = p.bill_id
WHERE
b.created_at >= '2022-02-01' AND b.created_at < '2022-03-01' AND
p.category = 'FOOD'
GROUP BY
p.client_code) as first),
(SELECT
SUM(p.total) as total
FROM
bill b
INNER JOIN product p one b.id = p.bill_id
WHERE
b.created_at >= '2022-02-01' AND b.created_at < '2022-03-01' AND
p.category = 'FOOD'
GROUP BY
p.client_code) as second)
but as soon as I get to 3 months it doesn't anymore:
SELECT
COALESCE(AVG(first.total),0),
COALESCE(AVG(second.total),0),
COALESCE(AVG(third.total),0)
FROM
(SELECT
SUM(p.total) as total
FROM
bill b
INNER JOIN product p one b.id = p.bill_id
WHERE
b.created_at >= '2022-02-01' AND b.created_at < '2022-03-01' AND
p.category = 'FOOD'
GROUP BY
p.client_code) as first),
(SELECT
SUM(p.total) as total
FROM
bill b
INNER JOIN product p one b.id = p.bill_id
WHERE
b.created_at >= '2022-02-01' AND b.created_at < '2022-03-01' AND
p.category = 'FOOD'
GROUP BY
p.client_code) as second),
(SELECT
SUM(p.total) as total
FROM
bill b
INNER JOIN product p one b.id = p.bill_id
WHERE
b.created_at >= '2022-03-01' AND b.created_at < '2022-01-01' AND
p.category = 'FOOD'
GROUP BY
p.client_code) as three)
Is there a rule that keeps me from having more than two subqueries inside my from clause?
If so is there an other way of doing this? My real problem is for 12 month actually not 4. I already have a working solution but performance wise it's bad, that is why I am trying this.
My working solution is looking like that:
(SELECT
COALESCE(AVG(req.total),0)
FROM
(SELECT
SUM(p.total) as total
FROM
bill b
INNER JOIN product p one b.id = p.bill_id
WHERE
b.created_at >= '2022-02-01' AND b.created_at < '2022-03-01' AND
p.category = 'FOOD'
GROUP BY
p.client_code) as req)
UNION ALL
(SELECT
COALESCE(AVG(req.total),0)
FROM
(SELECT
SUM(p.total) as total
FROM
bill b
INNER JOIN product p one b.id = p.bill_id
WHERE
b.created_at >= '2022-03-01' AND b.created_at < '2022-04-01' AND
p.category = 'FOOD'
GROUP BY
p.client_code) as req)
UNION ALL
(SELECT
COALESCE(AVG(req.total),0)
FROM
(SELECT
SUM(p.total) as total
FROM
bill b
INNER JOIN product p one b.id = p.bill_id
WHERE
b.created_at >= '2022-04-01' AND b.created_at < '2022-05-01' AND
p.category = 'FOOD'
GROUP BY
p.client_code) as req)
UNION ALL
(SELECT
COALESCE(AVG(req.total),0)
FROM
(SELECT
SUM(p.total) as total
FROM
bill b
INNER JOIN product p one b.id = p.bill_id
WHERE
b.created_at >= '2022-05-01' AND b.created_at < '2022-06-01' AND
p.category = 'FOOD'
GROUP BY
p.client_code) as req)

Something like this (but I'm pretty sure I don't have the right way of getting months, years in postgres - you will have to fix that!)
SELECT
A.client_code,
avg(A.total) as AvgMonthlyTotal
FROM
(
SELECT
SUM(p.total) as total,
month(b.created_at),
p.client_code
FROM
bill b
INNER JOIN product p
on b.id = p.bill_id
WHERE
p.Category = 'FOOD'
and year(b.created_at) = 2022
GROUP BY
month(b.created_at)
p.client_code
) A
GROUP BY A.client_code;

How to join colums with different field ids?

Hi everyone. As you can see below I want to count the c.key_value with different c.config_field_id and relation_type_id.
So I just want to join these two select statements. So as a result there should be shown 3 columns, they are:
| parent_id | count(c.key_value) | count(c.key_value(with another config_field_id, relation_type_id)) |
Please help me to solve this problem. Thanks in advance
select parent_id, count(c.key_value) from relation as r
left join config_value_number as c on r.child_id = c.key_value
where c.config_field_id = 100 and relation_type_id = 150
group by parent_id
select parent_id, count(c.key_value) from relation as r
left join config_value_number as c on r.child_id = c.key_value
where c.config_field_id = 101 and relation_type_id = 151
group by parent_id

You can simply use conditional aggregation for that:
select parent_id,
count(case when c.config_field_id=100 then c.key_value end) as key_value_150,
count(case when c.config_field_id=101 then c.key_value end) as key_value_151
from relation r
left join config_value_number c on r.child_id = c.key_value
where (c.config_field_id =100 and relation_type_id =150)
or (c.config_field_id =101 and relation_type_id =151)
group by parent_id

You would just need to join those two select statements instead of actual tables as subqueries.
SELECT parent_id, left_count, right_count
FROM (select parent_id, count(c.key_value) as left_count
from relation as r
left join config_value_number as c
on r.child_id = c.key_value
where c.config_field_id = 100 and relation_type_id = 150 group by parent_id) a
JOIN (select parent_id, count(c.key_value) as right_count
from relation as r
left join config_value_number as c
on r.child_id = c.key_value
where c.config_field_id = 101 and relation_type_id = 151 group by parent_id) b
ON a.parent_id = b.parent_id;

Having trouble using COUNT with INTERSECT in Teradata

I am trying to run the code below in Teradata. However, I keep getting an error when I try to count the number of rows this intersection has. The error is: Failed [2616 : 22003] Numeric overflow occurred during computation.
I tried using a CAST with BIGINT, but now the value comes empty. When I run the actual intersect (without the COUNT clause) - I am able to see the list of rows of this intersect. I want to be able to count this number. Do you know how I can do this?
select CAST(count(a.main_id) AS BIGINT) from second_database.tra_rock a
database.game_active b ON a.main_key=b.main_key AND description_detail LIKE 'AC'
database.release_day c ON a.release_key = c.release_key AND g_description = 'FW'
database.ft_feature d on a.main_id = d.main_id AND first_time >= 20200319
where action_date_key between 20200319 and 20200324 and a.main_id IN
(select a.main_id
From second_database.tra_rock a
database.game_active b ON a.main_key=b.main_key AND description_detail LIKE 'AC'
where action_date > 20200324 and release_key = 200)
INTERSECT
select a.main_id
From second_database.tra_rock a
database.game_active b ON a.main_key=b.main_key AND description_detail LIKE 'AC'
database.release_day c ON a.release_key = c.release_key AND g_description = 'FW'
database.ft_feature d on a.main_id = d.main_id AND DATE_KEY >= 20200319
where action_date_key between 20200319 and 20200324 and a.main_id IN
(select a.main_id
From second_database.tra_rock a
database.game_active b ON a.genome_key=b.genome_key AND description_detail <> 'AC'
where action_date > 20200324 and release_key = 200)

The COUNT is applied to the first Select only and then you try to Intersect the counts and the main_id from the second Select.
You need to wrap the full query into a Derived Table or a Common Table Expression:
select cast(count(*) as bigint)
from
(
select a.main_id from second_database.tra_rock a
database.game_active b ON a.main_key=b.main_key AND description_detail LIKE 'AC'
database.release_day c ON a.release_key = c.release_key AND g_description = 'FW'
database.ft_feature d on a.main_id = d.main_id AND first_time >= 20200319
where action_date_key between 20200319 and 20200324 and a.main_id IN
(select a.main_id
From second_database.tra_rock a
database.game_active b ON a.main_key=b.main_key AND description_detail LIKE 'AC'
where action_date > 20200324 and release_key = 200)
INTERSECT
select a.main_id
From second_database.tra_rock a
database.game_active b ON a.main_key=b.main_key AND description_detail LIKE 'AC'
database.release_day c ON a.release_key = c.release_key AND g_description = 'FW'
database.ft_feature d on a.main_id = d.main_id AND DATE_KEY >= 20200319
where action_date_key between 20200319 and 20200324 and a.main_id IN
(select a.main_id
From second_database.tra_rock a
database.game_active b ON a.genome_key=b.genome_key AND description_detail <> 'AC'
where action_date > 20200324 and release_key = 200)
) as dt

SQL - SUM within subquery

I have the following code that looks at the SalesVol of different products and groups it by transaction_week
SELECT a.transaction_week,
SUM(CASE WHEN record_type IN (6,37,13) THEN quantity ELSE 0 END) as SalesVol
FROM table 1 a
LEFT JOIN table 2 b ON b.Date = a.transaction_date
LEFT JOIN table 3 c ON c.sku = a.product
WHERE series in (62,236,501,52)
GROUP BY a.transaction_week
ORDER BY a.transaction_week
| tw | SalesVol |
| 1 | 4768 |
| 2 | 4567 |
| 3 | 4354 |
| 4 | 4678 |
I want to be able to have multiple subqueries where I change the series numbers for example.
SELECT a.transaction_week,
(SELECT SUM(CASE WHEN record_type IN (6,37,13) THEN quantity ELSE 0 END) as SalesVol
FROM table 1 a
LEFT JOIN table 2 b ON b.Date = a.transaction_date
LEFT JOIN table 3 c ON c.sku = a.product
WHERE series in (62,236,501,52)) as personal care
(SELECT SUM(CASE WHEN record_type IN (6,37,13) THEN quantity ELSE 0 END) as SalesVol
FROM table 1 a
LEFT JOIN table 2 b ON b.Date = a.transaction_date
LEFT JOIN table 3 c ON c.sku = a.product
WHERE series in (37,202,203,456)) as white goods
FROM table 1 a
LEFT JOIN table 2 b ON b.Date = a.transaction_date
LEFT JOIN table 3 c ON c.sku = a.product
GROUP BY a.transaction_week
ORDER BY a.transaction_week
I can't get the subqueries at work as it is giving me the overall sum value and not grouping it by transaction_week

Instead of using subqueries, add series to the condition of the CASE statements:
SELECT a.transaction_week,
sum(CASE WHEN series IN (62,236,501,52) AND record_type IN (6,37,13)
THEN quantity ELSE 0 END) as personal_care,
sum(CASE WHEN series IN (37,202,203,456) AND record_type IN (6,37,13)
THEN quantity ELSE 0 END) as white_goods
FROM table 1 a
LEFT JOIN table 2 b ON b.Date = a.transaction_date
LEFT JOIN table 3 c ON c.sku = a.product
GROUP BY a.transaction_week
ORDER BY a.transaction_week;

You just miss the a.transaction_week in you subquery. The JOIN in outer query is unneccessary.
SELECT a.transaction_week,
(
SELECT SUM(CASE WHEN record_type IN (6,37,13) THEN quantity ELSE 0 END) as SalesVol
FROM table 1 a2
LEFT JOIN table 2 b ON b.Date = a2.transaction_date
LEFT JOIN table 3 c ON c.sku = a2.product
WHERE series in (62,236,501,52) AND a2.transaction_week = a.transaction_week
) as personal care,
(
SELECT SUM(CASE WHEN record_type IN (6,37,13) THEN quantity ELSE 0 END) as SalesVol
FROM table 1 a 2
LEFT JOIN table 2 b ON b.Date = a2.transaction_date
LEFT JOIN table 3 c ON c.sku = a2.product
WHERE series in (37,202,203,456) AND a2.transaction_week = a.transaction_week
) as white goods
FROM table 1 a
GROUP BY a.transaction_week
ORDER BY a.transaction_week

Try this it would work fast as well as up to your requirement:
SELECT a.transaction_week ,
whitegoods.SalesVol AS 'White Goods' ,
personalcare.SalesVol1 AS 'Personal Care'
FROM table1 a
LEFT JOIN table2 b ON b.[Date] = a.transaction_date
LEFT JOIN table3 c ON c.sku = a.product
CROSS APPLY ( SELECT SUM(CASE WHEN record_type IN ( 6, 37, 13 )
THEN quantity
ELSE 0
END) AS SalesVol
FROM table1 a2
WHERE b.[Date] = a2.transaction_date
AND c.sku = a2.product
AND series IN ( 37, 202, 203, 456 )
AND a2.transaction_week = a.transaction_week
) whitegoods
CROSS APPLY ( SELECT SUM(CASE WHEN record_type IN ( 6, 37, 13 )
THEN quantity
ELSE 0
END) AS SalesVol1
FROM table1 a2
WHERE b.[Date] = a2.transaction_date
AND c.sku = a2.product
AND series IN ( 62, 236, 501, 52 )
AND a2.transaction_week = a.transaction_week
) personalcare
GROUP BY a.transaction_week
ORDER BY a.transaction_week

You should use the UNION operator. Please refer to the query below:
select a.transaction_week, SalesVol from
(SELECT a.transaction_week as transaction_week,
SUM(CASE WHEN record_type IN (6,37,13) THEN quantity ELSE 0 END) as SalesVol
FROM table 1 a
LEFT JOIN table 2 b ON b.Date = a.transaction_date
LEFT JOIN table 3 c ON c.sku = a.product
WHERE series in (62,236,501,52)
UNION
SELECT a.transaction_week as transaction_week,
SUM(CASE WHEN record_type IN (6,37,13) THEN quantity ELSE 0 END) as SalesVol
FROM table 1 a
LEFT JOIN table 2 b ON b.Date = a.transaction_date
LEFT JOIN table 3 c ON c.sku = a.product
WHERE series in (37,202,203,456)
) AS tbl1
GROUP BY tbl1.transaction_week
ORDER BY tbl1.transaction_week

Query for Select Id,name,Cost on Delivery using sub query

I want to make one query for selecting customer id, cus_name, Total COD Orders. I made these two queries ,but these queries calculating the COD and NON COD separately. How I make a single query using sub query.
SELECT o.cust_id, UPPER(c.name), count(o.order_no) AS 'Total COD Orders'
FROM T_Acct_CompanyProfile c INNER JOIN
T_Inv_Order o
ON c.id = o.cust_id
WHERE c.type_id = 1 AND o.cod = 1
GROUP BY o.cust_id, c.name;
SELECT o.cust_id, UPPER(c.name), count(o.order_no) AS 'Total COD Orders'
FROM T_Acct_CompanyProfile c INNER JOIN
T_Inv_Order o
ON c.id = o.cust_id
WHERE c.type_id = 1 AND o.cod = 0
GROUP BY o.cust_id, c.name;

With conditional aggregation:
SELECT o.cust_id,
UPPER(c.name),
SUM(CASE WHEN o.cod = 1 THEN 1 ELSE 0 END) AS 'Total COD Orders',
SUM(CASE WHEN o.cod = 0 THEN 1 ELSE 0 END) AS 'Total non COD Orders'
FROM T_Acct_CompanyProfile c
INNER JOIN T_Inv_Order o ON c.id = o.cust_id
WHERE c.type_id = 1
group by o.cust_id, c.name

As the queries seem to be very similar you could use a union. So your queries become:
SELECT o.cust_id, UPPER(c.name), count(o.order_no) AS 'Total COD Orders' FROM T_Acct_CompanyProfile c
INNER JOIN T_Inv_Order o ON c.id = o.cust_id
WHERE c.type_id = 1 AND o.cod = 1
group by o.cust_id, c.name
UNION
SELECT o.cust_id, UPPER(c.name), count(o.order_no) AS 'Total COD Orders' FROM T_Acct_CompanyProfile c
INNER JOIN T_Inv_Order o ON c.id = o.cust_id
WHERE c.type_id = 1 AND o.cod = 0
group by o.cust_id, c.name
This will comebine the two queries into one results set. See http://www.w3schools.com/sql/sql_union.asp for more information on the Union keyword.

Just use conditional aggregation:
SELECT o.cust_id, UPPER(c.name),
SUM(CASE WHEN o.cod = 1 THEN 1 ELSE 0 END) as TotalCODOrders,
SUM(CASE WHEN o.cod = 0 THEN 1 ELSE 0 END) as TotalNonCODOrders
FROM T_Acct_CompanyProfile c INNER JOIN
T_Inv_Order o
ON c.id = o.cust_id
WHERE c.type_id = 1
GROUP BY o.cust_id, c.name;
If o.cod only takes on the values 0 and 1, then you can use the short-cut:
SELECT o.cust_id, UPPER(c.name),
SUM(o.cod = 1) as TotalCODOrders,
SUM(1 - o.cod) as TotalNonCODOrders,

Use Union
SELECT o.cust_id, UPPER(c.name), count(o.order_no) AS 'Total COD Orders'
FROM T_Acct_CompanyProfile c INNER JOIN
T_Inv_Order o
ON c.id = o.cust_id
WHERE c.type_id = 1 AND o.cod = 1
GROUP BY o.cust_id, c.name;
UNION
SELECT o.cust_id, UPPER(c.name), count(o.order_no) AS 'Total COD Orders'
FROM T_Acct_CompanyProfile c INNER JOIN
T_Inv_Order o
ON c.id = o.cust_id
WHERE c.type_id = 1 AND o.cod = 0
GROUP BY o.cust_id, c.name;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

De-duplicating combinations - sql

Related

SQL is it possible to have multiple subqueries in From clause?

How to join colums with different field ids?

Having trouble using COUNT with INTERSECT in Teradata

SQL - SUM within subquery

Query for Select Id,name,Cost on Delivery using sub query

Categories

Resources