Cross Selling Matrix - In Snowflake - sql

I am trying to build a cross selling matrix with the following structure pivoted as seen below where X is the % of frequency in a basket with the other product:
I need to pivot this data in excel or another tool afterwards so I assume the query in Snowflake needs to output tabular dataset ready for pivoting, and I am struggling with its logic.
This is what I have so far:
SELECT FCT.TRANSACTION_ID,
PRD.PRODUCT_TYPE,
COUNT(DISTINCT FCT.PRODUCT_ID),
COUNT(DISTINCT FCT1.PRODUCT_ID)
FROM TRANSACTION_ORDERS FCT
INNER JOIN DIM_PRODUCT PRD ON FCT.PRODUCT_ID = PRD.PRODUCT_ID
LEFT JOIN FACT_TRANSACTION_ORDERS FCT1 ON FCT.TRANSACTION_ID = FCT1.TRANSACTION_ID
AND FCT.PRODUCT_ID != FCT1.PRODUCT_ID
GROUP BY FCT.TRANSACTION_ID, FCT.PRODUCT_ID, FCT1.PRODUCT_ID
Is the joining even correct? Or should I be doing a cross join? Also, how to capture percent frequency of both products in the same basket?
Many thanks!
EDIT: I am trying to capture the frequency of different product types appearing in the same basket.
The values are the same for combinations in both directions. ProductType1 intersection with column ProductType2 is the same value as column Product Type1 row ProductType2.
When in a basket cross analysis they should vary. It is not the same per direction. In other words, baskets with ProductType1 may have ProductType2 X % of the time but baskets with ProductType2 should have ProductType1 with Y% of the time.

You want a self join. I would expect the products to be in the same order, but you seem be using the same transaction. In any case, this is the structure of the query:
WITH TP AS (
SELECT T.*, P.PRODUCT_TYPE
FROM TRANSACTION_ORDERS T JOIN
DIM_PRODUCT P
ON T.PRODUCT_ID = P.PRODUCT_ID
)
SELECT TP.PRODUCT_TYPE, TP2.PRODUCT_TYPE,
COUNT(DISTINCT TP.TRANSACTION_ID) as NUM_ORDERS
FROM TP JOIN
TP TP2
ON TP2.TRANSACTION_ID = TP.TRANSACTION_ID
GROUP BY TP.PRODUCT_TYPE, TP2.PRODUCT_TYPE;
If this were per order, you would just change the ON clause in the outer query to use the order id.
Note that this uses COUNT(DISTINCT) rather than COUNT(*) because a transaction/order could have multiple products of the same type. Presumably, you want that counted only once.
EDIT:
If you want to divide by the number of transactions that have either product type (which makes sense to me), then I would approach this as:
WITH TP AS (
SELECT DISTINCT T.TRANSACTION_ID, P.PRODUCT_TYPE
FROM TRANSACTION_ORDERS T JOIN
DIM_PRODUCT P
ON T.PRODUCT_ID = P.PRODUCT_ID
)
SELECT TP.PRODUCT_TYPE, TP2.PRODUCT_TYPE,
COUNT(*) as NUM_ORDERS,
( MAX(CASE WHEN TP.PRODUCT_TYPE = TP2.PRODUCT_TYPE THEN COUNT(*) END) OVER (PARTITION BY TP.PRODUCT_TYPE) +
MAX(CASE WHEN TP.PRODUCT_TYPE = TP2.PRODUCT_TYPE THEN COUNT(*) END) OVER (PARTITION BY TP2.PRODUCT_TYPE) -
COUNT(*)
) as Num_Orders_Either,
( COUNT(*) * 1.0 /
( MAX(CASE WHEN TP.PRODUCT_TYPE = TP2.PRODUCT_TYPE THEN COUNT(*) END) OVER (PARTITION BY TP.PRODUCT_TYPE) +
MAX(CASE WHEN TP.PRODUCT_TYPE = TP2.PRODUCT_TYPE THEN COUNT(*) END) OVER (PARTITION BY TP2.PRODUCT_TYPE) -
COUNT(*)
) as ratio
FROM TP JOIN
TP TP2
ON TP2.TRANSACTION_ID = TP.TRANSACTION_ID
GROUP BY TP.PRODUCT_TYPE, TP2.PRODUCT_TYPE;
This calculates the total orders containing both products using the sum of the orders with either product minus the number with both.

Related

Prevent duplicate rows when using LEFT JOIN in Postgres without DISTINCT

I have 4 tables:
Item
Purchase
Purchase Item
Purchase Discount
In these tables, the Purchase Discount has two entries, all the others have only one entry. But when I query them, due to the LEFT JOIN, I'm getting duplicate entries.
This query will be running in a large database, and I heard using DISTINCT will reduce the performance. Is there any other way I can remove duplicates without using DISTINCT?
Here is the SQL Fiddle.
The result shows:
[{"item_id":1,"purchase_items_ids":[1234,1234],"total_sold":2}]
But the result should come as:
[{"item_id":1,"purchase_items_ids":[1234],"total_sold":1}]
Using correlated subquery instead of LEFT JOIN:
SELECT array_to_json(array_agg(p_values)) FROM
(
SELECT t.item_id, t.purchase_items_ids, t.total_sold, t.discount_amount FROM
(
SELECT purchase_items.item_id AS item_id,
ARRAY_AGG(purchase_items.id) AS purchase_items_ids,
SUM(purchase_items.sold) as total_sold,
SUM((SELECT SUM(pd.discount_amount) FROM purchase_discounts pd
WHERE pd.purchase_id = purchase.id)) as discount_amount
FROM items
INNER JOIN purchase_items ON purchase_items.item_id = items.id
INNER JOIN purchase ON purchase.id = purchase_items.purchase_id
WHERE purchase.id = 200
GROUP by purchase_items.item_id
) as t
INNER JOIN items i ON i.id = t.item_id
) AS p_values;
db<>fiddle demo
Output:
[{"item_id":1,"purchase_items_ids":[1234],"total_sold":1,"discount_amount":12}]
First I would suggest to remove INNER JOIN items i ON i.id = t.item_id from the query which no reason to be there.
Then instead Left joining Purchase_Discounts table use subquery to get the Discount_amount (as mentioned in Lukasz Szozda's answer)
If there is no discount for any product then Discount_amount column will display NULL. If you want to avoid it then you can use COALESCE() as below instead:
COALESCE(SUM((select sum(discount_amount) from purchase_discounts
where purchase_discounts.purchase_id = purchase.id)),0) as discount_amount
Db-Fiddle:
SELECT array_to_json(array_agg(p_values)) FROM
(
SELECT t.item_id, t.purchase_items_ids, t.total_sold, t.discount_amount FROM
(
SELECT purchase_items.item_id AS item_id,
ARRAY_AGG(purchase_items.id) AS purchase_items_ids,
SUM(purchase_items.sold) as total_sold,
SUM((select sum(discount_amount) from purchase_discounts
where purchase_discounts.purchase_id = purchase.id)) as discount_amount
FROM items
INNER JOIN purchase_items ON purchase_items.item_id = items.id
INNER JOIN purchase ON purchase.id = purchase_items.purchase_id
WHERE
purchase.id = 200
GROUP by
purchase_items.item_id
) as t
) AS p_values;
Output:
array_to_json
[{"item_id":1,"purchase_items_ids":[1234],"total_sold":1,"discount_amount":12}]
db<>fiddle here
The core problem is that your LEFT JOIN multiplies rows. See:
Two SQL LEFT JOINS produce incorrect result
Aggregate discounts to a single row before the join. Or use a (uncorrelated) subquery expression:
SELECT json_agg(items)
FROM (
SELECT pi.item_id
, array_agg(pi.id) AS purchase_items_ids
, sum(pi.sold) AS total_sold
,(SELECT COALESCE(sum(pd.discount_amount), 0)
FROM purchase_discounts pd
WHERE pd.purchase_id = 200) AS discount_amount
FROM purchase_items pi
WHERE pi.purchase_id = 200
GROUP BY 1
) AS items;
Result:
[{"item_id":1,"purchase_items_ids":[1234],"total_sold":1,"discount_amount":12}]
db<>fiddle here
I added a couple of additional improvements:
Assuming referential integrity enforced by FK constraints, we don't need to involve the tables purchase and items at all.
Removed a subquery level doing nothing.
Using json_agg() instead of array_to_json(array_agg()).
Added COALESCE() to output 0 instead or NULL for no discounts.
Since discounts apply to the purchase in your model, not to individual items, it doesn't make sense to output discount_amount for every single item. Consider this query instead to return an array of items and a single, separate discount_amount:
SELECT json_build_object(
'items'
, json_agg(items)
, 'discount_amount'
, (SELECT COALESCE(sum(pd.discount_amount), 0)
FROM purchase_discounts pd
WHERE pd.purchase_id = 200)
)
FROM (
SELECT pi.item_id
, array_agg(pi.id) AS purchase_items_ids
, sum(pi.sold) AS total_sold
FROM purchase_items pi
WHERE pi.purchase_id = 200
GROUP BY 1
) AS items;
Result:
{"items" : [{"item_id":1,"purchase_items_ids":[1234],"total_sold":1}], "discount_amount" : 12}
db<>fiddle here
Using json_build_object() to assemble the JSON object.
Your example with a single item in the purchase isn't too revealing. I added a purchase with multiple items and no discount to my fiddle.
If you can have multiple values only in the purchase_discounts table then a subquery that aggregate multiple purchase_discounts rows into one before the join can solve the problem:
SELECT array_to_json(array_agg(p_values)) FROM
(
SELECT t.item_id, t.purchase_items_ids, t.total_sold, t.discount_amount FROM
(
SELECT purchase_items.item_id AS item_id,
ARRAY_AGG(purchase_items.id) AS purchase_items_ids,
SUM(purchase_items.sold) as total_sold,
X.discount_amount
FROM items
INNER JOIN purchase_items ON purchase_items.item_id = items.id
INNER JOIN purchase ON purchase.id = purchase_items.purchase_id
LEFT JOIN (SELECT purchase_id, sum(purchase_discounts.discount_amount) AS discount_amount FROM purchase_discounts GROUP BY purchase_id) X ON X.purchase_id = purchase.id
WHERE
purchase.id = 200
GROUP by
purchase_items.item_id,
X.discount_amount
) as t
INNER JOIN items i ON i.id = t.item_id
) AS p_values;
The LEFT JOIN is not causing your duplicates, I understand why you need it as there may not be any discounts, but for the data provided changing to an inner join produces the same result. You are getting duplicate entries because you use ARRAY_AGG(purchase_items.id). Further, with the data presented, the tables item and purchase are not necessary. You can use the window version of sum and distinct on to reduce the duplication of purchase_id, and eliminate the mentioned tables. Finally the middle select ... ) t can be completely removed. Resulting in: (see demo)
select array_to_json(array_agg(p_values))
from (select distinct on (pi.item_id, pi.id)
pi.item_id
, pi.id purchase_items_ids
, sum(pi.sold) over (partition by pi.item_id) total_sold
, sum(pd.discount_amount) over(partition by pi.item_id) discount_amount
from purchase_items pi
left join purchase_discounts pd
on pd.purchase_id = pi.purchase_id
order by pi.item_id, pi.id
) as p_values;
I think the left join does not cause, because with the Inner Join query result same as the left join, in discount with purchase_id=200 query has 2 results you can use from row_number with the partion_by same as:
ROW_NUMBER() OVER(PARTITION BY purchase_items.id order by purchase_items.id) rn
then select rn=1.
you change your query for the sum function, I think that you can use from partion_by.

Select most Occurred Value SQL with Inner Join

I am using this query to get the following data from different linked tables. But let's say the VENDORS for an item were three. Now here in result i want to show the Vendor which occurred most. I mean if Item ABC was supplied by 3 different vendors many times. Then here i want to get the Vendor who supplied most of the times item ABC.
My query is this.
use iBusinessFlex;
SELECT Items.Name,
Max(Items.ItemID) as ItemID ,
MAX(Items.Description)as Description,
MAX(ItemsStock.CurrentPrice) as UnitPrice,
MAX(ItemsStock.Quantity) as StockQuantiity,
MAX(Vendors.VendorName) as VendorName,
SUM(ItemReceived.Quantity) as TotalQuantity
From ItemReceived
INNER JOIN Items ON ItemReceived.ItemId=Items.ItemID
INNER JOIN ItemsStock ON ItemReceived.ItemId=ItemsStock.ItemID
INNER JOIN PurchaseInvoices ON PurchaseInvoices.PurchaseInvoiceId = ItemReceived.PurchaseInvoiceId
INNER JOIN Vendors ON Vendors.VendorId = PurchaseInvoices.VendorId
Group By Items.Name
EDIT : I have included this sub query but i am not sure if it is showing correct result. i mean Showing Vendor for each Item who provided that item most of the times
use iBusinessFlex;
SELECT Items.Name,
Max(Items.ItemID) as ItemID ,
MAX(Items.Description)as Description,MAX(ItemsStock.CurrentPrice) as UnitPrice,
MAX(ItemsStock.Quantity) as StockQuantiity,MAX(Vendors.VendorName) as VendorName,
SUM(ItemReceived.Quantity) as TotalQuantity
From ItemReceived
INNER JOIN Items ON ItemReceived.ItemId=Items.ItemID INNER JOIN ItemsStock
ON ItemReceived.ItemId=ItemsStock.ItemID INNER JOIN PurchaseInvoices
ON PurchaseInvoices.PurchaseInvoiceId = ItemReceived.PurchaseInvoiceId INNER JOIN Vendors
ON Vendors.VendorId IN (
SELECT Top 1 MAX(PurchaseInvoices.VendorId) as VendorOccur
FROM PurchaseInvoices INNER JOIN Vendors ON Vendors.VendorId=PurchaseInvoices.VendorId
GROUP BY PurchaseInvoices.VendorId
ORDER BY COUNT(*) DESC
And the Result Looks like this.
First, I would start with who ordered what thing the most. But the MOST is based on what... the most quantity? Price?, Number of Times? If you use one vendor and order 6 times qty of 10 you have 60 things. But order 1 time from another vendor for 100 qty, which one wins. You have to decide the basis of MOST, but I will go based on most times
per your original question.
So all things come from PurchasedInvoices which has a vendor ID. I dont care who the vendor is, just their ID, so no need to join. Also, don't need the item name if I am just looking for my counts. The query below will show per item, each vendor and their respective most times ordered and quantities ordered. I added the items and vendor table joins just to show the names.
select
IR.ItemID,
PI.VendorID,
max( I.Name ) Name,
max( V.VendorName ) VendorName,
count(*) as TimesOrderedFrom,
SUM( IR.Quantity ) as QuantityFromVendor
from
ItemsReceived IR
JOIN PurchaseInvoices PI
on IR.PurchaseInvoiceID = PI.PurchaseInvoiceID
JOIN Items I
on IR.ItemID = I.ItemID
JOIN Vendors V
on IR.VendorID = V.VendorID
group by
IR.ItemID,
PI.VendorID
order by
-- Per item
IR.ItemID,
-- Most count ordered
count(*),
-- If multiple vendors, same count, get total quantity
sum( IR.Quantity )
Now, to get only 1 per item, this would create a correlated subquery and you
can add 'TOP 1' to return only the first by this. Since the aggregate of count
is already done, you can then get the vendor contact info.
select
I.Name,
V.VendorName,
TopVendor.TimesOrderedFromVendor,
TopVendor.QuantityFromVendor
from
Items I
JOIN ( select TOP 1
IR.ItemID,
PI.VendorID,
count(*) as TimesOrderedFrom,
SUM( IR.Quantity ) as QuantityFromVendor
from
ItemsReceived IR
JOIN PurchaseInvoices PI
on IR.PurchaseInvoiceID = PI.PurchaseInvoiceID
where
-- correlated subquery based on the outer-most item
IR.ItemID = I.ItemID
group by
IR.ItemID,
PI.VendorID
order by
-- Per item
IR.ItemID,
-- Most count ordered
count(*),
-- If multiple vendors, same count, get total quantity
sum( IR.Quantity ) ) TopVendor
on I.ItemID = TopVendor.ItemID
JOIN Vendors V
on TopVendor.VendorID = V.VendorID
No sense in having the INNER Subquery joining on the vendor and items just for the names. Get those once and only at the end when the top vendor is selected.

Joining two tables on SQL Server

I have two SQL Server tables: ORDR (orders) and RDR1 (order's items). I'm trying to create a report which shows:
DocEntry, CardName, DocDueDate: info about the order
pTot: total amount of items in the order
ItemCode: item's code (any of them, only one is needed)
Dscription: item's name
My last attempt was:
SELECT
dbo.ORDR.DocEntry, dbo.ORDR.CardName, dbo.ORDR.DocDueDate,
SUM(dbo.RDR1.Quantity) AS pTot,
dbo.RDR1.ItemCode,
dbo.RDR1.Dscription
FROM
dbo.ORDR
INNER JOIN
dbo.RDR1 ON dbo.ORDR.DocEntry = dbo.RDR1.DocEntry
GROUP BY
dbo.ORDR.DocEntry, dbo.ORDR.CardName, dbo.ORDR.DocDueDate,
dbo.RDR1.ItemCode, dbo.RDR1.Dscription
Items' code/name in one order are very similar so I need only the first RDR1's record associated to that order
I have 2 problems:
I'm getting one row for each RDR1 record
pTot is not summing the amount of items
Can you show me how to join these tables properly?
You could use ROW_NUMBER to get the first RDR1 item for each ORDR and SUM OVER to get the total amount of items.
SELECT
o.DocEntry,
o.CardName,
o.DocDueDate,
r.pTot,
r.ItemCode,
r.Dscription
FROM dbo.ORDR o
INNER JOIN (
SELECT *,
rn = ROW_NUMBER() OVER(PARTITION BY DocEntry ORDER BY ItemCode),
pTot = SUM(Quantity) OVER(PARTITION BY DocEntry)
FROM dbo.RDR1
) r
ON r.DocEntry = o.DocEntry
WHERE r.rn = 1
Additionally, you might want to use meaningful table aliases to improve readability.
Here is my proposed solution.
SELECT
[rowno] = ROW_NUMBER() OVER(PARTITION BY DocEntry ORDER BY ItemCode),
O.DocEntry,
O.CardName,
O.DocDueDate,
SUM(Quantity) AS pTot,
O.ItemCode,
O.Dscription
INTO #TEMP_ORDER
FROM dbo.ORDR O
INNER JOIN dbo.RDR1 R
ON O.DocEntry = dbo.RDR1.DocEntry
GROUP BY O.DocEntry, O.CardName, O.DocDueDate, R.ItemCode, R.Dscription
SELECT
DocEntry,
CardName,
DocDueDate,
pTot,
ItemCode,
Dscription
FROM #TEMP_ORDER
WHERE roWno = 1
DROP TABLE #TEMP_ORDER

Find MAX with JOIN where Field also shows up in another Table

I have 3 tables: Master, Paper and iCodes. For a certain set of Master.Ref's, I need to find Max(Paper.Date), where the Paper.Code is also in the iCodes table (i.e., Paper.Code is a type of iCode). Master is joined to Paper by the File field.
EDIT:
I only need the Max(Paper.Date) its corresponding Code; I do not need all of the Codes.
I wrote the following but it is very slow. I have a few hundred ref #'s to look for. What is a better way to do this?
SELECT Master.Ref,
Paper.Code,
mp.MaxDate
FROM ( SELECT p.File ,
MAX(p.Date) AS MaxDate ,
FROM Paper AS p
LEFT JOIN Master AS m ON p.File = m.File
WHERE m.Ref IN ('ref1', 'ref2', 'ref3', 'ref4', 'ref5', 'ref6'... )
AND p.Code IN ( SELECT DISTINCT i.iCode
FROM iCodes AS i
)
GROUP BY p.File
) AS mp
LEFT JOIN Master ON mp.File = Master.File
LEFT JOIN Paper ON Master.File = Paper.File
AND mp.MaxDate = Paper.Date
WHERE Paper.Code IN ( SELECT DISTINCT iCodes.iCode
FROM iCodes
)
Does this do what you want?
SELECT m.Ref, p.Code, max(p.date)
FROM Master m LEFT JOIN
Paper
ON m.File = p.File
WHERE p.Code IN (SELECT DISTINCT iCodes.iCode FROM iCodes) and
m.Ref IN ('ref1','ref2','ref3','ref4','ref5','ref6'...)
GROUP BY m.Ref, p.Code;
EDIT:
To get the code on the max date, then use window functions:
select ref, code, date
from (SELECT m.Ref, p.Code, p.date
row_number() over (partition by m.Ref order by p.date desc) as seqnum
FROM Master m LEFT JOIN
Paper
ON m.File = p.File
WHERE p.Code IN (SELECT DISTINCT iCodes.iCode FROM iCodes) and
m.Ref IN ('ref1','ref2','ref3','ref4','ref5','ref6'...)
) mp
where seqnum = 1;
The function row_number() assigns a sequential number starting at 1 to a group of rows. The groups are defined by the partition by clause, so in this case everything with the same m.Ref value would be in a single group. Within the group, rows are assigned the number based on the order by clause. So, the one with the biggest date gets the value of 1. That is the row you want.

Select SUM from multiple tables

I keep getting the wrong sum value when I join 3 tables.
Here is a pic of the ERD of the table:
(Original here: http://dl.dropbox.com/u/18794525/AUG%207%20DUMP%20STAN.png )
Here is the query:
select SUM(gpCutBody.actualQty) as cutQty , SUM(gpSewBody.quantity) as sewQty
from jobOrder
inner join gpCutHead on gpCutHead.joNum = jobOrder.joNum
inner join gpSewHead on gpSewHead.joNum = jobOrder.joNum
inner join gpCutBody on gpCutBody.gpCutID = gpCutHead.gpCutID
inner join gpSewBody on gpSewBody.gpSewID = gpSewHead.gpSewID
If you are only interested in the quantities of cuts and sews for all orders, the simplest way to do it would be like this:
select (select SUM(gpCutBody.actualQty) from gpCutBody) as cutQty,
(select SUM(gpSewBody.quantity) from gpSewBody) as sewQty
(This assumes that cuts and sews will always have associated job orders.)
If you want to see a breakdown of cuts and sews by job order, something like this might be preferable:
select joNum, SUM(actualQty) as cutQty, SUM(quantity) as sewQty
from (select joNum, actualQty, 0 as quantity
from gpCutBody
union all
select joNum, 0 as actualQty, quantity
from gpSewBody) sc
group by joNum
Mark's approach is a good one. I want to suggest the alternative of doing the group by's before the union, simply because this can be a more general approach for summing along multiple dimensions.
Your problem is that you have two dimensions that you want to sum along, and you are getting a cross product of the values in the join.
select joNum, act.quantity as ActualQty, q.quantity as Quantity
from (select joNum, sum(actualQty) as quantity
from gpCutBody
group by joNum
) act full outer join
(select joNum, sum(quantity) as quantity
from gpSewBody
group by joNum
) q
on act.joNum = q.joNum
(I have kept Mark's assumption that doing this by joNum is the desired output.)