Prevent duplicate rows when using LEFT JOIN in Postgres without DISTINCT - sql

I have 4 tables:
Item
Purchase
Purchase Item
Purchase Discount
In these tables, the Purchase Discount has two entries, all the others have only one entry. But when I query them, due to the LEFT JOIN, I'm getting duplicate entries.
This query will be running in a large database, and I heard using DISTINCT will reduce the performance. Is there any other way I can remove duplicates without using DISTINCT?
Here is the SQL Fiddle.
The result shows:
[{"item_id":1,"purchase_items_ids":[1234,1234],"total_sold":2}]
But the result should come as:
[{"item_id":1,"purchase_items_ids":[1234],"total_sold":1}]

Using correlated subquery instead of LEFT JOIN:
SELECT array_to_json(array_agg(p_values)) FROM
(
SELECT t.item_id, t.purchase_items_ids, t.total_sold, t.discount_amount FROM
(
SELECT purchase_items.item_id AS item_id,
ARRAY_AGG(purchase_items.id) AS purchase_items_ids,
SUM(purchase_items.sold) as total_sold,
SUM((SELECT SUM(pd.discount_amount) FROM purchase_discounts pd
WHERE pd.purchase_id = purchase.id)) as discount_amount
FROM items
INNER JOIN purchase_items ON purchase_items.item_id = items.id
INNER JOIN purchase ON purchase.id = purchase_items.purchase_id
WHERE purchase.id = 200
GROUP by purchase_items.item_id
) as t
INNER JOIN items i ON i.id = t.item_id
) AS p_values;
db<>fiddle demo
Output:
[{"item_id":1,"purchase_items_ids":[1234],"total_sold":1,"discount_amount":12}]

First I would suggest to remove INNER JOIN items i ON i.id = t.item_id from the query which no reason to be there.
Then instead Left joining Purchase_Discounts table use subquery to get the Discount_amount (as mentioned in Lukasz Szozda's answer)
If there is no discount for any product then Discount_amount column will display NULL. If you want to avoid it then you can use COALESCE() as below instead:
COALESCE(SUM((select sum(discount_amount) from purchase_discounts
where purchase_discounts.purchase_id = purchase.id)),0) as discount_amount
Db-Fiddle:
SELECT array_to_json(array_agg(p_values)) FROM
(
SELECT t.item_id, t.purchase_items_ids, t.total_sold, t.discount_amount FROM
(
SELECT purchase_items.item_id AS item_id,
ARRAY_AGG(purchase_items.id) AS purchase_items_ids,
SUM(purchase_items.sold) as total_sold,
SUM((select sum(discount_amount) from purchase_discounts
where purchase_discounts.purchase_id = purchase.id)) as discount_amount
FROM items
INNER JOIN purchase_items ON purchase_items.item_id = items.id
INNER JOIN purchase ON purchase.id = purchase_items.purchase_id
WHERE
purchase.id = 200
GROUP by
purchase_items.item_id
) as t
) AS p_values;
Output:
array_to_json
[{"item_id":1,"purchase_items_ids":[1234],"total_sold":1,"discount_amount":12}]
db<>fiddle here

The core problem is that your LEFT JOIN multiplies rows. See:
Two SQL LEFT JOINS produce incorrect result
Aggregate discounts to a single row before the join. Or use a (uncorrelated) subquery expression:
SELECT json_agg(items)
FROM (
SELECT pi.item_id
, array_agg(pi.id) AS purchase_items_ids
, sum(pi.sold) AS total_sold
,(SELECT COALESCE(sum(pd.discount_amount), 0)
FROM purchase_discounts pd
WHERE pd.purchase_id = 200) AS discount_amount
FROM purchase_items pi
WHERE pi.purchase_id = 200
GROUP BY 1
) AS items;
Result:
[{"item_id":1,"purchase_items_ids":[1234],"total_sold":1,"discount_amount":12}]
db<>fiddle here
I added a couple of additional improvements:
Assuming referential integrity enforced by FK constraints, we don't need to involve the tables purchase and items at all.
Removed a subquery level doing nothing.
Using json_agg() instead of array_to_json(array_agg()).
Added COALESCE() to output 0 instead or NULL for no discounts.
Since discounts apply to the purchase in your model, not to individual items, it doesn't make sense to output discount_amount for every single item. Consider this query instead to return an array of items and a single, separate discount_amount:
SELECT json_build_object(
'items'
, json_agg(items)
, 'discount_amount'
, (SELECT COALESCE(sum(pd.discount_amount), 0)
FROM purchase_discounts pd
WHERE pd.purchase_id = 200)
)
FROM (
SELECT pi.item_id
, array_agg(pi.id) AS purchase_items_ids
, sum(pi.sold) AS total_sold
FROM purchase_items pi
WHERE pi.purchase_id = 200
GROUP BY 1
) AS items;
Result:
{"items" : [{"item_id":1,"purchase_items_ids":[1234],"total_sold":1}], "discount_amount" : 12}
db<>fiddle here
Using json_build_object() to assemble the JSON object.
Your example with a single item in the purchase isn't too revealing. I added a purchase with multiple items and no discount to my fiddle.

If you can have multiple values only in the purchase_discounts table then a subquery that aggregate multiple purchase_discounts rows into one before the join can solve the problem:
SELECT array_to_json(array_agg(p_values)) FROM
(
SELECT t.item_id, t.purchase_items_ids, t.total_sold, t.discount_amount FROM
(
SELECT purchase_items.item_id AS item_id,
ARRAY_AGG(purchase_items.id) AS purchase_items_ids,
SUM(purchase_items.sold) as total_sold,
X.discount_amount
FROM items
INNER JOIN purchase_items ON purchase_items.item_id = items.id
INNER JOIN purchase ON purchase.id = purchase_items.purchase_id
LEFT JOIN (SELECT purchase_id, sum(purchase_discounts.discount_amount) AS discount_amount FROM purchase_discounts GROUP BY purchase_id) X ON X.purchase_id = purchase.id
WHERE
purchase.id = 200
GROUP by
purchase_items.item_id,
X.discount_amount
) as t
INNER JOIN items i ON i.id = t.item_id
) AS p_values;

The LEFT JOIN is not causing your duplicates, I understand why you need it as there may not be any discounts, but for the data provided changing to an inner join produces the same result. You are getting duplicate entries because you use ARRAY_AGG(purchase_items.id). Further, with the data presented, the tables item and purchase are not necessary. You can use the window version of sum and distinct on to reduce the duplication of purchase_id, and eliminate the mentioned tables. Finally the middle select ... ) t can be completely removed. Resulting in: (see demo)
select array_to_json(array_agg(p_values))
from (select distinct on (pi.item_id, pi.id)
pi.item_id
, pi.id purchase_items_ids
, sum(pi.sold) over (partition by pi.item_id) total_sold
, sum(pd.discount_amount) over(partition by pi.item_id) discount_amount
from purchase_items pi
left join purchase_discounts pd
on pd.purchase_id = pi.purchase_id
order by pi.item_id, pi.id
) as p_values;

I think the left join does not cause, because with the Inner Join query result same as the left join, in discount with purchase_id=200 query has 2 results you can use from row_number with the partion_by same as:
ROW_NUMBER() OVER(PARTITION BY purchase_items.id order by purchase_items.id) rn
then select rn=1.
you change your query for the sum function, I think that you can use from partion_by.

Related

PostgreSQL GROUP BY column must appear in the GROUP BY

SELECT
COUNT(follow."FK_accountId"),
score.*
FROM
(
SELECT items.*, AVG(reviews.score) as "averageScore" FROM "ITEM_VARIATION" as items
INNER JOIN "ITEM_REVIEW" as reviews ON reviews."FK_itemId"=items.id
GROUP BY items.id
) as score
INNER JOIN "ITEM_FOLLOWER" as follow ON score.id=follow."FK_itemId"
GROUP BY score.id
Inner Block works by itself and I believe I followed the same format.
However it outputs error:
ERROR: column "score.name" must appear in the GROUP BY clause or be used in an aggregate function
LINE 18: score.*
^
Is listing all the columns in score field only solution?
there are over 10 columns to list so I'd like to avoid that solution if it's not the only one
columns not included on the aggregation must be specified during group by
SELECT
COUNT(follow."FK_accountId"),
score.id,
score.name
FROM
(
SELECT items.id as id, items.name as name, AVG(reviews.score) as "averageScore" FROM "ITEM_VARIATION" as items
INNER JOIN "ITEM_REVIEW" as reviews ON reviews."FK_itemId"=items.id
GROUP BY items.id, items.name
) as score
INNER JOIN "ITEM_FOLLOWER" as follow ON score.id=follow."FK_itemId"
GROUP BY score.id, score.name
I would suggest you use correlated subqueries or a lateral join:
SELECT i.*,
(SELECT AVG(r.score)
FROM "ITEM_REVIEW" r
WHERE r."FK_itemId" = i.id
) as averageScore,
(SELECT COUNT(*)
FROM "ITEM_FOLLOWER" f
WHERE f."FK_itemId" = i.id
)
FROM "ITEM_VARIATION" i;
With the right indexes, this is probably faster as well.

Select most Occurred Value SQL with Inner Join

I am using this query to get the following data from different linked tables. But let's say the VENDORS for an item were three. Now here in result i want to show the Vendor which occurred most. I mean if Item ABC was supplied by 3 different vendors many times. Then here i want to get the Vendor who supplied most of the times item ABC.
My query is this.
use iBusinessFlex;
SELECT Items.Name,
Max(Items.ItemID) as ItemID ,
MAX(Items.Description)as Description,
MAX(ItemsStock.CurrentPrice) as UnitPrice,
MAX(ItemsStock.Quantity) as StockQuantiity,
MAX(Vendors.VendorName) as VendorName,
SUM(ItemReceived.Quantity) as TotalQuantity
From ItemReceived
INNER JOIN Items ON ItemReceived.ItemId=Items.ItemID
INNER JOIN ItemsStock ON ItemReceived.ItemId=ItemsStock.ItemID
INNER JOIN PurchaseInvoices ON PurchaseInvoices.PurchaseInvoiceId = ItemReceived.PurchaseInvoiceId
INNER JOIN Vendors ON Vendors.VendorId = PurchaseInvoices.VendorId
Group By Items.Name
EDIT : I have included this sub query but i am not sure if it is showing correct result. i mean Showing Vendor for each Item who provided that item most of the times
use iBusinessFlex;
SELECT Items.Name,
Max(Items.ItemID) as ItemID ,
MAX(Items.Description)as Description,MAX(ItemsStock.CurrentPrice) as UnitPrice,
MAX(ItemsStock.Quantity) as StockQuantiity,MAX(Vendors.VendorName) as VendorName,
SUM(ItemReceived.Quantity) as TotalQuantity
From ItemReceived
INNER JOIN Items ON ItemReceived.ItemId=Items.ItemID INNER JOIN ItemsStock
ON ItemReceived.ItemId=ItemsStock.ItemID INNER JOIN PurchaseInvoices
ON PurchaseInvoices.PurchaseInvoiceId = ItemReceived.PurchaseInvoiceId INNER JOIN Vendors
ON Vendors.VendorId IN (
SELECT Top 1 MAX(PurchaseInvoices.VendorId) as VendorOccur
FROM PurchaseInvoices INNER JOIN Vendors ON Vendors.VendorId=PurchaseInvoices.VendorId
GROUP BY PurchaseInvoices.VendorId
ORDER BY COUNT(*) DESC
And the Result Looks like this.
First, I would start with who ordered what thing the most. But the MOST is based on what... the most quantity? Price?, Number of Times? If you use one vendor and order 6 times qty of 10 you have 60 things. But order 1 time from another vendor for 100 qty, which one wins. You have to decide the basis of MOST, but I will go based on most times
per your original question.
So all things come from PurchasedInvoices which has a vendor ID. I dont care who the vendor is, just their ID, so no need to join. Also, don't need the item name if I am just looking for my counts. The query below will show per item, each vendor and their respective most times ordered and quantities ordered. I added the items and vendor table joins just to show the names.
select
IR.ItemID,
PI.VendorID,
max( I.Name ) Name,
max( V.VendorName ) VendorName,
count(*) as TimesOrderedFrom,
SUM( IR.Quantity ) as QuantityFromVendor
from
ItemsReceived IR
JOIN PurchaseInvoices PI
on IR.PurchaseInvoiceID = PI.PurchaseInvoiceID
JOIN Items I
on IR.ItemID = I.ItemID
JOIN Vendors V
on IR.VendorID = V.VendorID
group by
IR.ItemID,
PI.VendorID
order by
-- Per item
IR.ItemID,
-- Most count ordered
count(*),
-- If multiple vendors, same count, get total quantity
sum( IR.Quantity )
Now, to get only 1 per item, this would create a correlated subquery and you
can add 'TOP 1' to return only the first by this. Since the aggregate of count
is already done, you can then get the vendor contact info.
select
I.Name,
V.VendorName,
TopVendor.TimesOrderedFromVendor,
TopVendor.QuantityFromVendor
from
Items I
JOIN ( select TOP 1
IR.ItemID,
PI.VendorID,
count(*) as TimesOrderedFrom,
SUM( IR.Quantity ) as QuantityFromVendor
from
ItemsReceived IR
JOIN PurchaseInvoices PI
on IR.PurchaseInvoiceID = PI.PurchaseInvoiceID
where
-- correlated subquery based on the outer-most item
IR.ItemID = I.ItemID
group by
IR.ItemID,
PI.VendorID
order by
-- Per item
IR.ItemID,
-- Most count ordered
count(*),
-- If multiple vendors, same count, get total quantity
sum( IR.Quantity ) ) TopVendor
on I.ItemID = TopVendor.ItemID
JOIN Vendors V
on TopVendor.VendorID = V.VendorID
No sense in having the INNER Subquery joining on the vendor and items just for the names. Get those once and only at the end when the top vendor is selected.

Returning ID's from two other tables or null if no IDs found using using a left join SQL Server

I am wondering if someone could hep me. I am trying to make a join on two tables and return an id if an id is there but if there is no id return null but still return the row for that product and not ignore it. My query below returns twice the amount the records to which I can not figure out why.
SELECT
T2.ProductID, FirstChild.SupplierID, SecondChild.AccountID
FROM
Products T2
LEFT OUTER JOIN
(
SELECT TOP(1) SupplierID, Reference,CompanyID, Row_Number() OVER (Partition By SupplierID Order By SupplierID) AS RowNo FROM Suppliers
)
FirstChild ON T2.SupplierReference = FirstChild.Reference AND RowNo = 1AND FirstChild.CompanyID =T2.CompanyID
LEFT OUTER JOIN
(
SELECT TOP(1) AccountID, SageKey,CompanyID, Row_Number() OVER (Partition By AccountID Order By AccountID) AS RowNo2 FROM Accounts
)
SecondChild ON T2.ProductAccountReference = SecondChild.Reference AND RowNo2 = 1 AND SecondChild.CompanyID =T2.CompanyID
Example of what I am trying to do
ProductID SupplierID AccountID
1 5 2
2 6 NULL
3 NULL NULL
OUTER APPLY and ditching the ROW_NUMBER Seems like a better choice here:
SELECT
p.ProductId
,FirstChild.SupplierId
,SecondChild.AccountId
FROM
Products p
OUTER APPLY (SELECT TOP (1) s.SupplierId
FROM
Suppliers s
WHERE
p.SupplierReference = s.SupplierReference
AND p.CompanyId = s.CompanyId
ORDER BY
s.SupplierId
) FirstChild
OUTER APPLY (SELECT TOP (1) a.AccountId
FROM
Accounts
WHERE
p.ProductAccountReference = a.Reference
AND p.CompanyId = a.CompanyId
ORDER BY
a.AccountID
) SecondChild
The way your query is written above there is no correlation for the derived tables. Which means you would always get what ever SupplierId SQL chooses based on optimization and if that doesn't happen to always be Row1 you wont get the value. You need to relate your Table and select top 1, adding an ORDER BY in your derived table is like identifying the row number you want.
If it's just showing duplicate records, wouldn't an inelegant solution just be to add distinct in the select line?

SQL Sum returning wrong number

I am adding up the amount of tickets sold for a sporting event, the answer should be under 100 but my answer is in the thousands.
SELECT Stubhub.Active.Opponent,
SUM(Stubhub.Active.Qty) AS AQty, SUM(Stubhub.Sold.Qty) AS SQty
FROM Stubhub.Active INNER JOIN
Stubhub.Sold ON Stubhub.Active.Opponent = Stubhub.Sold.Opponent
GROUP BY Stubhub.Active.Opponent
This is type of problem occurs because you are getting a cartesian product between each table for each opponent. The solution is to pre-aggregate by opponent:
SELECT a.Opponent, a.AQty, s.SQty
FROM (SELECT a.Opponent, SUM(a.Qty) as AQty
FROM Stubhub.Active a
GROUP BY a.Opponent
) a INNER JOIN
(SELECT s.Opponent, SUM(s.QTY) as SQty
FROM Stubhub.Sold s
GROUP BY s.Opponent
) s
ON a.Opponent = s.Opponent;
Notice that in this case, you do not need the aggregation in the outer query.

SQL Selecting Distinct Count of items where 2 conditions are met

I am struggling to get a DISTINCT COUNT working with SQL DISTINCT SELECT
Not sure if I should even be using distinct here, but I have got it correct using a subquery, though it is very heavy processing wise.
This query does what I ultimately want results wise (without the weight)
SELECT DISTINCT
product_brandNAME,
product_classNAME,
(SELECT COUNT(productID) FROM products
WHERE products.product_classID = product_class.product_classID
AND products.product_brandID = product_brand.product_brandID) as COUNT
FROM products
JOIN product_brand
JOIN product_class
ON products.product_brandID = product_brand.product_brandID
AND products.product_classID = product_class.product_classID
GROUP BY productID
ORDER BY product_brandNAME
This gets close, and is much more efficient, but I can't get the count working, it only counts (obviously) the distinct count which is 1.
SELECT DISTINCT product_brandNAME, product_classNAME, COUNT(*) as COUNT
FROM products
JOIN product_brand
JOIN product_class
ON products.product_brandID = product_brand.product_brandID
AND products.product_classID = product_class.product_classID
GROUP BY productID
ORDER BY product_brandNAME
Any suggestions, I'm sure its small, and have been researching the net for hours for an answer to no avail for 2 conditions to match.
Thanks,
Have you tried following query
Edit
SELECT product_brandNAME
, product_classNAME
, COUNT(*)
FROM products
JOIN product_brand ON products.product_brandID = product_brand.product_brandID
JOIN product_class ON products.product_classID = product_class.product_classID
GROUP BY
product_brandNAME
, product_classNAME
When using GROUP BY you do not need to use a DISTINCT clause. Try the following:
SELECT productID,
product_brandNAME,
product_classNAME,
COUNT(*) as COUNT
FROM products JOIN product_brand ON products.product_brandID = product_brand.product_brandID
JOIN product_class ON products.product_classID = product_class.product_classID
GROUP BY productID,
product_brandNAME,
product_classNAME
ORDER BY product_brandNAME