Left join + aggregate functions resulting in incorrect values

Left join + aggregate functions resulting in incorrect values - sql

It feels like there is a very simple solution but I am new to SQL and have been struggling with this for a while.
I am working on joining two tables so that I can map the cost center in one to its associated product group in another. There is a parent child relationship here with product group as parent and cost center as child.
I am ultimately trying to see the spend data, currently only available by cost center in table 1, by the parent product group categorization. Prior to the join, I see the correct dollar value by cost center. After the join, the number dramatically increases and is incorrect.
Table 1 (purchase orders): cost_center_id, amount_ordered
Table 2 (employee plus): cost center_id, product_group_name
Below is a simplified sample of the query I am working with.
SELECT
po.po_cost_center_id,
ep.product_group_name,
SUM(po.amount_ordered)
FROM purchase_orders po
LEFT JOIN d_employee_plus ep on po.cost_center_id = ep.cost_center_id and ep.ds =
po.ds
WHERE
po.ds = (select max(ds) from purchase_orders)
GROUP BY 1,2

My take here is that you have duplicate rows because you're not aggregating on both the columns you're joining on. I may be wrong, though try using the following:
SELECT po.po_cost_center_id,
ep.ds,
ep.product_group_name,
SUM(po.amount_ordered)
FROM purchase_orders po
LEFT JOIN d_employee_plus ep
ON po.cost_center_id = ep.cost_center_id
AND ep.ds = po.ds
WHERE po.ds = (SELECT max(ds) FROM purchase_orders)
GROUP BY po.po_cost_center_id,
ep.ds,
ep.product_group_name
For more troubleshooting, if you can provide some sample data, I may help you further.
Does it work for you?

Related

SQL - join three tables based on (different) latest dates in two of them

Using Oracle SQL Developer, I have three tables with some common data that I need to join.
Appreciate any help on this!
Please refer to https://i.stack.imgur.com/f37Jh.png for the input and desired output (table formatting doesn't work on all tables).
These tables are made up in order to anonymize them, and in reality contain other data with millions of entries, but you could think of them as representing:
Product = Main product categories in a grocery store.
Subproduct = Subcategory products to the above. Each time the table is updated, the main product category may loses or get some new suproducts assigned to it. E.g. you can see that from May to June the Pulled pork entered while the Fishsoup was thrown out.
Issues = Status of the products, for example an apple is bad if it has brown spots on it..
What I need to find is: for each P_NAME, find the latest updated set of subproducts (SP_ID and SP_NAME), and append that information with the latest updated issue status (STATUS_FLAG).
Please note that each main product category gets its set of subproducts updated at individual occasions i.e. 1234 and 5678 might be "latest updated" on different dates.
I have tried multiple queries but failed each time. I am using combos of SELECT, LEFT OUTER JOIN, JOIN, MAX and GROUP BY.
Latest attempt, which gives me the combo of the first two tables, but missing the third:
SELECT
PRODUCT.P_NAME,
SUBPRODUCT.SP_PRODUCT_ID, SUBPRODUCT.SP_NAME, SUBPRODUCT.SP_ID, SUPPRODUCT.SP_VALUE_DATE
FROM SUBPRODUCT
LEFT OUTER JOIN PRODUCT ON PRODUCT.P_ID = SUBPRODUCT.SP_PRODUCT_ID
JOIN(SELECT SP_PRODUCT_ID, MAX(SP_VALUE_DATE) AS latestdate FROM SUBPRODUCT GROUP BY SP_PRODUCT_ID) sub ON
sub.SP_PRODUCT_ID = SUBPRODUCT.SP_PRODUCT_ID AND sub.latestDate = SUBPRODUCT.SP_VALUE_DATE;

Trying to find a row with a max value is a common SQL pattern - you can do it with a join, like your example, but it's usually more clear to use a subquery or a window function.
Correlated subquery example
select
PRODUCT.P_NAME,
SUBPRODUCT.SP_PRODUCT_ID, SUBPRODUCT.SP_NAME, SUBPRODUCT.SP_ID, SUPPRODUCT.SP_VALUE_DATE,
ISSUES.STATUS_FLAG, ISSUES.STATUS_LAST_UPDATED
from PRODUCT
join SUBPRODUCT
on PRODUCT.P_ID = SUBPRODUCT.SP_PRODUCT_ID
and SUBPRODUCT.SP_VALUE_DATE = (select max(S2.SP_VALUE_DATE) as latestDate
from SUBPRODUCT S2
where S2.SP_PRODUCT_ID = SUBPRODUCT.SP_PRODUCT_ID)
join ISSUES
on ISSUES.ISSUE_ID = SUBPRODUCT.SP_ID
and ISSUES.STATUS_LAST_UPDATED = (select max(I2.STATUS_LAST_UPDATED) as latestDate
from ISSUES I2
where I2.ISSUE_ID = ISSUES.ISSUE_ID)
Window function / inline view example
select
PRODUCT.P_NAME,
S.SP_PRODUCT_ID, S.SP_NAME, S.SP_ID, S.SP_VALUE_DATE,
I.STATUS_FLAG, I.STATUS_LAST_UPDATED
from PRODUCT
join (select SUBPRODUCT.*,
max(SP_VALUE_DATE) over (partition by SP_PRODUCT_ID) as latestDate
from SUBPRODUCT) S
on PRODUCT.P_ID = S.SP_PRODUCT_ID
and S.SP_VALUE_DATE = S.latestDate
join (select ISSUES.*,
max(STATUS_LAST_UPDATED) over (partition by ISSUE_ID) as latestDate
from ISSUES) I
on I.ISSUE_ID = S.SP_ID
and I.STATUS_LAST_UPDATED = I.latestDate
This often performs a bit better, but window functions can be tricky to understand.

Use group by with sum in query

These 3 tables that you see in the image are related
Course table and coaching table and sales table
I want to make a report from this table on how much each coach has sold by each course period.
The query I created is as follows, but unfortunately it has a problem and I do not know where the problem is.
Please help me fix the problem
Thank you
SELECT
dbo.tblCustomersOrders.id, dbo.tblCustomersOrders.pid, dbo.tblPost.postTitle,
dbo.tblArticleAuthor.authorName, SUM(dbo.tblCustomersOrders.prodPrice) AS TotalBuys
FROM
dbo.tblPost
INNER JOIN
dbo.tblArticleAuthor ON dbo.tblPost.id = dbo.tblArticleAuthor.articleID
INNER JOIN
dbo.tblCustomersOrders ON dbo.tblPost.id = dbo.tblCustomersOrders.pid
GROUP BY dbo.tblCustomersOrders.pid

For this use, SUM() is an Aggregate Function, so you need to refer all the
fields that you want to get in your result set.
Example:
SELECT
dbo.tblCustomersOrders.id, dbo.tblCustomersOrders.pid, dbo.tblPost.postTitle,
dbo.tblArticleAuthor.authorName, SUM(dbo.tblCustomersOrders.prodPrice) AS TotalBuys
FROM dbo.tblPost
INNER JOIN
dbo.tblArticleAuthor ON dbo.tblPost.id = dbo.tblArticleAuthor.articleID
INNER JOIN
dbo.tblCustomersOrders ON dbo.tblPost.id = dbo.tblCustomersOrders.pid
GROUP BY dbo.tblCustomersOrders.id, dbo.tblCustomersOrders.pid,
dbo.tblPost.postTitle, dbo.tblArticleAuthor.authorName
But this query does not solve the need for your report.
If you just need to get "how much each coach has sold by each course" , you can try the query bellow.
SELECT
dbo.tblArticleAuthor.authorName, dbo.tblPost.postTitle,
SUM(dbo.tblCustomersOrders.prodPrice) AS TotalBuys
FROM dbo.tblPost
INNER JOIN
dbo.tblArticleAuthor ON dbo.tblPost.id = dbo.tblArticleAuthor.articleID
INNER JOIN
dbo.tblCustomersOrders ON dbo.tblPost.id = dbo.tblCustomersOrders.pid
GROUP BY dbo.tblArticleAuthor.authorName, dbo.tblPost.postTitle
If you need, send more details regarding the desired result.
Here you can find more information about SQL SERVER Aggregate Functions:
https://learn.microsoft.com/en-us/sql/t-sql/functions/aggregate-functions-transact-sql?view=sql-server-ver15
And here a quick example regarding SQL Aliases to build queries with a simple
and effective way:
https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_alias_table

Per your description of the task, the problem is that you only GROUPed BY dbo.tblCustomersOrders.pid, which is the period's id I guess, but you also need to GROUP BY the coach, which is dbo.tblArticleAuthor.authorName, I guess again. Plus in the SELECT field list you can not use more columns only that are aggregated + GROUPed.

Expand Join to not limit data

I have a weird question - I understand that Joins return matching data based on the 'ON' stipulation, however the problem I am facing is I need the Business date back for both tables but at the same time i need to join on the date in order to get the totals correct
See below code:
Select
o.Resort,
o.Business_Date,
Occupied,
Comps,
House,
ADR,
Room_Revenue,
Occupied-(Comps+House) AS DandT,
Coalesce(gd.Projected_Occ1,0) AS Projected_Occ1,
Occupied-(Comps+House)+Coalesce(gd.Projected_Occ1,0) as Total
from Occupancy o
left join Group_Details_HF gd
on o.Business_Date = gd.Business_Date
and o.Resort = gd.resort
UNION ALL
select
o.Resort,
o.Business_Date,
Occupied,
Comps,
House,
ADR,
Room_Revenue,
Occupied-(Comps+House) AS DandT,
Coalesce(gd.Projected_Occ1,0) AS Projected_Occ1,
Coalesce(Occupied-(Comps+House),0)+Coalesce(gd.Projected_Occ1,0) as Total
from Occupancy_Forecast o
FULL OUTER JOIN Group_Details_HF gd
on o.Business_Date = gd.Business_Date
and o.Resort = gd.resort
Currently, this gives me the desired results from the Occupancy and Occupancy forecast table however when the business date does not exist in the occupancy forecast table it ignores the group_details table, I need the results to combine the dates when they exist in both or give the unique results for each when there is no match

I have decided to create another pivot table storing the details from Group_Details_HF and then Union together the two tables which has given me the desired result rather than fiddling with the join :)

Include missing years in Group By query

I am fairly new in Access and SQL programming. I am trying to do the following:
Sum(SO_SalesOrderPaymentHistoryLineT.Amount) AS [Sum Of PaymentPerYear]
and group by year even when there is no amount in some of the years. I would like to have these years listed as well for a report with charts. I'm not certain if this is possible, but every bit of help is appreciated.
My code so far is as follows:
SELECT
Base_CustomerT.SalesRep,
SO_SalesOrderT.CustomerId,
Base_CustomerT.Customer,
SO_SalesOrderPaymentHistoryLineT.DatePaid,
Sum(SO_SalesOrderPaymentHistoryLineT.Amount) AS [Sum Of PaymentPerYear]
FROM
Base_CustomerT
INNER JOIN (
SO_SalesOrderPaymentHistoryLineT
INNER JOIN SO_SalesOrderT
ON SO_SalesOrderPaymentHistoryLineT.SalesOrderId = SO_SalesOrderT.SalesOrderId
) ON Base_CustomerT.CustomerId = SO_SalesOrderT.CustomerId
GROUP BY
Base_CustomerT.SalesRep,
SO_SalesOrderT.CustomerId,
Base_CustomerT.Customer,
SO_SalesOrderPaymentHistoryLineT.DatePaid,
SO_SalesOrderPaymentHistoryLineT.PaymentType,
Base_CustomerT.IsActive
HAVING
(((SO_SalesOrderPaymentHistoryLineT.PaymentType)=1)
AND ((Base_CustomerT.IsActive)=Yes))
ORDER BY
Base_CustomerT.SalesRep,
Base_CustomerT.Customer;

You need another table with all years listed -- you can create this on the fly or have one in the db... join from that. So if you had a table called alltheyears with a column called y that just listed the years then you could use code like this:
WITH minmax as
(
select min(year(SO_SalesOrderPaymentHistoryLineT.DatePaid) as minyear,
max(year(SO_SalesOrderPaymentHistoryLineT.DatePaid) as maxyear)
from SalesOrderPaymentHistoryLineT
), yearsused as
(
select y
from alltheyears, minmax
where alltheyears.y >= minyear and alltheyears.y <= maxyear
)
select *
from yearsused
join ( -- your query above goes here! -- ) T
ON year(T.SO_SalesOrderPaymentHistoryLineT.DatePaid) = yearsused.y

You need a data source that will provide the year numbers. You cannot manufacture them out of thin air. Supposing you had a table Interesting_year with a single column year, populated, say, with every distinct integer between 2000 and 2050, you could do something like this:
SELECT
base.SalesRep,
base.CustomerId,
base.Customer,
base.year,
Sum(NZ(data.Amount)) AS [Sum Of PaymentPerYear]
FROM
(SELECT * FROM Base_CustomerT INNER JOIN Year) AS base
LEFT JOIN
(SELECT * FROM
SO_SalesOrderT
INNER JOIN SO_SalesOrderPaymentHistoryLineT
ON (SO_SalesOrderPaymentHistoryLineT.SalesOrderId = SO_SalesOrderT.SalesOrderId)
) AS data
ON ((base.CustomerId = data.CustomerId)
AND (base.year = Year(data.DatePaid))),
WHERE
(data.PaymentType = 1)
AND (base.IsActive = Yes)
AND (base.year BETWEEN
(SELECT Min(year(DatePaid) FROM SO_SalesOrderPaymentHistoryLineT)
AND (SELECT Max(year(DatePaid) FROM SO_SalesOrderPaymentHistoryLineT))
GROUP BY
base.SalesRep,
base.CustomerId,
base.Customer,
base.year,
ORDER BY
base.SalesRep,
base.Customer;
Note the following:
The revised query first forms the Cartesian product of BaseCustomerT with Interesting_year in order to have base customer data associated with each year (this is sometimes called a CROSS JOIN, but it's the same thing as an INNER JOIN with no join predicate, which is what Access requires)
In order to have result rows for years with no payments, you must perform an outer join (in this case a LEFT JOIN). Where a (base customer, year) combination has no associated orders, the rest of the columns of the join result will be NULL.
I'm selecting the CustomerId from Base_CustomerT because you would sometimes get a NULL if you selected from SO_SalesOrderT as in the starting query
I'm using the Access Nz() function to convert NULL payment amounts to 0 (from rows corresponding to years with no payments)
I converted your HAVING clause to a WHERE clause. That's semantically equivalent in this particular case, and it will be more efficient because the WHERE filter is applied before groups are formed, and because it allows some columns to be omitted from the GROUP BY clause.
Following Hogan's example, I filter out data for years outside the overall range covered by your data. Alternatively, you could achieve the same effect without that filter condition and its subqueries by ensuring that table Intersting_year contains only the year numbers for which you want results.
Update: modified the query to a different, but logically equivalent "something like this" that I hope Access will like better. Aside from adding a bunch of parentheses, the main difference is making both the left and the right operand of the LEFT JOIN into a subquery. That's consistent with the consensus recommendation for resolving Access "ambiguous outer join" errors.

Thank you John for your help. I found a solution which works for me. It looks quiet different but I learned a lot out of it. If you are interested here is how it looks now.
SELECT DISTINCTROW
Base_Customer_RevenueYearQ.SalesRep,
Base_Customer_RevenueYearQ.CustomerId,
Base_Customer_RevenueYearQ.Customer,
Base_Customer_RevenueYearQ.RevenueYear,
CustomerPaymentPerYearQ.[Sum Of PaymentPerYear]
FROM
Base_Customer_RevenueYearQ
LEFT JOIN CustomerPaymentPerYearQ
ON (Base_Customer_RevenueYearQ.RevenueYear = CustomerPaymentPerYearQ.[RevenueYear])
AND (Base_Customer_RevenueYearQ.CustomerId = CustomerPaymentPerYearQ.CustomerId)
GROUP BY
Base_Customer_RevenueYearQ.SalesRep,
Base_Customer_RevenueYearQ.CustomerId,
Base_Customer_RevenueYearQ.Customer,
Base_Customer_RevenueYearQ.RevenueYear,
CustomerPaymentPerYearQ.[Sum Of PaymentPerYear]
;

SUM(a*b) not working

I have a PHP page running in postgres. I have 3 tables - workorders, wo_parts and part2vendor. I am trying to multiply 2 table column row datas together, ie wo_parts has a field called qty and part2vendor has a field called cost. These 2 are joined by wo_parts.pn and part2vendor.pn. I have created a query like this:
$scoreCostQuery = "SELECT SUM(part2vendor.cost*wo_parts.qty) as total_score
FROM part2vendor
INNER JOIN wo_parts
ON (wo_parts.pn=part2vendor.pn)
WHERE workorder=$workorder";
But if I add the costs of the parts multiplied by the qauntities supplied, it adds to a different number than what the script is doing. Help....I am new to this but if someone can show me in SQL I can modify it for postgres. Thanks

Without seeing example data, there's no way for us to know why you're query totals are coming out differently that when you do the math by hand. It could be a bad join, so you are getting more/less records than you expected. It's also possible that your calculations are off. Pick an example with the smallest number of associated records & compare.
My suggestion is to add a GROUP BY to the query:
SELECT SUM(p.cost * wp.qty) as total_score
FROM part2vendor p
JOIN wo_parts wp ON wp.pn = p.pn
WHERE workorder = $workorder
GROUP BY workorder
FYI: MySQL was designed to allow flexibility in the GROUP BY, while no other db I've used does - it's a source of numerous questions on SO "why does this work in MySQL when it doesn't work on db x...".
To Check that your Quantities are correct:
SELECT wp.qty,
p.cost
FROM WO_PARTS wp
JOIN PART2VENDOR p ON p.pn = wp.pn
WHERE p.workorder = $workorder
Check that the numbers are correct for a given order.

You could try a sub-query instead.
(Note, I don't have a Postgres installation to test this on so consider this more like pseudo code than a working example... It does work in MySQL tho)
SELECT
SUM(p.`score`) AS 'total_score'
FROM part2vendor AS p2v
INNER JOIN (
SELECT pn, cost * qty AS `score`
FROM wo_parts
) AS p
ON p.pn = p2v.pn
WHERE p2n.workorder=$workorder"

In the question, you say the cost column is in part2vendor, but in the query you reference wo_parts.cost. If the wo_parts table has its own cost column, that's the source of the problem.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas