Avoiding double counting values when aggregating tables used in joins - sql

I have two tables that have a parent/child relationship, where one record in the parent table may correspond to N records in the child. Both tables have an amount column that I want to aggregate in one query, so I can see the total amount for the parent and the children.
When I join the tables together, the parent amounts will be counted multiple times for each child resulting in the parent aggregate values to be incorrect.
Here is a simplified version of the problem that has the bad results and my desired results.
drop table if exists parent;
CREATE TABLE parent (
id numeric,
amount numeric,
person text
);
drop table if exists child;
CREATE TABLE child (
id numeric,
parentId numeric,
amount numeric,
person text
);
insert into parent (id, amount, person) values
(1, 5, 'P1'),
(2, 15, 'P1'),
(3, 5, 'P2'),
(4, 20, 'P2');
insert into child (id, parentId, amount) values
(1, 1, 3),
(2, 1, 5),
(3, 2, 10),
(4, 3, 6),
(5, 4, 12),
(5, 4, 8);
-- Parent is double counted for each child joined onto
select
p.person,
p.id,
sum(p.amount) as parent_sum,
sum(c.amount) as child_sum
from
parent p
left outer join child as c on
c.parentId = p.id
group by rollup (p.person, p.id)
order by (p.person, p.id)
/*
Output:
| person | id | parent_sum | child_sum |
|--------|--------|------------|-----------|
| P1 | 1 | 10 | 8 |
| P1 | 2 | 15 | 10 |
| P1 | (null) | 25 | 18 |
| P2 | 3 | 5 | 6 |
| P2 | 4 | 40 | 20 |
| P2 | (null) | 45 | 26 |
| (null) | (null) | 70 | 44 |
Desired output:
| person | id | parent_sum | child_sum |
|--------|--------|------------|-----------|
| P1 | 1 | 5 | 8 |
| P1 | 2 | 15 | 10 |
| P1 | (null) | 20 | 18 |
| P2 | 3 | 5 | 6 |
| P2 | 4 | 20 | 20 |
| P2 | (null) | 25 | 26 |
| (null) | (null) | 45 | 44 |
*/
Here is an sql fiddle showing this: http://sqlfiddle.com/#!17/c4af6/2
I think I might be able to get this to work using a window function but I am looking for a better solution that is as performant as possible. Any ideas on this?

You can sum first up and then join, the results
SQL Fiddle
Query 1:
-- Parent is double counted for each child joined onto
select
p.person,
p.id,
SUM(parent_sum) as parent_sum,
SUM(child_sum) as child_sum
from
(SELECT p.person,
p.id,
sum(p.amount) as parent_sum FROM parent p
GROUP BY p.person,
p.id) p
left outer join (SELECT
c.parentId,
sum(c.amount) as child_sum FROM child c
GROUP BY c.parentId) c on
c.parentId = p.id
group by rollup (p.person, p.id)
order by (p.person, p.id)
Results:
| person | id | parent_sum | child_sum |
|--------|--------|------------|-----------|
| P1 | 1 | 5 | 8 |
| P1 | 2 | 15 | 10 |
| P1 | (null) | 20 | 18 |
| P2 | 3 | 5 | 6 |
| P2 | 4 | 20 | 20 |
| P2 | (null) | 25 | 26 |
| (null) | (null) | 45 | 44 |

Using rollup will get you the extra rows but I think you have to do the summary a bit manually like this:
select p.person, p.id,
coalesce(
case when grouping(p.id) = 0 then min(p.amount) end,
case when grouping(p.id) = 1 then sum(case when grouping(p.id) = 0 then min(p.amount) end) over (partition by p.person) end,
case when grouping(p.person) = 1 then sum(case when grouping(p.id) = 0 then min(p.amount) end) over () end,
0
) as parent_sum,
sum(c.amount) as child_sum
from parent p left outer join child as c on c.parentId = p.id
group by rollup(p.person, p.id)
order by p.person, p.id;
You should be able to extend the pattern to deeper levels if necessary.
https://dbfiddle.uk/CxD1n2T1

Related

Can someone help me figure out if I'm making a mistake in my query?

I'm trying to create a query that returns the names of all people in my database that have less than half of the money of the person with the most money.
These is my query:
select P1.name
from Persons P1 left join
AccountOf A1 on A1.person_id = P1.id left join
BankAccounts B1 on B1.id = A1.account_id
group by name
having SUM(B1.balance) < MAX((select SUM(B1.balance) as b
from AccountOf A1 left join
BankAccounts B1 on B1.id = A1.account_id
group by A1.person_id
order by b desc
LIMIT 1)) * 0.5
This is the result:
+-------+
| name |
+-------+
| Evert |
+-------+
I have the following tables in the database:
+---------+--------+--+
| Persons | | |
+---------+--------+--+
| id | name | |
| 11 | Evert | |
| 12 | Xavi | |
| 13 | Ludwig | |
| 14 | Ziggy | |
+---------+--------+--+
+--------------+---------+
| BankAccounts | |
+--------------+---------+
| id | balance |
| 11 | 525000 |
| 12 | 750000 |
| 13 | 1900000 |
| 14 | 1600000 |
+--------------+---------+
+-----------+-----------+------------+
| AccountOf | | |
+-----------+-----------+------------+
| id | person_id | account_id |
| 301 | 11 | 12 |
| 302 | 13 | 12 |
| 303 | 13 | 14 |
| 304 | 14 | 11 |
| 305 | 14 | 13 |
+-----------+-----------+------------+
What am I missing here? I should get two entries in the result (Evert, Xavi)
I wouldn't approach the logic this way (I would use window functions). But your final having has two levels of aggregation. That shouldn't work. You want:
having SUM(B1.balance) < (select 0.5 * SUM(B1.balance) as b
from AccountOf A1 join
BankAccounts B1 on B1.id = A1.account_id
group by A1.person_id
order by b desc
limit 1
)
I also moved the 0.5 into the subquery and changed the left join to a join -- the tables need to match to get balances.
I would recommend window functions, if your - undisclosed! - database supports them.
You can join and aggregate just once, and then use a window max() to get the top balance. All that is then left to is to filter in an outer query:
select *
fom (
select p.id, p.name, coalesce(sum(balance), 0) balance,
max(sum(balance)) over() max_balance
from persons p
left join accountof ao on ao.person_id = p.id
left join bankaccounts ba on ba.id = ao.account_id
group by p.id, p.name
) t
where balance > max_balance * 0.5

PostgreSQL can't make Self Join

I have a table:
| acctg_cath_id | parent | description |
| 1 | 20 | Bills |
| 9 | 20 | Invoices |
| 20 | | Expenses |
| 88 | 30 |
| 89 | 30 |
| 30 | |
And I want to create a self join in order to group my items under a parent.
Have tried this, but it doesn't work:
SELECT
accounting.categories.acctg_cath_id,
accounting.categories.parent
FROM accounting.categories a1, accounting.categories a2
WHERE a1.acctg_cath_id=a2.parent
I get error: invalid reference to FROM-clause entry for table "categories"
When I try:
a.accounting.categories.acctg_cath_id
b.accounting.categories.acctg_cath_id
I get error: cross-database references are not implemented: a.accounting.categories.acctg_cath_id
Desired output:
Expenses (Parent 20)
Bills (Child 1)
Invoices (Child 9)
What am I doing wrong here?
It seems you merely want to sort the rows:
select *
from accounting.categorie
order by coalesce(parent, acctg_cath_id), parent nulls first, acctg_cath_id;
Result:
+---------------+--------+-------------+
| acctg_cath_id | parent | description |
+---------------+--------+-------------+
| 20 | | Expenses |
| 1 | 20 | Bills |
| 9 | 20 | Invoices |
| 30 | | |
| 88 | 30 | |
| 89 | 30 | |
+---------------+--------+-------------+
Your syntax is performing a cross join:
FROM accounting.categories a1, accounting.categories a2
Try the following:
SELECT
a2.acctg_cath_id,
a2.parent
FROM accounting.categories a1
JOIN accounting.categories a2 ON (a1.acctg_cath_id = a2.parent)
;
Examine the DBFiddle.
You don't need grouping, only self join:
select
c.acctg_cath_id parentid, c.description parent,
cc.acctg_cath_id childid, cc.description child
from (
select distinct parent
from categories
) p inner join categories c
on p.parent = c.acctg_cath_id
inner join categories cc on cc.parent = p.parent
where p.parent = 20
You can remove the WHERE clause if you want all the parents with all their children.
See the demo.
Results:
> parentid | parent | childid | child
> -------: | :------- | ------: | :-------
> 20 | Expences | 1 | Bills
> 20 | Expences | 9 | Invoices
You don't need a self-join. You don't need aggregation. You just need a group by clause:
SELECT ac.*
FROM accounting.categories ac
ORDER BY COALESCE(ac.parent, ac.acctg_cath_id),
(CASE WHEN ac.parent IS NULL THEN 1 ELSE 2 END),
ac.acctg_cath_id;

Multiply quantities for all parent child relationships

I have a table kind of this.
======================================
ID | Description|Quantity| Parentid|
=====================================
1 | Main | NULL | NULL |
2 | Sub | 20 | 1 |
3 | Sub2 | 21 | 1 |
4 | A1 | 32 | 2 |
5 | B1 | 51 | 3 |
6 | B2 | 43 | 3 |
7 | C1 | 34 | 4 |
9 | D1 | 22 | 5 |
10 | D2 | 90 | 5 |
11 | E1 | 21 | 7 |
12 | F1 | 2 | 11 |
13 | F2 | 42 | 11 |
14 | G1 | 12 | 13 |
-------------------------------------
I want total quantity of G1.. parent of G1 is F2. parent of F2 is E1 . parent of E1 is C1. parent of C1 is A1. parent of A1 is Sub. Parent of Sub is Main. so the total quantity of G1 is (12*42*21*34*32*20=230307840).
How to get that answer with sql query?
WITH TotalQuantity AS
(
SELECT Quantity, ParentID
FROM MyTable
WHERE Description = 'G1'
UNION ALL
SELECT TQ.Quantity * COALESCE(T.Quantity,1), T.ParentID
FROM TotalQuantity TQ
INNER JOIN MyTable T ON T.ID = TQ.ParentID
)
SELECT * FROM TotalQuantity
WHERE ParentID IS NULL
This will give the increasing totals for each generation.
WITH Hierarchy(ChildId, Description, Quantity, Generation, ParentId)
AS
(
SELECT Id, Description, Quantity, 0 as Generation, ParentId
FROM Table1 AS FirtGeneration
WHERE ParentId IS NULL
UNION ALL
SELECT NextGeneration.Id, NextGeneration.Description,
ISNULL(NextGeneration.Quantity, 1) * ISNULL(Parent.Quantity, 1),
Parent.Generation + 1, Parent.ChildId
FROM Table1 AS NextGeneration
INNER JOIN Hierarchy AS Parent ON NextGeneration.ParentId = Parent.ChildId
)
SELECT *
FROM Hierarchy
For G1 simply
select quantity from Hierarchy where description = 'G1' -- result = 230307840
SQL Fiddle

Sub-sub-selects and grouping: Get name column from the row containing the max value of a group

I have two tables: States, and Items.
States:
+----+------+-------+----------+
| id | name | state | priority |
+----+------+-------+----------+
| 1 | AA | 10 | 1 |
| 2 | AB | 10 | 2 |
| 3 | AC | 10 | 3 |
| 4 | BA | 20 | 1 |
| 5 | BB | 20 | 5 |
| 6 | BC | 20 | 10 |
| 7 | BD | 20 | 50 |
+----+------+-------+----------+
Items:
+----+--------+-------+
| id | item | state |
+----+--------+-------+
| 1 | Blue | 10 |
| 2 | Red | 20 |
| 3 | Green | 20 |
| 4 | Yellow | 10 |
| 5 | Brown | 10 |
+----+--------+-------+
The priority column is not used in the Items table, but complicates getting the data I need, as shown below.
What I want is a list of the rows in the Items table, replacing the state.id value in each row with the name of the highest priority state.
Results would look like this:
+----+--------+-------+
| id | item | state |
+----+--------+-------+
| 1 | Blue | AC |
| 2 | Red | BD |
| 3 | Green | BD |
| 4 | Yellow | AC |
| 5 | Brown | AC |
+----+--------+-------+
Here's the tiny monster I've come up with. Is this the best way, or can I be more efficient / less verbose? (Sub-sub-selects make my palms itch. :-P )
SELECT *
FROM
Items AS itm
INNER JOIN (SELECT sta.name, sta.state
FROM (SELECT state, MAX(priority) [highest]
FROM States
GROUP BY state) AS pri
INNER JOIN States AS sta
ON sta.state = pri.state
AND sta.priority = pri.highest) AS nam
ON item.state = name.state
Update: I'm using MS-SQL 2005 and MS-SQL 2008R2
You did not post your version of SQL-Server. Assuming you are on 2005 or later you can use the ROW_NUMBER() function together with a cross apply like this:
CREATE TABLE dbo.States(id INT, name NVARCHAR(25), state INT, priority INT);
INSERT INTO dbo.States
VALUES
( 1 ,'AA', 10 , 1 ),
( 2 ,'AB', 10 , 2 ),
( 3 ,'AC', 10 , 3 ),
( 4 ,'BA', 20 , 1 ),
( 5 ,'BB', 20 , 5 ),
( 6 ,'BC', 20 , 10 ),
( 7 ,'BD', 20 , 50 );
CREATE TABLE dbo.Items( id INT ,item NVARCHAR(25), state INT );
INSERT INTO dbo.Items
VALUES
( 1 ,'Blue', 10 ),
( 2 ,'Red', 20 ),
( 3 ,'Green', 20 ),
( 4 ,'Yellow', 10 ),
( 5 ,'Brown', 10 );
SELECT i.id,
i.item,
s.name,
s.priority
FROM dbo.Items i
CROSS APPLY (
SELECT *,ROW_NUMBER()OVER(ORDER BY priority DESC) rn FROM dbo.States si WHERE si.state = i.state
)s
WHERE s.rn = 1;
The cross apply works like a join but allows to reference columns on the left side in the right side as you can see in the where clause. The ROW_NUMBER() function numbers all rows in the states table that match the current state value in reverse priority order so that the row with the highest priority always gets the number 1. The final where clause is filtering out just those rows.
EDIT:
I just started a blog series about joins: A Join A Day
The Cross Apply will be topic of day 8 (12/8/2012).

Left Join on Associative Table

I have three tables
Prospect -- holds prospect information
id
name
projectID
Sample data for Prospect
id | name | projectID
1 | p1 | 1
2 | p2 | 1
3 | p3 | 1
4 | p4 | 2
5 | p5 | 2
6 | p6 | 2
Conjoint -- holds conjoint information
id
title
projectID
Sample data
id | title | projectID
1 | color | 1
2 | size | 1
3 | qual | 1
4 | color | 2
5 | price | 2
6 | weight | 2
There is an associative table that holds the conjoint values for the prospects:
ConjointProspect
id
prospectID
conjointID
value
Sample Data
id | prospectID | conjointID | value
1 | 1 | 1 | 20
2 | 1 | 2 | 30
3 | 1 | 3 | 50
4 | 2 | 1 | 10
5 | 2 | 3 | 40
There are one or more prospects and one or more conjoints in their respective tables. A prospect may or may not have a value for each conjoint.
I'd like to have an SQL statement that will extract all conjoint values for each prospect of a given project, displaying NULL where there is no value for a value that is not present in the ConjointProspect table for a given conjoint and prospect.
Something along the lines of this for projectID = 1
prospectID | conjoint ID | value
1 | 1 | 20
1 | 2 | 30
1 | 3 | 50
2 | 1 | 10
2 | 2 | NULL
2 | 3 | 40
3 | 1 | NULL
3 | 2 | NULL
3 | 3 | NULL
I've tried using an inner join on the prospect and conjoint tables and then a left join on the ConjointProspect, but somewhere I'm getting a cartesian products for prospect/conjoint pairs that don't make any sense (to me)
SELECT p.id, p.name, c.id, c.title, cp.value
FROM prospect p
INNER JOIN conjoint c ON p.projectID = c.projectid
LEFT JOIN conjointProspect cp ON cp.prospectID = p.id
WHERE p.projectID = 2
ORDER BY p.id, c.id
prospectID | conjoint ID | value
1 | 1 | 20
1 | 2 | 30
1 | 3 | 50
1 | 1 | 20
1 | 2 | 30
1 | 3 | 50
1 | 1 | 20
1 | 2 | 30
1 | 3 | 50
2 | 1 | 10
2 | 2 | 40
2 | 1 | 10
2 | 2 | 40
2 | 1 | 10
2 | 2 | 40
3 | 1 | NULL
3 | 2 | NULL
3 | 3 | NULL
Guidance is very much appreciated!!
Then this will work for you... Prejoin a Cartesian against all prospects and elements within that project via a select as your first FROM table. Then, left join to the conjoinprospect. You can obviously change / eliminate certain columns from result, but at least all is there, in the join you want with exact results you are expecting...
SELECT
PJ.*,
CJP.Value
FROM
( SELECT
P.ID ProspectID,
P.Name,
P.ProjectID,
CJ.Title,
CJ.ID ConJointID
FROM
Prospect P,
ConJoint CJ
where
P.ProjectID = 1
AND P.ProjectID = CJ.ProjectID
ORDER BY
1, 4
) PJ
LEFT JOIN conjointProspect cjp
ON PJ.ProspectID = cjp.prospectID
AND PJ.ConjointID = cjp.conjointid
ORDER BY
PJ.ProspectID,
PJ.ConJointID
Your cartesian product is a result of joining by project Id - in your sample data there are 3 prospects with a project id of 1 and 3 conjoints with a project id of 1. Joining based on project id should then result in 9 rows of data, which is what you're getting. It looks like you really need to join via the conjointprospects table as that it what holds the mapping between prospects and conjoint.
What if you try something like:
SELECT p.id, p.name, c.id, c.title, cp.value
FROM prospect p
LEFT JOIN conjointProspect cp ON cp.prospectID = p.id
RIGHT JOIN conjoint c ON cp.conjointID = c.id
WHERE p.projectID = 2
ORDER BY p.id, c.id
Not sure if that will work, but it seems like conjointprospects needs to be at the center of your join in order to correctly map prospects to conjoints.