prevent from double/triple SUMing when JOINing - sql

i am joining two tables: accn_demographics and accn_payments. The relationship between the two tables is one to many between accn_demographics.accn_id and accn_payments.accn_id
My question is when I am summing the PAID_AMT and COPAY_AMT, I am getting double/triple/quadrouple the number that I should be getting.
Is there an obvious problem with my join condition?
select sum(p.paid_amt) as SumPaidAmount
, sum(p.copay_amt) as SumCoPay
, p.pmt_date
, d.load_Date
, p.ACCN_ID
from accn_payments p
join
(
select distinct load_date, accn_id
from accn_demographics
) d
on p.ACCN_ID=d.ACCN_ID
where p.POSTED='Y'
and p.pmt_date between '20120701' and '20120731'
group by p.pmt_date, d.load_Date,p.ACCN_ID
order by 3 desc
thanks so much for your guidance.

You need to do the summation in a subquery:
select sum(p.SumPaidAmount) as SumPaidAmount, sum(p.SumCoPay) as SumCoPay,
p.pmt_date, d.load_Date, p.ACCN_ID
from (select accn_id, p.pmt_date, sum(paid_amt) as SumPaidAmt,
sum(copay_amt) as SumCoPay
from accn_payments p
where p.POSTED='Y' and
p.pmt_date between '20120701' and '20120731'
group by accn_id, pmt_date
) p join
(select distinct load_date, accn_id from accn_demographics) d
on p.ACCN_ID=d.ACCN_ID
group by p.pmt_date, d.load_Date,p.ACCN_ID
order by 3 desc
Question: do you really intend for pmt_date to be in the final results? It looks like you want to remove it from both the outer SELECT and the subquery.

The only thing I can see if that (select distinct load_date, accn_id from accn_demographics) might return several matches. Look at your data and run a separate query
select distinct load_date, accn_id from accn_demographics WHERE accn_id=SomeID
where SomeID is one of the result accounts that is returning double/triple values. That should pinpoint your problem.

Yes, but it's not so obvious for beginners. What happens is that for every accn_payments record, you're matching on ONLY the accn_id, which means if there are multiple records in accn_demographics for that particular accn_id, then you will get duplicate accn_payment records due to the join. Is there another limiting field on accn_demographics to join back to the payments?
Ultimately, think of it this way:
accn_payments (p):
accn_id | paid_amt | copay_amt | ...
----------------------------------------------------
1 | 100.00 | 20.00 | ...
accn_demographics (d):
accn_id | load_date | ...
------------------------------------
1 | 2012/01/01 | ...
1 | 2012/03/05 | ...
1 | 2012/06/23 | ...
After joining, your results will look like this:
p.accn_id | p.paid_amt | p.copay_amt | p... | d.accn_id | d.load_date | d...
----------------------------------------------------------------------------
1 | 100.00 | 20.00 | .... | 1 | 2012/01/01 | ....
1 | 100.00 | 20.00 | .... | 1 | 2012/03/05 | ....
1 | 100.00 | 20.00 | .... | 1 | 2012/06/21 | ....
As you can see, the same row from accn_payments gets replicated for every matching accn_demographics record, since you specified only the accn_id column to be the join criteria. It can't limit the results any further, so it the DB engine says "Hey, look, this p record matches for all these d records, this must be what he was asking for!" Obviously not what was intended, as when you sum on the p.paid_amt and p.copay_amt, it performs a sum for ALL ROWS (even though they are duplicated).
Ultimately, see if you can limit the join criteria for accn_demographics even further (by some date, perhaps), that way you limit the number of duplicate payment records during the join.

Related

Complex SQL Joins with Where Clauses

Being pretty new to SQL, I ask for your patience. I have been banging my head trying to figure out how to create this VIEW by joining 3 tables. I am going to use mock tables, etc to keep this very simple. So that I can try to understand the answer - no just copy and paste.
ICS_Supplies:
Supplies_ ID Item_Description
-------------------------------------
1 | PaperClips
2 | Rubber Bands
3 | Stamps
4 | Staples
ICS_Orders:
ID SuppliesID RequisitionNumber
----------------------------------------------------
1 | 1 | R1234a
6 | 4 | R1234a
2 | 1 | P2345b
3 | 2 | P3456c
4 | 3 | R4567d
5 | 4 | P5678e
ICS_Transactions:
ID RequsitionNumber OrigDate TransType OpenClosed
------------------------------------------------------------------
1 | R1234a | 06/12/20 | Req | Open
2 | P2345b | 07/09/20 | PO | Open
3 | P3456c | 07/14/20 | PO | Closed
4 | R4567d | 08/22/20 | Req | Open
5 | P5678e | 11/11/20 | PO | Open
And this is what I want to see in my View Results
Supplies_ID Item RequsitionNumber OriginalDate TransType OpenClosed
---------------------------------------------------------------------------------------
1 | Paper Clips | P2345b | 07/09/20 | PO | OPEN
2 | Rubber Bands | Null | Null | Null | Null
3 | Stamps | Null | Null | Null | Null
4 | Staples | P56783 | 11/11/20 | PO | OPEN
I just can't get there. I want to always have the same amount of records that we have in the ICS_Supplies Table. I need to join to the ICS_Orders Table in order to grab the Requisition Number because that's what I need to join on the ICS_Transactions Table. I don't want to see data in the new added fields UNLESS ICS_Transactions.TransType = 'PO' AND ICS_Transactions.OpenClosed = 'OPEN', otherwise the joined fields should be seen as null, regardless to what they contain. IF that is possible?
My research shows this is probably a LEFT Join, which is very new to me. I had made many attempts on my own, and then posted my question yesterday. But I was struggling to ask the correct question and it was recommended by other members that I post the question again . .
If needed, I can share what I have done, but I fear it will make things overly confusing as I was going in the wrong direction.
I am adding a link to the original question, for those that need some background info
Original Question
If there is any additional information needed, just ask. I do apologize in advance if I have left out any needed details.
This is a bit tricky, because you want to exclude rows in the second table depending on whether there is a match in the third table - so two left joins are not what you are after.
I think this implements the logic you want:
select s.supplies_id, s.item_description,
t.requisition_number, t.original_date, t.trans_type, t.open_closed
from ics_supplies s
left join ics_transaction t
on t.transtype = 'PO'
and t.open_closed = 'Open'
and exists (
select 1
from ics_order o
where o.supplies_id = s.supplies_id and o.requisition_number = t.requisition_number
)
Another way to phrase this would be an inner join in a subquery, then a left join:
select s.supplies_id, s.item_description,
t.requisition_number, t.original_date, t.trans_type, t.open_closed
from ics_supplies s
left join (
select o.supplies_id, t.*
from ics_order o
inner join ics_transaction t
on t.requisition_number = o.requisition_number
where t.transtype = 'PO' and t.open_closed = 'Open'
) t on t.supplies_id = s.supplies_id
This query should return the data for supplies. The left join will add in all orders that have a supply_id (and return null for the orders that don't).
select
s.supplies_id
,s.Item_Description as [Item]
,t.RequisitionNumber
,t.OrigDate as [OriginalDate]
,t.TransType
,t.OpenClosed
from ICS_Supplies s
left join ICS_Orders o on o.supplies_id = s.supplies_id
left join ICS_Transactions t on t.RequisitionNumber = o.RequisitionNumber
where t.TransType = 'PO'
and t.OpenClosed = 'Open'
The null values will automatically show null if the record doesn't exist. For example, you are joining to the Transactions table and if there isn't a transaction_id for that supply then it will return 'null'.
Modify your query, run it, then maybe update your question using real examples if it's possible.
In the original question you wrote:
"I only need ONE matching record from the ICS_Transactions Table.
Ideally, the one that I want is the most current
'ICS_Transactions.OriginalDate'."
So the goal is to get the most recent transaction for which the TransType is 'PO' and OpenClosed is 'Open'. That the purpose of the CTE 'oa_cte' in this code. The appropriate transactions are then LEFT JOIN'ed on SuppliesId. Something like this
with oa_cte(SuppliesId, RequsitionNumber, OriginalDate,
TransType, OpenClosed, RowNum) as (
select o.SuppliesId, o.RequsitionNumber,
t.OrigDate, t.TransType, t.OpenClosed,
row_number() over (partition by o.SuppliesId
order by t.OrigDate desc)
from ICS_Orders o
join ICS_Transactions t on o.RequisitionNumber=t.RequisitionNumber
where t.TransType='PO'
and t.OpenClosed='OPEN')
select s.*, oa.*
from ICS_Supplies s
left join oa_cte oa on s.SuppliesId=oa.SuppliesId
and oa.RowNum=1;

Access query combine two tables with criteria

The below code references two tables. Each table are identical in structure, only difference being the "PRICE" and "PRICE_DATE" values. This is because it's the same table created one year ago. All I want to do is have a new table which takes the latest price in each table for each fund and inserts that into a new table. In addition to this, I also want another column which calculates the growth.
The code below works for this purpose.
SELECT [2015_11_Fund_Prices].FUND_CODE, [2015_11_Fund_Prices].PRICE AS
[PRICE_#_112015], [2016_11_Fund_Prices].PRICE AS [PRICE_#_112016]
([2016_11_Fund_Prices].[PRICE]/[2015_11_Fund_Prices].[PRICE]-1) AS Growth INTO 2016_11_Monthly_Fund_Prices
FROM 2016_11_Fund_Prices INNER JOIN 2015_11_Fund_Prices ON [2016_11_Fund_Prices].FUND_CODE = [2015_11_Fund_Prices].FUND_CODE
GROUP BY [2015_11_Fund_Prices].FUND_CODE, [2015_11_Fund_Prices].PRICE_DATE, [2015_11_Fund_Prices].PRICE, [2016_11_Fund_Prices].PRICE, [2016_11_Fund_Prices].PRICE_DATE, ([2016_11_Fund_Prices].[PRICE]/[2015_11_Fund_Prices].[PRICE]-1)
HAVING ((([2015_11_Fund_Prices].PRICE_DATE)=#24/11/2015#) AND (([2016_11_Fund_Prices].PRICE_DATE)=#24/11/2016#));
However, this code assumes that the latest price is 24/11 in both tables. I want to replace this with a max function that will result in the query referencing only the price in the row with the highest date value.
Can anyone help?
Tabels used are
+-----------+------------+-------+
| Fund_Code | PRICE_DATE | PRICE |
+-----------+------------+-------+
| 1 | 12/12/12 | 1 |
| 1 | 13/12/12 | 1.2 |
| 1 | 14/12/12 | 1.1 |
| 2 | 12/12/12 | 1.12 |
| 2 | 13/12/12 | 1.13 |
| 2 | 14/12/12 | 1.11 |
So the second table is exactly the same but dates corresponding to the following year.
All I want is a table with:
Fund_Code Price1 Price2 Growth
Thanks
You need a sub-query like this:
SELECT FUND_CODE, MAX(PRICE_DATE) AS MaxPriceDate FROM 2016_11_Fund_Prices GROUP BY FUND_CODE
If you add this sub-query to the above and link it to the 2016_11_Fund_Prices table on FUND_CODE and PRICE_DATE=MaxPriceDate it should do what you need.
SELECT 2016_11_Fund_Prices.FUND_CODE, PRICE, PRICE_DATE
FROM 2016_11_Fund_Prices
INNER JOIN (SELECT FUND_CODE, MAX(PRICE_DATE) AS MaxPriceDate FROM 2016_11_Fund_Prices GROUP BY FUND_CODE) mp
ON 2016_11_Fund_Prices.FUND_CODE=mp.FUND_CODE AND 2016_11_Fund_Prices.PRICE_DATE=mp.MaxPriceDate

Rewriting the HAVING clause and COUNT function

I'm trying to conceptually understand how to rewrite the HAVING clause and COUNT function.
I was asked "Find the names of all reviewers who have contributed three or more ratings. (As an extra challenge, try writing the query without HAVING or without COUNT.)" in relation this this simple database: http://sqlfiddle.com/#!5/35779/2/0
The query with HAVING and COUNT is easy. Without, I'm having difficulty.
Help would be very much appreciated. Thank you.
One option would be to use SUM(1) in place of COUNT in a subquery, and using WHERE instead of HAVING:
SELECT b.name
FROM (SELECT rID,SUM(1) Sum1
FROM rating
GROUP BY rID
)a
JOIN reviewer b
ON a.rID = b.rID
WHERE Sum1 >= 3
Demo: SQL Fiddle
Update: Some explanation of SUM(1):
Adding a constant to a SELECT statement will result in that value being repeated for every row returned, for example:
SELECT rID
,1 as Col1
FROM rating
Returns:
| rID | Col1 |
|-----|------|
| 201 | 1 |
| 201 | 1 |
| 202 | 1 |
| 203 | 1 |
| 203 | 1 |
......
SUM(1) is applying a constant 1 to every row and aggregating it.

Removing duplicate SQL records to permit a unique key

I have a table ('sales') in a MYSQL DB which should rightfully have had a unique constraint enforced to prevent duplicates. To first remove the dupes and set the constraint is proving a bit tricky.
Table structure (simplified):
'id (unique, autoinc)'
product_id
The goal is to enforce uniqueness for product_id. The de-duping policy I want to apply is to remove all duplicate records except the most recently created, eg: the highest id.
Or to put another way, I would like to delete only duplicate records, excluding the ids matched by the following query whilst also preserving the existing non-duped records:
select id
from sales s
inner join (select product_id,
max(id) as maxId
from sales
group by product_id
having count(product_id) > 1) groupedByProdId on s.product_id
and s.id = groupedByProdId.maxId
I've struggled with this on two fronts - writing the query to select the correct records to delete and then also the constraint in MYSQL where a subselect FROM clause of a DELETE cannot reference the same table from which data is being removed.
I checked out this answer and it seemed to deal with the subject, but seem specific to sql-server, though I wouldn't rule this question out from duplicating another.
In reply to your comment, here's a query that works in MySQL:
delete YourTable
from YourTable
inner join YourTable yt2
on YourTable.product_id = yt2.product_id
and YourTable.id < yt2.id
This would only remove duplicate rows. The inner join will filter out the latest row for each product, even if no other rows for the same product exist.
P.S. If you try to alias the table after FROM, MySQL requires you to specify the name of the database, like:
delete <DatabaseName>.yt
from YourTable yt
inner join YourTable yt2
on yt.product_id = yt2.product_id
and yt.id < yt2.id;
Perhaps use ALTER IGNORE TABLE ... ADD UNIQUE KEY.
For example:
describe sales;
+------------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| product_id | int(11) | NO | | NULL | |
+------------+---------+------+-----+---------+----------------+
select * from sales;
+----+------------+
| id | product_id |
+----+------------+
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 3 |
| 5 | 3 |
| 6 | 2 |
+----+------------+
ALTER IGNORE TABLE sales ADD UNIQUE KEY idx1(product_id), ORDER BY id DESC;
Query OK, 6 rows affected (0.03 sec)
Records: 6 Duplicates: 3 Warnings: 0
select * from sales;
+----+------------+
| id | product_id |
+----+------------+
| 6 | 2 |
| 5 | 3 |
| 2 | 1 |
+----+------------+
See this pythian post for more information.
Note that the ids end up in reverse order. I don't think this matters, since order of the ids should not matter in a database (as far as I know!). If this displeases you however, the post linked to above shows a way to solve this problem too. However, it involves creating a temporary table which requires more hard drive space than the in-place method I posted above.
I might do the following in sql-server to eliminate the duplicates:
DELETE FROM Sales
FROM Sales
INNER JOIN Sales b ON Sales.product_id = b.product_id AND Sales.id < b.id
It looks like the analogous delete statement for mysql might be:
DELETE FROM Sales
USING Sales
INNER JOIN Sales b ON Sales.product_id = b.product_id AND Sales.id < b.id
This type of problem is easier to solve with CTEs and Ranking functions, however, you should be able to do something like the following to solve your problem:
Delete Sales
Where Exists(
Select 1
From Sales As S2
Where S2.product_id = Sales.product_id
And S2.id > Sales.Id
Having Count(*) > 0
)

Using multiple left joins to calculate averages and counts

I am trying to figure out how to use multiple left outer joins to calculate average scores and number of cards. I have the following schema and test data. Each deck has 0 or more scores and 0 or more cards. I need to calculate an average score and card count for each deck. I'm using mysql for convenience, I eventually want this to run on sqlite on an Android phone.
mysql> select * from deck;
+----+-------+
| id | name |
+----+-------+
| 1 | one |
| 2 | two |
| 3 | three |
+----+-------+
mysql> select * from score;
+---------+-------+---------------------+--------+
| scoreId | value | date | deckId |
+---------+-------+---------------------+--------+
| 1 | 6.58 | 2009-10-05 20:54:52 | 1 |
| 2 | 7 | 2009-10-05 20:54:58 | 1 |
| 3 | 4.67 | 2009-10-05 20:55:04 | 1 |
| 4 | 7 | 2009-10-05 20:57:38 | 2 |
| 5 | 7 | 2009-10-05 20:57:41 | 2 |
+---------+-------+---------------------+--------+
mysql> select * from card;
+--------+-------+------+--------+
| cardId | front | back | deckId |
+--------+-------+------+--------+
| 1 | fron | back | 2 |
| 2 | fron | back | 1 |
| 3 | f1 | b2 | 1 |
+--------+-------+------+--------+
I run the following query...
mysql> select deck.name, sum(score.value)/count(score.value) "Ave",
-> count(card.front) "Count"
-> from deck
-> left outer join score on deck.id=score.deckId
-> left outer join card on deck.id=card.deckId
-> group by deck.id;
+-------+-----------------+-------+
| name | Ave | Count |
+-------+-----------------+-------+
| one | 6.0833333333333 | 6 |
| two | 7 | 2 |
| three | NULL | 0 |
+-------+-----------------+-------+
... and I get the right answer for the average, but the wrong answer for the number of cards. Can someone tell me what I am doing wrong before I pull my hair out?
Thanks!
John
It's running what you're asking--it's joining card 2 and 3 to scores 1, 2, and 3--creating a count of 6 (2 * 3). In card 1's case, it joins to scores 4 and 5, creating a count of 2 (1 * 2).
If you just want a count of cards, like you're currently doing, COUNT(Distinct Card.CardId).
select deck.name, coalesce(x.ave,0) as ave, count(card.*) as count -- card.* makes the intent more clear, i.e. to counting card itself, not the field. but do not do count(*), will make the result wrong
from deck
left join -- flatten the average result rows first
(
select deckId,sum(value)/count(*) as ave -- count the number of rows, not count the column name value. intent is more clear
from score
group by deckId
) as x on x.deckId = deck.id
left outer join card on card.deckId = deck.id -- then join the flattened results to cards
group by deck.id, x.ave, deck.name
order by deck.id
[EDIT]
sql has built-in average function, just use this:
select deckId, avg(value) as ave
from score
group by deckId
What's going wrong is that you're creating a Cartesian product between score and card.
Here's how it works: when you join deck to score, you may have multiple rows match. Then each of these multiple rows is joined to all of the matching rows in card. There's no condition preventing that from happening, and the default join behavior when no condition restricts it is to join all rows in one table to all rows in another table.
To see it in action, try this query, without the group by:
select *
from deck
left outer join score on deck.id=score.deckId
left outer join card on deck.id=card.deckId;
You'll see a lot of repeated data in the columns that come from score and card. When you calculate the AVG() over data that has repeats in it, the redundant values magically disappear (as long as the values are repeated uniformly). But when you COUNT() or SUM() them, the totals are way off.
There may be remedies for inadvertent Cartesian products. In your case, you can use COUNT(DISTINCT) to compensate:
select deck.name, avg(score.value) "Ave", count(DISTINCT card.front) "Count"
from deck
left outer join score on deck.id=score.deckId
left outer join card on deck.id=card.deckId
group by deck.id;
This solution doesn't solve all cases of inadvertent Cartesian products. The more general-purpose solution is to break it up into two separate queries:
select deck.name, avg(score.value) "Ave"
from deck
left outer join score on deck.id=score.deckId
group by deck.id;
select deck.name, count(card.front) "Count"
from deck
left outer join card on deck.id=card.deckId
group by deck.id;
Not every task in database programming must be done in a single query. It can even be more efficient (as well as simpler, easier to modify, and less error-prone) to use individual queries when you need multiple statistics.
Using left joins isn't a good approach, in my opinion. Here's a standard SQL query for the result you want.
select
name,
(select avg(value) from score where score.deckId = deck.id) as Ave,
(select count(*) from card where card.deckId = deck.id) as "Count"
from deck;