sum 'distinct' rows with same values - sql

I have a database which has a feeder that may have several distributors, each which may have several transformers, each which may have several clients and a certain kVA (power that gets to the clients).
And I have the following code:
SELECT f.feeder,
d.distributor,
count(DISTINCT t.transformer) AS total_transformers,
sum(t.Kvan) AS Total_KVA,
count(c.client) AS Clients,
FROM feeders f
LEFT JOIN distributors d
ON (d.feeder = f.feeder)
LEFT JOIN transformers t
ON (t.transformer = d.transformer)
LEFT JOIN clients c
ON (c.transformer = t.transformer)
WHERE d.transformer IS NOT NULL
GROUP BY f.feeder,
d.distributor,
f.feeder
ORDER BY f.feeder,
d.distributor
The sum is supposed to bring the sum of the different kVA the transformers have. Each transformer has a certain kVA. Problem is, 1 transformer has 1kVA for all the clients it has connected, but it will sum it as if it was 1kVA per client.
I need to group it on the feeder and distributor (I want to see how much kVA the distributor has and how many clients total).
So what should be "feeder1|dist1|2|600|374" brings me "feeder1|dist1|2|130000|374" (1 transformer has 200 kVA and the otherone 400, but it will sum these two 374 times instead of 400+200)

Your data model seems a little messy, in that you've specified a distributor can have many transformers (and logic suggests that a transformer is only on a single distributor) yet your query implies that the transformer ID is on the distributor record, which normally implies the opposite relationship ...
So if that's right, it must mean that you have multiple records in the distributors table for the same distributor - i.e. distributor can't then be a unique key in distributors table, which makes the query quite hard to reason accurately about. (e.g. What happens if the records for a distributor don't all have the same feeder ID on them? I'm guessing you wouldn't like the answer so much... Presumably you mean for that to be impossible, but if the model is as described it's not impossible. And worse I'm now second-guessing whether the apparent keys on the other tables are in fact unique... But I digress...)
Or maybe something else is broken. Point is the info you've given may be inconsistent or incomplete. Since I'm inferring an abnormal data model I can't guarantee the following is bug-free (though if you provide more detail so I can make fewer guesses, I may be able to refine the answer)...
So you know the trouble is that by the time you're ready to do the aggregation, the transformer data is embedded in a larger row that isn't based just on the identity of the transformer. There are a couple ways you could fix it, basically all centered on changing how you look at the aggregation of values. Here's one option:
select f.feeder
, dtc.distributor
-- next values work because transformer is already unique per group
, count(dtc.transformer) total_transformers
, sum(dtc.kvam) total_kvam
, sum(dtc.clients) clients
from feeder f
join (select d.distributor
, d.feeder
, t.transformer
, max(t.kvan) as kvan -- or min, doesn't matter
, count(distinct c.client) clients
from distributors d
left join transformers t
on d.transformer = t.transformer
left join clients c
on c.transformer = t.transformer
where d.transformer is not null
group by d.distributor, d.feeder, t.transformer
) dtc
on dtc.feeder = f.feeder
group by f.feeder, dtc.distributor
A few notes:
I changed the outer query join to an inner join, because any null rows from the original left join from feeder would be eliminated by the original where clause.
I kept the where clause anyway; having it along side the distributor-to-transformer left join is a little weird but is different from either an inner join or an outer join without the where clause (since the where clause acts on the left table's value). I'm avoiding changing the semantics from your original query, but it's weird enough this is something you might want to take another look at.
What using the subquery does here is, the inner query returns one row per feeder/distributor/transformer - i.e. for each feeder/distributor it returns one row per transformer. That row is itself an aggregate so that we can count clients, but since all rows in that aggregation come from the same transformer record we can use max() to get that single record's kvan value onto the aggregation.

Related

My Joins in query not pulling through correctly

Good evening. Could someone please help me with the following. I am trying to join two tables.The first id wbr_global.gl_ap_details. This stores historic GL information. The second table sandbox.utr_fixed_mapping is where account mapping is stored. For example, ana ccount number 60820 is mapped as Employee relation. The first table needs the mapping from the second table linked on the account number. The output I am getting is not right and way to bug. Any help would be appreciated!
Output
select sandbox.utr_fixed_mapping_na.new_mapping_1,sum(wbr_global.gl_ap_details.amount)
from wbr_global.gl_ap_details
LEFT JOIN sandbox.utr_fixed_mapping_na ON wbr_global.gl_ap_details.account_number = sandbox.utr_fixed_mapping_na.account_number
Where gl_ap_details.cost_center = '1172'
and gl_ap_details.period_name = 'JUL-21'
and gl_ap_details.ledger_name = 'Amazon.com, Inc.'
Group by 1;
I tried adding the cast function but after 5000 seconds of the query running I canceled it.
The query itself appears ok, but minor changes. Learn to use table "aliases". This way you don't have to keep typing long database.table.column all over. Additionally, SQL is easier to read doing it that way anyhow.
Notice the aliases "gl" and "fm" after the tables are declared, then these aliases are used to represent the columns.. Easier to read, would you agree.
Added GL Account number as described below the query.
select
gl.account_number,
fm.new_mapping_1,
sum(gl.amount)
from
wbr_global.gl_ap_details gl
LEFT JOIN sandbox.utr_fixed_mapping_na fm
ON gl.account_number = fm.account_number
Where
gl.cost_center = '1172'
and gl.period_name = 'JUL-21'
and gl.ledger_name = 'Amazon.com, Inc.'
Group by
gl.account_number,
fm.new_mapping_1
Now, as for your query and getting null. This just means that there are records within the gl_ap_details table with an account number that is not found in the utr_fixed_mapping_na table. So, to see WHAT gl account number does NOT exist, I have added it to the query. Its possible there are MULTIPLE records in the gl_ap_details that are not found in the mapping table. So, you may get
GLAccount Description SumOfAmount
glaccount1 null $someAmount
glaccount37 null $someAmount
glaccount49 null $someAmount
glaccount72 Depreciation $someAmount
glaccount87 Real Estate $someAmount
glaccount92 Building $someAmount
glaccount99 Salaries $someAmount
I obviously made-up glaccounts just to show the purpose. You may have multiple where the null's total amount is actually masking how many different gl account numbers were NOT found.
Once you find which are missing, you can check / confirm they SHOULD be in the mapping table.
FEEDBACK.
Since you do realize the missing numbers, lets consider a Cartesian result. If there are multiple entries in the mapping table for the same G/L account number, you will get a Cartesian result thus bloating your numbers. To clarify, lets say your mapping table has
Mapping file.
GL Descr1 NewMapping
1 test Salaries
1 testView Buildings
1 Another Depreciation
And your GL_AP_Details has
GL Amount
1 $100
Your total for the query would result in $300 because the query is trying to join the AP Details GL #1 to EACH of the entries in the mapping file thus bloating the amount. You could also add a COUNT(*) as NumberOfEntries to the query to see how many transactions it THINKS it is processing. Is there some "unique ID" in the GL_AP_Details table? If so, then you could also do a count of DISTINCT ID values. If they are different (distinct is lower than # of entries), I think THAT is your culprit.
select
fm.new_mapping_1,
sum(gl.amount),
count(*) as NumberOfEntries,
count( distinct gl.UniqueIdField ) as DistinctTransactions
from
wbr_global.gl_ap_details gl
LEFT JOIN sandbox.utr_fixed_mapping_na fm
ON gl.account_number = fm.account_number
Where
gl.cost_center = '1172'
and gl.period_name = 'JUL-21'
and gl.ledger_name = 'Amazon.com, Inc.'
Group by
fm.new_mapping_1
Might you also need to limit the mapping table for a specific prophecy or mec view?
If you "think" that the result of an aggregate is wrong, then the easiest way to verify this is to select the individual rows that correlate to 1 record in the aggregate output and inspect the records, looking for duplications.
For instance, pick 'Building Management':
SELECT fixed.new_mapping_1,details.amount,*
FROM wbr_global.gl_ap_details details
LEFT JOIN sandbox.utr_fixed_mapping_na fixed ON details.account_number = fixed.account_number
WHERE details.cost_center = '1172'
AND details.period_name = 'JUL-21'
AND details.ledger_name = 'Amazon.com, Inc.'
AND details.account_number = 'Building Management'
Notice that we tack on a ,* to the end of the projection, this will show you everything that the query has access to, you should look for repeating sections of data that you were not expecting, then depending on which table they originate from your might add additional criteria to the JOIN, or to the WHERE or you might need to group by additional columns.
This type of issue is really hard to comment on in a forum like this because it is highly specific to your schema, and the data contained within it, making solutions highly subjective to criteria you are not likely to publish online.
Generally if you think a calculation is wrong, you need to manually compute it to verify, this above advice helps you to inspect the data your query is using, you should either construct your own query or use other tools to build the data set that helps you to manually compute the correct values, then work them back into or replace your original query.
The speed issues are out of scope here, we can comment on the poor schema design but I suspect you don't have a choice. In the utr_fixed_mapping_na table you should make the account_number have the same column type as the source data, or add a new column that has the data in the original type, then you can setup indexes on the columns to improve the speed of the join.

MS Access - query that sums a sum through multiple joined tables

I have three tables Models, Buildups, and Components with many-to-many join tables in between them. Each model can have multiple buildups and buildups are composed of multiple components. The components table has a field called Retail.
I'm trying to create a query for a report where the user can see a model and know the total buildup retail amount which would be the sum of the Retail field of each component in the buildup and then a sum of each buildup in the model.
I need a way to reference the Sum of the Sum of components without the enter parameter box appearing when the query is run (strange enough, when the parameter box is left blank it calculates correctly but I don't want the box to pop up).
Is the solution a nested query? If so, how would I do that? Or is the solution to use DSum()? Once again, if so, how would I implement that?
I'm not sure what to reference to make the criteria portion of the DSum() formula work correctly.
Unless I've misunderstood your database structure or what you are looking to obtain, it seems like this would be sufficient:
select
mo.JandelModelID,
sum(co.retail) as Total_Retail
from
(
(
tblJandelModels mo inner join tblJandelModelBuildups mb on
mo.JandelModelID = mb.JandelModelID
)
inner join tblBuildupComponents bc on mb.BuildupID = bc.BuildupID
)
inner join tblComponents co on bc.ComponentID = co.ComponentID
group by
mo.JandelModelID

SQL query malfunction

So Im trying to use INNER JOIN in my sql command because I am trying to replace the Foreign keys ID numbers with the text value of each column. However, when I use INNER JOIN, the column for "Standards" always gives me the same value. The following is what I started with
SELECT Grade_Id, Cluster_Eng_Id, Domain_Math_Eng_Id, Standard
FROM `math_standards_eng`
WHERE 1
and returns this (which is good). Notice the value of Standard values are different
Grade_Id Cluster_Eng_Id Domain_Math_Eng_Id Standard
103 131 107 Explain equivalence of fractions in special cases...
104 143 105 Know relative sizes of measurement units within o...
When I try to use Inner Join, the values for Grade_Id, Cluster_Eng_Id, and Domain_Math_Eng_Id are changed from numbers to actual text. Standard column values, however, seems to return the same value. Here is my code:
SELECT
grades_eng.Grade, domain_math_eng.Domain, cluster_eng.Cluster,
math_standards_eng.Standard
FROM
math_standards_eng
INNER JOIN
grades_eng ON math_standards_eng.Grade_Id = grades_eng.Id
INNER JOIN
domain_math_eng ON math_standards_eng.Domain_Math_Eng_Id
INNER JOIN
cluster_eng ON math_standards_eng.Cluster_Eng_Id
This is what I get when I run the query:
Grade Domain Cluster Standard
3rd Counting and cardinality Know number names and the count sequence Explain equivalence of fractions in special cases...
3rd Expressions and Equations Know number names and the count sequence Explain equivalence of fractions in special cases...
3rd Functions Know number names and the count sequence Explain equivalence of fractions in special cases.
4th Counting and cardinality Know number names and the count sequence Know relative sizes of measurement units within o...
4th Expressions and Equations Know number names and the count sequence Know relative sizes of measurement units within o...
The text value for Standard keeps on showing the same value per grade and I do not know why. 3rd Will keep showing the same thing, and then the next grade will change to a new value and repeat over and over. Lastly, each table has a 1:M relationship with standard as they each appear multiple times in the standard Table. Any advice would be greatly appreciated.
You are missing the = part of your INNER JOIN on domain_math_eng and cluster_eng. I would expect something like:
SELECT grades_eng.Grade, domain_math_eng.Domain, cluster_eng.Cluster, math_standards_eng.Standard FROM math_standards_eng
INNER JOIN grades_eng ON math_standards_eng.Grade_Id = grades_eng.Id
INNER JOIN domain_math_eng ON math_standards_eng.Domain_Math_Eng_Id = domain_math_eng.Id
INNER JOIN cluster_eng ON math_standards_eng.Cluster_Eng_Id = cluster_eng.Id

SQL query seems to work for 'AND T1.email_address_ IN (subquery)', but returns 0 rows for 'AND T1.email_address_ NOT IN (subquery)'

Good morning. I'm working in Responsys Interact, which is an Oracle-based email campaign management type SAAS product. I'm creating a query to basically filter a target list for an email campaign designed to target a specific sub-set of our master email contact list. Here's the query I created a few weeks ago that appears to work:
/*
Table Symbolic Name
CONTACTS_LIST $A$
Engaged $B$
TRANSACTIONS_RAW $C$
TRANSACTION_LINES_RAW $D$
-- A Responsys Filter (Engaged) will return only an RIID_, nothing else, according to John # Responsys....so,....let's join on that to contact list...
*/
SELECT
DISTINCT $A$.EMAIL_ADDRESS_,
$A$.RIID_,
$A$.FIRST_NAME,
$A$.LAST_NAME,
$A$.EMAIL_PERMISSION_STATUS_
FROM
$A$
JOIN $B$ ON $B$.RIID_ = $A$.RIID_
LEFT JOIN $C$ ON $C$.EMAIL_ADDRESS_ = $A$.EMAIL_ADDRESS_
LEFT JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID
WHERE
$A$.EMAIL_DOMAIN_ NOT IN ('none.com', 'noemail.com', 'mailinator.com', 'nomail.com') AND
/* don't include hp customers */
$A$.HP_PLAN_START_DATE IS NULL AND
$A$.EMAIL_ADDRESS_ NOT IN
(
SELECT
$C$.EMAIL_ADDRESS_
FROM
$C$
JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID
WHERE
/* Get only purchase transactions for certain item_id's/SKU's */
($D$.ITEM_FAMILY_ID IN (3,4,5,8,14,15) OR $D$.ITEM_ID IN (704,769,1893,2808,3013) ) AND
/* .... within last 60 days (i.e. 2 months) */
$A$.TRANDATE > ADD_MONTHS(CURRENT_TIMESTAMP, -2)
)
;
This seems to work, in that if I run the query without the sub-query, we get 720K rows; and if I add back the 'AND NOT IN...' subquery, we get about 700K rows, which appears correct based on what my user knows about her data. What I'm (supposedly) doing with the NOT IN subquery is filtering out any email addresses where the customer has purchased certain items from us in the last 60 days.
So, now I need to add in another constraint. We still don't want customers who made certain purchases in the last 60 days as above, but now also we want to exclude customers who have purchased another particular item, but now within the last 12 months. So, I thought I would add another subquery, as shown below. Now, this has introduced several problems:
Performance - the query, which took a couple minutes to run before, now takes quite a few more minutes to run - in fact it seems to time out....
So, I wondered if there's an issue having two subqueries, but before I went to think about alternatives to this, I decided to test my new subquery by temporarily deleting the first subquery, so that I had just one subquery similar to above, but with the new item = 11 and within the last 12 months logic. And so with this, the query finally returned after a few minutes now, but with zero rows.
Trying to figure out why, I tried simply changing the AND NOT IN (subquery) to AND IN (subquery), and that worked, in that it returned a few thousand rows, as expected.
So why would the same SQL when using AND IN (subquery) "work", but the exact same SQL simply changed to AND NOT IN (subquery) return zero rows, instead of what I would expect which would be my 700 something thousdand plus rows, less the couple thousand encapsulated by the subquery result?
Also, what is the best i.e. most performant way to accomplish what I'm trying to do, which is filter by some purchases made within one date range, AND by some other purchases made within a different date range?
Here's the modified version:
SELECT
DISTINCT $A$.EMAIL_ADDRESS_,
$A$.RIID_,
$A$.FIRST_NAME,
$A$.LAST_NAME,
$A$.EMAIL_PERMISSION_STATUS_
FROM
$A$
JOIN $B$ ON $B$.RIID_ = $A$.RIID_
LEFT JOIN $C$ ON $C$.EMAIL_ADDRESS_ = $A$.EMAIL_ADDRESS_
LEFT JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID
WHERE
$A$.EMAIL_DOMAIN_ NOT IN ('none.com', 'noemail.com', 'mailinator.com', 'nomail.com') AND
/* don't include hp customers */
$A$.HP_PLAN_START_DATE IS NULL AND
$A$.EMAIL_ADDRESS_ NOT IN
(
SELECT
$C$.EMAIL_ADDRESS_
FROM
$C$
JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID
WHERE
/* Get only purchase transactions for certain item_id's/SKU's */
($D$.ITEM_FAMILY_ID IN (3,4,5,8,14,15) OR $D$.ITEM_ID IN (704,769,1893,2808,3013) ) AND
/* .... within last 60 days (i.e. 2 months) */
$C$.TRANDATE > ADD_MONTHS(CURRENT_TIMESTAMP, -2)
)
AND
$A$.EMAIL_ADDRESS_ NOT IN
(
/* get purchase transactions for another type of item within last year */
SELECT
$C$.EMAIL_ADDRESS_
FROM
$C$
JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID
WHERE
$D$.ITEM_FAMILY_ID = 11 AND $C$.TRANDATE > ADD_MONTHS(CURRENT_TIMESTAMP, -12)
)
;
Thanks for any ideas/insights. I may be missing or mis-remembering some basic SQL concept here - if so please help me out! Also, Responsys Interact runs on top of Oracle - it's an Oracle product - but I don't know off hand what version/flavor. Thanks!
Looks like my problem with the new subquery was due to poor performance due to lack of indexes. Thanks to Alex Poole's comments, I looked in Responsys and there is a facility to get an 'explain' type analysis, and it was throwing warnings, and suggesting I build some indexes. Found the way to do that on the data sources, went back to the explain, and it said, "The query should run without placing an unnecessary burden on the system". And while it still ran for quite a few minutes, it did finally come back with close to the expected number of rows.
Now, I'm on to tackle the other half of the issue, which is to now incorporate this second sub-query in addition to the first, original subquery....
Ok, upon further testing/analysis and refining my stackoverflow search critieria, the answer to the main part of my question dealing with the IN vs. NOT IN can be found here: SQL "select where not in subquery" returns no results
My performance was helped by using Responsys's explain-like feature and adding some indexes, but when I did that, I also happened to add in a little extra SQL in my sub-query's WHERE clause.... when I removed that, even after indexes built, I was back to zero rows returned. That's because as it turned out at least one of the transactions rows for the item family id I was interested in for this additional sub-query had a null value for email address. And as further explained in the link above, when using NOT IN, as soon as you have a null value involved, SQL can't definitively say it's NOT IN, since you can't really compare to null, so as soon as you have a null, the sub-query's going to evaluate 'false', thus zero rows. When using IN, even though there are nulls present, if you get one positive match, well, that's a match, so the sub-query returns 'true', so that's why you'll get rows with IN, but not with NOT IN. I hadn't realized that some of our transaction data may have null email addresses - now I know, so I just added a where not null to the where clause for the email address, and now all's good.

sql SUM value incorrect when using joins and group by

Im writing a query that sums order values broken down by product groups - problem is that when I add joins the aggregated SUM gets greatly inflated - I assume its because its adding in duplicate rows. Im kinda new to SQL, but I think its because I need to construct the query with sub selects or nested joins?
All data returns as expected, and my joins pull out the needed data, but the SUM(inv.item_total) AS Value returned is much higher that it should be - SQL below
SELECT so.Company_id, SUM(inv.item_total) AS Value, co.company_name,
agents.short_desc, stock_type.short_desc AS Type
FROM SORDER as so
JOIN company AS co ON co.company_id = so.company_id
JOIN invoice AS inv ON inv.Sorder_id = so.Sorder_id
JOIN sorder_item AS soitem ON soitem.sorder_id = so.Sorder_id
JOIN STOCK AS stock ON stock.stock_id = soitem.stock_id
JOIN stock_type AS stock_type ON stock_type.stype_id = stock.stype_id
JOIN AGENTS AS AGENTS ON agents.agent_id = co.agent_id
WHERE
co.last_ordered >'01-JAN-2012' and so.Sotype_id='1'
GROUP BY so.Company_id,co.company_name,agents.short_desc, stock_type.short_desc
Any guidence on how I should structure this query to pull out an "un-duplicated" SUM(inv.item_total) AS Value much appreciated.
To get an accurate sum, you want only the joins that are needed. So, this version should work:
SELECT so.Company_id, SUM(inv.item_total) AS Value, co.company_name
FROM SORDER so JOIN
company co
ON co.company_id = so.company_id JOIN
invoice inv
ON inv.Sorder_id = so.Sorder_id
group by so.Company_id, co.company_name
You can then add in one join at a time to see where the multiplication is taking place. I'm guessing it has to do with the agents.
It sounds like the joins are not accurate.
First suspect join
For example, would an agent be per company, or per invoice?
If it is per order, then should the join be something along the lines of
JOIN AGENTS AS AGENTS ON agents.agent_id = inv.agent_id
Second suspect join
Can one order have many items, and many invoices at the same time? That can cause problems as well. Say an order has 3 items and 3 invoices were sent out. According to your joins, the same item will show up 3 times means a total of 9 line items where there should be only 3. You may need to eliminate the invoices table
Possible way to solve this on your own:
I would remove all the grouping and sums, and see if you can filter by one invoice produce an unique set of rows for all the data.
Start with an invoice that has just one item and inspect your result set for accuracy. If that works, then add another invoice that has multiple and check the rows to see if you get your perfect dataset back. If not, then the columns that have repeating values (Company Name, Item Name, Agent Name, etc) are usually a good starting point for checking up on why the duplicates are showing up.