MS Access SQL Query using Sum() and Count() gives incorrect results - sql

I am having an issue with a query which returns results that are very far from reality (not only does it not make sense at all but I can also calculate the correct answer using filters).
I am building a KPI db for work and this query returns KPIs by employee by period. I have a very similar query from which this one is derived which returns KPIs by sector by period which gives the exact results I have calculated using a spreadsheet. I really have no idea what happens here. Basically, I want to sum a few measures that are in the maintenances table like temps_requete_min, temps_analyse_min, temps_maj_min and temps_rap_min and then create a subtotal AND present these measures as hours (measures are presented in minutes, thus the divide by 60).
SELECT
[anal].[prenom] & " " & [anal].[nom] AS Analyste,
maint.periode, maint.annee,
Round(Sum(maint.temps_requete_min)/60,2) AS REQ,
Round(Sum(maint.temps_analyse_min)/60,2) AS ANA,
Round(Sum(maint.temps_maj_min)/60,2) AS MAJ,
Round(Sum(maint.temps_rap_min)/60,2) AS RAP,
Round((Sum(maint.temps_requete_min)+Sum(maint.temps_analyse_min)+Sum(maint.temps_maj_min)+Sum(maint.temps_rap_min))/60,2) AS STOTAL,
Count(maint.periode) AS Nombre,
a.description
FROM
rapports AS rap,
analyste AS anal,
maintenances AS maint,
per_annuelle,
annees AS a
WHERE
(((rap.id_anal_maint)=anal.id_analyste) And
((maint.id_fichier)=rap.id_rapport) And
((maint.maint_effectuee)=True) And
((maint.annee)=per_annuelle.annee) And
((per_annuelle.annee)=a.annees))
GROUP BY
[anal].[prenom] & " " & [anal].[nom],
maint.periode,
maint.annee,
a.description,
anal.id_analyste
ORDER BY
maint.annee, maint.periode;
All measures are many orders of magnitude higher than what they should be. I suspect that my Count() is wrong, but I can't see what would be wrong with the sums :|
Edit: Finally I have come up with this query which shows the same measures I have calculated using Excel from the advice given in the comments and the answer provided. Many thanks to everyone. What I would like to know however, is why it makes a difference to use explicit joins rather than implicit joins (WHERE clause on PKs).
SELECT
maintenances.periode,
[analyste].[prenom] & " " & analyste.nom,
Round(Sum(maintenances.temps_requete_min)/60,2) AS REQ,
Round(Sum(maintenances.temps_analyse_min)/60,2) AS ANA,
Round(Sum(maintenances.temps_maj_min)/60,2) AS MAJ,
Round(Sum(maintenances.temps_rap_min)/60,2) AS RAP,
Round((Sum(maintenances.temps_requete_min)+Sum(maintenances.temps_analyse_min)+Sum(maintenances.temps_maj_min)+Sum(maintenances.temps_rap_min))/60,2) AS STOTAL,
Count(maintenances.periode) AS Nombre
FROM
(maintenances INNER JOIN rapports ON maintenances.id_fichier = rapports.id_rapport)
INNER JOIN analyste ON rapports.id_anal_maint = analyste.id_analyste
GROUP BY analyste.prenom, maintenances.periode

In this case, the problem is typically that your joins are bringing together multiple dimensions. You end up doing a cross product across two or more categories.
The fix is to do the summaries independently along each dimension. That means that the "from" clause contains subqueries with group bys, and these are then joined together. The group by would disappear from the outer query.
This would suggest having a subquery such as:
from (select maint.periode, maint.annee,
Round(Sum(maint.temps_requete_min)/60,2) AS REQ,
Round(Sum(maint.temps_analyse_min)/60,2) AS ANA,
Round(Sum(maint.temps_maj_min)/60,2) AS MAJ,
Round(Sum(maint.temps_rap_min)/60,2) AS RAP,
Round((Sum(maint.temps_requete_min)+Sum(maint.temps_analyse_min) +Sum(maint.temps_maj_min)+Sum(maint.temps_rap_min))/60,2) AS STOTAL,
Count(maint.periode) AS Nombre,
from maintenances maint
group by maint.periode, maint.annee
) m
I say "such as" because without a layout of the tables, it is difficult to see exactly where the problem is and what the exact solution is.

Related

MS Access 2013, How to add totals row within SQL

I'm in need of some assistance. I have search and not found what I'm looking for. I have an assigment for school that requires me to use SQL. I have a query that pulls some colunms from two tables:
SELECT Course.CourseNo, Course.CrHrs, Sections.Yr, Sections.Term, Sections.Location
FROM Course
INNER JOIN Sections ON Course.CourseNo = Sections.CourseNo
WHERE Sections.Term="spring";
I need to add a Totals row at the bottom to count the CourseNo and Sum the CrHrs. It has to be done through SQL query design as I need to paste the code. I know it can be done with the datasheet view but she will not accept that. Any advice?
To accomplish this, you can union your query together with an aggregation query. Its not clear from your question which columns you are trying to get "Totals" from, but here's an example of what I mean using your query and getting counts of each (kind of useless example - but you should be able to apply to what you are doing):
SELECT
[Course].[CourseNo]
, [Course].[CrHrs]
, [Sections].[Yr]
, [Sections].[Term]
, [Sections].[Location]
FROM
[Course]
INNER JOIN [Sections] ON [Course].[CourseNo] = [Sections].[CourseNo]
WHERE [Sections].[Term] = [spring]
UNION ALL
SELECT
"TOTALS"
, SUM([Course].[CrHrs])
, count([Sections].[Yr])
, Count([Sections].[Term])
, Count([Sections].[Location])
FROM
[Course]
INNER JOIN [Sections] ON [Course].[CourseNo] = [Sections].[CourseNo]
WHERE [Sections].[Term] = “spring”
You can prepare your "total" query separately, and then output both query results together with "UNION".
It might look like:
SELECT Course.CourseNo, Course.CrHrs, Sections.Yr, Sections.Term, Sections.Location
FROM Course
INNER JOIN Sections ON Course.CourseNo = Sections.CourseNo
WHERE Sections.Term="spring"
UNION
SELECT "Total", SUM(Course.CrHrs), SUM(Sections.Yr), SUM(Sections.Term), SUM(Sections.Location)
FROM Course
INNER JOIN Sections ON Course.CourseNo = Sections.CourseNo
WHERE Sections.Term="spring";
Whilst you can certainly union the aggregated totals query to the end of your original query, in my opinion this would be really bad practice and would be undesirable for any real-world application.
Consider that the resulting query could no longer be used for any meaningful analysis of the data: if displayed in a datagrid, the user would not be able to sort the data without the totals row being interspersed amongst the rest of the data; the user could no longer use the built-in Totals option to perform their own aggregate operation, and the insertion of a row only identifiable by the term totals could even conflict with other data within the set.
Instead, I would suggest displaying the totals within an entirely separate form control, using a separate query such as the following (based on your own example):
SELECT Count(Course.CourseNo) as Courses, Sum(Course.CrHrs) as Hours
FROM Course INNER JOIN Sections ON Course.CourseNo = Sections.CourseNo
WHERE Sections.Term = "spring";
However, since CrHrs are fields within your Course table and not within your Sections table, the above may yield multiples of the desired result, with the number of hours multiplied by the number of corresponding records in the Sections table.
If this is the case, the following may be more suitable:
SELECT Count(Course.CourseNo) as Courses, Sum(Course.CrHrs) as Hours
FROM
Course INNER JOIN
(SELECT DISTINCT s.CourseNo FROM Sections s WHERE s.Term = "spring") q
ON Course.CourseNo = q.CourseNo

MS Access Update SQL Query Extremely Slow and Multiplying the Amount of Records Updated

I am stumped on how to make this query run more efficiently/correctly. Here is the query first and then I can describe the tables that are involved:
UPDATE agg_pivot_test AS p
LEFT JOIN jd_cleaning AS c
ON c.Formerly = IIF(c.Formerly LIKE '*or*', '*' & p.LyFinalCode & '*', CStr(p.LyFinalCode))
SET p.CyFinalCode = c.FinalCode
WHERE p.CyFinalCode IS NULL AND c.Formerly IS NOT NULL;
agg_pivot_test has 200 rows of data and only 99 fit the criteria of WHERE p.CyFinalCode IS NULL. The JOIN needs some explaining. It is an IIF because some genius decided to link last year's data to this year's data using Formerly. It is a string because sometimes multiple items have been consolidated down to one so they use "or" (e.g., 632 or 631 or 630). So if I want to match this year's data I have to use Formerly to match last year's LyFinalCode. So this year the code might be 629, but I have to use the Formerly to map the items that were 632, 631, or 630 to the new code. Make sense? That is why the ON has an IIF. Also, Formerly is a string and LyFinalCode is an integer... fun.
Anyway, when you run the query it says it is updating 1807 records when again, there are only 200 records and only 99 that fit the criteria.
Any suggestions about what this is happening or how to fix it?
An interesting problem. I don't think I've ever come across something quite like this before.
I'm guessing what's happening is that rows where CyFinalCode is null, are being matched multiple times by the join statement, and thus the join expression is calculating a cartesian product of row-matches, and this is the basis of the rows updated message. It seems odd, as I would have expected access to complain about multiple row matches, when row matches should only be 1:1 in an update statement.
I would suggest rewriting the query (with this join) as a select statement, and seeing what the query gives you in the way of output; Something like:
SELECT p.*, c.*
FROM agg_pivot_test p LEFT JOIN jd_cleaning c
ON c.Formerly = IIF(c.Formerly LIKE '*or*', '*' & p.LyFinalCode & '*', CStr(p.LyFinalCode))
WHERE p.CyFinalCode IS NULL AND c.Formerly IS NOT NULL
I'm also inclined to suggest changing "... & p.LyFinalCode & ..." to "... & CStr(p.LyFinalCode) & ..." - though I can't really see why it should make a difference.
The only other thing I can think to suggest is change the join a bit: (this isnt guaranteed to be better necessarily - though it might be)
UPDATE agg_pivot_test AS p LEFT JOIN jd_cleaning AS c
ON (c.Formerly = CStr(p.LyFinalCode) OR InStr(c.Formerly, CStr(p.LyFinalCode)) > 0)
(Given the syntax of your statement, I assume this sql is running within access via ODBC; in which case this should be fine. If I'm wrong the sql is running server side, you'll need to change InStr to SubString.)

Use of the HAVING clause when using muliple sums

I was having a problem getting mulitple sums from multiple tables. Short story, my answer was solved in the "sql sum data from multiple tables" thread on this site. But where it came up short, is that now I'd like to only show sums that are greater than a certain amount. So while I have sub-selects in my select, I think I need to use a HAVING clause to filter the summed amounts that are too low.
Example, using the code specified in the link above (more specifically the answer that the owner has chosen as correct), I would only like to see a query result if SUM(AP2.Value) > 1500. Any thoughts?
If you need to filter on the results of ANY aggregate function, you MUST use a HAVING clause. WHERE is applied at the row level as the DB scans the tables for matching things. HAVING is applied basically immediately before the result set is sent out to the client. At the time WHERE operates, the aggregate function results are not (and cannot) be available, so you have to use a HAVING clause, which is applied after the main query is complete and all aggregate results are available.
So... long story short, yes, you'll need to do
SELECT ...
FROM ...
WHERE ...
HAVING (SUM_AP > 1500)
Note that you can use column aliases in the having clause. In technical terms, having on a query as above works basically exactly the same as wrapping the initial query in another query and applying another WHERE clause on the wrapper:
SELECT *
FROM (
SELECT ...
) AS child
WHERE (SUM_AP > 1500)
You could wrap that query as a subselect and then specify your criteria in the WHERE clause:
SELECT
PROJECT,
SUM_AP,
SUM_INV
FROM (
SELECT
AP1.[PROJECT],
(SELECT SUM(AP2.Value) FROM AP AS AP2 WHERE AP2.PROJECT = AP1.PROJECT) AS SUM_AP,
(SELECT SUM(INV2.Value) FROM INV AS INV2 WHERE INV2.PROJECT = AP1.PROJECT) AS SUM_INV
FROM AP AS AP1
INNER JOIN INV AS INV1 ON
AP1.[PROJECT] = INV1.[PROJECT]
WHERE
AP1.[PROJECT] = 'XXXXX'
GROUP BY
AP1.[PROJECT]
) SQ
WHERE
SQ.SUM_AP > 1500

Subqueries and AVG() on a subtraction

Working on a query to return the average time from when an employee begins his/her shift and then arrives at the first home (this DB assumes they are salesmen).
What I have:
SELECT l.OFFICE_NAME, crew.EMPLOYEE_NAME, //avg(first arrival time)
FROM LOCAL_OFFICE l, CREW_WORK_SCHEDULE crew,
WHERE l.LOCAL_OFFICE_ID = crew1.LOCAL_OFFICE_ID
You can see the AVG() command is commented out, because I know the time that they arrive at work, and the time they get to the first house, and can find the value using this:
(SELECT MIN(c.ARRIVE)
FROM ORDER_STATUS c
WHERE c.USER_ID = crew.CREW_ID)
-(SELECT START_TIME
FROM CREW_SHIFT_CODES
WHERE WORK_SHIFT_CODE = crew.WORK_SHIFT_CODE)
Would the best way be to simply put the above into the the AVG() parentheses? Just trying to learn the best methods to create queries. If you want more info on any of the tables, etc. just ask, but hopefully they're all named so you know what they're returning.
As per my comment, the example you gave would only return one record to the AVG function, and so not do very much.
If the sub-query was returning multiple records, however, your suggestion of placing the sub-query inside the AVG() would work...
SELECT
AVG((SELECT MIN(sub.val) FROM sub WHERE sub.id = main.id GROUP BY sub.group))
FROM
main
GROUP BY
main.group
(Averaging a set of minima, and so requiring two levels of GROUP BY.)
In many cases this gives good performance, and is maintainable. But sometimes the sub-query grows large, and it can be better to reformat it using an inline view...
SELECT
main.group,
AVG(sub_query.val)
FROM
main
INNER JOIN
(
SELECT
sub.id,
sub.group,
MIN(sub.val) AS val
FROM
sub
GROUP BY
sub.id
sub.group
)
AS sub_query
ON sub_query.id = main.id
GROUP BY
main.group
Note: Although this looks as though the inline view will calculate a lod of values that are not needed (and so be inefficient), most RDBMS optimise this so only the required records get processes. (The optimiser knows how the inner query is being used by the outer query, and builds the execution plan accordingly.)
Don't think of subqueries: they're often quite slow. In effect, they are row by row (RBAR) operations rather than set based
join all the table together
I've used a derived table to calculate the 1st arrival time
Aggregate
Soemthing like
SELECT
l.OFFICE_NAME, crew.EMPLOYEE_NAME,
AVG(os.minARRIVE - cs.START_TIME)
FROM
LOCAL_OFFICE l
JOIN
CREW_WORK_SCHEDULE crew On l.LOCAL_OFFICE_ID = crew1.LOCAL_OFFICE_ID
JOIN
CREW_SHIFT_CODES cs ON cs.WORK_SHIFT_CODE = crew.WORK_SHIFT_CODE
JOIN
(SELECT MIN(ARRIVE) AS minARRIVE, USER_ID
FROM ORDER_STATUS
GROUP BY USER_ID
) os ON oc.USER_ID = crew.CREW_ID
GROUP B
l.OFFICE_NAME, crew.EMPLOYEE_NAME
This probably won't give correct data because of the minARRIVE grouping: there isn't enough info from ORDER_STATUS to show "which day" or "which shift". It's simply "first arrival for that user for all time"
Edit:
This will give you average minutes
You can add this back to minARRIVE using DATEADD, or change to hh:mm with some %60 (modul0) and /60 (integer divide
AVG(
DATEDIFF(minute, os.minARRIVE, os.minARRIVE)
)

Slow Query - Help with Optimization

Hey guys. This is a follow-on from this question:
After getting the right data and making some tweaks based on requests from business, I've now got this mini-beast on my hands. This query should return the total number of new jobseeker registrations and the number of new uploaded CV's:
SELECT COUNT(j.jobseeker_id) as new_registrations,
(
SELECT
COUNT(c.cv_id)
FROM
tb_cv as c, tb_jobseeker, tb_industry
WHERE
UNIX_TIMESTAMP(c.created_at) >= '1241125200'
AND
UNIX_TIMESTAMP(c.created_at) <= '1243717200'
AND
tb_jobseeker.industry_id = tb_industry.industry_id
)
AS uploaded_cvs
FROM
tb_jobseeker as j, tb_industry as i
WHERE
j.created_at BETWEEN '2009-05-01' AND '2009-05-31'
AND
i.industry_id = j.industry_id
GROUP BY i.description, MONTH(j.created_at)
Notes:
- The two values in the UNIX TIMESTAMP functions are passed in as parameters from the report module in our backend.
Every time I run it, MySQL chokes and lingers silently into the ether of the Interweb.
Help is appreciated.
Update: Hey guys. Thanks a lot for all the thoughtful and helpful comments. I'm only 2 weeks into my role here, so I'm still learning the schema. So, this query is somewhere between a thumbsuck and an educated guess. Will start to answer all your questions now.
tb_cv is not connected to the other tables in the sub-query. I guess this is the root cause for the slow query. It causes generation of a Cartesian product, yielding a lot more rows than you probably need.
Other than that I'd say you need indexes on tb_jobseeker.created_at, tb_cv.created_at and tb_industry.industry_id, and you might want to get rid of the UNIX_TIMESTAMP() calls in the sub-query since they prevent use of an index. Use BETWEEN and the actual field values instead.
Here is my attempt at understanding your query and writing a better version. I guess you want to get the count of new jobseeker registrations and new uploaded CVs per month per industry:
SELECT
i.industry_id,
i.description,
MONTH(j.created_at) AS month_created,
YEAR(j.created_at) AS year_created,
COUNT(DISTINCT j.jobseeker_id) AS new_registrations,
COUNT(cv.cv_id) AS uploaded_cvs
FROM
tb_cv AS cv
INNER JOIN tb_jobseeker AS j ON j.jobseeker_id = cv.jobseeker_id
INNER JOIN tb_industry AS i ON i.industry_id = j.industry_id
WHERE
j.created_at BETWEEN '2009-05-01' AND '2009-05-31'
AND cv.created_at BETWEEN '2009-05-01' AND '2009-05-31'
GROUP BY
i.industry_id,
i.description,
MONTH(j.created_at),
YEAR(j.created_at)
A few things I noticed while writing the query:
you GROUP BY values you don't output in the end. Why? (I've added the grouped field to the output list.)
you JOIN three tables in the sub-query while only ever using values from one of them. Why? I don't see what it would be good for, other than filtering out CV records that don't have a jobseeker or an industry attached — which I find hard to imagine. (I've removed the entire sub-query and used a simple COUNT instead.)
Your sub-query returns the same value every time. Did you maybe mean to correlate it in some way, to the industry maybe?.
The sub-query runs once for every record in a grouped query without being wrapped in an aggregate function.
First and foremost it may be worth moving the 'UNIX_TIMESTAMP' conversions to the other side of the equation (that is, perform a reverse function on the literal timestamp values at the other side of the >= and <=). That'll avoid the inner query having to perform the conversions for every record, rather than once for the query.
Also, why does the uploaded_cvs query not have any where clause linking it to the outer query? Am I missing something here?