2010 Access Query with nested JOIN, WHERE and GROUP BY - sql

I appreciate everyone's help and patience as I continue learning through converting a large Excel/vba system to Access.
I have the following query:
SELECT AccountWeeklyBalances.AccountNumber,
AccountWeeklyBalances.AccountBalance,
AccountWeeklyBalances.AccountDate,
AccountMaster.AccountName,
AccountCurrentModel.Model,
ModelDetailAllHistory.Risk
FROM ((AccountWeeklyBalances
INNER JOIN AccountMaster
ON AccountMaster.[AccountNumber] = AccountWeeklyBalances.AccountNumber)
INNER JOIN AccountCurrentModel
ON AccountWeeklyBalances.AccountNumber=AccountCurrentModel.AccountNumber)
INNER JOIN ModelDetailAllHistory
ON AccountCurrentModel.Model=ModelDetailAllHistory.ModelName
WHERE AccountWeeklyBalances.AccountDate=[MatchDate]
;
This works, except I want to GROUP BY the Model. I tried adding
GROUP BY AccountCurrentModel.Model
and
GROUP BY ModelDetailAllHistory.ModelName
after the WHERE clause, but both give me an error:
Tried to execute a query that does not include the specified expression
'AccountNumber' as part of an aggregate function.
I've read several other posts here, but cannot figure out what I've done wrong.

It depends on what you're trying to do. If you just want to sum the AccountBalance by ModelName, then all the other columns would have to be removed from the select statement. If you want the sum of each model for each account, then you would just add the AccountNumber to the GROUP BY, probably before the ModelName.
When aggregating, you can't include anything in the select list that's not either an aggregate function (min, max, sum, etc) or something you are grouping by, because there's no way to represent that in the query results. How could you show the sum of AccountBalance by ModelName, but also include the AccountNumber? The only way to do that would be to group by both AccountNumber and ModelName.
----EDIT----
After discussing in the comments I have a clearer idea of what's going on. There is no aggregation, but there are multiple records in ModelDetailAllHistory for each Model. However, the only value we need from that table is Risk, and that will always be the same per model. So we need to eliminate the duplicate Risk values. This can be done by joining into a subquery instead of joining directly into ModelDetailAllHistory
INNER JOIN (SELECT DISTINCT ModelName, Risk FROM ModelDetailAllHistory) mh
ON AccountCurrentModel.Model=mh.ModelName
or
INNER JOIN (SELECT ModelName, max(Risk) FROM ModelDetailAllHistory GROUP BY ModelName) mh
ON AccountCurrentModel.Model=mh.ModelName
Both methods collapse the multiple Risk values into a single value per Model, eliminating the duplicate records. I tend to prefer the first option because if for some reason there were multiple Risk values for a single Model, you'd end up with duplicate records and you'd know there was something wrong. Using max() is basically choosing an arbitrary record from ModelDetailAllHistory that matches the given Model and getting the Risk value from it, since you know all the Risk values for that model should be the same. What I don't like about this method is it will hide data inconsistencies from you (e.g. if for some reason there are some ModelDetailAllHistory records for the same Model that don't have the same Risk value), and while it's nice to know you'll never ever get duplicate records, the underlying problem could end up rearing its ugly head in other unexpected ways.

Related

When to Use * in SQL Query Containing JOINs & Aggregations?

Question
Web_events table contain id,..., channel,account_id
accounts table contain id, ..., sales_rep_id
sales_reps table contains id, name
Given the above tables, write an SQL query to determine the number of times a particular channel was used in the web_events table for each name in sales_reps. Your final table should have three columns - the name of the sales_reps, the channel, and the number of occurrences. Order your table with the highest number of occurrences first.
Answer
SELECT s.name, w.channel, COUNT(*) num_events
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.name, w.channel
ORDER BY num_events DESC;
The COUNT(*) is confusing to me. I don't get how SQL figure out thatCOUNT(*) is COUNT(w.channel). Can anyone clarify?
I don't get how SQL figure out that COUNT(*) is COUNT(w.channel)
COUNT() is an aggregation function that counts the number of rows that match a condition. In fact, COUNT(<expression>) in general (or COUNT(column) in particular) counts the the number of rows where the expression (or column) is not NULL.
In general, the following do exactly the same thing:
COUNT(*)
COUNT(1)
COUNT(<primary key used on inner join>)
In general, I prefer COUNT(*) because that is the SQL standard for this. I can accept COUNT(1) as a recognition that COUNT(*) is just feature bloat. However, I see no reason to use the third version, because it just requires excess typing.
More than that, I find that new users often get confused between these two constructs:
COUNT(w.channel)
COUNT(DISTINCT w.channel)
People learning SQL often think the first really does the second. For this reason, I recommend sticking with the simpler ways of counting rows. Then use COUNT(DISTINCT) when you really want to incur the overhead to count unique values (COUNT(DISTINCT) is more expensive than COUNT()).

Issue counting joined rows from two tables

My schema, query, and problematic results can be seen here:
http://sqlfiddle.com/#!17/55bc3/5/0
I've created a schema for storing posts, comments, and favourites. ( I've simplified my example for the sake of demonstration ). I'm trying to write a query to aggregate the like/favourite counts for each post, for display on a 'front page'.
To model the relationships between users/posts/favourites I've used multiple intersection tables. In the query I'm using two LEFT JOINs, and then COUNTing distinct columns in the results. I've encountered an issue where the COUNT I'm storing as comment_count overrides favourite_count when it returns anything above 0, causing it to return duplicate values for both columns.
I think I understand the mechanism behind this, being that the GROUPing of the results is causing the resulting rows to get squashed together to yield an incorrect result. I was wondering if anyone could let me know some of the theory behind what this is called, and how you would correctly write queries to handle this scenario.
As they are unrelated tables, you can count individually and then join.
SELECT p.id
,coalesce(c.comment_count,0) as comment_count
,coalesce(f.favorite_count,0) as favorite_count
FROM post p
LEFT JOIN (select post_id,count(*) as comment_count
from comment group by post_id) c ON c.post_id=p.id
LEFT JOIN (select post_id,count(*) as favorite_count
from favourite group by post_id) f ON f.post_id=p.id

Time based accumulation based on type: Speed considerations in SQL

Based on surfing the web, I came up with two methods of counting the records in a table "Table1". The counter field increments according to a date field "TheDate". It does this by summing records with an older TheDate value. Furthermore, records with different values for the compound field (Field1,Field2) are counted using separate counters. Field3 is just an informational field that is included for added awareness and does not affect the counting or how records are grouped for counting.
Method 1: Use corrrelated subquery
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
(
SELECT SUM(1) FROM Table1 InnerQuery
WHERE InnerQuery.Field1 = MainQuery.Field1 AND
InnerQuery.Field2 = MainQuery.Field2 AND
InnerQuery.TheDate <= MainQuery.TheDate
) AS RunningCounter
FROM Table1 MainQuery
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
Method 2: Use join and group-by
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
SUM(1) AS RunningCounter
FROM Table1 MainQuery INNER JOIN Table1 InnerQuery
ON InnerQuery.Field1 = MainQuery.Field1 AND
InnerQuery.Field2 = MainQuery.Field2 AND
InnerQuery.TheDate <= MainQuery.TheDate
GROUP BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
There is no inner query per se in Method 2, but I use the table alias InnerQuery so that a ready parellel with Method 1 can be drawn. The role is the same; the 2nd instance of Table 1 is for accumulating the counts of the records which have TheDate less than that of any record in MainQuery (1st instance of Table 1) with the same Field1 and Field2 values.
Note that in Method 2, Field 3 is include in the Group-By clause even though I said that it does not affect how the records are grouped for counting. This is still true, since the counting is done using the matching records in InnerQuery, whereas the GROUP By applies to Field 3 in MainQuery.
I found that Method 1 is noticably faster. I'm surprised by this because it uses a correlated subquery. The way I think of a correlated subquery is that it is executed for each record in MainQuery (whether or not that is done in practice after optimization). On the other hand, Method 2 doesn't run an inner query over and over again. However, the inner join still has multiple records in InnerQuery matching each record in MainQuery, so in a sense, it deals with a similar order of complexity.
Is there a decent intuitive explanation for this speed difference, as well as best practice or considerations in choosing an approach for time-base accumulation?
I've posted this to
Microsoft Answers
Stack Exchange
In fact, I think the easiest way is to do this:
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
COUNT(*)
FROM Table1 MainQuery
GROUP BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
(The order by isn't required to get the same data, just to order it. In other words, removing it will not change the number or contents of each row returned, just the order in which they are returned.)
You only need to specify the table once. Doing a self-join (joining a table to itself as both your queries do) is not required. The performance of your two queries will depend on a whole load of things which I don't know - what the primary keys are, the number of rows, how much memory is available, and so on.
First, your experience makes a lot of sense. I'm not sure why you need more intuition. I imagine you learned, somewhere along the way, that correlated subqueries are evil. Well, as with some of the things we teach kids as being really bad ("don't cross the street when the walk sign is not green") turn out to be not so bad, the same is true of correlated subqueries.
The easiest intuition is that the uncorrelated subquery has to aggregate all the data in the table. The correlated version only has to aggregate matching fields, although it has to do this over and over.
To put numbers to it, say you have 1,000 rows with 10 rows per group. The output is 100 rows. The first version does 100 aggregations of 10 rows each. The second does one aggregation of 1,000 rows. Well, aggregation generally scales in a super-linear fashion (O(n log n), technically). That means that 100 aggregations of 10 records takes less time than 1 aggregation of 1000 records.
You asked for intuition, so the above is to provide some intuition. There are a zillion caveats that go both ways. For instance, the correlated subquery might be able to make better use of indexes for the aggregation. And, the two queries are not equivalent, because the correct join would be LEFT JOIN.
Actually, I was wrong in my original post. The inner join is way, way faster than the correlated subquery. However, the correlated subquery is able to display its results records as they are generated, so it appears faster.
As a side curiosity, I'm finding that if the correlated sub-query approach is modified to use sum(-1) instead of sum(1), the number of returned records seems to vary from N-3 to N (where N is the correct number, i.e., the number of records in Table1). I'm not sure if this is due to some misbehaviour in Access's rush to display initial records or what-not.
While it seems that the INNER JOIN wins hands-down, there is a major insidious caveat. If the GROUP BY fields do not uniquely distinguish each record in Table1, then you will not get an individual SUM for each record of Table1. Imagine that a particular combination of GROUP BY field values matching (say) THREE records in Table1. You will then get a single SUM for all of them. The problem is, each of these 3 records in MainQuery also matches all 3 of the same records in InnerQuery, so those instances in InnerQuery get counted multiple times. Very insidious (I find).
So it seems that the sub-query may be the way to go, which is awfully disturbing in view of the above problem with repeatability (2nd paragraph above). That is a serious problem that should send shivers down any spine. Another possible solution that I'm looking at is to turn MainQuery into a subquery by SELECTing the fields of interest and DISTINCTifying them before INNER JOINing the result with InnerQuery.

Select Statement with Distinct returning multiple rows and need only first result

I having a challenge with my query returning multiple results.
SELECT DISTINCT gpph.id, gpph.cname, gc2a.assetfilename, gpph.alternateURL
FROM [StepMirror].[dbo].[stepview_nwppck_ngn_getpimproducthierarchy] gpph
INNER JOIN [StepMirror].[dbo].[stepview_nwppck_ngn_getclassification2assetrefs] gc2a
ON gpph.id=gc2a.id
WHERE gpph.subtype='Level_4' AND gpph.parentId=#ID AND gc2a.assettype='Primary Image'
A record, 5679599, has 2 'Primary Images' and is returning 2 results for that id but I only need the first result back. Is there any way to do this IN the current query? Do I need to write multiple queries?
I need some direction on how to constrain the results to only 1 result on Primary Image. I have looked at a ton of similar questions but most typically are just requiring the guidance of adding 'distinct' to the beginning of their query rather than on the where clause.
Edit: This problem is created by a user inputting 2 Primary Images on one record in the database. My business requirements only state to take the first result.
Any help would be awesome!
Given the choice is arbitary which to return, we can just use an aggregate on the value. This then needs a group by clause, which eliminates the need for the distinct.
SELECT gpph.id, gpph.cname, max(gc2a.assetfilename), gpph.alternateURL
FROM [StepMirror].[dbo].[stepview_nwppck_ngn_getpimproducthierarchy] gpph
INNER JOIN [StepMirror].[dbo].[stepview_nwppck_ngn_getclassification2assetrefs] gc2a
ON gpph.id=gc2a.id
WHERE gpph.subtype='Level_4' AND gpph.parentId=#ID AND gc2a.assettype='Primary Image'
GROUP BY gpph.id, gpph.cname, gpph.alternateURL
In this instance, using max(gc2a.assetfilename) is going to give you the alphabetically highest value in the event of there being more than one record. It's not the ideal choice, some kind of timestamp knowing the order of the records might be more helpful, since then the meaning of the word 'first' could make more sense.
Replace distinct to group by :
SELECT MAX(gpph.id), gpph.cname, gc2a.assetfilename, gpph.alternateURL
FROM [StepMirror].[dbo].[stepview_nwppck_ngn_getpimproducthierarchy] gpph
INNER JOIN [StepMirror].[dbo].[stepview_nwppck_ngn_getclassification2assetrefs] gc2a
ON gpph.id=gc2a.id
WHERE gpph.subtype='Level_4' AND gpph.parentId=#ID AND gc2a.assettype='Primary Image'
AND gpph.id = MAX(gpph.id)
GROUP BY gpph.cname, gc2a.assetfilename, gpph.alternateURL

does the order of columns in a SQL select matters?

my question is regarding a left join I've tried to count how many people are tracking a certain project.
(there can be zero followers)
now the only way i can get it to work is by adding
group by idproject
my question is if the is a way to avoid using this and only selecting and implicitly
setting that group option.
SQL:
select `project_view`.`idproject` AS `idproject`,
count(`track`.`iduser`) AS `c`,`name`
from `project_view` left join `track` using(idproject)
I expected it count null as zero but it doesn't appear at all, if i neglect counting then it shows as null where there are no followers.
If you have a WHERE clause to specify a certain project then you don't need a GROUP BY.
SELECT project_view.idproject, COUNT(track.iduser) AS c, name
FROM project_view
LEFT JOIN track USING (idproject)
WHERE idproject = 4
If you want a count for each project then you do need a GROUP BY.
SELECT project_view.idproject, COUNT(track.iduser) AS c, name
FROM project_view
LEFT JOIN track USING (idproject)
GROUP BY idproject
Yes the order of selecting matters. For performance reasons you (typically) want your most limiting select first to narrow your data set. This makes every subsequent query operate on a smaller dataset.