My schema, query, and problematic results can be seen here:
http://sqlfiddle.com/#!17/55bc3/5/0
I've created a schema for storing posts, comments, and favourites. ( I've simplified my example for the sake of demonstration ). I'm trying to write a query to aggregate the like/favourite counts for each post, for display on a 'front page'.
To model the relationships between users/posts/favourites I've used multiple intersection tables. In the query I'm using two LEFT JOINs, and then COUNTing distinct columns in the results. I've encountered an issue where the COUNT I'm storing as comment_count overrides favourite_count when it returns anything above 0, causing it to return duplicate values for both columns.
I think I understand the mechanism behind this, being that the GROUPing of the results is causing the resulting rows to get squashed together to yield an incorrect result. I was wondering if anyone could let me know some of the theory behind what this is called, and how you would correctly write queries to handle this scenario.
As they are unrelated tables, you can count individually and then join.
SELECT p.id
,coalesce(c.comment_count,0) as comment_count
,coalesce(f.favorite_count,0) as favorite_count
FROM post p
LEFT JOIN (select post_id,count(*) as comment_count
from comment group by post_id) c ON c.post_id=p.id
LEFT JOIN (select post_id,count(*) as favorite_count
from favourite group by post_id) f ON f.post_id=p.id
Related
I seem to be missing something. I keep reading that you should use a join instead of a sub-select in most articles I read. However running a quick experiment myself shows a big win for the sub-query when it comes down to execution time.
Trying to get all first names of people that have made a bid (I presume the tables speak for themselves) results in the follwing.
This join takes 10 seconds
select U.firstname
from Bid B
inner join [User] U on U.userName = B.[user]
This query with sub-query takes 3 seconds
select firstname
from [User]
where userName in (select [user] from bid)
Why is my experiment not in line with what I keep reading everywhere or am I missing something?
Experimenting on I found that execution times are the same after adding distinct to both.
They're not the same thing. In the query with joins you can potentially multiply rows or have rows entirely removed from the results.
Inner Join removes rows on non-matched keys. It also multiplies rows on any matched keys that repeat in either one or both tables being joined. Inner Join therefor goes through the additional step of multiplying and removing rows.
The subquery you used is a SELECT. Since there are no filters using a WHERE it is as fast as a simple SELECT and since there are no joins you get results as fast as the results can be selected.
Some may argue that Outer joins return NULLs similar to sub-queries- but they can still multiply rows. Hence, sub-queries and joins are not the same thing.
In the queries you provided, you want to use the 2nd query (the one with the subquery) since it doesn't multiply or remove rows.
Good Read for Subquery vs Inner Join
https://www.essentialsql.com/subquery-versus-inner-join/
I appreciate everyone's help and patience as I continue learning through converting a large Excel/vba system to Access.
I have the following query:
SELECT AccountWeeklyBalances.AccountNumber,
AccountWeeklyBalances.AccountBalance,
AccountWeeklyBalances.AccountDate,
AccountMaster.AccountName,
AccountCurrentModel.Model,
ModelDetailAllHistory.Risk
FROM ((AccountWeeklyBalances
INNER JOIN AccountMaster
ON AccountMaster.[AccountNumber] = AccountWeeklyBalances.AccountNumber)
INNER JOIN AccountCurrentModel
ON AccountWeeklyBalances.AccountNumber=AccountCurrentModel.AccountNumber)
INNER JOIN ModelDetailAllHistory
ON AccountCurrentModel.Model=ModelDetailAllHistory.ModelName
WHERE AccountWeeklyBalances.AccountDate=[MatchDate]
;
This works, except I want to GROUP BY the Model. I tried adding
GROUP BY AccountCurrentModel.Model
and
GROUP BY ModelDetailAllHistory.ModelName
after the WHERE clause, but both give me an error:
Tried to execute a query that does not include the specified expression
'AccountNumber' as part of an aggregate function.
I've read several other posts here, but cannot figure out what I've done wrong.
It depends on what you're trying to do. If you just want to sum the AccountBalance by ModelName, then all the other columns would have to be removed from the select statement. If you want the sum of each model for each account, then you would just add the AccountNumber to the GROUP BY, probably before the ModelName.
When aggregating, you can't include anything in the select list that's not either an aggregate function (min, max, sum, etc) or something you are grouping by, because there's no way to represent that in the query results. How could you show the sum of AccountBalance by ModelName, but also include the AccountNumber? The only way to do that would be to group by both AccountNumber and ModelName.
----EDIT----
After discussing in the comments I have a clearer idea of what's going on. There is no aggregation, but there are multiple records in ModelDetailAllHistory for each Model. However, the only value we need from that table is Risk, and that will always be the same per model. So we need to eliminate the duplicate Risk values. This can be done by joining into a subquery instead of joining directly into ModelDetailAllHistory
INNER JOIN (SELECT DISTINCT ModelName, Risk FROM ModelDetailAllHistory) mh
ON AccountCurrentModel.Model=mh.ModelName
or
INNER JOIN (SELECT ModelName, max(Risk) FROM ModelDetailAllHistory GROUP BY ModelName) mh
ON AccountCurrentModel.Model=mh.ModelName
Both methods collapse the multiple Risk values into a single value per Model, eliminating the duplicate records. I tend to prefer the first option because if for some reason there were multiple Risk values for a single Model, you'd end up with duplicate records and you'd know there was something wrong. Using max() is basically choosing an arbitrary record from ModelDetailAllHistory that matches the given Model and getting the Risk value from it, since you know all the Risk values for that model should be the same. What I don't like about this method is it will hide data inconsistencies from you (e.g. if for some reason there are some ModelDetailAllHistory records for the same Model that don't have the same Risk value), and while it's nice to know you'll never ever get duplicate records, the underlying problem could end up rearing its ugly head in other unexpected ways.
I have three tables
PackingLists
ItemsToPackingLists
Items
I would like to have a list of all PackingLists with the Number of items per PackingList and the WeightInGramms for the PackingList.
I wrote the following query, but it gives wrong results. I guess I have to arrange the joins somehow different.
Any help how to refactor the query is appreciated.
SELECT p.ID,
p.NameOfPackingList,
COUNT(ItemsToP.ItemID) AS NumberOfDifferentItems,
SUM(items.WeightInGrams * ItemsToP.Quantity) AS WeightInGramms
FROM PackingLists AS p
LEFT OUTER JOIN ItemsToPackingLists AS ItemsToP
ON (ItemsToP.PackingListID = p.ID)
LEFT OUTER JOIN Items AS items
ON (ItemsToP.ItemID = items.ID)
GROUP BY p.ID,p.NameOfPackingList
Not really clear what you want to get, but two options to check.
Use COUNT(Distinct ItemsToP.ItemID) instead of COUNT(ItemsToP.ItemID), you might including the same item twice in one package (with different quantities), and naming of the col 'NumberOfDifferentItems' suggest using distinct as well.
However, your question is 'Number of items per PackingList'. To my understanding you should sum the quantities, SUM(ItemsToP.Quantity) instead of counting the IDs.
I am joining a table of about 70000 rows with a slightly bigger second table through inner join each. Now count(a.business_column) and count(*) give different results. The former correctly reports back ~70000, while the latter gives ~200000. But this only happens when I select count(*) alone, when I select them together they give the same result (~70000). How is this possible?
select
count(*)
/*,count(a.business_column)*/
from table_a a
inner join each table_b b
on b.key_column = a.business_column
UPDATE: For a step by step explanation on how this works, see BigQuery flattens when using field with same name as repeated field instead.
To answer the title question: COUNT(*) in BigQuery is always accurate.
The caveat is that in SQL COUNT(*) and COUNT(column) have semantically different meanings - and the sample query can be interpreted in different ways.
See: http://www.xaprb.com/blog/2009/04/08/the-dangerous-subtleties-of-left-join-and-count-in-sql/
There they have this sample query:
select user.userid, count(email.subject)
from user
inner join email on user.userid = email.userid
group by user.userid;
That query turns out to be ambigous, and the article author changes it for a more explicit one, adding this comment:
But what if that’s not what the author of the query meant? There’s no
way to really know. There are several possible intended meanings for
the query, and there are several different ways to write the query to
express those meanings more clearly. But the original query is
ambiguous, for a few reasons. And everyone who reads this query
afterwards will end up guessing what the original author meant. “I
think I can safely change this to…”
UPDATE: For a step by step explanation on how this works, see BigQuery flattens when using field with same name as repeated field instead.
COUNT(*) counts most repeated field in your query, if you want to count full records - use COUNT(0).
I've written the below query but I'm getting multiple duplicate rows in the results, please can anyone see where I'm going wrong?
use Customers
select customer_details.Customer_ID,
customer_details.customer_name,
metering_point_details.MPAN_ID,
Agents.DA_DC_Charge
from Customer_Details
left join Metering_Point_Details
on customer_details.customer_id = Metering_Point_Details.Customer_ID
left join agents
on customer_details.Customer_ID = agents.customer_id
order by customer_id
It doesn't really matter, but you're not using an INNER JOIN. Regardless, your unexpected rows indicate that your JOIN criteria is not specific enough to return your expected output. You can use SELECT DISTINCT if your results are fully duplicative, and if you'd like to see why you're getting those duplicates you can just use SELECT * to see the full detail between the multiple rows that are returned using your JOIN criteria, which should help you either make your criteria more specific or show you that you've got duplicated records in one of the tables you're using in your JOIN.
With sample data we can dissect the problem more, but odds are you won't need it once you see why the rows are duplicated.