When to Use * in SQL Query Containing JOINs & Aggregations? - sql

Question
Web_events table contain id,..., channel,account_id
accounts table contain id, ..., sales_rep_id
sales_reps table contains id, name
Given the above tables, write an SQL query to determine the number of times a particular channel was used in the web_events table for each name in sales_reps. Your final table should have three columns - the name of the sales_reps, the channel, and the number of occurrences. Order your table with the highest number of occurrences first.
Answer
SELECT s.name, w.channel, COUNT(*) num_events
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.name, w.channel
ORDER BY num_events DESC;
The COUNT(*) is confusing to me. I don't get how SQL figure out thatCOUNT(*) is COUNT(w.channel). Can anyone clarify?

I don't get how SQL figure out that COUNT(*) is COUNT(w.channel)
COUNT() is an aggregation function that counts the number of rows that match a condition. In fact, COUNT(<expression>) in general (or COUNT(column) in particular) counts the the number of rows where the expression (or column) is not NULL.
In general, the following do exactly the same thing:
COUNT(*)
COUNT(1)
COUNT(<primary key used on inner join>)
In general, I prefer COUNT(*) because that is the SQL standard for this. I can accept COUNT(1) as a recognition that COUNT(*) is just feature bloat. However, I see no reason to use the third version, because it just requires excess typing.
More than that, I find that new users often get confused between these two constructs:
COUNT(w.channel)
COUNT(DISTINCT w.channel)
People learning SQL often think the first really does the second. For this reason, I recommend sticking with the simpler ways of counting rows. Then use COUNT(DISTINCT) when you really want to incur the overhead to count unique values (COUNT(DISTINCT) is more expensive than COUNT()).

Related

columns selected neither in GROUP BY cause or aggregate function?

I have a database with cats, toys and their relationship cat_toys
To find the names of the cats with more than 5 toys, i have the following query:
select
cats.name
from
cats
join
cat_toys on cats.id = cat_toys.cat_id
group by
cats.id
having
count(cat_toys.toy_id) > 7
order by
cats.name
Column cats.name does not appear in the group by or be used in aggregate function, but this query works. in contrast, I cannot select anything in cat_toys table.
Is this something special with psql?
The error message is trying to tell you. It is a general requirement in SQL that you need to list in the group by clause all non-aggregaed columns that belong to the select clause.
Postgres, unlike most other databases, is a bit more clever about that, and understands the notion of functionaly-dependent column: since you are grouping by the primary key of the cats table, you are free to add any other column from that table (since they are functionaly dependent on the primary key). This is why your existing query works.
Now if you want to bring values from the cast_toys table, it is different. There are potentially multiple rows in this table for each row in cats, which, as a consequence, are not functionaly dependent on cats.id. If you still want one row per cat, you need to make use of an aggregate function.
As an example, this generates a comma-separated list of all toy_ids that relate to each cat:
select c.name, string_agg(ct.toy_id, ', ') toy_ids
from cats c
inner join cat_toys ct on t.id = ct.cat_id
group by c.id
having count(*) > 7
order by c.name
Side notes:
table aliases make the query easier to write and read
for this query, I recommend count(*) instead of count(cat_toys.toy_id); this produces the same result (unless you have null values in cat_toys.toy_id, which seems unlikely here), and incurs less work for the database (since it does not need to check each value in the column against null)
This is your query:
select c.name
from cats c join
cat_toys ct
on c.id = ct.cat_id
group by c.id
having count(ct.toy_id) > 7
order by c.name;
You are asking why it works: You are rightly observing that c.id is in the group by but not in the select -- and another column is in the select. Seems wrong. But it isn't. Postgres supports a little known part of the standard, related to functional dependency in aggregation queries.
Let me avoid the technical jargon. cats.id is the primary key of cats. That means the id is unique, so knowing the id specifies all other columns from cats. The database knows this -- that it, it knows that the value of name is always the same for a given id. So, by aggregating on the primary key, you can access the other columns without using aggregation functions -- and it is consistent with the standard.
This is explained in the documentation:
When GROUP BY is present, or any aggregate functions are present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.

How to randomly select a certain percent of rows in Access with another condition?

I have a table with, let's say, boys, girls, number of purchases and price. I want to randomly select 10% of girls with a condition that money they spent should be 30% of a total amount of money spent from both groups. To select 10% of girls I use this code:
SELECT TOP 10 PERCENT from Students
Where StudentType='girl'
ORDER BY rnd(ID)
How should I place an additional condition?
Since you are already selecting a random portion, the question is really just about selection criteria involving a "total". The key here is that you need another query, an aggregate query. The other query can be either another saved query, an embedded subquery, or a call to a function which performs the query.
Using a subquery to get the total:
SELECT TOP 10 PERCENT *
FROM Students
WHERE StudentType='girl'
AND (Students.[Spent] / (SELECT SUM(S2.[Spent]) FROM Students As S2) = 0.30)
ORDER BY rnd(ID)
Make sure to add a different alias to the same table, since Access can get confused if the subquery has a table with the same name as the main query. The question did not mention the "amount spent" column so I just guessed. This also assumes that "both groups" is essentially the same as "all student records". If that's not the case then you could add to the subquery WHERE S2.StudentType In ('girl', 'boy').
Using a domain aggregate function:
SELECT TOP 10 PERCENT *
FROM Students
WHERE StudentType='girl'
AND (Students.[Spent] / DSum("[Spent]", "Students", "") = 0.30)
ORDER BY rnd(ID)
Using another saved query:
First create and save the separate aggregate query as [Summed]:
SELECT SUM(S2.[Spent]) As TotalSpent FROM Students As S2
Now do a cross join so that each row is paired with the total:
SELECT TOP 10 PERCENT *
FROM Students, Summed
WHERE StudentType='girl'
AND (Students.[Spent] / Summed.TotalSpent = 0.30)
ORDER BY rnd(ID)
The efficiency of each solution may vary. For a small table of students it might not matter. If it does become an issue, I have found that the Domain Aggregate functions are not very efficient even though they appear to be simpler to use. More powerful query engines (not Access) are often better at analyzing a query plan and automatically reducing redundant calculations, but with Access you have to plan that out yourself.
Last note: If you have more complicated grouping, any of the solutions will have additional join conditions. For instance, if the aggregate query also had a GROUP BY clause on an ID, then instead of a cross join, you'd now want an INNER JOIN matching the ID in the main table. In the case of the domain aggregate function, you'd want to specify a criteria parameter that refers to a table field value. The point is that the above examples are not a precise template for all cases.

Time based accumulation based on type: Speed considerations in SQL

Based on surfing the web, I came up with two methods of counting the records in a table "Table1". The counter field increments according to a date field "TheDate". It does this by summing records with an older TheDate value. Furthermore, records with different values for the compound field (Field1,Field2) are counted using separate counters. Field3 is just an informational field that is included for added awareness and does not affect the counting or how records are grouped for counting.
Method 1: Use corrrelated subquery
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
(
SELECT SUM(1) FROM Table1 InnerQuery
WHERE InnerQuery.Field1 = MainQuery.Field1 AND
InnerQuery.Field2 = MainQuery.Field2 AND
InnerQuery.TheDate <= MainQuery.TheDate
) AS RunningCounter
FROM Table1 MainQuery
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
Method 2: Use join and group-by
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
SUM(1) AS RunningCounter
FROM Table1 MainQuery INNER JOIN Table1 InnerQuery
ON InnerQuery.Field1 = MainQuery.Field1 AND
InnerQuery.Field2 = MainQuery.Field2 AND
InnerQuery.TheDate <= MainQuery.TheDate
GROUP BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
There is no inner query per se in Method 2, but I use the table alias InnerQuery so that a ready parellel with Method 1 can be drawn. The role is the same; the 2nd instance of Table 1 is for accumulating the counts of the records which have TheDate less than that of any record in MainQuery (1st instance of Table 1) with the same Field1 and Field2 values.
Note that in Method 2, Field 3 is include in the Group-By clause even though I said that it does not affect how the records are grouped for counting. This is still true, since the counting is done using the matching records in InnerQuery, whereas the GROUP By applies to Field 3 in MainQuery.
I found that Method 1 is noticably faster. I'm surprised by this because it uses a correlated subquery. The way I think of a correlated subquery is that it is executed for each record in MainQuery (whether or not that is done in practice after optimization). On the other hand, Method 2 doesn't run an inner query over and over again. However, the inner join still has multiple records in InnerQuery matching each record in MainQuery, so in a sense, it deals with a similar order of complexity.
Is there a decent intuitive explanation for this speed difference, as well as best practice or considerations in choosing an approach for time-base accumulation?
I've posted this to
Microsoft Answers
Stack Exchange
In fact, I think the easiest way is to do this:
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
COUNT(*)
FROM Table1 MainQuery
GROUP BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
(The order by isn't required to get the same data, just to order it. In other words, removing it will not change the number or contents of each row returned, just the order in which they are returned.)
You only need to specify the table once. Doing a self-join (joining a table to itself as both your queries do) is not required. The performance of your two queries will depend on a whole load of things which I don't know - what the primary keys are, the number of rows, how much memory is available, and so on.
First, your experience makes a lot of sense. I'm not sure why you need more intuition. I imagine you learned, somewhere along the way, that correlated subqueries are evil. Well, as with some of the things we teach kids as being really bad ("don't cross the street when the walk sign is not green") turn out to be not so bad, the same is true of correlated subqueries.
The easiest intuition is that the uncorrelated subquery has to aggregate all the data in the table. The correlated version only has to aggregate matching fields, although it has to do this over and over.
To put numbers to it, say you have 1,000 rows with 10 rows per group. The output is 100 rows. The first version does 100 aggregations of 10 rows each. The second does one aggregation of 1,000 rows. Well, aggregation generally scales in a super-linear fashion (O(n log n), technically). That means that 100 aggregations of 10 records takes less time than 1 aggregation of 1000 records.
You asked for intuition, so the above is to provide some intuition. There are a zillion caveats that go both ways. For instance, the correlated subquery might be able to make better use of indexes for the aggregation. And, the two queries are not equivalent, because the correct join would be LEFT JOIN.
Actually, I was wrong in my original post. The inner join is way, way faster than the correlated subquery. However, the correlated subquery is able to display its results records as they are generated, so it appears faster.
As a side curiosity, I'm finding that if the correlated sub-query approach is modified to use sum(-1) instead of sum(1), the number of returned records seems to vary from N-3 to N (where N is the correct number, i.e., the number of records in Table1). I'm not sure if this is due to some misbehaviour in Access's rush to display initial records or what-not.
While it seems that the INNER JOIN wins hands-down, there is a major insidious caveat. If the GROUP BY fields do not uniquely distinguish each record in Table1, then you will not get an individual SUM for each record of Table1. Imagine that a particular combination of GROUP BY field values matching (say) THREE records in Table1. You will then get a single SUM for all of them. The problem is, each of these 3 records in MainQuery also matches all 3 of the same records in InnerQuery, so those instances in InnerQuery get counted multiple times. Very insidious (I find).
So it seems that the sub-query may be the way to go, which is awfully disturbing in view of the above problem with repeatability (2nd paragraph above). That is a serious problem that should send shivers down any spine. Another possible solution that I'm looking at is to turn MainQuery into a subquery by SELECTing the fields of interest and DISTINCTifying them before INNER JOINing the result with InnerQuery.

does the order of columns in a SQL select matters?

my question is regarding a left join I've tried to count how many people are tracking a certain project.
(there can be zero followers)
now the only way i can get it to work is by adding
group by idproject
my question is if the is a way to avoid using this and only selecting and implicitly
setting that group option.
SQL:
select `project_view`.`idproject` AS `idproject`,
count(`track`.`iduser`) AS `c`,`name`
from `project_view` left join `track` using(idproject)
I expected it count null as zero but it doesn't appear at all, if i neglect counting then it shows as null where there are no followers.
If you have a WHERE clause to specify a certain project then you don't need a GROUP BY.
SELECT project_view.idproject, COUNT(track.iduser) AS c, name
FROM project_view
LEFT JOIN track USING (idproject)
WHERE idproject = 4
If you want a count for each project then you do need a GROUP BY.
SELECT project_view.idproject, COUNT(track.iduser) AS c, name
FROM project_view
LEFT JOIN track USING (idproject)
GROUP BY idproject
Yes the order of selecting matters. For performance reasons you (typically) want your most limiting select first to narrow your data set. This makes every subsequent query operate on a smaller dataset.

SQL GROUP BY/COUNT even if no results

I am attempting to get the information from one table (games) and count the entries in another table (tickets) that correspond to each entry in the first. I want each entry in the first table to be returned even if there aren't any entries in the second. My query is as follows:
SELECT g.*, count(*)
FROM games g, tickets t
WHERE (t.game_number = g.game_number
OR NOT EXISTS (SELECT * FROM tickets t2 WHERE t2.game_number=g.game_number))
GROUP BY t.game_number;
What am I doing wrong?
You need to do a left-join:
SELECT g.Game_Number, g.PutColumnsHere, count(t.Game_Number)
FROM games g
LEFT JOIN tickets t ON g.Game_Number = t.Game_Number
GROUP BY g.Game_Number, g.PutColumnsHere
Alternatively, I think this is a little clearer with a correlated subquery:
SELECT g.Game_Number, G.PutColumnsHere,
(SELECT COUNT(*) FROM Tickets T WHERE t.Game_Number = g.Game_Number) Tickets_Count
FROM Games g
Just make sure you check the query plan to confirm that the optimizer interprets this well.
You need to learn more about how to use joins in SQL:
SELECT g.*, count(*)
FROM games g
LEFT OUTER JOIN tickets t
USING (game_number)
GROUP BY g.game_number;
Note that unlike some database brands, MySQL permits you to list many columns in the select-list even if you only GROUP BY their primary key. As long as the columns in your select-list are functionally dependent on the GROUP BY column, the result is unambiguous.
Other brands of database (Microsoft, Firebird, etc.) give you an error if you list any columns in the select-list without including them in GROUP BY or in an aggregate function.
"FROM games g, tickets t" is the problem line. This performs an inner join. Any where clause can't add on to this. I think you want a LEFT OUTER JOIN.