columns selected neither in GROUP BY cause or aggregate function? - sql

I have a database with cats, toys and their relationship cat_toys
To find the names of the cats with more than 5 toys, i have the following query:
select
cats.name
from
cats
join
cat_toys on cats.id = cat_toys.cat_id
group by
cats.id
having
count(cat_toys.toy_id) > 7
order by
cats.name
Column cats.name does not appear in the group by or be used in aggregate function, but this query works. in contrast, I cannot select anything in cat_toys table.
Is this something special with psql?

The error message is trying to tell you. It is a general requirement in SQL that you need to list in the group by clause all non-aggregaed columns that belong to the select clause.
Postgres, unlike most other databases, is a bit more clever about that, and understands the notion of functionaly-dependent column: since you are grouping by the primary key of the cats table, you are free to add any other column from that table (since they are functionaly dependent on the primary key). This is why your existing query works.
Now if you want to bring values from the cast_toys table, it is different. There are potentially multiple rows in this table for each row in cats, which, as a consequence, are not functionaly dependent on cats.id. If you still want one row per cat, you need to make use of an aggregate function.
As an example, this generates a comma-separated list of all toy_ids that relate to each cat:
select c.name, string_agg(ct.toy_id, ', ') toy_ids
from cats c
inner join cat_toys ct on t.id = ct.cat_id
group by c.id
having count(*) > 7
order by c.name
Side notes:
table aliases make the query easier to write and read
for this query, I recommend count(*) instead of count(cat_toys.toy_id); this produces the same result (unless you have null values in cat_toys.toy_id, which seems unlikely here), and incurs less work for the database (since it does not need to check each value in the column against null)

This is your query:
select c.name
from cats c join
cat_toys ct
on c.id = ct.cat_id
group by c.id
having count(ct.toy_id) > 7
order by c.name;
You are asking why it works: You are rightly observing that c.id is in the group by but not in the select -- and another column is in the select. Seems wrong. But it isn't. Postgres supports a little known part of the standard, related to functional dependency in aggregation queries.
Let me avoid the technical jargon. cats.id is the primary key of cats. That means the id is unique, so knowing the id specifies all other columns from cats. The database knows this -- that it, it knows that the value of name is always the same for a given id. So, by aggregating on the primary key, you can access the other columns without using aggregation functions -- and it is consistent with the standard.
This is explained in the documentation:
When GROUP BY is present, or any aggregate functions are present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.

Related

Why group by is forced on specific column but not on the others

I have the following query:
(sample schema here: https://www.db-fiddle.com/f/5e9gVC6oRidjYwigPWRKm/3)
SELECT
t.customer_id
, t.ticket_id
, c.combination_id
, c.possible_prize + c.confirmed_prize + coalesce(cb.bonus_amount,0) AS won_amount
, round(mul(o.value)::numeric,3) AS odds
FROM tickets t
JOIN combinations c ON c.ticket_id = t.ticket_id
LEFT JOIN combination_bonus cb ON cb.combination_id = c.combination_id
JOIN outcomes o ON o.ticket_id = t.ticket_id AND o.outcome_id = ANY(c.outcomes)
GROUP BY 1,2,3, cb.bonus_amount
ORDER BY 1
Without + coalesce(cb.bonus_amount,0) it runs fine, why only this column need to grouped, not the two others in this equation?
Also if I would put this line into sum() the results would be totally wrong as it is going to be multiplied few times, and I don't get it why and how.
Would much appreciate explanation on both cases.
Without + coalesce(cb.bonus_amount,0) it runs fine, why only this column need to grouped, not the two others in this equation?
When it comes to columns c.possible_prize and c.confirmed_prize: you don't need these in the group by clause because that clause already contains c.combination_id (which is hidden behind positional parameter 3), which is the primary key of table c.
Postgres is one of the rare databases (it not the only one) that properly implements the concept of functionaly-dependent column: once you put the primary key of a table in a group by clause, you don't need to add other columns of the same table: the primary key uniquely identify a row.
On the other hand, you don't have the primary key of table cb in the group by clause. You would argue that you are bringing table cb with join condition on that column precisely, which somehow guarantees unicity:
LEFT JOIN combination_bonus cb ON cb.combination_id = c.combin
Well, Postgres is probably not that smart. Your query should work just fine if you put it in there, so:
GROUP BY 1,2,3, cb.combination_id
This is a bit long for a comment.
SQL allows -- and Postgres support -- using group by on a unique or primary key and then selecting other columns without using aggregations. This is called functional dependency (the other columns are functionally dependent on the unique/primary key).
If your first query works, then it is using this functionality in Postgres -- based on combinations.combination_id being the primary key (or at least unique). However, combination_bonus has no key in the group by. And even if combination_bonus.combination_id is the primary key, Postgres may not be smart enough to use this information for functional dependence.
So, just include the entire expression coalesce(cb.bonus_amount, 0) in the group by. Or use an aggregation function.

When to Use * in SQL Query Containing JOINs & Aggregations?

Question
Web_events table contain id,..., channel,account_id
accounts table contain id, ..., sales_rep_id
sales_reps table contains id, name
Given the above tables, write an SQL query to determine the number of times a particular channel was used in the web_events table for each name in sales_reps. Your final table should have three columns - the name of the sales_reps, the channel, and the number of occurrences. Order your table with the highest number of occurrences first.
Answer
SELECT s.name, w.channel, COUNT(*) num_events
FROM accounts a
JOIN web_events w
ON a.id = w.account_id
JOIN sales_reps s
ON s.id = a.sales_rep_id
GROUP BY s.name, w.channel
ORDER BY num_events DESC;
The COUNT(*) is confusing to me. I don't get how SQL figure out thatCOUNT(*) is COUNT(w.channel). Can anyone clarify?
I don't get how SQL figure out that COUNT(*) is COUNT(w.channel)
COUNT() is an aggregation function that counts the number of rows that match a condition. In fact, COUNT(<expression>) in general (or COUNT(column) in particular) counts the the number of rows where the expression (or column) is not NULL.
In general, the following do exactly the same thing:
COUNT(*)
COUNT(1)
COUNT(<primary key used on inner join>)
In general, I prefer COUNT(*) because that is the SQL standard for this. I can accept COUNT(1) as a recognition that COUNT(*) is just feature bloat. However, I see no reason to use the third version, because it just requires excess typing.
More than that, I find that new users often get confused between these two constructs:
COUNT(w.channel)
COUNT(DISTINCT w.channel)
People learning SQL often think the first really does the second. For this reason, I recommend sticking with the simpler ways of counting rows. Then use COUNT(DISTINCT) when you really want to incur the overhead to count unique values (COUNT(DISTINCT) is more expensive than COUNT()).

Semi-join vs Subqueries

What is the difference between semi-joins and a subquery? I am currently taking a course on this on DataCamp and i'm having a hard time making a distinction between the two.
Thanks in advance.
A join or a semi join is required whenever you want to combine two or more entities records based on some common conditional attributes.
Unlike, Subquery is required whenever you want to have a lookup or a reference on same table or other tables
In short, when your requirement is to get additional reference columns added to existing tables attributes then go for join else when you want to have a lookup on records from the same table or other tables but keeping the same existing columns as o/p go for subquery
Also, In case of semi join it can act/used as a subquery because most of the times we dont actually join the right table instead we mantain a check via subquery to limit records in the existing hence semijoin but just that it isnt a subquery by itself
I don't really think of a subquery and a semi-join as anything similar. A subquery is nothing more interesting than a query that is used inside another query:
select * -- this is often called the "outer" query
from (
select columnA -- this is the subquery inside the parentheses
from mytable
where columnB = 'Y'
)
A semi-join is a concept based on join. Of course, joining tables will combine both tables and return the combined rows based on the join criteria. From there you select the columns you want from either table based on further where criteria (and of course whatever else you want to do). The concept of a semi-join is when you want to return rows from the first table only, but you need the 2nd table to decide which rows to return. Example: you want to return the people in a class:
select p.FirstName, p.LastName, p.DOB
from people p
inner join classes c on c.pID = p.pID
where c.ClassName = 'SQL 101'
group by p.pID
This accomplishes the concept of a semi-join. We are only returning columns from the first table (people). The use of the group by is necessary for the concept of a semi-join because a true join can return duplicate rows from the first table (depending on the join criteria). The above example is not often referred to as a semi-join, and is not the most typical way to accomplish it. The following query is a more common method of accomplishing a semi-join:
select FirstName, LastName, DOB
from people
where pID in (select pID
from class
where ClassName = 'SQL 101'
)
There is no formal join here. But we're using the 2nd table to determine which rows from the first table to return. It's a lot like saying if we did join the 2nd table to the first table, what rows from the first table would match?
For performance, exists is typically preferred:
select FirstName, LastName, DOB
from people p
where exists (select pID
from class c
where c.pID = p.pID
and c.ClassName = 'SQL 101'
)
In my opinion, this is the most direct way to understand the semi-join. There is still no formal join, but you can see the idea of a join hinted at by the usage of directly matching the first table's pID column to the 2nd table's pID column.
Final note. The last 2 queries above each use a subquery to accomplish the concept of a semi-join.

Why does FULL JOIN order make a difference in these queries?

I'm using PostgreSQL. Everything I read here suggests that in a query using nothing but full joins on a single column, the order of tables joined basically doesn't matter.
My intuition says this should also go for multiple columns, so long as every common column is listed in the query where possible (that is, wherever both joined tables have the column in common). But this is not the case, and I'm trying to figure out why.
Simplified to three tables a, b, and c.
Columns in table a: id, name_a
Columns in table b: id, id_x
Columns in table c: id, id_x
This query:
SELECT *
FROM a
FULL JOIN b USING(id)
FULL JOIN c USING(id, id_x);
returns a different number of rows than this one:
SELECT *
FROM a
FULL JOIN c USING(id)
FULL JOIN b USING(id, id_x);
What I want/expect is hard to articulate, but basically, a I'd like a "complete" full merger. I want no null fields anywhere unless that is unavoidable.
For example, whenever there is a not-null id, I want the corresponding name column to always have the name_a and not be null. Instead, one of those example queries returns semi-redundant results, with one row having a name_a but no id, and another having an id but no name_a, rather than a single merged row.
When the joins are listed in the other order, I do get that desired result (but I'm not sure what other problems might occur, because future data is unknown).
Your queries are different.
In the first, you are doing a full join to b using a single column, id.
In the second, you are doing a full join to b using two columns.
Although the two queries could return the same results under some circumstances, there is not reason to think that the results would be comparable.
Argument order matters in OUTER JOINs, except that FULL NATURAL JOIN is symmetric. They return what an INNER JOIN (ON, USING or NATURAL) does but also the unmatched rows from the left (LEFT JOIN), right (RIGHT JOIN) or both (FULL JOIN) tables extended by NULLs.
USING returns the single shared value for each specified column in INNER JOIN rows; in NULL-extended rows another common column can have NULL in one table's version and a value in the other's.
Join order matters too. Even FULL NATURAL JOIN is not associative, since with multiple tables each pair of tables (either operand being an original or join result) can have a unique set of common columns, ie in general (A ⟗ B) ⟗ C ≠ A ⟗ (B ⟗ C).
There are a lot of special cases where certain additional identities hold. Eg FULL JOIN USING all common column names and OUTER JOIN ON equality of same-named columns are symmetric. Some cases involve CKs (candidate keys), FKs (foreign keys) and other constraints on arguments.
Your question doesn't make clear exactly what input conditions you are assuming or what output conditions you are seeking.

SELECT fields from one table with aggregates from related table

Here is a simplified description of 2 tables:
CREATE TABLE jobs(id PRIMARY KEY, description);
CREATE TABLE dates(id PRIMARY KEY, job REFERENCES jobs(id), date);
There may be one or more dates per job.
I would like create a query which generates the following (in pidgin):
jobs.id, jobs.description, min(dates.date) as start, max(dates.date) as finish
I have tried something like this:
SELECT id, description,
(SELECT min(date) as start FROM dates d WHERE d.job=j.id),
(SELECT max(date) as finish FROM dates d WHERE d.job=j.id)
FROM jobs j;
which works, but looks very inefficient.
I have tried an INNER JOIN, but can’t see how to join jobs with a suitable aggregate query on dates.
Can anybody suggest a clean efficient way to do this?
While retrieving all rows: aggregate first, join later:
SELECT id, j.description, d.start, d.finish
FROM jobs j
LEFT JOIN (
SELECT job AS id, min(date) AS start, max(date) AS finish
FROM dates
GROUP BY job
) d USING (id);
Related:
SQL: How to save order in sql query?
About JOIN .. USING
It's not a "different type of join". USING (col) is a standard SQL (!) syntax shortcut for ON a.col = b.col. More precisely, quoting the manual:
The USING clause is a shorthand that allows you to take advantage of
the specific situation where both sides of the join use the same name
for the joining column(s). It takes a comma-separated list of the
shared column names and forms a join condition that includes an
equality comparison for each one. For example, joining T1 and T2 with
USING (a, b) produces the join condition ON *T1*.a = *T2*.a AND *T1*.b = *T2*.b.
Furthermore, the output of JOIN USING suppresses redundant columns:
there is no need to print both of the matched columns, since they must
have equal values. While JOIN ON produces all columns from T1 followed
by all columns from T2, JOIN USING produces one output column for each
of the listed column pairs (in the listed order), followed by any
remaining columns from T1, followed by any remaining columns from T2.
It's particularly convenient that you can write SELECT * FROM ... and joining columns are only listed once.
In addition to Erwin's solution, you can also use a window clause:
SELECT j.id, j.description,
first_value(d.date) OVER w AS start,
last_value(d.date) OVER w AS finish
FROM jobs j
JOIN dates d ON d.job = j.id
WINDOW w AS (PARTITION BY j.id ORDER BY d.date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);
Window functions effectively group by one or more columns (the PARTITION BY clause) and/or ORDER BY some other columns and then you can apply some window function to it, or even a regular aggregate function, without affecting grouping or ordering of any other columns (description in your case). It requires a somewhat different way of constructing queries, but once you get the idea it is pretty brilliant.
In your case you need to get the first value of a partition, which is easy because it is accessible by default. You also need to look beyond the window frame (which ends by default with the current row) to the last value in the partition and then you need the ROWS clause. Since you produce two columns using the same window definition, the WINDOW clause is used here; in case it applies to a single column you can just write the window function in the select list followed by the OVER clause and the window definition without its name (WINDOW w AS (...)).