How to use postgres group by statement when joining tables - sql

I have a very simple query that i want to execute in postgres.
table1 has one to may relation to tables2 and 3.
pseudo query is as follows
select * from table1
left join table2 ON table2.table1_id = table1.id
left join table3 ON table3.table1_id = table1.id
group by table1.id
This gives me an error:
"column "table2.id" must appear in the GROUP BY clause or be used in an aggregate function",
same for table3.id
What is the point of Group by, if it forces me to add the id's of all the tables into group by, thus defeating the group by purpose( all ids are unique and no grouping occurs )

The purpose of the group by is to summarize data. There is one row in the result set for every combination of keys in the group by.
The columns in the result set are either keys in the group by or are aggregations. There is one exception to this rule, involving grouping by unique or primary keys on a table and using other columns.
The use of select * with group by is simply not a correct use of aggregation in SQL.
You seem to be misunderstanding the purpose of this construct. It is possible that you really mean order by -- that will order the result set by the the order by keys without changing (i.e. summarizing) the number of rows.

Related

SQL do i use Subquery or what?

ok so i have 3 tables :
i need to return which cars have received fixing more than 150 times
thanks in advance :)
The query would look something like this:
SELECT T1.Car, COUNT(t3.*)
FROM
Table1 T1
JOIN Table2 T2 ON T1.Id = T2.table2ID
JOIN Table3 T3 on T3.Id = T2.table3Id
GROUP BY T1.Car
Order by T1.Car
Yes you can also do a subquery so you would be selecting from table 1 and instead of the count, you would do a subquery with table 2 and table 2 joined back to table 1.
But you can use join. I think they will be more efficient here.
First of all, you are using a relational database, Secondly, you happen to have 2 Dimension Tables and 1 FACT table
The dimension tables make searching the FACT table easier, though this only is valid if you need a characteristic from those DIMENSION tables that you cannot get in the FACT table (such as [type] of fixes).
Since you want the raw results of Cars and their number of repairs, use a GROUP BY with a HAVING Clause in your query. Remember that the HAVING clause is still a PREDICATE, so use proper SARGS.
SELECT CAR_ID, COUNT(*) --or COUNT(CAR_ID), it really does not matter
FROM FACT_TABLE
GROUP BY CAR_ID
HAVING COUNT(FIX_ID) >= 150
The GROUP BY smashes the table by CAR_ID and counts the rows combined in the COUNT function while the HAVING, begin a predicate, filters the results of the aggregate functions.
Nope, just use two inner joins. And then group by car and count the number of lines.

Update a column of a table with a column of another table in PostgreSQL

I want to copy all the values from one column val1 of a table table1 to one column val2 of another table table2. I tried this command in PostgreSQL:
update table2
set val2 = (select val1 from table1)
But I got this error:
ERROR: more than one row returned by a subquery used as an expression
Is there an alternative to do that?
Your UPDATE query should look like this:
UPDATE table2 t2
SET val2 = t1.val1
FROM table1 t1
WHERE t2.table2_id = t1.table2_id
AND t2.val2 IS DISTINCT FROM t1.val1; -- optional, see below
The way you had it, there was no link between individual rows of the two tables. Every row would be fetched from table1 for every row in table2. This made no sense (in an expensive way) and also triggered the syntax error, because a subquery expression in this place is only allowed to return a single value.
I fixed this by joining the two tables on table2_id. Replace that with your actual join condition.
I rewrote the UPDATE to join in table1 (with the FROM clause) instead of running correlated subqueries, because that is typically faster.
It also prevents that table2.val2 is nullified where no matching row is found in table1. Instead, nothing happens to such rows with this form of the query.
You can add table expressions to the FROM list like you would in a plain SELECT (tables, subqueries, set-returning functions, ...). The manual:
from_item
A table expression allowing columns from other tables to appear in the WHERE condition and update expressions. This uses the same
syntax as the FROM clause of a SELECT statement; for example,
an alias for the table name can be specified. Do not repeat the target
table as a from_item unless you intend a self-join (in which
case it must appear with an alias in the from_item).
The final WHERE clause prevents updates that wouldn't change anything - at almost full cost but no gain (exotic exceptions apply). If both old and new value are guaranteed to be NOT NULL, simplify to:
AND t2.val2 <> t1.val1
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
update table1 set table1_column= table2.column from table2 table2 where table1_id= table2.id
do not use alias name for table1.
tables are table1, table2

Problem with sql query

I'm using MySQL and I'm trying to construct a query to do the following:
I have:
Table1 [ID,...]
Table2 [ID, tID, start_date, end_date,...]
What I want from my query is:
Select all entires from Table2 Where Table1.ID=Table2.tID
**where at least one** end_date<today.
The way I have it working right now is that if Table 2 contains (for example) 5 entries but only 1 of them is end_date< today then that's the only entry that will be returned, whereas I would like to have the other (expired) ones returned as well. I have the actual query and all the joins working well, I just can't figure out the ** part of it.
Any help would be great!
Thank you!
SELECT * FROM Table2
WHERE tID IN
(SELECT Table2.tID FROM Table1
INNER JOIN Table2 ON Table1.ID = Table2.tID
WHERE Table2.end_date < NOW
)
The subquery will select all tId's that match your where clause. The main query will use this subquery to filter the entries in table 2.
Note: the use of inner join will filter all rows from table 1 with no matching entry in table 2. This is no problem; these entries wouldn't have matched the where clause anyway.
Maybe, just maybe, you could create a sub-query to join with your actual tables and in this subquery you use a count() which can be used later on you where clause.

Difference between "and" and "where" in joins

Whats the difference between
SELECT DISTINCT field1
FROM table1 cd
JOIN table2
ON cd.Company = table2.Name
and table2.Id IN (2728)
and
SELECT DISTINCT field1
FROM table1 cd
JOIN table2
ON cd.Company = table2.Name
where table2.Id IN (2728)
both return the same result and both have the same explain output
Firstly there is a semantic difference. When you have a join, you are saying that the relationship between the two tables is defined by that condition. So in your first example you are saying that the tables are related by cd.Company = table2.Name AND table2.Id IN (2728). When you use the WHERE clause, you are saying that the relationship is defined by cd.Company = table2.Name and that you only want the rows where the condition table2.Id IN (2728) applies. Even though these give the same answer, it means very different things to a programmer reading your code.
In this case, the WHERE clause is almost certainly what you mean so you should use it.
Secondly there is actually difference in the result in the case that you use a LEFT JOIN instead of an INNER JOIN. If you include the second condition as part of the join, you will still get a result row if the condition fails - you will get values from the left table and nulls for the right table. If you include the condition as part of the WHERE clause and that condition fails, you won't get the row at all.
Here is an example to demonstrate this.
Query 1 (WHERE):
SELECT DISTINCT field1
FROM table1 cd
LEFT JOIN table2
ON cd.Company = table2.Name
WHERE table2.Id IN (2728);
Result:
field1
200
Query 2 (AND):
SELECT DISTINCT field1
FROM table1 cd
LEFT JOIN table2
ON cd.Company = table2.Name
AND table2.Id IN (2728);
Result:
field1
100
200
Test data used:
CREATE TABLE table1 (Company NVARCHAR(100) NOT NULL, Field1 INT NOT NULL);
INSERT INTO table1 (Company, Field1) VALUES
('FooSoft', 100),
('BarSoft', 200);
CREATE TABLE table2 (Id INT NOT NULL, Name NVARCHAR(100) NOT NULL);
INSERT INTO table2 (Id, Name) VALUES
(2727, 'FooSoft'),
(2728, 'BarSoft');
SQL comes from relational algebra.
One way to look at the difference is that JOINs are operations on sets that can produce more records or less records in the result than you had in the original tables. On the other side WHERE will always restrict the number of results.
The rest of the text is extra explanation.
For overview of join types see article again.
When I said that the where condition will always restrict the results, you have to take into account that when we are talking about queries on two (or more) tables you have to somehow pair records from these tables even if there is no JOIN keyword.
So in SQL if the tables are simply separated by a comma, you are actually using a CROSS JOIN (cartesian product) which returns every row from one table for each row in the other.
And since this is a maximum number of combinations of rows from two tables then the results of any WHERE on cross joined tables can be expressed as a JOIN operation.
But hold, there are exceptions to this maximum when you introduce LEFT, RIGHT and FULL OUTER joins.
LEFT JOIN will join records from the left table on a given criteria with records from the right table, BUT if the join criteria, looking at a row from the left table is not satisfied for any records in the right table the LEFT JOIN will still return a record from the left table and in the columns that would come from the right table it will return NULLs (RIGHT JOIN works similarly but from the other side, FULL OUTER works like both at the same time).
Since the default cross join does NOT return those records you can not express these join criteria with WHERE condition and you are forced to use JOIN syntax (oracle was an exception to this with an extension to SQL standard and to = operator, but this was not accepted by other vendors nor the standard).
Also, joins usually, but not always, coincide with existing referential integrity and suggest relationships between entities, but I would not put as much weight into that since the where conditions can do the same (except in the before mentioned case) and to a good RDBMS it will not make a difference where you specify your criteria.
The join is used to reflect the entity relations
the where clause filters down results.
So the join clauses are 'static' (unless the entity relations change), while the where clauses are use-case specific.
There is no difference. "ON" is like a synonym for "WHERE", so t he second kind of reads like:
JOIN table2 WHERE cd.Company = table2.Name AND table2.Id IN (2728)
There is no difference when the query optimisation engine breaks it down to its relevant query operators.

PostgreSQL - Correlated Sub-Query Fail?

I have a query like this:
SELECT t1.id,
(SELECT COUNT(t2.id)
FROM t2
WHERE t2.id = t1.id
) as num_things
FROM t1
WHERE num_things = 5;
The goal is to get the id of all the elements that appear 5 times in the other table. However, I get this error:
ERROR: column "num_things" does not exist
SQL state: 42703
I'm probably doing something silly here, as I'm somewhat new to databases. Is there a way to fix this query so I can access num_things? Or, if not, is there any other way of achieving this result?
A few important points about using SQL:
You cannot use column aliases in the WHERE clause, but you can in the HAVING clause. That's the cause of the error you got.
You can do your count better using a JOIN and GROUP BY than by using correlated subqueries. It'll be much faster.
Use the HAVING clause to filter groups.
Here's the way I'd write this query:
SELECT t1.id, COUNT(t2.id) AS num_things
FROM t1 JOIN t2 USING (id)
GROUP BY t1.id
HAVING num_things = 5;
I realize this query can skip the JOIN with t1, as in Charles Bretana's solution. But I assume you might want the query to include some other columns from t1.
Re: the question in the comment:
The difference is that the WHERE clause is evaluated on rows, before GROUP BY reduces groups to a single row per group. The HAVING clause is evaluated after groups are formed. So you can't, for example, change the COUNT() of a group by using HAVING; you can only exclude the group itself.
SELECT t1.id, COUNT(t2.id) as num
FROM t1 JOIN t2 USING (id)
WHERE t2.attribute = <value>
GROUP BY t1.id
HAVING num > 5;
In the above query, WHERE filters for rows matching a condition, and HAVING filters for groups that have at least five count.
The point that causes most people confusion is when they don't have a GROUP BY clause, so it seems like HAVING and WHERE are interchangeable.
WHERE is evaluated before expressions in the select-list. This may not be obvious because SQL syntax puts the select-list first. So you can save a lot of expensive computation by using WHERE to restrict rows.
SELECT <expensive expressions>
FROM t1
HAVING primaryKey = 1234;
If you use a query like the above, the expressions in the select-list are computed for every row, only to discard most of the results because of the HAVING condition. However, the query below computes the expression only for the single row matching the WHERE condition.
SELECT <expensive expressions>
FROM t1
WHERE primaryKey = 1234;
So to recap, queries are run by the database engine according to series of steps:
Generate set of rows from table(s), including any rows produced by JOIN.
Evaluate WHERE conditions against the set of rows, filtering out rows that don't match.
Compute expressions in select-list for each in the set of rows.
Apply column aliases (note this is a separate step, which means you can't use aliases in expressions in the select-list).
Condense groups to a single row per group, according to GROUP BY clause.
Evaluate HAVING conditions against groups, filtering out groups that don't match.
Sort result, according to ORDER BY clause.
All the other suggestions would work, but to answer your basic question it would be sufficient to write
SELECT id From T2
Group By Id
Having Count(*) = 5
I'd like to mention that in PostgreSQL there is no way to use aliased column in having clause.
i.e.
SELECT usr_id AS my_id FROM user HAVING my_id = 1
Wont work.
Another example that is not going to work:
SELECT su.usr_id AS my_id, COUNT(*) AS val FROM sys_user AS su GROUP BY su.usr_id HAVING val >= 1
There will be the same error: val column is not known.
Im highliting this because Bill Karwin wrote something not really true for Postgres:
"You cannot use column aliases in the WHERE clause, but you can in the HAVING clause. That's the cause of the error you got."
I think you could just rewrite your query like so:
SELECT t1.id
FROM t1
WHERE (SELECT COUNT(t2.id)
FROM t2
WHERE t2.id = t1.id
) = 5;
try this
SELECT t1.id,
(SELECT COUNT(t2.id) as myCount
FROM t2
WHERE t2.id = t1.id and myCount=5
) as num_things
FROM t1