I'm using Postgres and I have the following schemes.
Orders
| id | status |
|----|-------------|
| 1 | delivered |
| 2 | recollected |
Comments
| id | text | user | order |
|----|---------|------|-------|
| 1 | texto 1 | 10 | 20 |
| 2 | texto 2 | 20 | 20 |
So, in this case, an order can have many comments.
I need to iterate over the orders and get something like this:
| id | status | comments |
|----|-------------|----------------|
| 1 | delivered | text 1, text 2 |
| 2 | recollected | |
I tried to use LEFT JOIN but it didn't work
SELECT
Order.id,
Order.status,
"Comment".text
FROM "Order"
LEFT JOIN "Comment" ON Order.id = "Comment"."order"
it returns this:
| id | status | text |
|----|-------------|--------|
| 1 | delivered | text 1 |
| 1 | delivered | text 2 |
| 2 | recollected| |
You are almost there - you just need aggregation:
SELECT
o.id,
o.status,
STRING_AGG(c.text, ',') comments
FROM "Order" o
LEFT JOIN "Comment" c ON p.id = c."order"
GROUP BY o.id, o.status
I would strongly recommend against having a table (and/or a column) called order: because it conflicts with a language keyword. I would also recommend avoiding quoted identifiers as much as possible - they make the queries longer to write, for no benefit.
Note that you can also use a correlated subquery:
SELECT
o.id,
o.status,
(SELECT STRING_AGG(c.text, ',') FROM "Comment" c WHERE c."order" = p.id) comments
FROM "Order" o
You can make it work with LEFT JOIN and aggregate after the join. But it's typically more efficient to aggregate first and join later.
If most or all rows in "Comment" are involved:
SELECT o.id, o.status, c.comments
FROM "Order" o
LEFT JOIN (
SELECT "order" AS id, string_agg(text, ', ') AS comments
FROM "Comment"
GROUP BY 1
) c USING (id);
Indexes won't matter, while most rows have to be read anyway.
For only a small percentage of rows (like, if you have a selective filter on "Order"):
SELECT o.id, o.status, c.comments
FROM "Order" o
LEFT JOIN LATERAL (
SELECT string_agg(text, ', ') AS comments
FROM "Comment"
WHERE "order" = o.id
) c ON true
WHERE <some_selective_filter>;
In this case, be sure to have an index on ("Comment"."order"), or more specialized, a covering index including text:
CREATE INDEX foo ON "Comment" ("order") INCLUDE (text);
Related:
Concatenate multiple result rows of one column into one, group by another column
Multiple array_agg() calls in a single query
Does a query with a primary key and foreign keys run faster than a query with just primary keys?
Aside: Consider legal, lower-case, unquoted identifiers in Postgres. In particular, don't (ab-)use completely reserved SQL keywords like ORDER as identifier. Much clearer and less potential for sneaky errors. See:
Are PostgreSQL column names case-sensitive?
Related
I´m doing the query below where I´m repeating the same joins multiple times, there is a better way to do it? (SQL Server Azure)
Ex.
Table: [Customer]
[Id_Customer] | [CustomerName]
1 | Tomy
...
Table: [Store]
[Id_Store] | [StoreName]
1 | SuperMarket
2 | BestPrice
...
Table: [SalesFrutes]
[Id_SalesFrutes] | [FruteName] | [Fk_Id_Customer] | [Fk_Id_Store]
1 | Orange | 1 | 1
...
Table: [SalesVegetable]
[Id_SalesVegetable] | [VegetableName] | [Fk_Id_Customer] | [Fk_Id_Store]
1 | Pea | 1 | 2
...
Select * From [Customer] as C
left join [SalesFrutes] as SF on SF.[Fk_Id_Customer] = C.[Id_Customer]
left join [SalesVegetable] as SV on SV.[Fk_Id_Customer] = C.[Id_Customer]
left join [Store] as S1 on S1.[Id_Store] = SF.[Fk_Id_Store]
left join [Store] as S2 on S1.[Id_Store] = SV.[Fk_Id_Store]
In my real case, I have many [Sales...] to Join with [Customer] and many other tables similar to [Store] to join to each [Sales...]. So it starts to scale a lot the number on joins repeating. There is a better way to do it?
Bonus question: I do like also to have FruteName, VegetableName, StoreName, and each Food table name under the same column.
The Expected Result is:
[CustomerName] | [FoodName] | [SalesTableName] | [StoreName]
Tomy | Orange | SalesFrute | SuperMarket
Tomy | Pea | SalesVegetable | BestPrice
...
Thank you!!
So based on the information provided, I would have suggested the below, to use a cte to "fix" the data model and make writing your query easier.
Since you say your real-world scenario is different to the info provided it might not work for you, but could still be applicable if you have say 80% shared columns, you can just use placeholder/null values where relevant for unioning the data sets and still minimise the number of joins eg to your store table.
with allSales as (
select Id_SalesFrutes as Id, FruitName as FoodName, 'Fruit' as SaleType, Fk_Id_customer as Id_customer, Fk_Id_Store as Id_Store
from SalesFruits
union all
select Id_SalesVegetable, VegetableName, 'Vegetable', Fk_Id_customer, Fk_Id_Store
from SalesVegetable
union all... etc
)
select c.CustomerName, s.FoodName, s.SaleType, st.StoreName
from Customer c
join allSales s on s.Id_customer=c.Id_customer
join Store st on st.Id_Store=s.Id_Store
Let's say I have the following tables:
+-------------------------------------------+
| t_classroom |
+-------------------------------------------+
| PK | id |
| | admin_user_id |
| | name |
| | students |
+-------------------------------------------+
+-------------------------------------------+
| t_shared |
+-------------------------------------------+
| | admin_user_id |
| | classroom_id |
| | expiry |
+-------------------------------------------+
I want to write a query that will pull all classrooms that an admin_user_id has access to. In essence, I want a union of classroom rows when I search by admin_user_id in the t_classroom table as well as classroom rows when I search by admin_user_id in the t_shared table. I made the following attempt:
SELECT
id,
admin_user_id,
name,
students
FROM
t_classroom
WHERE
admin_user_id = 1
UNION ALL
SELECT
c.id,
c.admin_user_id,
c.name,
students
FROM
t_classroom c
INNER JOIN t_shared s
ON c.id = s.classroom_id
WHERE
admin_user_id = 1
Does the above look correct? Is there anything more efficient/cleaner?
Depending on how much data you have you could probably get away with just using an IN clause to look at the other table.
SELECT
c.id,
c.admin_user_id,
c.name,
c.students
FROM
t_classroom c
WHERE
c.admin_user_id = 1
OR c.id IN ( select s.classroom_id from t_shared s where s.admin_user_id = 1 )
Your union wont work because you're left-joining to the t_shared table and checking only the classroom admin user.
If you join the shared room you would also end up with duplicates and would need to distinct the result too.
Edit:
Because of the large number of rows it might be better to use an exists check on the 2nd table.
SELECT
c.id,
c.admin_user_id,
c.name,
c.students
FROM
t_classroom c
WHERE
c.admin_user_id = 1
OR EXISTS ( select 1 from t_shared s where s.classroom_id = c.id AND s.admin_user_id = 1 )
Your solution is fundamentally fine, the only two problems I can detect when eyeballing your query are:
You need to write s.admin_user_id instead of admin_user_id in the last line to avoid an error message, because there is a column of that name in both tables. Best practice is to always qualify column names with the table names.
You might want to use UNION instead of UNION ALL if you want to avoid a duplicate result row in the case that both tables have admin_user_id = 1 for the same classroom.
I have an "insert only" database, wherein records aren't physically updated, but rather logically updated by adding a new record, with a CRUD value, carrying a larger sequence. In this case, the "seq" (sequence) column is more in line with what you may consider a primary key, but the "id" is the logical identifier for the record. In the example below,
This is the physical representation of the table:
seq id name | CRUD |
----|-----|--------|------|
1 | 10 | john | C |
2 | 10 | joe | U |
3 | 11 | kent | C |
4 | 12 | katie | C |
5 | 12 | sue | U |
6 | 13 | jill | C |
7 | 14 | bill | C |
This is the logical representation of the table, considering the "most recent" records:
seq id name | CRUD |
----|-----|--------|------|
2 | 10 | joe | U |
3 | 11 | kent | C |
5 | 12 | sue | U |
6 | 13 | jill | C |
7 | 14 | bill | C |
In order to, for instance, retrieve the most recent record for the person with id=12, I would currently do something like this:
SELECT
*
FROM
PEOPLE P
WHERE
P.ID = 12
AND
P.SEQ = (
SELECT
MAX(P1.SEQ)
FROM
PEOPLE P1
WHERE P.ID = 12
)
...and I would receive this row:
seq id name | CRUD |
----|-----|--------|------|
5 | 12 | sue | U |
What I'd rather do is something like this:
WITH
NEW_P
AS
(
--CTE representing all of the most recent records
--i.e. for any given id, the most recent sequence
)
SELECT
*
FROM
NEW_P P2
WHERE
P2.ID = 12
The first SQL example using the the subquery already works for us.
Question: How can I leverage a CTE to simplify our predicates when needing to leverage the "most recent" logical view of the table. In essence, I don't want to inline a subquery every single time I want to get at the most recent record. I'd rather define a CTE and leverage that in any subsequent predicate.
P.S. While I'm currently using DB2, I'm looking for a solution that is database agnostic.
This is a clear case for window (or OLAP) functions, which are supported by all modern SQL databases. For example:
WITH
ORD_P
AS
(
SELECT p.*, ROW_NUMBER() OVER ( PARTITION BY id ORDER BY seq DESC) rn
FROM people p
)
,
NEW_P
AS
(
SELECT * from ORD_P
WHERE rn = 1
)
SELECT
*
FROM
NEW_P P2
WHERE
P2.ID = 12
PS. Not tested. You may need to explicitly list all columns in the CTE clauses.
I guess you already put it together. First find the max seq associated with each id, then use that to join back to the main table:
WITH newp AS (
SELECT id, MAX(seq) AS latestseq
FROM people
GROUP BY id
)
SELECT p.*
FROM people p
JOIN newp n ON (n.latestseq = p.seq)
ORDER BY p.id
What you originally had would work, or moving the CTE into the "from" clause. Maybe you want to use a timestamp field rather than a sequence number for the ordering?
Following up from #Glenn's answer, here is an updated query which meets my original goal and is on par with #mustaccio's answer, but I'm still not sure what the performance (and other) implications of this approach vs the other are.
WITH
LATEST_PERSON_SEQS AS
(
SELECT
ID,
MAX(SEQ) AS LATEST_SEQ
FROM
PERSON
GROUP BY
ID
)
,
LATEST_PERSON AS
(
SELECT
P.*
FROM
PERSON P
JOIN
LATEST_PERSON_SEQS L
ON
(
L.LATEST_SEQ = P.SEQ)
)
SELECT
*
FROM
LATEST_PERSON L2
WHERE
L2.ID = 12
I have a table ('sales') in a MYSQL DB which should rightfully have had a unique constraint enforced to prevent duplicates. To first remove the dupes and set the constraint is proving a bit tricky.
Table structure (simplified):
'id (unique, autoinc)'
product_id
The goal is to enforce uniqueness for product_id. The de-duping policy I want to apply is to remove all duplicate records except the most recently created, eg: the highest id.
Or to put another way, I would like to delete only duplicate records, excluding the ids matched by the following query whilst also preserving the existing non-duped records:
select id
from sales s
inner join (select product_id,
max(id) as maxId
from sales
group by product_id
having count(product_id) > 1) groupedByProdId on s.product_id
and s.id = groupedByProdId.maxId
I've struggled with this on two fronts - writing the query to select the correct records to delete and then also the constraint in MYSQL where a subselect FROM clause of a DELETE cannot reference the same table from which data is being removed.
I checked out this answer and it seemed to deal with the subject, but seem specific to sql-server, though I wouldn't rule this question out from duplicating another.
In reply to your comment, here's a query that works in MySQL:
delete YourTable
from YourTable
inner join YourTable yt2
on YourTable.product_id = yt2.product_id
and YourTable.id < yt2.id
This would only remove duplicate rows. The inner join will filter out the latest row for each product, even if no other rows for the same product exist.
P.S. If you try to alias the table after FROM, MySQL requires you to specify the name of the database, like:
delete <DatabaseName>.yt
from YourTable yt
inner join YourTable yt2
on yt.product_id = yt2.product_id
and yt.id < yt2.id;
Perhaps use ALTER IGNORE TABLE ... ADD UNIQUE KEY.
For example:
describe sales;
+------------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| product_id | int(11) | NO | | NULL | |
+------------+---------+------+-----+---------+----------------+
select * from sales;
+----+------------+
| id | product_id |
+----+------------+
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
| 4 | 3 |
| 5 | 3 |
| 6 | 2 |
+----+------------+
ALTER IGNORE TABLE sales ADD UNIQUE KEY idx1(product_id), ORDER BY id DESC;
Query OK, 6 rows affected (0.03 sec)
Records: 6 Duplicates: 3 Warnings: 0
select * from sales;
+----+------------+
| id | product_id |
+----+------------+
| 6 | 2 |
| 5 | 3 |
| 2 | 1 |
+----+------------+
See this pythian post for more information.
Note that the ids end up in reverse order. I don't think this matters, since order of the ids should not matter in a database (as far as I know!). If this displeases you however, the post linked to above shows a way to solve this problem too. However, it involves creating a temporary table which requires more hard drive space than the in-place method I posted above.
I might do the following in sql-server to eliminate the duplicates:
DELETE FROM Sales
FROM Sales
INNER JOIN Sales b ON Sales.product_id = b.product_id AND Sales.id < b.id
It looks like the analogous delete statement for mysql might be:
DELETE FROM Sales
USING Sales
INNER JOIN Sales b ON Sales.product_id = b.product_id AND Sales.id < b.id
This type of problem is easier to solve with CTEs and Ranking functions, however, you should be able to do something like the following to solve your problem:
Delete Sales
Where Exists(
Select 1
From Sales As S2
Where S2.product_id = Sales.product_id
And S2.id > Sales.Id
Having Count(*) > 0
)
the two bits of SQL below get the same result
SELECT c.name, o.product
FROM customer c, order o
WHERE c.id = o.cust_id
AND o.value = 150
SELECT c.name, o.product
FROM customer c
INNER JOIN order o on c.id = o.cust_id
WHERE o.value = 150
I've seen both styles used as standard at different companies. From what I've seen, the 2nd one is what most people recommend online. Is there any real reason for this other than style? Does using an Inner Join sometimes have better performance?
I've noticed Ingres and Oracle developers tend to use the first style, whereas Microsoft SQL Server users have tended to use the second, but that might just be a coincidence.
Thanks for any insight, I've wondered about this for a while.
Edit: I've changed the title from 'SQL Inner Join versus Cartesian Product' as I was using the incorrect terminlogy. Thanks for all the responses so far.
Both queries are an inner joins and equivalent. The first is the older method of doing things, whereas the use of the JOIN syntax only became common after the introduction of the SQL-92 standard (I believe it's in the older definitions, just wasn't particularly widely used before then).
The use of the JOIN syntax is strongly preferred as it separates the join logic from the filtering logic in the WHERE clause. Whilst the JOIN syntax is really syntactic sugar for inner joins it's strength lies with outer joins where the old * syntax can produce situations where it is impossible to unambiguously describe the join and the interpretation is implementation-dependent. The [LEFT | RIGHT] JOIN syntax avoids these pitfalls, and hence for consistency the use of the JOIN clause is preferable in all circumstances.
Note that neither of these two examples are Cartesian products. For that you'd use either
SELECT c.name, o.product
FROM customer c, order o
WHERE o.value = 150
or
SELECT c.name, o.product
FROM customer c CROSS JOIN order o
WHERE o.value = 150
To answer part of your question, I think early bugs in the JOIN ... ON syntax in Oracle discouraged Oracle users away from that syntax. I don't think there are any particular problems now.
They are equivalent and should be parsed into the same internal representation for optimization.
Actually these examples are equivalent and neither is a cartesian product. A cartesian product is returned when you join two tables without specifying a join condition, such as in
select *
from t1,t2
There is a good discussion of this on Wikipedia.
Oracle was late in supporting the JOIN ... ON (ANSI) syntax (not until Oracle 9), that's why Oracle developers often don't use it.
Personally, I prefer using ANSI syntax when it is logically clear that one table is driving the query and the others are lookup tables. When tables are "equal", I tend to use the cartesian syntax.
The performance should not differ at all.
The JOIN... ON... syntax is a more recent addition to ANSI and ISO specs for SQL. The JOIN... ON... syntax is generally preferred because it 1) moves the join criteria out of the WHERE clause making the WHERE clause just for filtering and 2) makes it more obvious if you are creating a dreaded Cartesian product since each JOIN must be accompanied by at least one ON clause. If all the join criteria are just ANDed in the WHERE clause, it's not as obvious when one or more is missing.
Both queries are performing an inner join, just different syntax.
TL;DR
An INNER JOIN statement can be rewritten as a CROSS JOIN with a WHERE clause matching the same condition you used in the ON clause of the INNER JOIN query.
Table relationship
Considering we have the following post and post_comment tables:
The post has the following records:
| id | title |
|----|-----------|
| 1 | Java |
| 2 | Hibernate |
| 3 | JPA |
and the post_comment has the following three rows:
| id | review | post_id |
|----|-----------|---------|
| 1 | Good | 1 |
| 2 | Excellent | 1 |
| 3 | Awesome | 2 |
SQL INNER JOIN
The SQL JOIN clause allows you to associate rows that belong to different tables. For instance, a CROSS JOIN will create a Cartesian Product containing all possible combinations of rows between the two joining tables.
While the CROSS JOIN is useful in certain scenarios, most of the time, you want to join tables based on a specific condition. And, that's where INNER JOIN comes into play.
The SQL INNER JOIN allows us to filter the Cartesian Product of joining two tables based on a condition that is specified via the ON clause.
SQL INNER JOIN - ON "always true" condition
If you provide an "always true" condition, the INNER JOIN will not filter the joined records, and the result set will contain the Cartesian Product of the two joining tables.
For instance, if we execute the following SQL INNER JOIN query:
SELECT
p.id AS "p.id",
pc.id AS "pc.id"
FROM post p
INNER JOIN post_comment pc ON 1 = 1
We will get all combinations of post and post_comment records:
| p.id | pc.id |
|---------|------------|
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
So, if the ON clause condition is "always true", the INNER JOIN is simply equivalent to a CROSS JOIN query:
SELECT
p.id AS "p.id",
pc.id AS "pc.id"
FROM post p
CROSS JOIN post_comment
WHERE 1 = 1
ORDER BY p.id, pc.id
SQL INNER JOIN - ON "always false" condition
On the other hand, if the ON clause condition is "always false", then all the joined records are going to be filtered out and the result set will be empty.
So, if we execute the following SQL INNER JOIN query:
SELECT
p.id AS "p.id",
pc.id AS "pc.id"
FROM post p
INNER JOIN post_comment pc ON 1 = 0
ORDER BY p.id, pc.id
We won't get any result back:
| p.id | pc.id |
|---------|------------|
That's because the query above is equivalent to the following CROSS JOIN query:
SELECT
p.id AS "p.id",
pc.id AS "pc.id"
FROM post p
CROSS JOIN post_comment
WHERE 1 = 0
ORDER BY p.id, pc.id
SQL INNER JOIN - ON clause using the Foreign Key and Primary Key columns
The most common ON clause condition is the one that matches the Foreign Key column in the child table with the Primary Key column in the parent table, as illustrated by the following query:
SELECT
p.id AS "p.id",
pc.post_id AS "pc.post_id",
pc.id AS "pc.id",
p.title AS "p.title",
pc.review AS "pc.review"
FROM post p
INNER JOIN post_comment pc ON pc.post_id = p.id
ORDER BY p.id, pc.id
When executing the above SQL INNER JOIN query, we get the following result set:
| p.id | pc.post_id | pc.id | p.title | pc.review |
|---------|------------|------------|------------|-----------|
| 1 | 1 | 1 | Java | Good |
| 1 | 1 | 2 | Java | Excellent |
| 2 | 2 | 3 | Hibernate | Awesome |
So, only the records that match the ON clause condition are included in the query result set. In our case, the result set contains all the post along with their post_comment records. The post rows that have no associated post_comment are excluded since they can not satisfy the ON Clause condition.
Again, the above SQL INNER JOIN query is equivalent to the following CROSS JOIN query:
SELECT
p.id AS "p.id",
pc.post_id AS "pc.post_id",
pc.id AS "pc.id",
p.title AS "p.title",
pc.review AS "pc.review"
FROM post p, post_comment pc
WHERE pc.post_id = p.id
The non-struck rows are the ones that satisfy the WHERE clause, and only these records are going to be included in the result set. That's the best way to visualize how the INNER JOIN clause works.
| p.id | pc.post_id | pc.id | p.title | pc.review |
|------|------------|-------|-----------|-----------|
| 1 | 1 | 1 | Java | Good |
| 1 | 1 | 2 | Java | Excellent |
| 1 | 2 | 3 | Java | Awesome |
| 2 | 1 | 1 | Hibernate | Good |
| 2 | 1 | 2 | Hibernate | Excellent |
| 2 | 2 | 3 | Hibernate | Awesome |
| 3 | 1 | 1 | JPA | Good |
| 3 | 1 | 2 | JPA | Excellent |
| 3 | 2 | 3 | JPA | Awesome |
Not that this only applies to INNER JOIN, not for OUTER JOIN.