Semi-join vs Subqueries - sql

What is the difference between semi-joins and a subquery? I am currently taking a course on this on DataCamp and i'm having a hard time making a distinction between the two.
Thanks in advance.

A join or a semi join is required whenever you want to combine two or more entities records based on some common conditional attributes.
Unlike, Subquery is required whenever you want to have a lookup or a reference on same table or other tables
In short, when your requirement is to get additional reference columns added to existing tables attributes then go for join else when you want to have a lookup on records from the same table or other tables but keeping the same existing columns as o/p go for subquery
Also, In case of semi join it can act/used as a subquery because most of the times we dont actually join the right table instead we mantain a check via subquery to limit records in the existing hence semijoin but just that it isnt a subquery by itself

I don't really think of a subquery and a semi-join as anything similar. A subquery is nothing more interesting than a query that is used inside another query:
select * -- this is often called the "outer" query
from (
select columnA -- this is the subquery inside the parentheses
from mytable
where columnB = 'Y'
)
A semi-join is a concept based on join. Of course, joining tables will combine both tables and return the combined rows based on the join criteria. From there you select the columns you want from either table based on further where criteria (and of course whatever else you want to do). The concept of a semi-join is when you want to return rows from the first table only, but you need the 2nd table to decide which rows to return. Example: you want to return the people in a class:
select p.FirstName, p.LastName, p.DOB
from people p
inner join classes c on c.pID = p.pID
where c.ClassName = 'SQL 101'
group by p.pID
This accomplishes the concept of a semi-join. We are only returning columns from the first table (people). The use of the group by is necessary for the concept of a semi-join because a true join can return duplicate rows from the first table (depending on the join criteria). The above example is not often referred to as a semi-join, and is not the most typical way to accomplish it. The following query is a more common method of accomplishing a semi-join:
select FirstName, LastName, DOB
from people
where pID in (select pID
from class
where ClassName = 'SQL 101'
)
There is no formal join here. But we're using the 2nd table to determine which rows from the first table to return. It's a lot like saying if we did join the 2nd table to the first table, what rows from the first table would match?
For performance, exists is typically preferred:
select FirstName, LastName, DOB
from people p
where exists (select pID
from class c
where c.pID = p.pID
and c.ClassName = 'SQL 101'
)
In my opinion, this is the most direct way to understand the semi-join. There is still no formal join, but you can see the idea of a join hinted at by the usage of directly matching the first table's pID column to the 2nd table's pID column.
Final note. The last 2 queries above each use a subquery to accomplish the concept of a semi-join.

Related

How to avoid duplicated rows when joining multiple tables?

I am trying to SELECT the data from the education and experience tables. Both have two entries for the given candidate_id. When I try using GROUP BY and json_agg, I get four rows in the aggregated JSON values. What am I doing wrong? I want two education objects and two experience objects in their respective arrays.
SQL:
SELECT
json_agg(education) as education,
json_agg(experience) as experience
FROM application
LEFT JOIN candidate ON application.candidate_id = candidate.id
LEFT JOIN education ON candidate.id = education.candidate_id
LEFT JOIN experience ON candidate.id = experience.candidate_id
WHERE application.candidate_id = 2
GROUP BY education.candidate_id, experience.candidate_id;
Result:
education
[{"id":3,"candidate_id":2,"school":"school1 candidate2","qualification":"qualification1 candidate2","dates":"dates1 candidate2","note":null},
{"id":3,"candidate_id":2,"school":"school1 candidate2","qualification":"qualification1 candidate2","dates":"dates1 candidate2","note":null},
{"id":4,"candidate_id":2,"school":"school2 candidate2","qualification":"qualification2 candidate2","dates":"dates2 candidate2","note":null},
{"id":4,"candidate_id":2,"school":"school2 candidate2","qualification":"qualification2 candidate2","dates":"dates2 candidate2","note":null}]
experience
[{"id":3,"candidate_id":2,"employer":"emploer1 candidate2","title":"title1 candidate2","dates":"dates1 candidate2","job_duties":"duties1 candidate2"},
{"id":4,"candidate_id":2,"employer":"emploer2 candidate2","title":"title2 candidate2","dates":"dates2 candidate2","job_duties":"duties2 candidate2"},
{"id":3,"candidate_id":2,"employer":"emploer1 candidate2","title":"title1 candidate2","dates":"dates1 candidate2","job_duties":"duties1 candidate2"},
{"id":4,"candidate_id":2,"employer":"emploer2 candidate2","title":"title2 candidate2","dates":"dates2 candidate2","job_duties":"duties2 candidate2"}]
I tried multiple variants of this query ...
Multiple joins that do not (also) associate rows among the joined table rows effectively act like CROSS JOIN by proxy, multiplying rows. See:
Two SQL LEFT JOINS produce incorrect result
Aggregate before joining (so that only a single row per parent row remains, hence no duplication). Or use lowly correlated subqueries for this simple case. Well, not even correlated for just your single candidate_id, rather plain suquery expressions in the SELECT list:
SELECT (SELECT json_agg(e.*)
FROM education e
WHERE e.candidate_id = 2) AS education
, (SELECT json_agg(e.*)
FROM experience e
WHERE e.candidate_id = 2) AS experience
WHERE EXISTS (SELECT FROM application a WHERE a.candidate_id = 2);
I removed the table candidate from your query, which was dead freight (unless you must verify that a related row exists in that table), but might additionally multiply rows in the same way.
And the table application only needs to be checked for the existence of any qualifying rows.
You might alternatively use (LATERAL) subqueries for more complex cases. (I suspect you over-simplified.) See:
How to SUM numbers from a plain jsonb array?

SQL subselect statement very slow on certain machines

I've got an sql statement where I get a list of all Ids from a table (Machines).
Then need the latest instance of another row in (Events) where the the id's match so have been doing a subselect.
I need to latest instance of quite a few fields that match the id so have these subselects after one another within this single statement so end up with results similar to this...
This works and the results are spot on, it's just becoming very slow as the Events Table has millions of records. The Machine table would have on average 100 records.
Is there a better solution that subselects? Maybe doing inner joins or a stored procedure?
Help appreciated :)
You can use apply. You don't specify how "latest instance" is defined. Let me assume it is based on the time column:
Select a.id, b.*
from TableA a outer apply
(select top(1) b.Name, b.time, b.weight
from b
where b.id = a.id
order by b.time desc
) b;
Both APPLY and the correlated subquery need an ORDER BY to do what you intend.
APPLY is a lot like a correlated query in the FROM clause -- with two convenient enhances. A lateral join -- technically what APPLY does -- can return multiple rows and multiple columns.

Avoid repeated information when having multiple joins?

I have the following query that uses joins to join multiple tables
select DISTINCT
tblArticles.Article_Title,
tblArticles.Article_img,
tblArticles.Article_Content,
tblArticles.Article_Date_Created,
tblArticles.Article_Sequence,
tblWriters.Writer_Name,
tblTypes.Article_Type_Name,
tblimages.image_path as "Extra images"
from tblArticles inner join tblWriters
on tblArticles.Writer_ID_Fkey = tblWriters.Writer_ID inner join
tblArticleType on tblArticles.Article_ID = tblArticleType.Article_ID_Fkey inner join
tblTypes on tblArticleType.Article_Type_ID_Fkey = tblTypes.Article_Type_ID left outer join tblExtraImages
on tblArticles.Article_ID = tblExtraImages.Article_ID_Fkey left outer join tblimages
on tblExtraImages.image_id_fkey = tblimages.image_id
order by tblArticles.Article_Sequence, tblArticles.Article_Date_Created;
And I get the following results:
If an article has more than one type_name then I will get repeated columns for the rest of the records. Is there another way of joining these tables that would prevent that from happening?
The simplest method is to just remove column Article_Type_Name from the select clause. This allows SELECT DISTINCT to identify the rows as duplicates, and eliminate them.
Another option is to use an aggregation function on the column. In recent SQL Server versions, STRING_AGG() comes handy (you can also use MIN() or MAX()):
select
tblArticles.Article_Title,
tblArticles.Article_img,
tblArticles.Article_Content,
tblArticles.Article_Date_Created,
tblArticles.Article_Sequence,
tblWriters.Writer_Name,
string_agg(tblTypes.Article_Type_Name, ',')
within group(order by tblTypes.Article_Type_Name) Article_Type_Name_List,
tblimages.image_path as Extra_Images
from ..
group by
tblArticles.Article_Title,
tblArticles.Article_img,
tblArticles.Article_Content,
tblArticles.Article_Date_Created,
tblArticles.Article_Sequence,
tblWriters.Writer_Name,
tblimages.image_path
What you're seeing here is a Cartesian product; you've joined Tables in such a way that multiple rows from one side match with rows from the other
If you don't care about the article_type, then group the other columns and take the max(article_type), or omit it in a subquery that selects distinct records, not including the article type column, from the table that contains article type). If your SQLS is recent enough and you want to know all the article types you could STRING_AGG them into a csv list
Ultimately what you choose to do depends on what you want them for; filter the rows out, or group them down

Does a SQL JOIN's ON imply WHERE?

When joining a very large table to a small table, I try to be as specific as possible in my join query. Am I going overboard, however?
Let's say I have SmallTable with one column and just three values: "Peter", "Paul", and "Mary". I'll end up joining a bunch of huge tables to this. Should I put a WHERE statement in my join in order to narrow the join's select statement? Or does a join imply the where condition?
SELECT
Username,
click.TotalClicks,
otherjoin.SneezePercent,
anotherjoin.Coats
FROM
SmallTable
LEFT JOIN (
SELECT
Person,
SUM(Clicks) AS TotalClicks
FROM
HugeTable
WHERE
Person LIKE 'Peter' OR Person LIKE 'Paul' OR Person LIKE 'Mary'
) click
ON click.Person = Username
LEFT JOIN (
...
I think the version you currently have is the optimal one, because your WHERE restriction will save the database from aggregating over names whose results you ultimately will be discarding anyway in the join, in the outer query. Your current use of LIKE might preclude an index, but the database also might be able to use an index in that WHERE clause, for even better performance.
The alternative to this, namely relying on the join with the small table, would filter out names you don't want, but by then the aggregation would have already been done on the entire large table.

Whether Inner Queries Are Okay?

I often see something like...
SELECT events.id, events.begin_on, events.name
FROM events
WHERE events.user_id IN ( SELECT contacts.user_id
FROM contacts
WHERE contacts.contact_id = '1')
OR events.user_id IN ( SELECT contacts.contact_id
FROM contacts
WHERE contacts.user_id = '1')
Is it okay to have query in query? Is it "inner query"? "Sub-query"? Does it counts as three queries (my example)? If its bad to do so... how can I rewrite my example?
Your example isn't too bad. The biggest problems usually come from cases where there is what's called a "correlated subquery". That's when the subquery is dependent on a column from the outer query. These are particularly bad because the subquery effectively needs to be rerun for every row in the potential results.
You can rewrite your subqueries using joins and GROUP BY, but as you have it performance can vary, especially depending on your RDBMS.
It varies from database to database, especially if the columns compared are
indexed or not
nullable or not
..., but generally if your query is not using columns from the table joined to -- you should be using either IN or EXISTS:
SELECT e.id, e.begin_on, e.name
FROM EVENTS e
WHERE EXISTS (SELECT NULL
FROM CONTACTS c
WHERE ( c.contact_id = '1' AND c.user_id = e.user_id )
OR ( c.user_id = '1' AND c.contact_id = e.user_id )
Using a JOIN (INNER or OUTER) can inflate records if the child table has more than one record related to a parent table record. That's fine if you need that information, but if not then you need to use either GROUP BY or DISTINCT to get a result set of unique values -- and that can cost you when you review the query costs.
EXISTS
Though EXISTS clauses look like correlated subqueries, they do not execute as such (RBAR: Row By Agonizing Row). EXISTS returns a boolean based on the criteria provided, and exits on the first instance that is true -- this can make it faster than IN when dealing with duplicates in a child table.
You could JOIN to the Contacts table instead:
SELECT events.id, events.begin_on, events.name
FROM events
JOIN contacts
ON (events.user_id = contacts.contact_id OR events.user_id = contacts.user_id)
WHERE events.user_id = '1'
GROUP BY events.id
-- exercise: without the GROUP BY, how many duplicate rows can you end up with?
This leaves the following question up to the database: "Should we look through all the contacts table and find all the '1's in the various columns, or do something else?" where your original SQL didn't give it much choice.
The most common term for this sort of query is "subquery." There is nothing inherently wrong in using them, and can make your life easier. However, performance can often be improved by rewriting queries w/ subqueries to use JOINs instead, because the server can find optimizations.
In your example, three queries are executed: the main SELECT query, and the two SELECT subqueries.
SELECT events.id, events.begin_on, events.name
FROM events
JOIN contacts
ON (events.user_id = contacts.contact_id OR events.user_id = contacts.user_id)
WHERE events.user_id = '1'
GROUP BY events.id
In your case, I believe the JOIN version will be better as you can avoid two SELECT queries on contacts, opting for the JOIN instead.
See the mysql docs on the topic.