What is semi-join in database? - sql

I am having trouble while trying to understand the concept of semi-join and how it is different from conventional join. I have tried some article already but not satisfied with the explanation, could someone please help me to understand it?

Simple example. Let's select students with grades using left outer join:
SELECT DISTINCT s.id
FROM students s
LEFT JOIN grades g ON g.student_id = s.id
WHERE g.student_id IS NOT NULL
Now the same with left semi-join:
SELECT s.id
FROM students s
WHERE EXISTS (SELECT 1 FROM grades g
WHERE g.student_id = s.id)
The latter is generally more efficient (depending on concrete DBMS and query optimizer).

As far as I know SQL dialects that support SEMIJOIN/ANTISEMI are U-SQL/Cloudera Impala.
SEMIJOIN:
Semijoins are U-SQL’s way filter a rowset based on the inclusion of its rows in another rowset. Other SQL dialects express this with the SELECT * FROM A WHERE A.key IN (SELECT B.key FROM B) pattern.
More info Semi Join and Anti Join Should Have Their Own Syntax in SQL:
“Semi” means that we don’t really join the right hand side, we only check if a join would yield results for any given tuple.
-- IN
SELECT *
FROM Employee
WHERE DeptName IN (
SELECT DeptName
FROM Dept
)
-- EXISTS
SELECT *
FROM Employee
WHERE EXISTS (
SELECT 1
FROM Dept
WHERE Employee.DeptName = Dept.DeptName
)
EDIT:
Another dialect that supports SEMI/ANTISEMI join is KQL:
kind=leftsemi (or kind=rightsemi)
Returns all the records from the left side that have matches from the right. The result table contains columns from the left side only.
let t1 = datatable(key:long, value:string)
[1, "a",
2, "b",
3, "c"];
let t2 = datatable(key:long)
[1,3];
t1 | join kind=leftsemi (t2) on key
demo
Output:
key value
1 a
3 c

As I understand, a semi join is a left join or right join:
What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?
So the difference between a left (semi) join and a "conventional" join is that you only retrieve the data of the left table (where you have a match on your join condition). Whereas with a full (outer) join (I think thats what you mean by conventional join), you retrieve the data of both tables where your condition matches.

Related

What's the purpose of a JOIN where no column from 2nd table is being used?

I am looking through some hive queries we are running as part of analytics on our hadoop cluster, but I am having trouble understanding one. This is the Hive QL query
SELECT
c_id, v_id, COUNT(DISTINCT(m_id)) AS participants,
cast(date_sub(current_date, ${window}) as string) as event_date
from (
select
a.c_id, a.v_id, a.user_id,
case
when c.id1 is not null and a.timestamp <= c.stitching_ts then c.id2 else a.m_id
end as m_id
from (
select * from first
where event_date <= cast(date_sub(current_date, ${window}) as string)
) a
join (
select * from second
) b on a.c_id = b.c_id
left join third c
on a.user_id = c.id1
) dx
group by c_id, v_id;
I have changed the names but otherwise this is the select statement being used to insert overwrite to another table.
Regarding the join
join (
select * from second
) b on a.c_id = b.c_id
b is not used anywhere except for join condition, so is this join serving any purpose at all?
Is it for making sure that this join only has entries where c_id is present in second table? Would a where IN condition be better if thats all this is doing.
Or I can just remove this join and it won't make any difference at all.
Thanks.
Join (any inner, left or right) can duplicate rows if join key in joined dataset is not unique. For example if a contains single row with c_id=1 and b contains two rows with c_id=1, the result will be two rows with a.c_id=1.
Join (inner) can filter rows if join key is absent in joined dataset. I believe this is what it meant to do.
If the goal is to get only rows with keys present in both datasets(filter) and you do not want duplication, and you do not use columns from joined dataset, then better use LEFT SEMI JOIN instead of JOIN, it will work as filter only even if there are duplicated keys in joined dataset:
left semi join (
select c_id from second
) b on a.c_id = b.c_id
This is much safer way to filter rows only which exist in both a and b and avoid unintended duplication.
You can replace join with WHERE IN/EXISTS, but it makes no difference, it is implemented as the same JOIN, check the EXPLAIN output and you will see the same query plan. Better use LEFT SEMI JOIN, it implements uncorrelated IN/EXISTS in efficient way.
If you prefer to move it to the WHERE:
WHERE a.c_id IN (select c_id from second)
or correlated EXISTS:
WHERE EXISTS (select 1 from second b where a.c_id=b.c_id)
But as I said, all of them are implemented internally using JOIN operator.

Joins in oracle

If I write simply this query -
SELECT * FROM AA,BB WHERE AA.ID = BB.ID
then what kind of join it will be. If we apply any kind of join then we need to specify like inner join, outer join, cross join. Then what kind of join it is?
And what's the actual difference between Inner Join and Cross join?
Your query represents inner join. For example, on Scott's sample schema, it'll join rows from EMP and DEPT tables on DEPTNO (which is a common column):
SQL> select count(*)
2 from emp e inner join dept d on e.deptno = d.deptno;
COUNT(*)
----------
14
SQL>
(you can omit inner keyword).
You asked what is a difference between inner and cross join; cross join represents a Cartesian product, which means that the result will be pairs of all rows from the first table with all rows from the second table:
SQL> select count(*)
2 from emp e cross join dept d;
COUNT(*)
----------
56
SQL>
Using "your" old syntax, that's a join without WHERE clause:
SQL> select count(*)
2 from emp e, dept d;
COUNT(*)
----------
56
SQL>
Outer join will take rows that don't have a "pair" in another table. In Scott's schema, it is department 40 in DEPT table as there are no employees who work there:
SQL> select count(*)
2 from emp e right outer join dept d on e.deptno = d.deptno;
COUNT(*)
----------
15
SQL>
The old Oracle outer join operator ((+)) is something you might still see somewhere:
SQL> select count(*)
2 from emp e, dept d
3 where e.deptno (+) = d.deptno;
COUNT(*)
----------
15
SQL>
Basically, you should switch to modern joins. More about the subject in documentation about joins, here: https://docs.oracle.com/cd/B28359_01/server.111/b28286/queries006.htm#SQLRF52331 (11g version; find the one related to your database version, although there shouldn't be anything revolutionary in more recent versions).
You asked:
If I write simply this query -
SELECT * FROM AA,BB WHERE AA.ID = BB.ID
then what kind of join it will be?
it's an INNER JOIN using pre ANSI-92 syntax.
If we apply any kind of join then we need to specify like inner join, outer join, cross join.
Only if using the current ANSI join syntax (which you should be in 2019).
Then what kind of join it is?
Really it's a cross join which gets records eliminated base on the where clause making it behave like an inner join. Eliminate the where clause and you get every record in AA to every record in BB.
And what's the actual difference between Inner Join and Cross join?>
An inner join only returns records that EXIST & RELATE IN both tables involved
A cross join relates EVERY record in one table to EVERY record in the other.
Take for example two Tables (A,B,C) (A,B,D)
an INNER JOIN would return the set {A.A,B.B}(2)
a cross join would return {A.A, A.B, A.D, B.A, B.B, B.D, C.A, C.B, C.D}(9)
It looks like you don't quite understand the various types of joins. Overall, there are 5 different join types: INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER, CROSS. I'll show you an example of all 5 using these 2 tables
Tables AA and BB will only have the field ID. Table AA will have 'A', and 'B' as values for ID. Table BB will have 'B', 'C', and 'D' as values for ID.
For an INNER JOIN, only those rows with matching values are included. So the INNER join of AA and BB would be:
'B','B'
For a LEFT OUTER JOIN, every row in the left table is returned, and for those rows in the right table without a matching value, NULL is returned. So we get:
'A',NULL
'B','B'
For a RIGHT OUTER JOIN, every row in the right table is returned, and for those rows in the left table without a matching value, NULL is returned. So we get:
'B','B'
NULL,'C'
NULL,'D'
For a FULL OUTER JOIN, every row in both tables are returned. And missing values in either table are NULL. So we get:
'A',NULL
'B','B'
NULL,'C'
NULL,'D'
And finally, for a CROSS JOIN, every row in each table is returned with every row in the other table. So we get:
'A','B'
'A','C'
'A','D'
'B','B'
'B','C'
'B','D'
For the example SQL statement you gave, you're returning an INNER JOIN. As for the difference between an INNER JOIN and a CROSS JOIN, the above example should have illustrated the differences, but for an INNER JOIN you're examining the minimum number of rows and for a CROSS JOIN, you're examining the maximum possible number of rows. In general if you examine the plan for a SQL query and find out that a CROSS JOIN is being used in the plan, more often than not, you have an error in your SQL since cross joins tend to be extremely processor and I/O intensive.

PostgreSQL LEFT OUTER JOIN Conditionals not working

This LEFT OUTER JOIN with several conditionals is not working, it's probably something obvious. It is returning the result of all distinct sid and not performing conditionals at all.
SELECT
count(distinct student_status.sid)
FROM studentcoursedb.student_status
LEFT OUTER JOIN studentcoursedb.student_status AS t0
ON t0.sid = student_status.sid
AND t0.term < student_status.term
AND student_status.major LIKE 'ABC%';
The result, 32684 is the count of total distinct sids, the same value returned by this query:
select count(distinct sid)
from studentcoursedb.student_status;
The two query
SELECT
count(distinct student_status.sid)
FROM studentcoursedb.student_status
LEFT OUTER JOIN studentcoursedb.student_status AS t0
ON t0.sid = student_status.sid
AND t0.term < student_status.term
AND student_status.major LIKE 'ABC%';
select count(distinct sid)
from studentcoursedb.student_status;
return the same number of rows correctly because
You are left joining (left join or left outer join is the same) the same table this mean that the resulting number of rows is ever the same number of the main table
If you want a subset matching you should use inner join (or other join relation)
You are counting a column from the left table that might have duplicate rows as a result of the LEFT JOIN, but certainly no filtered rows.
A LEFT OUTER JOIN keeps all rows in the first table along with matching rows in the second. Hence, it does not filter the first table. You are counting a column from the first table. So, the LEFT OUTER JOIN does not affect the distinct count.
If you want to filter rows, then use INNER JOIN instead. I would also move the conditions to the WHERE clause:
SELECT count(distinct ss.sid)
FROM studentcoursedb.student_status ss INNER JOIN
studentcoursedb.student_status ss2
ON ss2.sid = ss.sid
WHERE ss2.term < ss.term AND ss.major LIKE 'ABC%';
I should note that I don't think you need a self join. Have you considered:
select dense_rank(ss.term) over (order by term)
from studentcoursedb.student_status ss
where ss.major like 'ABC%';
Much simpler and should have better performance.

Joining multiple tables in SQL

Can sombody Explains me about joins?
Inner join selects common data based on where condition.
Left outer join selects all data from left irrespective of common but takes common data from right table and vice versa for Right outer.
I know the basics but question stays when it comes to join for than 5, 8, 10 tables.
Suppose I have 10 tables to join. If I have inner join with the first 5 tables and now try to apply a left join with the 6th table, now how the query will work?
I mean to say now the result set of first 5 tables will be taken as left table and the 6th one will be considerded as Right table? Or only Fifth table will be considered as left and 6th as right? Please help me regarding this.
When joining multiple tables the output of each join logically forms a virtual table that goes into the next join.
So in the example in your question the composite result of joining the first 5 tables would be treated as the left hand table.
See Itzik Ben-Gan's Logical Query Processing Poster for more about this.
The virtual tables involved in the joins can be controlled by positioning the ON clause. For example
SELECT *
FROM T1
INNER JOIN T2
ON T2.C = T1.C
INNER JOIN T3
LEFT JOIN T4
ON T4.C = T3.C
ON T3.C = T2.C
is equivalent to (T1 Inner Join T2) Inner Join (T3 Left Join T4)
It's helpful to think of JOIN's in sequence, so the former is correct.
SELECT *
FROM a
INNER JOIN b ON b.a = a.id
INNER JOIN c ON c.b = b.id
LEFT JOIN d ON d.c = c.id
LEFT JOIN e ON e.d = d.id
Would be all the fields from a and b and c where all the ON criteria match, plus the values from d where its criteria match plus all the contents of e where all its criteria match.
I know RIGHT JOIN is perfectly acceptable, but I've found in my experience that it's unnecessary - I almost always just join things from left to right.
> Simple INNER JOIN VIEW code...
CREATE VIEW room_view
AS SELECT a.*,b.*
FROM j4_booking a INNER JOIN j4_scheduling b
on a.room_id = b.room_id;
You can apply join like this..
select a.*,b.*,c.*,d.*,e.*
from [DatabaseName].[Table_a] a
INNER JOIN [DatabaseName].[Table_b] b ON a.id = b.id
INNER JOIN [DatabaseName].[Table_c] c ON b.id=c.id
INNER JOIN [DatabaseName].[Table_d] d on c.id=d.id
INNER JOIN [DatabaseName].[Table_e] e on d.id=e.id where a.con=5 and
b.con=6
Here, at place of a.* and in where condition, you can show column(filed) which you like and according condition in where condition. You can insert more table and database as per your choice. But mind that you need to mention database name and alias if you work in different database.
Just tried the following from the Example DataBase given in W3School. Worked Fine for me.
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate, Products.ProductName, Products.ProductID
FROM Orders
INNER JOIN Products
INNER JOIN Customers
ON Orders.CustomerID=Customers.CustomerID;
Join used to combine rows from two or more tables, based on a related column between them. This example from Adventure works:
SELECT a.[EmailAddress],b.[FirstName],b.[LastName],c.[PhoneNumber],d.[Name]
FROM [Person].[EmailAddress] a
INNER JOIN [Person].[Person] b
ON a.BusinessEntityID = b.BusinessEntityID
INNER JOIN [Person].[PersonPhone] c
ON b.BusinessEntityID = c.BusinessEntityID
INNER JOIN [Person].[PhoneNumberType] d
ON c.phoneNumberTypeID = d.phoneNumberTypeID

Is inner join the same as equi-join?

Can you tell me if inner join and equi-join are the same or not ?
An 'inner join' is not the same as an 'equi-join' in general terms.
'equi-join' means joining tables using the equality operator or equivalent. I would still call an outer join an 'equi-join' if it only uses equality (others may disagree).
'inner join' is opposed to 'outer join' and determines how to join two sets when there is no matching value.
Simply put: an equi-join is a possible type of inner-joins
For a more in-depth explanation:
An inner-join is a join that returns only rows from joined tables where a certain condition is met. This condition may be of equality, which means we would have an equi-join; if the condition is not that of equality - which may be a non-equality, greater than, lesser than, between, etc. - we have a nonequi-join, called more precisely theta-join.
If we do not want such conditions to be necessarily met, we can have
outer joins (all rows from all tables returned), left join (all rows
from left table returned, only matching for right table), right join
(all rows from right table returned, only matching for left table).
The answer is NO.
An equi-join is used to match two columns from two tables using explicit operator =:
Example:
select *
from table T1, table2 T2
where T1.column_name1 = T2.column_name2
An inner join is used to get the cross product between two tables, combining all records from both tables. To get the right result you can use a equi-join or one natural join (column names between tables must be the same)
Using equi-join (explicit and implicit)
select *
from table T1 INNER JOIN table2 T2
on T1.column_name = T2.column_name
select *
from table T1, table2 T2
where T1.column_name = T2.column_name
Or Using natural join
select *
from table T1 NATURAL JOIN table2 T2
The answer is No,here is the short and simple for readers.
Inner join can have equality (=) and other operators (like <,>,<>) in the join condition.
Equi join only have equality (=) operator in the join condition.
Equi join can be an Inner join,Left Outer join, Right Outer join
If there has to made out a difference then ,I think here it is .I tested it with DB2.
In 'equi join'.you have to select the comparing column of the table being joined , in inner join it is not compulsory you do that . Example :-
Select k.id,k.name FROM customer k
inner join dealer on(
k.id =dealer.id
)
here the resulted rows are only two columns rows
id name
But I think in equi join you have to select the columns of other table too
Select k.id,k.name,d.id FROM customer k,dealer d
where
k.id =d.id
and this will result in rows with three columns , there is no way you cannot have the unwanted compared column of dealer here(even if you don't want it) , the rows will look like
id(from customer) name(from Customer) id(from dealer)
May be this is not true for your question.But it might be one of the major difference.
The answer is YES, But as a resultset. So here is an example.
Consider three tables:
orders(ord_no, purch_amt, ord_date, customer_id, salesman_id)
customer(customer_id,cust_name, city, grade, salesman_id)
salesman(salesman_id, name, city, commission)
Now if I have a query like this:
Find the details of an order.
Using INNER JOIN:
SELECT * FROM orders a INNER JOIN customer b ON a.customer_id=b.customer_id
INNER JOIN salesman c ON a.salesman_id=c.salesman_id;
Using EQUI JOIN:
SELECT * FROM orders a, customer b,salesman c where
a.customer_id=b.customer_id and a.salesman_id=c.salesman_id;
Execute both queries. You will get the same output.
Coming to your question There is no difference in output of equijoin and inner join. But there might be a difference in inner executions of both the types.