Left semi Join in Hive for multiple table - sql

How can we use left semi join in multiple tables . For example, in SQL the query to retrieve no. of employees working in US is :
select name,job_id,sal
from emp
where dept_id IN (select dept_id
from dept d
INNER JOIN Location L
on d.location_id = L.location_id
where L.city='US'
)
As IN query is not supported in Hive, how can we write this in Hive.

Seems like a simple inner join
select e.name
,e.job_id
,e.sal
from emp as e
join dept as d
on d.dept_id =
e.dept_id
join location as l
on l.location_id =
d.location_id
where l.city='US'
P.s.
Hive does support IN.
The only issue with your query is that dept_id of emp is not qualified (should be emp.dept_id).
This works:
select name,job_id,sal
from emp
where emp.dept_id IN (select dept_id
from dept d
INNER JOIN Location L
on d.location_id = L.location_id
where L.city='US'
)

Use exists instead:
select e.name, e.job_id, e.sal
from emp e
where exists (select 1
from dept d join
location L
on d.location_id = L.location_id
where l.city = 'US' and d.dept_id = e.dept_id
);
You can refer to the documentation, which covers subqueries in the WHERE clause.
This query appears to be answering the question: What employees work in departments that have a location in the US. You can also do this in the FROM clause with a subquery;
select e.name, e.job_id, e.sal
from emp e join
(select distinct d.dept_id
from dept d join
location L
on d.location_id = L.location_id
where l.city = 'US'
) d
on d.dept_id = e.dept_id;
I should note, though, that "US" is not usually considered a city.
EDIT:
Obviously, if a department can only have one location, then "semi-join" is not necessary. The SELECT DISTINCT can just be SELECT . . . Or, you can use the JOINs as in Dudu's answer. In any case, the EXISTS will work. In many databases it would have good (sometimes the best performance); I'm not sure about the performance implications in Hive.

Related

Complex query - retrieve employees working in both two specific departments

I'm trying to figure a way to retrieve employees working in two different departments.
I have 3 simple tables:
employee (employee_id, employee_name)
department (department_id, department_name)
working (eid, did, work_time)
So I have tried to write a SQL query:
select employee_name
from employee, working,department
where eid = employee_id
and did = department_id
and department_name = 'software'
and dname = 'hardware';
But it doesn't work, what is my problem?
The problem is that you are requiring department to be both 'software' and 'hardware'. Also, dname is not a field.
Correcting your query:
select employee_name
from employee, working, department
where eid = employee_id and did = department_id
and (department_name = 'software' or department_name = 'hardware');
But I would prefer this kind of query:
SELECT DISTINCT e.employee_name
FROM employee e
JOIN working w ON w.eid = e.employee_id
JOIN department d ON d.department_id = w.did
WHERE d.department_name IN ('software', 'hardware');
That is to get employees that work in any of the two departments (or both).
If you want only employees that work in both departments, try this:
SELECT e.employee_id, e.employee_name
FROM employee e
JOIN working w ON w.eid = e.employee_id
JOIN department d ON d.department_id = w.did
WHERE d.department_name IN ('software', 'hardware')
GROUP BY e.employee_id HAVING COUNT(DISTINCT d.department_id) = 2;
Would something like this work for you?
SELECT
count(*) as cnt,
employee.employee_name
FROM
employee
JOIN working ON working.eid = employee.employee_id
JOIN department ON department.department_id = working.did
WHERE
department.department_name = 'software' or department.department_name = 'hardware'
GROUP BY employee.employee_name
HAVING cnt > 1
This would count each employee who is linked both software or hardware department. Or you can leave WHERE clause away to get all employees working more than one departments.
What is my problem?
There is no dname column in your tables.
You can simplify the problem as you don't need the department table since the working table contains the department id in the did column.
Then you need to GROUP BY each employee and find those HAVING a COUNT of two DISTINCT department ids:
SELECT MAX(e.employee_name)
FROM employee e
INNER JOIN working w
ON e.employee_id = w.eid
GROUP BY e.employee_id
HAVING COUNT(DISTINCT w.did) = 2
If you want to consider only the software and hardware departments then:
SELECT MAX(e.employee_name)
FROM employee e
INNER JOIN working w
ON e.employee_id = w.eid
INNER JOIN department d
ON w.did = d.department_id
WHERE d.department_name IN ('software', 'hardware')
GROUP BY e.employee_id
HAVING COUNT(DISTINCT w.did) = 2
You can easily obtain employees who work in one specific department:
select *
from Employee e inner join
Working w on e.employee_id = w.eid inner join
Department d on w.did = d.department_id
where d.name = 'software'
Now ambiguity cames. If you want to get all employees work either in software or in hardware:
-- Employees who work at either software or hardware or both departments
select *
from Employee e inner join
Working w on e.employee_id = w.eid inner join
Department d on w.did = d.department_id
where d.name = in ('software', 'hardware')
If you want to get employees who works in both software and hardware departments:
-- Employees who work in both hardware and software deparments simultaneously
select *
from Employee e inner join
Working w on e.employee_id = w.eid inner join
Department d on w.did = d.department_id
where d.name = 'software'
intersect
select *
from Employee e inner join
Working w on e.employee_id = w.eid inner join
Department d on w.did = d.department_id
where d.name = 'hardware'

How can I write this query correctly?

How can I write a query that gives the country name, city, postal code, street address and the number of departments where at least 2 employees work? Below is the query I wrote, but I get "not a GROUP BY expression" error as a result of the query.
SELECT k.COUNTRY_NAME,
l.CITY,
l.POSTAL_CODE,
l.STREET_ADDRESS,
e.DEPARTMENT_ID,
COUNT(EMPLOYEE_ID)
FROM hr.employees e
JOIN hr.departments c
ON (c.DEPARTMENT_ID = e.DEPARTMENT_ID)
JOIN hr.locations l
ON (c.LOCATION_ID = l.LOCATION_ID)
JOIN hr.countries k
ON (k.COUNTRY_ID = l.COUNTRY_ID)
GROUP BY e.DEPARTMENT_ID
HAVING COUNT(EMPLOYEE_ID) > 2;
Because all non-aggregated columns(c.country_name,l.city,l.postal_code,l.street_address,e.department_id) should be listed within the GROUP BY list which is not suitable for your case. Rather use COUNT(.) OVER (..) analytic function with PARTITION BY e.department_id option in order to group by department_id column such as
SELECT DISTINCT *
FROM
(
SELECT c.country_name,
l.city,
l.postal_code,
l.street_address,
e.department_id,
COUNT(e.employee_id) OVER (PARTITION BY e.department_id) AS count
FROM hr.employees e
JOIN hr.departments d
ON d.department_id = e.department_id
JOIN hr.locations l
ON d.location_id = l.location_id
JOIN hr.countries c
ON c.country_id = l.country_id
)
WHERE count >= 2 -- equality is added considering "at least" 2
ORDER BY count
Btw, the parentheses next to the ON clause are redundant
You would first get the set of all departments that have at least 2 employees as follows(atleast_two)
After that you would join the data with the rest of your query and pull the attributes of interest.
with atleast_two
as (select c.DEPARTMENT_ID
,count(employee_id) as cnt_employees
from hr.employees e
join hr.departments c
on (c.DEPARTMENT_ID=e.DEPARTMENT_ID)
group by c.deptid
having count(employee_id)>2
)
select k.COUNTRY_NAME
, l.CITY
, l.POSTAL_CODE
, l.STREET_ADDRESS
, e.DEPARTMENT_ID
, c.cnt_employees
from hr.employees e
join atleast_two c
on (c.DEPARTMENT_ID=e.DEPARTMENT_ID)
join hr.locations l
on (c.LOCATION_ID=l.LOCATION_ID)
join hr.countries k
on (k.COUNTRY_ID=l.COUNTRY_ID);

What are Oracle's old-syntax join equivalents of these queries?

What are the equivalent joins written in the Oracle's old join syntax of these queries?
SELECT first_name, last_name, department_name, job_title
FROM employees e RIGHT JOIN departments d
ON(e.department_id = d.department_id)
RIGHT JOIN jobs j USING(job_id);
-->106 rows returned
SELECT first_name, last_name, department_name, job_title
FROM employees e RIGHT JOIN jobs j
ON(e.job_id = j.job_id)
RIGHT JOIN departments d
USING(department_id);
--> 122 rows returned
I would do something like this (for the first query) - making explicit the fact that a multiple join is, by definition, an iteration of joins of two tables (or more generally "rowsets") at a time. Think of it as "using parentheses explicitly".
select first_name, last_name, department_name, job_title
from (
select first_name, last_name, job_id, department_name
from employees e, departments d
where e.department_id (+) = d.department_id
) sq
, jobs j
where sq.job_id (+) = j.job_id
;
This can be rewritten (perhaps) using a single SELECT statement, with more WHERE conditions - but the query will be less readable; it wont' be quite as clear what it is doing.
Respectively:
SELECT first_name,
last_name,
department_name,
job_title
FROM employees e,
jobs j,
departments d
WHERE e.job_id (+) = j.job_id
AND e.department_id = d.department_id (+);
and:
SELECT first_name,
last_name,
department_name,
job_title
FROM employees e,
departments d,
jobs j
WHERE e.department_id (+) = d.department_id
AND e.job_id = j.job_id (+);
db<>fiddle here
However, please just use the ANSI join syntax. The old legacy join syntax is confusing to read and you will get errors from putting the (+) on the wrong side of the join condition and you should be teaching people how to use the less-confusing, "new" (its hard to call it new when its been around since Oracle 9i in 2001) syntax rather than reverting to old methods.
Just to add to Mathguy's answer, this is interesting because those innocent-looking right joins are not what they seem. My first (incorrect) attempt was this:
select e.department_id, e.job_id, e.first_name, e.last_name, d.department_name
from jobs j
, departments d
, employees e
where e.job_id(+) = j.job_id
and e.department_id(+) = d.department_id;
but as Mathguy points out it gives different results because of the departments with no employees and the cross join between departments and jobs, and a subtle join precedence effect that appears as a result of the right joins not being in one chain.
I'm not sure what the intention of the original query is. Using the Oracle HR demo schema, the results are the same as an inner join, but only because every job has at least one employee. This illustrates a pitfall in testing outer join queries, as you might run a test, get the same results, and think your rewrite was logically the same thing when it is not.
If you rewrite the original right joins as left joins, it would have to become something like this:
select e.department_id, e.job_id, e.first_name, e.last_name, d.department_name
from jobs j
left join (
departments d
left join employees e on e.department_id = d.department_id
)
on e.job_id = j.job_id;
(You could also expand the departments > employees join into an inline view or with clause, or use an outer apply construction to include the job_id join.)
This is because the two right joins in the original query are driven from jobs and departments, so even though the outer join from departments to employees includes the 16 departments with no employees, once we outer join from jobs to that, we implicitly exclude rows with no job_id, because we are driving it from jobs. So the outer join to departments is filtered to become in effect an inner join, and so long as all jobs have corresponding employees then that gives the same results as an inner join too. To see the difference you would have to insert another job, which adds a row in the results with the job title but no employee details.
Therefore the old-style version needs to be either this:
select de.first_name, de.last_name, de.department_name, j.job_title
from jobs j
, lateral (
select e.department_id, e.job_id, e.first_name, e.last_name, d.department_name
from departments d
, employees e
where e.department_id(+) = d.department_id
) de
where de.job_id(+) = j.job_id;
or without lateral:
select first_name, last_name, department_name, job_title
from jobs j
, ( select e.first_name, e.last_name, e.job_id, d.department_name
from departments d, employees e
where e.department_id (+) = d.department_id ) de
where de.job_id(+) = j.job_id
The second query just switches jobs and departments:
select first_name, last_name, department_name, job_title
from departments d
, ( select e.first_name, e.last_name, e.department_id, e.job_id, j.job_title
from jobs j, employees e
where e.job_id(+) = j.job_id ) je
where je.department_id(+) = d.department_id

Hive SQL + FROM not in to JOIN

I have a query with NOT IN clause,need to convert into join statement.
SELECT EMP_NBR
FROM employees not in (select emp_id from departments where dept_id = 10 and division = 'sales')
I think the proper transformation would be a left join:
select EMP_NBR
from employees e join
departments d
on e.dept_id = d.dept_id and
d.dept_id = 10 and
d.division = 'sales'
where d.dept_id is null;
Note: I added what I consider to be correct JOIN conditions.
not in could be mimicked in SQL using just not in the where clause, e.g.
SELECT EMP_NBR FROM employees inner join department on
employees.emp_id =departments.emp_id
where NOT (dept_id = 10 and division = 'sales')

SQL Oracle Missing keyword

create table EmployeesUK_9035
2 as
3 select e.employee_id,e.first_name || e.Last_name "Name",e.Salary,l.City,e.hire_date from employees e join locations l
4 where e.employee_id in(select e.employee_id from employees e join departments d on(e.department_id=d.department_id) join locations l on(d.location_id=l.location_id) where l.city='London');
...from employees e join locations l
4 where
You have missed out the ON section of the JOIN clause (in the main query, not the subquery).
This is the sort of bloomer which should be easy to spot. But because your code is all bunched up it's hard to diagnose. Laying code out nicely isn't just some neat-freakery on the part of experienced developers: readability is actually a feature of the code. Like this ...
create table EmployeesUK_9035
as
select e.employee_id,
e.first_name || e.Last_name "Name",
e.Salary,
l.City,
e.hire_date
from employees e
join locations l
where e.employee_id in (select e.employee_id
from employees e
join departments d
on (e.department_id=d.department_id)
join locations l
on (d.location_id=l.location_id)
where l.city = 'London')
;
See how easy it is to spot the missing line? You need an ON clause to join EMPLOYEES and LOCATIONS. However,given the join of the subquery you probably also need to include DEPARTMENTS in the main query because there appears to be no join between the two tables. In which case the query might simplify to
create table EmployeesUK_9035
as
select e.employee_id,
e.first_name || e.Last_name "Name",
e.Salary,
l.City,
e.hire_date
from employees e
join departments d
on (e.department_id=d.department_id)
join locations l
on (d.location_id=l.location_id)
where l.city = 'London'
;
Incidentally please don't use double-quotes and mixed-case for column-aliases when creating a table. You will have to use "Name" in double-quotes and the exact same case every time you reference it, which is a pain because Oracle code is generally case insensitive; that is, all Oracle identifiers are in upper-case by default but case doesn't matter provided we don't wrap the identifiers in double-quotes.
JOIN should have the ON (to show what you're joining those tables on), while yours doesn't.
I set ON 1 = 1, but you should use columns from EMPLOYEES and LOCATIONS tables.
CREATE TABLE EmployeesUK_9035
AS
SELECT e.employee_id,
e.first_name || e.Last_name "Name",
e.Salary,
l.City,
e.hire_date
FROM employees e JOIN locations l
ON 1 = 1 --> this
WHERE e.employee_id IN (SELECT e.employee_id
FROM employees e
JOIN departments d
ON (e.department_id = d.department_id)
JOIN locations l
ON (d.location_id = l.location_id)
WHERE l.city = 'London');