Multiple array_agg() calls in a single query - sql

I'm trying to accomplish something with my query but it's not really working. My application used to have a mongo db so the application is used to get arrays in a field, now we had to change to Postgres and I don't want to change my applications code to keep v1 working.
In order to get arrays in 1 field within Postgres I used array_agg() function. And this worked fine so far. However, I'm at a point where I need another array in a field from another different table.
For example:
I have my employees. employees have multiple address and have multiple workdays.
SELECT name, age, array_agg(ad.street) FROM employees e
JOIN address ad ON e.id = ad.employeeid
GROUP BY name, age
Now this worked fine for me, this would result in for example:
| name | age| array_agg(ad.street)
| peter | 25 | {1st street, 2nd street}|
Now I want to join another table for working days so I do:
SELECT name, age, array_agg(ad.street), arrag_agg(wd.day) FROM employees e
JOIN address ad ON e.id = ad.employeeid
JOIN workingdays wd ON e.id = wd.employeeid
GROUP BY name, age
This results in:
| peter | 25 | {1st street, 1st street, 1st street, 1st street, 1st street, 2nd street, 2nd street, 2nd street, 2nd street, 2nd street}| "{Monday,Tuesday,Wednesday,Thursday,Friday,Monday,Tuesday,Wednesday,Thursday,Friday}
But I need it to result:
| peter | 25 | {1st street, 2nd street}| {Monday,Tuesday,Wednesday,Thursday,Friday}
I understand it has to do with my joins, because of the multiple joins the rows multiple but I don't know how to accomplish this, can anyone give me the correct tip?

DISTINCT is often applied to repair queries that are rotten from the inside, and that's often expensive and / or incorrect. Don't multiply rows to begin with, then you don't have to fold unwanted duplicates at the end.
Joining to multiple n-tables ("has many") multiplies rows in the result set. That's efectively a CROSS JOIN or Cartesian product by proxy. See:
Two SQL LEFT JOINS produce incorrect result
There are various ways to avoid this mistake.
Aggregate first, join later
Technically, the query works as long as you join to one table with multiple rows at a time before you aggregate:
SELECT e.id, e.name, e.age, e.streets, array_agg(wd.day) AS days
FROM (
SELECT e.id, e.name, e.age, array_agg(ad.street) AS streets
FROM employees e
JOIN address ad ON ad.employeeid = e.id
GROUP BY e.id -- PK covers whole row
) e
JOIN workingdays wd ON wd.employeeid = e.id
GROUP BY e.id, e.name, e.age;
It's best to include the primary key id and GROUP BY it, because name and age are not necessarily unique. Else you might merge employees by mistake.
But better aggregate in a subquery before the join, that's superior without selective WHERE conditions on employees:
SELECT e.id, e.name, e.age, ad.streets, array_agg(wd.day) AS days
FROM employees e
JOIN (
SELECT employeeid, array_agg(ad.street) AS streets
FROM address
GROUP BY 1
) ad ON ad.employeeid = e.id
JOIN workingdays wd ON e.id = wd.employeeid
GROUP BY e.id, ad.streets;
Or aggregate both:
SELECT name, age, ad.streets, wd.days
FROM employees e
JOIN (
SELECT employeeid, array_agg(ad.street) AS streets
FROM address
GROUP BY 1
) ad ON ad.employeeid = e.id
JOIN (
SELECT employeeid, array_agg(wd.day) AS days
FROM workingdays
GROUP BY 1
) wd ON wd.employeeid = e.id;
The last one is typically faster if you retrieve all or most of the rows in the base tables.
Note that using JOIN and not LEFT JOIN removes employees from the result that have no row in address or none in workingdays. That may or may not be intended. Switch to LEFT JOIN to retain all employees in the result.
Correlated subqueries / JOIN LATERAL
For selective filters on employees, consider correlated subqueries instead:
SELECT name, age
, (SELECT array_agg(street) FROM address WHERE employeeid = e.id) AS streets
, (SELECT array_agg(day) FROM workingdays WHERE employeeid = e.id) AS days
FROM employees e
WHERE e.namer = 'peter'; -- very selective
Or LATERAL joins in Postgres 9.3 or later:
SELECT e.name, e.age, a.streets, w.days
FROM employees e
LEFT JOIN LATERAL (
SELECT array_agg(street) AS streets
FROM address
WHERE employeeid = e.id
GROUP BY 1
) a ON true
LEFT JOIN LATERAL (
SELECT array_agg(day) AS days
FROM workingdays
WHERE employeeid = e.id
GROUP BY 1
) w ON true
WHERE e.name = 'peter'; -- very selective
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
The last two queries retain all qualifying employees in the result.

Whenever you need values that aren't repeated, use DISTINCT, like so:
SELECT name, age, array_agg(DISTINCT ad.street), array_agg(DISTINCT wd.day) FROM employees e
JOIN address ad ON e.id = ad.employeeid
JOIN workingdays wd ON e.id = wd.employeeid
GROUP BY name, age

Related

subquery and join not giving the same result

1
select *
from employees
where salary > (select max(salary) from employees where department_id=50)
2
select *
from employees e left join
employees d
on e.DEPARTMENT_ID =d.DEPARTMENT_ID
where d.salary > (select max(salary) from employees where department_id=50)
why the second query is giving multiple record
i want achieve the same result as of 1st query using join.....
Thanks in Advance......
Rocky, the first select is correct. Why do you want to do any join? Without further information the objective of the second select is not clear (nonsense).
I can't see the point about joining against the same table by DEPARTMENT_ID. Anyway, the problem about duplicates is because you are joining the same two tables by a key is not pk, basically you are multiplyng each employee for all the employees of the same department. This version eliminate duplicates but still has no improvement from the first one.
select *
from employees e left join
employees d
on e.employee_ID = d.employee_ID
where d.salary > (select max(salary) from employees where department_id=50)
You are probably looking for an anti join. This is a pattern mainly used in a young DBMS where IN and EXISTS clauses are slow compared to joins, because the developers focused on joins only.
You are looking for all employees whose salaries are greater than all salaries in department 50. With other words: WHERE NOT EXISTS a salary greater or equal in department 50.
Your query can hence be written as:
select *
from employees e
where not exists
(
select null
from employees e50
where e50.department_id = 50
and e50.salary >= e.salary
);
As an anti join (an outer join where you dismiss all matches):
select *
from employees e
left join employees e50 on e50.department_id = 50 and e50.salary >= e.salary
where e50.salary is null;

SELECT and JOIN to return only one row for each employee

I have a user table that stores the employeeId, lastName, firstName, department, hire date and mostRecentLogin.
I have another table that stores the employeeId, emailAddress.
The emailAddress table can have multiple rows for an employee if they have multiple email addresses.
I'm trying to return results that only show one row for each employee. I don't care which email address, just as long as it only picks one.
But all the queries I've tried always return all possible rows.
Here is my most recent attempt:
select *
from EmployeeInfo i
left join EmployeeEmail e ON i.employeeId = e.employeeId
where i.hireDate = 2015
and employeeId IN (
SELECT MIN(employeeId)
FROM EmployeeInfo
GROUP BY employeeId
)
But then again, this returns all possible rows.
Is there a way to get this to work?
Use a sub-query instead of a join:
select *
, (select top 1 E.EmailAddress from EmplyeeEmail E where E.employeeId = I.employeeId)
from EmployeeInfo I
where I.hireDate = 2015;
Note: If you change your mind and decide you do have a preference as to which email address is returned then just add an order by to the sub-query - otherwise it is truly unknown which one you will get.
This should work.
SELECT *
FROM EmployeeInfo
Left JOIN EmployeeEmail
ON EmployeeInfo.employeeId = EmployeeEmail.employeeId
WHERE EmployeeInfo.hireDate = '2015'
GROUP BY EmployeeInfo.employeeId;

JOIN - 2 tasks - sql developer

I have 2 tasks:
1. FIRST TASK
Show first_name, last_name (from employees), job_title, employee_id (from jobs) start_date, end_date (from job_history)
My idea:
SELECT s.employee_id
, first_name
, last_name
, job_title
, employee_id
, start_date
, end_date
FROM employees
INNER JOIN jobs hp
on s.employee_id = hp.employee_id
INNER JOIN job_history
on hp.jobs = h.jobs
I know it doesn't work. I'm receiving: "HP"."EMPLOYEE_ID": invalid identifier
What does it mean "on s.employee_id = hp.employee_id". Maybe I should write sthg else instead of this.
2. SECOND TASK
Show department_name (from departments), average and max salary for each department (those data are from employees) and how many employees are working in those departments (from employees). Choose only departments with more than 1 person. The result round to 2 decimal places.
I have the pieces, but i don't know to connect it
My idea:
SELECT department_name,average(salary),max(salary),count(employees_id)
FROM employees
INNER JOIN departments
on employees_id = departments_id
HAVING count(department) > 1
SELECT ROUND(average(salary),2) from employees
I modified your queries a bit by improving table aliasing. Hopefully, if the right columns are present in the tables as you say, it should work:
SELECT s.employee_id, s.first_name, s.last_name,
hp.job_title, hp.employee_id,
h.start_date, h.end_date
FROM employees s
INNER JOIN jobs hp
on s.employee_id = hp.employee_id
INNER JOIN job_history h
on hp.jobs = h.jobs;
When we say on s.employee_id = hp.employee_id it means that if, for example, there is an employee_id = 1234 present in both the tables employees and jobs, then SQL will bring all the columns from both the tables in the same line that corresponds to employee_id = 1234. You can now pick different columns in the SELECT clause as if they are in the same/single table(which was not the case before joining). This is the main logic behind SQL joins.
As to your 2nd task, try the below query. I made some modifications in aggregation by introducing COUNT(DISTINCT s.employees_id). If the same employees_id is present twice for some reason, you still want to count that as one person.
SELECT d.department_name, avg(s.salary), max(s.salary), count(distinct s.employees_id)
FROM employees s
INNER JOIN departments d
on e.employees_id = d.departments_id
GROUP BY d.department_name
HAVING COUNT(DISTINCT s.employees_id) > 1;
Let me know if there is still any issue. Hopefully, this works.

SQL Join left join or left outer join

I am having a question in SQL Joins. I have table employee with employeeid as primary key and some other columns for employee. And there is another table called employeeaddress where there can be multiple employeeid is a foreign key. One employee can have many employeeaddresses just to explain one to many relationship.
If I want to write a query which will fetch the following columns
employee.employeeid, employee.empname,
employeeaddress.employeeaddressid, employeeaddress.addr1,
employeeaddress.addr2
So there can be an employee with no employeeaddress. But anyway I wanted to fetch all the employees who may have zero or multiple addresses.
Do I need to apply left join or left outer join? I want the following result for a table that has 2 employees John and Michael where John has two employeeaddresses with employeeaddressid 21 and 22 and Michael has no employeeaddress
1, John, 21, addr1 for John, addr2 for John
1, John, 22, another addr1 for John, another addr2 for John
2, Michael, NULL , NULL , NULL
The above result is arranged in the following fashion
employee.employeeid, employee.empname, employeeaddress.employeeaddressid, employeeaddress.addr1, employeeaddress.addr2
Please help.
Based on your description it sounds like you're looking for a query as follows. If you also wanted the address details, you'll just have to add a left join to the outer query.
Also, as comments have eluded to, LEFT JOIN is shorthand for LEFT OUTER JOIN, they will produce the same results.
SELECT *
FROM employee
inner join
(
SELECT
employeeid,
count(*) as addresscount
FROM employee
left join employeeaddress ON employeeaddress.employeeaddressid = employee.employeeaddressid
group by employeeid
) counts on counts.employeeid = employee.employeeid
WHERE counts.addresscount = 0 -- Or 1, or 5 or > 1, etc.
LEFT JOIN should be all you need.
SQL Fiddle Example
SELECT e.employeeID ,
e.empName ,
ea.employeeAddressID ,
ea.addr1 ,
ea.addr2
FROM Employee e
LEFT JOIN EmployeeAddress ea ON ea.employeeID = e.employeeID

Representing 'not in' subquery as join

I am trying to convert the following query:
select *
from employees
where emp_id not in (select distinct emp_id from managers);
into a form where I represent the subquery as a join. I tried doing:
select *
from employees a, (select distinct emp_id from managers) b
where a.emp_id!=b.emp_id;
I also tried:
select *
from employees a, (select distinct emp_id from managers) b
where a.emp_id not in b.emp_id;
But it does not give the same result. I have tried the 'INNER JOIN' syntax as well, but to no avail. I have become frustrated with this seemingly simple problem. Any help would be appreciated.
Assume employee Data set of
Emp_ID
1
2
3
4
5
6
7
Assume Manger data set of
Emp_ID
1
2
3
4
5
8
9
select *
from employees
where emp_id not in (select distinct emp_id from managers);
The above isn't joining tables so no Cartesian product is generated... you just have 7 records you're looking at...
The above would result in 6 and 7 Why? only 6 and 7 from Employee Data isn't in the managers table. 8,9 in managers is ignored as you're only returning data from employee.
select *
from employees a, (select distinct emp_id from managers) b
where a.emp_id!=b.emp_id;
The above didnt' work because a Cartesian product is generated... All of Employee to all of Manager (assuming 7 records in each table 7*7=49)
so instead of just evaluating the employee data like you were in the first query. Now you also evaluate all managers to all employees
so Select * results in
1,1
1,2
1,3
1,4
1,5
1,8
1,9
2,1
2,2...
Less the where clause matches...
so 7*7-7 or 42. and while this may be the answer to the life universe and everything in it, it's not what you wanted.
I also tried:
select *
from employees a, (select distinct emp_id from managers) b
where a.emp_id not in b.emp_id;
Again a Cartesian... All of Employee to ALL OF Managers
So this is why a left join works
SELECT e.*
FROM employees e
LEFT OUTER JOIN managers m
on e.emp_id = m.emp_id
WHERE m.emp_id is null
This says join on ID first... so don't generate a Cartesian but actually join on a value to limit the results. but since it's a LEFT join return EVERYTHING from the LEFT table (employee) and only those that match from manager.
so in our example would be returned as e.emp_Di = m.Emp_ID
1,1
2,2
3,3
4,4
5,5
6,NULL
7,NULL
now the where clause so
6,Null
7,NULL are retained...
older ansii SQL standards for left joins would have been *= in the where clause...
select *
from employees a, managers b
where a.emp_id *= b.emp_id --I never remember if the * is the LEFT so it may be =*
and b.emp_ID is null;
But I find this notation harder to read as the join can get mixed in with the other limiting criteria...
Try this:
select e.*
from employees e
left join managers m on e.emp_id = m.emp_id
where m.emp_id is null
This will join the two tables. Then we discard all rows where we found a matching manager and are left with employees who aren't managers.
Your best bet would probably be a left join:
select
e.*
from employees e
left join managers m on e.emp_id = m.emp_id
where
m.emp_id is null;
The idea here is you're saying that you want to select everything from employees, including anything that matches in the manager table based on emp_id and then filtering out the rows that actually have something in the manager table.
Use Left Outer Join instead
select e.*
from employees e
left outer join managers m
on e.emp_id = m.emp_id
where m.emp_id is null
left outer join will preserve the rows from m table even if they do not have a match i e table based on the emp_id field. The we filter on where m.emp_id is null - give me all the rows from e where there's no matching record in m table.
A bit more on the subject can be found here:
Visual representation of joins
from employees a, (select distinct emp_id from managers) b implies cross join - all posible combinations between tables (and you needed left outer join instead)
The MINUS keyword should do the trick:
SELECT e.* FROM employees e
MINUS
Select m.* FROM managers m
Hope that helps...
select *
from employees
where Not (emp_id in (select distinct emp_id from managers));