How do I write a standard SQL GROUP BY that includes columns not in the GROUP BY clause - sql

Let's say I have a table called Customer, defined like this:
Id Name DepartmentId Hired
1 X 101 2001/01/01
2 Y 102 2002/01/01
3 Z 102 2003/01/01
And I want to retrieve the date of the last hiring in each department.
Obviously I would do this
SELECT c.DepartmentId, MAX(c.Hired)
FROM Customer c
GROUP BY c.DepartmentId
Which returns:
101 2001/01/01
102 2003/01/01
But what do I do if I want to return the name of the guy hired? I.e. I would want this result set:
101 2001/01/01 X
102 2003/01/01 Z
Note that the following does not work, as it would return three rows rather than the two I'm looking for:
SELECT c.DepartmentId, c.Name, MAX(c.Hired)
FROM Customer c
GROUP BY c.DepartmentId
I can't remember seeing a query that achieves this.
NOTE: It's not acceptable to join on the Hired field, as that would not be guaranteed to be accurate.

A subselect would do the job and would handle the case where more than one person was hired in the same department on the same day:
SELECT c.DepartmentId, c.Name, c.Hired from Customer c,
(SELECT DepartmentId, MAX(Hired) as MaxHired
FROM Customer
GROUP BY DepartmentId) as sub
WHERE c.DepartmentId = sub.DepartmentId AND c.Hired = sub.MaxHired

Standard Sql:
select *
from Customer C
where exists
(
-- Linq to Sql put NULL instead ;-)
-- In fact, you can even put 1/0 here and would not cause division by zero error
-- An RDBMS do not parse the select clause of correlated subquery
SELECT NULL
FROM Customer
where c.DepartmentId = DepartmentId
GROUP BY DepartmentId
having c.Hired = MAX(Hired)
)
If Sql Server happens to support tuple testing, this is the most succint:
select *
from Customer
where (DepartmentId, Hired) in
(select DepartmentId, MAX(Hired)
from Customer
group by DepartmentId)

SELECT a.*
FROM Customer AS a
JOIN
(SELECT DepartmentId, MAX(Hired) AS Hired
FROM Customer GROUP BY DepartmentId) AS b
USING (DepartmentId,Hired);

Related

Join results of two queries in SQL and produce a result given some condition

I've never used SQL until now, so please bear with me. I have a table of departments:
I have written two queries as follows:
-- nbr of staff associated with each dept.
SELECT count(departmentId) as freq
FROM staff
GROUP BY departmentId
-- nbr of students associated with each dept.
SELECT count(departmentId) as freq
FROM StudentAssignment
GROUP BY departmentId
These produce the following two tables:
For each department id 1 to 5, I need to divide the studentFreq by the staffFreq and show the department id and description if the result is greater than 2.
If the staffFreq i.e. number of staff, for a department id is zero then I need to show that department id and description too.
So for example, in this case I want to produce a table with the department ids of 1, 4 and 5 and their corresponding descriptions: Computing, Classics and Mechanical Engineering.
Computing because 7 / 2 > 2. Classics and ME because 0 staff are assigned to those depts.
One method is a left join, starting with the departments table:
SELECT d.*,
s.freq as as num_staff, sa.freq as num_students,
sa.freq * 1.0 / s.freq as student_staff_ratio
FROM deptartments d LEFT JOIN
(SELECT departmentId, count(*) as freq
FROM staff
GROUP BY departmentId
) s
ON s.departmentId = d.department_id LEFT JOIN
(SELECT departmentId, count(*) as freq
FROM StudentAssignment
GROUP BY departmentId
) sa
ON sa.departmentId = d.departmentId;
Notes:
This should missing values as NULL rather than 0. You can assign 0 instead using COALESCE(): COALESCE(s.freq, 0) as num_staff.
SQL Server does integer division, so 7 / 2 = 3, not 3.5. I think you would typically want the fractional component.

How does one 'group' 2 ids on one row to filter data on another table?

Probably a newbie question, but most of my SQL Server experience is basic reporting, with all of my formatting and grouping being made somewhat manually in Excel. Now I am tackling a homework problem that I must solve everything within SQL...
I have a database with 2 tables:
Employees(id, job title, partnerID)
Bugs_Fixed(employeeID, bugs2010, bugs2011, bugs2012)
Each employee has one partner, who is on the same table (Like if an employee with ID 34 had partner ID 201, then ID 201 would have partnerID 34)
I need to essentially group those 2 together and calculate the combined total number of bugs they fixed (each year combined) without repeating the data for the inverse partner/employee relationship.
For example:
| Team | AMT |
| 34, 20 | 717 |
| 76, 16 | 576 |
| 102, 3 | 901 |
I've gotten the query to select based on id, then sum the # of bugs, but that is for each individual employee and it needs to be represented as a group.
SELECT employeeID, partnerID, SUM (bugs2010 + bugs2011 + bugs2012) as 'AMT'
FROM Bugs_Fixed
JOIN Employees on Employees.id = Bugs_Fixed.employeeID
GROUP BY employeeID, partnerID
It calculates the yearly bugfixes correctly, but obviously doesn't partner up the 2 ids and their combined total.
Edit: Clarified SQL Server
You might be able to adress this by generating a concatenated key made of the partnerID and the employeeID. The trick is to order the IDs, like:
SELECT
CONCAT(GREATEST(employeeID, partnerID), ',', LEAST(employeeID, partnerID) as team,
SUM(bugs2010 + bugs2011 + bugs2012) as 'AMT'
FROM Bugs_Fixed
JOIN Employees on Employees.id = Bugs_Fixed.employeeID
GROUP BY team
Notes - you did not tag the RDBMS that you are using:
LEAST() and GREATEST() are not supported by all RDBMS (notably, SQL Server does not support them, while MySQL, Oracle and Postgres do).
using a table alias in the GROUP BY clause (here, team) is not allowed in SQL Server, while MySQL, Postgres and sqlite do support it
SELECT
ARRAY[e1.id, e2.id] AS team,
SUM(b.bugs2010) + SUM(b.bugs2011) + SUM(b.bugs2012) AS amt
FROM employees e1
INNER JOIN employees e2 ON e2.partnerID = e1.id AND e1.id < e2.id
INNER JOIN bugs_fixed b ON b.employeeId IN (e1.id, e2.id)
GROUP BY e1.id, e2.id
Hope this is the solution your are looking for:
select concat(str(e.id), ', ', str(e.partner_id)) as TEAM,
AMT = (select sum(bugs2010 + bugs2011 + bugs2012) from Bugs_Fixed
where employee_id in (e.id, e.partner_id))
from employee e

Need to count the filtered employees

I wrote a query which need filter out the employee data on behalf on their employee codes.
For instance, in my XYZ table i have 200 employees, i need to insert these 200 employee in ABC table, but before inserting, i need to check whether all 200 employees are existed in the system,I first filter out the employee and then insert into my ABC table.
suppose, 180 out of 200 employee matched, then i will insert 180 in the ABC table.
Now i want the count 200-180=20, so i need that difference count.
I wrote a query but it fetches only the matched record, not those employee count who filters out.
Select distinct SD.EMP_code
FROm SALARY_DETAIL_REPORT_012018 SD /*219 Employees*/
JOIN
(SELECT * FROM EMPLOYEE) tbl
ON tbl.EMP_CODE=to_char(SD.EMP_CODE)
WHERE SD.REFERENCE_ID like '1-%';
final output : 213 employees
I want 219-213=6
i want those 6 employees. I also tried INTERSECT but i got same result.
Select distinct to_char(SD.EMP_code)
FROm SALARY_DETAIL_REPORT_012018 SD
WHERE SD.REFERENCE_ID like '1-%'
INTERSECT
SELECT EMP_CODE FROm EMPLOYEE;
OUTPUT
213 Employees
Kindly help me to find out the count of filtered employees
You can use NOT EXISTS :
SELECT DISTINCT SD.EMP_code
FROM SALARY_DETAIL_REPORT_012018 sd
WHERE NOT EXISTS (SELECT 1 FROM EMPLOYEE e WHERE e.EMP_CODE = TO_CHAR(SD.EMP_CODE)) AND
SD.REFERENCE_ID LIKE '1-%';
use except opertaor
Select distinct to_char(SD.EMP_code)
FROM SALARY_DETAIL_REPORT_012018 SD
WHERE SD.REFERENCE_ID like '1-%'
except
SELECT EMP_CODE FROm EMPLOYEE;

SQL: Finding duplicate records based on custom criteria

I need to find duplicates based on two tables and based on custom criteria. The following determines whether it's a duplicate, and if so, show only the most recent one:
If Employee Name and all EmployeePolicy CoverageId(s) are an exact match another record, then that's considered a duplicate.
--Employee Table
EmployeeId Name Salary
543 John 54000
785 Alex 63000
435 John 75000
123 Alex 88000
333 John 67000
--EmployeePolicy Table
EmployeePolicyId EmployeeId CoverageId
1 543 8888
2 543 7777
3 785 5555
4 435 8888
5 435 7777
6 123 4444
7 333 8888
8 333 7776
For example, the duplicates in the example above are the following:
EmployeeId Name Salary
543 John 54000
435 John 75000
This is because they are the only ones that have a matching name in the Employee table as well as both have the same exact CoverageIds in the EmployeePolicy table.
Note: EmployeeId 333 also with Name = John is not a match because both of his CoverageIDs are not the same as the other John's CoverageIds.
At first I have been trying to find duplicates the old fashioned way by Grouping records and saying having count(*) > 1, but then quickly realized that it would not work because while in English my criteria defines a duplicate, in SQL the CoverageIDs are different so they are NOT considered duplicates.
By that same accord, I tried something like:
-- Create a TMP table
INSERT INTO #tmp
SELECT *
FROM Employee e join EmployeePolicy ep on e.EmpoyeeId = ep.EmployeeId
SELECT info.*
FROM
(
SELECT
tmp.*,
ROW_NUMBER() OVER(PARTITION BY tmp.Name, tmp.CoverageId ORDER BY tmp.EmployeeId DESC) AS RowNum
FROM #tmp tmp
) info
WHERE
info.RowNum = 1 AND
Again, this does not work because SQL does not see this as duplicates. Not sure how to translate my English definition of duplicate into SQL definition of duplicate.
Any help is most appreciated.
The easiest way is to concatenate the policies into a string. That, alas, is cumbersome in SQL Server. Here is a set-based approach:
with ep as (
select ep.*, count(*) over (partition by employeeid) as cnt
from employeepolicy ep
)
select ep.employeeid, ep2.employeeid
from ep join
ep ep2
on ep.employeeid < ep2.employeeid and
ep.CoverageId = ep2.CoverageId and
ep.cnt = ep2.cnt
group by ep.employeeid, ep2.employeeid, ep.cnt
having count(*) = cnt -- all match
The idea is to match the coverages for different employees. A simple criteria is that the number of coverages need to match. Then, it checks that the number of matching coverages is the actual count.
Note: This puts the employee id pairs in a single row. You can join back to the employees table to get the additional information.
I have not tested the T-SQL but I believe the following should give you the output you are looking for.
;WITH CTE_Employee
AS
(
SELECT E.[Name]
,E.[EmployeeId]
,P.[CoverageId]
,E.[Salary]
FROM Employee E
INNER JOIN EmployeePolicy P ON E.EmployeeId = P.EmployeeId
)
, CTE_DuplicateCoverage
AS
(
SELECT E.[Name]
,E.[CoverageId]
FROM CTE_Employee E
GROUP BY E.[Name], E.[CoverageId]
HAVING COUNT(*) > 1
)
SELECT E.[EmployeeId]
,E.[Name]
,MAX(E.[Salary]) AS [Salary]
FROM CTE_Employee E
INNER JOIN CTE_DuplicateCoverage D ON E.[Name] = D.[Name] AND E.[CoverageId] = D.[CoverageId]
GROUP BY E.[EmployeeId], E.[Name]
HAVING COUNT(*) > 1
ORDER BY E.[EmployeeId]

group by in sql to find max count

Update
I have a query like this
select sl.College_ID,sl.Department_ID,COUNT(sl.RegisterNumber) from StudentList sl
group by sl.College_ID,sl.Department_ID
order by sl.College_ID,sl.Department_ID asc
abouve query gives the below result
and i have 200 - college id and each college have 6 department_id i could get the count [No.of student ] in each department
College_Id Dept_Id count
1 1 100
1 2 210
2 3 120
2 6 80
3 1 340
but my question is i need to display the maximum count[student] for each department
some thing like this
college_ID Dept_Id count
3 1 340
26 2 250
and i tried this out but getting error
select sl.College_ID,sl.Department_ID,COUNT(sl.RegisterNumber) from StudentList sl
group by sl.College_ID,sl.Department_ID
having COUNT(sl.RegisterNumber)=max(COUNT(sl.RegisterNumber))
order by sl.College_ID,sl.Department_ID asc
what went wrong can some one help me
Maybe something like this?
SELECT sl.College_ID, sl.Department_ID, COUNT(sl.RegisterNumber) As StudentCount, s2.MaxCount
FROM StudentList sl
INNER JOIN (
SELECT Department_ID, MAX(StudentCount) AS MaxCount
FROM (
SELECT College_ID, Department_ID, COUNT(*) As StudentCount
FROM StudentList
GROUP BY College_ID, Department_ID
) s1
GROUP BY Department_ID
) s2 ON sl.Department_ID = s2.Department_ID
GROUP BY sl.College_ID, sl.Department_ID, s2.MaxCount
HAVING COUNT(sl.RegisterNumber) = s2.MaxCount
ORDER BY sl.College_ID, sl.Department_ID ASC
EDIT: I've updated the query to more accurately answer your question, I missed the part where you want the College_ID with the max count.
EDIT 2: Okay, this should work now, I needed a second nested subquery for aggregating the aggregates. I don't know of a better way to compare the aggregates of different groups.
The result you want, which is group on college_ID and you not really care about college_ID, since from you example, with Dept_Id=1 will can not sure which college_ID is.
In that case, you can remove college_ID from your select statement and do a SUM GROUP BY
base on your query, something like:
SELECT t.Department_ID, SUM(t.c)
FROM (
select sl.College_ID,sl.Department_ID,COUNT(sl.RegisterNumber) c from StudentList sl
group by sl.College_ID,sl.Department_ID
) t
GROUP BY t.Department_ID
ORDER BY SUM(t.c)
Note: If you really want college_ID in your result, you can do a JOIN to get your college_ID