Cannot exclude previous non-duplicate rows - sql

In a nutshell, here it is:
I have 1000(ish) employees who have multiple recurrent annual training requirements
I need to be able to sort the employees by County, Facility, Employee, and Type of Training (and also allow for sorted lists at each level)
I want to display only the most recent date the Employee took the training
What I've tried so far:
I've been successful when dealing with only one Employee's record:
DECLARE #Skill int
SET #Skill = 81
SELECT TOP 1
P.lastname+', '+P.firstname AS Employee,
P.external_id,
PC.job_title,
SD.name,
SV.schedule_days as ExpireInterval,
PO.course_startdate,
DATEADD(DD,SV.schedule_days,PO.course_startdate) as ExpireDate
FROM portfolio PO
INNER JOIN person P ON PO.person_id=P.person_id
INNER JOIN e_component EC ON PO.component_id=EC.component_id
JOIN skill_value SV ON EC.component_id=SV.object_id
JOIN skill_description SD ON SV.skill_id=SD.skill_id
JOIN person_custom PC ON P.person_id=PC.person_id
GROUP BY
PO.person_id,
PO.course_startdate,
SV.skill_id,
P.lastname,
P.firstname,
P.external_id,
PC.job_title,
SD.name,
SV.schedule_days,
SD.language_id
HAVING SD.language_id=26
AND PO.person_id=123456
AND SV.skill_id= #Skill
ORDER BY Employee, PO.course_startdate DESC
NOTE: The excessive JOINS are due to the lack of FK relationships in the host database. Our vendor designed it to rely mostly on code built into their front end so I'm working with what I've got.
The previously listed code returns the following result:
Most Recent Record for Employee #123456
When I try to pull the most recent record from a list of employees however:
DECLARE #Skill int
SET #Skill = 81
SELECT
P.lastname+', '+P.firstname AS Employee,
P.external_id,
PC.job_title,
SD.name,
SV.schedule_days as ExpireInterval,
PO.course_startdate,
DATEADD(DD,SV.schedule_days,PO.course_startdate) as ExpireDate
FROM portfolio PO
INNER JOIN person P ON PO.person_id=P.person_id
INNER JOIN e_component EC ON PO.component_id=EC.component_id
JOIN skill_value SV ON EC.component_id=SV.object_id
JOIN skill_description SD ON SV.skill_id=SD.skill_id
JOIN person_custom PC ON P.person_id=PC.person_id
GROUP BY
PO.person_id,
PO.course_startdate,
SV.skill_id,
P.lastname,
P.firstname,
P.external_id,
PC.job_title,
SD.name,
SV.schedule_days,
SD.language_id
HAVING SD.language_id=26
AND PO.person_id IN (SELECT DISTINCT person_id FROM portfolio)
AND SV.skill_id= #Skill
ORDER BY Employee, PO.course_startdate DESC
I get multiple entries for the same Employee (e.g.different times that employee has taken the training with the same skill_id).
What I want to do is something like this:
IF count(SV.skill_id)>1
THEN SELECT TOP 1 component_id --for each individual
FROM portfolio
I just can't figure out where to put the condition to have it give me one record per person. I've tried assigning local variables, moving the SELECT subquery around to various columns, adding and removing constraints... etc. Nothing has worked so far.
I'm using the following software:
SQL Server Management Studio 2014 & 2017 (the live DB is on 2014 and I have a static one on 2017 for development purposes)
Report Builder 3.0 (my company hasn't upgraded to the latest and greatest yet)
P.S. If there is a method of sorting the records on the report form itself using Regular Expressions, please let me know!

A couple of observations, then an answer.
In SQL Server, INNER JOIN and JOIN mean the same thing.
As #DaleBurrell notes, unless you're filtering by an aggregated value, use a WHERE clause rather than a HAVING clause. The WHERE is applied earlier in the query processing and you should see modestly better performance putting your filtering there. Also, it's more "standard", if you will.
Finally, I removed your filtering sub-query for person_id because it's a self-join to portfolio that I couldn't see a good reason for. If there are additional criteria in there that make it useful, go ahead and put it back.
With that said, your second attempt was really close. If you RANK your results using your existing ORDER BY clause, then apply TOP (1) WITH TIES, it will return the #1 ranked result for each employee, ordered by date.
DECLARE #Skill int
SET #Skill = 81
SELECT TOP (1) WITH TIES
P.lastname+', '+P.firstname AS Employee,
P.external_id,
PC.job_title,
SD.name,
SV.schedule_days as ExpireInterval,
PO.course_startdate,
DATEADD(DD,SV.schedule_days,PO.course_startdate) as ExpireDate
FROM portfolio PO
JOIN person P ON PO.person_id=P.person_id
JOIN e_component EC ON PO.component_id=EC.component_id
JOIN skill_value SV ON EC.component_id=SV.object_id
JOIN skill_description SD ON SV.skill_id=SD.skill_id
JOIN person_custom PC ON P.person_id=PC.person_id
JOIN portfolio PF ON PO.person_id = PF.person_id
WHERE SD.language_id=26
AND SV.skill_id= #Skill
GROUP BY
PO.person_id,
PO.course_startdate,
SV.skill_id,
P.lastname,
P.firstname,
P.external_id,
PC.job_title,
SD.name,
SV.schedule_days,
SD.language_id
ORDER BY RANK() OVER (PARTITION BY Employee ORDER BY PO.course_startdate DESC)

You pretty much found the issue with your "what I want to do" snippet, and that is you can't use TOP 1 + ORDER BY to get the most recent record when you have more than 1 user (ie want more than 1 row returned).
ROW_NUMBER() is a good way to handle this. It assigns a number to each row based on conditions.
For instance, ROW_NUMBER() OVER (PARTITION BY PO.person_id ORDER BY PO.course_startdate DESC) as RN will assign a 1 to each row with the most recent PO.course_startdate for each PO.person_id. If you do this within a derived table or CTE, then you simply need to filter to RN = 1 in your final/outer select in order to find the most recent row for each user.
CTE example:
DECLARE #Skill int
SET #Skill = 81
;WITH yourCTE as (
SELECT
P.lastname+', '+P.firstname AS Employee,
P.external_id,
PC.job_title,
SD.name,
SV.schedule_days as ExpireInterval,
PO.course_startdate,
DATEADD(DD,SV.schedule_days,PO.course_startdate) as ExpireDate,
ROW_NUMBER() OVER (PARTITION BY PO.person_id ORDER BY PO.course_startdate DESC) as RN
FROM portfolio PO
JOIN person P ON PO.person_id=P.person_id
JOIN e_component EC ON PO.component_id=EC.component_id
JOIN skill_value SV ON EC.component_id=SV.object_id
JOIN skill_description SD ON SV.skill_id=SD.skill_id
JOIN person_custom PC ON P.person_id=PC.person_id
WHERE SD.language_id=26
AND SV.skill_id= #Skill
)
SELECT employee, extenal_id, job_title, name,
ExpireInterval, course_startdate, ExpireDate
FROM yourCTE
WHERE RN = 1
I also moved your HAVING conditions to WHERE (and removed one redundant one), shorthanded the INNER JOINs to JOINs (just to be consistent), and removed your GROUP BY and ORDER BY. I didn't see a point to the grouping, but you can add the ORDER BY to the final select if you still want it.

If you group by the course name, and select max(course_date) you'll get it e.g.
DECLARE #Skill int
SET #Skill = 81
SELECT TOP 1
P.lastname+', '+P.firstname AS Employee,
P.external_id,
PC.job_title,
SD.name,
SV.schedule_days as ExpireInterval,
max(PO.course_startdate) most_recent_course_startdate,
max(DATEADD(DD,SV.schedule_days,PO.course_startdate)) as ExpireDate
FROM portfolio PO
INNER JOIN person P ON PO.person_id=P.person_id
INNER JOIN e_component EC ON PO.component_id=EC.component_id
JOIN skill_value SV ON EC.component_id=SV.object_id
JOIN skill_description SD ON SV.skill_id=SD.skill_id
JOIN person_custom PC ON P.person_id=PC.person_id
where SD.language_id=26
AND PO.person_id=123456
AND SV.skill_id= #Skill
GROUP BY
PO.person_id,
--PO.course_startdate,
SV.skill_id,
P.lastname,
P.firstname,
P.external_id,
PC.job_title,
SD.name,
SV.schedule_days,
SD.language_id
ORDER BY Employee, most_recent_course_startdate DESC
Also HAVING is for using aggregate conditions, otherwise just stick to WHERE.

Related

Filtering SELECT TOP WITH TIES When No Records Exist for a Specific Column

Question: How can I filter my results (see below) to exclude erroneous data? I'm guessing my problem is somewhere in the WHERE clause but for the life of me, I can't figure it out.
End Goal: Return NULL values for the CDA_Orientation column where no values exist in the portfolio and e_component tables (e.g. employee has not had Orientation yet)
DB Schema:
Result Set with Errors:
NOTE: The Orientation dates for Eastman, DeLuca, and Fontano are the same date and represent the TOP 1 result from the course_startdate column of the portfolio table.
What I Want the Results to Look Like:
If I've done my JOINS correctly, the CDA_Orientation column should show NULL because there is no entry in the portfolio table (and accordingly, the e_component table) for these three individuals. The entry is only created by the front end when the Employee is assigned to a course.
Here is My Code:
SELECT TOP (1) WITH TIES
P.lastname+', '+P.firstname AS Employee,
P.person_id,
CONVERT(DATE,PC.CDAI_EXP_DATE) AS CDA_Infant,
CONVERT(DATE,PC.CDAP_EXP_DATE) AS CDA_Preschool,
CONVERT(DATE,PO.course_startdate) AS CDA_Orientation
FROM person P
JOIN person_custom PC ON PC.person_id=P.person_id
LEFT JOIN portfolio PO ON P.person_id=PO.person_id
FULL JOIN e_component EC ON PO.component_id=EC.component_id
WHERE (cdai_exp_date IS NOT NULL OR cdap_exp_date IS NOT NULL)
AND PO.course_startdate IN (SELECT course_startdate
FROM portfolio PO
LEFT JOIN e_component EC ON PO.component_id=EC.component_id
WHERE (EC.userdefined_id LIKE '000150%' AND PO.status=11))
ORDER BY ROW_NUMBER() OVER(PARTITION BY P.lastname+', '+P.firstname
ORDER BY PO.person_id)
NOTE: The TOP (1) WITH TIES has successfully pulled the most recent orientation date (employees can have more than one) from the portfolio table for Tarkin and Rust. I've cut out any and all unnecessary JOINS and caveats.
Thanks in advance!
I believe the joins are the issue. Using WITH TIES in that way is also confusing if you're just trying to get a record for each person; I would use a GROUP BY. If you wanted to do it without a sub-query you could do:
SELECT
P.lastname+', '+P.firstname AS Employee,
P.person_id,
CONVERT(DATE,PC.CDAI_EXP_DATE) AS CDA_Infant,
CONVERT(DATE,PC.CDAP_EXP_DATE) AS CDA_Preschool,
MAX(CONVERT(DATE,PO.course_startdate)) AS CDA_Orientation
FROM #person P
JOIN #person_custom PC
ON PC.person_id=P.person_id
LEFT JOIN
(#portfolio PO
JOIN #e_component EC
ON PO.component_id=EC.component_id
AND EC.userdefined_id LIKE '000150%'
AND PO.status=11)
ON P.person_id=PO.person_id
WHERE (cdai_exp_date IS NOT NULL OR cdap_exp_date IS NOT NULL)
GROUP BY P.lastname, P.firstname, P.person_id,PC.CDAI_EXP_DATE,PC.CDAP_EXP_DATE

How to find max value in SQL Server 2014?

I have a table named StatementSummary.
SELECT *
FROM StatementSummary
WHERE AccountID = 1234
Results
StatementId StatementDate AccountId AmountDue
-------------------------------------------------
100 2017-10-16 1234 600
99 2017-09-16 1234 500
98 2017-08-16 1234 400
I have another table that has a list of Accounts. I am trying to give results that show the last AmountDue for each account
My code:
SELECT
AccountID,
(SELECT MAX(StatementDate)
FROM StatementSummary
GROUP BY AccountID) LastStatementDate,
AmountDue
FROM
Accounts A
INNER JOIN
StatementSummary S ON A.AccountId = S.AccountId
Basically, I want to show all the details of the last statement for every AccountId.
You can use the SQL Server Windowing functions in cases like this.
SELECT DISTINCT
a.AccountId,
FIRST_VALUE(s.StatementDate) OVER (PARTITION BY s.AccountId ORDER BY s.StatementDate DESC) As LastStatementDate,
FIRST_VALUE(s.AmountDue) OVER (PARTITION BY s.AccountId ORDER BY s.StatementDate DESC) As LastAmountDue
FROM Accounts a
INNER JOIN StatementSummary s
ON a.AccountId = s.AccountId
Basically what happens is the OVER clause creates partitons in your data, in this case, by the account number (these partitions are the windows). We then tell SQL Server to sort the data within each partition by the statement date in descending order, so the last statement will be at the top of the partition, and then the FIRST_VALUE function is used to just grab the first row.
Finally, since you perform this operation for every account/statement combo between the two tables, you need the DISTINCT to say you just want one copy of each row for each account.
There are quite a bit of useful things you can do with the windowing functions in SQL Server. This article gives a good introduction to them: https://www.red-gate.com/simple-talk/sql/learn-sql-server/window-functions-in-sql-server/
Derived Table over row numberand left join - to display all accounts regardless if there is a statement
select *
from
(select row_number() over (partition by accountid order by statementdate desc) rn,
accountid, statementdate,amount
from statementtable
) l
left outer join accountstable a
on l.accountid = a.accountid
And l.rn = 1
That's sounds like a job for me sings lateral join aka cross apply in T-SQL.
SELECT a.*, last_ss.*
FROM Accounts A
cross apply (
select top 1 *
from StatementSummary S ON A.AccountId = S.AccountId
order by StatementDate desc
) last_ss
Alternatively you can use CTE to get last date for account:
; with l as (
select accountid, max(StatementDate)
from StatementSummary
group by accountid
)
select ...
from Accounts a
inner join l on l.accountid = a.accountid
inner join StatementSummary ss on ss.accountid = a.accountid
and l.StatementDate = ss.StatementDate

How to make this complex query more efficient?

I want to select employees, having more than 10 products and older than 50. I also want to have their last product selected. I use the following query:
SELECT
PE.EmployeeID, E.Name, E.Age,
COUNT(*) as ProductCount,
(SELECT TOP(1) xP.Name
FROM ProductEmployee xPE
INNER JOIN Product xP ON xPE.ProductID = xP.ID
WHERE xPE.EmployeeID = PE.EmployeeID
AND xPE.Date = MAX(PE.Date)) as LastProductName
FROM
ProductEmployee PE
INNER JOIN
Employee E ON PE.EmployeeID = E.ID
WHERE
E.Age > 50
GROUP BY
PE.EmployeeID, E.Name, E.Age
HAVING
COUNT(*) > 10
Here is the execution plan link: https://www.dropbox.com/s/rlp3bx10ty3c1mf/ximExPlan.sqlplan?dl=0
However it takes too much time to execute it. What's wrong with it? Is it possible to make a more efficient query?
I have one limitation - I can not use CTE. I believe it will not bring performance here anyway though.
Before creating Index I believe we can restructure the query.
Your query can be rewritten like this
SELECT E.ID,
E.NAME,
E.Age,
CS.ProductCount,
CS.LastProductName
FROM Employee E
CROSS apply(SELECT TOP 1 P.NAME AS LastProductName,
ProductCount
FROM (SELECT *,
Count(1)OVER(partition BY EmployeeID) AS ProductCount -- to find product count for each employee
FROM ProductEmployee PE
WHERE PE.EmployeeID = E.Id) PE
JOIN Product P
ON PE.ProductID = P.ID
WHERE ProductCount > 10 -- to filter the employees who is having more than 10 products
ORDER BY date DESC) CS -- To find the latest sold product
WHERE age > 50
This should work:
SELECT *
FROM Employee AS E
INNER JOIN (
SELECT PE.EmployeeID
FROM ProductEmployee AS PE
GROUP BY PE.EmployeeID
HAVING COUNT(*) > 10
) AS PE
ON PE.EmployeeID = E.ID
CROSS APPLY (
SELECT TOP (1) P.*
FROM Product AS P
INNER JOIN ProductEmployee AS PE2
ON PE2.ProductID = P.ID
WHERE PE2.EmployeeID = E.ID
ORDER BY PE2.Date DESC
) AS P
WHERE E.Age > 50;
Proper indexes should speed query up.
You're filtering by Age, so followining one should help:
CREATE INDEX ix_Person_Age_Name
ON Person (Age, Name);
Subquery that finds emploees with more than 10 records should be calculated first and CROSS APPLY should bring back data more efficient with TOP operator rather than comparing it to MAX value.
Answer by #Prdp is great, but I thought I'll drop an alternative in. Sometimes windowed functions do not work very well and it's worth to replace them with ol'good subqueries.
Also, do not use datetime, use datetime2. This is suggest by Microsoft:
https://msdn.microsoft.com/en-us/library/ms187819.aspx
Use the time, date, datetime2 and datetimeoffset data
types for new work. These types align with the SQL Standard. They are
more portable. time, datetime2 and datetimeoffset provide
more seconds precision. datetimeoffset provides time zone support
for globally deployed applications.
By the way, here's a tip. Try to name your surrogate primary keys after table, so they become more meaningful and joins feel more natural. I.E.:
In Employee table replace ID with EmployeeID
In Product table replace ID with ProductID
I find these a good practice.
with usersOver50with10productsOrMore (employeeID, productID, date, id, name, age, products ) as (
select employeeID, productID, date, id, name, age, count(productID) from productEmployee
join employee on productEmployee.employeeID = employee.id
where age >= 50
group by employeeID, productID, date, id, name, age
having count(productID) >= 10
)
select sfq.name, sfq.age, pro.name, sfq.products, max(date) from usersOver50with10productsOrMore as sfq
join product pro on sfq.productID = pro.id
group by sfq.name, sfq.age, pro.name, sfq.products
;
There is no need to find the last productID for the entire table, just filler the last product from the results of employees with 10 or more products and over the age of 50.

DB2 return first match

In DB2 for i (a.k.a. DB2/400) at V6R1, I want to write a SQL SELECT statement that returns some columns from a header record and some columns from ONLY ONE of the matching detail records. It can be ANY of the matching records, but I only want info from ONE of them. I am able to accomplish this with the following query below, but I'm thinking that there has to be an easier way than using a WITH clause. I'll use it if I need it, but I keep thinking, "There must be an easier way". Essentially, I'm just returning the firstName and lastName from the Person table ... plus ONE of the matching email-addresses from the PersonEmail table.
Thanks!
with theMinimumOnes as (
select personId,
min(emailType) as emailType
from PersonEmail
group by personId
)
select p.personId,
p.firstName,
p.lastName,
pe.emailAddress
from Person p
left outer join theMinimumOnes tmo
on tmo.personId = p.personId
left outer join PersonEmail pe
on pe.personId = tmo.personId
and pe.emailType = tmo.emailType
PERSONID FIRSTNAME LASTNAME EMAILADDRESS
1 Bill Ward p1#home.com
2 Tony Iommi p2#cell.com
3 Geezer Butler p3#home.com
4 John Osbourne -
This sounds like a job for row_number():
select p.personId, p.firstName, p.lastName, pe.emailAddress
from Person p left outer join
(select pe.*,
row_number() over (partition by personId order by personId) as seqnum
from PersonEmail pe
) pe
on pe.personId = tmo.personId and seqnum = 1;
If which row would be selected from the PersonEmail file is truly immaterial, then there is little reason to perform either of a summary query or an OLAP query to select that row; ordering is implied in the former per the MIN aggregate of the CTE, and order is explicitly requested in the latter. The following use of FETCH FIRST clause should suffice, without any requirements for ORDER of data in the secondary file [merely any matching row; albeit likely to be the first or last, depending on the personId keys, although dependent entirely on the query implementation which could even be without the use of a key]:
select p.personId, p.firstName, p.lastName
, pe.emailAddress
from Person as p
left outer join lateral
( select pe.*
from PersonEmail pe
where pe.personId = p.personId
fetch first 1 row only
) as pe
on p.personId = pe.personId

How to retrieve results for top items

I have ran a query to give me the total number of students within each school but now I need to know the name of those students within each school while keeping the top result by total number at the top. How can I add to this query to show me the names of the students?
Here is what I have to show me the total number of students at each school:
SELECT
dbo_Schools.Schools,
Count(dbo_tStudent.Student) AS NumberOfStudents
FROM
dbo_tStudent
INNER JOIN dbo_tSchools ON dbo_tStudent.SchoolID=dbo_tSchool.SchoolID
GROUP BY dbo_tSchool.School
ORDER BY Count(dbo_tStudent.Student) DESC;
Its important that I keep the schools in order from top number of students while listing the students
In this case you could use a Sub Query to achieve your resultset.
To use order by inside a subquery, you will also need a top or limit operator.
SELECT sc.schoolname
,st.columns...
FROM dbo_tStudent st
INNER JOIN (
SELECT TOP 1000 dbo_Schools.SchoolID
,min(schoolname) schoolname
,Count(dbo_tStudent.Student) AS NumberOfStudents
FROM dbo_tStudent
INNER JOIN dbo_tSchools ON dbo_tStudent.SchoolID = dbo_tSchools.SchoolID
GROUP BY dbo_tSchool.School
ORDER BY Count(dbo_tStudent.Student) DESC
) sc ON st.SchoolID = sc.SchoolID
Assuming that you are using SQL Server, you can use a CTE to join the first aggregate with the details like this:
;WITH cte as (
SELECT TOP 1000 dbo_Schools.SchoolID, Count(dbo_tStudent.Student) AS NumberOfStudents
FROM
dbo_tStudent
INNER JOIN dbo_tSchools ON dbo_tStudent.SchoolID = dbo_tSchools.SchoolID
GROUP BY dbo_tSchool.School
ORDER BY Count(dbo_tStudent.Student) DESC
)
SELECT
sc.<your school name column>,
st.<your student columns>
from
dbo_tStudent st
INNER JOIN cte ON st.SchoolID = cte.SchoolID
INNER JOIN dbo_tSchools sc on cte.SchoolID = sc.SchoolID
More generally speaking: you need a derived table (your aggregation containing the group by clause) that is joined with the select statement for the student details. In this example, the CTE basically is a SQL Server feature that facilitates the use of derived tables.