Finding Total Tenure of Employees with Multiple Stints in the Company - sql

I have a dataset that goes back to Jan 1, 2018 for employee history including hire and termination data. Prior to 2018 there is no history, however I do have original hire dates and continuous services dates prior. For example, I might have a datapoint that employee 123456 was termed on 2/15/2018 and had a hire date on 1/9/2014. But anyone who was termed prior to 2018 and never rehired, there is no data.
The logic I implemented was as follows:
Always consider the earliest of the original hire date or continuous service date as the date of first hire.
If earliest hire was before 2018 and the employee's first action in the database was to be rehired, then assume 1/1/2018 was the first termination date.
Add the hire and term activity after 1/1/2018
Add up all the days employed from all their stints
Here is some example data from one employee:
Employee ID
Employee Original Hire Date
Employment Status
Employee Termination Date
Employee Trend Date
Employee Action
Service Date
123456
2015-03-31
Leave
2019-06-24
2018-01-01
Data Changes
2020-02-24
123456
2015-03-31
Active
2019-06-24
2018-02-26
Leave
2020-02-24
123456
2015-03-31
Active
2019-06-24
2019-02-04
Leave
2020-02-24
123456
2015-03-31
Term
2019-06-24
2019-06-24
Voluntary
2020-02-24
123456
2015-03-31
Active
2022-06-17
2020-02-24
Rehire Employee
-
123456
2015-03-31
Active
2022-06-17
2020-02-26
Transfer
2020-02-24
123456
2015-03-31
Leave
2022-06-17
2020-11-23
Leave
2020-02-24
123456
2015-03-31
Active
2022-06-17
2021-02-22
Leave
2020-02-24
123456
2015-03-31
Leave
2022-06-17
2021-11-12
Leave
2020-02-24
123456
2015-03-31
Leave
2022-06-17
2021-12-27
Data Changes
2020-02-24
123456
2015-03-31
Active
2022-06-17
2022-02-13
Leave
2020-02-24
123456
2015-03-31
Term
2022-06-17
2022-06-17
Involuntary
2020-02-24
Note: Voluntary and Involuntary both refer to being Termed
And here is an intermediate step of my function:
stint
Hired
Termed
stintlength
1
2015-03-31
2019-06-24
1546
2
2020-02-24
2022-06-17
844
The final output simply sums the stintlength and return it. So the function seems to work.
The problem is it is slow. Super slow. About 3,500 rows a minute slow. I'd like to improve the speed by 10x.
Here is the code:
DECLARE #eeid int
DECLARE #lastupdated date
DECLARE #firsttrend date
DECLARE #firststatus nvarchar(50)
DECLARE #firstservicedate date
DECLARE #activejan2018 bit
DECLARE #mintermdate date
DECLARE #firsttermdate date
DECLARE #return int
SET #eeid = 143914
SET #lastupdated = (SELECT MAX(lastupdated) FROM cleandata.employeefullhistory2)
SET #firsttrend = (SELECT MIN(Employee_Trend_Date) FROM cleandata.employeefullhistory2 WHERE Employee_ID = #eeid)
SET #mintermdate = (SELECT MIN(Employee_Trend_Date) FROM cleandata.employeefullhistory2 WHERE Employee_ID = #eeid AND (Employee_Action = 'Voluntary' OR Employee_Action = 'Involuntary'))
SET #firstservicedate = (SELECT IIF(MIN(Employee_Original_Hire_Date)>MIN(Service_Date),MIN(Service_Date),MIN(Employee_Original_Hire_Date)) FROM cleandata.employeefullhistory2 WHERE Employee_ID = #eeid)
SET #firststatus =(SELECT TOP 1 Employee_Action FROM cleandata.employeefullhistory2 WHERE Employee_ID = #eeid AND Employee_Trend_Date = #firsttrend)
SET #activejan2018 = CASE WHEN #firstservicedate <= '2018-01-01' AND NOT (#firststatus = 'Rehire Employee' OR #firststatus = 'Hire Employee') THEN 1 ELSE 0 END
SET #firsttermdate = CASE WHEN #firstservicedate <= '2018-01-01' AND #activejan2018 = 0 THEN '2018-01-01' ELSE #mintermdate END
SET #return = (SELECT SUM(stintlength) FROM (
SELECT
stint
,[Hired]
,CASE WHEN [Termed] IS NULL THEN #lastupdated ELSE [Termed] END Termed
,DATEDIFF(DAY,[Hired],CASE WHEN [Termed] IS NULL THEN #lastupdated ELSE [Termed] END) stintlength
FROM(
SELECT
t.trenddate
,t.status
,ROW_NUMBER() OVER (PARTITION BY t.status ORDER BY t.trenddate asc) stint
FROM(
SELECT #firstservicedate trenddate, 'Hired' status
UNION
SELECT #firsttermdate trenddate, 'Termed'
UNION
SELECT
Employee_Trend_Date
,CASE WHEN Employee_Action = 'Voluntary' OR Employee_Action = 'Involuntary' THEN 'Termed' ELSE 'Hired' END
FROM cleandata.employeefullhistory2
WHERE Employee_ID = #eeid AND (Employee_Action = 'Voluntary' OR Employee_Action = 'Involuntary' OR Employee_Action = 'Hire Employee' OR Employee_Action = 'Rehire Employee')
)t
)t
PIVOT (
MIN(trenddate)
FOR status IN ([Hired],[Termed])
) pt
)t)
SELECT #return

I think you've probably made it much more complicated than it should be.
Rather than doing a PIVOT etc, you can simply 'group' the relevant rows and get max/min dates - and then do any modifications to those to account for results prior to 1/1/2018. Then (as you have), simply sum the stint_lengths.
I have a running example in this db<>fiddle - note though that I have called the table #employeefullhistory2.
I would suggest the first step (which is used in the later code) is to just pull out the relevant data e.g.,
SELECT Employee_ID,
Employee_Original_Hire_Date,
Employee_Termination_Date,
MIN(Employee_Trend_Date) AS First_Trend_Date,
MAX(Employee_Trend_Date) AS Last_Trend_Date
FROM #employeefullhistory2
GROUP BY Employee_ID,
Employee_Original_Hire_Date,
Employee_Termination_Date;
The above uses a simply GROUP BY and MIN/MAX calculations, to work out key dates. Results as below.
Employee_ID Employee_Original_Hire_Date Employee_Termination_Date First_Trend_Date Last_Trend_Date
123456 2015-03-31 2019-06-24 2018-01-01 2019-06-24
123456 2015-03-31 2022-06-17 2020-02-24 2022-06-17
Once you have this, it's easy enough to calculate the relevant Stint_Lengths (see the db<>fiddle linked above for that) and then take the total.
The full/final SQL query is below.
WITH EmpDates AS
(SELECT Employee_ID,
Employee_Original_Hire_Date,
Employee_Termination_Date,
MIN(Employee_Trend_Date) AS First_Trend_Date,
MAX(Employee_Trend_Date) AS Last_Trend_Date
FROM #employeefullhistory2
GROUP BY Employee_ID,
Employee_Original_Hire_Date,
Employee_Termination_Date
),
EmpStints AS
(SELECT *,
DATEDIFF(day,
CASE WHEN First_Trend_Date = '2018-01-01'
THEN Employee_Original_Hire_Date
ELSE First_Trend_Date END,
ISNULL(Employee_Termination_Date, Last_Trend_Date)
) AS Stint_Length
FROM EmpDates
)
SELECT Employee_ID, SUM(Stint_Length) AS Total_Stint_Length
FROM EmpStints
GROUP BY Employee_ID;
The results are as follows
Employee_ID Total_Stint_Length
123456 2390
I think the simplicity of the above approach will make it run much faster. If you need the results for a single employee (e.g., you pass a specific Employee_ID) then it's worthwhile on the original data set to have an index on Employee_ID.
Note: For current employees, I have used the last Trend Date to mimic their termination date for the purposes of stint length calculation (the ISNULL(Employee_Termination_Date, Last_Trend_Date) function). This means that someone who has been hired for (say) 2 years and is still employed (e.g., no termination date) will have a ~700 days value for the current stint. Feel free to modify the approach as needed.

Related

query to find the records between two tables for a given date range

I have a employee table which holds the information about which department the employee belongs during a period of time. At any point in a time a employee can belong to only one department. The end date column holds till what date the employee had stayed in a particular department. if the end date column holds a future date which means thats the latest department for a employee.
empid
deptname
startdate
enddate
1
sales
jan-20-2022
jan-24-2022
1
marketing
jan-25-2022
feb-03-2022
1
support
feb-04-2022
feb-06-2022
1
training
feb-07-2022
dec-31-2050
I have a call details table which holds the information of which employee took the call and what is call start time and call end time.
call_id
empid
callstart_time
callendtime
10
1
jan-21-2022 10:00:00
jan-21-2022 10:30:00
11
1
jan-21-2022 10:40:00
jan-21-2022 10:45:00
12
1
feb-01-2022 11:20:00
feb-01-2022 11:30:00
13
1
feb-05-2022 09:00:00
feb-05-2022 10:00:00
14
1
feb-08-2022 10:00:00
feb-08-2022 11:00:00
Now my question is:
I am looking for inputs and the sample query where i need to know what was the employees department during the time the employee took the call.
For example, if I want to know what are the calls took by an employee from jan-20-2022 to feb-02-2022 and what was there department name during the time of the call. i need the below output.
call_id
empid
callstart_time
callendtime
deptname
10
1
jan-21-2022 10:00:00
jan-21-2022 10:30:00
sales
11
1
jan-21-2022 10:40:00
jan-21-2022 10:45:00
sales
12
1
feb-01-2022 11:20:00
feb-01-2022 11:30:00
marketing
If i run the query for a date range from feb-04-2022 to feb-10-2022, i want to see the below output
call_id
empid
callstart_time
callendtime
deptname
13
1
feb-05-2022 09:00:00
feb-05-2022 10:00:00
support
14
1
feb-08-2022 10:00:00
feb-08-2022 11:00:00
training
please share few inputs on how to achieve this output using the sql query
A CROSS APPLY lets you define a subselect to pick the applicable employee record. In your case, the latest employee record prior to the call. Something like:
SELECT C.*, E.deptname
FROM calldetails C
CROSS APPLY (
SELECT TOP 1 *
FROM employee E
WHERE E.empid = C.empid
AND E.startdate <= C.callstart_time
ORDER BY E.startdate DESC
) E
ORDER BY C.callstart_time
See this db<>fiddle
Or since you have end date, a simple join will do
SELECT C.*, E.deptname
FROM calldetails C
JOIN employee E
ON E.empid = C.empid
AND E.startdate <= C.callstart_time
AND E.enddate > DATEADD(day, -1, C.callstart_time)
ORDER BY C.callstart_time
Note the date adjustment needed for the enddate comparison. This is needed because you are using inclusive enddates, Using exclusive end dates (where enddate = startdate for the next record works much better for range checks and calculations.

Using Distinct and MAX(date) in a large data

I have a table that stores the list of users who have accessed a product(with the accessed date).
I have written the below query to get the list of users who have accessed the product B between '2021-02-01' and '2021-02-26'.
SELECT DISTINCT UserName,Country,ADate,Product FROM Report WHERE UserName != '-' and Product='B and (CAST(ADate AS DATE) BETWEEN #startdate AND #enddate '
then it gives the below result:
UserName Country ADate Product
-------- ------ -------- ---------
asson IN 2021-02-10 00:00:00.000 B
rajan US 2021-02-23 00:00:00.000 B
rajan US 2021-02-25 00:00:00.000 B
moody US 2021-02-14 00:00:00.000 B
rajon US 2021-02-01 00:00:00.000 B
lukman US 2021-02-10 00:00:00.000 B
since the user rajan has accessed the product in 2 days it shows 2 entries for rajan even though I have added distinct. So I have modified the query as below:
SELECT DISTINCT UserName,Country,max(ADate),Product FROM Report WHERE UserName != '-' and Product='B' and (CAST(ADate AS DATE) BETWEEN #startdate AND #enddate group by Username,product
This query gives me the required result. But the problem I am facing now is When I select the table with more than a month gap (say data between 2 months), I miss some data in the output. I believe it might be due to the MAX(ADate). Can anyone give a good suggestion to get rid of this issue?
This will give you the latest access date of each user by month
SELECT DISTINCT UserName,Country, month(ADate) as month, max(ADate),Product FROM Report WHERE UserName != '-' and Product='B' group by UserName, Country, month, Product

Can not understand the logic of this query

This query is trying to get the s1ppmp (the price of product) of each s1ilie (size), each s1iref (reference) and s1ydat (the lastest date) for the price, because one product could have more than one price on different dates, for example, during the black friday or the normal price for the other days.
The anmoisjour comes from calender table, but there is no connection between CALENDER table and main table msk100, so ... I don't understand the logic of this query...
SELECT s1isoc,
s1ilie,
s1iref,
s1ydat,
anmoisjour,
s1ppmp
FROM msk110
INNER JOIN (SELECT s1isoc AS isoc,
s1ilie AS ilie,
s1iref AS iref,
MAX(s1ydat) AS ydat,
anmoisjour
FROM calendrier,
msk110
WHERE s1ydat <= anmoisjour
AND anmoisjour BETWEEN 20100101 AND 20302131
GROUP BY s1isoc,
s1ilie,
s1iref,
anmoisjour) a ON s1isoc = isoc
AND s1ilie = ilie
AND s1iref = iref
AND s1ydat = ydat
WHERE s1isoc = 1
AND anmoisjour BETWEEN 20100101 AND 20302131
ORDER BY anmoisjour,
s1ydat;
s1isoc, s1ilie, s1iref, s1ydat,and s1ppmp comes from msk110
and
anmoisjour belongs to calender table, which is a date table.
I believe the confusion is the way that the calendar table is joined.
If anmoisjour is the day column of the calendar table and this table holds 1 row per day, the WHERE filter anmoisjour BETWEEN 20100101 AND 20302131 makes calendrier hold a row for each day for 20 years (2010 to 2030).
They way the product prices table msk100 is linked to the calendar calendrier table is not directly by date, but with a max date (msk100.s1ydat <= calendrier.anmoisjour). This means that for example a date of msk100.s1ydat that's 2015-01-01 will join against every row of the calendar table thats between 2015-01-01 and 2030-12-31.
The GROUP BY is by the calendar table's date (calendrier.anmoisjour) this means that if a particular product, size and price repeats on different dates, let's say the only occurrences are on dates 2015-01-01, 2017-01-01 and 2020-01-01, then the result of the group by would be the following (ordered by calendar date, displaying even NULL to demonstrate):
MAX(s1ydat) anmoisjour
null 2010-01-01
null ...
null 2014-12-31
2015-01-01 2015-01-01
2015-01-01 2015-01-02
2015-01-01 ...
2015-01-01 2016-01-01
2015-01-01 ...
2017-01-01 2017-01-01
2017-01-01 2017-01-02
2017-01-01 ...
2017-01-01 2019-12-31
2020-01-01 2020-01-01
2020-01-01 2025-01-01
2020-01-01 ...
What your query is showing is the contents of the product table with the last date that that particular product had that particular price, for each day over 20 years, also where s1isoc = 1 (which I don't know what that means).

Expanding/changing my query to find more entries using (potentially) IFELSE

My question will use this dataset as an example. I have a query setup (I have changed variables to more generic variables for the sake of posting this on the internet so the query may not make perfect sense) that picks the most recent date for a given account. So the query returns values with a reason_type of 1 with the most recent date. This query has effective_date set to is not null.
account date effective_date value reason_type
123456 4/20/2017 5/1/2017 5 1
123456 1/20/2017 2/1/2017 10 1
987654 2/5/2018 3/1/2018 15 1
987654 12/31/2017 2/1/2018 20 1
456789 4/27/2018 5/1/2018 50 1
456789 1/24/2018 2/1/2018 60 1
456123 4/25/2017 null 15 2
789123 5/1/2017 null 16 2
666888 2/1/2018 null 31 2
333222 1/1/2018 null 20 2
What I am looking to do now is to basically use that logic to only apply to reason_type
if there is an entry for it, otherwise have it default to reason_type
I think I should be using an IFELSE, but I'm admittedly not knowledgeable about how I would go about that.
Here is the code that I currently have to return the reason_type 1s most recent entry.
I hope my question is clear.
SELECT account, date, effective_date, value, reason_type
from
(
SELECT account, date, effective_date, value, reason_type
ROW_NUMBER() over (partition by account order by date desc) rn
from mytable
WHERE value is not null
AND effective_date is not null
)
WHERE rn =1
I think you might want something like this (do you really have a column named date by the way? That seems like a bad idea):
SELECT account, date, effective_date, value, reason_type
FROM (
SELECT account, date, effective_date, value, reason_type
, ROW_NUMBER() OVER ( PARTITION BY account ORDER BY date DESC ) AS rn
FROM mytable
WHERE value IS NOT NULL
) WHERE rn = 1
-- effective_date IS NULL or is on or before today's date
AND ( effective_date IS NULL OR effective_date < TRUNC(SYSDATE+1) );
Hope this helps.

Teradata SQL: Determine how many accounts had status change in given month

Ok, so I have a table that looks something like this:
Acct_id Eff_dt Expr_dt Prod_cd Open_dt
-------------------------------------------------------
111 2012-05-01 2013-06-01 A 2012-05-01
111 2013-06-02 2014-03-08 A 2012-05-01
111 2014-03-09 9999-12-31 B 2012-05-01
222 2015-07-15 2015-11-11 A 2015-07-15
222 2015-11-12 2016-08-08 B 2015-07-15
222 2016-08-09 9999-12-31 A 2015-07-15
333 2016-01-01 2016-04-15 B 2016-01-01
333 2016-04-16 2016-08-08 B 2016-01-01
333 2016-08-09 9999-12-31 A 2016-01-01
444 2017-02-03 2017-05-15 A 2017-02-03
444 2017-05-16 2017-12-02 A 2017-02-03
444 2017-12-03 9999-12-31 B 2017-02-03
555 2017-12-12 9999-12-31 B 2017-12-12
There are many more columns that I'm not including as they're otherwise not relevant.
What I'm trying to determine is how many accounts had a change in Prod_cd in a given month, but then only in one direction (so from A > B in this example). Sometimes however an account was first opened as B, and then later changed to A. Or it was opened as A, changed to B, and moved back to A. I only want to know the current set of accounts where in a given month the Prod_cd changed from A to B.
Eff_dt is the date when a change was made to an account (could be any change, such as address change, name change, or what I'm looking for, product code change).
Expr_dt is the expiration date of that row, essentially the last day before a new change was made. When the date of that row is 9999-12-31, that's the most current row.
Open_dt is the date the account was created.
I created a query at first that was something like this:
select
count(distinct acct_id)
from table
where prod_cd = 'B'
and expr_dt = '9999-12-31'
and eff_dt between '2017-12-01' and '2017-12-31'
and open_dt < '2017-12-01'
But it's giving me results that don't look right. I want to specifically track the # of conversions that happened, but the count of accounts I'm getting seems way too high.
There is probably a way to create a more reliable query using window functions, but given that the Prod_cd changes can happen in multiple directions, I'm not sure how to write that query. Any help would be appreciated!
If you are specifically looking for the switch A --> B, then the simplest method is to use lag(). But, Teradata requires a slightly different formulation:
select count(distinct acct_id)
from (select t.*,
max(prod_cd) over (partition by acct_id order by effdt rows between 1 preceding and 1 preceding) as prev_prod_cd
from t
) t
where prod_cd = 'B' and prev_prod_cd = 'A' and
expr_dt = '9999-12-31' and
eff_dt between '2017-12-01' and '2017-12-31' and
open_dt < '2017-12-01';
I am guessing that the date conditions go in the outer query -- meaning that they lag() does not use them.
Similar to Gordon's answer, but using a supported window function (instead of LAG) and using Teradata's QUALIFY clause to do the lag-gy lookup:
SELECT DISTINCT acct_id
FROM mytable
QUALIFY
MAX(prod_cd) OVER (PARTITION BY acct_id ORDER BY eff_dt ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) = 'A'
AND prod_cd = 'B'
AND expr_dt = '9999-12-31'
AND eff_dt between DATE '2013-01-01' and DATE '2017-12-31'
AND open_dt < DATE '2017-12-01'