How to use variable lag window functions? - sql

I have a table with the following schema:
CREATE TABLE example (
userID,
status, --'SUCCESS' or 'FAIL'
date -- self explanatory
);
INSERT INTO example
Values(123, 'SUCCESS', 20211010),
(123, 'SUCCESS', 20211011),
(123, 'SUCCESS', 20211028),
(123, 'FAIL', 20211029),
(123, 'SUCCESS', 20211105),
(123, 'SUCCESS', 20211110)
I am trying to utilize a lag or lead function to assess whether the current line is within a 2-week window of the previous 'SUCCESS'. Given the current data, I would expect a isWithin2WeeksofSuccessFlag to be as following:
123, 'SUCCESS', 20211010,0 --since it is the first instance
123, 'SUCCESS', 20211011,1
123, 'SUCCESS', 20211028,1
123, 'FAIL', 20211029, 1 --failed, but criteria is that it is within 2 weeks of last success, which it is
123, 'SUCCESS', 20211105, 1 --last success is 2 rows back, but it is within 2 weeks
123, 'SUCCESS', 20211128, 0 --outside of 2 weeks
I would initially think to do something like this:
Select userID, status, date,
case when lag(status,1) over (partition by userid order by date asc) = 'SUCCESS'
and date_add('day',-14, date) <= lag(date,1) over (partition by userid order by date asc)
then 1 end as isWithin2WeeksofSuccessFlag
from example
This would work if I didn't have the 'FAIL' line in there. To handle it, I could modify the lag to 2 (instead of 1), but what about if I have 2,3,4,n 'FAIL's in a row? I would need to lag by 3,4,5,n+1. The specific number of FAILs in between is variable. How do I handle this variability?
NOTE I am querying billions of rows. Efficiency isn't really a concern (since it is for analysis), but running into memory allocation errors is.Thus, endlessly adding more window functions would likely cause an automatic termination of the query due memory requirement being above node limit.
How should I handle this?

Here's an approach, also using window functions, with each "common table expression" handling one step at a time.
Note: The expected result in the question does not match the data in the question. '20211128' doesn't exist in the actual data. I used the example INSERT statement.
In the test case, I changed the column name to xdate to avoid any potential SQL reserved word issues.
The SQL:
WITH cte1 AS (
SELECT *
, SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) OVER (PARTITION BY userID ORDER BY xdate) AS grp
FROM example
)
, cte2 AS (
SELECT *
, MAX(CASE WHEN status = 'SUCCESS' THEN xdate END) OVER (PARTITION BY userID, grp) AS lastdate
FROM cte1
)
, cte3 AS (
SELECT *
, CASE WHEN LAG(lastdate) OVER (PARTITION BY userID ORDER BY xdate) > (xdate - INTERVAL '2' WEEK) THEN 1 ELSE 0 END AS isNear
FROM cte2
)
SELECT * FROM cte3
ORDER BY userID, xdate
;
The result:
+--------+---------+------------+------+------------+--------+
| userID | status | xdate | grp | lastdate | isNear |
+--------+---------+------------+------+------------+--------+
| 123 | SUCCESS | 2021-10-10 | 1 | 2021-10-10 | 0 |
| 123 | SUCCESS | 2021-10-11 | 2 | 2021-10-11 | 1 |
| 123 | SUCCESS | 2021-10-28 | 3 | 2021-10-28 | 0 |
| 123 | FAIL | 2021-10-29 | 3 | 2021-10-28 | 1 |
| 123 | SUCCESS | 2021-11-05 | 4 | 2021-11-05 | 1 |
| 123 | SUCCESS | 2021-11-10 | 5 | 2021-11-10 | 1 |
+--------+---------+------------+------+------------+--------+
and with the data adjusted to match your expected result, plus a new user introduced, the result is this:
+--------+---------+------------+------+------------+--------+
| userID | status | xdate | grp | lastdate | isNear |
+--------+---------+------------+------+------------+--------+
| 123 | SUCCESS | 2021-10-10 | 1 | 2021-10-10 | 0 |
| 123 | SUCCESS | 2021-10-11 | 2 | 2021-10-11 | 1 |
| 123 | SUCCESS | 2021-10-28 | 3 | 2021-10-28 | 0 |
| 123 | FAIL | 2021-10-29 | 3 | 2021-10-28 | 1 |
| 123 | SUCCESS | 2021-11-05 | 4 | 2021-11-05 | 1 |
| 123 | SUCCESS | 2021-11-28 | 5 | 2021-11-28 | 0 |
| 323 | SUCCESS | 2021-10-10 | 1 | 2021-10-10 | 0 |
| 323 | SUCCESS | 2021-10-11 | 2 | 2021-10-11 | 1 |
| 323 | SUCCESS | 2021-10-28 | 3 | 2021-10-28 | 0 |
| 323 | FAIL | 2021-10-29 | 3 | 2021-10-28 | 1 |
| 323 | SUCCESS | 2021-11-05 | 4 | 2021-11-05 | 1 |
| 323 | SUCCESS | 2021-11-28 | 5 | 2021-11-28 | 0 |
+--------+---------+------------+------+------------+--------+
Here's an extra test case, which might expose problems in some solutions:
INSERT INTO example VALUES
(123, 'SUCCESS', '2021-10-11')
, (123, 'FAIL' , '2021-10-12')
, (123, 'FAIL' , '2021-10-13')
;
The result:
+--------+---------+------------+------+------------+--------+
| userID | status | xdate | grp | lastdate | isNear |
+--------+---------+------------+------+------------+--------+
| 123 | SUCCESS | 2021-10-11 | 1 | 2021-10-11 | 0 |
| 123 | FAIL | 2021-10-12 | 1 | 2021-10-11 | 1 |
| 123 | FAIL | 2021-10-13 | 1 | 2021-10-11 | 1 |
+--------+---------+------------+------+------------+--------+

If your DBMS doesn't support window function filters you can order by status desc so 'SUCCESS' goes before 'FAIL'.
select userID, status, date,
case when lag(status,1) over (partition by userid order by status desc , date asc) = 'SUCCESS'
and dateadd(d, -14, date) <= lag(date,1) over (partition by userid order by status desc , date asc)
then 1 end as isWithin2WeeksofSuccessFlag
from example
order by date
Sql Server fiddle

Related

How do I count occurrences with conditions in PostgreSQL?

I'm working in PostgreSQL. Suppose I have this Person table:
| id | time | name | type |
----------------------------------------------------
| 1 | 2022-04-25 07:49:58.0 | Brian | Rejection 1 |
| 2 | 2022-04-25 07:49:58.0 | Brian | Rejection 2 |
| 3 | 2022-04-27 13:05:51.0 | Fredd | Rejection 1 |
| 4 | 2022-05-01 02:13:44.0 | Janet | Rejection 1 |
| 5 | 2022-05-01 03:45:06.0 | Janet | Rejection 2 |
| 6 | 2022-05-01 08:01:34.0 | Peter | Approval |
| 7 | 2022-05-01 12:12:53.0 | Frank | Rejection 2 |
| 8 | 2022-05-02 01:26:38.0 | Frank | Approval |
Note: We have 2 rejections types Rejection 1 and Rejection 2.
I would like to make a query that counts the number of Rejections and the number of approvals for each name. However if there are 2 rejections at the same time, for the same name, like the two first rows in the example, it should only count as one.
Let me just add that it's possible for there to be one of each rejection types at the same time for the same name, but it's impossible for there to be two rejections of the same type at the same time for the same name.
So this is what I'm expecting it to return:
| name | approvals | rejections |
----------------------------------
| Brian | 0 | 1 |
| Fredd | 0 | 1 |
| Janet | 0 | 2 |
| Peter | 1 | 0 |
| Frank | 1 | 1 |
The closest I could get to this is the following:
SELECT
name,
COALESCE(SUM(CASE WHEN log_type = 'Approval' THEN 1 ELSE 0 END), 0) approvals,
COALESCE(SUM(CASE WHEN log_type = 'Rejection 1' OR log_type = 'Rejection 2' THEN 1 ELSE 0 END), 0) rejections
FROM
person
GROUP BY
name
The problem with this is that it counts two rejections with same time and name as 2 instead of 1.
You can use DISTINCT inside COUNT() to count the distinct times if log_type is the 'Rejection X':
SELECT name,
COUNT(CASE WHEN log_type = 'Approval' THEN 1 END) approvals,
COUNT(DISTINCT CASE WHEN log_type IN ('Rejection 1', 'Rejection 2') THEN time END) rejections
FROM person
GROUP BY name;
See the demo.
Use ROW_NUMBER to remove duplicates, then use a simple count query to find the counts:
SELECT
name,
COUNT(*) FILTER (WHERE log_type = 'Approval') approvals,
COUNT(*) FILTER (WHERE log_type LIKE 'Rejection%') rejections
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY time, name, SUBSTRING(log_type FROM '\w+')) rn
FROM person
) t
WHERE rn = 1
GROUP BY name;
We can fetch the date in the CASE and then use DISTINCT COUNT, which ignores nulls.
I have given a first query to show the intermediate results and the counts with and without DISTINCT to show what it is doing. I have used the test LEFT(log_type,6) = 'Reject' to group the 2 rejection types.
I suggest that it would be a good idea to round the time so that 2 rejections close together will be treated as repetitions. We the current queries event 1 second different will be treated as a different rejection.
create table person(
id int,
time date,
name varchar(20),
log_type varchar(20));
insert into person values
( 1,'2022-04-25 07:49:58.0','Brian','Rejection 1'),
( 2,'2022-04-25 07:49:58.0','Brian','Rejection 2'),
( 3,'2022-04-27 13:05:51.0','Fredd','Rejection 1'),
( 4,'2022-05-01 02:13:44.0','Janet','Rejection 1'),
( 5,'2022-05-01 03:45:06.0','Janet','Rejection 2'),
( 6,'2022-05-01 08:01:34.0','Peter','Approval'),
( 7,'2022-05-01 12:12:53.0','Frank','Rejection 2'),
( 8,'2022-05-02 01:26:38.0','Frank','Approval');
✓
8 rows affected
SELECT
name,
CASE WHEN LEFT(log_type,6) = 'Reject' THEN time END R,
CASE WHEN log_type = 'Approval' THEN time END A
FROM person;
name | r | a
:---- | :--------- | :---------
Brian | 2022-04-25 | null
Brian | 2022-04-25 | null
Fredd | 2022-04-27 | null
Janet | 2022-05-01 | null
Janet | 2022-05-01 | null
Peter | null | 2022-05-01
Frank | 2022-05-01 | null
Frank | null | 2022-05-02
SELECT
name,
COUNT(CASE WHEN LEFT(log_type,6) = 'Reject' THEN time END) all_rejections,
COUNT(CASE WHEN log_type = 'Approval' THEN time END) all_approvals,
COUNT(DISTINCT CASE WHEN LEFT(log_type,6) = 'Reject' THEN time END) distinct_rejections,
COUNT(DISTINCT CASE WHEN log_type = 'Approval' THEN time END) distinct_approvals
FROM person
GROUP BY name;
name | all_rejections | all_approvals | distinct_rejections | distinct_approvals
:---- | -------------: | ------------: | ------------------: | -----------------:
Brian | 2 | 0 | 1 | 0
Frank | 1 | 1 | 1 | 1
Fredd | 1 | 0 | 1 | 0
Janet | 2 | 0 | 1 | 0
Peter | 0 | 1 | 0 | 1
db<>fiddle here

Replace value with last value where flag was set to 1

I have a table where each row contains all the fields that changed during some event, and a flag associated with each field to flag if the field was updated. For simplicity I only show here the "status" field, but they are several other fields as well.
In cases where a given field was not modified by the event, the field is set to null and so is the flag.
+----+---------------------+--------+---------------------+
| id | date | status | flag_changed_status |
+----+---------------------+--------+---------------------+
| 1 | 2020-01-03 19:32:17 | TODO | 1 |
| 1 | 2020-01-08 15:46:07 | WIP | 1 |
| 1 | 2020-01-08 15:53:53 | | | //this line was generated because another field changed
| 1 | 2020-01-08 15:56:53 | | | //this line was generated because another field changed
| 1 | 2020-01-08 16:02:31 | Done | 1 |
+----+---------------------+--------+---------------------+
My goal is to replace the field values for the rows where the field was not changed with the last value it had when the flag was equal to one, e.g get :
+----+---------------------+--------+---------------------+
| id | date | status | flag_changed_status |
+----+---------------------+--------+---------------------+
| 1 | 2020-01-03 19:32:17 | TODO | 1 |
| 1 | 2020-01-08 15:46:07 | WIP | 1 |
| 1 | 2020-01-08 15:53:53 | WIP | |
| 1 | 2020-01-08 15:56:53 | WIP | |
| 1 | 2020-01-08 16:02:31 | Done | 1 |
+----+---------------------+--------+---------------------+
I understand that I want to use the last_value analytical function in Bigquery, and I tried :
SELECT ID_DEMANDE, date, status,
last_value(status) OVER (ORDER BY flag_changed_status, DATE ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as current_status, flag_changed_status
FROM table ORDER BY id, DATE
The idea was that by using the flag in the order by function, the rows where the flag was set to null would be put in first, and then the last_value(status) would be the last value where flag_changed_status was set to 1
But this can only work if I use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, because the ORDER BY clause will be processed before the window frame clause (rows between ...), thus for the rows where flag_changed_status is null, after the order by is processed, the current row number is 0, so the last value between unbounded preceding and current row is always null.
Is there any way to first run the ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING and then the ORDER BY, so that last_value(status) will return the last value preceding the current row where the flag was set to one ? Or is there something much simpler, still using analytical functions to allow me to complete all the different fields in one query ?
Edit :
I really want to copy the status that was set the last time the flag was set, even if this status is null, that is why I am trying to use the flag in the order by. That is if the initial table is :
+----+---------------------+--------+---------------------+
| id | date | status | flag_changed_status |
+----+---------------------+--------+---------------------+
| 1 | 2020-01-03 19:32:17 | TODO | 1 |
| 1 | 2020-01-08 15:46:07 | null | 1 |
| 1 | 2020-01-08 15:53:53 | null | null |
| 1 | 2020-01-08 15:56:53 | null | null |
| 1 | 2020-01-08 15:57:53 | WIP | 1 |
| 1 | 2020-01-08 15:58:53 | null | null |
| 1 | 2020-01-08 16:02:31 | Done | 1 |
+----+---------------------+--------+---------------------+
I would need:
+----+---------------------+--------+---------------------+
| id | date | status | flag_changed_status |
+----+---------------------+--------+---------------------+
| 1 | 2020-01-03 19:32:17 | TODO | 1 |
| 1 | 2020-01-08 15:46:07 | null | 1 |
| 1 | 2020-01-08 15:53:53 | null | null | // we copy the last status where the flag was 1, and it is null
| 1 | 2020-01-08 15:56:53 | null | null |
| 1 | 2020-01-08 15:57:53 | WIP | 1 |
| 1 | 2020-01-08 15:58:53 | WIP | null | //only this line changes
| 1 | 2020-01-08 16:02:31 | Done | 1 |
+----+---------------------+--------+---------------------+
But it seems to be too complicated, so I will just replace all the nulls where the flag is set to 1 with a custom status, and then a simple last_value(status IGNORE NULLS) as #gordon-linoff was suggesting will provide almost the desired result
Below is for BigQuery Standard SQL
#standardSQL
SELECT * EXCEPT(grp),
LAST_VALUE(status IGNORE NULLS) OVER (PARTITION BY grp ORDER BY date) AS updated_status
FROM (
SELECT *,
COUNTIF(flag_changed_status = 1) OVER(ORDER BY `date`) grp
FROM `project.dataset.table`
)
if to apply to sample data from your question - result is
Row id date status flag_changed_status updated_status
1 1 2020-01-03 19:32:17 TODO 1 TODO
2 1 2020-01-08 15:46:07 null 1 null
3 1 2020-01-08 15:53:53 null null null
4 1 2020-01-08 15:56:53 null null null
5 1 2020-01-08 15:57:53 WIP 1 WIP
6 1 2020-01-08 15:58:53 null null WIP
7 1 2020-01-08 16:02:31 Done 1 Done
I prefer lag(ignore nulls). But BigQuery doesn't support that. Instead, use first_value()/last_value():
with t as (
select 1 as id, '2020-01-03 19:32:17' as date, 'TODO' as status, 1 as file_changed_status union all
select 1 as id, '2020-01-08 15:46:07' as date, 'WIP ' as status, 1 as file_changed_status union all
select 1 as id, '2020-01-08 15:53:53' as date, null as status, null as file_changed_status union all
select 1 as id, '2020-01-08 15:56:53' as date, null as status, null as file_changed_status union all
select 1 as id, '2020-01-08 16:02:31' as date, 'Done' as status, 1 as file_changed_status
)
select t.*,
last_value(status ignore nulls) over (order by date) as imputed_status
from t;

SQL: how to check for neither overlapping nor holes in payment records

I do have a table PaymentSchedules with percentages info, and dates from/to for which those
percentages are valid, resource by resource:
| auto_numbered | res_id | date_start | date_end | org | pct |
|---------------+--------+------------+------------+-------+-----|
| 1 | A | 2018-01-01 | 2019-06-30 | One | 100 |
| 2 | A | 2019-07-01 | (NULL) | One | 60 |
| 3 | A | 2019-07-02 | 2019-12-31 | Two | 40 |
| 4 | A | 2020-01-01 | (NULL) | Two | 40 |
| 5 | B | (NULL) | (NULL) | Three | 100 |
| 6 | C | 2018-01-01 | (NULL) | One | 100 |
| 7 | C | 2019-11-01 | (NULL) | Four | 100 |
(Records #3 and #4 could be summarized onto just one line, but duplicated on purpose, to show that there are many combinations of date_start and date_end.)
A quick reading of the data:
Org "One" is fully paying for resource A up to 2019-06-30; then, it continues
to pay 60% of the cost, but the rest (40%) is being paid by org "Two" since
2019-07-02.
This should begin on 2019-07-01... small encoding error… provoking a 1-day gap.
Org "Three" is fully paying for resource B, at all times.
Org "One" is fully paying for resource C from 2018-01-01... but, starting on
2019-01-11, org "Four" is paying for it...
... and, there, there is an encoding error: we do have 200% of resource C being
taken into account since 2019-11-01: the record #6 should have been closed
(date_end set to 2019-10-31), but hasn't...
So, when we generate a financial report for the year 2019 (from 2019-01-01 to
2019-12-31), we will have calculation errors...
So, question: how can we make sure we don't have overlapping payments for
resources, or -- also the contrary -- "holes" for some period of times?
How is it possible to write an SQL query to check that there are neither
underpaid nor overpaid resources? That is, all resources in the table should be
paid, for every single day of the financial period being looked at, by exactly
one or more organizations, in a way that the summed up percentage is always
equal to 100%.
I don't see how to proceed with such a query. Anybody able to give hints, to put
me on track?
EDIT -- Working with both SQL Server and Oracle.
EDIT -- I don't own the DB, I can't add triggers or views. I need to be able to detect things "after the facts"... Need to easily spot the conflictual records, or the "missing" ones (in case of "period holes"), fix them by hand, and then re-run the financial report.
EDIT -- If we make an analysis for 2019, the following report would be desired:
| res_id | pct_sum | date |
|--------+---------+------------|
| A | 60 | 2019-07-01 |
| C | 200 | 2019-11-01 |
| C | 200 | 2019-11-02 |
| C | 200 | ... |
| C | 200 | ... |
| C | 200 | ... |
| C | 200 | 2019-12-30 |
| C | 200 | 2019-12-31 |
or, of course, an even much better version -- certainly unobtainable? -- where one
type of problem would one be present once, with the relevant date range for
which the problem is observed:
| res_id | pct_sum | date_start | date_end |
|--------+---------+------------+------------|
| A | 60 | 2019-07-01 | 2019-07-01 |
| C | 200 | 2019-11-01 | 2019-12-31 |
EDIT -- Fiddle code: db<>fiddle here
Here's an incomplete attempt for Sql Server.
Basically, the idea was to use a recursive CTE to unfold months for each res_id.
Then left join 'what could be' to the existing date ranges.
But I doubt it can be done in a sql that would work both for Oracle & MS Sql Server.
Sure, both have window functions and CTE's.
But the datetime functions are rarely the same for different RDMS.
So I give up.
Maybe someone else finds an easier solution.
create table PaymentSchedules
(
auto_numbered int identity(1,1) primary key,
res_id varchar(30),
date_start date,
date_end date,
org varchar(30),
pct decimal(3,0)
)
GO
✓
insert into PaymentSchedules
(res_id, org, pct, date_start, date_end)
values
('A', 'One', 100, '2018-01-01', '2018-06-30')
, ('A', 'One', 100, '2019-01-01', '2019-06-30')
, ('A', 'One', 60, '2019-07-01', null)
, ('A', 'Two', 40, '2019-07-02', '2019-12-31')
, ('A', 'Two', 40, '2020-01-01', null)
, ('B', 'Three', 100, null, null)
, ('C', 'One', 100, '2018-01-01', null)
, ('C', 'Four', 100, '2019-11-01', null)
;
GO
8 rows affected
declare #MaxEndDate date;
set #MaxEndDate = (select max(iif(date_start > date_end, date_start, isnull(date_end, date_start))) from PaymentSchedules);
;with rcte as
(
select res_id
, datefromparts(year(min(date_start)), month(min(date_start)), 1) as month_start
, eomonth(coalesce(max(date_end), #MaxEndDate)) as month_end
, 0 as lvl
from PaymentSchedules
group by res_id
having min(date_start) is not null
union all
select res_id
, dateadd(month, 1, month_start)
, month_end
, lvl + 1
from rcte
where dateadd(month, 1, month_start) < month_end
)
, cte_gai as
(
select c.res_id, c.month_start, c.month_end
, t.org, t.pct, t.auto_numbered
, sum(isnull(t.pct,0)) over (partition by c.res_id, c.month_start) as res_month_pct
, count(t.auto_numbered) over (partition by c.res_id, c.month_start) as cnt
from rcte c
left join PaymentSchedules t
on t.res_id = c.res_id
and c.month_start >= datefromparts(year(t.date_start), month(t.date_start), 1)
and c.month_start <= coalesce(t.date_end, #MaxEndDate)
)
select *
from cte_gai
where res_month_pct <> 100
order by res_id, month_start
GO
res_id | month_start | month_end | org | pct | auto_numbered | res_month_pct | cnt
:----- | :---------- | :--------- | :--- | :--- | ------------: | :------------ | --:
A | 2018-07-01 | 2019-12-31 | null | null | null | 0 | 0
A | 2018-08-01 | 2019-12-31 | null | null | null | 0 | 0
A | 2018-09-01 | 2019-12-31 | null | null | null | 0 | 0
A | 2018-10-01 | 2019-12-31 | null | null | null | 0 | 0
A | 2018-11-01 | 2019-12-31 | null | null | null | 0 | 0
A | 2018-12-01 | 2019-12-31 | null | null | null | 0 | 0
C | 2019-11-01 | 2020-01-31 | One | 100 | 7 | 200 | 2
C | 2019-11-01 | 2020-01-31 | Four | 100 | 8 | 200 | 2
C | 2019-12-01 | 2020-01-31 | One | 100 | 7 | 200 | 2
C | 2019-12-01 | 2020-01-31 | Four | 100 | 8 | 200 | 2
C | 2020-01-01 | 2020-01-31 | One | 100 | 7 | 200 | 2
C | 2020-01-01 | 2020-01-31 | Four | 100 | 8 | 200 | 2
db<>fiddle here
I am not giving the full answer here but I think that you are after cursors(https://learn.microsoft.com/en-us/sql/t-sql/language-elements/declare-cursor-transact-sql?view=sql-server-ver15).
This allows you to iterate through the database, checking all of the records.
This is bad practice because even though the idea is really good, they are quite heavy, and they are slow, and they block the involved tables.
I know some people have found a method to rewrite cursors using loops (while probably), so you need to understand a cursor, get how you would implement it and then translate it into a loop. (https://www.sqlbook.com/advanced/sql-cursors-how-to-avoid-them/)
Also, views can be helpful, but I am assuming that you know how to use them already.
The algorithm should be something like:
have table1 and table2 (table2 is a copy of table1, https://www.tutorialrepublic.com/sql-tutorial/sql-cloning-tables.php)
iterate through all of the records (I would use in the first instance a cursor for this) from table1. Picking up a record from table1.
if overlapping dates (check it against table2) do something
else do something else
pick another record from table1 and go to step 2.
Drop unnecessary tables

SQL - Selecting employees that have ended most recent contract but have other contracts open

I've been going around in circles trying to figure this one out.
I'm trying to select employees who have ended their most recent contract but have an active contract still open from previous.
For example, an employee has several contracts (some may be temporary or part time - this is irrelevant) but ends their most recent contract, however, they still continue to be in their older contracts.
Please see the table below as to what I'm trying to achieve - with relevant fields:
+------+-------------+-------------+------------+------------+
| ID | CONTRACT_ID | EMPLOYEE_ID | START_DATE | END_DATE |
+------+-------------+-------------+------------+------------+
| 4321 | 974 | 321 | 21/01/2004 | 31/12/2016 |
+------+-------------+-------------+------------+------------+
| 4322 | 1485 | 321 | 09/01/2009 | 31/08/2014 |
+------+-------------+-------------+------------+------------+
| 4323 | NULL | 321 | 25/07/2009 | 31/01/2010 |
+------+-------------+-------------+------------+------------+
| 4324 | 2440 | 321 | 01/06/2012 | NULL |
+------+-------------+-------------+------------+------------+
| 4325 | 7368 | 321 | 01/01/2017 | NULL |
+------+-------------+-------------+------------+------------+
| 4326 | 7612 | 321 | 14/02/2017 | 06/06/2017 |
+------+-------------+-------------+------------+------------+
Here is the code I currently have, which is not bringing back the correct data:
select
cond.EMPLOYEE_ID
,cond.END_DATE
from
contracts as cond
join
(select
EMPLOYEE_ID
,START_DATE
,END_DATE
from
contracts
where
END_DATE is null) a on a.EMPLOYEE_ID = cond.employee_id and a.START_DATE <
cond.END_DATE
group by cond.end_date, cond.EMPLOYEE_ID
having
max(cond.START_DATE) is not null AND cond.END_DATE is not null
This is what the code results in (example):
+------+-------------+-------------+------------+------------+
| ID | CONTRACT_ID | EMPLOYEE_ID | START_DATE | END_DATE |
+------+-------------+-------------+------------+------------+
| 1234 | NULL | 123 | 03/12/2014 | 26/10/2015 |
+------+-------------+-------------+------------+------------+
| 1235 | NULL | 123 | 30/10/2015 | 28/01/2016 |
+------+-------------+-------------+------------+------------+
| 1236 | NULL | 123 | 06/11/2015 | 28/01/2016 |
+------+-------------+-------------+------------+------------+
| 1237 | 1234 | 123 | 07/03/2016 | NULL |
+------+-------------+-------------+------------+------------+
| 1238 | NULL | 123 | 04/04/2017 | 13/04/2017 |
+------+-------------+-------------+------------+------------+
| 1239 | NULL | 123 | 18/04/2017 | NULL |
+------+-------------+-------------+------------+------------+
As you can see, the most recent contract does not have an end date but there is an open contract.
Any help much appreciated.
using cross apply() to get the most recent start_date, end_date, and the count of open_contracts using a windowed aggregate function count() over() :
select
c.id
, c.contract_id
, c.employee_id
, start_date
, end_date
, max_start_date = x.start_date
, max_end_date = x.end_date
, x.open_contracts
from contracts c
cross apply (
select top 1
i.start_date
, i.end_date
, open_contracts = count(case when i.end_date is null then 1 end) over(partition by i.employee_id)
from contracts i
where i.employee_id = c.employee_id
order by i.start_date desc
) x
where x.end_date is not null
and x.open_contracts > 0
order by c.employee_id, c.start_date asc
test setup with some additional cases:
create table contracts (id int, contract_id int, employee_id int, start_date date, end_date date);
insert into contracts values
(4321, 974, 321, '20040121', '20161231')
,(4322, 1485, 321, '20090109', '20140831')
,(4323, null, 321, '20090725', '20100131')
,(4324, 2440, 321, '20120601', null)
,(4325, 7368, 321, '20170101', null)
,(4326, 7612, 321, '20170214', '20170606')
,(1, 1, 1, '20160101', null)
,(2, 2, 1, '20160701', '20161231')
,(3, 3, 1, '20170101', null) /* most recent is open, do not return */
,(4, 4, 2, '20160101', '20170630')
,(5, 5, 2, '20160701', '20161231')
,(6, 6, 2, '20170101', '20170630') /* most recent is closed, no others open, do not return */
,(7, 7, 3, '20160101', '20170630')
,(8, 8, 3, '20160701', null)
,(9, 9, 3, '20170101', '20170630') /* most recent is closed, one other open, return */
;
rextester demo: http://rextester.com/BUYKJ77928
returns:
+------+-------------+-------------+------------+------------+----------------+--------------+----------------+
| id | contract_id | employee_id | start_date | end_date | max_start_date | max_end_date | open_contracts |
+------+-------------+-------------+------------+------------+----------------+--------------+----------------+
| 7 | 7 | 3 | 2016-01-01 | 2017-06-30 | 2017-01-01 | 2017-06-30 | 1 |
| 8 | 8 | 3 | 2016-07-01 | NULL | 2017-01-01 | 2017-06-30 | 1 |
| 9 | 9 | 3 | 2017-01-01 | 2017-06-30 | 2017-01-01 | 2017-06-30 | 1 |
| 4321 | 974 | 321 | 2004-01-21 | 2016-12-31 | 2017-02-14 | 2017-06-06 | 2 |
| 4322 | 1485 | 321 | 2009-01-09 | 2014-08-31 | 2017-02-14 | 2017-06-06 | 2 |
| 4323 | NULL | 321 | 2009-07-25 | 2010-01-31 | 2017-02-14 | 2017-06-06 | 2 |
| 4324 | 2440 | 321 | 2012-06-01 | NULL | 2017-02-14 | 2017-06-06 | 2 |
| 4325 | 7368 | 321 | 2017-01-01 | NULL | 2017-02-14 | 2017-06-06 | 2 |
| 4326 | 7612 | 321 | 2017-02-14 | 2017-06-06 | 2017-02-14 | 2017-06-06 | 2 |
+------+-------------+-------------+------------+------------+----------------+--------------+----------------+
I'm not a SQL-server expert, but you might try something similar to this:
SELECT *
FROM contracts cont
WHERE cont.end_date IS NOT NULL
AND cont.end_date <= SYSDATE
AND NOT EXISTS (SELECT *
FROM contracts recent
WHERE recent.employee_id = cont.employee_id
AND recent.start_date > cont.start_date)
AND EXISTS (SELECT *
FROM contracts openc
WHERE openc.employee_id = cont.employee_id
AND (openc.end_date IS NULL OR openc.end_date > SYSDATE))
The first 2 conditions search for closed contracts.
The next one ("NOT EXISTS") makes sure the selected contract is the most recent one.
The last part assures there are other open contracts.
Try this dude.
SELECT [EMPLOYEE_ID]
FROM [contracts]
WHERE [END_DATE] IS NULL
AND [EMPLOYEE_ID] IN (SELECT B.[EMPLOYEE_ID] FROM (
SELECT * FROM (
SELECT RowN = Row_Number() over (partition by [EMPLOYEE_ID] ORDER BY[START_DATE] DESC)
, [EMPLOYEE_ID]
, [CONTRACT_ID]
, [END_DATE]
FROM [contracts]
) A
WHERE A.[END_DATE] IS NOT NULL
AND A.[RowN] = 1) B)
You can do this with ROW_NUMBER() and a CTE
See it in action: http://rextester.com/HQVXF56741
In the code below, I changed the dateformat which you may not have to do.
set dateformat dmy
declare #table table (ID int,CONTRACT_ID int, EMPLOYEE_ID int, [START_DATE] datetime, END_DATE datetime)
insert into #table
values
(4321,974,321,'21/01/2004','31/12/2016'),
(4322,1485,321,'09/01/2009','31/08/2014'),
(4323,NULL,321,'25/07/2009','31/01/2010'),
(4324,2440,321,'01/06/2012',NULL),
(4325,7368,321,'01/01/2017',NULL),
(4326,7612,321,'14/02/2017','06/06/2017')
--this applies a row_number to each contract per employee
--the most recent contract (by start date) gets a 1
;with cte as(
select
EMPLOYEE_ID
,ID
,row_number() over (partition by EMPLOYEE_ID order by [START_DATE] desc) as ContractRecentcy
from #table)
--this will return all contacts that are open, which aren't the most recent for the employee.
select
t.*
from
#table t
where
t.END_DATE is null
and t.ID not in (select ID from cte where ContractRecentcy = 1)
set dateformat mdy

Union in outer query

I'm attempting to combine multiple rows using a UNION but I need to pull in additional data as well. My thought was to use a UNION in the outer query but I can't seem to make it work. Or am I going about this all wrong?
The data I have is like this:
+------+------+-------+---------+---------+
| ID | Time | Total | Weekday | Weekend |
+------+------+-------+---------+---------+
| 1001 | AM | 5 | 5 | 0 |
| 1001 | AM | 2 | 0 | 2 |
| 1001 | AM | 4 | 1 | 3 |
| 1001 | AM | 5 | 3 | 2 |
| 1001 | PM | 5 | 3 | 2 |
| 1001 | PM | 5 | 5 | 0 |
| 1002 | PM | 4 | 2 | 2 |
| 1002 | PM | 3 | 3 | 0 |
| 1002 | PM | 1 | 0 | 1 |
+------+------+-------+---------+---------+
What I want to see is like this:
+------+---------+------+-------+
| ID | DayType | Time | Tasks |
+------+---------+------+-------+
| 1001 | Weekday | AM | 9 |
| 1001 | Weekend | AM | 7 |
| 1001 | Weekday | PM | 8 |
| 1001 | Weekend | PM | 2 |
| 1002 | Weekday | PM | 5 |
| 1002 | Weekend | PM | 3 |
+------+---------+------+-------+
The closest I've come so far is using UNION statement like the following:
SELECT * FROM
(
SELECT Weekday, 'Weekday' as 'DayType' FROM t1
UNION
SELECT Weekend, 'Weekend' as 'DayType' FROM t1
) AS X
Which results in something like the following:
+---------+---------+
| Weekday | DayType |
+---------+---------+
| 2 | Weekend |
| 0 | Weekday |
| 2 | Weekday |
| 0 | Weekend |
| 10 | Weekday |
+---------+---------+
I don't see any rhyme or reason as to what the numbers are under the 'Weekday' column, I suspect they're being grouped somehow. And of course there are several other columns missing, but since I can't put a large scope in the outer query with this as inner one, I can't figure out how to pull those in. Help is greatly appreciated.
It looks like you want to union all a pair of aggregation queries that use sum() and group by id, time, one for Weekday and one for Weekend:
select Id, DayType = 'Weekend', [time], Tasks=sum(Weekend)
from t
group by id, [time]
union all
select Id, DayType = 'Weekday', [time], Tasks=sum(Weekday)
from t
group by id, [time]
Try with this
select ID, 'Weekday' as DayType, Time, sum(Weekday)
from t1
group by ID, Time
union all
select ID, 'Weekend', Time, sum(Weekend)
from t1
group by ID, Time
order by order by 1, 3, 2
Not tested, but it should do the trick. It may require 2 proc sql steps for the calculation, one for summing and one for the case when statements. If you have extra lines, just use a max statement and group by ID, Time, type_day.
Proc sql; create table want as select ID, Time,
sum(weekday) as weekdayTask,
sum(weekend) as weekendTask,
case when calculated weekdaytask>0 then weekdaytask
when calculated weekendtask>0 then weekendtask else .
end as Task,
case when calculated weekdaytask>0 then "Weekday"
when calculated weekendtask>0 then "Weekend"
end as Day_Type
from have
group by ID, Time
;quit;
Proc sql; create table want2 as select ID, Time, Day_Type, Task
from want
;quit;