Getting distinct rows for overlapping timestamp in SQL Server - sql

I have the following result set which I get from SQL Server:
employeeNumber | start_date | start_time | end_date | end_time
---------------+------------+------------+--------------+----------
123 | 10-03-2020 | 18:13:55 | 10-03-2020 | 22:59:46
123 | 10-03-2020 | 18:24:22 | 10-03-2020 | 22:59:51
123 | 10-03-2020 | 23:24:22 | 10-03-2020 | 23:59:51
123 | 11-03-2020 | 18:25:25 | 11-03-2020 | 20:59:51
123 | 12-03-2020 | 18:40:22 | 12-03-2020 | 22:59:52
For some cases I have multiple rows for the same overlapping time (row 1 and 2) as above but with a different start and end time (difference in seconds or minutes).
While my query is a simple select query that fetches the data from the source table, What can i add in the where clause to fetch distinct rows for such overlapping timestamp rows. i.e. for the above query i would want the result set to return the following :
employeeNumber | start_date | start_time | end_date | end_time
---------------+------------+------------+--------------+----------
123 | 10-03-2020 | 18:13:55 | 10-03-2020 | 22:59:46
123 | 10-03-2020 | 23:24:22 | 10-03-2020 | 23:59:51
123 | 11-03-2020 | 18:25:25 | 11-03-2020 | 20:59:51
123 | 12-03-2020 | 18:40:22 | 12-03-2020 | 22:59:52
Below is my query :
select
employeeNumber, start_date, start_time, end_date, end_time
from
emp_data
where
employeeNumber = 123
order by
employeeNumber;
I can probably do with fetching only the first record but what would the where clause be.
Any help is appreciated as I am not very familiar with SQL Server.

This is complicated. You need to keep track of "starts" and "ends". I am going to assume that your columns are datetimes or something similar that can be combined into a single column:
with e as (
select e.employeeNumber, v.dt, sum(v.inc) as inc,
sum(sum(v.inc)) over (partition by e.employeeNumber order by v.dt) as in_outs
from emp_data e cross apply
(values (start_date + start_time, 1),
(end_date + end_time, -1)
) v(dt, inc)
group by e.employeeNumber, v.dt
)
select employeeNumber, min(dt) as start_datetime, max(dt) as end_datetime
from (select e.*,
sum(case when in_outs = 0 then 1 else 0 end) over (partition by employeeNumber order by dt) as grp
from e
) e
where in_outs <> 0
group by employeeNumber, grp;
Here is a db<>fiddle.
What is this doing?
First the date/times are converted to date times.
Then the columns are unpivoted and identified as starts and ends, along with +1 or -1 to indicate whether the employee is "entering" or "existing" at that time.
These are accumulated.
Now you have a gaps and islands problem, where you want to find continue periods of "in"s. The "islands" are identified using a cumulative sum of "ins".
Then these are aggregated.
EDIT:
You can replace the cumulative sum with:
from (select e.*,
(select sum(case when e2.in_outs = 0 then 1 else 0 end)
from e e2
where e2.employeeNumber = e.employeeNumber
e2.dt <= e.dt
) as grp
from e
) e

Related

Finding created on dates for duplicates in SQL

I have one table of contact records and I'm trying to get the count of duplicate records that were created on each date. I'm not looking to include the original instance in the count. I'm using SQL Server.
Here's an example table
| email | created_on |
| ------------- | ---------- |
| aaa#email.com | 08-16-22 |
| bbb#email.com | 08-16-22 |
| zzz#email.com | 08-16-22 |
| bbb#email.com | 07-12-22 |
| aaa#email.com | 07-12-22 |
| zzz#email.com | 06-08-22 |
| aaa#email.com | 06-08-22 |
| bbb#email.com | 04-21-22 |
And I'm expecting to return
| created_on | dupe_count |
| ---------- | ---------- |
| 08-16-22 | 3 |
| 07-12-22 | 2 |
| 06-08-22 | 0 |
| 04-21-22 | 0 |
Edited to add error message:
error message
I created a sub table based on email and created date row number. Then, you query that, and ignore the date when the email first was created (row number 1). Works perfectly fine in this case.
Entire code:
Create table #Temp
(
email varchar(50),
dateCreated date
)
insert into #Temp
(email, dateCreated) values
('aaa#email.com', '08-16-22'),
('bbb#email.com', '08-16-22'),
('zzz#email.com', '08-16-22'),
('bbb#email.com', '07-12-22'),
('aaa#email.com', '07-12-22'),
('zzz#email.com', '06-08-22'),
('aaa#email.com', '06-08-22'),
('bbb#email.com', '04-21-22')
select datecreated, sum(case when r = 1 then 0 else 1 end) as duplicates
from
(
Select email, datecreated, ROW_NUMBER() over(partition by email
order by datecreated) as r from #Temp
) b
group by dateCreated
drop table #Temp
Output:
datecreated duplicates
2022-04-21 0
2022-06-08 0
2022-07-12 2
2022-08-16 3
You can calculate the difference between total count of emails for every day and the count of unique emails for the day:
select created_on,
count(email) - count(distinct email) as dupe_count
from cte
group by created_on
It seems I have misunderstood your request, and you wanted to consider previous created_on dates' too:
ct as (
select created_on,
(select case when (select count(*)
from cte t2
where t1.email = t2.email and t1.created_on > t2.created_on
) > 0 then email end) as c
from cte t1)
select created_on,
count(distinct c) as dupe_count
from ct
group by created_on
order by 1
It seems that in oracle it is also possible to aggregate it using one query:
select created_on,
count(distinct case when (select count(*)
from cte t2
where t1.email = t2.email and t1.created_on > t2.created_on
) > 0 then email end) as c
from cte t1
group by created_on
order by 1

Calculating elapsed time in non-contiguous rows using SQL

I need to deduce uptime for servers using SQL with a table that looks as follows:
| Row | ID | Status | Timestamp |
-----------------------------------
| 1 | A1 | UP | 1598451078 |
-----------------------------------
| 2 | A2 | UP | 1598457488 |
-----------------------------------
| 3 | A3 | UP | 1598457489 |
-----------------------------------
| 4 | A1 | DOWN | 1598458076 |
-----------------------------------
| 5 | A3 | DOWN | 1598461096 |
-----------------------------------
| 6 | A1 | UP | 1598466510 |
-----------------------------------
In this example, A1 went down on Wed, 26 Aug 2020 16:07:56 and came back up at Wed, 26 Aug 2020 18:28:30. This means I need to find the difference between rows 6 and 4 using the ID field and display it as an additional column named "Uptime".
I have found several answers that explain how to use aliases and inner joins to calculate the difference between contiguous rows (e.g. How to get difference between two rows for a column field?), but none that explains how to do so for non-contiguous rows.
For example, this piece of code from https://www.mysqltutorial.org/mysql-tips/mysql-compare-calculate-difference-successive-rows/ gives a possible solution, but I don't know how to adapt it to compare the roaws based on the ID field:
SELECT
g1.item_no,
g1.counted_date from_date,
g2.counted_date to_date,
(g2.qty - g1.qty) AS receipt_qty
FROM
inventory g1
INNER JOIN
inventory g2 ON g2.id = g1.id + 1
WHERE
g1.item_no = 'A';
Any help would be much appreciated.
Basically, you need the total time minus the downtime.
If you want the different periods, you can use:
select status, max(timestamp), min(timestamp),
max(timestamp) - min(timestamp)
from (select t.*,
row_number() over (order by timestamp) as seqnum,
row_number() over (partition by status order by timestamp) as seqnum2
from t
) t
group by status, (seqnum - seqnum2);
However, for your purposes, for the total uptime:
select sum( coalesce(next_timestamp, max_uptimestamp) - min(timestamp))
from (select t.*,
lag(timestamp) over (order by status) as prev_status,
lead(timestamp) over (order by timestamp) as next_timestamp,
max(case when status = 'UP' then timestamp end) over () as max_uptimestamp
from t
) t
where status = 'UP' and
(prev_status = 'DOWN' or pre_status is null);
Basically, this counts all the time from the first UP to the next DOWN or to the last UP. It then sums that up.

Return Min Start Date, Max End Date and Latest Category for a group of consecutive records based on date

I have a table which contains a Person ID, Category_ID, Start Date, End Date and Category. When the Start Date is the same as the Previous End Date then this is a continuation and merely denotes a Category Change. There can be many Category changes within a continuous date period.
I want to return the First Start Date and Last End Date and Category Type for each person.
I thought about identifying all those that have continuous date period for a person and return max and min etc. But this doesn't take into account when a person has multiple continuous date periods, i.e. one period ends and there is a break and then there is another continuous period with category changes.
Example output:
+---------+------------+------------+---------------+
| ID | start_dt | end_dt | category_type |
+---------+------------+------------+---------------+
| 8105755 | 26/01/2016 | 21/04/2016 | D |
| 8105859 | 21/04/2016 | 22/04/2016 | A |
| 8105861 | 22/04/2016 | 26/04/2016 | D |
| 8105870 | 26/04/2016 | 19/10/2016 | A |
+---------+------------+------------+---------------+
So in this case as the end_dt is the same as the preceding start_dt for each row this is a continuous period so I want to return one row with the First Start Date and Last End Date and Latest Category Type, as below:
+---------+------------+------------+---------------+
| ID | start_dt | end_dt | category_type |
+---------+------------+------------+---------------+
| 8105870 | 26/01/2016 | 19/10/2016 | A |
+---------+------------+------------+---------------+
This is a type of gaps-and-islands problem, which you can solve using a cumulative sum to identify the groups. The sum is based on when groups start. So:
select distinct
first_value(t.id) over (partition by grp order by t.start_dt desc) as id,
min(t.start_dt) over (partition by grp) as start_dt,
max(t.start_dt) over (partition by grp) as end_dt,
first_value(t.category) over (partition by grp order by t.start_dt desc) as id
from (select t.*,
sum(case when t.id is null then 1 else 0 end) over (order by t.start_dt) as grp
from t left join
t tprev
on tprev.end_dt = t.start_dt
) t;
Note: This uses select distinct simply because SQL Server does not offer "first()"/"last()" functions for aggregation.

Cumulated sum based on condition in other column

I would like to create a view based on data in following structure:
CREATE TABLE my_table (
date date,
daily_cumulative_precip float4
);
INSERT INTO my_table (date, daily_cumulative_precip)
VALUES
('2016-07-28', 3.048)
, ('2016-08-04', 2.286)
, ('2016-08-11', 5.334)
, ('2016-08-12', 0.254)
, ('2016-08-13', 2.794)
, ('2016-08-14', 2.286)
, ('2016-08-15', 3.302)
, ('2016-08-17', 3.81)
, ('2016-08-19', 15.746)
, ('2016-08-20', 46.739998);
I would like to accumulate the precipitation for consecutive days only.
Below is the desired result for a different test case - except that days without rain should be omitted:
I have tried window functions with OVER(PARTITION BY date, rain_on_day) but they do not yield the desired result.
How could I solve this?
SELECT date
, dense_rank() OVER (ORDER BY grp) AS consecutive_group_nr -- optional
, daily_cumulative_precip
, sum(daily_cumulative_precip) OVER (PARTITION BY grp ORDER BY date) AS cum_precipitation_mm
FROM (
SELECT date, t.daily_cumulative_precip
, row_number() OVER (ORDER BY date) - t.rn AS grp
FROM (
SELECT generate_series (min(date), max(date), interval '1 day')::date AS date
FROM my_table
) d
LEFT JOIN (SELECT *, row_number() OVER (ORDER BY date) AS rn FROM my_table) t USING (date)
) x
WHERE daily_cumulative_precip > 0
ORDER BY date;
db<>fiddle here
Returns all rainy days with cumulative sums for consecutive days (and a running group number).
Basics:
Select longest continuous sequence
Here's a way to calculate cumulative precipitation without having to explicitly enumerate all dates:
SELECT date, daily_cumulative_precip, sum(daily_cumulative_precip) over (partition by group_num order by date) as cum_precip
FROM
(SELECT date, daily_cumulative_precip, sum(start_group) over (order by date) as group_num
FROM
(SELECT date, daily_cumulative_precip, CASE WHEN (date != prev_date + 1) THEN 1 ELSE 0 END as start_group
FROM
(SELECT date, daily_cumulative_precip, lag(date, 1, '-infinity'::date) over (order by date) as prev_date
FROM my_table) t1) t2) t3
yields
| date | daily_cumulative_precip | cum_precip |
|------------+-------------------------+------------|
| 2016-07-28 | 3.048 | 3.048 |
| 2016-08-04 | 2.286 | 2.286 |
| 2016-08-11 | 5.334 | 5.334 |
| 2016-08-12 | 0.254 | 5.588 |
| 2016-08-13 | 2.794 | 8.382 |
| 2016-08-14 | 2.286 | 10.668 |
| 2016-08-15 | 3.302 | 13.97 |
| 2016-08-17 | 3.81 | 3.81 |
| 2016-08-19 | 15.746 | 15.746 |
| 2016-08-20 | 46.74 | 62.486 |

Query to find records that where created one after another in bigquery

I am playing around with bigquery. Following input is given:
+---------------+---------+---------+--------+----------------------+
| customer | agent | value | city | timestamp |
+---------------+---------+---------+--------+----------------------+
| 1 | 1 | 106 | LA | 2019-02-12 03:05pm |
| 1 | 1 | 251 | LA | 2019-02-12 03:06pm |
| 3 | 2 | 309 | NY | 2019-02-12 06:41pm |
| 1 | 1 | 654 | LA | 2019-02-12 05:12pm |
+---------------+---------+---------+--------+----------------------+
I want to find transactions that where issued one after another (say within 5 minutes) by one and the same agent. So the output for the above table should look like:
+---------------+---------+---------+--------+----------------------+
| customer | agent | value | city | timestamp |
+---------------+---------+---------+--------+----------------------+
| 1 | 1 | 106 | LA | 2019-02-12 03:05pm |
| 1 | 1 | 251 | LA | 2019-02-12 03:06pm |
+---------------+---------+---------+--------+----------------------+
The query should somehow group by agent and find such transactions. However the result is not really grouped as you can see from the output. My first thought was using the LEAD function, but I am not sure. Do you have any ideas?
Ideas for a query:
sort by agent and timestamp DESC
start with the first row, look at the following row (using LEAD?)
check if timestamp difference is less than 5 minutes
if so, this two rows should be in the output
continue with next (2nd) row
When the 2nd and 3rd row match the criteria, too, the 2nd row will get into the output, which would cause duplicate rows. I am not sure how to avoid that, yet.
There must be an easier way but does this achieve what you are after?
CTE2 AS (
SELECT customer, agent, value, city, timestamp,
lead(timestamp,1) OVER (PARTITION BY agent ORDER BY timestamp) timestamp_lead,
lead(customer,1) OVER (PARTITION BY agent ORDER BY timestamp) customer_lead,
lead(value,1) OVER (PARTITION BY agent ORDER BY timestamp) value_lead,
lead(city,1) OVER (PARTITION BY agent ORDER BY timestamp) city_lead,
lag(timestamp,1) OVER (PARTITION BY agent ORDER BY timestamp) timestamp_lag
FROM CTE
)
SELECT agent,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(customer as string),', ',cast(customer_lead as string)),cast(customer as string)) customer,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(value as string),', ',cast(value_lead as string)),cast(value as string)) value,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(city as string),', ',cast(city_lead as string)),cast(city as string)) cities,
if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(timestamp as string),', ',cast(timestamp_lead as string)),cast(timestamp as string)) timestamps
FROM CTE2
WHERE (timestamp_diff(timestamp_lead,timestamp,MINUTE)<5 OR NOT timestamp_diff(timestamp,timestamp_lag,MINUTE)<5)
Below is for BigQuery Standard SQL
#standardSQL
SELECT * FROM (
SELECT *,
IF(TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY agent ORDER BY ts), ts, MINUTE) < 5,
LEAD(STRUCT(customer AS next_customer, value AS next_value)) OVER(PARTITION BY agent ORDER BY ts),
NULL).*
FROM `project.dataset.yourtable`
)
WHERE NOT next_customer IS NULL
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 customer, 1 agent, 106 value,'LA' city, '2019-02-12 03:05pm' ts UNION ALL
SELECT 1, 1, 251,'LA', '2019-02-12 03:06pm' UNION ALL
SELECT 3, 2, 309,'NY', '2019-02-12 06:41pm' UNION ALL
SELECT 1, 1, 654,'LA', '2019-02-12 05:12pm'
), temp AS (
SELECT customer, agent, value, city, PARSE_TIMESTAMP('%Y-%m-%d %I:%M%p', ts) ts
FROM `project.dataset.table`
)
SELECT * FROM (
SELECT *,
IF(TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY agent ORDER BY ts), ts, MINUTE) < 5,
LEAD(STRUCT(customer AS next_customer, value AS next_value)) OVER(PARTITION BY agent ORDER BY ts),
NULL).*
FROM temp
)
WHERE NOT next_customer IS NULL
-- ORDER BY ts
with result
Row customer agent value city ts next_customer next_value
1 1 1 106 LA 2019-02-12 15:05:00 UTC 1 251