T-SQL filtering records based on dates and time difference with other records - sql

I have a table for which I have to perform a rather complex filter: first a filter by date is applied, but then records from the previous and next days should be included if their time difference does not exceed 8 hours compared to its prev or next record (depending if the date is less or greater than filter date).
For those adjacent days the selection should stop at the first record that does not satisfy this condition.
This is how my raw data looks like:
Id
Desc
EntryDate
1
Event type 1
2021-03-12 21:55:00.000
2
Event type 1
2021-03-12 01:10:00.000
3
Event type 1
2021-03-11 20:17:00.000
4
Event type 1
2021-03-11 05:04:00.000
5
Event type 1
2021-03-10 23:58:00.000
6
Event type 1
2021-03-10 11:01:00.000
7
Event type 1
2021-03-10 10:00:00.000
In this example set, if my filter date is '2021-03-11', my expected result set should be all records from that day plus adjacent records from 03-12 and 03-10 that satisfy the 8 hours condition. Note how record with Id 7 is not be included because record with Id 6 does not comply:
Id
EntryDate
2
2021-03-12 01:10:00.000
3
2021-03-11 20:17:00.000
4
2021-03-11 05:04:00.000
5
2021-03-10 23:58:00.000
Need advice how to write this complex query

This is a variant of gaps-and-islands. Define the difference . . . and then groups based on the differences:
with e as (
select t.*
from (select t.*,
sum(case when prev_entrydate > dateadd(hour, -8, entrydate) then 0 else 1 end) over (order by entrydate) as grp
from (select t.*,
lag(entrydate) over (order by entrydate) as prev_entrydate
from t
) t
)
select e.*
from e.*
where e.grp in (select e2.grp
from t e2
where date(e2.entrydate) = #filterdate
);
Note: I'm not sure exactly how filter date is applied. This assumes that it is any events on the entire day, which means that there might be multiple groups. If there is only one group (say the first group on the day), the query can be simplified a bit from a performance perspective.

declare #DateTime datetime = '2021-03-11'
select *
from t
where t.EntryDate between DATEADD(hour , -8 , #DateTime) and DATEADD(hour , 32 , #DateTime)

Related

How to LEFT JOIN on ROW_NUM using WITH

Right now I'm in the testing phase of this query so I'm only testing it on two Queries. I've gotten stuck on the final part where I want to left join everything (this will have to be extended to 12 separate queries). The problem is basically as the title suggests--I want to join 12 queries on the created Row_Num column using the WITH() statement, instead of creating 12 separate tables and saving them as table in a database.
WITH Jan_Table AS
(SELECT ROW_NUMBER() OVER (ORDER BY a.SALE_DATE) as Row_ID, a.SALE_DATE, sum(a.revenue) as Jan_Rev
FROM ba.SALE_TABLE a
WHERE a.SALE_DATE BETWEEN '2015-01-01' and '2015-01-31'
GROUP BY a.SALE_DATE)
SELECT ROW_NUMBER() OVER (ORDER BY a.SALE_DATE) as Row_ID, a.SALE_DATE, sum(a.revenue) as Jun_Rev, j.Jan_Rev
FROM ba.SALE_TABLE a
LEFT JOIN Jan_Table j
on "j.Row_ID" = a.Row_ID
WHERE a.SALE_DATE BETWEEN '2015-06-01' and '2015-06-30'
GROUP BY a.SALE_DATE
And then I get this error message:
ERROR: column "j.Row_ID" does not exist
I put in the "j.Row_ID" because the previous message was:
ERROR: column a.row_id does not exist Hint: Perhaps you meant to
reference the column "j.row_id".
Each query works individually without the JOIN and WITH functions. I have one for every month of the year and want to join 12 of these together eventually.
The output should be a single column with ROW_NUM and 12 Monthly Revenues columns. Each row should be a day of the month. I know not every month has 31 days. So, for example, Feb only has 28 days, meaning I'd want days 29, 30, and 31 as NULLs. The query above still has the dates--but I will remove the "SALE_DATE" column after I can just get these two queries to join.
My initially thought was just to create 12 tables but I think that'd be a really bad use of space and not the most logical solution to this problem if I were to extend this solution.
edit
Below are the separate outputs of the two qaruies above and the third table is what I'm trying to make. I can't give you the raw data. Everything above has been altered from the actual column names and purposes of the data that I'm using. And I don't know how to create a dataset--that's too above my head in SQL.
Jan_Table (first five lines)
Row_Num Date Jan_Rev
1 2015-01-01 20
2 2015-01-02 20
3 2015-01-03 20
4 2015-01-04 20
5 2015-01-05 20
Jun_Table (first five lines)
Row_Num Date Jun_Rev
1 2015-06-01 30
2 2015-06-02 30
3 2015-06-03 30
4 2015-06-04 30
5 2015-06-05 30
JOINED_TABLE (first five lines)
Row_Num Date Jun_Rev Date Jan_Rev
1 2015-06-01 30 2015-01-01 20
2 2015-06-02 30 2015-01-02 20
3 2015-06-03 30 2015-01-03 20
4 2015-06-04 30 2015-01-04 20
5 2015-06-05 30 2015-01-05 20
It seems like you can just use group by and conditional aggregation for your full query:
select day(sale_date),
max(case when month(sale_date) = 1 then sale_date end) as jan_date,
max(case when month(sale_date) = 1 then revenue end) as jan_revenue,
max(case when month(sale_date) = 2 then sale_date end) as feb_date,
max(case when month(sale_date) = 2 then revenue end) as feb_revenue,
. . .
from sale_table s
group by day(sale_date)
order by day(sale_date);
You haven't specified the database you are using. DAY() is a common function to get the day of the month; MONTH() is a common function to get the months of the year. However, those particular functions might be different in your database.

SQL : How to count number of times each ID exists continuously from previous period

My SQL data set is like this;
Date firm_id
======================
2010-01 1
2010-01 2
2010-01 3
----------------------
2010-02 1
2010-02 2
----------------------
2010-03 1
2010-03 2
2010-03 3
----------------------
2010-04 1
2010-04 3
How can I create a variable, name firm_age, to represent age of firms existing continuously from the previous period? like this,
Date firm_id firm_age
=================================
2010-01 1 0
2010-01 2 0
2010-01 3 0
-----------------------------------
2010-02 1 1
2010-02 2 1
-----------------------------------
2010-03 1 2
2010-03 2 2
2010-03 3 0
-----------------------------------
2010-04 1 3
2010-04 3 1
Thank you
This is a use case for the PACK operator from "Time & Relational Theory", which is not supported, at least not directly, in SQL.
You are trying to find [for each given row of the table] the smallest month such that there does not exist any intervening month between that smallest month and the month of the given row such that the company of the given row did not exist at that intervening month. Given two months, assessing the [non-]existence of such an intervening month is relatively trivial, however, finding the smallest month that makes the condition true for all intervening months is another order (*). I wouldn't try to do this completely in plain SQL.
(*) which set of months are you going to SELECT that "smallest month" from ? You cannot rely on the fact that all months will be mentioned in your table as there is always the slight theoretical possibility that one particular month, no companies existed at all. (This possibility also breaks any attack on the problem based on window functions ans row_numbers.)
This is a gaps-and-islands problem. You want "islands" where the values are sequential. Then you want to enumerate them. You can use row_number() for this:
select t.*,
row_number() over (partition by firm_id, date - seqnum * interval '1 month'
order by date
) as firm_age
from (select t.*,
row_number() over (partition by firm_id order by date) as seqnum
from t
) t;
Note that date functions are not standard across databases. This makes some assumptions about the data representation, but the idea for the processing should work in almost any database.

Generate cyclic column variable conditional on lag of another variable postgreSQL

So suppose I have a table as follows:
user date
a 10/15/2015
a 11/15/2015
a 12/15/2015
a 2/15/2015
b 1/15/2015
b 2/15/2015
b 4/15/2015
b 6/15/2015
I need to create three column variables (acutally two - i figured the time lag variable) (1) one that counts the number of successive logins by month and if there is a lapse restarts the counter (2) the numbers of days between logins (figured this one out) (3) if the counter resets then their cycle count increases by one. The resulting table should look as follows: (I am going to just use 30 days for 1 month span for illustrative purposes.)
user date count timelapse cycle
a 10/15/2015 1 0 1
a 11/15/2015 2 30 1
a 12/15/2015 3 30 1
a 2/15/2015 1 60 2
b 1/15/2015 1 0 1
b 2/15/2015 2 30 1
b 4/15/2015 1 60 2
b 6/15/2015 1 60 3
Any ideas? I was able to get the count column to work - but I could not get it to reset when the timelapse was greater than 30. Since the cycle was conditional on two columns I was at a bit of a loss there.
Any help or ideas would be greatly appreciated.
Here is the idea. Use lag() to determine when a gap occurs. You can do this by truncating the date to the beginning of the month, for comparison purposes.
Then, do a cumulative sum of the gap flags. This provides the cycle column. The count is then row_number() using the cycle:
select t.*,
row_number() over (partition by user, cycle order by date) as count
from (select t.*, sum(IsGap) over (partition by user order by date) as cycle
from (select user, date,
(case when date_trunc('month', date) = date_trunc('month', lag(date) over (partition by user order by date) + interval '1 month'
then 0
else 1
end) as IsGap
from t
) t
) t

Need to get the minimum start date and maximum end date, when there is no break in months

i have 8 rows as shown below,
Column1 Start_date end_date Row_number
1 2014-02-01 2014-02-28 1
1 2014-03-01 2014-03-31 2
1 2014-04-01 2014-04-30 3
1 2014-05-01 2014-05-31 4
1 2014-07-01 2014-07-31 5
1 2015-02-01 2015-02-28 6
1 2015-03-01 2015-03-31 7
I need result like below,
Column1 Start_date end_date
1 2014-02-01 2014-05-31
1 2014-07-01 2014-07-31
1 2015-02-01 2015-03-31
so when the end_date of first row is one day less than the start_date in next row, I need to group all the continuous rows like that and get the result as I shown. I need to do this only via SQL. please let me know, if anyone have any idea to solve this.
In the input record, you can see, first 4 rows are continuous, and 5th row is not continuous and 6th and 7th row is a continuous one.
Thanks in advance.
The trick here is that you need to first filter out only entries that are the ends of an interval, and then merge them together, rather than trying to keep a running count in one go.
So I don't know what flavour of SQL you're running, and I have no idea what you're trying to signify with Column1, but this should do the trick (written in SQL server flavour, but the only functions you need to adjust are the dateadd and the isnull). The fiddle is here
SELECT DISTINCT
CASE WHEN Q1.IsStart = 1
THEN Q1.start_date
ELSE LAG(start_date) OVER(ORDER BY Q1.Row_number) END AS start_date,
CASE WHEN Q1.IsEnding = 1
THEN Q1.end_date
ELSE LEAD(end_date) OVER(ORDER BY Q1.Row_number) END AS end_date
FROM
(SELECT
start_date,
end_date,
Row_number,
CASE WHEN DATEADD(day,1,end_date) =
ISNULL(LEAD(start_date) OVER(ORDER BY Row_number),
end_date)
THEN 0
ELSE 1 END AS IsEnding,
CASE WHEN DATEADD(day,-1,start_date) =
ISNULL(LAG(end_date) OVER(ORDER BY Row_number),
start_date)
THEN 0
ELSE 1 END AS IsStart
FROM table1) Q1
WHERE Q1.IsEnding = 1 OR Q1.IsStart = 1
For ANSI SQL/For those of you without LAG or LEAD:
SELECT
StartDates.start_date,
MIN(EndDates.end_date)
FROM
(SELECT
MainEntry.start_date,
MainEntry.row_number
FROM
mytable MainEntry
LEFT OUTER JOIN mytable PrevEntry ON PrevEntry.row_number - 1 = MainEntry.row_number
WHERE
PrevEntry.end_date IS NULL OR
EXTRACT(day FROM (MainEntry.start_date - PrevEntry.end_date)) > 1) StartDates
INNER JOIN
(SELECT
MainEntry.end_date,
MainEntry.row_number
FROM
mytable MainEntry
LEFT OUTER JOIN mytable NextEntry ON NextEntry.row_number + 1 = MainEntry.row_number
WHERE
NextEntry.start_date IS NULL OR
EXTRACT(day FROM (NextEntry.start_date - MainEntry.end_date)) > 1) EndDates
ON StartDates.row_number <= EndDates.row_number
GROUP BY
StartDates.start_date
Note that the GROUP BY could contain StartDates.row_number if that takes advantage of an index. Also note that this ANSI solution initially missed the edge cases of rows without any pairs (had INNER JOINs inside the subqueries).

How to aggregate 7 days in SQL

I was trying to aggregate a 7 days data for FY13 (starts on 10/1/2012 and ends on 9/30/2013) in SQL Server but so far no luck yet. Could someone please take a look. Below is my example data.
DATE BREAD MILK
10/1/12 1 3
10/2/12 2 4
10/3/12 2 3
10/4/12 0 4
10/5/12 4 0
10/6/12 2 1
10/7/12 1 3
10/8/12 2 4
10/9/12 2 3
10/10/12 0 4
10/11/12 4 0
10/12/12 2 1
10/13/12 2 1
So, my desired output would be like:
DATE BREAD MILK
10/1/12 1 3
10/2/12 2 4
10/3/12 2 3
10/4/12 0 4
10/5/12 4 0
10/6/12 2 1
Total 11 15
10/7/12 1 3
10/8/12 2 4
10/9/12 2 3
10/10/12 0 4
10/11/12 4 0
10/12/12 2 1
10/13/12 2 1
Total 13 16
--------through 9/30/2013
Please note, since FY13 starts on 10/1/2012 and ends on 9/30/2012, the first week of FY13 is 6 days instead of 7 days.
I am using SQL server 2008.
You could add a new computed column for the date values to group them by week and sum the other columns, something like this:
SELECT DATEPART(ww, DATEADD(d,-2,[DATE])) AS WEEK_NO,
SUM(Bread) AS Bread_Total, SUM(Milk) as Milk_Total
FROM YOUR_TABLE
GROUP BY DATEPART(ww, DATEADD(d,-2,[DATE]))
Note: I used DATEADD and subtracted 2 days to set the first day of the week to Monday based on your dates. You can modify this if required.
Use option with GROUP BY ROLLUP operator
SELECT CASE WHEN DATE IS NULL THEN 'Total' ELSE CONVERT(nvarchar(10), DATE, 101) END AS DATE,
SUM(BREAD) AS BREAD, SUM(MILK) AS MILK
FROM dbo.test54
GROUP BY ROLLUP(DATE),(DATENAME(week, DATE))
Demo on SQLFiddle
Result:
DATE BREAD MILK
10/01/2012 1 3
10/02/2012 2 4
10/03/2012 2 3
10/04/2012 0 4
10/05/2012 4 0
10/06/2012 2 1
Total 11 15
10/07/2012 1 3
10/08/2012 4 7
10/10/2012 0 4
10/11/2012 4 0
10/12/2012 2 1
10/13/2012 2 1
Total 13 16
You are looking for a rollup. In this case, you will need at least one more column to group by to do your rollup on, the easiest way to do that is to add a computed column that groups them into weeks by date.
Take a lookg at: Summarizing Data Using ROLLUP
Here is the general idea of how it could be done:
You need a derived column for each row to determine which fiscal week that record belongs to. In general you could subtract that record's date from 10/1, get the number of days that have elapsed, divide by 7, and floor the result.
Then you can GROUP BY that derived column and use the SUM aggregate function.
The biggest wrinkle is that 6 day week you start with. You may have to add some logic to make sure that the weeks start on Sunday or whatever day you use but this should get you started.
The WITH ROLLUP suggestions above can help; you'll need to save the data and transform it as you need.
The biggest thing you'll need to be able to do is identify your weeks properly. If you don't have those loaded into tables already so you can identify them, you can build them on the fly. Here's one way to do that:
CREATE TABLE #fy (fyear int, fstart datetime, fend datetime);
CREATE TABLE #fylist(fyyear int, fydate DATETIME, fyweek int);
INSERT INTO #fy
SELECT 2012, '2011-10-01', '2012-09-30'
UNION ALL
SELECT 2013, '2012-10-01', '2013-09-30';
INSERT INTO #fylist
( fyyear, fydate )
SELECT fyear, DATEADD(DAY, Number, DATEADD(DAY, -1, fy.fstart)) AS fydate
FROM Common.NUMBERS
CROSS APPLY (SELECT * FROM #fy WHERE fyear = 2013) fy
WHERE fy.fend >= DATEADD(DAY, Number, DATEADD(DAY, -1, fy.fstart));
WITH weekcalc AS
(
SELECT DISTINCT DATEPART(YEAR, fydate) yr, DATEPART(week, fydate) dt
FROM #fylist
),
ridcalc AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY yr, dt) AS rid, yr, dt
FROM weekcalc
)
UPDATE #fylist
SET fyweek = rid
FROM #fylist
JOIN ridcalc
ON DATEPART(YEAR, fydate) = yr
AND DATEPART(week, fydate) = dt;
SELECT list.fyyear, list.fyweek, p.[date], COUNT(bread) AS Bread, COUNT(Milk) AS Milk
FROM products p
JOIN #fylist list
ON p.[date] = list.fydate
GROUP BY list.fyyear, list.fyweek, p.[date] WITH ROLLUP;
The Common.Numbers reference above is a simple numbers table that I use for this sort of thing (goes from 1 to 1M). You could also build that on the fly as needed.