SQL Server iterating through time series data

SQL Server iterating through time series data - sql

I am using SQL Server and wondering if it is possible to iterate through time series data until specific condition is met and based on that label my data in other table?
For example, let's say I have a table like this:
Id Date Some_kind_of_event
+--+----------+------------------
1 |2018-01-01|dsdf...
1 |2018-01-06|sdfs...
1 |2018-01-29|fsdfs...
2 |2018-05-10|sdfs...
2 |2018-05-11|fgdf...
2 |2018-05-12|asda...
3 |2018-02-15|sgsd...
3 |2018-02-16|rgw...
3 |2018-02-17|sgs...
3 |2018-02-28|sgs...
What I want to get, is to calculate for each key the difference between two adjacent events and find out if there exists difference > 10 days between these two adjacent events. In case yes, I want to stop iterating for that specific key and put label 'inactive', otherwise 'active' in my other table. After we finish with one key, we start with another.
So for example id = 1 would get label 'inactive' because there exists two dates which have difference bigger that 10 days. The final result would be like that:
Id Label
+--+----------+
1 |inactive
2 |active
3 |inactive
Any ideas how to do that? Is it possible to do it with SQL?

When working with a DBMS you need to get away from the idea of thinking iteratively. Instead you need to try and think in sets. "Instead of thinking about what you want to do to a row, think about what you want to do to a column."
If I understand correctly, is this what you're after?
CREATE TABLE SomeEvent (ID int, EventDate date, EventName varchar(10));
INSERT INTO SomeEvent
VALUES (1,'20180101','dsdf...'),
(1,'20180106','sdfs...'),
(1,'20180129','fsdfs..'),
(2,'20180510','sdfs...'),
(2,'20180511','fgdf...'),
(2,'20180512','asda...'),
(3,'20180215','sgsd...'),
(3,'20180216','rgw....'),
(3,'20180217','sgs....'),
(3,'20180228','sgs....');
GO
WITH Gaps AS(
SELECT *,
DATEDIFF(DAY,LAG(EventDate) OVER (PARTITION BY ID ORDER BY EventDate),EventDate) AS EventGap
FROM SomeEvent)
SELECT ID,
CASE WHEN MAX(EventGap) > 10 THEN 'inactive' ELSE 'active' END AS Label
FROM Gaps
GROUP BY ID
ORDER BY ID;
GO
DROP TABLE SomeEvent;
GO
This assumes you are using SQL Server 2012+, as it uses the LAG function, and SQL Server 2008 has less than 12 months of any kind of support.

Try this. Note, replace #MyTable with your actual table.
WITH Diffs AS (
SELECT
Id
,DATEDIFF(DAY,[Date],LEAD([Date],1,0) OVER (ORDER BY [Id], [Date])) Diff
FROM #MyTable)
SELECT
Id
,CASE WHEN MAX(Diff) > 10 THEN 'Inactive' ELSE 'Active' END
FROM Diffs
GROUP BY Id

Just to share another approach (without a CTE).
SELECT
ID
, CASE WHEN SUM(TotalDays) = (MAX(CNT) - 1) THEN 'Active' ELSE 'Inactive' END Label
FROM (
SELECT
ID
, EventDate
, CASE WHEN DATEDIFF(DAY, EventDate, LEAD(EventDate) OVER(PARTITION BY ID ORDER BY EventDate)) < 10 THEN 1 ELSE 0 END TotalDays
, COUNT(ID) OVER(PARTITION BY ID) CNT
FROM EventsTable
) D
GROUP BY ID
The method is counting how many records each ID has, and getting the TotalDays by date differences (in days) between the current the next date, if the difference is less than 10 days, then give me 1, else give me 0.
Then compare, if the total days equal the number of records that each ID has (minus one) would print Active, else Inactive.
This is just another approach that doesn't use CTE.

Related

How do I do conditional logic between rows of a bigquery table?

I'm trying to write a query that goes through a table row by row comparing the current row with the next. Then based on a condition being true will perform a calculation which is then output in a column on the same table and a null value if false.
Consider the example above:
Row 8703 will be referred to as Row 1
Row 8704 will be referred to as Row 2
I would like to, if possible, compare Row 1 bookedEnd with Row 2 bookedStart. If they are of equal value (which in this case they are) I would like to subtract Row 2 actualStartdate from Row 1 actualEnddate and output the value in minutes in a separate column named 'difference' on Row 2.
If they are not of equal value (which is true for all other columns in the example above) I would like to output a null value.
For the above table the extra column named difference would have the row values of:
8701 - Null
8702 - Null
8703 - Null
8704 - 12
8705 - Null

Since you are writing to "Row 2", I use the LAG() function so you are comparing on the row you are writing.
with data as (select * from `project.dataset.table`),
lagged as (
select
*,
lag(bookedEnd,1) over(partition by roomID order by Row asc) as prev_bookedEnd,
lag(actualEnddate,1) over(partition by roomID order by Row asc) as prev_actualEnddate
from data
)
select
* except (prev_bookedEnd,prev_actualEnddate),
case when prev_bookedEnd = bookedStart then timestamp_diff(prev_actualEndDate,actualStartdate, minute) else null end as difference
from lagged

What you will want to do in this scenario is use the lead function
https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#lead
it would look similar to
SELECT bookedEnd
, CASE WHEN bookedEnd = LEAD(bookedStart) OVER (PARTITION BY roomid ORDER BY Row) then XXXX END as actualStartdate
, CASE WHEN bookedEnd = LEAD(bookedStart) OVER (PARTITION BY roomid ORDER BY Row) then XXXX END as difference

SELECT
*,
IF( LAG(bookedEnd) OVER (PARTITION BY roomId ORDER BY bookedStart) = bookedStart,
TIMESTAMP_DIFF( actualStartdate,
LAG(actualEnddate) OVER (PARTITION BY roomId ORDER BY bookedStart),
MINUTE
),
NULL
) AS difference
FROM `project.dataset.table`

SQL - Count new entries based on last date

I have a table with the follow structure
ID ReportDate Object_id
What I need to know, is the count of new and count of old (Object id's)
For example: If I have the data below:
I want the following output grouped by ReportDate:
I thought a way doing it using a Where clause based on date, however i need the data for all the dates I have in the table. To see the count of what already existed in the previous report and what is new at that report. Any Ideas?
Edit: New/Old definition- New would be the records that never appeared before that report run date and appeared on this one, whereas old is the number of records that had at least one match in previous dates. I'll edit the post to include this info.

managed to do it using a left join. Below is my solution in case it helps anyone in the future :)
SELECT table.ReportRunDate,
-1*sum(table.ReportRunDate = new_table.init_date) as count_new,
-1*sum(table.ReportRunDate <> new_table.init_date) as count_old,
count(*) as count_total
FROM table LEFT JOIN
((SELECT Object_ID, min(ReportRunDate) as init_date
FROM table
GROUP By OBJECT_ID) as new_table)
ON table.Object_ID = new_table.Object_ID
GROUP BY ReportRunDate

This would work in Oracle, not sure about ms-access:
SELECT ReportDate
,COUNT(CASE WHEN rnk = 1 THEN 1 ELSE NULL END) count_of_new
,COUNT(CASE WHEN rnk <> 1 THEN 1 ELSE NULL END)count_of_old
FROM (SELECT ID
,ReportDate
,Object_id
,RANK() OVER (PARTITION BY Object_id ORDER BY ReportDate) rnk
FROM table_name)
GROUP BY ReportDate
Inner query should rank each occurence of object_id based on the ReportDate so the 1st occurrence of certain object_id will have rank = 1, the next one rank = 2 etc.
Then the outer query counts how many records with rank equal/not equal 1 are the within each group.
I assumed that 1 object_id can appear only once within each reportDate.

SQL Server 2008 query, time in each status

I'm wondering if anybody can help with a query I am working on. I'm trying to gather information for 'Time in each status' from my call activity table.
I need to set up 3 time ranges in days: <3 days, 4-5 days, 6+ days, returning the number of days each CallID is spending in each status.
The trouble I'm having is that I need to identify from the table below when there was a status change. This table records any activity to the call, i.e changed customer details and not just when a status has been changed.
Apologies if this is unclear, let me know if you need further details.
I'm using SQL Server 2008. Here is the table I'm using and related values:
CREATE TABLE Activity ( CallID varchar(30), Call_Date datetime, [User] varchar(30), Status varchar(10) );
INSERT INTO Activity VALUES (366,'2013/09/27 12:24:33',13,9);
INSERT INTO Activity VALUES (366,'2013/09/28 17:36:14',13,9);
INSERT INTO Activity VALUES (366,'2013/09/29 07:29:18',13,10);
INSERT INTO Activity VALUES (366,'2013/09/30 06:22:12',13,-1);
INSERT INTO Activity VALUES (367,'2013/09/27 12:13:16',9,6);
INSERT INTO Activity VALUES (367,'2013/09/27 12:25:03',9,6);
INSERT INTO Activity VALUES (367,'2013/09/29 12:25:29',9,6);
INSERT INTO Activity VALUES (367,'2013/09/30 12:45:55',9,7);
INSERT INTO Activity VALUES (367,'2013/10/01 12:46:04',9,8);
INSERT INTO Activity VALUES (367,'2013/10/02 15:12:27',9,-1);
INSERT INTO Activity VALUES (368,'2013/08/01 15:09:01',5,10);
INSERT INTO Activity VALUES (368,'2013/08/02 14:11:20',5,13);
INSERT INTO Activity VALUES (368,'2013/08/04 16:41:11',5,13);
INSERT INTO Activity VALUES (368,'2013/08/05 01:12:56',5,-1);
Desired Output 1: E.g. if CallID 35931 took 2 days to change from status 1 to status 2, 2 days would be added to the count in the <3 column
Status <3 Days 4-5 days 6+ Days
------ ------- -------- -------
1 10 3 1
2 8 1 2
3 5 3 1
I'm stuck in the first stage trying to identify the rows where there are status changes and ignoring the rest. I'm working on a subquery which selects the top date for each change of status. It's bringing back negative values. See here:
select CallID, T2.[status], Call_Date,
sum(datediff(dd, nextDate, [Call_Date]) - (datediff(wk, nextDate, [Call_Date]) * 2) -
case when datepart(wk, nextDate) = 1 then 1 else 0 end +
case when datepart(wk, [Call_Date]) = 7 then 1 else 0 end) as TotalDays
from (select *,
(select MAX( T0.[Call_Date])
from [Activity] T0
where T0.[Call_Date] > T1.[Call_Date] and
T0.CallID = T1.CallID
) as nextDate
from [Activity] T1
) T2
where T2.[status] <> '-1'
group by Call_Date, T2.[status], CallID
Thanks for your help in advance.

First of all i think that you need only the rows with the minimum date for each id and status as they would show a status change. This can be done with a CTE and using ROW_NUMBER.
Then you should join the results in a way that on the same record you would have the old status date and the new status date. On the first time you would have nulls for the first status.
;WITH CallsCTE AS
(
SELECT CallId,
Call_Date,
Status,
ROW_NUMBER() OVER(PARTITION BY CallId, Status ORDER BY Call_Date) AS rn
FROM Activity
),
StatusChangesCTE AS
(
SELECT CallID,
Call_Date,
Status
FROM CallsCTE
WHERE rn = 1
)
SELECT Sold.*,
Snew.*
FROM StatusChangesCTE Snew
LEFT JOIN StatusChangesCTE Sold
ON Snew.CallID = Sold.CallID
AND Sold.Call_Date = (SELECT MAX(Call_Date) FROM StatusChangesCTE WHERE CallID = Sold.CallID AND Call_Date < Snew.Call_Date)
I think that you can find your way using the above, as you could use DateDiff on Snew.Call_Date and Sold.Call_Date to find the time needed for a status change.
Let me know if you need any more assistance.

Datediff between two tables

I have those two tables
1-Add to queue table
TransID , ADD date
10 , 10/10/2012
11 , 14/10/2012
11 , 18/11/2012
11 , 25/12/2012
12 , 1/1/2013
2-Removed from queue table
TransID , Removed Date
10 , 15/1/2013
11 , 12/12/2012
11 , 13/1/2013
11 , 20/1/2013
The TansID is the key between the two tables , and I can't modify those tables, what I want is to query the amount of time each transaction spent in the queue
It's easy when there is one item in each table , but when the item get queued more than once how do I calculate that?

Assuming the order TransIDs are entered into the Add table is the same order they are removed, you can use the following:
WITH OrderedAdds AS
( SELECT TransID,
AddDate,
[RowNumber] = ROW_NUMBER() OVER(PARTITION BY TransID ORDER BY AddDate)
FROM AddTable
), OrderedRemoves AS
( SELECT TransID,
RemovedDate,
[RowNumber] = ROW_NUMBER() OVER(PARTITION BY TransID ORDER BY RemovedDate)
FROM RemoveTable
)
SELECT OrderedAdds.TransID,
OrderedAdds.AddDate,
OrderedRemoves.RemovedDate,
[DaysInQueue] = DATEDIFF(DAY, OrderedAdds.AddDate, ISNULL(OrderedRemoves.RemovedDate, CURRENT_TIMESTAMP))
FROM OrderedAdds
LEFT JOIN OrderedRemoves
ON OrderedAdds.TransID = OrderedRemoves.TransID
AND OrderedAdds.RowNumber = OrderedRemoves.RowNumber;
The key part is that each record gets a rownumber based on the transaction id and the date it was entered, you can then join on both rownumber and transID to stop any cross joining.
Example on SQL Fiddle

DISCLAIMER: There is probably problem with this, but i hope to send you in one possible direction. Make sure to expect problems.
You can try in the following direction (which might work in some way depending on your system, version, etc) :
SELECT transId, (sum(add_date_sum) - sum(remove_date_sum)) / (1000*60*60*24)
FROM
(
SELECT transId, (SUM(UNIX_TIMESTAMP(add_date)) as add_date_sum, 0 as remove_date_sum
FROM add_to_queue
GROUP BY transId
UNION ALL
SELECT transId, 0 as add_date_sum, (SUM(UNIX_TIMESTAMP(remove_date)) as remove_date_sum
FROM remove_from_queue
GROUP BY transId
)
GROUP BY transId;
A bit of explanation: as far as I know, you cannot sum dates, but you can convert them to some sort of timestamps. Check if UNIX_TIMESTAMPS works for you, or figure out something else. Then you can sum in each table, create union by conveniently leaving the other one as zeto and then subtracting the union query.
As for that devision in the end of first SELECT, UNIT_TIMESTAMP throws out miliseconds, you devide to get days - or whatever it is that you want.
This all said - I would probably solve this using a stored procedure or some client script. SQL is not a weapon for every battle. Making two separate queries can be much simpler.

Answer 2: after your comments. (As a side note, some of your dates 15/1/2013,13/1/2013 do not represent proper date formats )
select transId, sum(numberOfDays) totalQueueTime
from (
select a.transId,
datediff(day,a.addDate,isnull(r.removeDate,a.addDate)) numberOfDays
from AddTable a left join RemoveTable r on a.transId = r.transId
order by a.transId, a.addDate, r.removeDate
) X
group by transId
Answer 1: before your comments
Assuming that there won't be a new record added unless it is being removed. Also note following query will bring numberOfDays as zero for unremoved records;
select a.transId, a.addDate, r.removeDate,
datediff(day,a.addDate,isnull(r.removeDate,a.addDate)) numberOfDays
from AddTable a left join RemoveTable r on a.transId = r.transId
order by a.transId, a.addDate, r.removeDate

SQL query group by nearby timestamp

I have a table with a timestamp column. I would like to be able to group by an identifier column (e.g. cusip), sum over another column (e.g. quantity), but only for rows that are within 30 seconds of each other, i.e. not in fixed 30 second bucket intervals. Given the data:
cusip| quantity| timestamp
============|=========|=============
BE0000310194| 100| 16:20:49.000
BE0000314238| 50| 16:38:38.110
BE0000314238| 50| 16:46:21.323
BE0000314238| 50| 16:46:35.323
I would like to write a query that returns:
cusip| quantity
============|=========
BE0000310194| 100
BE0000314238| 50
BE0000314238| 100
Edit:
In addition, it would greatly simplify things if I could also get the MIN(timestamp) out of the query.

From Sean G solution, I have removed Group By on complete Table. In Fact re adjected few parts for Oracle SQL.
First after finding previous time, assign self parent id. If there a null in Previous Time, then we exclude giving it an ID.
Now based on take the nearest self parent id by avoiding nulls so that all nearest 30 seconds cusip fall under one Group.
As There is a CUSIP column, I assumed the dataset would be large market transactional data. Instead using group by on complete table, use partition by CUSIP and final Group Parent ID for better performance.
SELECT
id,
sub.parent_id,
sub.cusip,
timestamp,
quantity,
sum(sub.quantity) OVER(
PARTITION BY cusip, parent_id
) sum_quantity,
MIN(sub.timestamp) OVER(
PARTITION BY cusip, parent_id
) min_timestamp
FROM
(
SELECT
base_sub.*,
CASE
WHEN base_sub.self_parent_id IS NOT NULL THEN
base_sub.self_parent_id
ELSE
LAG(base_sub.self_parent_id) IGNORE NULLS OVER(
PARTITION BY cusip
ORDER BY
timestamp, id
)
END parent_id
FROM
(
SELECT
c.*,
CASE
WHEN nvl(abs(EXTRACT(SECOND FROM to_timestamp(previous_timestamp, 'yyyy/mm/dd hh24:mi:ss') - to_timestamp
(timestamp, 'yyyy/mm/dd hh24:mi:ss'))), 31) > 30 THEN
id
ELSE
NULL
END self_parent_id
FROM
(
SELECT
my_table.id,
my_table.cusip,
my_table.timestamp,
my_table.quantity,
LAG(my_table.timestamp) OVER(
PARTITION BY my_table.cusip
ORDER BY
my_table.timestamp, my_table.id
) previous_timestamp
FROM
my_table
) c
) base_sub
) sub
Below are the Table Rows
Input Data:
Below is the Output
RESULT

Following may be helpful to you.
Grouping of 30 second periods stating form a given time. Here it is '2012-01-01 00:00:00'. DATEDIFF counts the number of seconds between time stamp value and stating time. Then its is divided by 30 to get grouping column.
SELECT MIN(TimeColumn) AS TimeGroup, SUM(Quantity) AS TotalQuantity FROM YourTable
GROUP BY (DATEDIFF(ss, TimeColumn, '2012-01-01') / 30)
Here minimum time stamp of each group will output as TimeGroup. But you can use maximum or even grouping column value can be converted to time again for display.

Looking at the above comments, I'm assuming Chris's first scenario is the one you want (all 3 get grouped even though values 1 and 3 are not within 30 seconds of eachother, but are each within 30 seconds of value 2). Also going to assume that each row in your table has some unique ID called 'id'. You can do the following:
Create a new grouping, determining if the preceding row in your partition is more than 30 seconds behind the current row (e.g. determine if you need a new 30 second grouping, or to continue the previous). We'll call that parent_id.
Sum quantity over parent_id (plus any other aggregations)
The code could look like this
select
sub.parent_id,
sub.cusip,
min(sub.timestamp) min_timestamp,
sum(sub.quantity) quantity
from
(
select
base_sub.*,
case
when base_sub.self_parent_id is not null
then base_sub.self_parent_id
else lag(base_sub.self_parent_id) ignore nulls over (
partition by
my_table.cusip
order by
my_table.timestamp,
my_table.id
) parent_id
from
(
select
my_table.id,
my_table.cusip,
my_table.timestamp,
my_table.quantity,
lag(my_table.timestamp) over (
partition by
my_table.cusip
order by
my_table.timestamp,
my_table.id
) previous_timestamp,
case
when datediff(
second,
nvl(previous_timestamp, to_date('1900/01/01', 'yyyy/mm/dd')),
my_table.timestamp) > 30
then my_table.id
else null
end self_parent_id
from
my_table
) base_sub
) sub
group by
sub.time_group_parent_id,
sub.cusip

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL Server iterating through time series data - sql

Try this. Note, replace #MyTable with your actual table. WITH Diffs AS ( SELECT Id ,DATEDIFF(DAY,[Date],LEAD([Date],1,0) OVER (ORDER BY [Id], [Date])) Diff FROM #MyTable) SELECT Id ,CASE WHEN MAX(Diff) > 10 THEN 'Inactive' ELSE 'Active' END FROM Diffs GROUP BY Id

Related

How do I do conditional logic between rows of a bigquery table?

SQL - Count new entries based on last date

SQL Server 2008 query, time in each status

Datediff between two tables

SQL query group by nearby timestamp

Categories

Resources