Possible to calculate iterated count of timestamps relative to one another? - sql

This question is a bit complicated but to make it as simple as possible:
I have a list of timestamps (it is in the millions but let's say for simplicity sake it is much smaller):
order_times
-----------
2014-10-11 15:00:00
2014-10-11 15:02:00
2014-10-11 15:03:31
2014-10-11 15:07:00
2014-10-11 16:00:00
2014-10-11 16:04:00
I am trying to build a query (in PostgeSQL) that would allow me to determine the number of times a an order_time occurs within 10 minutes of 2 order_times prior to it (and no more).
In the sample data above:
first time stamp is considered 0 as there were no orders before it
second timestamp is considered 0 as it was within 10 minutes of it
prior but there was only 1 request before it
third timestamp is considered 1 because there were at least 2 orders within 10 minutes of it
(and so on)...
I hope this is clear!

You don't need to look at the first previous, just the one 2 prior to each. If that is within 10 minutes, then the one after it will be also.
Best way is to get the data that is important to you into a single row, so you can do set operations on it. For that, use the windowing function ROW_NUMBER() and a self join. This is the MS SQL way of doing what you want.
WITH T1 AS (
SELECT ID, Order_Time, ROW_NUMBER() OVER( ORDER BY Order_Time) AS RowNumber FROM myTest)
SELECT T1.ID,T1.Order_Time, T2.ID AS CompareID,T2.Order_Time AS CompareTime
FROM T1 LEFT OUTER JOIN T1 AS T2 ON T1.RowNumber-2 = T2.RowNumber
WHERE DATEDIFF(n,t2.Order_Time,t1.Order_Time)<=10
First we create a query that has the row numbers, then use it as an inline table to do a self join to build a row that contains each order, and the one that happened 2 orders prior to it. Then just do a simple date comparison to select out the rows you want.

Related

Dynamically generating date range starts in SQL

Imagine you have a set of dates. You want any date which is within X days of the lowest date to be "merged" into that date. Then you want to repeat until you have merged all date points.
For example:
ID
DatePoints
1
2023-01-01
2
2023-01-02
3
2023-01-12
4
2023-01-21
5
2023-02-01
6
2023-02-02
7
2023-03-01
If you applied this rule to this data using 10 days as your X, you would end up with this output:
DateRangeStarts
2023-01-01
2023-01-12
2023-02-01
2023-03-01
IDs 1 and 2 into range 1, IDs 3 and 4 into range 2, IDs 5 and 6 into range 3, and ID 7 into range 4.
Is there any way to do this without a loop? Answer can work in SQL Server or BigQuery. Thanks
You could consider something like the following. It's not pretty and I'm not at all confident it is the best solution, but I do think it works. Maybe it's a good starting point for you to work from.
WITH cte AS
(
SELECT min(datepoint) datepoint
FROM test
UNION ALL
SELECT min(t.datepoint) OVER() datepoint
FROM test t CROSS APPLY (SELECT max(cte.datepoint) OVER() md FROM cte) c
WHERE t.datepoint > DATEADD(DAY, 10, c.md)
)
SELECT distinct datepoint
FROM cte
ORDER BY datepoint
(You might want to change the > to a >=, depending on what counts as within X days.)
The basic idea is to get the minimum date from your table into the cte, then recursively get the minimum date from your table that is bigger than the current maximum date in the cte + X days.
It gets messy because of the limitations SQL Server places on recursive CTEs. They can't be used in subqueries, with normal OUTER JOINs, or with aggregate functions. Therefore, I use CROSS APPLY and the window versions of min/max. This gets the correct result, but multiple times, so I'm forced to use DISTINCT to clean it up afterward.
Depending on your data, it might be better to do a loop anyway, but I think this is an option to consider.
Here's a Fiddle of it working.

Rolling count of rows withing time interval [duplicate]

This question already has answers here:
Window Functions or Common Table Expressions: count previous rows within range
(2 answers)
Closed 1 year ago.
For an analysis I need to aggregate the rows of a single table depending on their creation time. Basically, I want to know the count of orders that have been created within a certain period of time before the current order. Can't seem to find the solution to this.
Table structure:
order_id
time_created
1
00:00
2
00:01
3
00:03
4
00:05
5
00:10
Expected result:
order_id
count within 3 seconds
1
1
2
2
3
3
4
2
5
1
Sounds like an application for window functions. But, sadly, that's not the case. Window frames can only be based on row counts, not on actual column values.
A simple query with LEFT JOIN can do the job:
SELECT t0.order_id
, count(t1.time_created) AS count_within_3_sec
FROM tbl t0
LEFT JOIN tbl t1 ON t1.time_created BETWEEN t0.time_created - interval '3 sec'
AND t0.time_created
GROUP BY 1
ORDER BY 1;
db<>fiddle here
Does not work with time like in your minimal demo, as that does not wrap around. I suppose it's reasonable to assume timestamp or timestamptz.
Since you include each row itself in the count, an INNER JOIN would work, too. (LEFT JOIN is still more reliable in the face of possible NULL values.)
Or use a LATERAL subquery and you don't need to aggregate on the outer query level:
SELECT t0.order_id
, t1.count_within_3_sec
FROM tbl t0
LEFT JOIN LATERAL (
SELECT count(*) AS count_within_3_sec
FROM tbl t1
WHERE t1.time_created BETWEEN t0.time_created - interval '3 sec'
AND t0.time_created
) t1 ON true
ORDER BY 1;
Related:
Rolling sum / count / average over date interval
For big tables and many rows in the time frame, a procedural solution that walks through the table once will perform better. Like:
Window Functions or Common Table Expressions: count previous rows within range
Alternatives to broken PL/ruby: convert a warehouse journal table
GROUP BY and aggregate sequential numeric values

Find list of dates in a table closest to specific date from different table.

I have a list of unique ID's in one table that has a date column. Example:
TABLE1
ID Date
0 2018-01-01
1 2018-01-05
2 2018-01-15
3 2018-01-06
4 2018-01-09
5 2018-01-12
6 2018-01-15
7 2018-01-02
8 2018-01-04
9 2018-02-25
Then in another table I have a list of different values that appear multiple times for each ID with various dates.
TABLE 2
ID Value Date
0 18 2017-11-28
0 24 2017-12-29
0 28 2018-01-06
1 455 2018-01-03
1 468 2018-01-16
2 55 2018-01-03
3 100 2017-12-27
3 110 2018-01-04
3 119 2018-01-10
3 128 2018-01-30
4 223 2018-01-01
4 250 2018-01-09
4 258 2018-01-11
etc
I want to find the value in table 2 that is closest to the unique date in table 1.
Sometimes table 2 does contain a value that matches the date exactly and I have had no problem in pulling through those values. But I can't work out the code to pull through the value closest to the date requested from table 1.
My desired result based on the examples above would be
ID Value Date
0 24 2017-12-29
1 455 2018-01-03
2 55 2018-01-03
3 110 2018-01-04
4 250 2018-01-09
Since I can easily find the ID's with an exact match, one thing I have tried is taking the ID's that don't have an exact date match and placing them with their corresponding values into a temporary table. Then trying to find the values where I need the closest possible match, but it's here that I'm not sure where to begin on the coding of that.
Apologies if I'm missing a basic function or clause for this, I'm still learning!
The below would be one method:
WITH Table1 AS(
SELECT ID, CONVERT(date, datecolumn) DateColumn
FROM (VALUES (0,'20180101'),
(1,'20180105'),
(2,'20180115'),
(3,'20180106'),
(4,'20180109'),
(5,'20180112'),
(6,'20180115'),
(7,'20180102'),
(8,'20180104'),
(9,'20180225')) V(ID, DateColumn)),
Table2 AS(
SELECT ID, [value], CONVERT(date, datecolumn) DateColumn
FROM (VALUES (0,18 ,'2017-11-28'),
(0,24 ,'2017-12-29'),
(0,28 ,'2018-01-06'),
(1,455,'2018-01-03'),
(1,468,'2018-01-16'),
(2,55 ,'2018-01-03'),
(3,100,'2017-12-27'),
(3,110,'2018-01-04'),
(3,119,'2018-01-10'),
(3,128,'2018-01-30'),
(4,223,'2018-01-01'),
(4,250,'2018-01-09'),
(4,258,'2018-01-11')) V(ID, [Value],DateColumn))
SELECT T1.ID,
T2.[Value],
T2.DateColumn
FROM Table1 T1
CROSS APPLY (SELECT TOP 1 *
FROM Table2 ca
WHERE T1.ID = ca.ID
ORDER BY ABS(DATEDIFF(DAY, ca.DateColumn, T1.DateColumn))) T2;
Note that if the difference is days is the same, the row returned will be random (and could differ each time the query is run). For example, if Table had the date 20180804 and Table2 had the dates 20180803 and 20180805 they would both have the value 1 for ABS(DATEDIFF(DAY, ca.DateColumn, T1.DateColumn)). You therefore might need to include additional logic in your ORDER BY to ensure consistent results.
dude.
I'll say a couple of things here for you to consider, since SQL Server is not my comfort zone, while SQL itself is.
First of all, I'd join TABLE1 with TABLE2 per ID. That way, I can specify on my SELECT clause the following tuple:
SELECT ID, Value, DateDiff(d, T1.Date, T2.Date) qt_diff_days
Obviously, depending on the precision of the dates kept there, rather they have times or not, you can change the date field on DateDiff function.
Going forward, I'd also make this date difference an absolute number (to resolve positive / negative differences and consider only the elapsed time).
After that, and that's where it gets tricky because I don't know the SQL Server version you're using, but basically I'd use a ROW_NUMBER window function to rank all my lines per difference. Something like the following:
SELECT
ID, Value, Abs(DateDiff(d, T1.Date, T2.Date)) qt_diff_days,
ROW_NUMBER() OVER(PARTITION BY ID ORDER BY Abs(DateDiff(d, T1.Date, T2.Date)) ASC) nu_row
ROW_NUMBER (Transact-SQL)
Numbers the output of a result set. More specifically, returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition.
If you could run ROW_NUMBER properly, you should notice the query will rank it's data per ID, starting with 1 and increasing this ranking by it's difference between both dates, reseting it's rank to 1 when ID changes.
After that, all you need to do is select only those lines where nu_row equals to 1. I'd use a CTE to that.
WITH common_table_expression (Transact-SQL)
Specifies a temporary named result set, known as a common table expression (CTE).

SQL statement to match dates that are the closest?

I have the following table, let's call it Names:
Name Id Date
Dirk 1 27-01-2015
Jan 2 31-01-2015
Thomas 3 21-02-2015
Next I have the another table called Consumption:
Id Date Consumption
1 26-01-2015 30
1 01-01-2015 20
2 01-01-2015 10
2 05-05-2015 20
Now the problem is, that I think that doing this using SQL is the fastest, since the table contains about 1.5 million rows.
So the problem is as follows, I would like to match each Id from the Names table with the Consumption table provided that the difference between the dates are the lowest, so we have: Dirk consumes on 27-01-2015 about 30. In case there are two dates that have the same "difference", I would like to calculate the average consumption on those two dates.
While I know how to join, I do not know how to code the difference part.
Thanks.
DBMS is Microsoft SQL Server 2012.
I believe that my question differs from the one mentioned in the comments, because it is much more complicated since it involves comparison of dates between two tables rather than having one date and comparing it with the rest of the dates in the table.
This is how you could it in SQL Server:
SELECT Id, Name, AVG(Consumption)
FROM (
SELECT n.Id, Name, Consumption,
RANK() OVER (PARTITION BY n.Id
ORDER BY ABS(DATEDIFF(d, n.[Date], c.[Date]))) AS rnk
FROM Names AS n
INNER JOIN Consumption AS c ON n.Id = c.Id ) t
WHERE t.rnk = 1
GROUP BY Id, Name
Using RANK with PARTITION BY n.Id and ORDER BY ABS(DATEDIFF(d, n.[Date], c.[Date])) you can locate all matching records per Id: all records with the smallest difference in days are going to have rnk = 1.
Then, using AVG in the outer query, you are calculating the average value of Consumption between all matching records.
SQL Fiddle Demo

Running a query over past date ranges

I have a rather interesting problem which I first thought would be straight-forward, but it turned out to be more complicated.
I have data like this:
Date User ID
2012-10-11 a
2012-10-11 b
2012-10-12 c
2012-10-12 d
2012-10-13 e
2012-10-14 b
2012-10-14 e
... ...
Each row has a Date, User ID couple which indicates that that user was active on that day. A user can appear on multiple dates and a date will have multiple users -- just like in the example. I have millions of rows like this which cover a time range of about 90 days.
Here's the question: For each day, I want to get the number of users who have not been active for the past 10 days. For instance, if the user "a" was active on 2012-05-31 and but hasn't been active on any of the days between 06-01 and 06-10, I want to count this user on 6/10. I wouldn't count him again on the following days though unless he becomes active and disappears again.
Can I do this in SQL or would I need to some kind of script to organize the data the way I want. What would be your recommendations? I use Hive.
Thank you so much!
I think you can do this in Hive-compatible SQL. Here is the idea.
For each user/date get the next date for the user.
Discard the original record if the next is less than 10 days after the current one.
Add 10 to the date
Aggregate and count
I am not sure of all the Hive functions for things like date. Here is an example of how to do it:
select date+10, count(*)
from (select t.userid, t.date,
min(case when tnext.date > t.date then tnext.date end) as nextdate
from t left outer join
t tnext
on t.userid = tnext.userid
group by t.userid, t.date
) t
where nextdate is null or nextdate - date >= 10
group by date+10;
Note that the inner subquery would be better written using:
on t.userid = tnext.userid and t2.date > t.date
However, I don't know if Hive supports such a join (it doesn't support non-equijoins and it not clear about whether one or all clauses have to be equal).