Query to find all timestamps more than a certain interval apart - sql

I'm using postgres to run some analytics on user activity. I have a table of all requests(pageviews) made by every user and the timestamp of the request, and I'm trying to find the number of distinct sessions for every user. For the sake of simplicity, I'm considering every set of requests an hour or more apart from others as a distinct session. The data looks something like this:
id| request_time| user_id
1 2014-01-12 08:57:16.725533 1233
2 2014-01-12 08:57:20.944193 1234
3 2014-01-12 09:15:59.713456 1233
4 2014-01-12 10:58:59.713456 1234
How can I write a query to get the number of sessions per user?

To start a new session after every gap >= 1 hour:
SELECT user_id, count(*) AS distinct_sessions
FROM (
SELECT user_id
,(lag(request_time, 1, '-infinity') OVER (PARTITION BY user_id
ORDER BY request_time)
<= request_time - '1h'::interval) AS step -- start new session
FROM tbl
) sub
WHERE step
GROUP BY user_id
ORDER BY user_id;
Assuming request_time NOT NULL.
Explain:
In subquery sub, check for every row if a new session begins. Using the third parameter of lag() to provide the default -infinity, which is lower than any timestamp and therefore always starts a new session for the first row.
In the outer query count how many times new sessions started. Eliminate step = FALSE and count per user.
Alternative interpretation
If you really wanted to count hours where at least one request happened (I don't think you do, but another answer assumes as much), you would:
SELECT user_id
, count(DISTINCT date_trunc('hour', request_time)) AS hours_with_req
FROM tbl
GROUP BY 1
ORDER BY 1;

Related

Azure Stream ANalytics - Find Most Recent `n` Events Within Time Interval

I am working with Azure Stream Analytics and, to illustrate my situation, I have streaming events corresponding to buy(+)/sell(-) orders from users of a certain amount. So, key fields in an individual event look like: {UserId: 'u12345', Type: 'Buy', Amt: 14.0}.
I want to write a query which outputs UserId's and the sum of Amt for the most recent (up to) 5 events within a sliding 24 hr period partitioned by UserId.
To clarify:
If there are more than 5 events for a given UserId in the last 24 hours, I only want the sum of Amt for the most recent 5.
If there are fewer than 5 events, I either want the UserId to be omitted or the sum of the Amt of the events that do exist.
I've tried looking at LIMIT DURATION predicates, but there doesn't seem to be a way to limit the number of events as well as filter on time while PARTITION'ing by UserId. Has anyone done something like this?
Considering the comments, I think this should work:
WITH Last5 AS (
SELECT
UserId,
System.Timestamp() AS windowEnd,
COLLECTTOP(5) OVER (ORDER BY CAST(EventEnqueuedUtcTime AS DATETIME) DESC) AS Top5
FROM input1
TIMESTAMP BY EventEnqueuedUtcTime
GROUP BY
SlidingWindow(hour,24),
UserId
HAVING COUNT(*) >= 5 --We want at least 5
)
SELECT
L.UserId,
System.Timestamp() AS ts,
SUM(C.ArrayValue.value.Amt) AS sumAmt
INTO myOutput
FROM Last5 AS L
CROSS APPLY GetArrayElements(L.Top5) AS C
GROUP BY
System.Timestamp(), --Snapshot window
L.UserId
We use a CTE to first get the sliding window of 24h. In there we both filter to only retain windows of more than 5 records (HAVING COUNT(*) > 5), and collect only the last 5 of them (COLLECTOP(5) OVER...). Note that I had to TIMESTAMP BY and CAST on my own timestamp when testing the query, you may not need that in your case.
Next we need to unpack the collected records, that's done via CROSS APPLY GetArrayElements, and sum them. I use a snapshot window for that, as I don't need time grouping on that one.
Please let me know if you need more details.

Sql query to get bounce rate based on session id and datetime

We have table with 3 columns- Url of Page Visited, User Session ID and Datetime.
Based on this information we have generate result with 2 columns - Date (unique) and Bounce Rate.
It is very clear that we need to look for single occurrences of session id, if there are 2 entries for same session id it means the user hitted the another page and didn't bounced but one entry means it bounced.
I can not write a sql query for this. I tried grouping data by session id and date but couldn't get the result in required format.
Can anyone do this?
If you want the number of sessions with only one page per day, you can use aggregation:
select dte,
avg( (num_pages = 1)::int ) as bounce_rate
from (select sessionid, min(datetime)::date as dte, count(*) as num_pages
from t
group by sessionid
) t
group by dte;

SQL Query to find out Sequence in next or subsequent rows of a Table based on a condition

I have an SQL Table with following structure
Timestamp(DATETIME)|AuditEvent
---------|----------
T1|Login
T2|LogOff
T3|Login
T4|Execute
T5|LogOff
T6|Login
T7|Login
T8|Report
T9|LogOff
Want the T-SQL way to find out What is the time that the user has logged into the system i.e. Time inbetween Login Time and Logoff Time for each session in a given day.
Day (Date)|UserTime(In Hours) (Logoff Time - LogIn Time)
--------- | -------
Jun 12 | 2
Jun 12 | 3
Jun 13 | 5
I tried using two temporary tables and Row Numbers but could not get it since the comparison was a time i.e. finding out the next Logout event with timestamp is greater than the current row's Login Event.
You need to group the records. I would suggest counting logins or logoffs. Here is one approach to get the time for each "session":
select min(case when auditevent = 'login' then timestamp end) as login_time,
max(timestamp) as logoff_time
from (select t.*,
sum(case when auditevent = 'logoff' then 1 else 0 end) over (order by timestamp desc) as grp
from t
) t
group by grp;
You then have to do whatever you want to get the numbers per day. It is unclear what those counts are.
The subquery does a reverse count. It counts the number of "logoff" records that come on or after each record. For records in the same "session", this count is the same, and suitable for grouping.

How to aggregate unique users by hour in Amazon Redshift?

With Amazon Redshift I want to count every unique visitor.
A unique visitor is a visitor who did not visit less than an hour previously.
So for the following rows of users and timestamps we'd get a total count of 4 unique visitors with user1 and user2 counting as 2 respectively.
Please note that I do not want to aggregate by hour in a 24 hour day. I want to aggregate by an hour after the time stamp of the users first visit.
I'm guessing a straight up SQL expression won't do it.
user1,"2015-07-13 08:28:45.247000"
user1,"2015-07-13 08:30:17.247000"
user1,"2015-07-13 09:35:00.030000"
user1,"2015-07-13 09:54:00.652000"
user2,"2015-07-13 08:28:45.247000"
user2,"2015-07-13 08:30:17.247000"
user2,"2015-07-13 09:35:00.030000"
user2,"2015-07-13 09:54:00.652000"
So user1 arrives at 8:28, that counts as one hit. He comes back at 8:30 which counts as zero. He then comes back at 9:35 which is more than an hour from 8:30, so he gets another hit. Then he comes back at 9:35 which is only 5 minutes from the last time 9:30 so this counts as zero. The total is 2 hits for user1. The same thing happens for user2 meaning two hits each bringing it to a final total of 4.
You can use lag to accomplish this. However, you will also have to handle for end of day by partitioning on day as well. The query below would be a starting point.
with prev as (
select user_id,
datecol,
coalesce(lag(datecol) over(partition by user_id order by datecol),0) as prev
from tablename
)
select user_id,
sum(case when datediff(minutes, datecol, prev) >=60 then 1 else 0 end) as totalvisits
from prev
group by user_id

How do I analyse time periods between records in SQL data without cursors?

The root problem: I have an application which has been running for several months now. Users have been reporting that it's been slowing down over time (so in May it was quicker than it is now). I need to get some evidence to support or refute this claim. I'm not interested in precise numbers (so I don't need to know that a login took 10 seconds), I'm interested in trends - that something which used to take x seconds now takes of the order of y seconds.
The data I have is an audit table which stores a single row each time the user carries out any activity - it includes a primary key, the user id, a date time stamp and an activity code:
create table AuditData (
AuditRecordID int identity(1,1) not null,
DateTimeStamp datetime not null,
DateOnly datetime null,
UserID nvarchar(10) not null,
ActivityCode int not null)
(Notes: DateOnly (datetime) is the DateTimeStamp with the time stripped off to make group by for daily analysis easier - it's effectively duplicate data to make querying faster).
Also for the sake of ease you can assume that the ID is assigned in date time order, that is 1 will always be before 2 which will always be before 3 - if this isn't true I can make it so).
ActivityCode is an integer identifying the activity which took place, for instance 1 might be user logged in, 2 might be user data returned, 3 might be search results returned and so on.
Sample data for those who like that sort of thing...:
1, 01/01/2009 12:39, 01/01/2009, P123, 1
2, 01/01/2009 12:40, 01/01/2009, P123, 2
3, 01/01/2009 12:47, 01/01/2009, P123, 3
4, 01/01/2009 13:01, 01/01/2009, P123, 3
User data is returned (Activity Code 2) immediate after login (Activity Code 1) so this can be used as a rough benchmark of how long the login takes (as I said, I'm interested in trends so as long as I'm measuring the same thing for May as July it doesn't matter so much if this isn't the whole login process - it takes in enough of it to give a rough idea).
(Note: User data can also be returned under other circumstances so it's not a one to one mapping).
So what I'm looking to do is select the average time between login (say ActivityID 1) and the first instance after that for that user on that day of user data being returned (say ActivityID 2).
I can do this by going through the table with a cursor, getting each login instance and then for that doing a select to say get the minimum user data return following it for that user on that day but that's obviously not optimal and is slow as hell.
My question is (finally) - is there a "proper" SQL way of doing this using self joins or similar without using cursors or some similar procedural approach? I can create views and whatever to my hearts content, it doesn't have to be a single select.
I can hack something together but I'd like to make the analysis I'm doing a standard product function so would like it to be right.
SELECT TheDay, AVG(TimeTaken) AvgTimeTaken
FROM (
SELECT
CONVERT(DATE, logins.DateTimeStamp) TheDay
, DATEDIFF(SS, logins.DateTimeStamp,
(SELECT TOP 1 DateTimeStamp
FROM AuditData userinfo
WHERE UserID=logins.UserID
and userinfo.ActivityCode=2
and userinfo.DateTimeStamp > logins.DateTimeStamp )
)TimeTaken
FROM AuditData logins
WHERE
logins.ActivityCode = 1
) LogInTimes
GROUP BY TheDay
This might be dead slow in real world though.
In Oracle this would be a cinch, because of analytic functions. In this case, LAG() makes it easy to find the matching pairs of activity codes 1 and 2 and also to calculate the trend. As you can see, things got worse on 2nd JAN and improved quite a bit on the 3rd (I'm working in seconds rather than minutes).
SQL> select DateOnly
2 , elapsed_time
3 , elapsed_time - lag (elapsed_time) over (order by DateOnly) as trend
4 from
5 (
6 select DateOnly
7 , avg(databack_time - prior_login_time) as elapsed_time
8 from
9 ( select DateOnly
10 , databack_time
11 , ActivityCode
12 , lag(login_time) over (order by DateOnly,UserID, AuditRecordID, ActivityCode) as prior_login_time
13 from
14 (
15 select a1.AuditRecordID
16 , a1.DateOnly
17 , a1.UserID
18 , a1.ActivityCode
19 , to_number(to_char(a1.DateTimeStamp, 'SSSSS')) as login_time
20 , 0 as databack_time
21 from AuditData a1
22 where a1.ActivityCode = 1
23 union all
24 select a2.AuditRecordID
25 , a2.DateOnly
26 , a2.UserID
27 , a2.ActivityCode
28 , 0 as login_time
29 , to_number(to_char(a2.DateTimeStamp, 'SSSSS')) as databack_time
30 from AuditData a2
31 where a2.ActivityCode = 2
32 )
33 )
34 where ActivityCode = 2
35 group by DateOnly
36 )
37 /
DATEONLY ELAPSED_TIME TREND
--------- ------------ ----------
01-JAN-09 120
02-JAN-09 600 480
03-JAN-09 150 -450
SQL>
Like I said in my comment I guess you're working in MSSQL. I don't know whether that product has any equivalent of LAG().
If the assumptions are that:
Users will perform various tasks in no mandated order, and
That the difference between any two activities reflects the time it takes for the first of those two activities to execute,
Then why not create a table with two timestamps, the first column containing the activity start time, the second column containing the next activity start time. Thus the difference between these two will always be total time of the first activity. So for the logout activity, you would just have NULL for the second column.
So it would be kind of weird and interesting, for each activity (other than logging in and logging out), the time stamp would be recorded in two different rows--once for the last activity (as the time "completed") and again in a new row (as time started). You would end up with a jacob's ladder of sorts, but finding the data you are after would be much more simple.
In fact, to get really wacky, you could have each row have the time that the user started activity A and the activity code, and the time started activity B and the time stamp (which, as mentioned above, gets put down again for the following row). This way each row will tell you the exact difference in time for any two activities.
Otherwise, you're stuck with a query that says something like
SELECT TIME_IN_SEC(row2-timestamp) - TIME_IN_SEC(row1-timestamp)
which would be pretty slow, as you have already suggested. By swallowing the redundancy, you end up just querying the difference between the two columns. You probably would have less need of knowing the user info as well, since you'd know that any row shows both activity codes, thus you can just query the average for all users on any given day and compare it to the next day (unless you are trying to find out which users are having the problem as well).
This is the faster query to find out, in one row you will have current and row before datetime value, after that you can use DATEDIFF ( datepart , startdate , enddate ). I use #DammyVariable and DamyField as i remember the is some problem if is not first #variable=Field in update statement.
SELECT *, Cast(NULL AS DateTime) LastRowDateTime, Cast(NULL As INT) DamyField INTO #T FROM AuditData
GO
CREATE CLUSTERED INDEX IX_T ON #T (AuditRecordID)
GO
DECLARE #LastRowDateTime DateTime
DECLARE #DammyVariable INT
SET #LastRowDateTime = NULL
SET #DammyVariable = 1
UPDATE #T SET
#DammyVariable = DammyField = #DammyVariable
, LastRowDateTime = #LastRowDateTime
, #LastRowDateTime = DateTimeStamp
option (maxdop 1)