Group timestamps into sessions with a defined minimum gap between sessions - sql

I'm trying to group my timestamp data into user sessions of various lengths where a session end is defined by a minimum time gap between sessions.
So if the thr is e.g. 5 minutes, two timestamps with a 2 minute gap would be considered the same session, while two timestamps with the gap of 6 minutes would be considered two sessions.
I've seen several examples where people try to group into sessions of a certain length, which is kinda easy. But my case is too "online" and I can't figure out what trick to use.
I have a base query defining the timestamps and creating sessions with 1 minute granularity. But I get one long session for each object, merging several ones into one long one.
How could I split my long merged session into several ones, with a defined gap of e.g. 5 minutes?
SELECT
count(distinct timestamps.*) as min_spent,
timestamps.object_id,
timestamps.user_id,
min(created_min) as session_start,
max(created_min) as session_end
FROM
(
SELECT
date_trunc('minute', datetime) as created_min,
object_id,
user_id
FROM timestamp_metrics
GROUP BY created_min, object_id, user_id
) as timestamps
left join objects o on o.id = timestamps.object_id
left join users u on u.id = timestamps.user_id
GROUP BY object_id, user_id

Related

Azure Stream ANalytics - Find Most Recent `n` Events Within Time Interval

I am working with Azure Stream Analytics and, to illustrate my situation, I have streaming events corresponding to buy(+)/sell(-) orders from users of a certain amount. So, key fields in an individual event look like: {UserId: 'u12345', Type: 'Buy', Amt: 14.0}.
I want to write a query which outputs UserId's and the sum of Amt for the most recent (up to) 5 events within a sliding 24 hr period partitioned by UserId.
To clarify:
If there are more than 5 events for a given UserId in the last 24 hours, I only want the sum of Amt for the most recent 5.
If there are fewer than 5 events, I either want the UserId to be omitted or the sum of the Amt of the events that do exist.
I've tried looking at LIMIT DURATION predicates, but there doesn't seem to be a way to limit the number of events as well as filter on time while PARTITION'ing by UserId. Has anyone done something like this?
Considering the comments, I think this should work:
WITH Last5 AS (
SELECT
UserId,
System.Timestamp() AS windowEnd,
COLLECTTOP(5) OVER (ORDER BY CAST(EventEnqueuedUtcTime AS DATETIME) DESC) AS Top5
FROM input1
TIMESTAMP BY EventEnqueuedUtcTime
GROUP BY
SlidingWindow(hour,24),
UserId
HAVING COUNT(*) >= 5 --We want at least 5
)
SELECT
L.UserId,
System.Timestamp() AS ts,
SUM(C.ArrayValue.value.Amt) AS sumAmt
INTO myOutput
FROM Last5 AS L
CROSS APPLY GetArrayElements(L.Top5) AS C
GROUP BY
System.Timestamp(), --Snapshot window
L.UserId
We use a CTE to first get the sliding window of 24h. In there we both filter to only retain windows of more than 5 records (HAVING COUNT(*) > 5), and collect only the last 5 of them (COLLECTOP(5) OVER...). Note that I had to TIMESTAMP BY and CAST on my own timestamp when testing the query, you may not need that in your case.
Next we need to unpack the collected records, that's done via CROSS APPLY GetArrayElements, and sum them. I use a snapshot window for that, as I don't need time grouping on that one.
Please let me know if you need more details.

SQL question: count of occurrence greater than N in any given hour

I'm looking through login logs (in Netezza) and trying to find users who have greater than a certain number of logins in any 1 hour time period (any consecutive 60 minute period, as opposed to strictly a clock hour) since December 1st. I've viewed the following posts, but most seem to address searching within a specific time range, not ANY given time period. Thanks.
https://dba.stackexchange.com/questions/137660/counting-number-of-occurences-in-a-time-period
https://dba.stackexchange.com/questions/67881/calculating-the-maximum-seen-so-far-for-each-point-in-time
Count records per hour within a time span
You could use the analytic function lag to look back in a sorted sequence of time stamps to see whether the record that came 19 entries earlier is within an hour difference:
with cte as (
select user_id,
login_time,
lag(login_time, 19) over (partition by user_id order by login_time) as lag_time
from userlog
order by user_id,
login_time
)
select user_id,
min(login_time) as login_time
from cte
where extract(epoch from (login_time - lag_time)) < 3600
group by user_id
The output will show the matching users with the first occurrence when they logged a twentieth time within an hour.
I think you might do something like that (I'll use a login table, with user, datetime as single column for the sake of simplicity):
with connections as (
select ua.user
, ua.datetime
from user_logons ua
where ua.datetime >= timestamp'2018-12-01 00:00:00'
)
select ua.user
, ua.datetime
, (select count(*)
from connections ut
where ut.user = ua.user
and ut.datetime between ua.datetime and (ua.datetime + 1 hour)
) as consecutive_logons
from connections ua
It is up to you to complete with your columns (user, datetime)
It is up to you to find the dateadd facilities (ua.datetime + 1 hour won't work); this is more or less dependent on the DB implementation, for example it is DATE_ADD in mySQL (https://www.w3schools.com/SQl/func_mysql_date_add.asp)
Due to the subquery (select count(*) ...), the whole query will not be the fastest because it is a corelative subquery - it needs to be reevaluated for each row.
The with is simply to compute a subset of user_logons to minimize its cost. This might not be useful, however this will lessen the complexity of the query.
You might have better performance using a stored function or a language driven (eg: java, php, ...) function.

How do I design a SQL query that will show me all users who visited at least one page for the last 20 out of 24 hours?

In order to identify human traffic (as opposed to crawlers, bots, etc), I would like to design an SQL query that will identify all unique visitor ID's that have visited websites in the last 20 of 24 hours (as most humans would not be browsing for that long). I believe I understand how I want to structure it, "How many UNIQUE hours have any activity for each visitor in the past 24 hours, and WHERE 20 hours have at least some activity".
While the specifics of such a query would depend on the tables involved, I'm having trouble understanding if my structure is on the right track:
SELECT page_url, affinity, num
FROM (
SELECT AGG GROUP BY visitor_id, pages.page_url, max(v.max_affinity) as affinity, COUNT(*) as num, Row_Number()
OVER (Partition By v.visitor_id ORDER BY COUNT(visitor_id) DESC) AS RowNumber
FROM audience_lab_active_visitors v
SELECT pages ON pages.p_date >= '2017-09-14'
WHERE v.p_date='2017-09-14'
GROUP BY v.vispage_visitors, pages.page_url
) tbl WHERE RowNumber < 20
I don't believe your query is valid SQL, but I have an idea of what you're trying to accomplish. Rather than use a static date, I filtered by the past 24 hours and truncated the current timestamp to the hour, otherwise the query would be considering 25 unique hours. I also removed page_url from the query since it didn't seem relevant to the results based on what you're trying to solve.
For each visitor_id, the query counts the number of unique hours recorded based on the column (timestamp_col in this example) used to record the timestamp of the page view. HAVING COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) < 20 returns those you've identified at humans, meaning they visited the website up to 19 of the past 24 hours.
SELECT
visitor_id,
COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) AS num,
MAX(v.max_affinity) AS affinity
FROM audience_lab_active_visitors AS v
JOIN pages AS p ON v.page_url = p.page_url
WHERE
v.p_date >= DATE_TRUNC('hour', CURRENT_TIMESTAMP) - INTERVAL '24' hour
GROUP BY 1
HAVING COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) < 20;

How can I run this except sub-query in one single query?

I am using postgreSQL, I have two tables, one is user, and one is usertasks.
user has following fields : userid, username
usertasks has following fields: id, taskdate, userid
userid and id are primary keys on above tables
I want to find all users who have made less than 3 tasks in last 3 months.
I cannot use WHERE taskdate>(last3months) here because I need all the users, not just those who made tasks in last 3 months. (Some users might have done their tasks 6 months ago, but didn't do any task in last 3 months, so I need those users as well)
My query is this:
select userid
from users
EXCEPT
select userid from usertasks
where usertasks.taskdate > CURRENT_DATE - INTERVAL '3 months'
group by usertasks.userid having count(id) >= 3
Problem:
The above query works perfectly and returns the right result, I have also tried NOT IN , instead of EXCEPT, that works fine too, but the thing is I am getting performance issues, can this be done in one single query without using a sub query, can it be done using joins or any other method ? The use of sub-queries making it slower.
the test case is for 100 thousand users and 1 million tasks, i am searching for fastest methods..
You need to use having with a case.
Select u.user_id
from users u
left join usertask ut
on ut.user_id=u.user_id
group by u.user_id
having count(case when ut.taskdate > CURRENT_DATE - INTERVAL '3 months' then task_id else null end)<3 -- count of tasks in last 3 monthx < 3

Finding concurrent users in a sessions table created from log entries

We are exploring using Bigquery to store and analyze 100s of million of log entries representing user sessions. The source raw log entries contain a "connect" log type and "disconnect" log type.
We have the option of processing the logs before they are ingested to bigquery so that we have one entry per session, containing the session start TIMESTAMP and a "duration" value, or to insert each log entry individually and calculate session times at the analysis stage. Let's imagine our table schema is of the form:
sessionStartTime: TIMESTAMP,
clientId: STRING,
duration: INTEGER
or (in the case we store two log entries per session: one connect and one disconnect):
time: TIMESTAMP,
type: INTEGER, //enum, 0 for connect, 1 for disconnect
clientId: STRING
Our problem is we cannot find a way to get concurrent users using bigquery: ideally we would be able to write a query that partitions the sessions table by timestamp "buckets" (let's say every minute) and perform a query which would give us concurrents per minute over a certain time range.
The simple way to think about concurrents with respect to log entries is that at any moment in time they are calculated using the function f(t) = x0 + connects(t) - disconnects(t), where x0 is the initial concurrent users count (at time t0), and t is the "timestamp" bucket (in minutes in this example).
Can anybody recommend a way to do this?
Thanks!
Thanks for the sample data! (Available at https://bigquery.cloud.google.com/table/imgdge:sopub.sessions)
I'll take your offer to "We have the option of processing the logs before they are ingested to bigquery so that we have one entry per session, containing the session start TIMESTAMP and a 'duration' value". This time, I'll make the processing with BigQuery, and leave the results on a table of my own with:
SELECT u, start, MIN(end) end FROM (
SELECT a.f0_ u, a.time start, b.time end
FROM [imgdge:sopub.sessions] a
JOIN EACH [imgdge:sopub.sessions] b
ON a.f0_ = b.f0_
WHERE a.type = 'connect'
AND b.type='disconnect'
AND a.time < b.time
)
GROUP BY 1, 2
That gives me 819,321 rows. Not a big number for BigQuery, but since we are going to be doing combinations of it, it might explode. We'll limit the date range for calculating the concurrent sessions to keep it sane. I'll save the results of this query to [fh-bigquery:public_dump.imgdge_sopub_sessions_startend].
Once I have all the sessions with start and end time, I can go find how many concurrent sessions are per each interesting instant. By minute you said?
All the interesting minutes happen to be:
SELECT SEC_TO_TIMESTAMP(FLOOR(TIMESTAMP_TO_SEC(time)/60)*60) time
FROM [imgdge:sopub.sessions]
GROUP BY 1
Now let's combine this list of interesting times with all the sessions in my new table. For each minute we'll count all the sessions that started before this time, and ended after it:
SELECT time, COUNT(*) concurrent
FROM (
SELECT u, start, end, 99 x
FROM [fh-bigquery:public_dump.imgdge_sopub_sessions_startend]
WHERE start < '2013-09-30 00:00:00'
) a
JOIN
(
SELECT SEC_TO_TIMESTAMP(FLOOR(TIMESTAMP_TO_SEC(time)/60)*60) time, 99 x FROM [imgdge:sopub.sessions] GROUP BY 1) b
ON a.x = b.x
WHERE b.time < a.end
AND b.time >= a.start
GROUP BY 1
Notice the 99 x. It could be any number, I'm just using it to generate the combinatorial all the session * all the times. There are too many sessions for this kind of combinatorial game, so I'm limiting them with the WHERE start < '2013-09-30 00:00:00'.
And that's how you can count concurrent users.
Could you instead of sessionStartTime get sessionEndTime (or just add duration+sessionStartTime)? If you could do that something like this can be made. It is not perfect but it will give you somewhat relevant data.
SELECT AVG(perMinute) as avgUsersMin FROM
(
SELECT COUNT(distinct clientId, 1000000) as perMinute, YEAR(sessionEndTime) as y,
MONTH(sessionEndTime) as m, DAY(sessionEndTime) as d, HOUR(sessionEndTime) as h, MINUTE(sessionEndTime) as mn FROM [MyProject:MyTable]
WHERE sessionEndTime BETWEEN someDate AND someOtherDate
GROUP BY y,m,d,h,mn
);