Microsoft SQL server count distinct every 30 minutes - sql

We have an activity database that records user interaction to a website, storing a log that includes values such as [UserId] and [LogDate] e.g.
UserId|LogDate
123 |2017-01-01 11:17:35.190
I am trying to find out the count of distinct user sessions over time.
This would be easy enough by counting the distinct users:
SELECT COUNT(DISTINCT UserId) FROM ActivityDatabase.dbo.Logs
However, I need to count a user multiple times if they have a log more than 30 minutes from the previous log as this is then classed as a new session.
A session is defined as having a log in a 30 minute timeframe. For example:
If a user creates a log at 13.30, the value for distinct user
sessions over time would be 1.
If the user creates another log at 13.40, the count should still be 1 as it is within 30 minutes of the previous log.
If the user creates another log at 14.20, the count should then be 2 as this is 30 minutes after the previous log.
Is this possible in SQL? I would need a way of checking every log for a user against the previous user log, and if the time difference between these is more than 30 minutes, it should count as a unique session.
The output of the SQL should be a number rather than broken down by a time period.
Thank you.

Sessionizing is a bit tricky. Let me show you how to do that. Perhaps this will solve your problem:
select userid, min(log_date) as session_start,
dateadd(minute, 30, max(log_date)) as session_end,
row_number() over () as session_id
from (select l.*,
sum(case when log_date < dateadd(minute, 30, prev_logdate)
then 0 else 1
end) over (partition by userid order by logdate
) as grp
from (select l.*,
lag(logdate) over (partition by userid order by logdate) as prev_logdate
from ActivityDatabase.dbo.Logs l
) l
) l
group by userid, grp;
If you want the number of unique users at a given point in time, then:
with s as (
select userid, min(log_date) as session_start,
dateadd(minute, 30, max(log_date) as session_end,
row_number() over () as session_id
from (select l.*,
sum(case when log_date < dateadd(minute, 30, prev_logdate)
then 0 else 1
end) over (partition by userid order by logdate
) as grp
from (select l.*,
lag(logdate) over (partition by userid order by logdate) as prev_logdate
from ActivityDatabase.dbo.Logs l
) l
) l
group by userid, grp
)
select count(*)
from s
where #datetime between session_start and session_end;
A more brute force alternative for a given time is:
select count(distinct userid)
from ActivityDatabase.dbo.Logs l
where #datetime between log_date and dateadd(minute, 30, log_date);

If you are using sql server 2012 or greater, I would use the lag function to find the previous row and then you can compare the two datetimes to see if the difference is greater than 30 mins
select
userId,
LogDate,
LAG(LogDate, 1,0) OVER (PARTITION BY userId ORDER BY LogDate) AS PreviousLogDate
from logTbl
You can then add datediff and a case statement to flag a new login where the difference is greater than your threshold.
If no previous row is found, then the lag function will return null.

If you play around with the definition you're trying to use, it becomes a lot easier to write the SQL.
What we want to identify are "starting logs" - logs that mark the start of a session. We don't want to identify any other logs.
How do we define a "starting log"? It's a log that doesn't have another log within 30 minutes before it.
SELECT COUNT(*)
FROM ActivityDatabase.dbo.Logs l1
WHERE NOT EXISTS (
SELECT * FROM ActivityDatabase.dbo.Logs l2
WHERE l1.UserId = l2.UserId AND
l2.LogDate < l1.LogDate AND
l2.LogDate >= DATEADD(minute,-30,l1.LogDate)
)

Related

count consecutive number of -1 in a column. count >=14

I'm trying to figure out query to count "-1" that have occurred for more than 14 times. Can anyone help me here. I tried everything from lead, row number, etc but nothing is working out.
The BP is recorded for every minute and I need to figure the id's who's bp_level was "-1" for more than 14min
You may try the following:
Select Distinct B.Person_ID, B.[Consecutive]
From
(
Select D.person_ID, COUNT(D.bp_level) Over (Partition By D.grp, D.person_ID Order By D.Time_) [Consecutive]
From
(
Select Time_, Person_ID, bp_level,
DATEADD(Minute, -ROW_NUMBER() Over (Partition By Person_ID Order By Time_), Time_) grp
From mytable Where bp_level = -1
) D
) B
Where B.[Consecutive] >= 14
See a demo from db<>fiddle. Using SQL Server.
DATEADD(Minute, -ROW_NUMBER() Over (Partition By Person_ID Order By Time_), Time_): to define a unique group for consecutive times per person, where (bp_level = -1).
COUNT(D.bp_level) Over (Partition By D.grp, D.person_ID Order By D.Time_): to find the cumulative sum of bp_level over the increasing of time for each group.
Once a none -1 value appeared the group will split into two groups and the counter will reset to 0 for the other group.
NOTE: this solution works only if there are no gaps between the consecutive times, the time is increased by one minute for each row/ person, otherwise, the query will not work but can be modified to cover the gaps.
with data as (
select *,
count(case when bp_level = 1 then 1 end) over
(partition by person_id order by time) as grp
from T
)
select distinct person_id
from data
where bp_level = -1
group by person_id, grp
having count(*) > 14; /* = or >= ? */
If you want to rely on timestamps rather than a count of rows then you could use the time difference:
...
-- where 1 = 1 /* all rows */
group by person_id, grp
having datediff(minute, min(time), max(time)) > 14;
The accepted answer would have issues with scenarios where there are multiple rows with the same timestamp if there's any potential for that to happen.
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=2ad6a1b515bb4091efba9b8831e5d579

How to get a count of new values per day in Postgres

I have the following schema -
Date
UserID
"2021-07-29"
1
"2021-07-29"
2
"2021-07-30"
1
"2021-07-30"
4
"2021-08-01"
2
"2021-08-01"
2
It contains the dates of some event, along with the user who triggered that event.
I need to get a count of all the NEW users who triggered the event on every given day until today, ignoring users who have triggered the event in the past.
So after running the query, results would look like this
Date
Count
"2021-07-29"
2
"2021-07-30"
1
"2021-08-01"
0
Because on the 29th, user 1, and 2 - who I've never seen before triggered it.
On the 30th, user 4 - who I've never seen before triggered it.
On the first, I've seen user 2 before, so ignore him.
You can use a window function to get the first date for each user. Then use conditional aggregation:
select date, count(*) filter (where seqnum = 1) as num_new_users
from (select t.*,
row_number() over (partition by userid order by date) as seqnum
from t
) t
group by date;
Use the window function FIRST_VALUE in a subquery (or CTE) to get the first trigger for each user and in the outer query count if it's equal to the current date:
SELECT dt,count(*) FILTER (WHERE first_trigger = dt)
FROM (
SELECT *,FIRST_VALUE(dt) OVER w first_trigger FROM t
WINDOW w AS (PARTITION BY userid ORDER BY dt
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
ORDER BY dt)j
GROUP BY dt;
Demo: db<>fiddle
Use MIN() window function to get the min date for each user and then aggregate and count for each date only the min dates:
SELECT Date, SUM((Date = min_date)::int) Count
FROM (
SELECT *, MIN(date) OVER (PARTITION BY UserID) min_date
FROM tablename
) t
GROUP BY Date;
Or:
SELECT Date, COUNT(*) FILTER (WHERE Date = min_date) Count
FROM (
SELECT *, MIN(date) OVER (PARTITION BY UserID) min_date
FROM tablename
) t
GROUP BY Date;
See the demo.

Grouping Consecutive Timestamps (Redshift)

Got something that I cant get my head around
raw data shows every 15 min intervals and I would like to group them based on if they are consecutive 15 min intervals (see screenshot below) I will like to do this multiple times for each user and for alot of users... Any ideas on how to do this using sql only that can scale to 1000's users?
Any help would be appreicated
Thanks
This is a type of gaps-and-islands problem. Use lag() to get the difference, then a cumulative sum to identify the group:
select user_id, min(start_time), max(end_time)
from (select t.*,
sum( case when prev_end_time <> start_time then 0 else 1 end) over (partition by user_id order by start_time) as grp
from (select t.*,
lag(end_time) over (partition by user_id order by start_time) as prev_end_time
from t
) t
) t
group by user_id, grp;

Define user's sessions (sql)

I have an event table (user_id, timestamp). I need to write a query to define a user session (every user can have more than one session and every session can have >= 1 event). 30 minutes of inactivity for the user is a completed session.
The output table should have the following format: (user_id, start_session, end_sesson). I wrote part of query, but what to do next i have no idea.
select
t.user_id,
t.ts start_session,
t.next_ts
from ( select
user_id,
ts,
DATEDIFF(SECOND, lag(ts, 1) OVER(partition by user_id order by ts), ts) next_ts
from
events_tabl ) t
You want a cumulative sum to identify the sessions and then aggregation:
select user_id, session_id, min(ts), max(ts)
from (select e.*,
sum(case when prev_ts > dateadd(minute, -30, ts)
then 0 else 1
end) over (partition by user_id order by ts) as session_id
from (select e.*,
lag(ts) over (partition by user_id order by ts), ts) as prev_ts
from events_tabl e
) e
) e
group by user_id, session_id;
Note that I changed the date/time logic from using datediff() to a direct comparison of the times. datediff() counts the number of "boundaries" between two times. So, there is 1 hour between 12:59 a.m. and 1:01 a.m. -- but zero hours between 1:01 a.m. and 1:59 a.m.
Although handling the diffs at the second level produces similar results, you can run into occasions where you are working with seconds or milliseconds -- but the time spans are too long to fit into an integer. Overflow errors. It is just easier to work directly with the date/time values.

Vertica Analytic function to count instances in a window

Let's say I have a dataset with two columns: ID and timestamp. My goal is to count return IDs that have at least n timestamps in any 30 day window.
Here is an example:
ID Timestamp
1 '2019-01-01'
2 '2019-02-01'
3 '2019-03-01'
1 '2019-01-02'
1 '2019-01-04'
1 '2019-01-17'
So, let's say I want to return a list of IDs that have 3 timestamps in any 30 day window.
Given above, my resultset would just be ID = 1. I'm thinking some kind of windowing function would accomplish this, but I'm not positive.
Any chance you could help me write a query that accomplishes this?
A relatively simple way to do this involves lag()/lead():
select t.*
from (select t.*,
lead(timestamp, 2) over (partition by id order by timestamp) as timestamp_2
from t
) t
where datediff(day, timestamp, timestamp_2) <= 30;
The lag() looks at the third timestamp in a series. The where checks if this is within 30 days of the original one. The result is rows where this occurs.
If you just want the ids, then:
select distinct id
from (select t.*,
lead(timestamp, 2) over (partition by id order by timestamp) as timestamp_2
from t
) t
where datediff(day, timestamp, timestamp_2) <= 30;