Calculating user retention on daily basis between the dates in SQL - sql

I have a table that has the data about user_ids, all their last log_in dates to the app
| User_Id | log_in_dates |
| 1 | 2021-09-01 |
| 1 | 2021-09-03 |
| 2 | 2021-09-02 |
| 2 | 2021-09-04 |
| 3 | 2021-09-01 |
| 3 | 2021-09-02 |
| 3 | 2021-09-03 |
| 3 | 2021-09-04 |
| 4 | 2021-09-03 |
| 4 | 2021-09-04 |
| 5 | 2021-09-01 |
| 6 | 2021-09-01 |
| 6 | 2021-09-09 |
From the above table, I'm trying to understand the user's log in behavior from the present day to the past 90 days.
Num_users_no_log_in defines the count for the number of users who haven't logged in to the app from present_day to the previous days (last_log_in_date)
I want the table like below:
| present_date | days_difference | last_log_in_date | Num_users_no_log_in |
| 2021-09-01 | 0 | 2021-09-01 | 0 |
| 2021-09-02 | 1 | 2021-09-01 | 3 |->(Id = 1,5,6)
| 2021-09-02 | 0 | 2021-09-02 | 3 |->(Id = 1,5,6)
| 2021-09-03 | 2 | 2021-09-01 | 2 |->(Id = 5,6)
| 2021-09-03 | 1 | 2021-09-02 | 1 |->(Id = 2)
| 2021-09-03 | 0 | 2021-09-03 | 3 |->(Id = 2,5,6)
| 2021-09-04 | 3 | 2021-09-01 | 2 |->(Id = 5,6)
| 2021-09-04 | 2 | 2021-09-02 | 0 |
| 2021-09-04 | 1 | 2021-09-03 | 1 |->(Id= 1)
| 2021-09-04 | 0 | 2021-09-04 | 3 |->(Id = 1,5,6)
| .... | .... | .... | ....
I was able to get the first three columns Present_date | days_difference | last_log_in_date using the following query:
with dts as
select distinct log_in from users_table
select x.log_in_dates as present_date,
DATEDIFF(DAY, y.log_in_dates ,x.log_in_dates ) as Days_since_last_log_in,
y.log_in_dates as log_in_dates
from dts x, dts y
where x.log_in_dates >= y.log_in_dates
I don't understand how I can get the fourth column Num_users_no_log_in

I do not really understand your need: are there values base on users or dates? It it's based on dates, as it looks like (elsewhere you would probably have user_id as first column), what does it mean to have multiple times the same date? I understand that you would like to have a recap for all dates since the beginning until the current date, but in my opinion in does not really make sens (imagine your dashboard in 1 year!!)
Once this is said, let's go to the approach.
In such cases, I develop step by step using common table extensions. For you example, it required 3 steps:
prepare the time series
integrate connections' dates and perform the first calculation (time difference)
Finally, calculate nb connection per day
Then, the final query will display the desired result.
Here is the query I proposed, developed with Postgresql (you did not precise your dbms, but converting should not be such a big deal here):
with init_calendar as (
-- Prepare date series and count total users
select generate_series(min(log_in_dates), now(), interval '1 day') as present_date,
count(distinct user_id) as nb_users
from users
calendar as (
-- Add connections' dates for each period from the beginning to current date in calendar
-- and calculate nb days difference for each of them
-- Syntax my vary depending dbms used
select distinct present_date, log_in_dates as last_date,
extract(day from present_date - log_in_dates) as days_difference,
from init_calendar
join users on log_in_dates <= present_date
usr_con as (
-- Identify last user connection's dates according to running date
-- Tag the line to be counted as no connection
select c.present_date, c.last_date, c.days_difference, c.nb_users,
u.user_id, max(log_in_dates) as last_con,
case when max(log_in_dates) = present_date then 0 else 1 end as to_count
from calendar c
join users u on u.log_in_dates <= c.last_date
group by c.present_date, c.last_date, c.days_difference, c.nb_users, u.user_id
select present_date, last_date, days_difference,
nb_users - sum(to_count) as Num_users_no_log_in
from usr_con
group by present_date, last_date, days_difference, nb_users
order by present_date, last_date
Please note that there is a difference with your own expected result as you forgot user_id = 3 in your calculation.
If you want to play with the query, you can with dbfiddle


SQL Server - Counting total number of days user had active contracts

I want to count the number of days while user had active contract based on table with start and end dates for each service contract. I want to count the time of any activity, no matter if the customer had 1 or 5 contracts active at same time.
| 1 | 14 | 18.02.2021 | 18.04.2022 |
| 1 | 13 | 02.01.2019 | 02.01.2020 |
| 1 | 12 | 01.01.2018 | 01.01.2019 |
| 1 | 11 | 13.02.2017 | 13.02.2019 |
| 2 | 23 | 19.06.2021 | 18.04.2022 |
| 2 | 22 | 01.07.2019 | 01.07.2020 |
| 2 | 21 | 19.01.2019 | 19.01.2020 |
In result I want a table:
| 1 | 1477 |
| 2 | 832 |
1477 stands by 1053 (days from 13.02.2017 to 02.01.2020 - user had active contracts during this time) + 424 (days from 18.02.2021 to 18.04.2022)
832 stands by 529 (days from 19.01.2019 to 01.07.2020) + 303 (days from 19.06.2021 to 18.04.2022).
I tried some queries with joins, datediff's, case when conditions but nothing worked. I'll be grateful for any help.
If you don't have a Tally/Numbers table (highly recommended), you can use an ad-hoc tally/numbers table
Example or dbFiddle
Select User_ID
,Days = count(DISTINCT dateadd(DAY,N,Start_Date))
from YourTable A
Join ( Select Top 10000 N=Row_Number() Over (Order By (Select NULL))
From master..spt_values n1, master..spt_values n2
) B
On N<=DateDiff(DAY,Start_Date,End_Date)
Group By User_ID
User_ID Days
1 1477
2 832

Select only record until timestamp from another table

I have three tables.
The first one is Device table
| DeviceId | Type |
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
The second one is History table - data received by different devices.
| DeviceId | Temperature | TimeStamp |
| 1 | 31 | 15.08.2020 1:42:00 |
| 2 | 100 | 15.08.2020 1:42:01 |
| 2 | 40 | 15.08.2020 1:43:00 |
| 1 | 32 | 15.08.2020 1:44:00 |
| 1 | 34 | 15.08.2020 1:45:00 |
| 3 | 20 | 15.08.2020 1:46:00 |
| 2 | 45 | 15.08.2020 1:47:00 |
The third one is DeviceStatusHistory table
| DeviceId | State | TimeStamp |
| 1 | 1(OK) | 15.08.2020 1:42:00 |
| 2 | 1(OK) | 15.08.2020 1:43:00 |
| 1 | 1(OK) | 15.08.2020 1:44:00 |
| 1 | 0(FAIL) | 15.08.2020 1:44:30 |
| 1 | 0(FAIL) | 15.08.2020 1:46:00 |
| 2 | 0(FAIL) | 15.08.2020 1:46:10 |
I want to select the last temperature of devices, but take into account only those history records that occurs until the first device failure.
Since device1 starts failing from 15.08.2020 1:44:30, I don't want its records that go after that timestamp.
The same for the device2.
So as a final result, I want to have only data of all devices until they get first FAIL status:
| DeviceId | Temperature | TimeStamp |
| 2 | 40 | 15.08.2020 1:43:00 |
| 1 | 32 | 15.08.2020 1:44:00 |
| 3 | 20 | 15.08.2020 1:46:00 |
I can select an appropriate history only if device failed at least once:
(SELECT TOP 1 * FROM History H
WHERE D.Id = H.DeviceId
and H.DeviceTimeStamp <
(select MIN(UpdatedOn) from DeviceStatusHistory Y where [State]=0 and DeviceId=D.Id)
ORDER BY H.DeviceTimeStamp desc) X
The problems is, if a device never fails, I don't get its history at all.
My idea is to use something like this
SELECT * FROM DeviceHardwarePart HP
(SELECT TOP 1 * FROM History H
WHERE HP.Id = H.DeviceId
and H.DeviceTimeStamp <
(select ISNULL((select MIN(UpdatedOn) from DeviceMetadataPart where [State]=0 and DeviceId=HP.Id),
cast('12/31/9999 23:59:59.997' as datetime)))
ORDER BY H.DeviceTimeStamp desc) X
I'm not sure whether it is a good solution
You can use COALESCE: coalesce(min(UpdateOn), cast('9999-12-31 23:59:59' as datetime)). This ensures you always have an upperbound for your select instead of NULL.
I will treat this as two parts problem
I will try to find the time at which device has failed and if it hasn't failed I will keep it as a large value like some timestamp in 2099
Once I have the above I can simply join with histories table and take the latest value before the failed timestamp.
In order to get one, I guess there can be several approaches. From top of my mind something like below should work
select device_id, coalesce(min(failed_timestamps), cast('01-01-2099 01:01:01' as timestamp)) as failed_at
(select device_id, case when state = 0 then timestamp else null end as failed_timestamps from History) as X
group by device_id
This gives us the minimum of failed timestamp for a particular device, and an arbitrary large value for the devices which have never failed.
I guess after this the solution is straight forward.

30 day rolling count of distinct IDs

So after looking at what seems to be a common question being asked and not being able to get any solution to work for me, I decided I should ask for myself.
I have a data set with two columns: session_start_time, uid
I am trying to generate a rolling 30 day tally of unique sessions
It is simple enough to query for the number of unique uids per day:
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '30 days'
it is also relatively simple to calculate the daily unique uids over a date range.
DATE_TRUNC('day',session_start_time) AS "date"
,COUNT(DISTINCT uid) AS "count"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY date(session_start_time)
I then I tried several ways to do a rolling 30 day unique count over a time interval
DATE(session_start_time) AS "running30day"
case when date(session_start_time) >= running30day - interval '30 days'
AND date(session_start_time) <= running30day
then uid
) AS "unique_30day"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '3 months'
GROUP BY date(session_start_time)
Order BY running30day desc
I really thought this would work but when looking into the results, it appears I'm getting the same results as I was when doing the daily unique rather than the unique over 30days.
I am writing this query from Metabase using the SQL query editor. the underlying tables are in redshift.
If you read this far, thank you, your time has value and I appreciate the fact that you have spent some of it to read my question.
As rightfully requested, I added an example of the data set I'm working with and the desired outcome.
| | |
| 10 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 5 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 3 | 2020-01-18T02:49:23.000-05:00 |
| | |
| 9 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 2 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 8 | 2020-03-31T23:13:33.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 2 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 9 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 8 | 2020-09-15T16:40:29.000-04:00 |
| | |
| 3 | 2020-09-21T20:49:09.000-04:00 |
| | |
| 1 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 6 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 5 | 2020-12-12T04:42:00.000-05:00 |
bellow is what the result I would like looks like:
| | |
| 2020-01-13 | 3 |
| | |
| 2020-01-18 | 1 |
| | |
| 2020-03-06 | 3 |
| | |
| 2020-03-31 | 1 |
| | |
| 2020-08-28 | 4 |
| | |
| 2020-09-15 | 2 |
| | |
| 2020-09-21 | 1 |
| | |
| 2020-11-05 | 2 |
| | |
| 2020-12-12 | 2 |
Thank you
You can approach this by keeping a counter of when users are counted and then uncounted -- 30 (or perhaps 31) days later. Then, determine the "islands" of being counted, and aggregate. This involves:
Unpivoting the data to have an "enters count" and "leaves" count for each session.
Accumulate the count so on each day for each user you know whether they are counted or not.
This defines "islands" of counting. Determine where the islands start and stop -- getting rid of all the detritus in-between.
Now you can simply do a cumulative sum on each date to determine the 30 day session.
In SQL, this looks like:
with t as (
select uid, date_trunc('day', session_start_time) as s_day, 1 as inc
from users_sessions
union all
select uid, date_trunc('day', session_start_time) + interval '31 day' as s_day, -1
from users_sessions
tt as ( -- increment the ins and outs to determine whether a uid is in or out on a given day
select uid, s_day, sum(inc) as day_inc,
sum(sum(inc)) over (partition by uid order by s_day rows between unbounded preceding and current row) as running_inc
from t
group by uid, s_day
ttt as ( -- find the beginning and end of the islands
select tt.uid, tt.s_day,
(case when running_inc > 0 then 1 else -1 end) as in_island
from (select tt.*,
lag(running_inc) over (partition by uid order by s_day) as prev_running_inc,
lead(running_inc) over (partition by uid order by s_day) as next_running_inc
from tt
) tt
where running_inc > 0 and (prev_running_inc = 0 or prev_running_inc is null) or
running_inc = 0 and (next_running_inc > 0 or next_running_inc is null)
select s_day,
sum(sum(in_island)) over (order by s_day rows between unbounded preceding and current row) as active_30
from ttt
group by s_day;
Here is a db<>fiddle.
I'm pretty sure the easier way to do this is to use a join. This creates a list of all the distinct users who had a session on each day and a list of all distinct dates in the data. Then it one-to-many joins the user list to the date list and counts the distinct users, the key here is the expanded join criteria that matches a range of dates to a single date via a system of inequalities.
with users as
distinct uid,
date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01'),
dates as
distinct date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01')
count(distinct uid),
from users
on users.dt >= dates.dt - 29
and users.dt <= dates.dt
group by dates.dt
order by dt desc

SQL: Get an aggregate (SUM) of a calculation of two fields (DATEDIFF) that has conditional logic (CASE WHEN)

I have a dataset that includes a bunch of stay data (at a hotel). Each row contains a start date and an end date, but no duration field. I need to get a sum of the durations.
Sample Data:
| Stay ID | Client ID | Start Date | End Date |
| 1 | 38 | 01/01/2018 | 01/31/2019 |
| 2 | 16 | 01/03/2019 | 01/07/2019 |
| 3 | 27 | 01/10/2019 | 01/12/2019 |
| 4 | 27 | 05/15/2019 | NULL |
| 5 | 38 | 05/17/2019 | NULL |
There are some added complications:
I am using Crystal Reports and this is a SQL Expression, which obeys slightly different rules. Basically, it returns a single scalar value. Here is some more info:
Sometimes, the end date field is blank (they haven't booked out yet). If blank, I would like to replace it with the current timestamp.
I only want to count nights that have occurred in the past year. If the start date of a given stay is more than a year ago, I need to adjust it.
I need to get a sum by Client ID
I'm not actually any good at SQL so all I have is guesswork.
The proper syntax for a Crystal Reports SQL Expression is something like this:
And that's giving me the correct value for a single row, if I wanted to do this:
| Stay ID | Client ID | Start Date | End Date | Duration |
| 1 | 38 | 01/01/2018 | 01/31/2019 | 210 | // only days since June 4 2018 are counted
| 2 | 16 | 01/03/2019 | 01/07/2019 | 4 |
| 3 | 27 | 01/10/2019 | 01/12/2019 | 2 |
| 4 | 27 | 05/15/2019 | NULL | 21 |
| 5 | 38 | 05/17/2019 | NULL | 19 |
But I want to get the SUM of Duration per client, so I want this:
| Stay ID | Client ID | Start Date | End Date | Duration |
| 1 | 38 | 01/01/2018 | 01/31/2019 | 229 | // 210+19
| 2 | 16 | 01/03/2019 | 01/07/2019 | 4 |
| 3 | 27 | 01/10/2019 | 01/12/2019 | 23 | // 2+21
| 4 | 27 | 05/15/2019 | NULL | 23 |
| 5 | 38 | 05/17/2019 | NULL | 229 |
I've tried to just wrap a SUM() around my CASE but that doesn't work:
It gives me an error that the StayDateEnd is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. But I don't even know what that means, so I'm not sure how to troubleshoot, or where to go from here. And then the next step is to get the SUM by Client ID.
Any help would be greatly appreciated!
Although the explanation and data set are almost impossible to match, I think this is an approximation to what you want.
declare #your_data table (StayId int, ClientId int, StartDate date, EndDate date)
insert into #your_data values
;with data as (
select *,
when datediff(day,StartDate,getdate())>365 then dateadd(year,-1,getdate())
else StartDate
) days
from #your_data
select *,
sum(days) over (partition by ClientId)
from data
You need a subquery for sum based on group by client_id and a join between you table the subquery eg:
select Stay_id, client_id, Start_date, End_date, t.sum_duration
from your_table
inner join (
select Client_id,
END) sum_duration
from your_table
group by Client_id
) t on t.Client_id = your_table.client_id

find every n-th date in a continuous date stream

i would like to find/mark every 4th day in a continuous date stream inserted into my table for each user in a given date range
CREATE TABLE mytable (
myuser INTEGER,
the problem is, that only 3 continuous days are valid per user, after that, there has to be a one day "break"
id | myuser | day |
0 | 200 | 2012-01-12 | }
1 | 200 | 2012-01-13 | }--> 3 continuous days
2 | 200 | 2012-01-14 | }
3 | 200 | 2012-01-15 | <-- not ok, user 200 should get warned and delete this
4 | 200 | 2012-01-16 | }
5 | 200 | 2012-01-17 | }--> 3 continuous days
6 | 200 | 2012-01-18 | }
7 | 200 | 2012-01-19 | <-- not ok, user 200 should get warned and delete this
8 | 201 | 2012-01-12 | }
9 | 201 | 2012-01-13 | }--> 3 continuous days
10 | 201 | 2012-01-14 | }
11 | 201 | 2012-01-16 | <-- ok, there is a one day gap here
12 | 201 | 2012-01-17 |
the main goal is to look at a given date range (usually a month) and identify days, which are not allowed. Also i have to take care that the overlapping dates are handled correctly, for example, if i look on a date range from 2012-02-01 to 2012-02-29, 2012-02-01 could be a "break" day if 2012-01-29 to 2012-01-31 is present in that table for the same user.
I don't have access to PostgreSQL, but hopefully this works...
grouped_data AS
ROW_NUMBER() OVER (PARTITION BY myuser ORDER BY day) - (day - start_date) AS user_group_id,
day >= start_date - 3
AND day <= end_date
sequenced_data AS
ROW_NUMBER() OVER (PARTITION BY myuser, user_group_id ORDER BY day) AS sequence_id,
CASE WHEN sequence_id % 4 = 0 THEN 1 ELSE 0 END as should_be_a_break_day
day >= start_date
Sorry I didn't explain the workings, I had to jump into a meeting :)
Example with start_date = '2012-01-14'...
id | myuser | day | ROW_NUMBER() | day - start_date | user_group_id
0 | 200 | 2012-01-12 | 1 | -2 | 1 - -2 = 3
1 | 200 | 2012-01-13 | 2 | -1 | 2 - -1 = 3
2 | 200 | 2012-01-14 | 3 | 0 | 3 - 0 = 3
3 | 200 | 2012-01-15 | 4 | 1 | 4 - 1 = 3
4 | 200 | 2012-01-16 | 5 | 2 | 5 - 2 = 3
5 | 201 | 2012-01-12 | 1 | -2 | 1 - -2 = 3
6 | 201 | 2012-01-13 | 2 | -1 | 2 - -1 = 3
7 | 201 | 2012-01-14 | 3 | 0 | 3 - -1 = 3
8 | 201 | 2012-01-16 | 4 | 2 | 4 - 2 = 2
Any sequential dates will have the same user_group_id. Each 'gap' in the days makes that user_group_id decrease by 1 (see row 8, if the record was for the 17th, a 2 day gap, the id would have been 1).
Once you have a group_id, row_number() can be easily used to say which day in the sequence it is. A max of 3 day is the same as "Every 4th day should be a gap", and "x % 4 = 0" identifies every 4th day.
Much simpler and faster with the window function lag():
SELECT myuser
,COALESCE(lag(day, 3) OVER (PARTITION BY myuser ORDER BY day) = (day - 3)
,FALSE) AS break_overdue
FROM mytable
WHERE day BETWEEN ('2012-01-12'::date - 3) AND '2012-01-16'::date;
myuser | day | break_overdue
200 | 2012-01-12 | f
200 | 2012-01-13 | f
200 | 2012-01-14 | f
200 | 2012-01-15 | t
200 | 2012-01-16 | t
201 | 2012-01-12 | f
201 | 2012-01-13 | f
201 | 2012-01-14 | f
201 | 2012-01-16 | f
Major points:
The query marks all days as break_overdue after three consecutive days. It is unclear whether you want all of them marked after the rule has been broken or just every 4th day.
I include 3 days before the start date (not just two) to determine whether the first day is already in violation of the rule.
The test is simple: if the 3rd row before the current row within the partition equals the current day - 3 then the rule has been broken. I wrap it all in COALESCE to fold NULL values to FALSE for cosmetic reasons only. Guaranteed to work as long as (myuser, day) is unique.
In PostgreSQL you can subtract integers form a date, effectively subtracting days.
Can be done in a single query level, no CTE or subquery needed. Should be much faster.
You need PostgreSQL 8.4 or later for window functions.