Sliding window of count distinct users for 12 months - sql

I have a fairly basic dataset, where I have a table containing a timestamp of every time a user interacts with an app. An active user is classified as someone who has in the previous 12 months interacted with the app at least once.
I need to produce a table, which tells me day by day (going back n days) how many "active" users there were in the prior 12 month period. I need to run the query in Amazon Athena
A possible complexity is the fact that one user could interact with the app every day. I was wondering what the best window function could be to capture this.
The data is in the format;
A Opened App 10/04/2020
A Opened App 10/02/2020
A Opened App 05/01/2020
B Opened App 12/03/2020
B Opened App 02/01/2019
B Opened App 20/07/2018
C Opened App 19/04/2019
I need a resulting table of
20/04/2020 2 (A and B)
19/04/2020 2 (A and B)
18/04/2020 3 (all three)
...
04/01/2020 1 (Only C)
...

One method is to use count(distinct) with a range window function:
select distinct date,
count(distinct user) over (order by date range between interval '1 year' preceding and current row) as num_active_users
from t;
Not all databases support this syntax.

Related

Skip the day the email was opened and count 7 days after open in big query SQL

I am trying to block off a window within my script that will attribute a sale to a 7-day window. The issue that I am having is that I want the seven-day window to not include the open date so open date = 0 and the sales window begins on day 1.
Here is the current way that I am creating that window -
and oh.Order_Date >= first_open_date.first_open
and oh.Order_Date <= first_open_date.first_open + 7
If you can provide some example data I can help with a more accurate answer, but for now I hope the below will share some ideas.
Please consider the below approach, where I'm assuming your 'opens' refer to tracking whether a user has opened a marketing campaign.
select orders.*,campaigns.campaign_name
from orders_table as orders
left join
(
select distinct timestamp as open_date,campaign_name from campaign_data
) as campaigns
on orders.user_id = campaigns.user_id and campaigns.open_date < orders.order_date and campaigns.open_date >= date_sub(orders.order_date,interval 7 day)
This example is based on something similar to what I've created for work in the past, which looks at each order date in the order table and then what campaigns were opened before that date.
You may also want to consider using a window statement like row_number or dense_rank with this if you wish to pull only the first or last campaign that was opened to answer questions like "What was the last google ad a user interacted with before placing an order".
Hope this helps,
Tom

Postgres SQL for loop to count monthly existing users

I have a users database for user sign up time:
id, signup_time
100 2020-09-01
001 2018-01-01
....
How could I find monthly existing user for all the history record? Use the last day in the month as the cut off day, existing users means if I observe in July last day, 2020-07-31, this user had already signed up before 2020-07-01. If I observe in June last day 2020-06-30, this user had already signed up before 2020-06-01.
Similar as a for loop in other language:
observation_year_month_list = ['2020-04','2020-05','2020-06']
for i in observation_year_month_list:
if user signup_time < i:
monthly_existing_user_count+1
While PL/SQL has loops, that is a procedural language extension. SQL is a declarative language and does not use loops. Instead, you describe the results you want and the database comes up with a query plan to make it happen.
Your case is handled by group by to aggregate rows into groups. In this case by month using date_trunc. Then you use the aggregate function count to count up how many users are in each group.
select
count(id) as num_users,
date_trunc('month', signup_time) as signup_month
from users
group by date_trunc('month', signup_time)

SQL performance issues with window functions on daily basis

Given ~23 million users, what is the most efficient way to compute the cumulative number of logins within the last X months for any given day (even when no login was performed) ? Start date of a customer is its first ever login, end date is today.
Desired output
c_id day nb_logins_past_6_months
----------------------------------------------
1 2019-01-01 10
1 2019-01-02 10
1 2019-01-03 9
...
1 today 5
➔ One line per user per day with the number of logins between current day and 179 days in the past
Approach 1
1. Cross join each customer ID with calendar table
2. Left join on login table on day
3. Compute window function (i.e. `sum(nb_logins) over (partition by c_id order by day rows between 179 preceding and current row)`)
+ Easy to understand and mantain
- Really heavy, quite impossible to run on daily basis
- Incremental does not bring much benefit : still have to go 179 days in the past
Approach 2
1. Cross join each customer ID with calendar table
2. Left join on login table on day between today and 179 days in the past
3. Group by customer ID and day to get nb logins within 179 days
+ Easier to do incremental
- Table at step 2 is exceeding 300 billion rows
What is the common way to deal with this knowing this is not the only use case, we have to compute other columns like this (nb logins in the past 12 months etc.)
In standard SQL, you would use:
select l.*,
count(*) over (partition by customerid
order by login_date
range between interval '6 month' preceding and current row
) as num_logins_180day
from logins l;
This assumes that the logins table has a date of the login with no time component.
I see no reason to multiply 23 million users by 180 days to generate a result set in excess of 4 million rows to answer this question.
For performance, don't do the entire task all at once. Instead, gather subtotals at the end of each month (or day or whatever makes sense for your data). Then SUM up the subtotals to provide the 'report'.
More discussion (with a focus on MySQL): http://mysql.rjweb.org/doc.php/summarytables
(You should tag questions with the specific product; different products have different syntax/capability/performance/etc.)

How to create an SQL time-in-location table from location/timestamp SQL data stream

I have a question that I'm struggling with in SQL.
I currently have a series of location and timestamp data. It consists of devices in locations at varying timestamps. The locations are repeated, so while they are lat/long coordinates there are several that repeat. The timestamp comes in irregular intervals (sometimes multiple times a second, sometimes nothing for 30 seconds). For example see the below representational data (I am sorting by device name in this example, but could order by anything if it would help):
Device Location Timestamp
X A 1
X A 1.7
X A 2
X A 3
X B 4
X B 5.2
X B 6
X A 7
X A 8
Y A 2
Y A 4
Y C 6
Y C 7
I wish to create a table based on the above data that would show entry/exit or first/last time in each location, with the total duration of that instance. i.e:
Device Location EntryTime ExitTime Duration
X A 1 3 2
X B 4 6 2
X A 7 8 1
Y A 2 4 2
Y C 6 7 1
From here I could process it further to work out a total time in location for a given day, for example.
This is something I could do in Python or some other language with something like a while loop, but I'm really not sure how to accomplish this in SQL.
It's probably worth noting that this is in Azure SQL and I'm creating this table via a Stream Analytics Query to an Event Hubs instance.
The reason I don't want to just simply total all in a location is because it is going to be streaming data and rolling through for a display for say, the last 24 hrs.
Any hints, tips or tricks on how I might accomplish this would be greatly appreciated. I've looked and haven't be able to quite find what I'm looking for - I can see things like datediff for calculating duration between two timestamps, or max and min for finding the first and last dates, but none quite seem to tick the box. The challenge I have here is that the devices move around and come back to the same locations many times within the period. Taking the first occurrence/timestamp of device X at location A and subtracting it from the last, for example, doesn't take into account the other locations it may have traveled to in between those timestamps. Complicating things further, the timestamps are irregular, so I can't simply count the number of occurrences for each location and add them up either.
Maybe I'm missing something simple or obvious, but this has got me stumped! Help would be greatly appreciated :)
I believe grouping would work
SELECT Device, Location, [EntryTime] = MIN(Timestamp), [ExitTime] = Max(Timestamp), [Duration] = MAX(Timestamp)- MIN(Timestamp)
FROM <table>
GROUP BY Device, Location
I was working on similar issue, to some extent in my dataset.
SELECT U.*, TO_DATE(U.WEND,'DD-MM-YY HH24:MI') - TO_DATE(U.WSTART,'DD-MM-YY HH24:MI') AS DURATION
FROM
(
SELECT EMPNAME,TLOC, TRUNC(TO_DATE(T.TDATETIME,'DD-MM-YY HH24:MI')) AS WDATE, MIN(T.TDATETIME) AS WSTART, MAX(T.TDATETIME) AS WEND FROM EMPTRCK_RSMSM T
GROUP BY EMPNAME,TLOC,TRUNC(TO_DATE(T.TDATETIME,'DD-MM-YY HH24:MI'))
) U

MDX question

I am developing a cube with Analysis Services 2000 for a web application where users can register and unregister to the site. So, the "user" table has these three fields:
activo (1 or 0)
fechaAlta
fechaBaja
When the user activates his account, the application saves the "fechaAlta" and puts 1 on "activo" field.
When the user unsubscribes his account,the application updates the field "activo" to 0 and saves the "fechaBaja".
The information I need is to know how many users are active at a time, through a time dimension. Something like:
Year Month Day Active users
2009 January 1 10 (10 activations this day)
2009 January 2 12 (3 activations this day and 1 unregistered)
2009 January 10 17 (5 activation this day)
Even I query on february 2009, I need to know that in January 1th there was 10 active users (the user that unsubscribed the 2th must be counted).
I developed a cube where the fact table is the user table, and create two dimensions for both date fields (fechaAlta and fechaBaja). Also I created this calculated field:
active by month:
Calculation subcube: {[Measures].[Altas]}, [Fecha Alta].[Mes].MEMBERS
Calculation formula: sum({Descendants([Fecha Alta].currentmember,[Fecha Alta].[Día])},[Measures].[Activo])
active to day:
Calculation subcube: {[Measures].[Inscritos]},[Fecha Alta].MEMBERS
Calculation formula: sum({Periodstodate([Fecha Alta].[(Todos)])},[Measures].[Activo])
I don't know how to discount the unregistered users only from the day indicated on fechaBaja.
Thanks.
This is a classic slowly changing dimension issue. What you are describing is a type 2 slowly changing dimension see here
You need to make sure that your user dimension has a surrogate key. Then you create a new record in your user table each time the user changes status and then you use effective dates to control which surrogate key to insert into your fact table. This will let you report on the users effective status at any point in time.
I think you need a "User Status" dimension, then you can show this against Time, with the measure being count of users.