Is there a BigQuery solution to find average time between events? - google-bigquery

I have a request to deliver aggregate GA data and I can't quite wrap my head around, there are three events that do not immediately following one another:
Event A -> Event B -> Event C
The question are:
What is the total volume of sessions where Event B is eventually triggered after Event A in the same session? What is the success rate i.e:
(Sessions where Event B follows Event A) / Sessions where Event A is triggered
What is the average time it takes between triggering Event A and event B, in a given time period? I.e.:
(Timestamp B - Timestamp A) / # of Sessions = Y
What is the volume and average number of times (per day) that Event C is triggered within 5 minutes of Y? I.e.
Count of Sessions where (Timestamp C <= Y + 5) / # of Days
Any help here would be great, thanks so much for your time

Related

Sliding window of count distinct users for 12 months

I have a fairly basic dataset, where I have a table containing a timestamp of every time a user interacts with an app. An active user is classified as someone who has in the previous 12 months interacted with the app at least once.
I need to produce a table, which tells me day by day (going back n days) how many "active" users there were in the prior 12 month period. I need to run the query in Amazon Athena
A possible complexity is the fact that one user could interact with the app every day. I was wondering what the best window function could be to capture this.
The data is in the format;
A Opened App 10/04/2020
A Opened App 10/02/2020
A Opened App 05/01/2020
B Opened App 12/03/2020
B Opened App 02/01/2019
B Opened App 20/07/2018
C Opened App 19/04/2019
I need a resulting table of
20/04/2020 2 (A and B)
19/04/2020 2 (A and B)
18/04/2020 3 (all three)
...
04/01/2020 1 (Only C)
...
One method is to use count(distinct) with a range window function:
select distinct date,
count(distinct user) over (order by date range between interval '1 year' preceding and current row) as num_active_users
from t;
Not all databases support this syntax.

How to create an SQL time-in-location table from location/timestamp SQL data stream

I have a question that I'm struggling with in SQL.
I currently have a series of location and timestamp data. It consists of devices in locations at varying timestamps. The locations are repeated, so while they are lat/long coordinates there are several that repeat. The timestamp comes in irregular intervals (sometimes multiple times a second, sometimes nothing for 30 seconds). For example see the below representational data (I am sorting by device name in this example, but could order by anything if it would help):
Device Location Timestamp
X A 1
X A 1.7
X A 2
X A 3
X B 4
X B 5.2
X B 6
X A 7
X A 8
Y A 2
Y A 4
Y C 6
Y C 7
I wish to create a table based on the above data that would show entry/exit or first/last time in each location, with the total duration of that instance. i.e:
Device Location EntryTime ExitTime Duration
X A 1 3 2
X B 4 6 2
X A 7 8 1
Y A 2 4 2
Y C 6 7 1
From here I could process it further to work out a total time in location for a given day, for example.
This is something I could do in Python or some other language with something like a while loop, but I'm really not sure how to accomplish this in SQL.
It's probably worth noting that this is in Azure SQL and I'm creating this table via a Stream Analytics Query to an Event Hubs instance.
The reason I don't want to just simply total all in a location is because it is going to be streaming data and rolling through for a display for say, the last 24 hrs.
Any hints, tips or tricks on how I might accomplish this would be greatly appreciated. I've looked and haven't be able to quite find what I'm looking for - I can see things like datediff for calculating duration between two timestamps, or max and min for finding the first and last dates, but none quite seem to tick the box. The challenge I have here is that the devices move around and come back to the same locations many times within the period. Taking the first occurrence/timestamp of device X at location A and subtracting it from the last, for example, doesn't take into account the other locations it may have traveled to in between those timestamps. Complicating things further, the timestamps are irregular, so I can't simply count the number of occurrences for each location and add them up either.
Maybe I'm missing something simple or obvious, but this has got me stumped! Help would be greatly appreciated :)
I believe grouping would work
SELECT Device, Location, [EntryTime] = MIN(Timestamp), [ExitTime] = Max(Timestamp), [Duration] = MAX(Timestamp)- MIN(Timestamp)
FROM <table>
GROUP BY Device, Location
I was working on similar issue, to some extent in my dataset.
SELECT U.*, TO_DATE(U.WEND,'DD-MM-YY HH24:MI') - TO_DATE(U.WSTART,'DD-MM-YY HH24:MI') AS DURATION
FROM
(
SELECT EMPNAME,TLOC, TRUNC(TO_DATE(T.TDATETIME,'DD-MM-YY HH24:MI')) AS WDATE, MIN(T.TDATETIME) AS WSTART, MAX(T.TDATETIME) AS WEND FROM EMPTRCK_RSMSM T
GROUP BY EMPNAME,TLOC,TRUNC(TO_DATE(T.TDATETIME,'DD-MM-YY HH24:MI'))
) U

SQL: how do I maintain a running total that resets after reaching a certain threshold?

if I have a simple table that has a person ID and a date, and many rows per person with different dates, how can I track how many days passed from one date to the next, and reset the count once I reach a certain threshold? I have at least got the running total part working (using sum over window function), but if I want to note when it reaches 10 days, I cannot figure out how to reset the counter. I am working in Redshift. Any help would be greatly appreciated. Thank you very much.
For some specific context on the use case, I have a table with medical events and if a person has another event within 10 days of the initial event, I want to not count subsequent events in the 10 day window as a separate event, but once I have found an event that is greater than 10 days after the initial event, I want to count that one as a second event, and so on.
Person A, day 1, mark as new event
Person A, day 5, do not mark as new event
Person A, day 12, mark as new event
Person A, day 15, do not mark as new event
Person A, day 23, mark as new event
Person A, day 100, mark as new event
Thank you. Perhaps this will help? The input table would be:
person_id day
A 1
A 5
A 12
A 15
A 23
A 100
And the desired result would be:
person_id day new_event
A 1 1
A 5 0
A 12 1
A 15 0
A 23 1
A 100 1

How to query based on moduloed date

Writing a task that will run daily, but only looking for users where created_at is at week long increments ago. I want to do something along the lines of
User.where("created_at.days_ago % 7 = 0")
How might I do this?
EDIT
For reference the task is for verifying a user's email. They can continue using the product without verifying for some amount of time, but I want to email them periodically (once per week) to verify. I'm using the heroku scheduler to do this and the max time between runs it allows is 1 day, which is why I need only the people who are on exactly 1 week increments from when they were created
You could look at generating a list of the dates themselves, using something along the lines of:
((User.minimum(:created_at).to_date)..(Date.today)).to_a.select{|d| (Date.today - d) % 7 == 0}
Since created_at is a timestamp you'd probably need to apply a SQL function to it, to truncate it to a date.
days = ((Date.today-1.years)..(Date.today)).to_a.select{|d| (Date.today - d) % 7 == 0}
User.where("created_at::date in (?)", days)

SQL: Joining sequential events based on time stamps

I have a table containing information about buses driving around a city. Each record represents an event where a bus arrives at a bus stop, with the bus id, stop id, arrival time (military time in seconds), and departure time (military time in seconds). If I can join each event to the subsequent event, then I can compute the time each bus spends driving between stops by subtracting the departure time from stop 1 from the arrival time at stop 2.
But how can I perform this join? How can I easily find the soonest arrival time after a given departure time? edit I am using sql-server 2012.
Sample Data
Expected Result
Use lead function, which gets the values on the subsequent row based on a specified ordering.
select t.*,
lead(arrival_time) over(partition by busname order by arrival_time) as next_stop_arrival,
lead(departure_time) over(partition by busname order by arrival_time) as next_stop_departure
from tablename t