SQL query to show user session length - sql

I have a table that looks like this:
user_id page happened_at
2 'page3' 2017-10-05 11:31
1 'page2' 2016-02-01 00:02
2 'page1' 2017-10-05 15:24
3 'page3' 2017-03-31 19:35
4 'page1' 2017-07-09 00:24
2 'page3' 2017-10-05 15:28
1 'page3' 2018-02-01 13:02
2 'page2' 2017-10-05 16:14
2 'page3' 2017-10-05 16:34
etc
I have a query that identifies user sessions, which are opened pages #1, #2 and #3, in that particular order, made in a time period less than one hour from each other (page3 within an hour of page2, page2 within an hour of page1). Any pages, opened between that, can be ignored. Example of a session from the table above:
user_id page happened_at
2 'page1' 2017-10-05 15:24
2 'page2' 2017-10-05 16:14
2 'page3' 2017-10-05 16:34
My query so far looks like this and shows user_id of users, who had sessions:
select user_id
from (select user_id,page,happened_at,
lag(page) over(partition by user_id order by happened_at) as prev_page,
lead(page) over(partition by user_id order by happened_at) as next_page,
datediff(minute,lag(happened_at) over(partition by user_id order by happened_at),happened_at) as time_diff_with_prev_action,
datediff(minute,happened_at,lead(happened_at) over(partition by user_id order by happened_at)) as time_diff_with_next_action
from tbl
) t
where page='page2' and prev_page='page1' and next_page='page3'
and time_diff_with_prev_action <= 60 and time_diff_with_next_action <= 60
What I need is to edit a query, add 2 columns to the output, session start time and session end time, which is last action + 1 hour. Please advice how to make it. Temporary tables are forbidden, so it should be just a query. Example output should be:
user_id session_start session_end
2 2017-10-05 15:24 2017-10-05 17:34
Thanks for your time!

Related

How do you get the last entry for each month in SQL?

I am looking to filter very large tables to the latest entry per user per month. I'm not sure if I found the best way to do this. I know I "should" trust the SQL engine (snowflake) but there is a part of me that does not like the join on three columns.
Note that this is a very common operation on many big tables, and I want to use it in DBT views which means it will get run all the time.
To illustrate, my data is of this form:
mytable
userId
loginDate
year
month
value
1
2021-01-04
2021
1
41.1
1
2021-01-06
2021
1
411.1
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-06
2021
2
32
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
And I'm trying to use SQL to get the last value (by loginDate) for each month.
I'm currently doing a groupby & a join as follows:
WITH latest_entry_by_month AS (
SELECT "userId", "year", "month", max("loginDate") AS "loginDate"
FROM mytable
)
SELECT * FROM mytable NATURAL JOIN latest_entry_by_month
The above results in my desired output:
userId
loginDate
year
month
value
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
But I'm not sure if it's optimal.
Any guidance on how to do this faster? Note that I am not materializing the underlying data, so it is effectively un-clustered (I'm getting it from a vendor via the Snowflake marketplace).
Using QUALIFY and windowed function(ROW_NUMBER):
SELECT *
FROM mytable
QUALIFY ROW_NUMBER() OVER(PARTITION BY userId, year, month
ORDER BY loginDate DESC) = 1

Creating a new calculated column in SQL

Is there a way to find the solution so that I need for 2 days, there are 2 UD's because there are June 24 2 times and for the rest there are single days.
I am showing the expected output here:
Primary key UD Date
-------------------------------------------
1 123 2015-06-24 00:00:00.000
6 456 2015-06-24 00:00:00.000
2 123 2015-06-25 00:00:00.000
3 658 2015-06-26 00:00:00.000
4 598 2015-06-27 00:00:00.000
5 156 2015-06-28 00:00:00.000
No of times Number of days
-----------------------------
4 1
2 2
The logic is 4 users are there who used the application on 1 day and there are 2 userd who used the application on 2 days
You can use two levels of aggregation:
select cnt, count(*)
from (select date, count(*) as cnt
from t
group by date
) d
group by cnt
order by cnt desc;

History record which came earlier than recent

I have certain ID_NUM which have transactions which have History record which came earlier than recent
Below is one example
ID_num Create Datetime Start Datetime Rank_num
1 1/1/19 5:28 NULL 1
1 12/1/18 9:25 1/1/19 9:25 2
1 12/1/18 7:39 12/1/18 9:25 3
1 11/1/18 7:40 12/1/18 13:37 4
1 10/1/18 7:38 11/1/18 13:37 5
1 9/1/18 13:37 9/1/18 13:37 6
1 9/1/18 13:37 10/1/18 13:37 7
Here Rank#4 has a Start Datetime > Rank#3.
These incorrect records are set because of a system error and would like to identify how many such rows exists
I would like to list all ID_num's which have similar behaviour
Any suggestion would help
You can use lag(). For instance:
select t.*
from (select t.*,
lag(start_datetime) partition by (id_num order by ranknum) as prev_start_datetime
from t
) t
where start_datetime < prev_start_datetime

SQL query with specific order of user actions

I have a table that looks like this:
user_id user_action timestamp
1 action #2 2016-02-01 00:02
2 action #1 2017-10-05 15:24
3 action #3 2017-03-31 19:35
4 action #1 2017-07-09 00:24
1 action #1 2018-11-05 18:28
1 action #3 2018-02-01 13:02
2 action #2 2017-10-05 16:14
2 action #3 2017-10-05 16:34
etc
My task is to write a query where I can see user sessions, where a user performs action #1, 2, and 3 in that specific order, with time intervals between actions less than an hour. For example, user #2 have a session
2 action #1 2017-10-05 15:24
2 action #2 2017-10-05 16:14
2 action #3 2017-10-05 16:34
Sorry for lack of my own attempt, as I am really stuck and don't know, where to start.
Thanks in advance!
This can be done with window functions lead and lag which get the values from the next and previous rows respecitvely.
select distinct user_id
from (select user_id,user_action,timestamp,
lag(user_action) over(partition by user_id order by timestamp) as prev_action,
lead(user_action) over(partition by user_id order by timestamp) as next_action,
datediff(minute,lag(timestamp) over(partition by user_id order by timestamp),timestamp) as time_diff_with_prev_action,
datediff(minute,timestamp,lead(timestamp) over(partition by user_id order by timestamp)) as time_diff_with_next_action
from tbl
) t
where user_action='action#2' and prev_action='action#1' and next_action='action#3'
and time_diff_with_prev_action <= 60 and time_diff_with_next_action <= 60

SQL - Creating a timeline for each ID (Vertica)

I am dealing with the following problem in SQL (using Vertica):
In short -- Create a timeline for each ID (in a table where I have multiple lines, orders in my example, per ID)
What I would like to achieve -- At my disposal I have a table on historical order date and I would like to compute new customer (first order ever in the past month), active customer- (>1 order in last 1-3 months), passive customer- (no order for last 3-6 months) and inactive customer (no order for >6 months) rates.
Which steps I have taken so far -- I was able to construct a table similar to the example presented below:
CustomerID Current order date Time between current/previous order First order date (all-time)
001 2015-04-30 12:06:58 (null) 2015-04-30 12:06:58
001 2015-09-24 17:30:59 147 05:24:01 2015-04-30 12:06:58
001 2016-02-11 13:21:10 139 19:50:11 2015-04-30 12:06:58
002 2015-10-21 10:38:29 (null) 2015-10-21 10:38:29
003 2015-05-22 12:13:01 (null) 2015-05-22 12:13:01
003 2015-07-09 01:04:51 47 12:51:50 2015-05-22 12:13:01
003 2015-10-23 00:23:48 105 23:18:57 2015-05-22 12:13:01
A little bit of intuition: customer 001 placed three orders from which the second one was 147 days after its first order. Customer 002 has only placed one order in total.
What I think that the next steps should be -- I would like to know for each date (also dates on which a certain user did not place an order), for each CustomerID, how long it has been since his/her last order. This would imply that I would create some sort of timeline for each CustomerID. In the example presented above I would get 287 (days between 1st of May 2015 and 11th of February 2016, the timespan of this table) lines for each CustomerID. I have difficulties solving this previous step. When I have performed this step I want to create a field which shows at each date the last order date, the period between the last order date and the current date, and what state someone is in at the current date. For the example presented earlier, this would look something like this:
CustomerID Last order date Current date Time between current date /last order State
001 2015-04-30 12:06:58 2015-05-01 00:00:00 0 00:00:00 New
...
001 2015-04-30 12:06:58 2015-06-30 00:00:00 60 11:53:02 Active
...
001 2015-09-24 17:30:59 2016-02-01 00:00:00 129 11:53:02 Passive
...
...
002 2015-10-21 17:30:59 2015-10-22 00:00:00 0 06:29:01 New
...
002 2015-10-21 17:30:59 2015-11-30 00:00:00 39 06:29:01 Active
...
...
003 2015-05-22 12:13:01 2015-06-23 00:00:00 31 11:46:59 Active
...
003 2015-07-09 01:04:51 2015-10-22 00:00:00 105 11:46:59 Inactive
...
At the dots there should be all the inbetween dates but for sake of space I have left these out of the table.
When I know for each date what the state is of each customer (active/passive/inactive) my plan is to sum the states and group by date which should give me the sum of new, active, passive and inactive customers. From here on I can easily compute the rates at each date.
Anybody that knows how I can possibly achieve this task?
Note -- If anyone has other ideas how to achieve the goal presented above (using some other approach compared to the approach I had in mind) please let me know!
EDIT
Suppose you start from a table like this:
SQL> select * from ord order by custid, ord_date ;
custid | ord_date
--------+---------------------
1 | 2015-04-30 12:06:58
1 | 2015-09-24 17:30:59
1 | 2016-02-11 13:21:10
2 | 2015-10-21 10:38:29
3 | 2015-05-22 12:13:01
3 | 2015-07-09 01:04:51
3 | 2015-10-23 00:23:48
(7 rows)
You can use Vertica's Timeseries Analytic Functions TS_FIRST_VALUE(), TS_LAST_VALUE() to fill gaps and interpolate last_order date to the current date:
Then you just have to join this with a Vertica's TimeSeries generated from the same table with interval one day starting from the first day each customer did place his/her first order up to now (current_date):
select
custid,
status_dt,
last_order_dt,
case
when status_dt::date - last_order_dt::date < 30 then case
when nord = 1 then 'New' else 'Active' end
when status_dt::date - last_order_dt::date < 90 then 'Active'
when status_dt::date - last_order_dt::date < 180 then 'Passive'
else 'Inactive'
end as status
from (
select
custid,
last_order_dt,
status_dt,
conditional_true_event (first_order_dt is null or
last_order_dt > lag(last_order_dt))
over(partition by custid order by status_dt) as nord
from (
select
custid,
ts_first_value(ord_date) as first_order_dt ,
ts_last_value(ord_date) as last_order_dt ,
dt::date as status_dt
from
( select custid, ord_date from ord
union all
select distinct(custid) as custid, current_date + 1 as ord_date from ord
) z timeseries dt as '1 day' over (partition by custid order by ord_date)
) x
) y
where status_dt <= current_date
order by 1, 2
;
And you will get something like this:
custid | status_dt | last_order_dt | status
--------+------------+---------------------+---------
1 | 2015-04-30 | 2015-04-30 12:06:58 | New
1 | 2015-05-01 | 2015-04-30 12:06:58 | New
1 | 2015-05-02 | 2015-04-30 12:06:58 | New
...
1 | 2015-05-29 | 2015-04-30 12:06:58 | New
1 | 2015-05-30 | 2015-04-30 12:06:58 | Active
1 | 2015-05-31 | 2015-04-30 12:06:58 | Active
...
etc.