Joining tables based on sequence of events/timestamps

Joining tables based on sequence of events/timestamps - sql

I hope I can explain properly. For a quick example context, let's say the Login and Create Page tables below, store user made events - with a given UserID, Timestamp, and extra info (ABC and XYZ).
I want to join both tables together on UserID, but only keep the entries where the Create Page timestamp is after the Login timestamp, but only the first occurrence closest to the Login timestamp - essentially match where the timestamps are the closest, but Login before Create Page. The time difference also can't be longer than 10 hours. An example of data is below
Login
UserID | Timestamp | ABC
1 | 2022-12-15 07:05:00 | aa
1 | 2022-12-15 07:10:00 | ab
2 | 2022-12-14 05:55:55 | ac
1 | 2022-12-11 17:00:00 | ad
3 | 2022-12-11 05:00:00 | ae
2 | 2022-12-10 05:06:00 | af
2 | 2022-12-10 08:00:00 | ag
Create Page
UserID | Timestamp | XYZ
1 | 2022-12-10 02:22:22 | xa
2 | 2022-12-10 08:10:00 | xb
2 | 2022-12-10 05:15:00 | xc
2 | 2022-12-10 05:20:00 | xd
1 | 2022-12-11 17:10:00 | xe
1 | 2022-12-11 18:00:00 | xf
3 | 2022-12-12 15:00:00 | xg
1 | 2022-12-15 07:15:00 | xh
Expected Result
UserID | XYZ | ABC
2 | xb | ag
2 | xc | af
1 | xe | ad
1 | ab | xh
I hope it made sense. Anyone able to help?
I'm an SQL novice, and all my tries just failed miserably. I have no idea where to start to approach this and haven't found anything here yet that helped me.
Thanks in advance!

You could do something like this:
SELECT
l.UserID,
l.LoginTime 'LoginTime',
cp.CreationTime 'First Page Creation Time Within 10 Hours',
CAST(DATEDIFF(MINUTE, l.LoginTime, cp.CreationTime) AS FLOAT) / 60 'Page Created After Login in Hours'
FROM #Logins l
LEFT OUTER JOIN
(SELECT TOP 1 *, ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY CreationTime) AS RowNum FROM #CreatePages) cp ON cp.UserID = l.UserID
AND cp.CreationTime > l.LoginTime
WHERE DATEDIFF(hour, l.LoginTime, cp.CreationTime) <= 10
ORDER BY LoginTime
If you want to see rows for when a user logged in but didn't create a page within 10 hours, then replace the WHERE clause with the same but added as a condition to the JOIN. Because the JOIN is a LEFT OUTER JOIN, you will see all Logins but will, in the case of no page creations within 10 hours, NULLs for the PageCreation time and the hours after login.

Related

SQL: usage time of item between dates combining two tables

Trying to create query that will give me usage time of each car part between dates when that part is used. Etc. let say part id 1 is installed on 2018-03-01 and on 2018-04-01 runs for 50min and then on 2018-05-10 runs 30min total usage of this part shoud be 1:20min as result.
These are examples of my tables.
Table1
| id | part_id | car_id | part_date |
|----|-------- |--------|------------|
| 1 | 1 | 3 | 2018-03-01 |
| 2 | 1 | 1 | 2018-03-28 |
| 3 | 1 | 3 | 2018-05-10 |
Table2
| id | car_id | run_date | puton_time | putoff_time |
|----|--------|------------|---------------------|---------------------|
| 1 | 3 | 2018-04-01 | 2018-04-01 12:00:00 | 2018-04-01 12:50:00 |
| 2 | 2 | 2018-04-10 | 2018-04-10 15:10:00 | 2018-04-10 15:20:00 |
| 3 | 3 | 2018-05-10 | 2018-05-10 10:00:00 | 2018-05-10 10:30:00 |
| 4 | 1 | 2018-05-11 | 2018-05-11 12:00:00 | 2018-04-01 12:50:00 |
Table1 contains dates when each part is installed, table2 contains usage time of each part and they are joined on car_id, I have try to write query but it does not work well if somebody can figure out my mistake in this query that would be healpful.
My SQL query
SELECT SEC_TO_TIME(SUM(TIME_TO_SEC(TIMEDIFF(t1.puton_time, t1.putoff_time)))) AS total_time
FROM table2 t1
LEFT JOIN table1 t2 ON t1.car_id=t2.car_id
WHERE t2.id=1 AND t1.run_date BETWEEN t2.datum AND
(SELECT COALESCE(MIN(datum), '2100-01-01') AS NextDate FROM table1 WHERE
id=1 AND t2.part_date > part_date);
Expected result
| part_id | total_time |
|---------|------------|
| 1 | 1:20:00 |
Hope that this problem make sence because in my search I found nothing like this, so I need help.
Solution, thanks to Kota Mori
SELECT t1.id, SEC_TO_TIME(SUM(TIME_TO_SEC(TIMEDIFF(t2.puton_time, t2.putoff_time)))) AS total_time
FROM table1 t1
LEFT JOIN table2 t2 ON t1.car_id = t2.car_id
AND t1.part_date >= t2.run_date
GROUP BY t1.id

You first need to join the two tables by the car_id and also a condition that part_date should be no greater than run_date.
Then compute the total minutes for each part_id separately.
The following is a query example for SQLite (The only SQL engine that I have access to right now).
Since SQLite does not have datetime type, I convert strings into unix timestamp by strftime function. This part should be changed in accordance with the SQL engine you are using. Apart from that, this is fairly a standard sql and mostly valid for other SQL dialect.
SELECT
t1.id,
sum(
cast(strftime('%s', t2.putoff_time) as integer) -
cast(strftime('%s', t2.puton_time) as integer)
) / 60 AS total_minutes
FROM
table1 t1
LEFT JOIN
table2 t2
ON
t1.car_id = t2.car_id
AND t1.part_date <= t2.run_date
GROUP BY
t1.id
The result is something like the below. Note that ID 1 gets 80 minutes (1:20) as expected.
id total_minutes
0 1 80
1 2 80
2 3 30

Calculating user retention on daily basis between the dates in SQL

I have a table that has the data about user_ids, all their last log_in dates to the app
Table:
|----------|--------------|
| User_Id | log_in_dates |
|----------|--------------|
| 1 | 2021-09-01 |
| 1 | 2021-09-03 |
| 2 | 2021-09-02 |
| 2 | 2021-09-04 |
| 3 | 2021-09-01 |
| 3 | 2021-09-02 |
| 3 | 2021-09-03 |
| 3 | 2021-09-04 |
| 4 | 2021-09-03 |
| 4 | 2021-09-04 |
| 5 | 2021-09-01 |
| 6 | 2021-09-01 |
| 6 | 2021-09-09 |
|----------|--------------|
From the above table, I'm trying to understand the user's log in behavior from the present day to the past 90 days.
Num_users_no_log_in defines the count for the number of users who haven't logged in to the app from present_day to the previous days (last_log_in_date)
I want the table like below:
|---------------|------------------|--------------------|-------------------------|
| present_date | days_difference | last_log_in_date | Num_users_no_log_in |
|---------------|------------------|--------------------|-------------------------|
| 2021-09-01 | 0 | 2021-09-01 | 0 |
| 2021-09-02 | 1 | 2021-09-01 | 3 |->(Id = 1,5,6)
| 2021-09-02 | 0 | 2021-09-02 | 3 |->(Id = 1,5,6)
| 2021-09-03 | 2 | 2021-09-01 | 2 |->(Id = 5,6)
| 2021-09-03 | 1 | 2021-09-02 | 1 |->(Id = 2)
| 2021-09-03 | 0 | 2021-09-03 | 3 |->(Id = 2,5,6)
| 2021-09-04 | 3 | 2021-09-01 | 2 |->(Id = 5,6)
| 2021-09-04 | 2 | 2021-09-02 | 0 |
| 2021-09-04 | 1 | 2021-09-03 | 1 |->(Id= 1)
| 2021-09-04 | 0 | 2021-09-04 | 3 |->(Id = 1,5,6)
| .... | .... | .... | ....
|---------------|------------------|--------------------|-------------------------|
I was able to get the first three columns Present_date | days_difference | last_log_in_date using the following query:
with dts as
(
select distinct log_in from users_table
)
select x.log_in_dates as present_date,
DATEDIFF(DAY, y.log_in_dates ,x.log_in_dates ) as Days_since_last_log_in,
y.log_in_dates as log_in_dates
from dts x, dts y
where x.log_in_dates >= y.log_in_dates
I don't understand how I can get the fourth column Num_users_no_log_in

I do not really understand your need: are there values base on users or dates? It it's based on dates, as it looks like (elsewhere you would probably have user_id as first column), what does it mean to have multiple times the same date? I understand that you would like to have a recap for all dates since the beginning until the current date, but in my opinion in does not really make sens (imagine your dashboard in 1 year!!)
Once this is said, let's go to the approach.
In such cases, I develop step by step using common table extensions. For you example, it required 3 steps:
prepare the time series
integrate connections' dates and perform the first calculation (time difference)
Finally, calculate nb connection per day
Then, the final query will display the desired result.
Here is the query I proposed, developed with Postgresql (you did not precise your dbms, but converting should not be such a big deal here):
with init_calendar as (
-- Prepare date series and count total users
select generate_series(min(log_in_dates), now(), interval '1 day') as present_date,
count(distinct user_id) as nb_users
from users
),
calendar as (
-- Add connections' dates for each period from the beginning to current date in calendar
-- and calculate nb days difference for each of them
-- Syntax my vary depending dbms used
select distinct present_date, log_in_dates as last_date,
extract(day from present_date - log_in_dates) as days_difference,
nb_users
from init_calendar
join users on log_in_dates <= present_date
),
usr_con as (
-- Identify last user connection's dates according to running date
-- Tag the line to be counted as no connection
select c.present_date, c.last_date, c.days_difference, c.nb_users,
u.user_id, max(log_in_dates) as last_con,
case when max(log_in_dates) = present_date then 0 else 1 end as to_count
from calendar c
join users u on u.log_in_dates <= c.last_date
group by c.present_date, c.last_date, c.days_difference, c.nb_users, u.user_id
)
select present_date, last_date, days_difference,
nb_users - sum(to_count) as Num_users_no_log_in
from usr_con
group by present_date, last_date, days_difference, nb_users
order by present_date, last_date
Please note that there is a difference with your own expected result as you forgot user_id = 3 in your calculation.
If you want to play with the query, you can with dbfiddle

Count the number of events between two (or more) different login periods

I would like to count the number of events occured for each user between each login. The login are stored at one table, and the other events are stored at another.
So if a user logged-in at 2019-10-03 10:00:00 then any events that occured after that time will be grouped and counted towards that specific time until a new event occurs (e.g: 2019-11-04 11:00:00) and then we'll count according to the new time.
Meaning that for 2019-10-03 10:00:00 we'll count all values between 2019-10-03 10:00:00 and 2019-11-04 11:00:00 and for 2019-11-04 11:00:00 we count anything above it.
Another way of looking at it:
user_login:
User | Login_Timestamp
1 | 2019-10-03 10:00:00
1 | 2019-11-03 14:44:00
1 | 2019-14-03 08:01:11
user_events:
User | ... | EventTimestamp
1 | ... | 2019-10-03 10:01:00
1 | ... | 2019-10-03 10:10:00
1 | ... | 2019-11-03 13:10:00
1 | ... | 2019-11-03 14:45:11
1 | ... | 2019-11-03 14:46:11
1 | ... | 2019-14-03 10:10:00
The output I would like to get is:
User | LoginTimestamp | NumberOfEvents
1 | 2019-10-03 10:00:00 | 3
1 | 2019-11-03 14:44:00 | 2
1 | 2019-14-03 08:01:11 | 1
Thanks !

Using transposition:
WITH cte AS (
SELECT user,
loginTimestamp AS loginTimestampStart,
LEAD(loginTimestampStart) OVER(PARTITION BY user
ORDER BY loginTimestamp) AS loginTimestampEnd
FROM user_login
)
SELECT c.user, c.loginTimestampStart, COUNT(*) AS NumberOfEvents
FROM cte c
JOIN user_events e
ON c.user = e.user
AND e.EventTimestamp >= c.loginTimestampStart
AND (e.EventTimestamp < c.loginTimestampEnd OR c.loginTimestampEnd IS NULL)
GROUP BY c.user, c.loginTimestampStart

Aggregating Records Per Login Date

I have a table storing user logins, which can be simplified to look like this:
| user | logindate |
+------+---------------------+
| 001 | 2018-01-26 10:00:00 |
| 001 | 2018-01-26 11:00:00 |
| 001 | 2018-01-26 12:00:00 |
Similarly, I have a table recording activities completed by the user:
| user | activitydate | activity |
+------+---------------------+-----------+
| 001 | 2018-01-26 10:24:00 | survey |
| 001 | 2018-01-26 10:30:00 | poll |
| 001 | 2018-01-26 11:03:00 | poll |
| 001 | 2018-01-26 12:08:00 | poll |
| 001 | 2018-01-26 12:10:00 | survey |
| 001 | 2018-01-26 12:12:00 | video |
I would like to know the number of activities completed per user per login. Given the above example, I would expect results like this:
| user | latestLogin | activityCount |
+------+---------------------+---------------+
| 001 | 2018-01-26 10:00:00 | 2 |
| 001 | 2018-01-26 11:00:00 | 1 |
| 001 | 2018-01-26 12:00:00 | 3 |
I have found one way to do this, which is to join each activity with the login table (where the login occurred before the activity), and get the max login per activity. I can demonstrated this using SQLFiddle - http://sqlfiddle.com/#!9/c3c90d/8
However, I feel this solution is very slow. When I run this against a production environment, the query runs for far too long. There are nearly 85,000 login records for the time period I'm looking into, and a much much larger amount of activities.
What are some alternative solutions? Is there any way I could first subquery the login table to figure out the various login segments, and then tie each activity to those segments, for example?

You can get the logins using a correlated subquery:
select a.*,
(select max(l.logindate)
from logins l
where l.user = a.user and a.activitydate >= l.logindate
) as logindate
from activities a;
The rest is just aggregation. I would use a subquery for this:
select user, logindate, count(*) as numactivities
from (select a.*,
(select max(l.logindate)
from logins l
where l.user = a.user and a.activitydate <= l.logindate
) as logindate
from activities a
) a
group by user, logindate;
The data in the SQL Fiddle differs from the data in the question, but here is an example of how this works.

Can I put a condition on a window function in Redshift?

I have an events-based table in Redshift. I want to tie all events to the FIRST event in the series, provided that event was in the N-hours preceding this event.
If all I cared about was the very first row, I'd simply do:
SELECT
event_time
,first_value(event_time)
OVER (ORDER BY event_time rows unbounded preceding) as first_time
FROM
my_table
But because I only want to tie this to the first event in the past N-hours, I want something like:
SELECT
event_time
,first_value(event_time)
OVER (ORDER BY event_time rows between [N-hours ago] and current row) as first_time
FROM
my_table
A little background on my table. It's user actions, so effectively a user jumps on, performs 1-100 actions, and then leaves. Most users are 1-10x per day. Sessions rarely last over an hour, so I could set N=1.
If I just set a PARTITION BY date_trunc('hour', event_time), I'll double create for sessions that span the hour.
Assume my_table looks like
id | user_id | event_time
----------------------------------
1 | 123 | 2015-01-01 01:00:00
2 | 123 | 2015-01-01 01:15:00
3 | 123 | 2015-01-01 02:05:00
4 | 123 | 2015-01-01 13:10:00
5 | 123 | 2015-01-01 13:20:00
6 | 123 | 2015-01-01 13:30:00
My goal is to get a result that looks like
id | parent_id | user_id | event_time
----------------------------------
1 | 1 | 123 | 2015-01-01 01:00:00
2 | 1 | 123 | 2015-01-01 01:15:00
3 | 1 | 123 | 2015-01-01 02:05:00
4 | 4 | 123 | 2015-01-01 13:10:00
5 | 4 | 123 | 2015-01-01 13:20:00
6 | 4 | 123 | 2015-01-01 13:30:00

The answer appears to be "no" as of now.
There is a functionality in SQL Server of using RANGE instead of ROWS in the frame. This allows the query to compare values to the current row's value.
https://www.simple-talk.com/sql/learn-sql-server/window-functions-in-sql-server-part-2-the-frame/
When I attempt this syntax in Redshift I get the error that "Range is not yet supported"
Someone update this when that "yet" changes!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Joining tables based on sequence of events/timestamps - sql

Related

SQL: usage time of item between dates combining two tables

Calculating user retention on daily basis between the dates in SQL

Count the number of events between two (or more) different login periods

Aggregating Records Per Login Date

Can I put a condition on a window function in Redshift?

Categories

Resources