I have a table which tracks the user actions in a web-site. A simplified version is as follows
user_id | action_time | module_name
--------+-------------------------+------------
1 | 2014-03-02 11:13:08.775 | home
1 | 2014-03-02 11:13:08.345 | user
1 | 2014-03-02 11:13:08.428 | discussions
How much time did a user spend on each screen? So take the least action_time for a user, get the next one, find the difference.
I think this calls for a recursive query, but not able to get my head around it. One thing - I wouldn't know when to stop. After some "module" the user could have just closed the browser, without bothering to logout. So "closure" is a bit tricky.
This can be surprisingly simple with the window function lead()
SELECT *
, lead(action_time) OVER (PARTITION BY user_id ORDER BY action_time)
- action_time AS time_spent
FROM tbl;
That's all.
time_spent is NULL for the last action of a user, where no other action follows - which seems perfectly adequate.
this example of how to make a 'range aggregate' using windowing functions and a lot of nested subqueries. I just adapted it to partition and group by user_id, and it seems to do what you want:
SELECT user_id, min(login_time) as login_time, max(logout_time) as logout_time
FROM (
SELECT user_id, login_time, logout_time,
max(new_start) OVER (PARTITION BY user_id ORDER BY login_time, logout_time) AS left_edge
FROM (
SELECT user_id, login_time, logout_time,
CASE
WHEN login_time <= max(lag_logout_time) OVER (
PARTITION BY user_id ORDER BY login_time, logout_time
) THEN NULL
ELSE login_time
END AS new_start
FROM (
SELECT
user_id,
login_time,
logout_time,
lag(logout_time) OVER (PARTITION BY user_id ORDER BY login_time, logout_time) AS lag_logout_time
FROM app_log
) AS s1
) AS s2
) AS s3
GROUP BY user_id, left_edge
ORDER BY user_id, min(login_time)
Results in:
user_id | login_time | logout_time
---------+---------------------+---------------------
1 | 2014-01-01 08:00:00 | 2014-01-01 10:49:00
1 | 2014-01-01 10:55:00 | 2014-01-01 11:00:00
2 | 2014-01-01 09:00:00 | 2014-01-01 11:49:00
2 | 2014-01-01 11:55:00 | 2014-01-01 12:00:00
(4 rows)
It works by first detecting the beginning of each new range (partitioned by user_id), then extending and grouping by the detected ranges. I found I had to read that article very carefully to understand it!
The article suggests it can be simplified with Postgresql>=9.0 by removing the innermost subquery and changing the window range, but I could not get that to work.
Related
I have a table like the below:
+---------------------+------------------------------------+---------------------+
| prompt | answer | step_timestamp |
+---------------------+------------------------------------+---------------------+
| hi Lary | | 2022-04-04 10:00:00 |
| how are you? | | 2022-04-04 10:02:00 |
| how is your pet? |I am fine | 2022-04-04 10:05:00 |
| what is your hobby? |my pet is good | 2022-04-04 10:15:00 |
| ok thanks |football | 2022-04-04 10:25:00 |
+---------------------+-------------------------------------+---------------------
The answer has to match with the prompt of the previous row.
Expected result :
hi Lary, how are you?I am fine. how is your pet?my pet is good. what is your hobby? football. ok thanks
For this I have done this
WITH SUPER AS(
SELECT call_id, group_concat(tall,'\t') as dialog_text,
FROM
(SELECT ROW_NUMBER() OVER (PARTITION BY tall,call_id
ORDER BY step_timestamp ASC) AS rn,call_id,tall
FROM
(SELECT call_id,step_timestamp, concat(prompt,':',lead(answer) over(PARTITION BY call_id,step_timestamp order by step_timestamp asc)) tall
FROM db.table
ORDER BY step_timestamp ASC
limit 100000000
)as inq
ORDER BY step_timestamp ASC
limit 100000000
) b
WHERE rn =1
GROUP BY call_id,call_ani
)select distinct call_id, dialog_text
from super;
But it does not work as expecting. For example some times I have something like this:
hi lary, how are you?I am fine. how is your pet?my pet is good. how is your pet?I am fine. what is your hobby? football. ok thanks
You probably know the reason already. group_concat() in impala doesnt maintain order by. Now even if you put limit 10000000, it may not put all rows into same node to ensure ordered concat.
Use hive collect_list().
I couldnt find relevance of your rownumber() so i removed it to keep the solution simple. Please test below code with your original data and then add rownumber if needed.
select
id call_id,
concat( concat_ws(',', min(g)) ) dialog_text
from
(
select
s.id,
--use collect list to cooncat all dialogues in order of timestamp.
collect_list(s.tall) over (partition by s.id order by s.step_timestamp desc rows between unbounded preceding and unbounded following) g
from
(
SELECT call_id id,step_timestamp,
concat(prompt,':',lead(answer) over(PARTITION BY call_id,step_timestamp order by step_timestamp asc)) tall
FROM db.table -- This it main data
) s
) gs
-- Need to group by 'id' since we have duplicate collect_list values
group by id
I am trying to create a query that will give me a column of total time logged in for each month for each user.
username | auth_event_type | time | credential_id
Joe | 1 | 2021-11-01 09:00:00 | 44
Joe | 2 | 2021-11-01 10:00:00 | 44
Jeff | 1 | 2021-11-01 11:00:00 | 45
Jeff | 2 | 2021-11-01 12:00:00 | 45
Joe | 1 | 2021-11-01 12:00:00 | 46
Joe | 2 | 2021-11-01 12:30:00 | 46
Joe | 1 | 2021-12-06 14:30:00 | 47
Joe | 2 | 2021-12-06 15:30:00 | 47
The auth_event_type column specifies whether the event was a login (1) or logout (2) and the credential_id indicates the session.
I'm trying to create a query that would have an output like this:
username | year_month | total_time
Joe | 2021-11 | 1:30
Jeff | 2021-11 | 1:00
Joe | 2021-12 | 1:00
How would I go about doing this in postgres? I am thinking it would involve a window function? If someone could point me in the right direction that would be great. Thank you.
Solution 1 partially working
Not sure that window functions will help you in your case, but aggregate functions will :
WITH list AS
(
SELECT username
, date_trunc('month', time) AS year_month
, max(time ORDER BY time) - min(time ORDER BY time) AS session_duration
FROM your_table
GROUP BY username, date_trunc('month', time), credential_id
)
SELECT username
, to_char (year_month, 'YYYY-MM') AS year_month
, sum(session_duration) AS total_time
FROM list
GROUP BY username, year_month
The first part of the query aggregates the login/logout times for the same username, credential_id, the second part makes the sum per year_month of the difference between the login/logout times. This query works well until the login time and logout time are in the same month, but it fails when they aren't.
Solution 2 fully working
In order to calculate the total_time per username and per month whatever the login time and logout time are, we can use a time range approach which intersects the session ranges [login_time, logout_time) with the monthly ranges [monthly_start_time, monthly_end_time) :
WITH monthly_range AS
(
SELECT to_char(m.month_start_date, 'YYYY-MM') AS month
, tsrange(m.month_start_date, m.month_start_date+ interval '1 month' ) AS monthly_range
FROM
( SELECT generate_series(min(date_trunc('month', time)), max(date_trunc('month', time)), '1 month') AS month_start_date
FROM your_table
) AS m
), session_range AS
(
SELECT username
, tsrange(min(time ORDER BY auth_event_type), max(time ORDER BY auth_event_type)) AS session_range
FROM your_table
GROUP BY username, credential_id
)
SELECT s.username
, m.month
, sum(upper(p.period) - lower(p.period)) AS total_time
FROM monthly_range AS m
INNER JOIN session_range AS s
ON s.session_range && m.monthly_range
CROSS JOIN LATERAL (SELECT s.session_range * m.monthly_range AS period) AS p
GROUP BY s.username, m.month
see the result in dbfiddle
Use the window function lag() with a partition it by credential_id ordered by time, e.g.
WITH j AS (
SELECT username, time, age(time, LAG(time) OVER w)
FROM t
WINDOW w AS (PARTITION BY credential_id ORDER BY time
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
)
SELECT username, to_char(time,'yyyy-mm'),sum(age) FROM j
GROUP BY 1,2;
Note: the frame ROWS BETWEEN 1 PRECEDING AND CURRENT ROW is pretty much optional in this case, but it is considered a good practice to keep window functions as explicit as possible, so that in the future you don't have to read the docs to figure out what your query is doing.
Demo: db<>fiddle
Using SQL how can you find the time duration or time elapsed between each users session? For instance user_id 1234 had one session on 2017-01-01 00:00:00 and another session on 2017-01-02 (see table below). How can I find the time between the last session_end to beginning of their next session_start.
user_id|session_start |session_end
1234 | 2017-01-01 00:00:00| 2017-01-01 00:30:30
1236 | 2017-01-01 01:00:00| 2017-01-01 01:05:30
1234 | 2017-01-02 12:00:09| 2017-01-02 12:00:30
1234 | 2017-01-01 02:00:00| 2017-01-01 03:30:30
1236 | 2017-01-01 00:00:00| 2017-01-01 00:30:30
Thanks.
This can easily be done using window functions
select user_id, session_start, session_end,
session_start - lag(session_end) over (partition by user_id order by session_start) as time_diff
from the_table
order by user_id, session_start;
Online example: http://rextester.com/NTVH38963
Subtracting one timestamp from another returns an interval to convert that to minutes you can extract the number of get the number of seconds the interval represents and divide them by 60 to get minutes:
select user_id, session_start, session_end,
extract(epoch from
session_start - lag(session_end) over (partition by user_id order by session_start)
) / 60 as minutes
from the_table
order by user_id, session_start;
Here's one way to do it with a subquery:
SELECT dT.user_ID
,dT.max_session_start
,DATEDIFF(minute, (SELECT MAX(session_end)
FROM tablename T
WHERE T.user_ID = dT.user_ID
AND T.session_end < dT.max_session_start)
, dT.max_session_start
) AS minutes
FROM (
SELECT user_ID
,MAX(session_start) AS max_session_start
FROM tablename
GROUP BY user_ID
) AS dT
i want to find out the first entry of a user who signed up to my product (with a id) with his anonymous_id and a timestamp.
Since i know that a user , who already signed up & visit the page again, can have multiple anonymous_id (f.e using multiple devices, having new cookies etc...) , i distinct the user_id
i write a code who looks like this
SELECT distinct user_id , min(timestamp),anonymous_id
FROM data
group by 1,3
but now he gives me every first mention of the user with all anonymous_id
user_id | timestamp | anonymous_id
------ | ----------------------------|-------------
12 | 2016-07-28 16:19:57.101+00 | x-1
------ | ----------------------------|-------------
12 | 2016-08-24 09:17:21.294+00 y-23
12 | 2016-07-27 12:03:25.572+00 y-2345
i want only see the first mention of user_id 12 - in this case the one with the timestamp 2016-07-27 12:03:25.572+00
how i write the code so i get the first mention of the user_id?
The fastest way in Postgres is to use its proprietary distinct on ()
SELECT distinct on (user_id) user_id , timestamp, anonymous_id
FROM data
order by user_id, timestamp;
You can use the row_number() window function:
SELECT user_id, timestamp, anonymous_id
FROM (SELECT user_id, timestamp, anonymous_id,
ROW_NUMBER() OVER (PARTITION BY user_id
ORDER BY timestamp ASC) AS rn
FROM data) t
WHERE rn = 1
At my Drupal website users can rate each other and those timestamped ratings are stored in the pref_rep table:
# select id, nice, last_rated from pref_rep where nice=true
order by last_rated desc limit 7;
id | nice | last_rated
------------------------+------+----------------------------
OK152565298368 | t | 2011-07-07 14:26:38.325716
OK452217781481 | t | 2011-07-07 14:26:10.831353
OK524802920494 | t | 2011-07-07 14:25:28.961652
OK348972427664 | t | 2011-07-07 14:25:17.214928
DE11873 | t | 2011-07-07 14:25:05.303104
OK335285460379 | t | 2011-07-07 14:24:39.062652
OK353639875983 | t | 2011-07-07 14:23:33.811986
Also I keep the gender of each user in the pref_users table:
# select id, female from pref_users limit 7;
id | female
----------------+--------
OK351636836012 | f
OK366097485338 | f
OK251293359874 | t
OK7848446207 | f
OK335478250992 | t
OK355400714550 | f
OK146955222542 | t
I'm trying to create 2 Drupal blocks displaying "Miss last month" and "Mister last month", but my question is not about Drupal, so please don't move it to drupal.stackexchange.com ;-)
My question is about SQL: how could I find the user with the highest count of nice - and that for the last month? I would have 2 queries - one for female and one for non-female.
Using PostgreSQL 8.4.8 / CentOS 5.6 and SQL is sometimes so hard :-)
Thank you!
Alex
UPDATE:
I've got a nice suggestion to cast timestamps to strings in order to find records for the last month (not for the last 30 days)
UPDATE2:
I've ended up doing string comparison:
select r.id,
count(r.id),
u.first_name,
u.avatar,
u.city
from pref_rep r, pref_users u where
r.nice=true and
to_char(current_timestamp - interval '1 month', 'IYYY-MM') =
to_char(r.last_rated, 'IYYY-MM') and
u.female=true and
r.id=u.id
group by r.id , u.first_name, u.avatar, u.city
order by count(r.id) desc
limit 1
Say you run it once on the first day of the month, and cache the results, since counting votes on every page is kinda useless.
First some date arithmetic :
SELECT now(),
date_trunc( 'month', now() ) - '1 MONTH'::INTERVAL,
date_trunc( 'month', now() );
now | ?column? | date_trunc
-------------------------------+------------------------+------------------------
2011-07-07 16:24:38.765559+02 | 2011-06-01 00:00:00+02 | 2011-07-01 00:00:00+02
OK, we got the bounds for the "last month" datetime range.
Now we need some window function to get the first rows per gender :
SELECT * FROM (
SELECT *, rank( ) over (partition by gender order by score desc )
FROM (
SELECT user_id, count(*) AS score FROM pref_rep
WHERE nice=true
AND last_rated >= date_trunc( 'month', now() ) - '1 MONTH'::INTERVAL
AND last_rated < date_trunc( 'month', now() )
GROUP BY user_id) s1
JOIN users USING (user_id)) s2
WHERE rank=1;
Note this can give you several rows in case of ex-aequo.
EDIT :
I've got a nice suggestion to cast timestamps to strings in order to
find records for the last month (not for the last 30 days)
date_trunc() works much better.
If you make 2 queries, you'll have to make the count() twice. Since users can potentially vote many times for other users, that table will probably be the larger one, so scanning it once is a good thing.
You can't "leave joining back onto the users table to the outer part of the query too" because you need genders...
Query above takes about 30 ms with 1k users and 100k votes so you'd definitely want to cache it.