How can I run this except sub-query in one single query? - sql

I am using postgreSQL, I have two tables, one is user, and one is usertasks.
user has following fields : userid, username
usertasks has following fields: id, taskdate, userid
userid and id are primary keys on above tables
I want to find all users who have made less than 3 tasks in last 3 months.
I cannot use WHERE taskdate>(last3months) here because I need all the users, not just those who made tasks in last 3 months. (Some users might have done their tasks 6 months ago, but didn't do any task in last 3 months, so I need those users as well)
My query is this:
select userid
from users
EXCEPT
select userid from usertasks
where usertasks.taskdate > CURRENT_DATE - INTERVAL '3 months'
group by usertasks.userid having count(id) >= 3
Problem:
The above query works perfectly and returns the right result, I have also tried NOT IN , instead of EXCEPT, that works fine too, but the thing is I am getting performance issues, can this be done in one single query without using a sub query, can it be done using joins or any other method ? The use of sub-queries making it slower.
the test case is for 100 thousand users and 1 million tasks, i am searching for fastest methods..

You need to use having with a case.
Select u.user_id
from users u
left join usertask ut
on ut.user_id=u.user_id
group by u.user_id
having count(case when ut.taskdate > CURRENT_DATE - INTERVAL '3 months' then task_id else null end)<3 -- count of tasks in last 3 monthx < 3

Related

Group timestamps into sessions with a defined minimum gap between sessions

I'm trying to group my timestamp data into user sessions of various lengths where a session end is defined by a minimum time gap between sessions.
So if the thr is e.g. 5 minutes, two timestamps with a 2 minute gap would be considered the same session, while two timestamps with the gap of 6 minutes would be considered two sessions.
I've seen several examples where people try to group into sessions of a certain length, which is kinda easy. But my case is too "online" and I can't figure out what trick to use.
I have a base query defining the timestamps and creating sessions with 1 minute granularity. But I get one long session for each object, merging several ones into one long one.
How could I split my long merged session into several ones, with a defined gap of e.g. 5 minutes?
SELECT
count(distinct timestamps.*) as min_spent,
timestamps.object_id,
timestamps.user_id,
min(created_min) as session_start,
max(created_min) as session_end
FROM
(
SELECT
date_trunc('minute', datetime) as created_min,
object_id,
user_id
FROM timestamp_metrics
GROUP BY created_min, object_id, user_id
) as timestamps
left join objects o on o.id = timestamps.object_id
left join users u on u.id = timestamps.user_id
GROUP BY object_id, user_id

how to get transactions on a database for a specific time

I want to get the users from a postgresql database where users activity is not seen for a specific period of time etc. (Basically I am trying to get which users who are not using the application at all)
For example the following SQL query is for users not using for the last 30 days:
SELECT distinct on (username) username, started_at
FROM projects_user JOIN projects_synclog
ON projects_user.id = projects_synclog.user_id
WHERE started_at BETWEEN '2019-08-15' AND '2019-09-15'
ORDER BY username, started_at DESC
In this query it is showing all the users which means for example a user may have logged in a month ago for once and again the same user has logged in 2 days ago. In this case, the user is still active, which I don't want to be listed out.
I have been trying this for countless hours. I searched for solutions a lot in here and other forums listed in google.
I would highly appreciate any help.
Thanks a lot.
I think you want aggregation and having. This answers the question in the title:
SELECT pu.username, max(sl.started_at)
FROM projects_user pu LEFT JOIN
projects_synclog sl
ON pu.id = sl.user_id
GROUP BY pu.username
HAVING MAX(sl.started_at) IS NULL OR
MAX(sl.started_at) < CURRENT_DATE - INTERVAL '7 DAY'

Query to find all timestamps more than a certain interval apart

I'm using postgres to run some analytics on user activity. I have a table of all requests(pageviews) made by every user and the timestamp of the request, and I'm trying to find the number of distinct sessions for every user. For the sake of simplicity, I'm considering every set of requests an hour or more apart from others as a distinct session. The data looks something like this:
id| request_time| user_id
1 2014-01-12 08:57:16.725533 1233
2 2014-01-12 08:57:20.944193 1234
3 2014-01-12 09:15:59.713456 1233
4 2014-01-12 10:58:59.713456 1234
How can I write a query to get the number of sessions per user?
To start a new session after every gap >= 1 hour:
SELECT user_id, count(*) AS distinct_sessions
FROM (
SELECT user_id
,(lag(request_time, 1, '-infinity') OVER (PARTITION BY user_id
ORDER BY request_time)
<= request_time - '1h'::interval) AS step -- start new session
FROM tbl
) sub
WHERE step
GROUP BY user_id
ORDER BY user_id;
Assuming request_time NOT NULL.
Explain:
In subquery sub, check for every row if a new session begins. Using the third parameter of lag() to provide the default -infinity, which is lower than any timestamp and therefore always starts a new session for the first row.
In the outer query count how many times new sessions started. Eliminate step = FALSE and count per user.
Alternative interpretation
If you really wanted to count hours where at least one request happened (I don't think you do, but another answer assumes as much), you would:
SELECT user_id
, count(DISTINCT date_trunc('hour', request_time)) AS hours_with_req
FROM tbl
GROUP BY 1
ORDER BY 1;

Running a query over past date ranges

I have a rather interesting problem which I first thought would be straight-forward, but it turned out to be more complicated.
I have data like this:
Date User ID
2012-10-11 a
2012-10-11 b
2012-10-12 c
2012-10-12 d
2012-10-13 e
2012-10-14 b
2012-10-14 e
... ...
Each row has a Date, User ID couple which indicates that that user was active on that day. A user can appear on multiple dates and a date will have multiple users -- just like in the example. I have millions of rows like this which cover a time range of about 90 days.
Here's the question: For each day, I want to get the number of users who have not been active for the past 10 days. For instance, if the user "a" was active on 2012-05-31 and but hasn't been active on any of the days between 06-01 and 06-10, I want to count this user on 6/10. I wouldn't count him again on the following days though unless he becomes active and disappears again.
Can I do this in SQL or would I need to some kind of script to organize the data the way I want. What would be your recommendations? I use Hive.
Thank you so much!
I think you can do this in Hive-compatible SQL. Here is the idea.
For each user/date get the next date for the user.
Discard the original record if the next is less than 10 days after the current one.
Add 10 to the date
Aggregate and count
I am not sure of all the Hive functions for things like date. Here is an example of how to do it:
select date+10, count(*)
from (select t.userid, t.date,
min(case when tnext.date > t.date then tnext.date end) as nextdate
from t left outer join
t tnext
on t.userid = tnext.userid
group by t.userid, t.date
) t
where nextdate is null or nextdate - date >= 10
group by date+10;
Note that the inner subquery would be better written using:
on t.userid = tnext.userid and t2.date > t.date
However, I don't know if Hive supports such a join (it doesn't support non-equijoins and it not clear about whether one or all clauses have to be equal).

How to have GROUP BY and COUNT include zero sums?

I have SQL like this (where $ytoday is 5 days ago):
$sql = 'SELECT Count(*), created_at FROM People WHERE created_at >= "'. $ytoday .'" AND GROUP BY DATE(created_at)';
I want this to return a value for every day, so it would return 5 results in this case (5 days ago until today).
But say Count(*) is 0 for yesterday, instead of returning a zero it doesn't return any data at all for that date.
How can I change that SQLite query so it also returns data that has a count of 0?
Without convoluted (in my opinion) queries, your output data-set won't include dates that don't exist in your input data-set. This means that you need a data-set with the 5 days to join on to.
The simple version would be to create a table with the 5 dates, and join on that. I typically create and keep (effectively caching) a calendar table with every date I could ever need. (Such as from 1900-01-01 to 2099-12-31.)
SELECT
calendar.calendar_date,
Count(People.created_at)
FROM
Calendar
LEFT JOIN
People
ON Calendar.calendar_date = People.created_at
WHERE
Calendar.calendar_date >= '2012-05-01'
GROUP BY
Calendar.calendar_date
You'll need to left join against a list of dates. You can either create a table with the dates you need in it, or you can take the dynamic approach I outlined here:
generate days from date range