a question about sql group by - sql

I have a table named visiting that looks like this:
id | visitor_id | visit_time
-------------------------------------
1 | 1 | 2009-01-06 08:45:02
2 | 1 | 2009-01-06 08:58:11
3 | 1 | 2009-01-06 09:08:23
4 | 1 | 2009-01-06 21:55:23
5 | 1 | 2009-01-06 22:03:35
I want to work out a sql that can get how many times a user visits within one session(successive visit's interval less than 1 hour).
So, for the example data, I want to get following result:
visitor_id | count
-------------------
1 | 3
1 | 2
BTW, I use postgresql 8.3.
Thanks!
UPDATE: updated the timestamps in the example data table. sorry for the confusion.
UPDATE: I don't care much if the solution is a single sql query, using store procedure, subquery etc. I only care how to get it done :)

The question is slightly ambiguous because you're making the assumption or requiring that the hours are going to start at a set point, i.e. a natural query would also indicate that there's a result record of (1,2) for all the visits between the hour of 08:58 and 09:58. You would have to "tell" your query that the start times are for some determinable reason visits 1 and 4, or you'd get the natural result set:
visitor_id | count
--------------------
1 | 3
1 | 2 <- extra result starting at visit 2
1 | 1 <- extra result starting at visit 3
1 | 2
1 | 1 <- extra result starting at visit 5
That extra logic is going to be expensive and too complicated for my fragile mind this morning, somebody better than me at postgres can probably solve this.
I would normally want to solve this by having a sessionkey column in the table I could cheaply group by for perforamnce reasons, but there's also a logical problem I think. Deriving session info from timings seems dangerous to me because I don't believe that the user will be definitely logged out after an hours activity. Most session systems work by expiring the session after a period of inactivity, i.e. it's very likely that a visit after 9:45 is going to be in the same session because your hourly period is going to be reset at 9:08.

The problem seems a little fuzzy.
It gets more complicated as id 3 is within an hour of id 1 and 2, but if the user had visited at 9:50 then that would have been within an hour of 2 but not 1.
You seem to be after a smoothed total - for a given visit, how many visits are within the following hour?
Perhaps you should be asking for how many visits have a succeeding visit less than an hour distant? If a visit is less than an hour from the preceeding one then should it 'count'?
So what you probably want is how many chains do you have where the links are less than an arbitrary amount (so the hypothetical 9:50 visit would be included in the chain that starts with id 1).

no simple solution
There is no way to do this in a single SQL statment.
Below are 2 ideas: one uses a loop to count visits, the other changes the way the visiting table is populated.
loop solution
However, it can be done without too much trouble with a loop.
(I have tried to get the postgresql syntax correct, but I'm no expert)
/* find entries where there is no previous entry for */
/* the same visitor within the previous hour: */
select v1.* , 0 visits
into temp_table
from visiting v1
where not exists ( select 1
from visiting v2
where v2.visitor_id = v1.visitor_id
and v2.visit_time < v1.visit_time
and v1.visit_time - interval '1 hour' < v2.visit_time
)
select #rows = ##rowcount
while #rows > 0
begin
update temp_table
set visits = visits + 1 ,
last_time = v.visit_time
from temp_table t ,
visiting v
where t.visitor_id = v.visitor_id
and v.visit_time - interval '1 hour' < t.last_time
and not exists ( select 1
from visiting v2
where v2.visitor_id = t.visitor_id
and v2.visit_time between t.last_time and v.visit_time
)
select #rows = ##rowcount
end
/* get the result: */
select visitor_id,
visits
from temp_table
The idea here is to do this:
get all visits where there is no prior visit inside of an hour.
this identifies the sessions
loop, getting the next visit for each of these "first visits"
until there are no more "next visits"
now you can just read off the number of visits in each session.
best solution?
I suggest:
add a column to the visiting table: session_id int not null
change the process which makes the entries so that it checks to see if the previous visit by the current visitor was less than an hour ago. If so, it sets session_id to the same as the session id for that earlier visit. If not, it generates a new session_id .
you could put this logic in a trigger.
Then your original query can be solved by:
SELECT session_id, visitor_id, count(*)
FROM visiting
GROUP BY session_id, visitor_id
Hope this helps. If I've made mistakes (I'm sure I have), leave a comment and I'll correct it.

PostgreSQL 8.4 will have a windowing function, by then we can eliminate creating temporary table just to simulate rownumbers (sequence purposes)
create table visit
(
visitor_id int not null,
visit_time timestamp not null
);
insert into visit(visitor_id, visit_time)
values
(1, '2009-01-06 08:45:02'),
(2, '2009-02-06 08:58:11'),
(1, '2009-01-06 08:58:11'),
(1, '2009-01-06 09:08:23'),
(1, '2009-01-06 21:55:23'),
(2, '2009-02-06 08:59:11'),
(2, '2009-02-07 00:01:00'),
(1, '2009-01-06 22:03:35');
create temp table temp_visit(visitor_id int not null, sequence serial not null, visit_time timestamp not null);
insert into temp_visit(visitor_id, visit_time) select visitor_id, visit_time from visit order by visitor_id, visit_time;
select
reference.visitor_id, count(nullif(reference.visit_time - prev.visit_time < interval '1 hour',false))
from temp_visit reference
left join temp_visit prev
on prev.visitor_id = reference.visitor_id and prev.sequence = reference.sequence - 1
group by reference.visitor_id;

One or both of these may work? However, both will end up giving you more columns in the result than you are asking for.
SELECT visitor_id,
date_part('year', visit_time),
date_part('month', visit_time),
date_part('day', visit_time),
date_part('hour', visit_time),
COUNT(*)
FROM visiting
GROUP BY 1, 2, 3, 4, 5;
SELECT visitor_id,
EXTRACT(EPOCH FROM visit_time)-(EXTRACT(EPOCH FROM visit_time) % 3600),
COUNT(*)
FROM visiting
GROUP BY 1, 2;

This can't be done in a single SQL.
The better option is to handle it in stored procedure

If it were T-SQL, I would write something as:
SELECT visitor_id, COUNT(id),
DATEPART(yy, visit_time), DATEPART(m, visit_time),
DATEPART(d, visit_time), DATEPART(hh, visit_time)
FROM visiting
GROUP BY
visitor_id,
DATEPART(yy, visit_time), DATEPART(m, visit_time),
DATEPART(d, visit_time), DATEPART(hh, visit_time)
which gives me:
1 3 2009 1 6 8
1 2 2009 1 6 21
I do not know how or if you can write this in postgre though.

Related

SQL Efficient way to loop through sequential time-series data to identify unique users without double counting

I am working on a project where I have users that have 'signed up' during a period from Day1 to Day8. However, due to the circumstances of the issue a user can 'sign up' more than once. This results in the same users being able to sign up in Dayx and Dayz. Note: I am using the latest stable version of PostGreSQL for Windows
The goal is to only count the number of unique sign ups for each day while without double counting any users. This means that total sign ups in Day8 need to take into account signups in Days1-Day7 as well.
The solution I have at the moment works technically, but it is very clunky, takes forever to query and does not scale well. Ideally, the SQL query needs to scale for any time period between time x and time y without having to manually write a block of code for each individual time period.
As you can see from my code below it technically gives me the write answer, but is cumbersome, slow and does not scale. Looking for help finding an elegant, scalable solution that does not take 30 minutes to run.
Note: I could write this much more elegantly in Python but am not sure how well Python scales with large datasets stored in RDBMS (ex: Pull all raw data with SQL and then import the CSV into python where a python script will do the calculations instead of doing it in SQL)
TABLE DATA:
+-----------+--------------+-----------------------------------------------+
| cookie_id | time_created | URL |
+-----------+--------------+-----------------------------------------------+
| 3422erq | 2018-10-1 | https:data.join/4wr08w40rwj/utm_source.com |
| 3421ra | 2018-10-1 | https:data.join/convert/45824234/utm_code.com |
| 321af | 2018-10-2 | https:data.join/utm_source=34342.com |
+-----------+--------------+-----------------------------------------------+
SELECT COUNT(DISTINCT cookie_id), time_created FROM Data WHERE url LIKE ('%join%')
AND time_created IN (SELECT MIN(time_created) FROM Data)
GROUP BY time_created
--Code to get all unique users in Day1 (5,304 unique users)
SELECT COUNT(DISTINCT cookie_id), time_created FROM Data WHERE url LIKE ('%join%')
AND time_created IN (SELECT MIN(time_created +1) FROM Data)
AND cookie_id NOT IN (SELECT DISTINCT cookie_id FROM Data WHERE time_created = '2018-10-01')
GROUP BY time_created
--Code to get all unique users in Day2 (9,218 unique users)
SELECT COUNT(DISTINCT cookie_id), time_created FROM Data WHERE url LIKE ('%join%')
AND time_created IN (SELECT MIN(time_created +2) FROM Data)
AND cookie_id NOT IN (SELECT DISTINCT cookie_id FROM Data WHERE time_created BETWEEN '2018-10-01' AND '2018-10-02')
GROUP BY time_created
--Code to get all unique users in Day3 (8,745 unique users)
Expected & actual results are the same. However the code does not scale and is incredibly slow.
So given this table:
CREATE TABLE data
(
cookie_id text,
time_created date,
url text
)
(Yes, no indexes)
I generated 5.5 million rows with random 5 [0-9A-F] characters long cookie_ids on a random (2018-10-01::date + (10*random())::int) date, with every 100th row having the https:data.join/.... url while others were some garbage.
Your second query took around 8.5 minutes. This one, on the other hand, took around 0.2s:
with count_per_day as
(
select time_created, count(*) as unique_users from (
select cookie_id
, time_created
, row_number() over (partition by cookie_id order by time_created) occurrence
from data
where url like 'https:data.join%'
and time_created between '2018-10-01' and '2018-10-08'
) oc
where occurrence = 1
group by time_created
)
select time_created, unique_users, sum(unique_users) over (order by time_created) as running_sum
from count_per_day
Again, with no indexes. If you have orders of magnitude bigger counts, an index on (left(url, 15), time_created, cookie_id) and change of url condition to left(url, 15) = 'https:data.join' dropped it to below 50ms.

SQL Query to find out Sequence in next or subsequent rows of a Table based on a condition

I have an SQL Table with following structure
Timestamp(DATETIME)|AuditEvent
---------|----------
T1|Login
T2|LogOff
T3|Login
T4|Execute
T5|LogOff
T6|Login
T7|Login
T8|Report
T9|LogOff
Want the T-SQL way to find out What is the time that the user has logged into the system i.e. Time inbetween Login Time and Logoff Time for each session in a given day.
Day (Date)|UserTime(In Hours) (Logoff Time - LogIn Time)
--------- | -------
Jun 12 | 2
Jun 12 | 3
Jun 13 | 5
I tried using two temporary tables and Row Numbers but could not get it since the comparison was a time i.e. finding out the next Logout event with timestamp is greater than the current row's Login Event.
You need to group the records. I would suggest counting logins or logoffs. Here is one approach to get the time for each "session":
select min(case when auditevent = 'login' then timestamp end) as login_time,
max(timestamp) as logoff_time
from (select t.*,
sum(case when auditevent = 'logoff' then 1 else 0 end) over (order by timestamp desc) as grp
from t
) t
group by grp;
You then have to do whatever you want to get the numbers per day. It is unclear what those counts are.
The subquery does a reverse count. It counts the number of "logoff" records that come on or after each record. For records in the same "session", this count is the same, and suitable for grouping.

in redshift, how can I use window functions to assign a count to a previous row's date

the title would be too wordy if I actually tried to cram it all in there but here's what I need help with...
We are trying to calculate retention of users. Our users have assignment start dates and assignment end dates that may overlap. What I need to do is look at all candidate assignments and determine if they are retained (30 days or less between previous end and new start). The tricky part: I need to assign the retention credit to the previous assignment end date. Here's a preview of the data:
month | user_id | start_date | end_date | rank | days_btw_assignment
1 5 1-1-16 1-31-16 1 NULL
2 5 2-14-16 4-15-16 2 15
6 4 6-01-16 11-01-16 1 NULL
8 4 8-01-16 11-01-16 2 -81
Therefore for user 5, I would need to give credit of retention to the month of jan-16' because their assignment end date ends 1-31-16. For User 4, where there assignments overlap, I would give credit of retention to nov-16' because their previous assignment end date ends 11-01-16.
I've restricted this example to use cases where they only have 2 assignments, though, there could be more. I just need a step in the right direction and I can probably handle all other use cases by myself.
Here's the sample code I'm currently using:
with placement_facts as (select date_trunc('month',assignment_start_date) as month, user_id, assignment_start_date, assignment_end_date, rank () over (partition by user_id order by assignment_start_date asc), extract( day from assignment_start_date - lag(assignment_end_date, 1) over (partition by user_id order by assignment_start_date asc)) as time_btw_placement
from activations as ca
join offers on ca.offer_id = offers.id
where assignment_start_date != assignment_end_date
order by 2,4 asc)
select placement_facts.month, count(distinct case when time_btw_placement <=30 then user_id else null end) as retained_raw
from placement_facts
group by 1;
Appreciate the help and please lmk if I nee to clarify anything!
If I understand your question then I think you can achieve what you want by replacing your use of LAG() with LEAD(). It's basically the same function but it looks at a given number of rows ahead.

Multiple aggregate sums from different conditions in one sql query

Whereas I believe this is a fairly general SQL question, I am working in PostgreSQL 9.4 without an option to use other database software, and thus request that any answer be compatible with its capabilities.
I need to be able to return multiple aggregate totals from one query, such that each sum is in a new row, and each of the groupings are determined by a unique span of time, e.g. WHERE time_stamp BETWEEN '2016-02-07' AND '2016-02-14'. The number of records that satisfy there WHERE clause is unknown and may be zero, in which case ideally the result is "0". This is what I have worked out so far:
(
SELECT SUM(minutes) AS min
FROM downtime
WHERE time_stamp BETWEEN '2016-02-07' AND '2016-02-14'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-02-14' AND '2016-02-21'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-02-28' AND '2016-03-06'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-03-06' AND '2016-03-13'
)
UNION ALL
(
SELECT SUM(minutes))
FROM downtime
WHERE time_stamp BETWEEN '2016-03-13' AND '2016-03-20'
)
UNION ALL
(
SELECT SUM(minutes)
FROM downtime
WHERE time_stamp BETWEEN '2016-03-20' AND '2016-03-27'
)
Result:
min
---+-----
1 | 119
2 | 4
3 | 30
4 |
5 | 62
6 | 350
That query gets me almost the exact result that I want; certainly good enough in that I can do exactly what I need with the results. Time spans with no records are blank but that was predictable, and whereas I would prefer "0" I can account for the blank rows in software.
But, while it isn't terrible for the 6 weeks that it represents, I want to be flexible and to be able to do the same thing for different time spans, and for a different number of data points, such as each day in a week, each week in 3 months, 6 months, each month in 1 year, 2 years, etc... As written above, it feels as if it is going to get tedious fast... for instance 1 week spans over a 2 year period is 104 sub-queries.
What I'm after is a more elegant way to get the same (or similar) result.
I also don't know if doing 104 iterations of a similar query to the above (vs. the 6 that it does now) is a particularly efficient usage.
Ultimately I am going to write some code which will help me build (and thus abstract away) the long, ugly query--but it would still be great to have a more concise and scale-able query.
In Postgres, you can generate a series of times and then use these for the aggregation:
select g.dte, coalesce(sum(dt.minutes), 0) as minutes
from generate_series('2016-02-07'::timestamp, '2016-03-20'::timestamp, interval '7 day') g(dte) left join
downtime dt
on dt.timestamp >= g.dte and dt.timestamp < g.dte + interval '7 day'
group by g.dte
order by g.dte;

How do I analyse time periods between records in SQL data without cursors?

The root problem: I have an application which has been running for several months now. Users have been reporting that it's been slowing down over time (so in May it was quicker than it is now). I need to get some evidence to support or refute this claim. I'm not interested in precise numbers (so I don't need to know that a login took 10 seconds), I'm interested in trends - that something which used to take x seconds now takes of the order of y seconds.
The data I have is an audit table which stores a single row each time the user carries out any activity - it includes a primary key, the user id, a date time stamp and an activity code:
create table AuditData (
AuditRecordID int identity(1,1) not null,
DateTimeStamp datetime not null,
DateOnly datetime null,
UserID nvarchar(10) not null,
ActivityCode int not null)
(Notes: DateOnly (datetime) is the DateTimeStamp with the time stripped off to make group by for daily analysis easier - it's effectively duplicate data to make querying faster).
Also for the sake of ease you can assume that the ID is assigned in date time order, that is 1 will always be before 2 which will always be before 3 - if this isn't true I can make it so).
ActivityCode is an integer identifying the activity which took place, for instance 1 might be user logged in, 2 might be user data returned, 3 might be search results returned and so on.
Sample data for those who like that sort of thing...:
1, 01/01/2009 12:39, 01/01/2009, P123, 1
2, 01/01/2009 12:40, 01/01/2009, P123, 2
3, 01/01/2009 12:47, 01/01/2009, P123, 3
4, 01/01/2009 13:01, 01/01/2009, P123, 3
User data is returned (Activity Code 2) immediate after login (Activity Code 1) so this can be used as a rough benchmark of how long the login takes (as I said, I'm interested in trends so as long as I'm measuring the same thing for May as July it doesn't matter so much if this isn't the whole login process - it takes in enough of it to give a rough idea).
(Note: User data can also be returned under other circumstances so it's not a one to one mapping).
So what I'm looking to do is select the average time between login (say ActivityID 1) and the first instance after that for that user on that day of user data being returned (say ActivityID 2).
I can do this by going through the table with a cursor, getting each login instance and then for that doing a select to say get the minimum user data return following it for that user on that day but that's obviously not optimal and is slow as hell.
My question is (finally) - is there a "proper" SQL way of doing this using self joins or similar without using cursors or some similar procedural approach? I can create views and whatever to my hearts content, it doesn't have to be a single select.
I can hack something together but I'd like to make the analysis I'm doing a standard product function so would like it to be right.
SELECT TheDay, AVG(TimeTaken) AvgTimeTaken
FROM (
SELECT
CONVERT(DATE, logins.DateTimeStamp) TheDay
, DATEDIFF(SS, logins.DateTimeStamp,
(SELECT TOP 1 DateTimeStamp
FROM AuditData userinfo
WHERE UserID=logins.UserID
and userinfo.ActivityCode=2
and userinfo.DateTimeStamp > logins.DateTimeStamp )
)TimeTaken
FROM AuditData logins
WHERE
logins.ActivityCode = 1
) LogInTimes
GROUP BY TheDay
This might be dead slow in real world though.
In Oracle this would be a cinch, because of analytic functions. In this case, LAG() makes it easy to find the matching pairs of activity codes 1 and 2 and also to calculate the trend. As you can see, things got worse on 2nd JAN and improved quite a bit on the 3rd (I'm working in seconds rather than minutes).
SQL> select DateOnly
2 , elapsed_time
3 , elapsed_time - lag (elapsed_time) over (order by DateOnly) as trend
4 from
5 (
6 select DateOnly
7 , avg(databack_time - prior_login_time) as elapsed_time
8 from
9 ( select DateOnly
10 , databack_time
11 , ActivityCode
12 , lag(login_time) over (order by DateOnly,UserID, AuditRecordID, ActivityCode) as prior_login_time
13 from
14 (
15 select a1.AuditRecordID
16 , a1.DateOnly
17 , a1.UserID
18 , a1.ActivityCode
19 , to_number(to_char(a1.DateTimeStamp, 'SSSSS')) as login_time
20 , 0 as databack_time
21 from AuditData a1
22 where a1.ActivityCode = 1
23 union all
24 select a2.AuditRecordID
25 , a2.DateOnly
26 , a2.UserID
27 , a2.ActivityCode
28 , 0 as login_time
29 , to_number(to_char(a2.DateTimeStamp, 'SSSSS')) as databack_time
30 from AuditData a2
31 where a2.ActivityCode = 2
32 )
33 )
34 where ActivityCode = 2
35 group by DateOnly
36 )
37 /
DATEONLY ELAPSED_TIME TREND
--------- ------------ ----------
01-JAN-09 120
02-JAN-09 600 480
03-JAN-09 150 -450
SQL>
Like I said in my comment I guess you're working in MSSQL. I don't know whether that product has any equivalent of LAG().
If the assumptions are that:
Users will perform various tasks in no mandated order, and
That the difference between any two activities reflects the time it takes for the first of those two activities to execute,
Then why not create a table with two timestamps, the first column containing the activity start time, the second column containing the next activity start time. Thus the difference between these two will always be total time of the first activity. So for the logout activity, you would just have NULL for the second column.
So it would be kind of weird and interesting, for each activity (other than logging in and logging out), the time stamp would be recorded in two different rows--once for the last activity (as the time "completed") and again in a new row (as time started). You would end up with a jacob's ladder of sorts, but finding the data you are after would be much more simple.
In fact, to get really wacky, you could have each row have the time that the user started activity A and the activity code, and the time started activity B and the time stamp (which, as mentioned above, gets put down again for the following row). This way each row will tell you the exact difference in time for any two activities.
Otherwise, you're stuck with a query that says something like
SELECT TIME_IN_SEC(row2-timestamp) - TIME_IN_SEC(row1-timestamp)
which would be pretty slow, as you have already suggested. By swallowing the redundancy, you end up just querying the difference between the two columns. You probably would have less need of knowing the user info as well, since you'd know that any row shows both activity codes, thus you can just query the average for all users on any given day and compare it to the next day (unless you are trying to find out which users are having the problem as well).
This is the faster query to find out, in one row you will have current and row before datetime value, after that you can use DATEDIFF ( datepart , startdate , enddate ). I use #DammyVariable and DamyField as i remember the is some problem if is not first #variable=Field in update statement.
SELECT *, Cast(NULL AS DateTime) LastRowDateTime, Cast(NULL As INT) DamyField INTO #T FROM AuditData
GO
CREATE CLUSTERED INDEX IX_T ON #T (AuditRecordID)
GO
DECLARE #LastRowDateTime DateTime
DECLARE #DammyVariable INT
SET #LastRowDateTime = NULL
SET #DammyVariable = 1
UPDATE #T SET
#DammyVariable = DammyField = #DammyVariable
, LastRowDateTime = #LastRowDateTime
, #LastRowDateTime = DateTimeStamp
option (maxdop 1)