How to aggregate unique users by hour in Amazon Redshift? - sql

With Amazon Redshift I want to count every unique visitor.
A unique visitor is a visitor who did not visit less than an hour previously.
So for the following rows of users and timestamps we'd get a total count of 4 unique visitors with user1 and user2 counting as 2 respectively.
Please note that I do not want to aggregate by hour in a 24 hour day. I want to aggregate by an hour after the time stamp of the users first visit.
I'm guessing a straight up SQL expression won't do it.
user1,"2015-07-13 08:28:45.247000"
user1,"2015-07-13 08:30:17.247000"
user1,"2015-07-13 09:35:00.030000"
user1,"2015-07-13 09:54:00.652000"
user2,"2015-07-13 08:28:45.247000"
user2,"2015-07-13 08:30:17.247000"
user2,"2015-07-13 09:35:00.030000"
user2,"2015-07-13 09:54:00.652000"
So user1 arrives at 8:28, that counts as one hit. He comes back at 8:30 which counts as zero. He then comes back at 9:35 which is more than an hour from 8:30, so he gets another hit. Then he comes back at 9:35 which is only 5 minutes from the last time 9:30 so this counts as zero. The total is 2 hits for user1. The same thing happens for user2 meaning two hits each bringing it to a final total of 4.

You can use lag to accomplish this. However, you will also have to handle for end of day by partitioning on day as well. The query below would be a starting point.
with prev as (
select user_id,
datecol,
coalesce(lag(datecol) over(partition by user_id order by datecol),0) as prev
from tablename
)
select user_id,
sum(case when datediff(minutes, datecol, prev) >=60 then 1 else 0 end) as totalvisits
from prev
group by user_id

Related

Running a Count on an Interval

I'm trying to do an alert of sorts for customers joining. The alert needs to run on an interval of one hour, which is possible with an integration we have.
The sample data is this:
Name
Time
John
2022-04-21T13:49:51
Mary
2022-04-23T13:49:51
Dave
2022-04-25T13:49:51
Gregg
2022-04-27T13:49:51
so the problem with the below output is this only captures the "count" within the hour. And will yield no results. But I'm trying to determine the moment (well, within the hour) the threshold crosses above a count of 3. Is there something I'm missing?
SELECT COUNT (name)
FROM Table
WHERE Time >= TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -60 MINUTE)
HAVING COUNT (NAME) > 3

compute url rating in the particular time period

I am new with hive and SQL. And I have some difficulties with writing of query on hive. The fact is that I need to calculate the ranking of visits to urls on weekend between 08:00 and 17:59 (it is working hours) and between 18:00 and 07:59 (it is free hours). I have data base , which stores userId, timestamp (when user attended the url). And I need to select url, and this two ranking of visits for work hours and free hours on weekend. I don't understand how to do that. I suppose I have to use WHERE to filter data with timestamps which were on weekend. But I don't understand how to compute two ranking of visits.
Example of the row in db:
1234543 1419638963 site.com
where
userId = 1234543
unix timestamp = 1419638963
url = site.com
I would be very grateful for any help!
You can use below SQL.
I am a little confused on rank of columns. Because rank can be based on many things. i simply order the data based on count of visit. see if it suits your requirement.
I used from_unixtime(ux_time) to calculate the date and time from your data. overall_rank should give you overall rank based on both counts.
select
url,
cnt_working,
cnt_free,
row_number() over (order by cnt_working,cnt_free) overall_rank
from
(
select
sum(case when hour(from_unixtime(ux_time)) >=8 and hour(from_unixtime(ux_time)) < 18 then 1 else 0 end ) cnt_working,
sum(case when hour(from_unixtime(ux_time)) >=8 and hour(from_unixtime(ux_time)) < 18 then 0 else 1 end ) cnt_free,
url
from mytable
where extract(dayofweek FROM (from_unixtime(1419638963) )) IN (1,7) -- to include only sat,sun
group by visit
) rs
order by 2,3

SQL to calculate daily average for a column with records of some dates are missing

I have table like below:
I have to calculate daily average count of session for each user.
First i calculated, total no of sessions for each day for every user and from that i tried to calculate average of daily session. I understand that it wont work since all users dont have sessions for every date.Some dates are missing for all users. Is there any way to calculate daily average when there is no entries for some dates
WITH daily_count AS
(
SELECT user_id, to_char(local_time,’MM/DD/YYYY’) AS Date, count(session_id) AS total_count
FROM table_name
GROUP BY device_id, to_char(local_time,’MM/DD/YYYY’)
)
SELECT user_id , AVG(total_count) AS average_session_count
FROM daily_count
GROUP BY user_id
For eg: The max date in the above given table is Feb04 and the min date is Jan31 .So the total no of days is 5 days.If we take Userid 1, it is having records only for 2 dates. So the query i wrote will calculate average for 2 days not for 5 days. How to make it to calculate average for 5 days
if for date 1,2,3 number of sessions for one user is 1,0(no sessions),5 then what output do you want in average sessions? --> 2 or 3?
You need to change your main query as follows:
SELECT USER_ID,
AVG(TOTAL_COUNT) AS AVERAGE_SESSION_COUNT -- if you want output as 3 then use this
--SUM(TOTAL_COUNT)/(MAX(DATE) - MIN(DATE) + 1) -- if you want output as 2 then use this
FROM DAILY_COUNT
GROUP BY USER_ID

SQL Query to find out Sequence in next or subsequent rows of a Table based on a condition

I have an SQL Table with following structure
Timestamp(DATETIME)|AuditEvent
---------|----------
T1|Login
T2|LogOff
T3|Login
T4|Execute
T5|LogOff
T6|Login
T7|Login
T8|Report
T9|LogOff
Want the T-SQL way to find out What is the time that the user has logged into the system i.e. Time inbetween Login Time and Logoff Time for each session in a given day.
Day (Date)|UserTime(In Hours) (Logoff Time - LogIn Time)
--------- | -------
Jun 12 | 2
Jun 12 | 3
Jun 13 | 5
I tried using two temporary tables and Row Numbers but could not get it since the comparison was a time i.e. finding out the next Logout event with timestamp is greater than the current row's Login Event.
You need to group the records. I would suggest counting logins or logoffs. Here is one approach to get the time for each "session":
select min(case when auditevent = 'login' then timestamp end) as login_time,
max(timestamp) as logoff_time
from (select t.*,
sum(case when auditevent = 'logoff' then 1 else 0 end) over (order by timestamp desc) as grp
from t
) t
group by grp;
You then have to do whatever you want to get the numbers per day. It is unclear what those counts are.
The subquery does a reverse count. It counts the number of "logoff" records that come on or after each record. For records in the same "session", this count is the same, and suitable for grouping.

Query to find all timestamps more than a certain interval apart

I'm using postgres to run some analytics on user activity. I have a table of all requests(pageviews) made by every user and the timestamp of the request, and I'm trying to find the number of distinct sessions for every user. For the sake of simplicity, I'm considering every set of requests an hour or more apart from others as a distinct session. The data looks something like this:
id| request_time| user_id
1 2014-01-12 08:57:16.725533 1233
2 2014-01-12 08:57:20.944193 1234
3 2014-01-12 09:15:59.713456 1233
4 2014-01-12 10:58:59.713456 1234
How can I write a query to get the number of sessions per user?
To start a new session after every gap >= 1 hour:
SELECT user_id, count(*) AS distinct_sessions
FROM (
SELECT user_id
,(lag(request_time, 1, '-infinity') OVER (PARTITION BY user_id
ORDER BY request_time)
<= request_time - '1h'::interval) AS step -- start new session
FROM tbl
) sub
WHERE step
GROUP BY user_id
ORDER BY user_id;
Assuming request_time NOT NULL.
Explain:
In subquery sub, check for every row if a new session begins. Using the third parameter of lag() to provide the default -infinity, which is lower than any timestamp and therefore always starts a new session for the first row.
In the outer query count how many times new sessions started. Eliminate step = FALSE and count per user.
Alternative interpretation
If you really wanted to count hours where at least one request happened (I don't think you do, but another answer assumes as much), you would:
SELECT user_id
, count(DISTINCT date_trunc('hour', request_time)) AS hours_with_req
FROM tbl
GROUP BY 1
ORDER BY 1;