Count visits on a daily base with one select - sql

I have this table in my PostgreSQL:
CREATE TABLE visits(
id BIGSERIAL NOT NULL PRIMARY KEY,
timeslot TSRANGE NOT NULL,
user_id INTEGER NOT NULL REFERENCES users(id),
CONSTRAINT overlapping_timeslots EXCLUDE USING GIST (
user_id WITH =,
timeslot WITH &&
));
With this Data:
| id | timeslot | user_id |
| 1 | 10.02.2014 10:00 - 10.02.2014 17:00 | 2 |
| 2 | 10.02.2014 18:00 - 10.02.2014 19:00 | 2 |
| 3 | 11.02.2014 01:00 - 11.02.2014 02:00 | 2 |
| 4 | 10.02.2014 12:00 - 11.02.2014 17:00 | 2 |
| 5 | 11.02.2014 12:00 - 11.02.2014 12:30 | 2 |
I need to know how many users are visted my shop every day. If a user visits the shop twice a day it should be count twice.
In the example above it should be.
Users at 10.02 = 3 (ID: 1,2,4)
Users at 11.02 = 3 (ID: 3,4,5)

Assuming an arbitrary period of time for lack of definition for "every day":
SELECT day, count(*) AS visits, array_agg(id) AS ids
FROM generate_series ('2014-02-10'::date
, '2014-02-12'::date
, interval '1 day') AS d(day)
JOIN visits ON tsrange(day::timestamp
, day::timestamp + interval '1 day') && timeslot
GROUP BY 1;
&& is the "overlap" operator for range types.
Use a LEFT JOIN to include days with 0 visits in the result.
-> SQLfiddle demo.

SELECT user_id, LEFT(timeslot, 10) as date_visit ,COUNT(*) as day_visit
FROM vistis
GROUP BY USER_ID, LEFT(timeslot, 10)
UNION
SELECT user_id, SUBSTR(timeslot, 13, 23) as date_visit ,COUNT(*) as day_visit
FROM vistis
GROUP BY USER_ID, SUBSTR(timeslot, 13, 23)
Remove in select and group by the column user_id if you want to get the counter of all user in a day

Related

Finding total session time of a user in postgres

I am trying to create a query that will give me a column of total time logged in for each month for each user.
username | auth_event_type | time | credential_id
Joe | 1 | 2021-11-01 09:00:00 | 44
Joe | 2 | 2021-11-01 10:00:00 | 44
Jeff | 1 | 2021-11-01 11:00:00 | 45
Jeff | 2 | 2021-11-01 12:00:00 | 45
Joe | 1 | 2021-11-01 12:00:00 | 46
Joe | 2 | 2021-11-01 12:30:00 | 46
Joe | 1 | 2021-12-06 14:30:00 | 47
Joe | 2 | 2021-12-06 15:30:00 | 47
The auth_event_type column specifies whether the event was a login (1) or logout (2) and the credential_id indicates the session.
I'm trying to create a query that would have an output like this:
username | year_month | total_time
Joe | 2021-11 | 1:30
Jeff | 2021-11 | 1:00
Joe | 2021-12 | 1:00
How would I go about doing this in postgres? I am thinking it would involve a window function? If someone could point me in the right direction that would be great. Thank you.
Solution 1 partially working
Not sure that window functions will help you in your case, but aggregate functions will :
WITH list AS
(
SELECT username
, date_trunc('month', time) AS year_month
, max(time ORDER BY time) - min(time ORDER BY time) AS session_duration
FROM your_table
GROUP BY username, date_trunc('month', time), credential_id
)
SELECT username
, to_char (year_month, 'YYYY-MM') AS year_month
, sum(session_duration) AS total_time
FROM list
GROUP BY username, year_month
The first part of the query aggregates the login/logout times for the same username, credential_id, the second part makes the sum per year_month of the difference between the login/logout times. This query works well until the login time and logout time are in the same month, but it fails when they aren't.
Solution 2 fully working
In order to calculate the total_time per username and per month whatever the login time and logout time are, we can use a time range approach which intersects the session ranges [login_time, logout_time) with the monthly ranges [monthly_start_time, monthly_end_time) :
WITH monthly_range AS
(
SELECT to_char(m.month_start_date, 'YYYY-MM') AS month
, tsrange(m.month_start_date, m.month_start_date+ interval '1 month' ) AS monthly_range
FROM
( SELECT generate_series(min(date_trunc('month', time)), max(date_trunc('month', time)), '1 month') AS month_start_date
FROM your_table
) AS m
), session_range AS
(
SELECT username
, tsrange(min(time ORDER BY auth_event_type), max(time ORDER BY auth_event_type)) AS session_range
FROM your_table
GROUP BY username, credential_id
)
SELECT s.username
, m.month
, sum(upper(p.period) - lower(p.period)) AS total_time
FROM monthly_range AS m
INNER JOIN session_range AS s
ON s.session_range && m.monthly_range
CROSS JOIN LATERAL (SELECT s.session_range * m.monthly_range AS period) AS p
GROUP BY s.username, m.month
see the result in dbfiddle
Use the window function lag() with a partition it by credential_id ordered by time, e.g.
WITH j AS (
SELECT username, time, age(time, LAG(time) OVER w)
FROM t
WINDOW w AS (PARTITION BY credential_id ORDER BY time
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
)
SELECT username, to_char(time,'yyyy-mm'),sum(age) FROM j
GROUP BY 1,2;
Note: the frame ROWS BETWEEN 1 PRECEDING AND CURRENT ROW is pretty much optional in this case, but it is considered a good practice to keep window functions as explicit as possible, so that in the future you don't have to read the docs to figure out what your query is doing.
Demo: db<>fiddle

Querying the retention rate on multiple days with SQL

Given a simple data model that consists of a user table and a check_in table with a date field, I want to calculate the retention date of my users. So for example, for all users with one or more check ins, I want the percentage of users who did a check in on their 2nd day, on their 3rd day and so on.
My SQL skills are pretty basic as it's not a tool that I use that often in my day-to-day work, and I know that this is beyond the types of queries I am used to. I've been looking into pivot tables to achieve this but I am unsure if this is the correct path.
Edit:
The user table does not have a registration date. One can assume it only contains the ID for this example.
Here is some sample data for the check_in table:
| user_id | date |
=====================================
| 1 | 2020-09-02 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 12:00:00 |
-------------------------------------
| 1 | 2020-09-04 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 11:00:00 |
-------------------------------------
| ... |
-------------------------------------
And the expected output of the query would be something like this:
| day_0 | day_1 | day_2 | day_3 |
=================================
| 70% | 67 % | 44% | 32% |
---------------------------------
Please note that I've used random numbers for this output just to illustrate the format.
Oh, I see. Assuming you mean days between checkins for users -- and users might have none -- then just use aggregation and window functions:
select sum( (ci.date = ci.min_date)::numeric ) / u.num_users as day_0,
sum( (ci.date = ci.min_date + interval '1 day')::numeric ) / u.num_users as day_1,
sum( (ci.date = ci.min_date + interval '2 day')::numeric ) / u.num_users as day_2
from (select u.*, count(*) over () as num_users
from users u
) u left join
(select ci.user_id, ci.date::date as date,
min(min(date::date)) over (partition by user_id order by date) as min_date
from checkins ci
group by user_id, ci.date::date
) ci;
Note that this aggregates the checkins table by user id and date. This ensures that there is only one row per date.

Calculating working minutes for Normal and Night Shift

I am making a query to fetch the working minutes for employees. The problem I have is the Night Shift. I know that I need to subtract the "ShiftStartMinutesFromMidnight" but I can't find the right logic.
NOTE: I can't changing the database, I only can use the data from it.
Let's say I have these records.
+----+--------------------------+----------+
| ID | EventTime | ReaderNo |
-----+--------------------------+----------+
| 1 | 2019-12-04 11:28:46.000 | In |
| 1 | 2019-12-04 12:36:17.000 | Out |
| 1 | 2019-12-04 12:39:23.000 | In |
| 1 | 2019-12-04 12:51:21.000 | Out |
| 1 | 2019-12-05 07:37:49.000 | In |
| 1 | 2019-12-05 08:01:22.000 | Out |
| 2 | 2019-12-04 22:11:46.000 | In |
| 2 | 2019-12-04 23:06:17.000 | Out |
| 2 | 2019-12-04 23:34:23.000 | In |
| 2 | 2019-12-05 01:32:21.000 | Out |
| 2 | 2019-12-05 01:38:49.000 | In |
| 2 | 2019-12-05 06:32:22.000 | Out |
-----+--------------------------+----------+
WITH CT AS (SELECT
EIn.PSNID, EIn.PSNNAME
,CAST(DATEADD(minute, -0, EIn.EventTime) AS date) AS dt
,EIn.EventTime AS LogIn
,CA_Out.EventTime AS LogOut
,DATEDIFF(minute, EIn.EventTime, CA_Out.EventTime) AS WorkingMinutes
FROM
VIEW_EVENT_EMPLOYEE AS EIn
CROSS APPLY
(
SELECT TOP(1) EOut.EventTime
FROM VIEW_EVENT_EMPLOYEE AS EOut
WHERE
EOut.PSNID = EIn.PSNID
AND EOut.ReaderNo = 'Out'
AND EOut.EventTime >= EIn.EventTime
ORDER BY EOut.EventTime
) AS CA_Out
WHERE
EIn.ReaderNo = 'In'
)
SELECT
PSNID
,PSNNAME
,dt
,LogIn
,LogOut
,WorkingMinutes
FROM CT
WHERE dt BETWEEN '2019-11-29' AND '2019-12-05'
ORDER BY LogIn
;
OUTPUT FROM QUERY
+----+------------+-------------------------+-------------------------+----------------+
| ID | date | In | Out | WorkingMinutes |
-----+------------+-------------------------+-------------------------+----------------+
| 1 | 2019-12-04 | 2019-12-04 11:28:46.000 | 2019-12-04 12:36:17.000 | 68 |
| 1 | 2019-12-04 | 2019-12-04 12:39:23.000 | 2019-12-04 12:51:21.000 | 12 |
| 1 | 2019-12-05 | 2019-12-05 07:37:49.000 | 2019-12-05 08:01:22.000 | 24 |
-----+------------+-------------------------+-------------------------+----------------+
I was thinking something like this. When Out is between 06:25 - 6:40. But I also need to check If employee, previous day has In between 21:50 - 22:30. I need that second condition because some employee from first shift maybe can Out, for example at 6:30.
*(1310 is the ShiftStartMinutesFromMidnight
Line 3 of Query
CAST(DATEADD(minute, -0, EIn.EventTime) AS date) AS dt
Updating the Line 3 with this code.
CASE
WHEN CAST(CA_Out.LogDate AS time) BETWEEN '06:25:00' AND '06:40:00'
AND CAST(EIn.LogDate AS time) BETWEEN '21:50:00' AND '22:30:00' THEN CAST(DATEADD(minute, -1310, EIn.LogDate) AS date)
ELSE CAST(DATEADD(minute, -0, EIn.LogDate) AS date)
END as dt
Expected Output
+----+------------+-------------------------+-------------------------+----------------+
| ID | date | In | Out | WorkingMinutes |
-----+------------+-------------------------+-------------------------+----------------+
| 2 | 2019-12-04 | 2019-12-04 22:11:46.000 | 2019-12-04 23:06:17.000 | 55 |
| 2 | 2019-12-04 | 2019-12-04 23:34:23.000 | 2019-12-05 01:32:21.000 | 118 |
| 2 | 2019-12-04 | 2019-12-05 01:38:49.000 | 2019-12-05 06:32:22.000 | 294 |
-----+------------+-------------------------+-------------------------+----------------+
Assuming that total minutes per separate date is enough:
WITH
/* enumerate pairs */
cte1 AS ( SELECT *,
COUNT(CASE WHEN ReaderNo = 'In' THEN 1 END)
OVER (PARTITION BY ID
ORDER BY EventTime) pair
FROM test ),
/* divide by pairs */
cte2 AS ( SELECT ID, MIN(EventTime) starttime, MAX(EventTime) endtime
FROM cte1
GROUP BY ID, pair ),
/* get dates range */
cte3 AS ( SELECT CAST(MIN(EventTime) AS DATE) minDate,
CAST(MAX(EventTime) AS DATE) maxDate
FROM test),
/* generate dates list */
cte4 AS ( SELECT minDate theDate
FROM cte3
UNION ALL
SELECT DATEADD(dd, 1, theDate)
FROM cte3, cte4
WHERE theDate < maxDate ),
/* add overlapped dates to pairs */
cte5 AS ( SELECT ID, starttime, endtime, theDate
FROM cte2, cte4
WHERE theDate BETWEEN CAST(starttime AS DATE) AND CAST(endtime AS DATE) ),
/* adjust borders */
cte6 AS ( SELECT ID,
CASE WHEN starttime < theDate
THEN theDate
ELSE starttime
END starttime,
CASE WHEN CAST(endtime AS DATE) > theDate
THEN DATEADD(dd, 1, theDate)
ELSE endtime
END endtime,
theDate
FROM cte5 )
/* calculate total minutes per date */
SELECT ID,
theDate,
SUM(DATEDIFF(mi, starttime, endtime)) workingminutes
FROM cte6
GROUP BY ID,
theDate
ORDER BY 1,2
fiddle
The solution is specially made detailed, step by step, so that you can easily understand the logic.
You may freely combine some CTEs into one. You may also use pre-last cte5 combined with cte2 if you need the output strongly as shown.
The solution assumes that none records are lost in source data (each 'In' matches strongly one 'Out' and backward, and no adjacent or overlapped pairs).
Don't know where you stopped but here is how I do,
Night shift 20:00 - 05:00 so in one day 00:00 - 5:00; 22:00 - 24:00
day shift 5:00 - 22:00
To get easier overlapping checking you need to change all dates to unix timestamp. so you don't have to split time intervals like shown above
So generate map of each period work for fetch period date_from and date_till, make sure to add holiday and pre-holiday exceptions where periods are different
something like:
Unix values is only for understanding.
unix_from_tim, unix_till_tim, shift_type
1580680800, 1580680800, 1 => example 02-02-2020:22:00:00, 03-02-2020:05:00:00, 1
1580680800, 1580680800, 0 => example 03-02-2020:05:00:00, 03-02-2020:22:00:00, 0
1580680800, 1580680800, 1 => example 03-02-2020:22:00:00, 04-02-2020:05:00:00, 1
...
Make sure you don't calculate overlapping minutes on period start/end..
And there is worker one row
with unix_from_tim, unix_from_tim
1580680800, 1580680800=> something like 02-02-2020:16:30:00, 03-02-2020:07:10:00
When you check overlapping you can get ms like this:
MIN(work_period:till,worker_period:till) - MAX(work_period:from, worker_period:from);
example in simple numbers:
work_period 3 - 7
worker_period 5 - 12
MIN(7,12) - MAX(3,5) = 7 - 5 = 2 //overlap
work_period 3 - 7
worker_period 8 - 12
MIN(7,12) - MAX(3,8) = 7 - 8 = -1 //if negative not overlap!
work_period 3 - 13
worker_period 8 - 12
MIN(13,12) - MAX(3,8) = 13 - 8 = 5 //full overlap!
And you have to check each worker period on all overlaping time generated work intervals.
May be someone can make select where you don't have to generate work_shift overlapping but its not a easy task if you add more holidays, transferred days, reduced time days etc.
Hope it helps

Discard existing dates that are included in the result, SQL Server

In my database I have a Reservation table and it has three columns Initial Day, Last Day and the House Id.
I want to count the total days and omit those who are repeated, for example:
+-------------+------------+------------+
| | Results | |
+-------------+------------+------------+
| House Id | InitialDay | LastDay |
+-------------+------------+------------+
| 1 | 2017-09-18 | 2017-09-20 |
| 1 | 2017-09-18 | 2017-09-22 |
| 19 | 2017-09-18 | 2017-09-22 |
| 20 | 2017-09-18 | 2017-09-22 |
+-------------+------------+------------+
If you noticed the House Id with the number 1 has two rows, and each row has dates but the first row is in the interval of dates of the second row. In total the number of days should be 5 because the first shouldn't be counted as those days already exist in the second.
The reason why this is happening is that each house has two rooms, and different persons can stay in that house on the same dates.
My question is: how can I omit those cases, and only count the real days the house was occupied?
In your are using SQL Server 2012 or higher you can use LAG() to get the previous final date and adjust the initial date:
with ReservationAdjusted as (
select *,
lag(LastDay) over(partition by HouseID order by InitialDay, LastDay) as PreviousLast
from Reservation
)
select HouseId,
sum(case when PreviousLast>LastDay then 0 -- fully contained in the previous reservation
when PreviousLast>=InitialDay then datediff(day,PreviousLast,LastDay) -- overlap
else datediff(day,InitialDay,LastDay)+1 -- no overlap
end) as Days
from ReservationAdjusted
group by HouseId
The cases are:
The reservation is fully included in the previous reservation: we only need to compare end dates because the previous row is obtained ordering by InitialDay, LastDay, so the previous start date is always minor or equal than the current start date.
The current reservation overlaps with the previous: in this case we adjust the start and don't add 1 (the initial day is already counted), this case include when the previous end is equal to the current start (is a one day overlap).
There is no overlap: we just calculate the difference and add 1 to count also the initial day.
Note that we don't need extra condition for the reservation of a HouseID because by default the LAG() function returns NULL when there isn't a previous row, and comparisons with null always are false.
Sample input and output:
| HouseId | InitialDay | LastDay |
|---------|------------|------------|
| 1 | 2017-09-18 | 2017-09-20 |
| 1 | 2017-09-18 | 2017-09-22 |
| 1 | 2017-09-21 | 2017-09-22 |
| 19 | 2017-09-18 | 2017-09-27 |
| 19 | 2017-09-24 | 2017-09-26 |
| 19 | 2017-09-29 | 2017-09-30 |
| 20 | 2017-09-19 | 2017-09-22 |
| 20 | 2017-09-22 | 2017-09-26 |
| 20 | 2017-09-24 | 2017-09-27 |
| HouseId | Days |
|---------|------|
| 1 | 5 |
| 19 | 12 |
| 20 | 9 |
select house_id,min(initialDay),max(LastDay)
group by houseId
If I understood correctly!
Try out and let me know how it works out for you.
Ted.
While thinking through your question I came across the wonder that is the idea of a Calendar table. You'd use this code to create one, with whatever range of dates your want for your calendar. Code is from http://blog.jontav.com/post/9380766884/calendar-tables-are-incredibly-useful-in-sql
declare #start_dt as date = '1/1/2010';
declare #end_dt as date = '1/1/2020';
declare #dates as table (
date_id date primary key,
date_year smallint,
date_month tinyint,
date_day tinyint,
weekday_id tinyint,
weekday_nm varchar(10),
month_nm varchar(10),
day_of_year smallint,
quarter_id tinyint,
first_day_of_month date,
last_day_of_month date,
start_dts datetime,
end_dts datetime
)
while #start_dt < #end_dt
begin
insert into #dates(
date_id, date_year, date_month, date_day,
weekday_id, weekday_nm, month_nm, day_of_year, quarter_id,
first_day_of_month, last_day_of_month,
start_dts, end_dts
)
values(
#start_dt, year(#start_dt), month(#start_dt), day(#start_dt),
datepart(weekday, #start_dt), datename(weekday, #start_dt), datename(month, #start_dt), datepart(dayofyear, #start_dt), datepart(quarter, #start_dt),
dateadd(day,-(day(#start_dt)-1),#start_dt), dateadd(day,-(day(dateadd(month,1,#start_dt))),dateadd(month,1,#start_dt)),
cast(#start_dt as datetime), dateadd(second,-1,cast(dateadd(day, 1, #start_dt) as datetime))
)
set #start_dt = dateadd(day, 1, #start_dt)
end
select *
into Calendar
from #dates
Once you have a calendar table your query is as simple as:
select distinct t.House_id, c.date_id
from Reservation as r
inner join Calendar as c
on
c.date_id >= r.InitialDay
and c.date_id <= r.LastDay
Which gives you a row for each unique day each room was occupied. If you need a sum of how many days each room was occupied it becomes:
select a.House_id, count(a.House_id) as Days_occupied
from
(select distinct t.House_id, c.date_id
from so_test as t
inner join Calendar as c
on
c.date_id >= t.InitialDay
and c.date_id <= t.LastDay) as a
group by a.House_id
Create a table of all the possible dates and then join it to the Reservations table so that you have a list of all days between InitialDay and LastDay. Like this:
DECLARE #i date
DECLARE #last date
CREATE TABLE #temp (Date date)
SELECT #i = MIN(Date) FROM Reservations
SELECT #last = MAX(Date) FROM Reservations
WHILE #i <= #last
BEGIN
INSERT INTO #temp VALUES(#i)
SET #i = DATEADD(day, 1, #i)
END
SELECT HouseID, COUNT(*) FROM
(
SELECT DISTINCT HouseID, Date FROM Reservation
LEFT JOIN #temp
ON Reservation.InitialDay <= #temp.Date
AND Reservation.LastDay >= #temp.Date
) AS a
GROUP BY HouseID
DROP TABLE #temp

Filling Out & Filtering Irregular Time Series Data

Using Postgresql 9.4, I am trying to craft a query on time series log data that logs new values whenever the value updates (not on a schedule). The log can update anywhere from several times a minute to once a day.
I need the query to accomplish the following:
Filter too much data by just selecting the first entry for the timestamp range
Fill in sparse data by using the last reading for the log value. For example, if I am grouping the data by hour and there was an entry at 8am with a log value of 10. Then the next entry isn't until 11am with a log value of 15, I would want the query to return something like this:
Timestamp | Value
2015-07-01 08:00 | 10
2015-07-01 09:00 | 10
2015-07-01 10:00 | 10
2015-07-01 11:00 | 15
I have got a query that accomplishes the first of these goals:
with time_range as (
select hour
from generate_series('2015-07-01 00:00'::timestamp, '2015-07-02 00:00'::timestamp, '1 hour') as hour
),
ranked_logs as (
select
date_trunc('hour', time_stamp) as log_hour,
log_val,
rank() over (partition by date_trunc('hour', time_stamp) order by time_stamp asc)
from time_series
)
select
time_range.hour,
ranked_logs.log_val
from time_range
left outer join ranked_logs on ranked_logs.log_hour = time_range.hour and ranked_logs.rank = 1;
But I can't figure out how to fill in the nulls where there is no value. I tried using the lag() feature of Postgresql's Window functions, but it didn't work when there were multiple nulls in a row.
Here's a SQLFiddle that demonstrates the issue:
http://sqlfiddle.com/#!15/f4d13/5/0
your columns are log_hour and first_vlue
with time_range as (
select hour
from generate_series('2015-07-01 00:00'::timestamp, '2015-07-02 00:00'::timestamp, '1 hour') as hour
),
ranked_logs as (
select
date_trunc('hour', time_stamp) as log_hour,
log_val,
rank() over (partition by date_trunc('hour', time_stamp) order by time_stamp asc)
from time_series
),
base as (
select
time_range.hour lh,
ranked_logs.log_val
from time_range
left outer join ranked_logs on ranked_logs.log_hour = time_range.hour and ranked_logs.rank = 1)
SELECT
log_hour, log_val, value_partition, first_value(log_val) over (partition by value_partition order by log_hour)
FROM (
SELECT
date_trunc('hour', base.lh) as log_hour,
log_val,
sum(case when log_val is null then 0 else 1 end) over (order by base.lh) as value_partition
FROM base) as q
UPDATE
this is what your query return
Timestamp | Value
2015-07-01 01:00 | 10
2015-07-01 02:00 | null
2015-07-01 03:00 | null
2015-07-01 04:00 | 15
2015-07-01 05:00 | nul
2015-07-01 06:00 | 19
2015-07-01 08:00 | 13
I want this result set to be split in groups like this
2015-07-01 01:00 | 10
2015-07-01 02:00 | null
2015-07-01 03:00 | null
2015-07-01 04:00 | 15
2015-07-01 05:00 | nul
2015-07-01 06:00 | 19
2015-07-01 08:00 | 13
and to assign to every row in a group the value of first row from that group (done by last select)
In this case, a method for obtaining the grouping is to create a column which holds the number of
not null values counted until current row and split by this value. (use of sum(case))
value | sum(case)
| 10 | 1 |
| null | 1 |
| null | 1 |
| 15 | 2 | <-- new not null, increment
| nul | 2 |
| 19 | 3 | <-- new not null, increment
| 13 | 4 | <-- new not null, increment
and now I can partion by sum(case)