Define Sessions not only with time limit - sql

I've been trying to group an events-like table (user, action, event_time) into sessions .
But the common "idle time" approach is not enough.
I need to check if the user was idle for more than an period X of time (ok), BUT if the user started the game, then it will very likely be idle for a long time and every action between start game and end game, regardless of the time interval, should not be considered a new session. But when the game finishes, a new session shows up:
For example (idle time 5 min)
| action | user | event_at | new_session? (desired output) |
|---------------|------|--------------|-------------------------------|
| random1 | 1 | 1 sec | 1 |
| random3 | 1 | 30 sec | 0 |
| random4 | 1 | 6:00 min | 1 |
| random5 | 1 | 7:00 min | 0 |
| game_started | 1 | 7:30 min | 0 |
| random2 | 1 | 20:00 min | 0 |
| random5 | 1 | 27:00 min | 0 |
| game_finished | 1 | 35:00 min | 0 |
| random5 | 1 | 35:30 min | 1 |
The problem are those random actions between the game_start and game_finish. I cannot tell SQL to ignore them and not count them as a new session when using the idle time logic (- that is needed for the part not between start and finish).
in "proper" programming language I could just add a flag after the "for" or "while" finds "game started" and tell it to ignore anything until "game_finished" is found. But in SQL this is simply not that easy, even using an auxiliary column.
Any Ideas ?
Thanks in advance!

Identify the games first, then use lag() for the session start:
select t.*,
(case when in_game > 0 and
(lag(in_game) over (order by time) is null or
lag(in_game) over (order by time) = 0
)
then 1
when in_game = 0 and
(lag(time) over (partition by in_game order by time) is null or
lag(time) over (partition by in_game order by time) < time - interval '5 minute'
)
then 1
else 0
end) as new_session
from (select t.*,
sum(case when action = 'game_start' then 1
when action = 'game_end' then -1
end) over (order by time) as in_game
from t
) t;
This does not handle every edge case. For instance, this version does not handle nested games. It might also be tricky with games right next to each other. But I think it does do what you want.

Related

counting values with almost the same time

Good afternoon. What is the essence of the matter, the train has a geotag that determines its position in space. Location data is entered into a table. It is required to count how many times the train was in a certain timezone. But the problem is that being in a certain time zone, the geotag leaves several records in the table by time. What query can be used to count the number of arrivals?
I created a query that counts how many times the train was at point 270 and at point 289. To do this, I rounded the time to hours, but the problem is that if the train arrived at the end of the hour, but left at the beginning of the next, the query counts it as two arrivals . Below I will attach the query itself and the output results.
Create temp table tmpTable_1 ON COMMIT DROP as
select addr,zone_id,DATE_PART('hour',time)*100 as IntTime from trac_path_rmp where time between '2022.04.06' and '2022.04.07';
Create temp table tmpTable_2 ON COMMIT DROP as select addr,zone_id,IntTime from tmpTable_1 where addr in (12421,12422,12423,12425) group by addr,zone_id,IntTime;
select addr,sum(case when zone_id=289 then 1 else 0 end) as "Zone 289", sum(case when zone_id=270 then 1 else 0 end) as "Zone 270" from tmpTable_2 group by addr order by addr;
We can use LAG OVER() to get the timestamp of the previous row and only return the rows when there is at least a minutes difference. We could easily modify this: to 5 minutes for example.
We also keep the first row where LAG returns null.
We need to use hours and minutes because if we only use minutes we will get 0 time difference when there is exactly an hour between rows.
See dbFiddle link below.
;WITH CTE AS
(SELECT
*,
time_ - LAG(time_) OVER (ORDER BY id) AS dd
FROM table_name)
SELECT
id,time_,addr,x,y,z,zone_id,type
FROM cte
WHERE DATE_PART('hours',dd) + 60 * DATE_PART('minutes',dd) > 0
OR dd IS null;
id | time_ | addr | x | y | z | zone_id | type
--: | :------------------ | ----: | ------: | ------: | ------: | ------: | ---:
138 | 2022-04-06 19:19:11 | 12421 | 9793.50 | 4884.70 | -125.00 | 270 | 1
141 | 2022-04-06 20:37:23 | 12421 | 9736.00 | 4856.90 | -125.00 | 270 | 1
146 | 2022-04-06 22:58:15 | 12421 | 9736.00 | 4856.90 | -125.00 | 270 | 1
db<>fiddle here

Calculating percentage change of a metric for a group in PostgreSQL

I have a sample table as follows:
id | timestamp | agentid | input_interface | sourceipv4address | totalbytes_sum
-------+------------+-----------------+-----------------------------+-------------------+----------------
10733 | 1593648000 | 203.121.214.129 | 203.121.214.129 interface 1 | 10.10.10.10 | 3857
10734 | 1593648000 | 203.121.214.129 | 203.121.214.129 interface 1 | 10.10.10.101 | 45960
10731 | 1593648600 | 203.121.214.129 | 203.121.214.129 interface 1 | 10.10.10.10 | 20579
10736 | 1593648600 | 203.121.214.129 | 203.121.214.129 interface 1 | 10.10.10.101 | 21384
10737 | 1593648600 | 203.121.214.129 | 203.121.214.129 interface 1 | 10.10.10.107 | 2094
This table is populated by taking samples from a network every 10 minutes. Basically I am trying to build a view to calculate the percentage change on totalbytes_sum for each group (agentid,input_interface,sourceipv4address) and show it as:
timestamp | agentid | input_interface | sourceipv4address | totalbytes_sum | percent
The calculation needs to happen based on the current 10 minute and the previous 10 minute. i can guarantee that there will be only one copy of a particular agentid,input_interface,sourceipv4address combination within the same 10 minutes.
If a combination did not happen to have a record within the previous 10 minutes, the percentage will be +%100.
I was trying to apply the Partition/Order logic but had no luck. the offset function seems good too but I am pretty much stuck.
Can someone please assist me.
Thanks
Your timestamps are all the same. I assume they would be ~600 seconds apart in your actual data.
Please try something like this and let me know in comments if it does not work for you or if you need explanation for any of it:
select timestamp, agentid, input_interface,
sourceipv4address, totalbytes_sum,
timestamp - lag(timestamp) over w as elapsed_time, -- illustration column
lag(totalbytes_sum) over w as last_totalbytes_sum, -- illustration column
case
when coalesce(lag(timestamp) over w, 0) = 0 then 100.0
when timestamp - lag(timestamp) over w > 600 then 100.0
else 100.0 * (totalbytes_sum - (lag(totalbytes_sum) over w)) /
(lag(totalbytes_sum) over w)
end as percent
from sample_table
window w as (partition by agentid, input_interface, sourceipv4address
order by timestamp)
;

Aggregate results split by day

I'm trying to write a query that returns summarised data, per day, over many day's of data.
For example
| id | user_id | start
|----|---------|------------------------------
| 1 | 1 | 2020-02-01T17:35:37.242+00:00
| 2 | 1 | 2020-02-01T13:25:21.344+00:00
| 3 | 1 | 2020-01-31T16:42:51.344+00:00
| 4 | 1 | 2020-01-30T06:44:55.344+00:00
The outcome I'm hoping for is a function that I can pass in a the userid and timezone, or UTC offset, and get out:
| day | count |
|---------|-------|
| 1/2/20 | 2 |
| 31/1/20 | 1 |
| 30/1/20 | 7 |
Where the count is all the rows that have a start time falling between 00:00:00.0000 and 23:59:59.9999 on each day - taking into consideration the supplied UTC offset.
I don't really know where to start writing a query like this, and I the fact I can't even picture where to start feels like a big gap in my SQL thinking. How should I approach something like this?
You can use:
select date_trunc('day', start) as dte, count(*)
from t
where userid = ?
group by date_trunc('day', start)
order by dte;
If you want to handle an additional offset, build that into the query:
select dte, count(*)
from t cross join lateral
(values (date_trunc('day', start + ? * interval '1 hour'))) v(dte)
where userid = ?
group by v.dte
order by v.dte;

How to assign a percentage of in SQL?

| proj_id | list_date | state | Status |
| 1 | 03/05/10 | CA | Finished |
| 2 | 04/05/10 | WA | Unfinished |
| 3 | 03/05/10 | WA | Finished |
| 4 | 04/05/10 | CA | Finished |
| 5 | 03/05/10 | WA | Unfinished |
| 6 | 04/05/10 | CA | Finished |
What query can I write so to determine the percentage of projects in each listing month and state are finished?
I currently have written the following code to group the amount of projects by its month and state
select
month(list_date), state_name, count (*) as Projects
from
projects
group by
month(list_date), state_name;
How do I then incorporate the percentage of projects in each listing month and state that are finished or unfinished?
Desired output:
| list_date | state | Proj_Count | % Finished
| March | CA | 2 | 10%
| April | WA | 3 | 20%
| July | WA | 6 | 50%
| August | CA | 5 | 40%
You can easily do this with the avg() aggregation function:
select month(list_date), state_name, count(*) as Projects,
avg(case when status = 'Finished' then 100.0 else 0 end) as percent_finished
from projects
group by month(list_date), state_name;
To see how this works, just work it out on an example:
Finished 100
Unfinished 0
Finished 100
Unfinished 0
Unfinished 0
The average is (100 + 100) / 5 = 40. That is the percentage you are looking for.
Put this in the select:
100 * count(case when state = 'Finished' then 1 else null end) / cast(count(*) as float)
I wasn't clear if you want the percentage finished by state and month, or just by month. If the latter it's a bit more involved
Count counts the non nulls in a set. If the state is finished, 1 is returned. If the state is anything else, null is returned and it doesn't contribute to the count
Note that this calc above will be done using INT so we cast one of the operands to a float (or whatever decimal point supporting type your database has) so that the calc is done using decimal places instead. Without this some db would simply produce 0 because small_int/large_int is always 0

How can I group by the difference of a column between rows in SQL?

I have a table of events with a created_at timestamp. I want to divide them into groups of events that are N seconds apart, specifically 130 seconds. Then for each group, I just need to know the lowest timestamp and the highest timestamp.
Here's some sample data (ignore the formatting of the timestamp, it's a datetime field):
------------------------
| id | created_at |
------------------------
| 1 | 2013-1-20-08:00 |
| 2 | 2013-1-20-08:01 |
| 3 | 2013-1-20-08:05 |
| 4 | 2013-1-20-08:07 |
| 5 | 2013-1-20-08:09 |
| 6 | 2013-1-20-08:12 |
| 7 | 2013-1-20-08:20 |
------------------------
And what I would like to get as a result is:
-------------------------------------
| started_at | ended_at |
-------------------------------------
| 2013-1-20-08:00 | 2013-1-20-08:01 |
| 2013-1-20-08:05 | 2013-1-20-08:09 |
| 2013-1-20-08:12 | 2013-1-20-08:12 |
| 2013-1-20-08:20 | 2013-1-20-08:20 |
-------------------------------------
I've googled and searched every possible way of phrasing that question and experimented for some time, but I can't figure it out. I can already do this in Ruby, I'm just trying to figure out if it's possible to move this to the database level. If you're curious or it's easier to visualize, here's what it looks like in Ruby:
groups = SortedSet[*events].divide { |a,b| (a.created_at - b.created_at).abs <= 130 }
groups.map do |group|
{ started_at: group.to_a.first.created_at, ended_at: group.to_a.last.created_at }
end
Does anyone know how to do this in SQL, specifically PostgreSQL?
I think you want to start each new grouping when the difference from the previous is greater than 130 seconds. You can do this with lag and date arithmetic to determine where a grouping starts. Then do a cumulative sum to get the grouping:
select Grouping, min(created_at), max(created_at)
from (select t.*, sum(GroupStartFlag) over (order by created_at) as Grouping
from (select t.*,
lag(created_at) over (order by created_at) as prevca,
(case when extract(epoch from created_at - lag(created_at) over (order by created_at)) < 130
then 0 else 1
end) as GroupStartFlag
from t
) t
) t
group by Grouping;
The final step is the aggregate by the "grouping" identifier to get the earliest and latest dates.