find max logons in an interval - sql

I have a table with timestamps and states for people.
|:--------------------------------------------------------------:|
| user_id | state | start_time | end_time |
|:--------------------------------------------------------------:|
| 4711 | 1 | 2013-10-30 09:01:23 | 2013-10-30 17:12:03 |
| 4712 | 1 | 2013-10-30 07:01:23 | 2013-10-30 18:12:03 |
| 4713 | 1 | 2013-10-30 08:01:23 | 2013-10-30 16:12:03 |
| 4714 | 1 | 2013-10-30 09:01:24 | 2013-10-30 17:02:03 |
My challenge is, to find out how many users are
MAX(logged on) AND AVG(logged on) in same time per Interval. I think that I get out when I can see, how many users are simultaneously logged in per second.
|:-------------------------------------:|
| timestamp | state | userid |
|:-------------------------------------:|
| 1383123683 | 1 | 4711 |
| 1383123684 | 1 | 4711 |
| 1383123684 | 1 | 4712 |
| 1383123685 | 1 | 4711 |
| 1383123685 | 1 | 4712 |
| ... | ... | ... |
By the way, one intervals is a quarter of an hour.
The Data comes via INSERT INTO so my idea was to crate a trigger and wrote in a helper table one row for each second (UNIX timestamp) between start and end adding the state_id.
At the end, it must be possible to group over the seconds and count over the datasets to find out, how many rows are exist in one second. For the AVG I have not yet a formula :-). It's a question of time, you know.
But I'm not sure, if my idea was a good one, because i fear that my plan needs a lot of performance and space.
The better idea will be, to wrote just the start-time and end-time, but i loosing the possibility to grouping over the seconds.
How I can manage that without thousands of rows in my database?

Here can be several solutions, I want to describe one, and I hope you can use/adapt/extent it for your particular needs (NOTE: I'm using mysql dialect, for ms sql it can be a little bit different syntax, but approach will work):
1 create new table, with structure like:
create table changelog (
changetime datetime,
changevalue int,
totalsum int,
primary key (changetime)
);
2 insert basic data:
insert into changelog
select changet, sum(cnts), 0
from
(
select start_time as changet, 1 as cnts from testlog
union all
select end_time as changet, -1 from testlog
) as q
group by changet;
3 update totalsum colum:
update changelog as a set totalsum = ifnull((select sum(changevalue) from (select changet, sum(cnts) as changevalue, 0
from
(
select start_time as changet, 1 as cnts from testlog
union all
select end_time as changet, -1 from testlog
) as q
group by changet) as b where b.changet<=a.changetime),0);
NOTE: for ms sql you can try with syntax, you will be able to do these insert/update as one query
4 after this you will have (based on data from question):
2013-10-30 07:01:23 1 1
2013-10-30 08:01:23 1 2
2013-10-30 09:01:23 1 3
2013-10-30 09:01:24 1 4
2013-10-30 16:12:03 -1 3
2013-10-30 17:02:03 -1 2
2013-10-30 17:12:03 -1 1
2013-10-30 18:12:03 -1 0
As you see, max logged in already here, but here is one problem, imagine that you need to select data for range: 08:00-08:01, there are no data in table, so query like this will not work:
SELECT max(totalsum)
FROM changelog
where changetime between cast(#startrange as datetime) and cast(#endrange as datetime)
but you can change it to:
SELECT max(totalsum)
from
(
select max(totalsum) as totalsum FROM changelog
where changetime between cast(#startrange as datetime) and cast(#endrange as datetime)
union all
select totalsum from changelog where changetime=(select max(changetime) from changelog where changetime<cast(#startrange as datetime))
) as q;
so, basically saying - in addition to your range you need to fetch last row before period starts, to find out how many users was at the moment of range start
5 now, you want to calculate average. Average is kinda tricky function, depending on what you understand - there can be different results, average user's per second or average workload
here is the difference:
100 users logged in at 09:00
98 users logged out at 09:01
1 user logged out at 09:02
Selection range: 09:00 - 09:59 (inclusive)
average per minute will be sum of all logged in users in each minute and divided by 60
(100 + 2 + 1 + 57*1)/60 = 2.6(6) user per minute
but average workload can be calculated as (max(logged_users)+min(logged_users)) / 2
(100 + 1)/2 = 50.5 users, this is average simultaneous users logged in system
another average can be calculated via SQL avg (sum(values)/count(values)), which will give us
(100+98+1)/3 = 66.3(3) - another average workload in persons
first formula says to us that it only 2.65 user at the same time, but second shows "holy #*&####, it is 50.5 users at the same time"
another example:
100 users logged in at 09:00
99 users logged out at 09:58
1 user logged out at 09:59
Selection range: 09:00 - 09:59 (inclusive)
first formula will give you (100*58 + 2 + 1)/60 = 96.71(6) users, second will continue to give 50.5, 3rd one still 66.3(3)
what average suits best for you?
To calculate 1st average you need to create stored procedure which will get data for each minute/seconds of period and summarize it, after divide
To calculate 2nd variant: just select min/max and divide by 2
3rd variant: use avg instead of max
Note #1: of course all this approaches are quite slow with huge traffic, so I suggest you to prepare some "pre-calculated" tables with data which can be fetched fast (for example you can have data for each hour like: YYYY-MM-DD HH loggedInatStart, min, avg, median, max, loggedInatEnd)
Note #2: sometimes median average is more interesting for statistical purposes, to obtain it you will: for each minute calculate how many users was logged in, select distinct values, select middle from this list (for my examples this will give us 2 and 2), or select all values, select middle one (for my example it will give us 1 and 99)

Related

Get the difference in time between multiple rows with the same column name

I need to get the time difference between two dates on different rows. This part is okay but I can have instances of the same title. A quick example which will explain things some more.
Lets say we have a table with the following records:
| ID | Title | Date |
| ----- | ------- |--------------------|
| 1 | Down |2021-03-07 12:05:00 |
| 2 | Up |2021-03-07 13:05:00 |
| 3 | Down |2021-03-07 10:30:00 |
| 4 | Up |2021-03-07 11:00:00 |
I basically need to get the time difference between the first "Down" and "Up". So ID 1 & 2 = 1 hour.
Then ID 3 & 4 = 30 mins, and so on for the amount of "Down" and "Up" rows there are.
(These will always be grouped together one after another)
It doesn't matter if the results are seperate or a SUM of all the differences.
I'm trying to get this done without a temp table.
Thank you.
This can be done using analytical functions, the availability of which will be determined based on your sql engine. The idea is to get the next value in the same row as the one you need to calculate the diff/sum
In the case above it would look some thing like below
SELECT
id ,
title,
Date as startdate,
LEAD(Date,1) OVER (
ORDER BY id
) enddate
FROM
table;
Once you have it on the same row, you can carry out your time difference operation.

Performance on querying only the most recent entries

I made an app that saves when a worker arrives and departures from the premises.
Over a 24 hours multiple checks are made, so the database can quickly fill hundreds to thousands of records depending on the activity.
| user_id | device_id | station_id | arrived_at | departed_at |
|-----------|-----------|------------|---------------------|---------------------|
| 67 | 46 | 4 | 2020-01-03 11:32:45 | 2020-01-03 11:59:49 |
| 254 | 256 | 8 | 2020-01-02 16:29:12 | 2020-01-02 16:44:65 |
| 97 | 87 | 7 | 2020-01-01 09:55:01 | 2020-01-01 11:59:18 |
...
This becomes a problem since the daily report software, which later reports who was absent or who made extra hours, filters by arrival date.
The query becomes a full table sweep:
(I just used SQLite for this example, but you get the idea)
EXPLAIN QUERY PLAN
SELECT * FROM activities
WHERE user_id = 67
AND arrived_at > '2020-01-01 00:00:00'
AND departed_at < '2020-01-01 23:59:59'
ORDER BY arrived_at DESC
LIMIT 10
What I want to make is make the query snappier for records created (arrived) only the most recent day, since queries for older days are rarely executed. Otherwise, I'll have to deal with timeouts.
I would use the following index, so that departed_at that don't match can be eliminated before probing the table:
CREATE INDEX ON activities (arrived_at, departed_at);
On Postgres, you may use DISTINCT ON:
SELECT DISTINCT ON (user_id) *
FROM activities
ORDER BY user_id, arrived_at::date DESC;
This assumes that you only want to report the latest record, as determined by the arrival date, for each user. If instead you just want to show all records with the latest arrival date across the entire table, then use:
SELECT *
FROM activities
WHERE arrived_at::date = (SELECT MAX(arrived_at::date) FROM activities);

How to make query that selects based on 1 day interval?

How can I get all IDs that have more than 10 entries on one day?
Here is the sample data:
ID | Time
__________________________
4 | 2019-02-14 17:22:43
__________________________
2 | 2019-04-27 07:51:09
__________________________
83 | 2018-01-07 08:38:37
__________________________
I am having a hard time using count and going through and finding all of the ones on the same day. The Hour:Min:Sec is what is causing problems for me.
For MySql it would be:
select distinct id from tablename
group by id, date(time)
having count(*) > 10
The date() function rejects the time part of the column, so the grouping is done only by the date part.
For SqlServer you would use:
convert(date, time)

How to do a sub-select per result entry in postgresql?

Assume I have a table with only two columns: id, maturity. maturity is some date in the future and is representative of until when a specific entry will be available. Thus it's different for different entries but is not necessarily unique. And with time number of entries which have not reached this maturity date changes.
I need to count a number of entries from such a table that were available on a specific date (thus entries that have not reached their maturity). So I basically need to join this two queries:
SELECT generate_series as date FROM generate_series('2015-10-01'::date, now()::date, '1 day');
SELECT COUNT(id) FROM mytable WHERE mytable.maturity > now()::date;
where instead of now()::date I need to put entry from the generated series. I'm sure this has to be simple enough, but I can't quite get around it. I need the resulting solution to remain a query, thus it seems that I can't use for loops.
Sample table entries:
id | maturity
---+-------------------
1 | 2015-10-03
2 | 2015-10-05
3 | 2015-10-11
4 | 2015-10-11
Expected output:
date | count
------------+-------------------
2015-10-01 | 4
2015-10-02 | 4
2015-10-03 | 3
2015-10-04 | 3
2015-10-05 | 2
2015-10-06 | 2
NOTE: This count doesn't constantly decrease, since new entries are added and this count increases.
You have to use fields of outer query in WHERE clause of a sub-query. This can be done if the subquery is in the SELECT clause of the outer query:
SELECT generate_series,
(SELECT COUNT(id)
FROM mytable
WHERE mytable.maturity > generate_series)
FROM generate_series('2015-10-01'::date, now()::date, '1 day');
More info: http://www.techonthenet.com/sql_server/subqueries.php
I think you want to group your data by the maturity Date.
Check this:
select maturity,count(*) as count
from your_table group by maturity;

Remove rows that are within 15 minutes from the very first row that meets the criteria

I have data that looks like this
Name | Date | Event | Event_ID
_____________________________________________________________
BRADLEY | 2014-12-01 16:15:26.442 | ACCESSED | 268766
BRADLEY | 2014-12-01 16:15:36.794 | ACCESSED | 268766
BRADLEY | 2014-12-01 16:15:50.618 | DENIED | 268766
BRADLEY | 2014-12-01 16:16:04.89 | DENIED | 268766
BRADLEY | 2014-12-01 16:18:01.036 | DENIED | 268766
BRADLEY | 2014-12-01 16:18:31.335 | DENIED | 268766
CHARLES | 2014-12-01 08:33:34.831 | ACCESSED | 445317
CHARLES | 2014-12-01 08:33:44.041 | ACCESSED | 445317
CHARLES | 2014-12-01 14:56:49.872 | ACCESSED | 10333360
CHARLES | 2014-12-01 14:56:57.549 | ACCESSED | 10333360
CHARLES | 2014-12-01 14:56:59.248 | ACCESSED | 10333360
CHARLES | 2014-12-01 14:56:62.221 | ACCESSED | 10333360
CHARLES | 2014-12-01 14:56:63.226 | ACCESSED | 10333360
My requirement is that I need to remove events that are ACCESSED that are within 15 minutes apart. For example BRADLEY, I would remove the second accessed at timestamp 16:15:36.794. That part is easy for me as I can just logically join the same table together comparing current row to next row and do logic on the Date.
Now the issue I'm coming across is CHARLES. His Event_ID of 10333360 is a bit more complicated than BRADLEY's use case. For CHARLES, I will need to remove all ACCESSED with Event_ID of 10333360 except the one with Date of 14:56:49.872. That's because I need to remove all dates that are within 15 minutes at the start of a new Event_ID. The real world issue is that there are too much "duplicates" when a user is ACCESSED and I'm doing data cleanup to remove all these unnecessary ACCESSED data.
I thought about using window functions in Postgres but there doesn't seem to be anything that can help me with the logic in comparing the dates (http://www.postgresql.org/docs/9.1/static/functions-window.html)
I do have some ideas on how to tackle this problem using stored procedures and create temp tables so that I can actually use variables in a Java-like way. But of course I want it to be efficient and I'm hoping to learn new techniques on how to tackle a problem like this.
This is tricky, because which rows are deleted and which rows stay depend on the rest of the table. So you have a moving target. To pin the chicken down, I suggest you apply a grid (of 15 minutes in your case):
SELECT tbl_id, row_number() OVER (PARTITION BY grid_start, event_id, name
ORDER BY date) rn
FROM (
SELECT g AS grid_start, g + interval '15 min' AS grid_end
FROM (SELECT min(date) AS mind, max(date) AS maxd
FROM tbl
WHERE event = 'ACCESSED') t
, generate_series(t.mind, t.maxd, interval '15 min') g
) g
JOIN tbl t ON t.date >= g.grid_start
AND t.date < g.grid_end
WHERE event = 'ACCESSED';
The implicit LATERAL join requires Postgres 9.3+.
Record returned from function has columns concatenated
To substitute in older versions:
FROM (SELECT generate_series(min(date)
, max(date)
, interval '15 min') g
FROM tbl WHERE event = 'ACCESSED') g
Now it's simple to DELETE:
DELETE FROM tbl t
USING (<above query>) x
WHERE t.tbl_id = x.tbl_id
AND x.rn > 1;
Only the first row per name in every 15 min interval survives.
Note that you can still have two rows within 15 minutes (neighboring grid cells), but never three. Or generally speaking: never more than n+1 rows per n consecutive intervals.
If that's not good enough I suggest a procedural solution. Iterate through qualifying rows, remember the date of the first survivor and then return the id (for deletion) of every following row until the date is > 15 minutes later. Remember the date of that next survivor etc. Similar to:
GROUP BY and aggregate sequential numeric values
Aside: don't call a timestamp "date", that's misleading.