Performance on querying only the most recent entries - sql

I made an app that saves when a worker arrives and departures from the premises.
Over a 24 hours multiple checks are made, so the database can quickly fill hundreds to thousands of records depending on the activity.
| user_id | device_id | station_id | arrived_at | departed_at |
|-----------|-----------|------------|---------------------|---------------------|
| 67 | 46 | 4 | 2020-01-03 11:32:45 | 2020-01-03 11:59:49 |
| 254 | 256 | 8 | 2020-01-02 16:29:12 | 2020-01-02 16:44:65 |
| 97 | 87 | 7 | 2020-01-01 09:55:01 | 2020-01-01 11:59:18 |
...
This becomes a problem since the daily report software, which later reports who was absent or who made extra hours, filters by arrival date.
The query becomes a full table sweep:
(I just used SQLite for this example, but you get the idea)
EXPLAIN QUERY PLAN
SELECT * FROM activities
WHERE user_id = 67
AND arrived_at > '2020-01-01 00:00:00'
AND departed_at < '2020-01-01 23:59:59'
ORDER BY arrived_at DESC
LIMIT 10
What I want to make is make the query snappier for records created (arrived) only the most recent day, since queries for older days are rarely executed. Otherwise, I'll have to deal with timeouts.

I would use the following index, so that departed_at that don't match can be eliminated before probing the table:
CREATE INDEX ON activities (arrived_at, departed_at);

On Postgres, you may use DISTINCT ON:
SELECT DISTINCT ON (user_id) *
FROM activities
ORDER BY user_id, arrived_at::date DESC;
This assumes that you only want to report the latest record, as determined by the arrival date, for each user. If instead you just want to show all records with the latest arrival date across the entire table, then use:
SELECT *
FROM activities
WHERE arrived_at::date = (SELECT MAX(arrived_at::date) FROM activities);

Related

SQL ORDER BY with 2 timestamp columns

I'm trying to build a blog-type website for a few of my programmer friends where we can share tips and articles about code, Linux, software, etc.
I'm building the post system in PostgreSQL and so far it's been going quite good. The thing that stumps me however is the sorting of two timestamptz columns with ORDER BY. I want to be able to have a created timestamp, but also a modified timestamp. This should sort by newest post (created OR modified most recently). I came up with this -- post 135 should be on top but the modified posts are taking precedence.
I would preferably like to have both modified and created fields available so I can display: "created on xx-xx-xx, last updated yy-yy-yy".
SELECT posts.postid, users.id, posts.modified, posts.created
FROM posts
JOIN users ON posts.userid=users.id
WHERE posts.isdraft=false
ORDER BY posts.modified DESC NULLS LAST, posts.created DESC;
postid | id | modified | created
--------+-----+-------------------------------+-------------------------------
100 | 999 | 2022-11-28 01:57:07.495482-06 | 2022-11-27 21:43:34.132985-06
115 | 111 | 2022-11-28 01:55:05.9358-06 | 2022-11-27 21:43:34.137873-06
135 | 999 | | 2022-11-28 02:28:20.64009-06
130 | 444 | | 2022-11-28 01:42:49.301489-06
110 | 42 | | 2022-11-27 21:43:34.137254-06
(the reason for the JOIN is that I'll need the username attached to the user id but I omitted it here for space)
All help is appreciated, thanks!
Sort by greatest of the two timestamps. Here is your query with this modification.
SELECT posts.postid, users.id, posts.modified, posts.created
FROM posts
JOIN users ON posts.userid=users.id
WHERE not posts.isdraft
ORDER BY greatest(posts.modified, posts.created) DESC;

Querying the retention rate on multiple days with SQL

Given a simple data model that consists of a user table and a check_in table with a date field, I want to calculate the retention date of my users. So for example, for all users with one or more check ins, I want the percentage of users who did a check in on their 2nd day, on their 3rd day and so on.
My SQL skills are pretty basic as it's not a tool that I use that often in my day-to-day work, and I know that this is beyond the types of queries I am used to. I've been looking into pivot tables to achieve this but I am unsure if this is the correct path.
Edit:
The user table does not have a registration date. One can assume it only contains the ID for this example.
Here is some sample data for the check_in table:
| user_id | date |
=====================================
| 1 | 2020-09-02 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 12:00:00 |
-------------------------------------
| 1 | 2020-09-04 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 11:00:00 |
-------------------------------------
| ... |
-------------------------------------
And the expected output of the query would be something like this:
| day_0 | day_1 | day_2 | day_3 |
=================================
| 70% | 67 % | 44% | 32% |
---------------------------------
Please note that I've used random numbers for this output just to illustrate the format.
Oh, I see. Assuming you mean days between checkins for users -- and users might have none -- then just use aggregation and window functions:
select sum( (ci.date = ci.min_date)::numeric ) / u.num_users as day_0,
sum( (ci.date = ci.min_date + interval '1 day')::numeric ) / u.num_users as day_1,
sum( (ci.date = ci.min_date + interval '2 day')::numeric ) / u.num_users as day_2
from (select u.*, count(*) over () as num_users
from users u
) u left join
(select ci.user_id, ci.date::date as date,
min(min(date::date)) over (partition by user_id order by date) as min_date
from checkins ci
group by user_id, ci.date::date
) ci;
Note that this aggregates the checkins table by user id and date. This ensures that there is only one row per date.

JOIN or analytic function to match different sensors on nearby timestamps within a large dataset?

I have a large dataset consisting of four sensors in a single stream, but for simplicity's sake let's reduce that to two sensors that transmit at approximate (but not exact) same times like this:
+---------+-------------+-------+
| Sensor | Time | Value |
+---------+-------------+-------+
| SensorA | 10:00:01.14 | 10 |
| SensorB | 10:00:01.06 | 8 |
| SensorA | 10:00:02.15 | 11 |
| SensorB | 10:00:02.07 | 9 |
| SensorA | 10:00:03.14 | 13 |
| SensorA | 10:00:04.09 | 12 |
| SensorB | 10:00:04.13 | 6 |
+---------+-------------+-------+
I am trying to find the difference between SensorA and SensorB when their readings are within a half-second of each other. Like this:
+-------------+-------+
| Trunc_Time | Diff |
+-------------+-------+
| 10:00:01 | 2 |
| 10:00:02 | 2 |
| 10:00:04 | 6 |
+-------------+-------+
I know I could write queries to put each sensor in its own table (say SensorA_table and SensorB_table), and then join those tables like this:
SELECT
TIMESTAMP_TRUNC(a.Time, SECOND) as truncated_sec,
a.Value - b.Value as sensor_diff
FROM SensorA_table AS a JOIN SensorB_Table AS b
ON b.Time BETWEEN TIMESTAMP_SUB(a.Time, INTERVAL 500 MILLISECOND) AND TIMESTAMP_ADD(a.Time, INTERVAL 500 MILLISECOND)
But that seems very expensive to make every row of the SensorA_table compare against every row of the SensorB_table, given that the sensor tables are each about 10 TB. Or does partitioning automatically take care of this and only look at one block of SensorB's table per row of SensorA's table?
Either way, I am wondering if there is a better way to do this than a full JOIN. Since the matching values are all coming from within a few rows of each other in the original table, it seems like an analytic function might be able to look at a smaller amount of data at a time, but because we can't guarantee alternating rows of A & B, there's no clear LAG or LEAD offset that would always return the correct row.
Is it a matter of writing an analytic functions to return a few LAG and LEAD rows for each row, then evaluate each of those rows with a CASE statement to see if it is the correct row, then calculating the value? Or is there a way of doing a join against an analytic function's window?
Thanks for any guidance here.
One method uses lag(). Something like this:
select st.time, st.value - st.prev_value
from (select st.*,
lag(sensor) over (order by time, sensor) as prev_sensor,
lag(time) over (order by time, sensor) as prev_time,
lag(value) over (order by time, sensor) as prev_value
from sensor_table st
) st
where ( st.sensor = 'A' <> prev_sensor = 'B' ) and
prev_time > timestamp_add(time, interval 1 second)

SQL Query X Days back excluding date ranges (Confusing!)

Ok, I have a tough SQL query, and I'm not sure how to go about writing it.
I am summing the number of "bananas collected" by an employee within the last X days, but what I could really use help on is determining X.
The "last X days" value is defined to be the last 100 days that the employee was NOT out due to Purple Fever, starting from some ChosenDate (we'll say today, 6/24/14). That is to say, if the person was sick with Purple Fever for 3 days, then I want to look back over the last 103 days from ChosenDate rather than the last 100 days. Any other reason the employee may have been out does not affect our calculation.
Table PersonOutIncident
+----------------------+----------+-------------+
| PersonOutIncidentID | PersonID | ReasonOut |
+----------------------+----------+-------------+
| 1 | Sarah | PurpleFever |
| 2 | Sarah | PaperCut |
| 3 | Jon | PurpleFever |
| 4 | Sarah | PurpleFever |
+----------------------+----------+-------------+
Table PersonOutDetail
+-------------------+----------------------+-----------+-----------+
| PersonOutDetailID | PersonOutIncidentID | BeginDate | EndDate |
+-------------------+----------------------+-----------+-----------+
| 1 | 1 | 1/1/2014 | 1/3/2014 |
| 2 | 1 | 1/7/2014 | 1/13/2014 |
| 3 | 2 | 2/1/2014 | 2/3/2014 |
| 4 | 3 | 1/15/2014 | 1/20/2014 |
| 5 | 4 | 5/1/2014 | 5/15/2014 |
+-------------------+----------------------+-----------+-----------+
The tables are established. Many PersonOutDetail records can be associated with one PersonOutIncident record and there may be multiple PersonOutIncident records for a single employee (That is to say, there could be two or three PersonOutIncident records with an identical ReasonOut column, because they represent a particular incident or event and the not-necessarily-continuous days lost due to that particular incident)
The nature of this requirement complicates things, even conceptually to me.
The best I can think of is to check for a BeginDate/EndDate pair within the 100 day base period, then determine the number of days from BeginDate to EndDate and add that to the base 100 days. But then I would have to check again that this new range doesn't overlap or contain additional BeginDate/EndDate pairs and add, if so, add those days as well. I can tell already that this isn't the method I want to use, but I can't wrap my mind quite around how exactly what I need to start/structure this query. Does anyone have an idea that might steer me in the correct direction? I realize this might not be clear and I apologize if I'm just confusing things.
One way to do this is to work with a table or WITH CLAUSE that contains a list of days. Let's say days is a table with one column that contains the last 200 days. (This means the query will break if the employee had more than 100 sick days in the last 200 days).
Now you can get a list of all working days of an employee like this (replace ? with the employee id):
WITH t1 AS
(
SELECT day,
ROW_NUMBER() OVER (ORDER BY day DESC) AS 'RowNumber'
FROM days d
WHERE NOT EXISTS (SELECT * FROM PersonOutDetail pd
INNER JOIN PersonOutIncidentID po ON po.PersonOutIncidentID = pd.PersonOutIncidentID
WHERE d.day BETWEEN pd.BeginDate AND pd.EndDate
AND po.ReasonOut = 'PurpleFever'
AND po.PersonID = ?)
)
SELECT * FROM t1
WHERE RowNumber <= 100;
Alternatively, you can obtain the '100th day' by replacing RowNumber <= 100 with RowNumber = 100.

find max logons in an interval

I have a table with timestamps and states for people.
|:--------------------------------------------------------------:|
| user_id | state | start_time | end_time |
|:--------------------------------------------------------------:|
| 4711 | 1 | 2013-10-30 09:01:23 | 2013-10-30 17:12:03 |
| 4712 | 1 | 2013-10-30 07:01:23 | 2013-10-30 18:12:03 |
| 4713 | 1 | 2013-10-30 08:01:23 | 2013-10-30 16:12:03 |
| 4714 | 1 | 2013-10-30 09:01:24 | 2013-10-30 17:02:03 |
My challenge is, to find out how many users are
MAX(logged on) AND AVG(logged on) in same time per Interval. I think that I get out when I can see, how many users are simultaneously logged in per second.
|:-------------------------------------:|
| timestamp | state | userid |
|:-------------------------------------:|
| 1383123683 | 1 | 4711 |
| 1383123684 | 1 | 4711 |
| 1383123684 | 1 | 4712 |
| 1383123685 | 1 | 4711 |
| 1383123685 | 1 | 4712 |
| ... | ... | ... |
By the way, one intervals is a quarter of an hour.
The Data comes via INSERT INTO so my idea was to crate a trigger and wrote in a helper table one row for each second (UNIX timestamp) between start and end adding the state_id.
At the end, it must be possible to group over the seconds and count over the datasets to find out, how many rows are exist in one second. For the AVG I have not yet a formula :-). It's a question of time, you know.
But I'm not sure, if my idea was a good one, because i fear that my plan needs a lot of performance and space.
The better idea will be, to wrote just the start-time and end-time, but i loosing the possibility to grouping over the seconds.
How I can manage that without thousands of rows in my database?
Here can be several solutions, I want to describe one, and I hope you can use/adapt/extent it for your particular needs (NOTE: I'm using mysql dialect, for ms sql it can be a little bit different syntax, but approach will work):
1 create new table, with structure like:
create table changelog (
changetime datetime,
changevalue int,
totalsum int,
primary key (changetime)
);
2 insert basic data:
insert into changelog
select changet, sum(cnts), 0
from
(
select start_time as changet, 1 as cnts from testlog
union all
select end_time as changet, -1 from testlog
) as q
group by changet;
3 update totalsum colum:
update changelog as a set totalsum = ifnull((select sum(changevalue) from (select changet, sum(cnts) as changevalue, 0
from
(
select start_time as changet, 1 as cnts from testlog
union all
select end_time as changet, -1 from testlog
) as q
group by changet) as b where b.changet<=a.changetime),0);
NOTE: for ms sql you can try with syntax, you will be able to do these insert/update as one query
4 after this you will have (based on data from question):
2013-10-30 07:01:23 1 1
2013-10-30 08:01:23 1 2
2013-10-30 09:01:23 1 3
2013-10-30 09:01:24 1 4
2013-10-30 16:12:03 -1 3
2013-10-30 17:02:03 -1 2
2013-10-30 17:12:03 -1 1
2013-10-30 18:12:03 -1 0
As you see, max logged in already here, but here is one problem, imagine that you need to select data for range: 08:00-08:01, there are no data in table, so query like this will not work:
SELECT max(totalsum)
FROM changelog
where changetime between cast(#startrange as datetime) and cast(#endrange as datetime)
but you can change it to:
SELECT max(totalsum)
from
(
select max(totalsum) as totalsum FROM changelog
where changetime between cast(#startrange as datetime) and cast(#endrange as datetime)
union all
select totalsum from changelog where changetime=(select max(changetime) from changelog where changetime<cast(#startrange as datetime))
) as q;
so, basically saying - in addition to your range you need to fetch last row before period starts, to find out how many users was at the moment of range start
5 now, you want to calculate average. Average is kinda tricky function, depending on what you understand - there can be different results, average user's per second or average workload
here is the difference:
100 users logged in at 09:00
98 users logged out at 09:01
1 user logged out at 09:02
Selection range: 09:00 - 09:59 (inclusive)
average per minute will be sum of all logged in users in each minute and divided by 60
(100 + 2 + 1 + 57*1)/60 = 2.6(6) user per minute
but average workload can be calculated as (max(logged_users)+min(logged_users)) / 2
(100 + 1)/2 = 50.5 users, this is average simultaneous users logged in system
another average can be calculated via SQL avg (sum(values)/count(values)), which will give us
(100+98+1)/3 = 66.3(3) - another average workload in persons
first formula says to us that it only 2.65 user at the same time, but second shows "holy #*&####, it is 50.5 users at the same time"
another example:
100 users logged in at 09:00
99 users logged out at 09:58
1 user logged out at 09:59
Selection range: 09:00 - 09:59 (inclusive)
first formula will give you (100*58 + 2 + 1)/60 = 96.71(6) users, second will continue to give 50.5, 3rd one still 66.3(3)
what average suits best for you?
To calculate 1st average you need to create stored procedure which will get data for each minute/seconds of period and summarize it, after divide
To calculate 2nd variant: just select min/max and divide by 2
3rd variant: use avg instead of max
Note #1: of course all this approaches are quite slow with huge traffic, so I suggest you to prepare some "pre-calculated" tables with data which can be fetched fast (for example you can have data for each hour like: YYYY-MM-DD HH loggedInatStart, min, avg, median, max, loggedInatEnd)
Note #2: sometimes median average is more interesting for statistical purposes, to obtain it you will: for each minute calculate how many users was logged in, select distinct values, select middle from this list (for my examples this will give us 2 and 2), or select all values, select middle one (for my example it will give us 1 and 99)