counting values with almost the same time - sql

Good afternoon. What is the essence of the matter, the train has a geotag that determines its position in space. Location data is entered into a table. It is required to count how many times the train was in a certain timezone. But the problem is that being in a certain time zone, the geotag leaves several records in the table by time. What query can be used to count the number of arrivals?
I created a query that counts how many times the train was at point 270 and at point 289. To do this, I rounded the time to hours, but the problem is that if the train arrived at the end of the hour, but left at the beginning of the next, the query counts it as two arrivals . Below I will attach the query itself and the output results.
Create temp table tmpTable_1 ON COMMIT DROP as
select addr,zone_id,DATE_PART('hour',time)*100 as IntTime from trac_path_rmp where time between '2022.04.06' and '2022.04.07';
Create temp table tmpTable_2 ON COMMIT DROP as select addr,zone_id,IntTime from tmpTable_1 where addr in (12421,12422,12423,12425) group by addr,zone_id,IntTime;
select addr,sum(case when zone_id=289 then 1 else 0 end) as "Zone 289", sum(case when zone_id=270 then 1 else 0 end) as "Zone 270" from tmpTable_2 group by addr order by addr;

We can use LAG OVER() to get the timestamp of the previous row and only return the rows when there is at least a minutes difference. We could easily modify this: to 5 minutes for example.
We also keep the first row where LAG returns null.
We need to use hours and minutes because if we only use minutes we will get 0 time difference when there is exactly an hour between rows.
See dbFiddle link below.
;WITH CTE AS
(SELECT
*,
time_ - LAG(time_) OVER (ORDER BY id) AS dd
FROM table_name)
SELECT
id,time_,addr,x,y,z,zone_id,type
FROM cte
WHERE DATE_PART('hours',dd) + 60 * DATE_PART('minutes',dd) > 0
OR dd IS null;
id | time_ | addr | x | y | z | zone_id | type
--: | :------------------ | ----: | ------: | ------: | ------: | ------: | ---:
138 | 2022-04-06 19:19:11 | 12421 | 9793.50 | 4884.70 | -125.00 | 270 | 1
141 | 2022-04-06 20:37:23 | 12421 | 9736.00 | 4856.90 | -125.00 | 270 | 1
146 | 2022-04-06 22:58:15 | 12421 | 9736.00 | 4856.90 | -125.00 | 270 | 1
db<>fiddle here

Related

What's the best way to optimize a query of search 1000 rows in 50 million by date? Oracle

I have table
CREATE TABLE ard_signals
(id, val, str, date_val, dt);
This table records the values ​​of all ID's signals. Unique ID's around 950. At the moment, the table contains about 50 million rows with different values of these signals. Each ID can have only a numeric values, string values or date values.
I get the last value of each ID, which, by condition, is less than input date:
select ID,
max(val) keep (dense_rank last order by dt desc) as val,
max(str) keep (dense_rank last order by dt desc) as str,
max(date_val) keep (dense_rank lastt order by dt desc) as date_val,
max(dt)
where dt <= to_date(any_date)
group by id;
I have index on ID. At the moment, the request takes about 30 seconds. Help, please, what ways of optimization it is possible to make for the given request?
EXPLAIN PLAN: with dt index
Example Data(This kind of rows are about 950-1000):
ID
VAL
STR
DATE_VAL
DT
920
0
20.07.2022 9:59:11
490
yes
20.07.2022 9:40:01
565
233
20.07.2022 9:32:03
32
1
20.07.2022 9:50:01
TL;DR You need your application to maintain a table of distinct id values.
So, you want the last record for each group (distinct id value) in your table, without doing a full table scan. It seems like it should be easy to tell Oracle: iterate through the distinct values for id and then do an index lookup to get the last dt value for each id and then give me that row. But looks are deceiving here -- it's not easy at all, if it is even possible.
Think about what an index on (id) (or (id, dt)) has. It has a bunch of leaf blocks and a structure to find the highest value of id in each block. But Oracle can only use the index to get all the distinct id values by reading every leaf block in the index. So, we might be find a way to trade our TABLE FULL SCAN for an INDEX_FFS for a marginal benefit, but it's not really what we're hoping for.
What about partitioning? Can't we create ARD_SIGNALS with PARTITION BY LIST (id) AUTOMATIC and use that? Now the data dictionary is guaranteed to have a separate partition for each distinct id value.
But again - think about what Oracle knows (from DBA_TAB_PARTITIONS) -- it knows what the highest partition key value is in each partition. It is true: for a list partitioned table, that highest value is guaranteed to be the only distinct value in the partition. But I think using that guarantee requires special optimizations that Oracle's CBO does not seem to make (yet).
So, unfortunately, you are going to need to modify your application to keep a parent table for ARDS_SIGNALS that has a (single) row for each distinct id.
Even then, it's kind of difficult to get what we want. Because, again, want Oracle to iterate through the distinct id values, then use the index to find the one with the highest dt for that id .. and then stop. So, we're looking for an execution plan that makes use of the INDEX RANGE SCAN (MIN/MAX) operation.
I find that tricky with joins, but not so hard with scalar subqueries. So, assuming we named our parent table ARD_IDS, we can start with this:
SELECT /*+ NO_UNNEST(#ssq) */ i.id,
( SELECT /*+ QB_NAME(ssq) */ max(dt)
FROM ard_signals s
WHERE s.id = i.id
AND s.dt <= to_date(trunc(SYSDATE) + 2+ 10/86400) -- replace with your date variable
) max_dt
FROM ard_ids i;
---------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1000 |00:00:00.01 | 22 |
| 1 | SORT AGGREGATE | | 1000 | 1 | 1000 |00:00:00.02 | 3021 |
| 2 | FIRST ROW | | 1000 | 1 | 1000 |00:00:00.01 | 3021 |
|* 3 | INDEX RANGE SCAN (MIN/MAX)| ARD_SIGNALS_N1 | 1000 | 1 | 1000 |00:00:00.01 | 3021 |
| 4 | TABLE ACCESS FULL | ARD_IDS | 1 | 1000 | 1000 |00:00:00.01 | 22 |
---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("S"."ID"=:B1 AND "S"."DT"<=TO_DATE(TO_CHAR(TRUNC(SYSDATE#!)+2+.00011574074074074074
0740740740740740740741)))
Note the use of hints to keep Oracle from merging the scalar subquery into the rest of the query and losing our desired access path.
Next, it is a matter of using those (id, max(dt)) combinations to look up the rows from the table to get the other column values. I came up with this; improvements may be possible (especially if (id, dt) is not as selective as I am assuming it is):
with k AS (
select /*+ NO_UNNEST(#ssq) */ i.id, ( SELECT /*+ QB_NAME(ssq) */ max(dt) FROM ard_signals s WHERE s.id = i.id AND s.dt <= to_date(trunc(SYSDATE) + 2+ 10/86400) ) max_dt
from ard_ids i
)
SELECT k.id,
max(val) keep ( dense_rank first order by dt desc, s.rowid ) val,
max(str) keep ( dense_rank first order by dt desc, s.rowid ) str,
max(date_val) keep ( dense_rank first order by dt desc, s.rowid ) date_val,
max(dt) keep ( dense_rank first order by dt desc, s.rowid ) dt
FROM k
INNER JOIN ard_signals s ON s.id = k.id AND s.dt = k.max_dt
GROUP BY k.id;
--------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1000 |00:00:00.04 | 7009 | | | |
| 1 | SORT GROUP BY | | 1 | 1000 | 1000 |00:00:00.04 | 7009 | 302K| 302K| 268K (0)|
| 2 | NESTED LOOPS | | 1 | 1005 | 1000 |00:00:00.04 | 7009 | | | |
| 3 | NESTED LOOPS | | 1 | 1005 | 1000 |00:00:00.03 | 6009 | | | |
| 4 | TABLE ACCESS FULL | ARD_IDS | 1 | 1000 | 1000 |00:00:00.01 | 3 | | | |
|* 5 | INDEX RANGE SCAN | ARD_SIGNALS_N1 | 1000 | 1 | 1000 |00:00:00.02 | 6006 | | | |
| 6 | SORT AGGREGATE | | 1000 | 1 | 1000 |00:00:00.02 | 3002 | | | |
| 7 | FIRST ROW | | 1000 | 1 | 1000 |00:00:00.01 | 3002 | | | |
|* 8 | INDEX RANGE SCAN (MIN/MAX) | ARD_SIGNALS_N1 | 1000 | 1 | 1000 |00:00:00.01 | 3002 | | | |
| 9 | TABLE ACCESS BY GLOBAL INDEX ROWID| ARD_SIGNALS | 1000 | 1 | 1000 |00:00:00.01 | 1000 | | | |
--------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
5 - access("S"."ID"="I"."ID" AND "S"."DT"=)
8 - access("S"."ID"=:B1 AND "S"."DT"<=TO_DATE(TO_CHAR(TRUNC(SYSDATE#!)+2+.000115740740740740740740740740740740740741)))
... 7000 gets and 4/100ths of a second.
You want to see that latest entry per ID at a given time.
If you would usually be interested in times at the beginning of the recordings, an index on the time could help to limit the rows to have to work with. But I consider this highly unlikely.
It is much more likely that you are interested in the situation at more recent times. This means for instance that when looking for the latest entries until today, for one ID the latest entry may be found yesterday, while for another ID the latest entry my be from two years ago.
In my opinion, Oracle already chooses the best approach to deal with this: read the whole table sequentially.
If you have several CPUs at hand, parallel execution might speed up things:
select /*+ parallel(4) */ ...
If you really need this to be much faster, then you may want to consider an additional table with one row per ID, which gets a copy of the lastest date and value by a trigger. I.e. you'd introduce redundancy for the gain of speed, as it is done in data warehouses.

Calculating percentage change of a metric for a group in PostgreSQL

I have a sample table as follows:
id | timestamp | agentid | input_interface | sourceipv4address | totalbytes_sum
-------+------------+-----------------+-----------------------------+-------------------+----------------
10733 | 1593648000 | 203.121.214.129 | 203.121.214.129 interface 1 | 10.10.10.10 | 3857
10734 | 1593648000 | 203.121.214.129 | 203.121.214.129 interface 1 | 10.10.10.101 | 45960
10731 | 1593648600 | 203.121.214.129 | 203.121.214.129 interface 1 | 10.10.10.10 | 20579
10736 | 1593648600 | 203.121.214.129 | 203.121.214.129 interface 1 | 10.10.10.101 | 21384
10737 | 1593648600 | 203.121.214.129 | 203.121.214.129 interface 1 | 10.10.10.107 | 2094
This table is populated by taking samples from a network every 10 minutes. Basically I am trying to build a view to calculate the percentage change on totalbytes_sum for each group (agentid,input_interface,sourceipv4address) and show it as:
timestamp | agentid | input_interface | sourceipv4address | totalbytes_sum | percent
The calculation needs to happen based on the current 10 minute and the previous 10 minute. i can guarantee that there will be only one copy of a particular agentid,input_interface,sourceipv4address combination within the same 10 minutes.
If a combination did not happen to have a record within the previous 10 minutes, the percentage will be +%100.
I was trying to apply the Partition/Order logic but had no luck. the offset function seems good too but I am pretty much stuck.
Can someone please assist me.
Thanks
Your timestamps are all the same. I assume they would be ~600 seconds apart in your actual data.
Please try something like this and let me know in comments if it does not work for you or if you need explanation for any of it:
select timestamp, agentid, input_interface,
sourceipv4address, totalbytes_sum,
timestamp - lag(timestamp) over w as elapsed_time, -- illustration column
lag(totalbytes_sum) over w as last_totalbytes_sum, -- illustration column
case
when coalesce(lag(timestamp) over w, 0) = 0 then 100.0
when timestamp - lag(timestamp) over w > 600 then 100.0
else 100.0 * (totalbytes_sum - (lag(totalbytes_sum) over w)) /
(lag(totalbytes_sum) over w)
end as percent
from sample_table
window w as (partition by agentid, input_interface, sourceipv4address
order by timestamp)
;

Aggregate results split by day

I'm trying to write a query that returns summarised data, per day, over many day's of data.
For example
| id | user_id | start
|----|---------|------------------------------
| 1 | 1 | 2020-02-01T17:35:37.242+00:00
| 2 | 1 | 2020-02-01T13:25:21.344+00:00
| 3 | 1 | 2020-01-31T16:42:51.344+00:00
| 4 | 1 | 2020-01-30T06:44:55.344+00:00
The outcome I'm hoping for is a function that I can pass in a the userid and timezone, or UTC offset, and get out:
| day | count |
|---------|-------|
| 1/2/20 | 2 |
| 31/1/20 | 1 |
| 30/1/20 | 7 |
Where the count is all the rows that have a start time falling between 00:00:00.0000 and 23:59:59.9999 on each day - taking into consideration the supplied UTC offset.
I don't really know where to start writing a query like this, and I the fact I can't even picture where to start feels like a big gap in my SQL thinking. How should I approach something like this?
You can use:
select date_trunc('day', start) as dte, count(*)
from t
where userid = ?
group by date_trunc('day', start)
order by dte;
If you want to handle an additional offset, build that into the query:
select dte, count(*)
from t cross join lateral
(values (date_trunc('day', start + ? * interval '1 hour'))) v(dte)
where userid = ?
group by v.dte
order by v.dte;

Define Sessions not only with time limit

I've been trying to group an events-like table (user, action, event_time) into sessions .
But the common "idle time" approach is not enough.
I need to check if the user was idle for more than an period X of time (ok), BUT if the user started the game, then it will very likely be idle for a long time and every action between start game and end game, regardless of the time interval, should not be considered a new session. But when the game finishes, a new session shows up:
For example (idle time 5 min)
| action | user | event_at | new_session? (desired output) |
|---------------|------|--------------|-------------------------------|
| random1 | 1 | 1 sec | 1 |
| random3 | 1 | 30 sec | 0 |
| random4 | 1 | 6:00 min | 1 |
| random5 | 1 | 7:00 min | 0 |
| game_started | 1 | 7:30 min | 0 |
| random2 | 1 | 20:00 min | 0 |
| random5 | 1 | 27:00 min | 0 |
| game_finished | 1 | 35:00 min | 0 |
| random5 | 1 | 35:30 min | 1 |
The problem are those random actions between the game_start and game_finish. I cannot tell SQL to ignore them and not count them as a new session when using the idle time logic (- that is needed for the part not between start and finish).
in "proper" programming language I could just add a flag after the "for" or "while" finds "game started" and tell it to ignore anything until "game_finished" is found. But in SQL this is simply not that easy, even using an auxiliary column.
Any Ideas ?
Thanks in advance!
Identify the games first, then use lag() for the session start:
select t.*,
(case when in_game > 0 and
(lag(in_game) over (order by time) is null or
lag(in_game) over (order by time) = 0
)
then 1
when in_game = 0 and
(lag(time) over (partition by in_game order by time) is null or
lag(time) over (partition by in_game order by time) < time - interval '5 minute'
)
then 1
else 0
end) as new_session
from (select t.*,
sum(case when action = 'game_start' then 1
when action = 'game_end' then -1
end) over (order by time) as in_game
from t
) t;
This does not handle every edge case. For instance, this version does not handle nested games. It might also be tricky with games right next to each other. But I think it does do what you want.

How can I group by the difference of a column between rows in SQL?

I have a table of events with a created_at timestamp. I want to divide them into groups of events that are N seconds apart, specifically 130 seconds. Then for each group, I just need to know the lowest timestamp and the highest timestamp.
Here's some sample data (ignore the formatting of the timestamp, it's a datetime field):
------------------------
| id | created_at |
------------------------
| 1 | 2013-1-20-08:00 |
| 2 | 2013-1-20-08:01 |
| 3 | 2013-1-20-08:05 |
| 4 | 2013-1-20-08:07 |
| 5 | 2013-1-20-08:09 |
| 6 | 2013-1-20-08:12 |
| 7 | 2013-1-20-08:20 |
------------------------
And what I would like to get as a result is:
-------------------------------------
| started_at | ended_at |
-------------------------------------
| 2013-1-20-08:00 | 2013-1-20-08:01 |
| 2013-1-20-08:05 | 2013-1-20-08:09 |
| 2013-1-20-08:12 | 2013-1-20-08:12 |
| 2013-1-20-08:20 | 2013-1-20-08:20 |
-------------------------------------
I've googled and searched every possible way of phrasing that question and experimented for some time, but I can't figure it out. I can already do this in Ruby, I'm just trying to figure out if it's possible to move this to the database level. If you're curious or it's easier to visualize, here's what it looks like in Ruby:
groups = SortedSet[*events].divide { |a,b| (a.created_at - b.created_at).abs <= 130 }
groups.map do |group|
{ started_at: group.to_a.first.created_at, ended_at: group.to_a.last.created_at }
end
Does anyone know how to do this in SQL, specifically PostgreSQL?
I think you want to start each new grouping when the difference from the previous is greater than 130 seconds. You can do this with lag and date arithmetic to determine where a grouping starts. Then do a cumulative sum to get the grouping:
select Grouping, min(created_at), max(created_at)
from (select t.*, sum(GroupStartFlag) over (order by created_at) as Grouping
from (select t.*,
lag(created_at) over (order by created_at) as prevca,
(case when extract(epoch from created_at - lag(created_at) over (order by created_at)) < 130
then 0 else 1
end) as GroupStartFlag
from t
) t
) t
group by Grouping;
The final step is the aggregate by the "grouping" identifier to get the earliest and latest dates.