=> \d test_table;
Table "public.test_table"
Column | Type | Collation | Nullable | Default
-------------+---------+-----------+----------+---------
timestamp | bigint | | |
source | inet | | |
destination | inet | | |
type | integer | | |
Let's say there's a table with the same above, that contains the network connection information, with a lot more ip pairs, and types than shown below.
rawdb=> select * from test_table order by timestamp;
timestamp | source | destination | type
------------+-------------+-------------+------
1586940900 | 192.168.1.1 | 192.168.1.2 | 1
1586940960 | 192.168.1.1 | 192.168.1.2 | 1
1586941020 | 192.168.1.1 | 192.168.1.2 | 1
1586941080 | 192.168.1.1 | 192.168.1.2 | 1
1586941140 | 192.168.1.1 | 192.168.1.2 | 1
(5 rows)
Out of the table, I need to find the connection pair that are always connecting every x interval. For example, in the rows above, the connection between the ip pair 192.168.1.1 to 192.168.1.2 is happening every 60s intervals.
The table above would be the answer to the question "how many ip pairs are connecting every 60s over the last 5 min?"
Question
How do I extract those periodic connections for various intervals with the same type, same ip pair? Connections that are 1 min peridoically, 5 min perodically, 30 min perodically.
The baseline is I can provide the x to search for (e.g. every 60s, every 5 min, every 1 hour, etc), the best case solution is to be able to find x without being provided.
The result format I need is the same as the table above.
Is it possible to do all of this in SQL? I have done some research on doing gap analysis on the postgres tables, but this is not to find out the gaps, but rather the continuous sequence.
Not sure what exactly the output is, you are after. But the difference between two connections of the same source/destination combination can be calculated using a window function.
Something like:
select distinct source, destination
from (
select *,
lead("timestamp") over w - "timestamp" as diff
from test_table
window w as (partition by source, destination order by "timestamp")
) t
where diff = 60
lead("timestamp") over w - "timestamp" calculates the difference between the current row's timestamp and the next one for the same source/destination pair. I moved the window definition into the FROM clause to make the expression that calculates the diff more readable.
Related
I have a table of encounters called user_dates that is ordered by 'user' and 'start' like below. I want to create a column indicating whether an encounter was followed up by another encounter within 30 days. So basically I want to go row by row checking if "encounter_stop" is within 30 days of "encounter_start" in the following row (as long as the following row is the same user).
user | encounter_start | encounter_stop
A | 4-16-1989 | 4-20-1989
A | 4-24-1989 | 5-1-1989
A | 6-14-1993 | 6-27-1993
A | 12-24-1999 | 1-2-2000
A | 1-19-2000 | 1-24-2000
B | 2-2-2000 | 2-7-2000
B | 5-27-2001 | 6-4-2001
I want a table like this:
user | encounter_start | encounter_stop | subsequent_encounter_within_30_days
A | 4-16-1989 | 4-20-1989 | 1
A | 4-24-1989 | 5-1-1989 | 0
A | 6-14-1993 | 6-27-1993 | 0
A | 12-24-1999 | 1-2-2000 | 1
A | 1-19-2000 | 1-24-2000 | 0
B | 2-2-2000 | 2-7-2000 | 1
B | 5-27-2001 | 6-4-2001 | 0
You can select..., exists <select ... criteria>, that would return a boolean (always true or false) but if really want 1 or 0 just cast the result to integer: true=>1 and false=>0. See Demo
select ts1.user_id
, ts1.encounter_start
, ts1. encounter_stop
, (exists ( select null
from test_set ts2
where ts1.user_id = ts2.user_id
and ts2.encounter_start
between ts1.encounter_stop
and (ts1.encounter_stop + interval '30 days')::date
)::integer
) subsequent_encounter_within_30_days
from test_set ts1
order by user_id, encounter_start;
Difference: The above (and demo) disagree with your expected result:
B | 2-2-2000 | 2-7-2000| 1
subsequent_encounter (last column) should be 0. This entry starts and ends in Feb 2000, the other B entry starts In May 2001. Please explain how these are within 30 days (other than just a simple typo that is).
Caution: Do not use user as a column name. It is both a Postgres and SQL Standard reserved word. You can sometimes get away with it or double quote it. If you double quote it you MUST always do so. The big problem being it has a predefined meaning (run select user;) and if you forget to double quote is does not necessary produce an error or exception; it is much worse - wrong results.
The column "activitie_time_enter" has the times.
The column "activitie_still" indicates the type of activity.
The column "activitie_walking" indicates the other type of activity.
Table example:
activitie_time_enter | activitie_still | activitie_walking
17:30:20 | Still |
17:31:32 | Still |
17:32:24 | | Walking
17:33:37 | | Walking
17:34:20 | Still |
17:35:37 | Still |
17:45:13 | Still |
17:50:23 | Still |
17:51:32 | | Walking
What I need is to sum up the total minutes for each activity separately.
Any suggestions or solution?
First calculate the duration for each activity (the with CTE) and then do conditional sum.
with t as
(
select
*, lead(activitie_time_enter) over (order by activitie_time_enter) - activitie_time_enter as duration
from _table
)
select
sum (duration) filter (where activitie_still = 'Still') as total_still,
sum (duration) filter (where activitie_walking = 'Walking') as total_walking
from t;
/** Result:
total_still|total_walking|
-----------+-------------+
00:19:16| 00:01:56|
*/
BTW do you really need two columns (activitie_still and activitie_walking)? Only one activity column with those values will do. This will allow more activities (Running, Sleeping, Working etc.) w/o having to change the table structure.
I have a large dataset consisting of four sensors in a single stream, but for simplicity's sake let's reduce that to two sensors that transmit at approximate (but not exact) same times like this:
+---------+-------------+-------+
| Sensor | Time | Value |
+---------+-------------+-------+
| SensorA | 10:00:01.14 | 10 |
| SensorB | 10:00:01.06 | 8 |
| SensorA | 10:00:02.15 | 11 |
| SensorB | 10:00:02.07 | 9 |
| SensorA | 10:00:03.14 | 13 |
| SensorA | 10:00:04.09 | 12 |
| SensorB | 10:00:04.13 | 6 |
+---------+-------------+-------+
I am trying to find the difference between SensorA and SensorB when their readings are within a half-second of each other. Like this:
+-------------+-------+
| Trunc_Time | Diff |
+-------------+-------+
| 10:00:01 | 2 |
| 10:00:02 | 2 |
| 10:00:04 | 6 |
+-------------+-------+
I know I could write queries to put each sensor in its own table (say SensorA_table and SensorB_table), and then join those tables like this:
SELECT
TIMESTAMP_TRUNC(a.Time, SECOND) as truncated_sec,
a.Value - b.Value as sensor_diff
FROM SensorA_table AS a JOIN SensorB_Table AS b
ON b.Time BETWEEN TIMESTAMP_SUB(a.Time, INTERVAL 500 MILLISECOND) AND TIMESTAMP_ADD(a.Time, INTERVAL 500 MILLISECOND)
But that seems very expensive to make every row of the SensorA_table compare against every row of the SensorB_table, given that the sensor tables are each about 10 TB. Or does partitioning automatically take care of this and only look at one block of SensorB's table per row of SensorA's table?
Either way, I am wondering if there is a better way to do this than a full JOIN. Since the matching values are all coming from within a few rows of each other in the original table, it seems like an analytic function might be able to look at a smaller amount of data at a time, but because we can't guarantee alternating rows of A & B, there's no clear LAG or LEAD offset that would always return the correct row.
Is it a matter of writing an analytic functions to return a few LAG and LEAD rows for each row, then evaluate each of those rows with a CASE statement to see if it is the correct row, then calculating the value? Or is there a way of doing a join against an analytic function's window?
Thanks for any guidance here.
One method uses lag(). Something like this:
select st.time, st.value - st.prev_value
from (select st.*,
lag(sensor) over (order by time, sensor) as prev_sensor,
lag(time) over (order by time, sensor) as prev_time,
lag(value) over (order by time, sensor) as prev_value
from sensor_table st
) st
where ( st.sensor = 'A' <> prev_sensor = 'B' ) and
prev_time > timestamp_add(time, interval 1 second)
The background to this question is that we have had to hand-roll replication between a 3rd party Oracle database and our SQL Server database since there are no primary keys defined in the Oracle tables but there are unique indexes.
In most cases the following method works fine: we load the values of the columns in the unique index along with an MD5 hash of all column values from each corresponding table in the Oracle and SQL Server databases and are able to then calculate what records need to be inserted/deleted/updated.
However, in one table the sheer number of rows precludes us from loading all records into memory from the Oracle and SQL Server databases. So we need to do the comparison in blocks.
The method I am considering is: to query the first n records from the Oracle table and then - using the same sort order - to query the SQL Server table for all records up to the last record that was returned from the Oracle database and then compare the two data sets for what needs to be inserted/deleted/updated.
Then once that has been done to load the next n records from the Oracle database and query the records in the SQL Server table that when sorted in the same way fall between (and include) the first and last records in that data set.
My question is: how to achieve this in SQL Server? If I have the values of the nth record (having queried the table in Oracle with a certain sort order) how can I return the range of records up to and including the record with those values from SQL Server?
Example
I have the following table:
| Id | SOU_ORDREF | SOU_LINESEQ | SOU_DATOVER | SOU_TIMEOVER | SOU_SEQ | SOU_DESC |
|-----------------------------------------------------|------------|-------------|-------------------------|------------------|---------|------------------------|
| AQ000001_10_25/07/2004 00:00:00_14_1 | AQ000001 | 10 | 2004-07-2500:00:00.000 | 14 | 1 | Black 2.5mm Cable |
| AQ000004_91_26/07/2004 00:00:00_15.4833333333333_64 | AQ000004 | 91 | 2004-07-26 00:00:00.000 | 15.4333333333333 | 63 | 2.5mm Yellow Cable |
| AQ000005_31_26/07/2004 00:00:00_10.8333333333333_18 | AQ000005 | 31 | 2004-07-26 00:00:00.000 | 10.8333333333333 | 18 | Rotary Cam Switch |
| AQ000012_50_26/07/2004 00:00:00_11.3_17 | AQ000012 | 50 | 2004-07-26 00:00:00.000 | 11.3 | 17 | 3Mtr Heavy Gauge Cable |
The Id field is basically a concatenation of the five fields which make up the unique index on the table i.e. SOU_ORDREF, SOU_LINESEQ, SOU_DATOVER, SOU_TIMEOVER, and SOU_SEQ.
What I would like to do is to be able to query, for example, all the records (when sorted by those columns) up to the record with the Id 'AQ000005_31_26/07/2004 00:00:00_10.8333333333333_18' which would give us the following result (I'll just show the ids):
| Id |
|-----------------------------------------------------|
| AQ000001_10_25/07/2004 00:00:00_14_1 |
| AQ000004_91_26/07/2004 00:00:00_15.4833333333333_64 |
| AQ000005_31_26/07/2004 00:00:00_10.8333333333333_18 |
So, the query has not included the record with Id 'AQ000012_50_26/07/2004 00:00:00_11.3_17' since it comes after 'AQ000005_31_26/07/2004 00:00:00_10.8333333333333_18' when we order by SOU_ORDREF, SOU_LINESEQ, SOU_DATOVER, SOU_TIMEOVER, and SOU_SEQ.
I have a view defined in postgres, in a separate schema to the data it is using.
It contains three columns:
mydb=# \d "my_views"."results"
View "my_views.results"
Column | Type | Modifiers
-----------+-----------------------+-----------
Date | date |
Something | character varying(60) |
Result | numeric |
When I query it from psql or adminer, I get results like theese:
bb_adminpanel=# select * from "my_views"."results";
Date | Something | Result
------------+-----------------------------+--------------
2015-09-14 | Foo | -3.36000000
2015-09-14 | Bar | -16.34000000
2015-09-12 | Foo | -11.55000000
2015-09-12 | Bar | 11.76000000
2015-09-11 | Bar | 2.48000000
However, querying it through django, I get a different set:
(c is a cursor object on the database)
c.execute('SELECT * from "my_views"."results"')
c.fetchall()
[(datetime.date(2015, 9, 14), 'foo', Decimal('-3.36000000')),
(datetime.date(2015, 9, 14), 'bar', Decimal('-16.34000000')),
(datetime.date(2015, 9, 11), 'foo', Decimal('-11.55000000')),
(datetime.date(2015, 9, 11), 'bar', Decimal('14.24000000'))]
Which doesn't match at all - the first two rows are correct, but the last two are really weird - they have a shifted date, and the Result of the last record is the sum of the last two.
I have no idea why that's happening, any suggestions welcome.
Here is the view definition:
SELECT a."Timestamp"::date AS "Date",
a."Something",
sum(a."x") AS "Result"
FROM my_views.another_view a
WHERE a.status::text = ANY (ARRAY['DONE'::character varying::text, 'CLOSED'::character varying::text])
GROUP BY a."Timestamp"::date, a."Something"
ORDER BY a."Timestamp"::date DESC;
and "another_view" looks like this:
Column | Type | Modifiers
---------------------------+--------------------------+-----------
Timestamp | timestamp with time zone |
Something | character varying(60) |
x | numeric |
status | character varying(100) |
(some columns ommited)
Simple explanation of problem is: timezones.
Detailed: you're not declaring any timezone setting when connecting to PostgreSQL console, but django does it on each query. That way,the timestamp for some records will point to different day depending on used timezone, for example with data
+-------------------------+-----------+-------+--------+
| timestamp | something | x | status |
+-------------------------+-----------+-------+--------+
| 2015-09-11 12:00:00 UTC | foo | 2.48 | DONE |
| 2015-09-12 00:50:00 UTC | foo | 11.76 | DONE |
+-------------------------+-----------+-------+--------+
query on your view executed with timezone UTC will give you 2 rows, but query executed with timezone GMT-2 will give you only one row. because in GMT-2 timezone timestamp from second row is still in day 2015-09-11.
To fix that, you can edit your view, so it will always group days according to specified timezone:
SELECT (a."Timestamp" AT TIME ZONE 'UTC')::date AS "Date",
a."Something",
sum(a."x") AS "Result"
FROM my_views.another_view a
WHERE a.status::text = ANY (ARRAY['DONE'::character varying::text, 'CLOSED'::character varying::text])
GROUP BY (a."Timestamp" AT TIME ZONE 'UTC'), a."Something"
ORDER BY (a."Timestamp" AT TIME ZONE 'UTC') DESC;
That way days will be always counted according to 'UTC' timezone.