examine if one time series column of table has two adjacent time points which have interval larger than certain length - sql

I am dealing with data preprocessing on a table containing time series column
toy example Table A
timestamp value
12:30:24 1
12:32:21 3
12:33:21 4
timestamp is ordered and always go incrementally
Is that possible to define an function or something else to return "True expression" when table has two adjacent time points which have interval larger than certain length and return "False" otherwise?
I am using postgresql, thank you

SQL Fiddle
select bool_or(bigger_than) as bigger_than
from (
select
time - lag(time) over (order by time)
>
interval '1 minute' as bigger_than
from table_a
) s;
bigger_than
-------------
t
bool_or will stop searching as soon as it finds the first true value.
http://www.postgresql.org/docs/current/static/functions-aggregate.html
Your sample data shows a time value. But it works the same for a timestamp

Something like this:
select count(*) > 0
from (
select timestamp,
lag(timestamp) over (order by value) as prev_ts
from table_a
) t
where timestamp - prev_ts < interval '1' minute;
It calculates the difference between a timestamp and it's "previous" timestamp. The order of the timestamps is defined by the value column. The outer query then counts the number of rows where the difference is smaller than 1 minute.
lag() is called a window functions. More details on those can be found in the manual:
http://www.postgresql.org/docs/current/static/tutorial-window.html

Related

How to iterate over table and delete rows based on specific condition on previous row - PostgreSQL

I have a table of ships, which consists of:
row id (number)
ship id (character varying)
timestamp (timestamp in yyyy-mm-dd hh:mm:ss format)
Timestamp is the time that the specific ship (ship id) emitted a signal during its course. The table looks like this:
What I need to do (in PostgreSQL - pgAdmin) is for every ship_id, find if a signal has been emitted 5 seconds or less after another signal from the same ship, and then delete the row with the latter.
In the example table shown above, for the ship "foo" the signals are almost 9 minutes apart so it's all good, but for the ship "bar" the signal with row_id 4 was emitted 3 seconds after the previous one with row_id 3, so it needs to go.
Thanks a lot in advance.
Windowing functions Lag/Lead in this case will do the trick.
Add a LAG to calculate the difference between timestamps for the same ships. This will allow you to calculate the time difference for the same ship and its most recent posting.
Use that to filter out what to delete
SELECT ROW_ID, SHIP_ID, EXTRACT(EPOCH FROM (TIMESTAMP - LAG (TIMESTAMP,1) OVER (PARTITION BY SHIP_ID ORDER BY TIMESTAMP ASC))) AS SECONDS_DIFF
--THEN SOMETHING LIKE THIS TO FIND WHICH ROWS TO DELETE
DELETE FROM SHIP_TABLE WHERE ROW_ID IN
(SELECT ROW_ID FROM
(SELECT ROW_ID, SHIP_ID, EXTRACT(EPOCH FROM (TIMESTAMP - LAG (TIMESTAMP,1) OVER (PARTITION BY SHIP_ID ORDER BY TIMESTAMP ASC))) AS SECONDS_DIFF) SUB_1
WHERE SECONDS_DIFF <= 10 --THRESHOLD
) SUB_2

How to compute window function for each nth row in Presto?

I am working with a table that contains timeseries data, with a row for each minute for each user.
I want to compute some aggregate functions on a rolling window of N calendar days.
This is achieved via
SELECT
SOME_AGGREGATE_FUN(col) OVER (
PARTITION BY user_id
ORDER BY timestamp
ROWS BETWEEN (60 * 24 * N) PRECEDING AND CURRENT ROW
) as my_col
FROM my_table
However, I am only interested in the result of this at a daily scale.
i.e. I want the window to be computed only at 00:00:00, but I want the window itself to contain all the minute-by-minute data to be passed into my aggregate function.
Right now I am doing this:
WITH agg_results AS (
SELECT
SOME_AGGREGATE_FUN(col) OVER (
PARTITION BY user_id
ORDER BY timestamp_col
ROWS BETWEEN (60 * 24 * N) PRECEDING AND CURRENT ROW
)
FROM my_table
)
SELECT * FROM agg_results
WHERE
timestamp_col = DATE_TRUNC('day', "timestamp_col")
This works in theory, but it does 60 * 24 more computations that necessary, resulting in the query being super slow.
Essentially, I am trying to find a way to make the right window bound skip rows based on a condition. Or, if it is simpler to implement, for every nth row (as I have a constant number of rows for each day).
I don't think that's possible with window functions. You could switch to a subquery instead, assuming that your aggregate function works as a regular aggregate function too (that is, without an OVER() clause):
select
timestamp_col,
(
select some_aggregate_fun(t1.col)
from my_table t1
where
t1.user_id = t.user_id
and t1.timestamp_col >= t.timestamp_col - interval '1' day
and t1.timestamp_col <= t.timestamp_col
)
from my_table t
where timestamp_col = date_trunc('day', timestamp_col)
I am unsure that this would perform better than your original query though; you might need to assess that against your actual dataset.
You can change interval '1' day to the actual interval you want to use.

How to compare time stamps from consecutive rows

I have a table that I would like to sort by a timestamp desc and then compare all consecutive rows to determine the difference between each row. From there, I would like to find all the rows whose difference is greater than ~2hours.
I'm stuck on how to actually compare consecutive rows in a table. Any help would be much appreciated.
I'm using Oracle SQL Developer 3.2
You didn't show us your table definition, but something like this:
select *
from (
select t.*,
t.timestamp_column,
t.timestamp_column - lag(timestamp_column) over (order by timestamp_column) as diff
from the_table t
) x
where diff > interval '2' hour;
This assumes that timestamp_column is defined as timestamp not date (otherwise the result of the difference wouldn't be an interval)

Every 10th row based on timestamp

I have a table with signal name, value and timestamp. these signals where recorded at sampling rate of 1sample/sec. Now i want to plot a graph on values of months, and it is becoming very heavy for the system to perform it within seconds. So my question is " Is there any way to view 1 value/minute in other words i want to see every 60th row.?"
You can use the row_number() function to enumerate the rows, and then use modulo arithmetic to get the rows:
select signalname, value, timestamp
from (select t.*,
row_number() over (order by timestamp) as seqnum
from table t
) t
where seqnum % 60 = 0;
If your data really is regular, you can also extract the seconds value and check when that is 0:
select signalname, value, timestamp
from table t
where datepart(second, timestamp) = 0
This assumes that timestamp is stored in an appropriate date/time format.
Instead of sampling, you could use the one minute average for your plot:
select name
, min(timestamp)
, avg(value)
from Yourtable
group by
name
, datediff(minute, '2013-01-01', timestamp)
If you are charting months, even the hourly average might be detailed enough.

Efficient way of counting a large content from a cloumn or a two in a database using selected time period

I need to list number of column1 that have been added to the database over the selected time period (since the day the list is requested)-daily, weekly (last 7 days), monthly (last 30 days) and quarterly (last 3 months). for example below is the table I created to perform this task.
Column | Type | Modifiers
------------------+-----------------------------+-----------------------------------------------------
column1 character varying (256) not null default nextval
date timestamp without time zone not null default now()
coloumn2 charater varying(256) ..........
Now, I need the total count of entries in column1 with respect the selected time period.
Like,
Column 1 | Date | Coloumn2
------------------+-----------------------------+-----------------------------------------------------
abcdef 2013-05-12 23:03:22.995562 122345rehr566
njhkepr 2013-04-10 21:03:22.337654 45hgjtron
ffb3a36dce315a7 2013-06-14 07:34:59.477735 jkkionmlopp
abcdefgggg 2013-05-12 23:03:22.788888 22345rehr566
From above data, for daily selected time period it should be count= 2
I have tried doing this query
select count(column1) from table1 where date='2012-05-12 23:03:22';
and have got the exact one record matching the time stamp. But I really needed to do it in proper way I believe this is not an efficient way of retrieving the count. Anyone who could help me know the right and efficient way of writing such query would be great. I am new to the database world, and I am trying to be efficient in writing any query.
Thanks!
[EDIT]
Each query currently is taking 175854ms to get process. What could be the efficient way to lessen the time to have it processed accordingly. Any help would be really great. I am using Postgresql to do the same.
To be efficient, conditions should compare values of the sane type as the columns being compared. In this case, the column being compared - Date - has type timestamp, so we need to use a range of tinestamp values.
In keeping with this, you should use current_timestamp for the "now" value, and as confirmed by the documentation, subtracting an interval from a timestamp yields a timestamp, so...
For the last 1 day:
select count(*) from table1
where "Date" > current_timestamp - interval '1 day'
For the last 7 days:
select count(*) from table1
where "Date" > current_timestamp - interval '7 days'
For the last 30 days:
select count(*) from table1
where "Date" > current_timestamp - interval '30 days'
For the last 3 months:
select count(*) from table1
where "Date" > current_timestamp - interval '3 months'
Make sure you have an index on the Date column.
If you find that the index is not being used, try converting the condition to a between, eg:
where "Date" between current_timestamp - interval '3 months' and current_timestamp
Logically the same, but may help the optimizer to choose the index.
Note that column1 is irrelevant to the question; being unique there is no possibility of the row count being different from the number of different values of column1 found by any given criteria.
Also, the choice of "Date" for the column name is poor, because a) it is a reserved word, and b) it is not in fact a date.
If you want to count number of records between two dates:
select count(*)
from Table1
where "Date" >= '2013-05-12' and "Date" < '2013-05-13'
-- count for one day, upper bound not included
select count(*)
from Table1
where "Date" >= '2013-05-12' and "Date" < '2013-06-13'
-- count for one month, upper bound not included
select count(*)
from Table1
where
"Date" >= current_date and
"Date" < current_date + interval '1 day'
-- current date
What I understand from your wording is
select date_trunc('day', "date"), count(*)
from t
where "date" >= '2013-01-01'
group by 1
order by 1
Replace 'day' for 'week', 'month', 'quarter' as needed.
http://www.postgresql.org/docs/current/static/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC
Create an index on the "date" column.
select count(distinct column1) from table1 where date > '2012-05-12 23:03:22';
I assume "number of column1" means "number of distinct values in column1.
Edit:
Regarding your second question (speed of the query): I would assume that an index on the date column should speed up the runtime. Depending on the data content, this could even be declared unique.
To throw another option into the mix...
Add a column of type "date" and index that -- named "datecol" for this example:
create index on tbl_datecol_idx on tbl (datecol);
analyze tbl;
Then your query can use an equality operator:
select count(*) from tbl where datecol = current_date - 1; --yesterday
Or if you can't add the date datatype column, you could create a functional index on the existing column:
create index tbl_date_fbi on tbl ( ("date"::DATE) );
analyze tbl;
select count(*) from tbl where "date"::DATE = current_date - 1;
Note1: you do not need to query "column1" directly as every row has that attribute filled due to the NOT NULL.
Note2: Creating a column named "date" is poor form, and even worse that it is of type TIMESTAMP.