Every 10th row based on timestamp - sql

I have a table with signal name, value and timestamp. these signals where recorded at sampling rate of 1sample/sec. Now i want to plot a graph on values of months, and it is becoming very heavy for the system to perform it within seconds. So my question is " Is there any way to view 1 value/minute in other words i want to see every 60th row.?"

You can use the row_number() function to enumerate the rows, and then use modulo arithmetic to get the rows:
select signalname, value, timestamp
from (select t.*,
row_number() over (order by timestamp) as seqnum
from table t
) t
where seqnum % 60 = 0;
If your data really is regular, you can also extract the seconds value and check when that is 0:
select signalname, value, timestamp
from table t
where datepart(second, timestamp) = 0
This assumes that timestamp is stored in an appropriate date/time format.

Instead of sampling, you could use the one minute average for your plot:
select name
, min(timestamp)
, avg(value)
from Yourtable
group by
name
, datediff(minute, '2013-01-01', timestamp)
If you are charting months, even the hourly average might be detailed enough.

Related

How to iterate over table and delete rows based on specific condition on previous row - PostgreSQL

I have a table of ships, which consists of:
row id (number)
ship id (character varying)
timestamp (timestamp in yyyy-mm-dd hh:mm:ss format)
Timestamp is the time that the specific ship (ship id) emitted a signal during its course. The table looks like this:
What I need to do (in PostgreSQL - pgAdmin) is for every ship_id, find if a signal has been emitted 5 seconds or less after another signal from the same ship, and then delete the row with the latter.
In the example table shown above, for the ship "foo" the signals are almost 9 minutes apart so it's all good, but for the ship "bar" the signal with row_id 4 was emitted 3 seconds after the previous one with row_id 3, so it needs to go.
Thanks a lot in advance.
Windowing functions Lag/Lead in this case will do the trick.
Add a LAG to calculate the difference between timestamps for the same ships. This will allow you to calculate the time difference for the same ship and its most recent posting.
Use that to filter out what to delete
SELECT ROW_ID, SHIP_ID, EXTRACT(EPOCH FROM (TIMESTAMP - LAG (TIMESTAMP,1) OVER (PARTITION BY SHIP_ID ORDER BY TIMESTAMP ASC))) AS SECONDS_DIFF
--THEN SOMETHING LIKE THIS TO FIND WHICH ROWS TO DELETE
DELETE FROM SHIP_TABLE WHERE ROW_ID IN
(SELECT ROW_ID FROM
(SELECT ROW_ID, SHIP_ID, EXTRACT(EPOCH FROM (TIMESTAMP - LAG (TIMESTAMP,1) OVER (PARTITION BY SHIP_ID ORDER BY TIMESTAMP ASC))) AS SECONDS_DIFF) SUB_1
WHERE SECONDS_DIFF <= 10 --THRESHOLD
) SUB_2

How to compute window function for each nth row in Presto?

I am working with a table that contains timeseries data, with a row for each minute for each user.
I want to compute some aggregate functions on a rolling window of N calendar days.
This is achieved via
SELECT
SOME_AGGREGATE_FUN(col) OVER (
PARTITION BY user_id
ORDER BY timestamp
ROWS BETWEEN (60 * 24 * N) PRECEDING AND CURRENT ROW
) as my_col
FROM my_table
However, I am only interested in the result of this at a daily scale.
i.e. I want the window to be computed only at 00:00:00, but I want the window itself to contain all the minute-by-minute data to be passed into my aggregate function.
Right now I am doing this:
WITH agg_results AS (
SELECT
SOME_AGGREGATE_FUN(col) OVER (
PARTITION BY user_id
ORDER BY timestamp_col
ROWS BETWEEN (60 * 24 * N) PRECEDING AND CURRENT ROW
)
FROM my_table
)
SELECT * FROM agg_results
WHERE
timestamp_col = DATE_TRUNC('day', "timestamp_col")
This works in theory, but it does 60 * 24 more computations that necessary, resulting in the query being super slow.
Essentially, I am trying to find a way to make the right window bound skip rows based on a condition. Or, if it is simpler to implement, for every nth row (as I have a constant number of rows for each day).
I don't think that's possible with window functions. You could switch to a subquery instead, assuming that your aggregate function works as a regular aggregate function too (that is, without an OVER() clause):
select
timestamp_col,
(
select some_aggregate_fun(t1.col)
from my_table t1
where
t1.user_id = t.user_id
and t1.timestamp_col >= t.timestamp_col - interval '1' day
and t1.timestamp_col <= t.timestamp_col
)
from my_table t
where timestamp_col = date_trunc('day', timestamp_col)
I am unsure that this would perform better than your original query though; you might need to assess that against your actual dataset.
You can change interval '1' day to the actual interval you want to use.

Saving only unique datapoints in SQL

For simplicity: We have a table with 2 columns, value and date.
Every second a new data is received and we want to save it with it's timestamp. Since the data can be similar, to lower usage, if data is the same as previous entry, we don't save it.
Question: Given that same value was received during 24 hours, only the first value & date pair is saved. If we want to query 'Average value in last 1 hour', is there a way to have the db (PostgreSQL) see that no values are saved in last hour and search for last existing value entry?
It is not as easy as it may seem, and it is not just about retrieving the latest data point when there is none available within the last hour. You want to calculate an average, so you need to rebuild the time-series data of the period, second per second, filling the gaps with the latest available data point.
I think the simplest approach is generate_series() to build the rows, and then a lateral join to recover the data:
select avg(d.value) avg_last_hour
from generate_series(now() - interval '1 hour', now(), '1 second') t(ts)
cross join lateral (
select d.*
from data d
where d.date <= t.ts
order by d.date desc
limit 1
) t
Hmmm . . . if you simply want the average of values in the most recent hour in the data, you can use:
select date_trunc('hour', date) as ddhh, avg(value)
from t
group by ddhh
order by ddhh desc
limit 1;
If you have a lot of data being collected, it might be faster to add an index on date and use:
select avg(value)
from t
where date >= date_trunc('hour', (select max(t2.date) from t t2));

SQL Server Determine the Amount of Time Above a Threshold

I am trying to determine the amount of time my data spends above a certain threshold. I have a SQL table of values that looks like this:
Where the first column is datetime and the second column is value. This is time series data so it is a large table and cannot be changed. I want to know the first value that crosses over the threshold (say it is 50 for the example) this is my beginning, the last value that crosses back over the threshold which is the end, and the duration spent over the threshold.
In my data example the Beginning would be 9/20/2019 19:18, the end would be 9/20/2019 19:46 and the duration would be 28 minutes.
This needs to be written in one sql statement due to the requirements of the project. I am just wondering if this is possible and how to do it. Thanks!
You can use lead() and some aggregation:
select t.*
from (select t.*,
datediff(minute,
ts, lead(ts) over (order by ts)
) as diff_minutes
from (select t.*,
lead(value) over (order by ts) as next_value
from t
) t
where (value < 50 and next_value >= 50) or
(value >= 50 and next_value < 50
) t
where value < 50;
Your question is a little tricky because you want the time span to start just before the period in question. That is actually a simplification. The above implements:
Identify the next value.
Keep a row when next_value or current value exceeds the threshold or vice versa. This is the first row before and last row after the period.
Then use lead() to get the ending timestamp.
Finally filter down to just the first row.
Another approach is perhaps simpler. Define the groups based on the count of rows that are under the threshold up to or before the row. This keeps the previous row with the following group.
Then aggregate:
select min(ts), max(ts),
datediff(minute, min(ts), max(ts)) as diff_minute
from (select t.*,
sum(case when value < 50 then 1 else 0 end) over (order by ts) as grp
from t
) t
group by grp;
It looks like you are sampling every 10 seconds. If that is pretty solid, you can just count how many records are above 50 during a selected interval, and multiply by 10 seconds, that will be the duration that exceeds 50.

examine if one time series column of table has two adjacent time points which have interval larger than certain length

I am dealing with data preprocessing on a table containing time series column
toy example Table A
timestamp value
12:30:24 1
12:32:21 3
12:33:21 4
timestamp is ordered and always go incrementally
Is that possible to define an function or something else to return "True expression" when table has two adjacent time points which have interval larger than certain length and return "False" otherwise?
I am using postgresql, thank you
SQL Fiddle
select bool_or(bigger_than) as bigger_than
from (
select
time - lag(time) over (order by time)
>
interval '1 minute' as bigger_than
from table_a
) s;
bigger_than
-------------
t
bool_or will stop searching as soon as it finds the first true value.
http://www.postgresql.org/docs/current/static/functions-aggregate.html
Your sample data shows a time value. But it works the same for a timestamp
Something like this:
select count(*) > 0
from (
select timestamp,
lag(timestamp) over (order by value) as prev_ts
from table_a
) t
where timestamp - prev_ts < interval '1' minute;
It calculates the difference between a timestamp and it's "previous" timestamp. The order of the timestamps is defined by the value column. The outer query then counts the number of rows where the difference is smaller than 1 minute.
lag() is called a window functions. More details on those can be found in the manual:
http://www.postgresql.org/docs/current/static/tutorial-window.html