I am running Postgres 9.2 and I have a large table something like
CREATE TABLE sensor_values
(
ts timestamp with time zone NOT NULL,
value double precision NOT NULL DEFAULT 'NaN'::real,
sensor_id integer NOT NULL
)
I have values coming into the system constantly ie many per minute. I want to maintain a rolling standard deviation / average for the last 200 values so I can determine if new values entering the system are within say 3 standard deviations of the mean. To do so I would need the current standard deviation and mean to be constantly updated for the last 200 values.
As the table can be hundreds of millions of rows I do not want to get the last say 200 rows for a sensor ordered by time and then do vg(value), var_samp(value) for every new value coming in. I and assuming it will be faster to updated the standard deviation and mean.
I have started writing a PL/pgSQL function to update a rolling variance and mean on each new value entering the system for a particular sensor.
I can do this using code pseudo like
newavg = oldavg + (new_value - old_value)/window_size
new_variance += (new_value-old_value)*(new_value-newavg+old_value-oldavg)/(window_size-1)
This is based on
http://jonisalonen.com/2014/efficient-and-accurate-rolling-standard-deviation/
Basically the window is of size 200 values. The old_value is the first value of the window. When a new value comes in we shift the window forward one. After I get the result I store the following values for the sensor
The first value of the window.
The mean average of the window values.
The variance of the window values.
This way I don't have to constantly get there last 200 value and do a sum etc.I can reuse this values when a new sensor value come in.
My problem is when first running I have no previous window data for a sensor ie the three values above so I have to do it the slow way.
something like
WITH s AS
(SELECT value FROM sensor_values WHERE sensor_values.sensor_id = $1 AND ts >= (NOW() - INTERVAL '2 day')::timestamptz ORDER BY ts DESC LIMIT 200)
SELECT avg(value), var_samp(value) INTO last_window_average, last_window_variance FROM s;
But how could I get the last value (ealiest) to save from that select statement ?
Can I access the first row from s in PL/pgSQL.
I thought PL/pgSQL would be faster / cleaner approach but maybe its better to do this is client code ?
Are there better ways to perform this type on rolling statistic update ?
I assume, that it will not be drastically slow to re-calculated latest 200 entries each time with proper indexing. If you'll do an index, like:
CREATE INDEX i_sensor_values ON sensor_values(sensor_id, ts DESC);
you'll be able to get results fairly quickly doing:
SELECT sum("value") -- add more expressions as required
FROM sensor_values
WHERE sensor_id=$1
ORDER BY ts DESC
LIMIT 200;
You can execute this query in a loop from PL/pgSQL function.
If you'll migrate to 9.3 (or higher) any time soon, you'll be able to also use LATERAL joins for this purpose.
I do not think a covering index will do a good thing here, as table is constantly changing and IndexOnlyScan will not kick in.
It is good to check Loose Index scans also.
P.S. Column name value should be double quoted, as this is an SQL reserved word.
Related
I'd like some advices to know if what I need to do is achievable with timescale functions.
I've just found out I can use time_bucket_gapfill() to complete missing data, which is amazing! I need data each 5 minutes but I can receive 10 minutes, 30 minutes or 1 hour data. So the function helps me to complete the missing points in order to have only 5 minutes points. Also, I use locf() to set the gapfilled value with last value found.
My question is: can I set a max range when I set the last value found with locf() in order to never overpass 1 hour ?
Example: If the last value found is older than 1 hour ago I don't want to fill gaps, I need to leave it empty to say we have real missing values here.
I think I'm close to something with this but apparently I'm not allowed to use locf() in the same case.
ERROR: multiple interpolate/locf function calls per resultset column not supported
Somebody have an idea how I can resolve that?
How to reproduce:
Create table powers
CREATE table powers (
delivery_point_id BIGINT NOT NULL,
at timestamp NOT NULL,
value BIGINT NOT NULL
);
Create hypertable
SELECT create_hypertable('powers', 'at');
Create indexes
CREATE UNIQUE INDEX idx_dpid_at ON powers(delivery_point_id, at);
CREATE INDEX index_at ON powers(at);
Insert data for one day, one delivery point, point 10 minutes
INSERT INTO powers SELECT 1, at, round(random()*10000) FROM generate_series(TIMESTAMP '2021-01-01 00:00:00', TIMESTAMP '2022-01-02 00:00:00', INTERVAL '10 minutes') AS at;
Remove three hours of data from 4am to 7am
DELETE FROM powers WHERE delivery_point_id = 1 AND at < '2021-01-1 07:00:00' AND at > '2021-01-01 04:00:00';
The query that need to be fixed
SELECT
time_bucket_gapfill('5 minutes', at) AS point_five,
avg(value) AS avg,
CASE
WHEN (locf(at) - at) > interval '1 hour' THEN null
ELSE locf(avg(value))
END AS gapfilled
FROM powers
GROUP BY point_five, at
ORDER BY point_five;
Actual: ERROR: multiple interpolate/locf function calls per resultset column not supported
Expected: Gapfilled values each 5 minutes except between 4am and 7 am (real missing values).
This is a great question! I'm going to provide a workaround for how to do this with the current stuff, but I think it'd be great if you'd open a Github issue as well, because there might be a way to add an option for this that doesn't require a workaround like this.
I also think your attempt was a good approach and just requires a few tweaks to get it right!
The error that you're seeing is that we can't have multiple locf calls in a single column, this is a limitation that's pretty easy to work around as we can just shift both of them into a subquery, but that's not enough. The other thing that we need to change is that locf only works on aggregates, right now, you’re trying to use it on a column (at) that isn’t aggregated, which isn’t going to work, because it wouldn’t know which of the values of at in a time_bucket to “pull forward” for the gapfill.
Now you said you want to fill data as long as the previous point wasn’t more than one hour ago, so, we can take the last value of at in the bucket by using last(at, at) this is also the max(at) so either of those aggregates would work. So we put that into a CTE (common table expression or WITH query) and then we do the case statement outside like so:
WITH filled as (SELECT
time_bucket_gapfill('5 minutes', at) AS point_five,
avg(value) AS avg,
locf(last(at, at)) as filled_from,
locf(avg(value)) as filled_avg
FROM powers
WHERE at BETWEEN '2021-01-01 01:30:00' AND '2021-01-01 08:30:00'
AND delivery_point_id = 1
GROUP BY point_five
ORDER BY point_five)
SELECT point_five,
avg,
filled_from,
CASE WHEN point_five - filled_from > '1 hour'::interval THEN NULL
ELSE filled_avg
END as gapfilled
FROM filled;
Note that I’ve tried to name my CTE expressively so that it’s a little easier to read!
Also, I wanted to point out a couple other hyperfunctions that you might think about using:
heartbeat_agg is a new/experimental one that will help you determine periods when your system is up or down, so if you're expecting points at least every hour, you can use it to find the periods where the delivery point was down or the like.
When you have more irregular sampling or want to deal with different data frequencies from different delivery points, I’d take a look a the time_weight family of functions. They can be more efficient than using something like gapfill to upsample, by instead letting you treat all the different sample rates similarly, without having to create more points and more work to do so. Even if you want to, for instance, compare sums of values, you’d use something like integral to get the time weighted sum over a period based on the locf interpolation.
Anyway, hope all that is helpful!
Quick question on optimizing a query type we do a lot in working with time-series data provided by a data logging system.
Database is SQL Server 2019 (v15) and for simplification assume the table is made up of just:
ID (bigint) - unique ID for the row
Timestamp (bigint) - Unix timestamp value.
Sample (float) - Value of sample taken (e.g. temperature measurement).
There is no regular interval or spacing with respect to timestamp as the data logger only logs data on a change to the data point being monitored (i.e. there is no reliable way to determine when in time that a previous sample would have been taken).
Anyway, our queries often involve selecting a range of data between two timestamps, but as expected the timestamps selected as the bounds for the range rarely ever line-up exactly with a timestamp in the data set. Because of this, what we really need to select is all the data in the range plus one record immediately before the range (so we know what the data value is leading into the selected range).
Historically we have done this one of two ways:
Select the rows between the timestamps (inclusive) and union this with a top(1) select of the first row with a timestamp <= to the range start.
OR
Select the top(1) timestamp <= to the range start into a variable and then do a select statement with this new timestamp as the lower bound for the range.
Since I am not an expert, I'm wondering if either one of these methods has better performance over the other or if there is maybe some better, third option we haven't encountered.
Thanks!
I am trying to create a report that will show how long an automated sprinkler system has run for. The system is comprised of several sprinklers, with each one keeping track of only itself, and then sends that information to a database. My problem is that each sprinkler has its own run time (I.E. if 5 sprinklers all ran at the same time for 10 minutes, it would report back a total run time of 50 minutes), and I want to know only the net amount of run time - in this example, it would be 10 minutes.
The database is comprised of a time stamp and a boolean, where it records the time stamp every time a sprinkler is shut on or off (its on/off state is indicated by the 1/0 of the boolean).
So, to figure out the total net time the system was on each day - whether it was 1 sprinkler running or all of them - I need to check the database for time frames where no sprinklers were turned at all (or where ANY sprinkler at all was turned on). I would think the beginning of the query would look something like
SELECT * FROM MyTable
WHERE MyBoolean = 0
AND [ ... ]
But I'm not sure what the conditional statements that would follow the AND would be like to check the time stamps.
Is there a query I can send to the database that will report back this format of information?
EDIT:
Here's the table the data is recorded to - it's literally just a name, a boolean, and a datetime of when the boolean was changed, and that's the entire database
Every time a sprinkler turns on the number of running sprinklers increments by 1, and every time one turns off the number decrements by 1. If you transform the data so you get this:
timestamp on/off
07:00:05 1
07:03:10 1
07:05:45 -1
then you have a sequence of events in order; which sprinklers they refer to is irrelevant. (I've changed the zeros to -1 for reasons that will become evident in a moment. You can do this with "(2 * value) - 1")
Now put a running total together:
select a.timestamp, (SELECT SUM(a.on_off)
FROM sprinkler_events b
WHERE b.timestamp <= a.timestamp) as run_total
from sprinkler_events a
order by a.timestamp;
where sprinkler_events is the transformed data I listed above. This will give you:
timestamp run_total
07:00:05 1
07:03:10 2
07:05:45 1
and so on. Every row in this which has a run total of zeros is a time at which all sprinklers were turned off, which I think is what you're looking for. If you need to sum the time they were on or off, you'll need to do additional processing: search for "date difference between consecutive rows" and you'll see solutions for that.
You might consider looking for whether all the sprinklers are currently off. For example:
SELECT COUNT (DISTINCT s._NAME) AS sprinkers_currently_off
FROM (
SELECT
_NAME,
_VALUE,
_TIMESTAMP,
ROW_NUMBER() OVER (PARTITION BY _NAME ORDER BY _TIMESTAMP DESC, _VALUE) AS latest_rec
FROM sprinklers
) s
WHERE
_VALUE = 0
AND latest_rec = 1
The inner query orders the records so that you can get the latest status of all the sprinklers, and the outer query counts how many are currently off. If you have 10 sprinklers you would report them all off when this query returns 10.
You could modify this by applying a date range to the inner query if you wanted to look into the past, but this should get you on the right track.
In OrientDB I have setup a time series using this use case. However, instead of appending my Vertex as an embedded list to the respective hour I have opted to just create an edge from the hour to the time dependent Vertex.
For arguments sake lets say that each hour has up to 60 time Vertex each identified by a timestamp. This means I can perform the following query to obtain a specific desired Vertex:
SELECT FROM ( SELECT expand( month[5].day[12].hour[0].out() ) FROM Year WHERE year = 2015) WHERE timestamp = 1434146922
I can see from the use case that I can use UNION to get several specified time branches in one go.
SELECT expand( records ) FROM (
SELECT union( month[3].day[20].hour[10].out(), month[3].day[20].hour[11].out() ) AS records
FROM Year WHERE year = 2015
)
This works fine if you only have a small number of branches but it doesn't work very well if you want to get all the records for a given time span. Say you wanted to get all the records between;
month[3].day[20].hour[11] -> month[3].day[29].hour[23]
I could iterate through the timespan and create a huge union query but at some point I guess the query would be too long and my guess is that it wouldn't be very efficient. I could also completely bypass the time branches and query the Vectors directly based on the timestamp.
SELECT FROM Sector WHERE timestamp BETWEEN 1406588622 AND 1406588624
The problem being that you loose all efficiencies gained by the time branches.
By experimenting and reading a bit about data types in orientdb, I found that :
The squared brackets allow to :
filtering by one index, example out()[0]
filtering by multiple indexes, example out()[0,2,4]
filtering by ranges, example out()[0-9]
OPTION 1 (UPDATE) :
Using a union to join on multiple time is the only option if you don't want to create all indexes and if your range is small. Here is a query exemple using union in the documentation.
OPTION 2 :
If you always have all the indexes created for your time and if you filter on wide ranges, you should filter by ranges. This is more performant then option 1 for the cost of having to create all indexes on which you want to filter on. Official documentation about field part.
This is how the query would look like :
select
*
from
(
select
expand(hour[0-23].out())
from
(select
expand(month[3].day[20-29])
from
Year
where
year = 2015)
)
where timestamp > 1406588622
I would highly recommend reading this.
I need to come up with an analysis of simultaneus events, when having only starttime and duration of each event.
Details
I've a standard CDR call detail record, that contains among others:
calldate (timedate of each call start
duration (int, seconds of call duration)
channel (a string)
What I need to come up with is some sort of analysys of simultaneus calls on each second, for a given timedate period. For example, a graph of simultaneous calls we had yesterday.
(The problem is the same if we have visitors logs with duration on a website and wish to obtain simultaneous clients for a group of web-pages)
What would your algoritm be?
I can iterate over records in the given period, and fill an array, where each bucket of the array corresponds to 1 second in the overall period. This works and seems to be fast, but if the timeperiod is big (say..1 year), I would need lots of memory (3600x24x365x4 bytes ~ 120MB aprox).
This is for a web-based, interactive app, so my memory footprint should be small enough.
Edit
By simultaneous, I mean all calls on a given second. Second would be my minimum unit. I cannot use something bigger (hour for example) becuse all calls during an hour do not need to be held at the same time.
I would implement this on the database. Using a GROUP BY clause with DATEPART, you could get a list of simultaneous calls for whatever time period you wanted, by second, minute, hour, whatever.
On the web side, you would only have to display the histogram that is returned by the query.
#eric-z-beard: I would really like to be able to implement this on the database. I like your proposal, and while it seems to lead to something, I dont quite fully understand it. Could you elaborate? Please recall that each call will span over several seconds, and each second need to count. If using DATEPART (or something like it on MySQL), what second should be used for the GROUP BY. See note on simultaneus.
Elaborating over this, I found a way to solve it using a temporary table. Assuming temp holds all seconds from tStart to tEnd, I could do
SELECT temp.second, count(call.id)
FROM call, temp
WHERE temp.second between (call.start and call.start + call.duration)
GROUP BY temp.second
Then, as suggested, the web app should use this as a histogram.
You can use a static Numbers table for lots of SQL tricks like this. The Numbers table simply contains integers from 0 to n for n like 10000.
Then your temp table never needs to be created, and instead is a subquery like:
SELECT StartTime + Numbers.Number AS Second
FROM Numbers
You can create table 'simultaneous_calls' with 3 fields: yyyymmdd Char(8),
day_second Number, -- second of the day,
count Number -- count of simultaneous calls
Your web service can take 'count' value from this table and make some statistics.
Simultaneous_calls table will be filled by some batch program which will be started every day after end of the day.
Assuming that you use Oracle, the batch may start a PL/SQL procedure which does the following:
Appends table with 24 * 3600 = 86400 records for each second of the day, with default 'count' value = 0.
Defines the 'day_cdrs' cursor for the query:
Select to_char(calldate, 'yyyymmdd') yyyymmdd,
(calldate - trunc(calldate)) * 24 * 3600 starting_second,
duration duration
From cdrs
Where cdrs.calldate >= Trunc(Sysdate -1)
And cdrs.calldate
Iterates the cursor to increment 'count' field for the seconds of the call:
For cdr in day_cdrs
Loop
Update simultaneos_calls
Set count = count + 1
Where yyyymmdd = cdr.yyyymmdd
And day_second Between cdr.starting_second And cdr.starting_second + cdr.duration;
End Loop;