Timeseries data query - optimizing query performance - sql

Quick question on optimizing a query type we do a lot in working with time-series data provided by a data logging system.
Database is SQL Server 2019 (v15) and for simplification assume the table is made up of just:
ID (bigint) - unique ID for the row
Timestamp (bigint) - Unix timestamp value.
Sample (float) - Value of sample taken (e.g. temperature measurement).
There is no regular interval or spacing with respect to timestamp as the data logger only logs data on a change to the data point being monitored (i.e. there is no reliable way to determine when in time that a previous sample would have been taken).
Anyway, our queries often involve selecting a range of data between two timestamps, but as expected the timestamps selected as the bounds for the range rarely ever line-up exactly with a timestamp in the data set. Because of this, what we really need to select is all the data in the range plus one record immediately before the range (so we know what the data value is leading into the selected range).
Historically we have done this one of two ways:
Select the rows between the timestamps (inclusive) and union this with a top(1) select of the first row with a timestamp <= to the range start.
OR
Select the top(1) timestamp <= to the range start into a variable and then do a select statement with this new timestamp as the lower bound for the range.
Since I am not an expert, I'm wondering if either one of these methods has better performance over the other or if there is maybe some better, third option we haven't encountered.
Thanks!

Related

OrientDB Time Span Search Query

In OrientDB I have setup a time series using this use case. However, instead of appending my Vertex as an embedded list to the respective hour I have opted to just create an edge from the hour to the time dependent Vertex.
For arguments sake lets say that each hour has up to 60 time Vertex each identified by a timestamp. This means I can perform the following query to obtain a specific desired Vertex:
SELECT FROM ( SELECT expand( month[5].day[12].hour[0].out() ) FROM Year WHERE year = 2015) WHERE timestamp = 1434146922
I can see from the use case that I can use UNION to get several specified time branches in one go.
SELECT expand( records ) FROM (
SELECT union( month[3].day[20].hour[10].out(), month[3].day[20].hour[11].out() ) AS records
FROM Year WHERE year = 2015
)
This works fine if you only have a small number of branches but it doesn't work very well if you want to get all the records for a given time span. Say you wanted to get all the records between;
month[3].day[20].hour[11] -> month[3].day[29].hour[23]
I could iterate through the timespan and create a huge union query but at some point I guess the query would be too long and my guess is that it wouldn't be very efficient. I could also completely bypass the time branches and query the Vectors directly based on the timestamp.
SELECT FROM Sector WHERE timestamp BETWEEN 1406588622 AND 1406588624
The problem being that you loose all efficiencies gained by the time branches.
By experimenting and reading a bit about data types in orientdb, I found that :
The squared brackets allow to :
filtering by one index, example out()[0]
filtering by multiple indexes, example out()[0,2,4]
filtering by ranges, example out()[0-9]
OPTION 1 (UPDATE) :
Using a union to join on multiple time is the only option if you don't want to create all indexes and if your range is small. Here is a query exemple using union in the documentation.
OPTION 2 :
If you always have all the indexes created for your time and if you filter on wide ranges, you should filter by ranges. This is more performant then option 1 for the cost of having to create all indexes on which you want to filter on. Official documentation about field part.
This is how the query would look like :
select
*
from
(
select
expand(hour[0-23].out())
from
(select
expand(month[3].day[20-29])
from
Year
where
year = 2015)
)
where timestamp > 1406588622
I would highly recommend reading this.

Rolling average postgres

I am running Postgres 9.2 and I have a large table something like
CREATE TABLE sensor_values
(
ts timestamp with time zone NOT NULL,
value double precision NOT NULL DEFAULT 'NaN'::real,
sensor_id integer NOT NULL
)
I have values coming into the system constantly ie many per minute. I want to maintain a rolling standard deviation / average for the last 200 values so I can determine if new values entering the system are within say 3 standard deviations of the mean. To do so I would need the current standard deviation and mean to be constantly updated for the last 200 values.
As the table can be hundreds of millions of rows I do not want to get the last say 200 rows for a sensor ordered by time and then do vg(value), var_samp(value) for every new value coming in. I and assuming it will be faster to updated the standard deviation and mean.
I have started writing a PL/pgSQL function to update a rolling variance and mean on each new value entering the system for a particular sensor.
I can do this using code pseudo like
newavg = oldavg + (new_value - old_value)/window_size
new_variance += (new_value-old_value)*(new_value-newavg+old_value-oldavg)/(window_size-1)
This is based on
http://jonisalonen.com/2014/efficient-and-accurate-rolling-standard-deviation/
Basically the window is of size 200 values. The old_value is the first value of the window. When a new value comes in we shift the window forward one. After I get the result I store the following values for the sensor
The first value of the window.
The mean average of the window values.
The variance of the window values.
This way I don't have to constantly get there last 200 value and do a sum etc.I can reuse this values when a new sensor value come in.
My problem is when first running I have no previous window data for a sensor ie the three values above so I have to do it the slow way.
something like
WITH s AS
(SELECT value FROM sensor_values WHERE sensor_values.sensor_id = $1 AND ts >= (NOW() - INTERVAL '2 day')::timestamptz ORDER BY ts DESC LIMIT 200)
SELECT avg(value), var_samp(value) INTO last_window_average, last_window_variance FROM s;
But how could I get the last value (ealiest) to save from that select statement ?
Can I access the first row from s in PL/pgSQL.
I thought PL/pgSQL would be faster / cleaner approach but maybe its better to do this is client code ?
Are there better ways to perform this type on rolling statistic update ?
I assume, that it will not be drastically slow to re-calculated latest 200 entries each time with proper indexing. If you'll do an index, like:
CREATE INDEX i_sensor_values ON sensor_values(sensor_id, ts DESC);
you'll be able to get results fairly quickly doing:
SELECT sum("value") -- add more expressions as required
FROM sensor_values
WHERE sensor_id=$1
ORDER BY ts DESC
LIMIT 200;
You can execute this query in a loop from PL/pgSQL function.
If you'll migrate to 9.3 (or higher) any time soon, you'll be able to also use LATERAL joins for this purpose.
I do not think a covering index will do a good thing here, as table is constantly changing and IndexOnlyScan will not kick in.
It is good to check Loose Index scans also.
P.S. Column name value should be double quoted, as this is an SQL reserved word.

Comparing values of type DATE - Oracle

Is there any way of comparing to date values to check if one is before the other?
For example how do i know which came first on the following rows
SEQ CREATION_DTM
--------------------
234 2011-03-26 22:59:03
235 2011-03-26 22:59:03
The column for the above data is declarad as datatype DATE. Having read around it appears that the DATE datatype does not store milliseconds. Does this mean
i cant compare the above two dates to find out which one is before the other?
EDIT
I am using Oracle 10G on Solaris.
DATE precision only goes to the nearest second, so if you have two dates that are the same to that precision then you can't distinguish between or order them. To get any more precision you'd need to store them as TIMESTAMP.
In the more general case where the dates do differ you can compare and order them much like numbers. When you get two the same the results are uncertain; in you case if you ordered by CREATION_DTM then you couldn't reliably predict whether the results would be ordered as 234,235 or 235,234. You would need to determine a way to break a tie, as Justin has suggested.
A DATE only stores up to the second. So if two rows are inserted in the same second, you can't determine which came first based on the CREATION_DTM column. If you want that level of resolution, you'd be better served with a TIMESTAMP [WITH [LOCAL] TIME ZONE] column which will store the time component up to 9 decimal digits if the host operating system provides that level of granularity (most Unix systems will provide microsecond resolution).
In your case, assuming that you're not using RAC and that you are using an Oracle sequence to populate the SEQ column, you could use that column to break the tie. If the two rows were inserted in different transactions, haven't been updated, and the table was built with ROWDEPENDENCIES, you could also potentially use the ORA_ROWSCN to break the tie.
Seems timestamp data type will be appropriate for you query..
Thanks

How to design SQL tables when column data arrives in multiple types/margins of error?

I've been given a stack of data where a particular value has been collected sometimes as a date (YYYY-MM-DD) and sometimes as just a year.
Depending on how you look at it, this is either a variance in type or margin of error.
This is a subprime situation, but I can't afford to recover or discard any data.
What's the optimal (eg. least worst :) ) SQL table design that will accept either form while avoiding monstrous queries and allowing maximum use of database features like constraints and keys*?
*i.e. Entity-Attribute-Value is out.
You could store the year, month and day components in separate columns. That way, you only need to populate the columns for which you have data.
if it comes in as just a year make it default to 01 for month and date, YYYY-01-01
This way you can still use a date/datetime datatype and don't have to worry about invalid dates
Either bring it in as a string unmolested, and modify it so it's consistent in another step, or modify the year-only values during the import like SQLMenace recommends.
I'd store the value in a DATETIME type and another value (just an integer will do, or some kind of enumerated type) that signifies its precision.
It would be easier to give more information if you mentioned what kind of queries you will be doing on the data.
Either fix it, then store it (OK, not an option)
Or store it broken with a fixed computed columns
Something like this
CREATE TABLE ...
...
Broken varchar(20),
Fixed AS CAST(CASE WHEN Broken LIKE '[12][0-9][0-9][0-9]' THEN Broken + '0101' ELSE Broken END AS datetime)
This also allows you to detect good from bad source data
If you don't always have a full date, what sort of keys and constraints would you need? Perhaps store two columns of data; a full date, and a year. For data that has only year, the year is stored and date is null. For items with full info, both are populated.
I'd put three columns in the table:
The provided value (YYYY-MM-DD or YYYY)
A date column, Date or DateTime data type, which is nullable
A year column, as an integer or char(4) depending upon your needs.
I'd always populate the year column, populate the date column only when the provided value is a date.
And, because you've kept the provided value, you can always re-process down the road if needs change.
An alternative solution would be to that of a date mask (like in IP). Store the date in a regular datetime field, and insert an additional field of type smallint or something, where you could indicate which is present (could go even binary here):
If you have YYYY-MM-DD, you would have 3 bits of data, which will have the values 1 if data is present and 0 if not.
Example:
Date Mask
2009-12-05 7 (111)
2009-12-01 6 (110, only year and month are know, and day is set to default 1)
2009-01-20 5 (101, for some strange reason, only the year and the date is known. January has 31 days, so it will never generate an error)
Which solution is better depends on what you will do with it.
This is better when you want to select those with full dates, which are between a certain period (less to write). Also this way it's easier to compare any dates which have masks like 7,6,4. It may also take up less memory (date + smallint may be smaller than int+int+int, and only if datetime uses 64 bit, and smallint uses up as much as int, it will be the same).
I was going to suggest the same solution as #ninesided did above. Additionally, you could have a date field and a field that quantitatively represents your uncertainty. This offers the advantage of being able to represent things like "on or about Sept 23, 2010". The problem is that to represent the case where you only know the year, you'd have to set your date to be the middle of the year, with 182.5 days' uncertainty (assuming non-leap year), which seems ugly.
You could use a similar but distinct approach with a mask that represents what date parts you're confident about - that's what SQLMenace offered in his answer above.
+1 each to recommendations from ninesided, Nikki9696 and Jeff Siver - I support all those answers though none was exactly what I decided upon.
My solution:
a date column used only for complete dates
an int column used for years
a constraint to ensure integrity between the two
a trigger to populate the year if only date is supplied
Advantages:
can run simple (one-column) queries on the date column with missing data ignored (by using NULL for what it was designed for)
can run simple (one-column) queries on the year column for any row with a date (because year is automatically populated)
insert either year or date or both (provided they agree)
no fear of disagreement between columns
self explanatory, intuitive
I would argue that methods using YYYY-01-01 to signify missing data (when flagged as such with a second explanatory column) fail seriously on points 1 and 5.
Example code for Sqlite 3:
create table events
(
rowid integer primary key,
event_year integer,
event_date date,
check (event_year = cast(strftime("%Y", event_date) as integer))
);
create trigger year_trigger after insert on events
begin
update events set event_year = cast(strftime("%Y", event_date) as integer)
where rowid = new.rowid and event_date is not null;
end;
-- various methods to insert
insert into events (event_year, event_date) values (2008, "2008-02-23");
insert into events (event_year) values (2009);
insert into events (event_date) values ("2010-01-19");
-- select events in January without expressions on supplementary columns
select rowid, event_date from events where strftime("%m", event_date) = "01";

Nondeterministic functions in sql partitioning functions

How are non-deterministic functions used in SQL partitioning functions and are they useful?
MsSql allows non-deterministic functions in partitioning functions:
CREATE PARTITION FUNCTION MyArchive(datetime)
AS RANGE LEFT FOR VALUES (GETDATE() – 10)
GO
Does that mean that records older then 10 days are automatically moved to the archive (first) partition? Of course not.
The database stores the date when the partitioning schema was set up and uses it in the most (logical) way.
Lets say one sets the above schema on 2000 -01-11 which makes the delimiting date 2000-01-01.
When you are querying for data with date lower then the initial delimiting date (boundary_value - 2000-01-01) you will use only the archive partition.
When you are querying for data with date higher then the current day minus 10 days (GETDATE() – 10) you will be using only the current partition.
All other queries will use both partitions ie querying for data with date lower then current date minus 10 days but higher then the delimiting date (2000-01-01).
This means that with each passing day, the range of dates for which both partitions are used is growing. And you would have been better of setting the partition to the delimiting date deterministically.
I don't forsee any scenario where this is useful.