How to set a max range condition with timescale time_bucket_gapfill() in order to not fill real missing values? - sql

I'd like some advices to know if what I need to do is achievable with timescale functions.
I've just found out I can use time_bucket_gapfill() to complete missing data, which is amazing! I need data each 5 minutes but I can receive 10 minutes, 30 minutes or 1 hour data. So the function helps me to complete the missing points in order to have only 5 minutes points. Also, I use locf() to set the gapfilled value with last value found.
My question is: can I set a max range when I set the last value found with locf() in order to never overpass 1 hour ?
Example: If the last value found is older than 1 hour ago I don't want to fill gaps, I need to leave it empty to say we have real missing values here.
I think I'm close to something with this but apparently I'm not allowed to use locf() in the same case.
ERROR: multiple interpolate/locf function calls per resultset column not supported
Somebody have an idea how I can resolve that?
How to reproduce:
Create table powers
CREATE table powers (
delivery_point_id BIGINT NOT NULL,
at timestamp NOT NULL,
value BIGINT NOT NULL
);
Create hypertable
SELECT create_hypertable('powers', 'at');
Create indexes
CREATE UNIQUE INDEX idx_dpid_at ON powers(delivery_point_id, at);
CREATE INDEX index_at ON powers(at);
Insert data for one day, one delivery point, point 10 minutes
INSERT INTO powers SELECT 1, at, round(random()*10000) FROM generate_series(TIMESTAMP '2021-01-01 00:00:00', TIMESTAMP '2022-01-02 00:00:00', INTERVAL '10 minutes') AS at;
Remove three hours of data from 4am to 7am
DELETE FROM powers WHERE delivery_point_id = 1 AND at < '2021-01-1 07:00:00' AND at > '2021-01-01 04:00:00';
The query that need to be fixed
SELECT
time_bucket_gapfill('5 minutes', at) AS point_five,
avg(value) AS avg,
CASE
WHEN (locf(at) - at) > interval '1 hour' THEN null
ELSE locf(avg(value))
END AS gapfilled
FROM powers
GROUP BY point_five, at
ORDER BY point_five;
Actual: ERROR: multiple interpolate/locf function calls per resultset column not supported
Expected: Gapfilled values each 5 minutes except between 4am and 7 am (real missing values).

This is a great question! I'm going to provide a workaround for how to do this with the current stuff, but I think it'd be great if you'd open a Github issue as well, because there might be a way to add an option for this that doesn't require a workaround like this.
I also think your attempt was a good approach and just requires a few tweaks to get it right!
The error that you're seeing is that we can't have multiple locf calls in a single column, this is a limitation that's pretty easy to work around as we can just shift both of them into a subquery, but that's not enough. The other thing that we need to change is that locf only works on aggregates, right now, you’re trying to use it on a column (at) that isn’t aggregated, which isn’t going to work, because it wouldn’t know which of the values of at in a time_bucket to “pull forward” for the gapfill.
Now you said you want to fill data as long as the previous point wasn’t more than one hour ago, so, we can take the last value of at in the bucket by using last(at, at) this is also the max(at) so either of those aggregates would work. So we put that into a CTE (common table expression or WITH query) and then we do the case statement outside like so:
WITH filled as (SELECT
time_bucket_gapfill('5 minutes', at) AS point_five,
avg(value) AS avg,
locf(last(at, at)) as filled_from,
locf(avg(value)) as filled_avg
FROM powers
WHERE at BETWEEN '2021-01-01 01:30:00' AND '2021-01-01 08:30:00'
AND delivery_point_id = 1
GROUP BY point_five
ORDER BY point_five)
SELECT point_five,
avg,
filled_from,
CASE WHEN point_five - filled_from > '1 hour'::interval THEN NULL
ELSE filled_avg
END as gapfilled
FROM filled;
Note that I’ve tried to name my CTE expressively so that it’s a little easier to read!
Also, I wanted to point out a couple other hyperfunctions that you might think about using:
heartbeat_agg is a new/experimental one that will help you determine periods when your system is up or down, so if you're expecting points at least every hour, you can use it to find the periods where the delivery point was down or the like.
When you have more irregular sampling or want to deal with different data frequencies from different delivery points, I’d take a look a the time_weight family of functions. They can be more efficient than using something like gapfill to upsample, by instead letting you treat all the different sample rates similarly, without having to create more points and more work to do so. Even if you want to, for instance, compare sums of values, you’d use something like integral to get the time weighted sum over a period based on the locf interpolation.
Anyway, hope all that is helpful!

Related

Finding the Closest Unbooked Dates Using SQL

Scenario
A user selects a date. Based on the selection I check whether the date & time is booked or not (No issues here).
If a date & time is booked, I need to show them n alternative dates. Based on their date and time parameters, and those proposed alternative dates have to be as close as to their chosen date as possible. The list of alternative dates should start from the date the query is ran on My backend handles this.
My Progress So Far
SELECT alternative_date
FROM GENERATE_SERIES(
TIMESTAMP '2022-08-20 05:00:00',
date_trunc('month', TIMESTAMP '2022-08-20 07:00:00') + INTERVAL '1 month - 1 day',
INTERVAL '1 day'
) AS G(alternative_date)
WHERE NOT EXISTS(
SELECT * FROM events T
WHERE T.bookDate::DATE = G.alternative_date::DATE
)
The code above uses the GENERATE_SERIES(...) function in PSQL. It searches for all dates, starting from 2022-08-20, and up to the end of August. It specifically returns the dates which does not exist in the bookDate column (Meaning it has not yet been booked).
Problems I Need Help With
When searching for alternative dates, I'm providing 3 important things
The user's preferred booking date, so I can suggest which other dates are close to him that he can choose? How would I go about doing this? It's the part where I'm facing most trouble.
The user's start and end times, so when providing a list of alternative dates, I can tell him, hey there's free space between 06 and 07 on the date 2022-08-22 for instance. I'm also facing some issues here, a push in the right track will be great!
I want to add another WHERE but it fails, the current WHERE is a NOT EXISTS so it looks for all dates not equaling to what is given. My other WHERE basically means WHERE the place is open for booking or not.
To get closest free dates, you can ORDER BY your result by "distance" of particular alternative date to user's preferred date - the shortest intervals will be first:
ORDER BY alternative_date - TIMESTAMP '2022-08-20 05:00:00'
If you want to recommend time slots smaller than whole dates (hour range), you need to switch the whole thing from dates to hours, i.e. generate_series from 1 day to 1 hour (or whatever your smallest bookable unit is) and excluse invalid hours (nighttime I assume) in WHERE. From there, it is pretty much the same as with dates.
As for "second where", there can be only one WHERE, but it can be composed from multiple conditions - you can add more conditions using AND operator (and it can also be sub-query if needed):
WHERE NOT EXISTS(
SELECT * FROM events T
WHERE T.bookDate::DATE = G.alternative_date::DATE
) AND NOT EXISTS (
SELECT 1 FROM events WHERE "roomId" = '13b46460-162d-4d32-94c0-e27dd9246c79'
)
(warning: this second sub-query is probably dangerous in real world, since the room will be used more than one time, I assume, so you need to add some time condition to the subquery to check against date)

sqlalchemy select by date column only x newset days

suppose I have a table MyTable with a column some_date (date type of course) and I want to select the newest 3 months data (or x days).
What is the best way to achieve this?
Please notice that the date should not be measured from today but rather from the date range in the table (which might be older then today)
I need to find the maximum date and compare it to each row - if the difference is less than x days, return it.
All of this should be done with sqlalchemy and without loading the entire table.
What is the best way of doing it? must I have a subquery to find the maximum date? How do I select last X days?
Any help is appreciated.
EDIT:
The following query works in Oracle but seems inefficient (is max calculated for each row?) and I don't think that it'll work for all dialects:
select * from my_table where (select max(some_date) from my_table) - some_date < 10
You can do this in a single query and without resorting to creating datediff.
Here is an example I used for getting everything in the past day:
one_day = timedelta(hours=24)
one_day_ago = datetime.now() - one_day
Message.query.filter(Message.created > one_day_ago).all()
You can adapt the timedelta to whatever time range you are interested in.
UPDATE
Upon re-reading your question it looks like I failed to take into account the fact that you want to compare two dates which are in the database rather than today's day. I'm pretty sure that this sort of behavior is going to be database specific. In Postgres, you can use straightforward arithmetic.
Operations with DATEs
1. The difference between two DATES is always an INTEGER, representing the number of DAYS difference
DATE '1999-12-30' - DATE '1999-12-11' = INTEGER 19
You may add or subtract an INTEGER to a DATE to produce another DATE
DATE '1999-12-11' + INTEGER 19 = DATE '1999-12-30'
You're probably using timestamps if you are storing dates in postgres. Doing math with timestamps produces an interval object. Sqlalachemy works with timedeltas as a representation of intervals. So you could do something like:
one_day = timedelta(hours=24)
Model.query.join(ModelB, Model.created - ModelB.created < interval)
I haven't tested this exactly, but I've done things like this and they have worked.
I ended up doing two selects - one to get the max date and another to get the data
using the datediff recipe from this thread I added a datediff function and using the query q = session.query(MyTable).filter(datediff(max_date, some_date) < 10)
I still don't think this is the best way, but untill someone proves me wrong, it will have to do...

How to design SQL tables when column data arrives in multiple types/margins of error?

I've been given a stack of data where a particular value has been collected sometimes as a date (YYYY-MM-DD) and sometimes as just a year.
Depending on how you look at it, this is either a variance in type or margin of error.
This is a subprime situation, but I can't afford to recover or discard any data.
What's the optimal (eg. least worst :) ) SQL table design that will accept either form while avoiding monstrous queries and allowing maximum use of database features like constraints and keys*?
*i.e. Entity-Attribute-Value is out.
You could store the year, month and day components in separate columns. That way, you only need to populate the columns for which you have data.
if it comes in as just a year make it default to 01 for month and date, YYYY-01-01
This way you can still use a date/datetime datatype and don't have to worry about invalid dates
Either bring it in as a string unmolested, and modify it so it's consistent in another step, or modify the year-only values during the import like SQLMenace recommends.
I'd store the value in a DATETIME type and another value (just an integer will do, or some kind of enumerated type) that signifies its precision.
It would be easier to give more information if you mentioned what kind of queries you will be doing on the data.
Either fix it, then store it (OK, not an option)
Or store it broken with a fixed computed columns
Something like this
CREATE TABLE ...
...
Broken varchar(20),
Fixed AS CAST(CASE WHEN Broken LIKE '[12][0-9][0-9][0-9]' THEN Broken + '0101' ELSE Broken END AS datetime)
This also allows you to detect good from bad source data
If you don't always have a full date, what sort of keys and constraints would you need? Perhaps store two columns of data; a full date, and a year. For data that has only year, the year is stored and date is null. For items with full info, both are populated.
I'd put three columns in the table:
The provided value (YYYY-MM-DD or YYYY)
A date column, Date or DateTime data type, which is nullable
A year column, as an integer or char(4) depending upon your needs.
I'd always populate the year column, populate the date column only when the provided value is a date.
And, because you've kept the provided value, you can always re-process down the road if needs change.
An alternative solution would be to that of a date mask (like in IP). Store the date in a regular datetime field, and insert an additional field of type smallint or something, where you could indicate which is present (could go even binary here):
If you have YYYY-MM-DD, you would have 3 bits of data, which will have the values 1 if data is present and 0 if not.
Example:
Date Mask
2009-12-05 7 (111)
2009-12-01 6 (110, only year and month are know, and day is set to default 1)
2009-01-20 5 (101, for some strange reason, only the year and the date is known. January has 31 days, so it will never generate an error)
Which solution is better depends on what you will do with it.
This is better when you want to select those with full dates, which are between a certain period (less to write). Also this way it's easier to compare any dates which have masks like 7,6,4. It may also take up less memory (date + smallint may be smaller than int+int+int, and only if datetime uses 64 bit, and smallint uses up as much as int, it will be the same).
I was going to suggest the same solution as #ninesided did above. Additionally, you could have a date field and a field that quantitatively represents your uncertainty. This offers the advantage of being able to represent things like "on or about Sept 23, 2010". The problem is that to represent the case where you only know the year, you'd have to set your date to be the middle of the year, with 182.5 days' uncertainty (assuming non-leap year), which seems ugly.
You could use a similar but distinct approach with a mask that represents what date parts you're confident about - that's what SQLMenace offered in his answer above.
+1 each to recommendations from ninesided, Nikki9696 and Jeff Siver - I support all those answers though none was exactly what I decided upon.
My solution:
a date column used only for complete dates
an int column used for years
a constraint to ensure integrity between the two
a trigger to populate the year if only date is supplied
Advantages:
can run simple (one-column) queries on the date column with missing data ignored (by using NULL for what it was designed for)
can run simple (one-column) queries on the year column for any row with a date (because year is automatically populated)
insert either year or date or both (provided they agree)
no fear of disagreement between columns
self explanatory, intuitive
I would argue that methods using YYYY-01-01 to signify missing data (when flagged as such with a second explanatory column) fail seriously on points 1 and 5.
Example code for Sqlite 3:
create table events
(
rowid integer primary key,
event_year integer,
event_date date,
check (event_year = cast(strftime("%Y", event_date) as integer))
);
create trigger year_trigger after insert on events
begin
update events set event_year = cast(strftime("%Y", event_date) as integer)
where rowid = new.rowid and event_date is not null;
end;
-- various methods to insert
insert into events (event_year, event_date) values (2008, "2008-02-23");
insert into events (event_year) values (2009);
insert into events (event_date) values ("2010-01-19");
-- select events in January without expressions on supplementary columns
select rowid, event_date from events where strftime("%m", event_date) = "01";

Date range intersection in SQL

I have a table where each row has a start and stop date-time. These can be arbitrarily short or long spans.
I want to query the sum duration of the intersection of all rows with two start and stop date-times.
How can you do this in MySQL?
Or do you have to select the rows that intersect the query start and stop times, then calculate the actual overlap of each row and sum it client-side?
To give an example, using milliseconds to make it clearer:
Some rows:
ROW START STOP
1 1010 1240
2 950 1040
3 1120 1121
And we want to know the sum time that these rows were between 1030 and 1100.
Lets compute the overlap of each row:
ROW INTERSECTION
1 70
2 10
3 0
So the sum in this example is 80.
If your example should have said 70 in the first row then
assuming #range_start and #range_end as your condition paramters:
SELECT SUM( LEAST(#range_end, stop) - GREATEST(#range_start, start) )
FROM Table
WHERE #range_start < stop AND #range_end > start
using the greatest/least and date functions you should be able to get what you need directly operating on the date type.
I fear you're out of luck.
Since you don't know the number of rows that you will be "cumulatively intersecting", you need either a recursive solution, or an aggregation operator.
The aggregation operator you need is no option because SQL does not have the data type that it is supposed to operate on (that type being an interval type, as described in "Temporal Data and the Relational Model").
The recursive solution may be possible, but it is likely to be difficult to write, difficult to read to other programmers, and it is also questionable whether the optimizer can turn that query into the optimal data access strategy.
Or I misunderstood your question.
There's a fairly interesting solution if you know the maximum time you'll ever have. Create a table with all the numbers in it from one to your maximum time.
millisecond
-----------
1
2
3
...
1240
Call it time_dimension (this technique is often used in dimensional modelling in data warehousing.)
Then this:
SELECT
COUNT(*)
FROM
your_data
INNER JOIN time_dimension ON time_dimension.millisecond BETWEEN your_data.start AND your_data.stop
WHERE
time_dimension.millisecond BETWEEN 1030 AND 1100
...will give you the total number of milliseconds of running time between 1030 and 1100.
Of course, whether you can use this technique depends on whether you can safely predict the maximum number of milliseconds that will ever be in your data.
This is often used in data warehousing, as I said; it fits well with some kinds of problems -- for example, I've used it for insurance systems, where a total number of days between two dates was needed, and where the overall date range of the data was easy to estimate (from the earliest customer date of birth to a date a couple of years into the future, beyond the end date of any policies that were being sold.)
Might not work for you, but I figured it was worth sharing as an interesting technique!
After you added the example, it is clear that indeed I misunderstood your question.
You are not "cumulatively intersecting rows".
The steps that will bring you to a solution are :
intersect each row's start and end point with the given start and end points. This should be doable using CASE expressions or something of that nature, something in the style of :
SELECT (CASE startdate < givenstartdate : givenstartdate, CASE startdate >= givenstartdate : startdate) as retainedstartdate, (likewise for enddate) as retainedenddate FROM ... Cater for nulls and that sort of stuff as needed.
With the retainedstartdate and retainedenddate, use a date function to compute the length of the retained interval (which is the overlap of your row with the given time section).
SELECT the SUM() of those.

Simultaneous calls from CDR

I need to come up with an analysis of simultaneus events, when having only starttime and duration of each event.
Details
I've a standard CDR call detail record, that contains among others:
calldate (timedate of each call start
duration (int, seconds of call duration)
channel (a string)
What I need to come up with is some sort of analysys of simultaneus calls on each second, for a given timedate period. For example, a graph of simultaneous calls we had yesterday.
(The problem is the same if we have visitors logs with duration on a website and wish to obtain simultaneous clients for a group of web-pages)
What would your algoritm be?
I can iterate over records in the given period, and fill an array, where each bucket of the array corresponds to 1 second in the overall period. This works and seems to be fast, but if the timeperiod is big (say..1 year), I would need lots of memory (3600x24x365x4 bytes ~ 120MB aprox).
This is for a web-based, interactive app, so my memory footprint should be small enough.
Edit
By simultaneous, I mean all calls on a given second. Second would be my minimum unit. I cannot use something bigger (hour for example) becuse all calls during an hour do not need to be held at the same time.
I would implement this on the database. Using a GROUP BY clause with DATEPART, you could get a list of simultaneous calls for whatever time period you wanted, by second, minute, hour, whatever.
On the web side, you would only have to display the histogram that is returned by the query.
#eric-z-beard: I would really like to be able to implement this on the database. I like your proposal, and while it seems to lead to something, I dont quite fully understand it. Could you elaborate? Please recall that each call will span over several seconds, and each second need to count. If using DATEPART (or something like it on MySQL), what second should be used for the GROUP BY. See note on simultaneus.
Elaborating over this, I found a way to solve it using a temporary table. Assuming temp holds all seconds from tStart to tEnd, I could do
SELECT temp.second, count(call.id)
FROM call, temp
WHERE temp.second between (call.start and call.start + call.duration)
GROUP BY temp.second
Then, as suggested, the web app should use this as a histogram.
You can use a static Numbers table for lots of SQL tricks like this. The Numbers table simply contains integers from 0 to n for n like 10000.
Then your temp table never needs to be created, and instead is a subquery like:
SELECT StartTime + Numbers.Number AS Second
FROM Numbers
You can create table 'simultaneous_calls' with 3 fields: yyyymmdd Char(8),
day_second Number, -- second of the day,
count Number -- count of simultaneous calls
Your web service can take 'count' value from this table and make some statistics.
Simultaneous_calls table will be filled by some batch program which will be started every day after end of the day.
Assuming that you use Oracle, the batch may start a PL/SQL procedure which does the following:
Appends table with 24 * 3600 = 86400 records for each second of the day, with default 'count' value = 0.
Defines the 'day_cdrs' cursor for the query:
Select to_char(calldate, 'yyyymmdd') yyyymmdd,
(calldate - trunc(calldate)) * 24 * 3600 starting_second,
duration duration
From cdrs
Where cdrs.calldate >= Trunc(Sysdate -1)
And cdrs.calldate
Iterates the cursor to increment 'count' field for the seconds of the call:
For cdr in day_cdrs
Loop
Update simultaneos_calls
Set count = count + 1
Where yyyymmdd = cdr.yyyymmdd
And day_second Between cdr.starting_second And cdr.starting_second + cdr.duration;
End Loop;