How to add a numerical value into a window frame in SQLite? - sql

I am having some difficulty in adding a numerical digit into my windows frame specification in SQLite. I am using R in SQLITE. Although if you know how to do this in SQL then that's also helpful.
Here is a link to the sqlite window function documentation - although it's abit hard to understand as to where i should place my numerical value.
https://www.sqlite.org/windowfunctions.html
In particular i am looking at the frame boundary section.
I kepe receiving the error message:
Error: unsupported frame specification
Any ideas?
My code is the following:
"create temp table forward_looking as
SELECT *,
COUNT( CASE channel WHEN 'called_office' THEN 1 ELSE null END)
OVER (PARTITION by special_digs
ORDER BY time
RANGE FOLLOWING 604800)
AS new_count
from my_data
")
Basically the code should look at the time column which is in unix epoch time and then find 7 days in advance (which is 604800 in unix time) then add a count to new_count. And do this on a row by row term.
I think I may have the numeric in the RANGE FOLLOWING part the wrong way around??

I think that you want:
create temp table forward_looking as
select
d.*,
count(*) filter(where channel <> 'called_office') over (
partition by special_digs
order by time
range between current row and 604800 following
) as new_count
from my_data d
That is, the range clause requires a starting and ending specification (between ... and ...).
Note that I also modified the window function to use the standard filter clause, which makes the logic more obvious.

Related

How to set a max range condition with timescale time_bucket_gapfill() in order to not fill real missing values?

I'd like some advices to know if what I need to do is achievable with timescale functions.
I've just found out I can use time_bucket_gapfill() to complete missing data, which is amazing! I need data each 5 minutes but I can receive 10 minutes, 30 minutes or 1 hour data. So the function helps me to complete the missing points in order to have only 5 minutes points. Also, I use locf() to set the gapfilled value with last value found.
My question is: can I set a max range when I set the last value found with locf() in order to never overpass 1 hour ?
Example: If the last value found is older than 1 hour ago I don't want to fill gaps, I need to leave it empty to say we have real missing values here.
I think I'm close to something with this but apparently I'm not allowed to use locf() in the same case.
ERROR: multiple interpolate/locf function calls per resultset column not supported
Somebody have an idea how I can resolve that?
How to reproduce:
Create table powers
CREATE table powers (
delivery_point_id BIGINT NOT NULL,
at timestamp NOT NULL,
value BIGINT NOT NULL
);
Create hypertable
SELECT create_hypertable('powers', 'at');
Create indexes
CREATE UNIQUE INDEX idx_dpid_at ON powers(delivery_point_id, at);
CREATE INDEX index_at ON powers(at);
Insert data for one day, one delivery point, point 10 minutes
INSERT INTO powers SELECT 1, at, round(random()*10000) FROM generate_series(TIMESTAMP '2021-01-01 00:00:00', TIMESTAMP '2022-01-02 00:00:00', INTERVAL '10 minutes') AS at;
Remove three hours of data from 4am to 7am
DELETE FROM powers WHERE delivery_point_id = 1 AND at < '2021-01-1 07:00:00' AND at > '2021-01-01 04:00:00';
The query that need to be fixed
SELECT
time_bucket_gapfill('5 minutes', at) AS point_five,
avg(value) AS avg,
CASE
WHEN (locf(at) - at) > interval '1 hour' THEN null
ELSE locf(avg(value))
END AS gapfilled
FROM powers
GROUP BY point_five, at
ORDER BY point_five;
Actual: ERROR: multiple interpolate/locf function calls per resultset column not supported
Expected: Gapfilled values each 5 minutes except between 4am and 7 am (real missing values).
This is a great question! I'm going to provide a workaround for how to do this with the current stuff, but I think it'd be great if you'd open a Github issue as well, because there might be a way to add an option for this that doesn't require a workaround like this.
I also think your attempt was a good approach and just requires a few tweaks to get it right!
The error that you're seeing is that we can't have multiple locf calls in a single column, this is a limitation that's pretty easy to work around as we can just shift both of them into a subquery, but that's not enough. The other thing that we need to change is that locf only works on aggregates, right now, you’re trying to use it on a column (at) that isn’t aggregated, which isn’t going to work, because it wouldn’t know which of the values of at in a time_bucket to “pull forward” for the gapfill.
Now you said you want to fill data as long as the previous point wasn’t more than one hour ago, so, we can take the last value of at in the bucket by using last(at, at) this is also the max(at) so either of those aggregates would work. So we put that into a CTE (common table expression or WITH query) and then we do the case statement outside like so:
WITH filled as (SELECT
time_bucket_gapfill('5 minutes', at) AS point_five,
avg(value) AS avg,
locf(last(at, at)) as filled_from,
locf(avg(value)) as filled_avg
FROM powers
WHERE at BETWEEN '2021-01-01 01:30:00' AND '2021-01-01 08:30:00'
AND delivery_point_id = 1
GROUP BY point_five
ORDER BY point_five)
SELECT point_five,
avg,
filled_from,
CASE WHEN point_five - filled_from > '1 hour'::interval THEN NULL
ELSE filled_avg
END as gapfilled
FROM filled;
Note that I’ve tried to name my CTE expressively so that it’s a little easier to read!
Also, I wanted to point out a couple other hyperfunctions that you might think about using:
heartbeat_agg is a new/experimental one that will help you determine periods when your system is up or down, so if you're expecting points at least every hour, you can use it to find the periods where the delivery point was down or the like.
When you have more irregular sampling or want to deal with different data frequencies from different delivery points, I’d take a look a the time_weight family of functions. They can be more efficient than using something like gapfill to upsample, by instead letting you treat all the different sample rates similarly, without having to create more points and more work to do so. Even if you want to, for instance, compare sums of values, you’d use something like integral to get the time weighted sum over a period based on the locf interpolation.
Anyway, hope all that is helpful!

Calculating Outliers - Nested Aggregate Error

I am currently working SQL Workbench/J and Amazon Redshift.
I am working on a query with the intent to identify the number of outliers within a data set.
My source data contains one record per day for multiple symbols. I am utilizing 30 days of trailing data. In short, for 30 days there are ten symbols with 30 records each.
I am then utilizing the following query to calculate the mean, standard deviation, and upper/lower control limits for each unique symbol based upon the 30 day data set.
select
symbol,
avg(high) as MEAN,
cast(stddev_samp(high) as dec(14,2)) STDV,
(MEAN+STDV*3) as UCL,
(MEAN-STDV*3) as LCL
from historical
group by symbol
;
My next step will be calculating how many individual values from the 'high' column exceed the upper control limit calculated value. I have tried to add the following count(case...) statement, but it is failing:
select
symbol,
avg(high) as MEAN,
cast(stddev_samp(high) as dec(14,2)) STDV,
(MEAN+STDV*3) as UCL,
(MEAN-STDV*3) as LCL,
count(case when high>avg(high) then 1 else 0 end) as outlier
from historical
group by symbol
;
The specific error is
Amazon Invalid operation: aggregate function calls may not have nested aggregate or window function
Is a count(case..) statement the right method to utilize here, or what would the recommended approach or example be?
There are a number of ways to do this but I think all of them involve a sub-query. This is because you have an aggregate (avg) compared to a per-row value (high) and then summing the the comparison.
I'd go with a sub-query where you perform an avg() window function partitioned by symbol. This will give you the average of the group on every row then just do the query as you have it. Kinda like this:
I am currently working SQL Workbench/J and Amazon Redshift.
I am working on a query with the intent to identify the number of outliers within a data set.
My source data contains one record per day for multiple symbols. I am utilizing 30 days of trailing data. In short, for 30 days there are ten symbols with 30 records each.
I am then utilizing the following query to calculate the mean, standard deviation, and upper/lower control limits for each unique symbol based upon the 30 day data set.
select symbol, avg(high) as MEAN, cast(stddev_samp(high) as dec(14,2)) STDV, (MEAN+STDV3) as UCL, (MEAN-STDV3) as LCL from historical group by symbol ;
My next step will be calculating how many individual values from the 'high' column exceed the upper control limit calculated value. I have tried to add the following count(case...) statement, but it is failing:
select symbol, avg(high) as MEAN, cast(stddev_samp(high) as dec(14,2)) STDV, (MEAN+STDV3) as UCL,
(MEAN-STDV3) as LCL, count(case when high>group_avg then 1 else 0 end) as outlier
from (
select *, avg(high) over (partition by symbol) as group_avg
from historical )
group by symbol ;
(You could also replace "avg(high) as MEAN" with "min(group_avg) as MEAN" since you already computed the average in the window function. Just a possible slight optimization.)
Use window functions to calculate the values for the standard deviation and mean. Then aggregate:
select symbol, mean, STDV,
(MEAN+STDV*3) as UCL, (MEAN-STDV*3) as LCL,
sum( (high > mean)::int) ) as outlier
from (select h.*,
avg(high) over (partition by symbol) as mean,
cast(stddev_samp(high) over (partition by symbol) as dec(14,2)) as STDV
from historical h
) h
group by symbol, mean, STDV;
Your definition of "outlier" is rather strange -- merely being higher than the average is going to happen (very roughly) about half the time. The more typical definition I have seen is outside the range of 2 standard deviations.
As a comment not directly related to the SQL. It seems unusual for me to be using future data to determine outliers. I would expect that a trailing 30 days would be used for that purpose. However, that is not the question you have asked here.

Cumulative count for calculating daily frequency using SQL query (in Amazon Redshift)

I have a dataset contains 'UI' (unique id), time, frequency (frequency for give value in UI column), as it is shown here:
What I would like to add a new column named 'daily_frequency' which simply counts each unique value in UI column for a given day sequentially as I show in the image below.
For example if UI=114737 and it is repeated 2 times in one day, we should have 1, and 2 in the daily_frequency column.
I could do that with Python and Panda package using group by and cumcount methods as follows ...
df['daily_frequency'] = df.groupby(['UI','day']).cumcount()+1
However, for some reason, I must do this via SQL queries (Amazon Redshift).
I think you want a running count, which could be calculated as:
COUNT(*) OVER (PARTITION BY ui, TRUNC(time) ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS daily_frequency
Although Salman's answer seems to be correct, I think ROW_NUMBER() is simpler:
COUNT(*) OVER (PARTITION BY ui, time::date
ORDER BY time
) AS daily_frequency

How to read 2 or more records in Hive UDF?

I have a table of some toll stations logs. My task, "translated" into SQL is:
step 1. sort these records, using GROUP BY station, lane.
step 2. arrange these records, using ORDER BY check_time.
step 3.[that is the problem] consecutively judge every two contiguous records in each group, whether the interval is less then 5 seconds or not.
It is easy if I can do it in C, Java or others but not in SQL.
It seems that Hive UDF(User Defined Function) can help me do that. I have read the Demo UDF from official documentaion. But still I don't know how to pass the consecutive 2 records into my function. Any advice?
You can do it using SQL.
Using LAG() analytic function you can get previous row check_time and other columns if necessary. Then do a calculation with two timestamps. Convert timestamps to seconds using unix_timestamp() and subtract:
select t.*,
case when time_diff < 5 then ... else ... end --do some logic
from
(
select t.*,
--current time minus previous time
unix_timestamp(check_time) -
unix_timestamp(lag(check_time) over (partition by station, lane order by check_time)) as time_diff
from table t
) t
The Lead() analytic function to get next row's check_time or other column if necessary.

Rolling average postgres

I am running Postgres 9.2 and I have a large table something like
CREATE TABLE sensor_values
(
ts timestamp with time zone NOT NULL,
value double precision NOT NULL DEFAULT 'NaN'::real,
sensor_id integer NOT NULL
)
I have values coming into the system constantly ie many per minute. I want to maintain a rolling standard deviation / average for the last 200 values so I can determine if new values entering the system are within say 3 standard deviations of the mean. To do so I would need the current standard deviation and mean to be constantly updated for the last 200 values.
As the table can be hundreds of millions of rows I do not want to get the last say 200 rows for a sensor ordered by time and then do vg(value), var_samp(value) for every new value coming in. I and assuming it will be faster to updated the standard deviation and mean.
I have started writing a PL/pgSQL function to update a rolling variance and mean on each new value entering the system for a particular sensor.
I can do this using code pseudo like
newavg = oldavg + (new_value - old_value)/window_size
new_variance += (new_value-old_value)*(new_value-newavg+old_value-oldavg)/(window_size-1)
This is based on
http://jonisalonen.com/2014/efficient-and-accurate-rolling-standard-deviation/
Basically the window is of size 200 values. The old_value is the first value of the window. When a new value comes in we shift the window forward one. After I get the result I store the following values for the sensor
The first value of the window.
The mean average of the window values.
The variance of the window values.
This way I don't have to constantly get there last 200 value and do a sum etc.I can reuse this values when a new sensor value come in.
My problem is when first running I have no previous window data for a sensor ie the three values above so I have to do it the slow way.
something like
WITH s AS
(SELECT value FROM sensor_values WHERE sensor_values.sensor_id = $1 AND ts >= (NOW() - INTERVAL '2 day')::timestamptz ORDER BY ts DESC LIMIT 200)
SELECT avg(value), var_samp(value) INTO last_window_average, last_window_variance FROM s;
But how could I get the last value (ealiest) to save from that select statement ?
Can I access the first row from s in PL/pgSQL.
I thought PL/pgSQL would be faster / cleaner approach but maybe its better to do this is client code ?
Are there better ways to perform this type on rolling statistic update ?
I assume, that it will not be drastically slow to re-calculated latest 200 entries each time with proper indexing. If you'll do an index, like:
CREATE INDEX i_sensor_values ON sensor_values(sensor_id, ts DESC);
you'll be able to get results fairly quickly doing:
SELECT sum("value") -- add more expressions as required
FROM sensor_values
WHERE sensor_id=$1
ORDER BY ts DESC
LIMIT 200;
You can execute this query in a loop from PL/pgSQL function.
If you'll migrate to 9.3 (or higher) any time soon, you'll be able to also use LATERAL joins for this purpose.
I do not think a covering index will do a good thing here, as table is constantly changing and IndexOnlyScan will not kick in.
It is good to check Loose Index scans also.
P.S. Column name value should be double quoted, as this is an SQL reserved word.