Calculating Outliers - Nested Aggregate Error - sql

I am currently working SQL Workbench/J and Amazon Redshift.
I am working on a query with the intent to identify the number of outliers within a data set.
My source data contains one record per day for multiple symbols. I am utilizing 30 days of trailing data. In short, for 30 days there are ten symbols with 30 records each.
I am then utilizing the following query to calculate the mean, standard deviation, and upper/lower control limits for each unique symbol based upon the 30 day data set.
select
symbol,
avg(high) as MEAN,
cast(stddev_samp(high) as dec(14,2)) STDV,
(MEAN+STDV*3) as UCL,
(MEAN-STDV*3) as LCL
from historical
group by symbol
;
My next step will be calculating how many individual values from the 'high' column exceed the upper control limit calculated value. I have tried to add the following count(case...) statement, but it is failing:
select
symbol,
avg(high) as MEAN,
cast(stddev_samp(high) as dec(14,2)) STDV,
(MEAN+STDV*3) as UCL,
(MEAN-STDV*3) as LCL,
count(case when high>avg(high) then 1 else 0 end) as outlier
from historical
group by symbol
;
The specific error is
Amazon Invalid operation: aggregate function calls may not have nested aggregate or window function
Is a count(case..) statement the right method to utilize here, or what would the recommended approach or example be?

There are a number of ways to do this but I think all of them involve a sub-query. This is because you have an aggregate (avg) compared to a per-row value (high) and then summing the the comparison.
I'd go with a sub-query where you perform an avg() window function partitioned by symbol. This will give you the average of the group on every row then just do the query as you have it. Kinda like this:
I am currently working SQL Workbench/J and Amazon Redshift.
I am working on a query with the intent to identify the number of outliers within a data set.
My source data contains one record per day for multiple symbols. I am utilizing 30 days of trailing data. In short, for 30 days there are ten symbols with 30 records each.
I am then utilizing the following query to calculate the mean, standard deviation, and upper/lower control limits for each unique symbol based upon the 30 day data set.
select symbol, avg(high) as MEAN, cast(stddev_samp(high) as dec(14,2)) STDV, (MEAN+STDV3) as UCL, (MEAN-STDV3) as LCL from historical group by symbol ;
My next step will be calculating how many individual values from the 'high' column exceed the upper control limit calculated value. I have tried to add the following count(case...) statement, but it is failing:
select symbol, avg(high) as MEAN, cast(stddev_samp(high) as dec(14,2)) STDV, (MEAN+STDV3) as UCL,
(MEAN-STDV3) as LCL, count(case when high>group_avg then 1 else 0 end) as outlier
from (
select *, avg(high) over (partition by symbol) as group_avg
from historical )
group by symbol ;
(You could also replace "avg(high) as MEAN" with "min(group_avg) as MEAN" since you already computed the average in the window function. Just a possible slight optimization.)

Use window functions to calculate the values for the standard deviation and mean. Then aggregate:
select symbol, mean, STDV,
(MEAN+STDV*3) as UCL, (MEAN-STDV*3) as LCL,
sum( (high > mean)::int) ) as outlier
from (select h.*,
avg(high) over (partition by symbol) as mean,
cast(stddev_samp(high) over (partition by symbol) as dec(14,2)) as STDV
from historical h
) h
group by symbol, mean, STDV;
Your definition of "outlier" is rather strange -- merely being higher than the average is going to happen (very roughly) about half the time. The more typical definition I have seen is outside the range of 2 standard deviations.
As a comment not directly related to the SQL. It seems unusual for me to be using future data to determine outliers. I would expect that a trailing 30 days would be used for that purpose. However, that is not the question you have asked here.

Related

DATEIME_DIFF throwing error when using safe divide

I've created a query that I'm hoping to use to fill a table with daily budgets at the end of every day. To do this, I just need some simple maths:
monthly budget / number of days left in the month.
Then, at the end of the month, I can SUM all of the rows to calculate an accurate budget.
The query is as follows:
SELECT *,
ROUND(SAFE_DIVIDE(budget, DATETIME_DIFF(CURRENT_DATE(), LAST_DAY(CURRENT_DATE()), DAY)),2) AS daily_budget
FROM `mm-demo-project.marketing_hub.budget_manager`
When executing the query, my results present as negative numbers, which according to the documentation for this function, is likely caused by the result overflowing the result type.
View the results of the query.
I've made a fools guess at rounding the calculation. Needless to say that it did not work at all.
How can I stop my query from returning negative number?
use below
SELECT *,
ROUND(SAFE_DIVIDE(budget, DATETIME_DIFF(LAST_DAY(CURRENT_DATE()), CURRENT_DATE(), DAY)),2) AS daily_budget
FROM `mm-demo-project.marketing_hub.budget_manager`

How does the group by function works in PostgreSQL? (beginner)

I don't know much at all about SQL, I've just toyed with it here and there through the years but never really 'used' it.
I'm trying to get a list of prices / volumes and aggregate them:
CREATE TABLE IF NOT EXISTS test (
ts timestamp without time zone NOT NULL,
price decimal NOT NULL,
volume decimal NOT NULL
);
what I'd like is to extract:
min price
max price
sum volume
sum (price * volume) / sum (volume)
By 1h slices
If I forget about the last line for now, I have:
SELECT MIN(price) min_price, MAX(price) max_price, SUM(volume) sum_vol, date_trunc('hour', ts) ts_group FROM test
GROUP BY ts_group;
My understanding is that 'GROUP BY ts_group' will calculate ts_group, build groups of rows and then apply the MIN / MAX / SUM functions after. Since the syntax doesn't make any sense to me (entries on the select line would be treated differently while being declared together vs. building groups and then declaring operations on the groups), I could be dramatically wrong here.
But that will not return the min_price, max_price and sum_vol results after the grouping; I get ts, price and volume in the results.
If I remove the GROUP BY line to try to see all the output, I get the error:
column "test.ts" must appear in the GROUP BY clause or be used in an aggregate function
Which I don't really understand either...
I looked at:
must appear in the GROUP BY clause or be used in an aggregate function but I don't really get it
and I looked at the doc (https://www.postgresqltutorial.com/postgresql-group-by/) which shows working example, but doesn't really clarify what is wrong with what I'm trying to do here.
While I'd be happy to have a working solution, I'm more looking from an explanation, or pointers toward good resources, that would allow me to understand this.
I have this working solution:
SELECT MIN(price) min_price, MAX(price) max_price, SUM(volume) sum_vol, (SUM(price * volume)/SUM(volume)) vwap FROM test
GROUP BY date_trunc('hour', ts);
but I still don't understand the error message from my question
All of your expressions in SQL must use data elements and functions that are known to PostgreSQL. In your first example, ts_group is neither an element of your table, nor a defined function, so it complained that it did not know how to calculate it.
Your second example works because date_trunc is a known function and ts is defined as a data element of the test table.
It also gets you the correct grouping (by hour intervals) because date_trunc 'blurs' all of those unique timestamps that by themselves would not combine into groups.
Without a GROUP BY, then having any aggregates in your select list means it will aggregate everything down to just one row. But how does it aggregate date_trunc('hour', ts) down to one row, since there is no aggregating function specified for it? If you were using MySQL, it would just pick some arbitrary value for the column from all the seen values and report that as the "aggregate". But PostgreSQL is not so cavalier with your data. If your query is vague in this way, it refuses to run it. If you just want to see some value from the set without caring which one it is, you could use min or max to aggregate it.
Since the syntax doesn't make any sense to me (entries on the select line would be treated differently while being declared together vs. building groups and then declaring operations on the groups),
You are trying to understand SQL as if it were C. But it is very different. Just learn it for what it is, without trying to force it to be something else. The select list is where you define the columns you want to see in the output. They may be computed in different ways, but what they have in common is that you want each of them to show up in the output, so they are listed together in that spot.

How to add a numerical value into a window frame in SQLite?

I am having some difficulty in adding a numerical digit into my windows frame specification in SQLite. I am using R in SQLITE. Although if you know how to do this in SQL then that's also helpful.
Here is a link to the sqlite window function documentation - although it's abit hard to understand as to where i should place my numerical value.
https://www.sqlite.org/windowfunctions.html
In particular i am looking at the frame boundary section.
I kepe receiving the error message:
Error: unsupported frame specification
Any ideas?
My code is the following:
"create temp table forward_looking as
SELECT *,
COUNT( CASE channel WHEN 'called_office' THEN 1 ELSE null END)
OVER (PARTITION by special_digs
ORDER BY time
RANGE FOLLOWING 604800)
AS new_count
from my_data
")
Basically the code should look at the time column which is in unix epoch time and then find 7 days in advance (which is 604800 in unix time) then add a count to new_count. And do this on a row by row term.
I think I may have the numeric in the RANGE FOLLOWING part the wrong way around??
I think that you want:
create temp table forward_looking as
select
d.*,
count(*) filter(where channel <> 'called_office') over (
partition by special_digs
order by time
range between current row and 604800 following
) as new_count
from my_data d
That is, the range clause requires a starting and ending specification (between ... and ...).
Note that I also modified the window function to use the standard filter clause, which makes the logic more obvious.

Presto / Athena: How can I group by one column's categorical values while doing time-series window functions on another column?

I have an Athena table with the following fields:
date (str that I'm date_parse()ing to date format)
entity (str, categorical variable)
value (float, the target metric for my analysis)
Each entity has one value per date.
I'm analyzing variance -- specifically, identifying the entitys for which something unusual is happening in the value field. Previously, I was pulling out a single entity's data and doing some simple anomaly detection in Pandas using the ewm functions.
I'm working with a lot of data, though, and it updates daily. So I would prefer not to run the entire ewm time-series analysis on the thousands of entitys in this table every day. My workaround is to try to calculate a z-score using a window function in Athena, then run the more expensive analysis on the top z-scores for any given day. But I can't seem to figure out how to write the query such that the z-score is only calculated with respect to each entity and the relevant day.
Here's my stab at the initial query, which works for a single entity:
with subquery AS
(SELECT date_parse(date, '%Y-%m-%d') AS day,
value,
entity
FROM mytable
WHERE date_parse(date, '%Y-%m-%d') > date_parse('201-01-01', '%Y-%m-%d')
AND entity = 'sample_entity'),
data_with_stddev AS
(SELECT day,
value,
entity,
(value - avg(value)
OVER ()) / (stddev(value)
OVER ()) AS zscore
FROM subquery
ORDER BY 1)
SELECT *
FROM data_with_stddev
WHERE day > date_parse('2019-12-25', '%Y-%m-%d')
ORDER BY zscore desc
The way I've done this in the past is to run a bash script that iterates over all of the entity variables and executes a separate query for each. I'd like to avoid that. Thanks!
The answer is a partition by clause, like this:
...
OVER (PARTITION BY entity ORDER BY day asc)) / (stddev(value)
OVER (PARTITION BY entity ORDER BY day asc)) AS zscore
...
Docs: https://prestodb.io/docs/current/functions/window.html

How to calculate the avg time a tool stays on hold? oracle sql developer

Im trying to calculate the average time a tool stays on loan. The time a tool stays on loan is the number of days between loan_status_change_date and tool_out_date (table columns). the date type of these 2 columns is ex: 01-SEP-17
whats the best way to approach this?
We can do arithmetic with Oracle dates. It's not clear from the column names which one is the start of the loan and which the end; in the following example I've assumed loan_status_change is when the tool is returned.
select tool
, avg(loan_status_change - tool_out_date) as avg_loan_days
from your_table
group by tool
/
The AVG() function is an aggregate function, so it handles the /ns for us. The substraction is to calculate the length of a particular loan, which is the value you want to average. The result of that substraction already is a number of days, so no further transformation is necessary. If your columns have a time element then the result might not be an integer.