Finding standard deviation using basic math functions - sql

I am trying to get the standard deviation from a table containing income values, using the basic math functions below in postgresql.
This is what I tried:
SELECT sqrt(sum(power(income - (sum(income) / count(income)), 2)) / (count(*) - 1)) FROM income_data
however, I keep getting the following error:
ERROR: aggregate function calls cannot be nested
Has anyone run into this issue? I feel like the logic for obtaining the standard deviation should work, although haven't had any luck thus far, I appreciate any suggestions on how to resolve.

You should calculate a mean in a separate query, e.g. in a with statement:
with mean as (
select sum(income) / count(income) as mean
from income_data
)
select sqrt(sum(power(income - mean, 2)) / (count(*) - 1))
from income_data
cross join mean;
or in a derived table:
select sqrt(sum(power(income - mean, 2)) / (count(*) - 1))
from income_data
cross join (
select sum(income) / count(income) as mean
from income_data
) s;

Related

How to write SQL to calculate running average with some additional formulae?

Following is the image that has running average calculated by me. But the requirement is a bit extra on top of the running average.
Following is the image where the requirement is in the Microsoft Excel sheet.
So, in order to calculate the running average with formulae like =(3*C4+2*C5+1*C6)/6 that have been gathered in excel sheet, what SQL Query could be written?
Also, if it's not feasible through SQL, then how could I use the Column D from the second image as my measure in SSAS?
use LAG() with offset and follow your formula accordingly
avg_val = ( (3.0 * lag(Open_, 2) over (order by M, [WEEK]) )
+ (2.0 * lag(Open_, 1) over (order by M, [WEEK]) )
+ (1.0 * Open_) ) / 6

How to check if a float is between multiple ranges in Postgres?

I'm trying to write a query like this:
SELECT * FROM table t
WHERE ((long_expression BETWEEN -5 AND -2) OR
(long_expression BETWEEN 0 AND 2) OR
(long_expression BETWEEN 4 and 6))
Where long_expression is approximately equal to this:
(((t.s <#> (SELECT s FROM user WHERE user.user_id = $1)) / (SELECT COUNT(DISTINCT cluster_id) FROM cluster) * -1) + 1)
t.s and s are the CUBE datatypes and <#> is the indexed distance operator.
I could just repeat this long expression multiple times in the body, but this would be extremely verbose. An alternative might be to save it in a variable somehow (with a CTE?), but I think this might remove the possibility of using an index in the WHERE clause?
I also found int4range and numrange, but I don't believe they would work here either, because the distance operator returns float8's, not integer or numerics.
You can use a lateral join:
SELECT t.*
FROM table t CROSS JOIN LATERAL
(VALUES (long_expression)) v(x)
WHERE ((v.x BETWEEN -5 AND -2) OR
(v.x BETWEEN 0 AND 2) OR
(v.x BETWEEN 4 and 6)
);
Of course, a CTE or subquery could be used as well; I like lateral joins because they are easy to express multiple expressions that depend on previous values.

Get standard errors in BigQuery ML linear regression

I am trying to get the standard errors of the betas in a linear regression in bigquery ML, sorry if I have missed something basic, but I cannot find the answer to this question
#standard sql
CREATE OR REPLACE MODEL `DATASET.test_lm`
OPTIONS(model_type='LINEAR_REG', input_label_cols= ["y"]) AS
select * from unnest(ARRAY<STRUCT<y INT64, x float64>> [(1,2.028373),
(2,2.347660),(3,3.429958),(4,5.250539),(5,5.976455)])
you can get weights without variance with
select * from ml.weights(model `DATASET.test_ml`)
Also, you can calculate the standard errors directly like this
with dat as (
select * from unnest(ARRAY<STRUCT<y INT64, x float64>> [(1,2.028373), (2,2.347660),(3,3.429958),(4,5.250539),(5,5.976455)])),
#get the residual standard error, using simple df-2
rse_dat as (
select sqrt(sum(e2)/((select count(1) from dat)-2)) as rse from (
select pow(y - predicted_y, 2) as e2 from ml.predict(model `DATASET.test_lm`,
(select * from dat)))),
#get the variance of x
xvar_dat as (
select sum(pow(x - (select avg(x) as xbar from dat),2)) as xvar from dat)
#calulate standard error
select sqrt((select pow(rse,2) from rse_dat)/(select xvar from xvar_dat) as beta_x_se )
But this gets to be heavy lift for many covariates. Is there a direct way to get this get this pretty basic statistic for confidence intervals?
You could use ML.ADVANCED_WEIGHTS now, which gives standard errors.
https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-advanced-weights

How to Calc Exponential Moving Average using SQL Server 2012 Window Functions

I know that it is easy to calculate simple moving average using SQL Server 2012 window functions and OVER() clause. But how can I calculate exponential moving average using this approach? Thanks!
The formula for EMA(x) is:
EMA(x1) = x1
EMA(xn) = α * xn + (1 - α) * EMA(xn-1)
With β := 1 - α that is equivalent to
EMA(xn) = βn-1 * x1 + α * βn-2 * x2 + α * βn-3 * x3 + ... + α * xn
In that form it is easy to implement with LAG. For a 4 row EMA it would look like this:
SELECT LAG(x,3)OVER(ORDER BY ?) * POWER(#beta,3) +
LAG(x,2)OVER(ORDER BY ?) * POWER(#beta,2) * #alpha +
LAG(x,1)OVER(ORDER BY ?) * POWER(#beta,1) * #alpha +
x * #alpha
FROM ...
OK, as you seem to be after the EWMA_Chart I created a SQL Fiddle showing how to get there. However, be aware that it is using a recursive CTE that requires one recursion per row returned. So on a big dataset you will most likely get disastrous performance. The recursion is necessary as each row depends on all rows that happened before. While you could get all preceding rows with LAG() you cannot also reference preceding calculations as LAG() cannot reference itself.
Also, the formular in the spreadsheet you attached below does not make sense. It seems to be trying to calculate the EWMA_Chart value but it is failing at that. In the above SQLFiddle I included a column [Wrong] that calculates the same value that the spreadsheet is calculating.
Either way, if you need to use this on a big dataset, you are probably better of writing a cursor.
This is the code that does the calculation in above SQLFiddle. it references th vSMA view that calculates the 10 row moving average.
WITH
smooth AS(
SELECT CAST(0.1818 AS NUMERIC(20,5)) AS alpha
),
numbered AS(
SELECT Date, Price, SMA, ROW_NUMBER()OVER(ORDER BY Date) Rn
FROM vSMA
WHERE SMA IS NOT NULL
),
EWMA AS(
SELECT Date, Price, SMA, CAST(SMA AS NUMERIC(20,5)) AS EWMA, Rn
, CAST(SMA AS NUMERIC(20,5)) AS Wrong
FROM numbered
WHERE Rn = 1
UNION ALL
SELECT numbered.Date, numbered.Price, numbered.SMA,
CAST(EWMA.EWMA * smooth.alpha + CAST(numbered.SMA AS NUMERIC(20,5)) * (1 - smooth.alpha) AS NUMERIC(20,5)),
numbered.Rn
, CAST((numbered.Price - EWMA.EWMA) * smooth.alpha + EWMA.EWMA AS NUMERIC(20,5))
FROM EWMA
JOIN numbered
ON EWMA.rn + 1 = numbered.rn
CROSS JOIN smooth
)
SELECT Date, Price, SMA, EWMA
, Wrong
FROM EWMA
ORDER BY Date;

How to select count as a percentage over the total in Oracle using any Oracle function?

I have an SQL statement that counts over the total number of rows active packages whose end date is null. I am currently doing this using (x/y) * 100:
SELECT (SELECT COUNT(*)
FROM packages
WHERE end_dt IS NULL) / (SELECT COUNT(*)
FROM packages) * 100
FROM DUAL;
I wonder if there is a way to make use of any Oracle function to express this more easily?
There's no functionality I'm aware of, but you could simply the query to be:
SELECT SUM(CASE WHEN p.end_dt IS NULL THEN 1 ELSE 0 END) / COUNT(*) * 100
FROM PACKAGES p
So, basically the formula is
COUNT(NULL-valued "end_dt") / COUNT(*) * 100
Now, COUNT(NULL-valued "end_dt") is syntactically wrong, but it can be represented as COUNT(*) - COUNT(end_dt). So, the formula can be like this:
(COUNT(*) - COUNT(end_dt)) / COUNT(*) * 100
If we just simplify it a little, we'll get this:
SELECT (1 - COUNT(end_dt) * 1.0 / COUNT(*)) * 100 AS Percent
FROM packages
The * 1.0 bit converts the integer result of COUNT to a non-integer value so make the division non-integer too.
The above sentence and the corresponding part of the script turned out to be complete rubbish. Unlike some other database servers, Oracle does not perform integer division, even if both operands are integers. This doc page contains no hint of such behaviour of the division operator.
The original post is a little long in the tooth but this should work, using the function "ratio_to_report" that's been available since Oracle 8i:
SELECT
NVL2(END_DT, 'NOT NULL', 'NULL') END_DT,
RATIO_TO_REPORT(COUNT(*)) OVER () AS PCT_TOTAL
FROM
PACKAGES
GROUP BY
NVL2(END_DT, 'NOT NULL', 'NULL');