how to compute standard deviation in sql without built-in function - sql

I have a column of numbers in my database. How can I computer the standard deviation? I do not want use the stddev function.

Just because I was curious, I decided to test the actual STDEV(). Now, I could not nail the built in function.
I was close... 0.000141009220002264 or 0.00748% off
Also, The Total Average and Count has to be converted to float (variance was greater with decimal)
The example below is going after my Treasury Rates Table for the 10 Year Yield (not that it matters)
Select SQLFunction = Stdev([TR_Y10])
,ManualCalc = Sqrt(Sum(Power(((cast([TR_Y10] as float)-B.TotalAvg)),2) / B.TotalCnt))
,Variance = Stdev([TR_Y10]) - Sqrt(Sum(Power(((cast([TR_Y10] as float)-B.TotalAvg)),2) / B.TotalCnt))
From [Chinrus-Shared].[dbo].[DS_Treasury_Rates]
Join (Select TotalAvg=Avg(cast([TR_Y10] as float)),TotalCnt=count(*) From [Chinrus-Shared].[dbo].[DS_Treasury_Rates]) B on 1=1
Returns
SQLFunction ManualCalc Variance
1.88409468982299 1.88395368060299 0.000141009220002264

The standard deviation is the square root of the variance divided by n.
The variance is the sum of the squares of the differences between the average and the observed value.
So, in most databases, you can use window functions:
select sqrt(avg(var))
from (select square(t.x - avg(t.x) over ()) as var
from t
) t;
Notes:
The square() function might have some other name (such as power()).
The sqrt() function might have some other name.
This is not a good way to calculate the standard deviation in general. In particular, this is a numerically unstable algorithm (it will work just fine for finite numbers of normal numbers).
The subquery is needed because window functions cannot be the arguments to aggregation functions.

Related

Skewness and Kurtosis Calculation on BigQuery and Standard SQL

How can I create Skewness and Kurtosis statistical functions, which are like Python scipy/pandas on Big query?
I have researched UDFs, but I know that these structures do not allow aggregated and windowed operations. These two statistical calculations are not included in Big Query by default.
You won't need a UDF for that - the definition of the statistical moments isn't so complex.
The first two may have built in versions, but let's cover them as well as the two you're interested in:
mean
The first statistical moment is the mean. As a simple aggregate value: SUM(field)/COUNT(field)
You could create a new column with this value using a window function (which you mentioned)
SELECT
COUNT(field) OVER(w) AS n,
SUM(field) OVER(w) / COUNT(field) OVER(w) AS mean
FROM
some_table
Here w would be the definition of a window. I have added a field n for later convenience.
variance
Okay, so now we have the mean. The variance is the second statistical moment, and builds on the definition of the mean:
SELECT
POW(SUM(field - mean), 2) OVER(w) / n AS variance
FROM
some_table
You can see that defining n previously made this more concise.
The square root of the variance (SQRT(variance) AS sdev) is the standard deviation. Let's also add this sdev column for future convenience.
skewness
On to the third moment! The skewness continues to build on the first two moments:
SELECT
POW(SUM(field - mean), 3) OVER(w) / (n * POW(sdev, 3)) OVER(w) AS skewness,
FROM
some_table
(note how defining sdev makes this more concise)
kurtosis
And so we arrive at my favourite, the fourth statistical moment, the one with a name that makes you sound clever if you know it. There are actually two slightly different definitions, but moving between them is simple.
SELECT
POW(SUM(field - mean), 4) OVER(w) / (n * POW(sdev, 4)) OVER(w) AS kurtosis,
FROM
some_table
And we could define kurtosis - 3 AS x_kurtosis if we prefer that definition (kurtosis of a Normal distribution is 3, so subtracting 3 makes it 0 - then a kurtosis of, say 3.1 is an 'excess kurtosis' of 0.1).

Redshift numeric precision truncating

I have encountered situation that I can't explain how Redshift handles division of SUMs.
There is example table:
create table public.datatype_test(
a numeric(19,6),
b numeric(19,6));
insert into public.datatype_test values(222222.2222, 333333.3333);
insert into public.datatype_test values(444444.4444, 666666.6666);
Now I try to run query:
select sum(a)/sum(b) from public.datatype_test;
I get result 0.6666 (4 decimals). It is not related to tool display, it really returns only 4 decimal places, and it doesn't matter how big or small numbers are in table. In my case 4 decimals is not precise enough.
Same stands true if I use AVG instead of SUM.
If I use MAX instead of SUM, I get : 0.6666666666666666666 (19 decimals).
It also returns correct result (0.6666666666666667) when no phisical table is used:
with t as (
select 222222.2222::numeric(19,6) as a, 333333.3333::numeric(19,6) as b union all
select 444444.4444::numeric(19,6) as a, 666666.6666::numeric(19,6) as b
)
select sum(a)/sum(b) as d from t;
I have looked into Redshift documentation about SUM and Computations with Numeric Values, but I still don't get result according to documentation.
Using float datatype for table columns is not an option as I need to store precise currency amounts and 15 significant digits is not enough.
Using cast on SUM aggregation also gives 0.6666666666666666666 (19 decimals).
select sum(a)::numeric(19,6)/sum(b) from public.datatype_test;
But it looks wrong, and I can't force BI tools to do this workaround, also everyone who uses this data should not use this kind of workaround.
I have tried to use same test in PostgreSQL 10, and it works as it should, returning sufficient amount of decimals for division.
Is there anything I can do with database setup to avoid casting in SQL Query?
Any advice or guidance is highly appreciated.
Redshift version:
PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.4081
Using dc2.8xlarge nodes
I have run into similar issues, and although I don't have a solution that doesn't require a workaround, I can at least explain it.
The precision/scale of the result of division is defined by the rules in the "computations with numeric values" document.
A consequence of those rules is that a decimal(19,6) divided by another decimal(19,6) will return decimal(38,19).
What's happening to you, though, is that MAX returns the same precision/scale as the underlying column, but SUM returns decimal(38,*) no matter what.
(This is probably a safety precaution to prevent overflow on sums of "big data"). If you divide decimal(38,6) by another, you get decimal(38,4).
AWS support will probably not consider this a defect -- there is no SQL standard for how to treat decimal precision in division, and given that this is documented behavior, it's probably a deliberate decision.
The only way to address this is to typecast the numerator, or multiply it by something like sum(a) * cast(1 as decimal(10,9)) which is portable SQL and will force more decimal places in the numerator and thus the result.
As a convenience I made a calculator in JSFiddle with the rules so you can play around with different options:
scale = Math.max(4, s1 + p2 - s2 + 1)
precision = p1 - s1 + s2 + scale
if (precision > 38) {
scale = Math.max((38 + scale - precision), 4)
precision = 38
}

How does the Average function work in relational databases?

I'm trying to find geometric average of values from a table with millions of rows. For those that don't know, to find the geometric average, you mulitply each value times each other then divide by the number of rows.
You probably already see the problem; The number multiplied number will quickly exceed the maximum allowed system maximum. I found a great solution that uses the natural log.
http://timothychenallen.blogspot.com/2006/03/sql-calculating-geometric-mean-geomean.html
However that got me to wonder wouldn't the same problem apply with the arithmetic mean? If you have N records, and N is very large the running sum can also exceed the system maximum.
So how do RDMS calculate averages during queries?
I don't know an exact implementation for arithmetic mean in an RDBMS, nor did you specify one in your original question. But the RDBMS does not need to sum a million rows in a column in order to obtain the arithmetic mean. Consider the following summation:
sum = (x1 + x2 + x3 + ... + x1000000)
Then the mean can be written as
mean = sum / N = (x1 + x2 + x3 + ... + x1000000) / N, for N = 1,000,000
But this expression can be broken up into pieces like this:
mean = [(x1 + x2 + x3) / N ] + [(x4 + x5 + x6) / N] + ...
In other words, the RDBMS can simply scan down the million rows in a column and find the mean section by section, without running the risk of an overflow. And since each number in the column is presumably within range for the type storing it, there is no chance of the mean value itself overflowing.
Most databases don't support a product() function the way they support an average.
However, you can use do what you want with logs. The product (simplified) is like:
select exp(sum(ln(x)) as product
The average would be:
select power(exp(sum(ln(x))), 1.0 / count(*)) as geoaverage
or
select EXP(AVG(LN(x))) as geoaverage
The LN() function might be LOG() on some platforms...
These are schematics. The functions for exp() and ln() and power() vary, depending on the database. Plus, if you have to take into account zero or negative numbers, the logic is more complicated.
Very easy to check. For example, SQL Server 2008.
DECLARE #T TABLE(i int);
INSERT INTO #T(i) VALUES
(2147483647),
(2147483647);
SELECT AVG(i) FROM #T;
result
(2 row(s) affected)
Msg 8115, Level 16, State 2, Line 7
Arithmetic overflow error converting expression to data type int.
There is no magic. Column type is int, server adds values together using internal variable of the same type int and intermediary result exceeds range for int.
You can run the similar check for any other DBMS that you use. Different engines may behave differently, but I would expect all of them to stick to the original type of the column. For example, averaging two int values 100 and 101 may result in 100 or 101 (still int), but never 100.5.
For SQL Server this behavior is documented. I would expect something similar for all other engines:
AVG () computes the average of a set of values by dividing the sum of
those values by the count of nonnull values. If the sum exceeds the
maximum value for the data type of the return value an error will be
returned.
So, you have to be careful when calculating simple average as well, not just product.
Here is extract from SQL 92 Standard:
6) Let DT be the data type of the < value expression >.
9) If SUM or AVG is specified, then:
a) DT shall not be character string, bit string, or datetime.
b) If SUM is specified and DT is exact numeric with scale S, then the
data type of the result is exact numeric with implementation-defined
precision and scale S.
c) If AVG is specified and DT is exact numeric, then the data type of
the result is exact numeric with implementation- defined precision not
less than the precision of DT and implementation-defined scale not
less than the scale of DT.
d) If DT is approximate numeric, then the data type of the result is
approximate numeric with implementation-defined precision not less
than the precision of DT.
e) If DT is interval, then the data type of the result is inter- val
with the same precision as DT.
So, DBMS can convert int to larger type when calculating AVG, but it has to be an exact numeric type, not floating-point. In any case, depending on the values you can still get arithmetic overflow.
Some DBMS — specifically, the Informix DBMS — convert from an INT type to a floating point type to do the calculation:
SQL[2148]: create table t(i int);
SQL[2149]: insert into t values(214748347);
SQL[2150]: insert into t values(214748347);
SQL[2151]: insert into t values(214748347);
SQL[2152]: select avg(i) from t;
214748347.0
SQL[2153]: types on;
SQL[2154]: select i from t;
INTEGER
214748347
214748347
214748347
SQL[2155]: select avg(i) from t;
DECIMAL(32)
214748347.0
SQL[2156]:
Similarly with other types. This can still end with an overflow under some circumstances; you then get a runtime error. However, it is rather seldom that you exceed the precision — it typically takes a very large number of rows for the sum to exceed the limits, even if you're counting the US deficit over the next century in atto-Zimbabwean dollars circa 2009.

MDX Query SUM PROD to do Weighted Average

I'm building a cube in MS BIDS. I need to create a calculated measure that returns the weighted-average of the rank value weighted by the number of searches. I want this value to be calculated at any level, no matter what dimensions have been applied to break-down the data.
I am trying to do something like the following:
I have one measure called [Rank Search Product] which I want to apply at the lowest level possible and then sum all values of it
IIf([Measures].[Searches] IS NOT NULL, [Measures].[Rank] * [Measures].[Searches], NULL)
And then my weighted average measure uses this:
IIf([Measures].[Rank Search Product] IS NOT NULL AND SUM([Measures].[Searches]) <> 0,
SUM([Measures].[Rank Search Product]) / SUM([Measures].[Searches]),
NULL)
I'm totally new to writing MDX queries and so this is all very confusing to me. The calculation should be
([Rank][0]*[Searches][0] + [Rank][1]*[Searches][1] + [Rank][2]*[Searches][2] ...)
/ SUM([searches])
I've also tried to follow what is explained in this link http://sqlblog.com/blogs/mosha/archive/2005/02/13/performance-of-aggregating-data-from-lower-levels-in-mdx.aspx
Currently loading my data into a pivot table in Excel is return #VALUE! for all calculations of my custom measures.
Please halp!
First of all, you would need an intermediate measure, lets say Rank times Searches, in the cube. The most efficient way to implement this would be to calculate it when processing the measure group. You would extend your fact table by a column e. g. in a view or add a named calculation in the data source view. The SQL expression for this column would be something like Searches * Rank. In the cube definition, you would set the aggregation function of this measure to Sum and make it invisible. Then just define your weighted average as
[Measures].[Rank times Searches] / [Measures].[Searches]
or, to avoid irritating results for zero/null values of searches:
IIf([Measures].[Searches] <> 0, [Measures].[Rank times Searches] / [Measures].[Searches], NULL)
Since Analysis Services 2012 SP1, you can abbreviate the latter to
Divide([Measures].[Rank times Searches], [Measures].[Searches], NULL)
Then the MDX engine will apply everything automatically across all dimensions for you.
In the second expression, the <> 0 test includes a <> null test, as in numerical contexts, NULL is evaluated as zero by MDX - in contrast to SQL.
Finally, as I interpret the link you have in your question, you could leave your measure Rank times Searches on SQL/Data Source View level to be anything, maybe just 0 or null, and would then add the following to your calculation script:
({[Measures].[Rank times Searches]}, Leaves()) = [Measures].[Rank] * [Measures].[Searches];
From my point of view, this solution is not as clear as to directly calculate the value as described above. I would also think it could be slower, at least if you use aggregations for some partitions in your cube.

SQL - STDEVP or STDEV and how to use it?

I have a table:
LocationId OriginalValue Mean
1 0.45 3.99
2 0.33 3.99
3 16.74 3.99
4 3.31 3.99
and so forth...
How would I work out the Standard Deviation using this table and also what would you recommend - STDEVP or STDEV?
To use it, simply:
SELECT STDEVP(OriginalValue)
FROM yourTable
From below, you probably want STDEVP.
From here:
STDEV is used when the group of numbers being evaluated are only a partial sampling of the whole population. The denominator for dividing the sum of squared deviations is N-1, where N is the number of observations ( a count of items in the data set ). Technically, subtracting the 1 is referred to as "non-biased."
STDEVP is used when the group of numbers being evaluated is complete - it's the entire population of values. In this case, the 1 is NOT subtracted and the denominator for dividing the sum of squared deviations is simply N itself, the number of observations ( a count of items in the data set ). Technically, this is referred to as "biased." Remembering that the P in STDEVP stands for "population" may be helpful. Since the data set is not a mere sample, but constituted of ALL the actual values, this standard deviation function can return a more precise result.
Generally, you should use STDEV when you have to estimate standard deviation based on a sample. But if you have entire column-data given as arguments, then use STDEVP.
In general, if your data represents the entire population, use STDEVP; otherwise, use STDEV.
Note that for large samples, the functions return nearly the same value, so better use STDEV in this case.
In statistics, there are two types of standard deviations: one for a sample and one for a population.
The sample standard deviation, generally notated by the letter s, is used as an estimate of the population standard deviation.
The population standard deviation, generally notated by the Greek letter lower case sigma, is used when the data constitutes the complete population.
It is difficult to answer your question directly -- sample or population -- because it is difficult to tell what you are working with: a sample or a population. It often depends on context.
Consider the following example.
If I want to know the standard deviation of the age of students in my class, then I u=would use STDEVP because the class is my population. But if I want the use my class as a sample of the population of all students in the school (this would be what is known as a convenience sample, and would likely be biased, but I digress), then I would use STDEV because my class is a sample. The resulting value would be my best estimate of STDEVP.
As mentioned above (1) for large sample sizes (say, more than thirty), the difference between the two becomes trivial, and (2) generally you should use STDEV, not STDEVP, because in practice we usually don't have access to the population. Indeed, one could argue that if we always had access to populations, then we wouldn't need statistics. The entire point of inferential statistics is to be able to make inferences about a population based on the sample.