I have a table:
LocationId OriginalValue Mean
1 0.45 3.99
2 0.33 3.99
3 16.74 3.99
4 3.31 3.99
and so forth...
How would I work out the Standard Deviation using this table and also what would you recommend - STDEVP or STDEV?
To use it, simply:
SELECT STDEVP(OriginalValue)
FROM yourTable
From below, you probably want STDEVP.
From here:
STDEV is used when the group of numbers being evaluated are only a partial sampling of the whole population. The denominator for dividing the sum of squared deviations is N-1, where N is the number of observations ( a count of items in the data set ). Technically, subtracting the 1 is referred to as "non-biased."
STDEVP is used when the group of numbers being evaluated is complete - it's the entire population of values. In this case, the 1 is NOT subtracted and the denominator for dividing the sum of squared deviations is simply N itself, the number of observations ( a count of items in the data set ). Technically, this is referred to as "biased." Remembering that the P in STDEVP stands for "population" may be helpful. Since the data set is not a mere sample, but constituted of ALL the actual values, this standard deviation function can return a more precise result.
Generally, you should use STDEV when you have to estimate standard deviation based on a sample. But if you have entire column-data given as arguments, then use STDEVP.
In general, if your data represents the entire population, use STDEVP; otherwise, use STDEV.
Note that for large samples, the functions return nearly the same value, so better use STDEV in this case.
In statistics, there are two types of standard deviations: one for a sample and one for a population.
The sample standard deviation, generally notated by the letter s, is used as an estimate of the population standard deviation.
The population standard deviation, generally notated by the Greek letter lower case sigma, is used when the data constitutes the complete population.
It is difficult to answer your question directly -- sample or population -- because it is difficult to tell what you are working with: a sample or a population. It often depends on context.
Consider the following example.
If I want to know the standard deviation of the age of students in my class, then I u=would use STDEVP because the class is my population. But if I want the use my class as a sample of the population of all students in the school (this would be what is known as a convenience sample, and would likely be biased, but I digress), then I would use STDEV because my class is a sample. The resulting value would be my best estimate of STDEVP.
As mentioned above (1) for large sample sizes (say, more than thirty), the difference between the two becomes trivial, and (2) generally you should use STDEV, not STDEVP, because in practice we usually don't have access to the population. Indeed, one could argue that if we always had access to populations, then we wouldn't need statistics. The entire point of inferential statistics is to be able to make inferences about a population based on the sample.
How can I create Skewness and Kurtosis statistical functions, which are like Python scipy/pandas on Big query?
I have researched UDFs, but I know that these structures do not allow aggregated and windowed operations. These two statistical calculations are not included in Big Query by default.
You won't need a UDF for that - the definition of the statistical moments isn't so complex.
The first two may have built in versions, but let's cover them as well as the two you're interested in:
The first statistical moment is the mean. As a simple aggregate value: SUM(field)/COUNT(field)
You could create a new column with this value using a window function (which you mentioned)
COUNT(field) OVER(w) AS n,
SUM(field) OVER(w) / COUNT(field) OVER(w) AS mean
Here w would be the definition of a window. I have added a field n for later convenience.
Okay, so now we have the mean. The variance is the second statistical moment, and builds on the definition of the mean:
POW(SUM(field - mean), 2) OVER(w) / n AS variance
You can see that defining n previously made this more concise.
The square root of the variance (SQRT(variance) AS sdev) is the standard deviation. Let's also add this sdev column for future convenience.
On to the third moment! The skewness continues to build on the first two moments:
POW(SUM(field - mean), 3) OVER(w) / (n * POW(sdev, 3)) OVER(w) AS skewness,
(note how defining sdev makes this more concise)
And so we arrive at my favourite, the fourth statistical moment, the one with a name that makes you sound clever if you know it. There are actually two slightly different definitions, but moving between them is simple.
POW(SUM(field - mean), 4) OVER(w) / (n * POW(sdev, 4)) OVER(w) AS kurtosis,
And we could define kurtosis - 3 AS x_kurtosis if we prefer that definition (kurtosis of a Normal distribution is 3, so subtracting 3 makes it 0 - then a kurtosis of, say 3.1 is an 'excess kurtosis' of 0.1).
I have a column of numbers in my database. How can I computer the standard deviation? I do not want use the stddev function.
Just because I was curious, I decided to test the actual STDEV(). Now, I could not nail the built in function.
I was close... 0.000141009220002264 or 0.00748% off
Also, The Total Average and Count has to be converted to float (variance was greater with decimal)
The example below is going after my Treasury Rates Table for the 10 Year Yield (not that it matters)
Select SQLFunction = Stdev([TR_Y10])
,ManualCalc = Sqrt(Sum(Power(((cast([TR_Y10] as float)-B.TotalAvg)),2) / B.TotalCnt))
,Variance = Stdev([TR_Y10]) - Sqrt(Sum(Power(((cast([TR_Y10] as float)-B.TotalAvg)),2) / B.TotalCnt))
From [Chinrus-Shared].[dbo].[DS_Treasury_Rates]
Join (Select TotalAvg=Avg(cast([TR_Y10] as float)),TotalCnt=count(*) From [Chinrus-Shared].[dbo].[DS_Treasury_Rates]) B on 1=1
SQLFunction ManualCalc Variance
1.88409468982299 1.88395368060299 0.000141009220002264
The standard deviation is the square root of the variance divided by n.
The variance is the sum of the squares of the differences between the average and the observed value.
So, in most databases, you can use window functions:
select sqrt(avg(var))
from (select square(t.x - avg(t.x) over ()) as var
from t
) t;
The square() function might have some other name (such as power()).
The sqrt() function might have some other name.
This is not a good way to calculate the standard deviation in general. In particular, this is a numerically unstable algorithm (it will work just fine for finite numbers of normal numbers).
The subquery is needed because window functions cannot be the arguments to aggregation functions.
I understand that BigQuery is providing an estimation of COUNT DISTINCT, but is there any information on how big the error is and what kind of parameters it depends on?
The accuracy of COUNT DISTINCT estimation depends on real number of distict values. If it is small - the algorithm is pretty accurate (for small values it usually returns the exact value), but the bigger number of distinct values is - the less accurate it can become. Note, that COUNT(DISTINCT) takes second argument, which trades memory for accuracy, i.e. it will use more memory, but be more accurate. For example:
will return fairly accurate results if total number of distict values is less than 100,000.
The exact algorithm for COUNT distinct estimate varies, but different variations have similar error estimate - about 1/SQRT(N), where N is the second argument. Default value is 1000, which corresponds to about 3% error. If bumped to 10000 it would be about 1% error.
I'm trying to select a random row from a table, but there is a column in this table called Rate, I want it to return the row that has a higher rate, and rarely ever return the rows that has a lower rate, is this possible?
Table :
CREATE TABLE _Random (Code varchar(128), Rate tinyint)
So you want a random row, but weighted towards the ones with higher rates?
It would also be good to know how many rows there are in the table - sorting the whole lot is kinda expensive. You may prefer to use a row_number concept than sorting by N guids.
So... One option could be to generate a single number, and then divide 100 by it. Imagine we generate a number between 0 and 1.
.25 gives us 400, .5 gives us 200, .75 gives us 133... Notice that there's a curve here - so the numbers closer to 100 come up more often (subtract 100 to make the range start at 1).
You could use RAND() for a single value between 0 and 1 (it's probably good enough), and then do the division and subtraction to get a number. If this is higher than the count of records, then maybe repeat? But try to choose a value for your division that suits.
If you need to weight it more, you could raise your RAND() value by some number, to flatten it out or steepen it up. Do some experimenting to see how it looks.
This query will fetch a random record which has an above average rate
SELECT TOP (1) * FROM _Random
WHERE Rate>(SELECT AVG(Rate) FROM _Random)
My database has a directory of about 2,000 locations scattered throughout the United States with zipcode information (which I have tied to lon/lat coordinates).
I also have a table function which takes two parameters (ZipCode & Miles) to return a list of neighboring zip codes (excluding the same zip code searched)
For each location I am trying to get the neighboring location ids. So if location #4 has three nearby locations, the output should look like:
4 5
4 24
4 137
That is, locations 5, 24, and 137 are within X miles of location 4.
I originally tried to use a cross apply with my function as follows:
CROSS APPLY (SELECT SL_StoreNum FROM tbl_store_locations WHERE SL_Zip in (select zipnum from udf_GetLongLatDist(A.Sl_Zip,7))) AS Q
WHERE A.SL_StoreNum='04'
However that ran for over 20 minutes with no results so I canceled it. I did try hardcoding in the zipcode and it immediately returned a list
CROSS APPLY (SELECT SL_StoreNum FROM tbl_store_locations WHERE SL_Zip in (select zipnum from udf_GetLongLatDist('12345',7))) AS Q
WHERE A.SL_StoreNum='04'
What is the most efficient way of accomplishing this listing of nearby locations? Keeping in mind while I used "04" as an example here, I want to run the analysis for 2,000 locations.
The "udf_GetLongLatDist" is a function which uses some math to calculate distance between two geographic coordinates and returns a list of zipcodes with a distance of > 0. Nothing fancy within it.
When you use the function you probably have to calculate every single possible distance for each row. That is why it takes so long. SInce teh actual physical locations don;t generally move, what we always did was precalculate the distance from each zipcode to every other zip code (and update only once a month or so when we added new possible zipcodes). Once the distances are precalculated, all you have to do is run a query like
select zip2 from zipprecalc where zip1 = '12345' and distance <=10
We have something similar and optimized it by only calculating the distance of other zipcodes whose latitude is within a bounded range. So if you want other zips within #miles, you use a
where latitude >= #targetLat - (#miles/69.2) and latitude <= #targetLat + (#miles/69.2)
Then you are only calculating the great circle distance of a much smaller subset of other zip code rows. We found this fast enough in our use to not require precalculating.
The same thing can't be done for longitude because of the variation between equator and pole of what distance a degree of longitude represents.
Other answers here involve re-working the algorithm. I personally advise the pre-calculated map of all zipcodes against each other. It should be possible to embed such optimisations in your existing udf, to minimise code-changes.
A refactoring of the query, however, could be as follows...
tbl_store_locations AS A
dbo.udf_GetLongLatDist(A.Sl_Zip,7) AS B
tbl_store_locations AS C
ON C.SL_Zip = B.zipnum
Also, the performance of the CROSS APPLY will benefit greatly if you can ensure that the udf is INLINE rather than MULTI-STATEMENT. This allows the udf to be expanded inline (macro like) for a much cleaner execution plan.
Doing so would also allow you to return additional fields from the udf. The optimiser can then include or exclude those fields from the plan depending on whether you actually use them. Such an example would be to include the SL_StoreNum if it's easily accessible from the query in the udf, and so remove the need for the last join...