How can I create Skewness and Kurtosis statistical functions, which are like Python scipy/pandas on Big query?
I have researched UDFs, but I know that these structures do not allow aggregated and windowed operations. These two statistical calculations are not included in Big Query by default.
You won't need a UDF for that - the definition of the statistical moments isn't so complex.
The first two may have built in versions, but let's cover them as well as the two you're interested in:
mean
The first statistical moment is the mean. As a simple aggregate value: SUM(field)/COUNT(field)
You could create a new column with this value using a window function (which you mentioned)
SELECT
COUNT(field) OVER(w) AS n,
SUM(field) OVER(w) / COUNT(field) OVER(w) AS mean
FROM
some_table
Here w would be the definition of a window. I have added a field n for later convenience.
variance
Okay, so now we have the mean. The variance is the second statistical moment, and builds on the definition of the mean:
SELECT
POW(SUM(field - mean), 2) OVER(w) / n AS variance
FROM
some_table
You can see that defining n previously made this more concise.
The square root of the variance (SQRT(variance) AS sdev) is the standard deviation. Let's also add this sdev column for future convenience.
skewness
On to the third moment! The skewness continues to build on the first two moments:
SELECT
POW(SUM(field - mean), 3) OVER(w) / (n * POW(sdev, 3)) OVER(w) AS skewness,
FROM
some_table
(note how defining sdev makes this more concise)
kurtosis
And so we arrive at my favourite, the fourth statistical moment, the one with a name that makes you sound clever if you know it. There are actually two slightly different definitions, but moving between them is simple.
SELECT
POW(SUM(field - mean), 4) OVER(w) / (n * POW(sdev, 4)) OVER(w) AS kurtosis,
FROM
some_table
And we could define kurtosis - 3 AS x_kurtosis if we prefer that definition (kurtosis of a Normal distribution is 3, so subtracting 3 makes it 0 - then a kurtosis of, say 3.1 is an 'excess kurtosis' of 0.1).
Related
Suppose I have this query:
select
x + y as _total,
abs(x - y) / _total as _err,
round(100 * _err) as pct_err,
x,
y
from foo;
This assumes I have a table with x and y, and calculates an error between them. Note that columns I prefixed with _ are dummy columns - they're only there to show the steps of the calculation more clearly. Is there a way to omit them from the result?
I don't want to simply collapse the three columns into a single expression. It would be messier, and consider also a calculation with 10 steps and much longer field names.
I don't want to make this a CTE and then re-select only the columns I want. That seems too much hassle for such a simple thing.
It would be okay if I could just put the dummy columns at the end where they would be out of the way, but SQL doesn't seem to allow referencing a column that comes after.
Note that "no" is an acceptable answer, if you have reasonably comprehensive knowledge of SQL syntax :)
I have a column of numbers in my database. How can I computer the standard deviation? I do not want use the stddev function.
Just because I was curious, I decided to test the actual STDEV(). Now, I could not nail the built in function.
I was close... 0.000141009220002264 or 0.00748% off
Also, The Total Average and Count has to be converted to float (variance was greater with decimal)
The example below is going after my Treasury Rates Table for the 10 Year Yield (not that it matters)
Select SQLFunction = Stdev([TR_Y10])
,ManualCalc = Sqrt(Sum(Power(((cast([TR_Y10] as float)-B.TotalAvg)),2) / B.TotalCnt))
,Variance = Stdev([TR_Y10]) - Sqrt(Sum(Power(((cast([TR_Y10] as float)-B.TotalAvg)),2) / B.TotalCnt))
From [Chinrus-Shared].[dbo].[DS_Treasury_Rates]
Join (Select TotalAvg=Avg(cast([TR_Y10] as float)),TotalCnt=count(*) From [Chinrus-Shared].[dbo].[DS_Treasury_Rates]) B on 1=1
Returns
SQLFunction ManualCalc Variance
1.88409468982299 1.88395368060299 0.000141009220002264
The standard deviation is the square root of the variance divided by n.
The variance is the sum of the squares of the differences between the average and the observed value.
So, in most databases, you can use window functions:
select sqrt(avg(var))
from (select square(t.x - avg(t.x) over ()) as var
from t
) t;
Notes:
The square() function might have some other name (such as power()).
The sqrt() function might have some other name.
This is not a good way to calculate the standard deviation in general. In particular, this is a numerically unstable algorithm (it will work just fine for finite numbers of normal numbers).
The subquery is needed because window functions cannot be the arguments to aggregation functions.
I'm building a cube in MS BIDS. I need to create a calculated measure that returns the weighted-average of the rank value weighted by the number of searches. I want this value to be calculated at any level, no matter what dimensions have been applied to break-down the data.
I am trying to do something like the following:
I have one measure called [Rank Search Product] which I want to apply at the lowest level possible and then sum all values of it
IIf([Measures].[Searches] IS NOT NULL, [Measures].[Rank] * [Measures].[Searches], NULL)
And then my weighted average measure uses this:
IIf([Measures].[Rank Search Product] IS NOT NULL AND SUM([Measures].[Searches]) <> 0,
SUM([Measures].[Rank Search Product]) / SUM([Measures].[Searches]),
NULL)
I'm totally new to writing MDX queries and so this is all very confusing to me. The calculation should be
([Rank][0]*[Searches][0] + [Rank][1]*[Searches][1] + [Rank][2]*[Searches][2] ...)
/ SUM([searches])
I've also tried to follow what is explained in this link http://sqlblog.com/blogs/mosha/archive/2005/02/13/performance-of-aggregating-data-from-lower-levels-in-mdx.aspx
Currently loading my data into a pivot table in Excel is return #VALUE! for all calculations of my custom measures.
Please halp!
First of all, you would need an intermediate measure, lets say Rank times Searches, in the cube. The most efficient way to implement this would be to calculate it when processing the measure group. You would extend your fact table by a column e. g. in a view or add a named calculation in the data source view. The SQL expression for this column would be something like Searches * Rank. In the cube definition, you would set the aggregation function of this measure to Sum and make it invisible. Then just define your weighted average as
[Measures].[Rank times Searches] / [Measures].[Searches]
or, to avoid irritating results for zero/null values of searches:
IIf([Measures].[Searches] <> 0, [Measures].[Rank times Searches] / [Measures].[Searches], NULL)
Since Analysis Services 2012 SP1, you can abbreviate the latter to
Divide([Measures].[Rank times Searches], [Measures].[Searches], NULL)
Then the MDX engine will apply everything automatically across all dimensions for you.
In the second expression, the <> 0 test includes a <> null test, as in numerical contexts, NULL is evaluated as zero by MDX - in contrast to SQL.
Finally, as I interpret the link you have in your question, you could leave your measure Rank times Searches on SQL/Data Source View level to be anything, maybe just 0 or null, and would then add the following to your calculation script:
({[Measures].[Rank times Searches]}, Leaves()) = [Measures].[Rank] * [Measures].[Searches];
From my point of view, this solution is not as clear as to directly calculate the value as described above. I would also think it could be slower, at least if you use aggregations for some partitions in your cube.
I have a table:
LocationId OriginalValue Mean
1 0.45 3.99
2 0.33 3.99
3 16.74 3.99
4 3.31 3.99
and so forth...
How would I work out the Standard Deviation using this table and also what would you recommend - STDEVP or STDEV?
To use it, simply:
SELECT STDEVP(OriginalValue)
FROM yourTable
From below, you probably want STDEVP.
From here:
STDEV is used when the group of numbers being evaluated are only a partial sampling of the whole population. The denominator for dividing the sum of squared deviations is N-1, where N is the number of observations ( a count of items in the data set ). Technically, subtracting the 1 is referred to as "non-biased."
STDEVP is used when the group of numbers being evaluated is complete - it's the entire population of values. In this case, the 1 is NOT subtracted and the denominator for dividing the sum of squared deviations is simply N itself, the number of observations ( a count of items in the data set ). Technically, this is referred to as "biased." Remembering that the P in STDEVP stands for "population" may be helpful. Since the data set is not a mere sample, but constituted of ALL the actual values, this standard deviation function can return a more precise result.
Generally, you should use STDEV when you have to estimate standard deviation based on a sample. But if you have entire column-data given as arguments, then use STDEVP.
In general, if your data represents the entire population, use STDEVP; otherwise, use STDEV.
Note that for large samples, the functions return nearly the same value, so better use STDEV in this case.
In statistics, there are two types of standard deviations: one for a sample and one for a population.
The sample standard deviation, generally notated by the letter s, is used as an estimate of the population standard deviation.
The population standard deviation, generally notated by the Greek letter lower case sigma, is used when the data constitutes the complete population.
It is difficult to answer your question directly -- sample or population -- because it is difficult to tell what you are working with: a sample or a population. It often depends on context.
Consider the following example.
If I want to know the standard deviation of the age of students in my class, then I u=would use STDEVP because the class is my population. But if I want the use my class as a sample of the population of all students in the school (this would be what is known as a convenience sample, and would likely be biased, but I digress), then I would use STDEV because my class is a sample. The resulting value would be my best estimate of STDEVP.
As mentioned above (1) for large sample sizes (say, more than thirty), the difference between the two becomes trivial, and (2) generally you should use STDEV, not STDEVP, because in practice we usually don't have access to the population. Indeed, one could argue that if we always had access to populations, then we wouldn't need statistics. The entire point of inferential statistics is to be able to make inferences about a population based on the sample.
I need to write a query which allows me to find all locations within a range (Miles) from a provided location.
The table is like this:
id | name | lat | lng
So I have been doing research and found: this my sql presentation
I have tested it on a table with around 100 rows and will have plenty more! - Must be scalable.
I tried something more simple like this first:
//just some test data this would be required by user input
set #orig_lat=55.857807; set #orig_lng=-4.242511; set #dist=10;
SELECT *, 3956 * 2 * ASIN(
SQRT( POWER(SIN((orig.lat - abs(dest.lat)) * pi()/180 / 2), 2)
+ COS(orig.lat * pi()/180 ) * COS(abs(dest.lat) * pi()/180)
* POWER(SIN((orig.lng - dest.lng) * pi()/180 / 2), 2) ))
AS distance
FROM locations dest, locations orig
WHERE orig.id = '1'
HAVING distance < 1
ORDER BY distance;
This returned rows in around 50ms which is pretty good!
However this would slow down dramatically as the rows increase.
EXPLAIN shows it's only using the PRIMARY key which is obvious.
Then after reading the article linked above. I tried something like this:
// defining variables - this when made into a stored procedure will call
// the values with a SELECT query.
set #mylon = -4.242511;
set #mylat = 55.857807;
set #dist = 0.5;
-- calculate lon and lat for the rectangle:
set #lon1 = #mylon-#dist/abs(cos(radians(#mylat))*69);
set #lon2 = #mylon+#dist/abs(cos(radians(#mylat))*69);
set #lat1 = #mylat-(#dist/69);
set #lat2 = #mylat+(#dist/69);
-- run the query:
SELECT *, 3956 * 2 * ASIN(
SQRT( POWER(SIN((#mylat - abs(dest.lat)) * pi()/180 / 2) ,2)
+ COS(#mylat * pi()/180 ) * COS(abs(dest.lat) * pi()/180)
* POWER(SIN((#mylon - dest.lng) * pi()/180 / 2), 2) ))
AS distance
FROM locations dest
WHERE dest.lng BETWEEN #lon1 AND #lon2
AND dest.lat BETWEEN #lat1 AND #lat2
HAVING distance < #dist
ORDER BY distance;
The time of this query is around 240ms, this is not too bad, but is slower than the last. But I can imagine at much higher number of rows this would work out faster. However anEXPLAIN shows the possible keys as lat,lng or PRIMARY and used PRIMARY.
How can I do this better???
I know I could store the lat lng as a POINT(); but I also haven't found too much documentation on this which shows if it's faster or accurate?
Any other ideas would be happily accepted!
Thanks very much!
-Stefan
UPDATE:
As Jonathan Leffler pointed out I had made a few mistakes which I hadn't noticed:
I had only put abs() on one of the lat values. I was using an id search in the WHERE clause in the second one as well, when there was no need. In the first query was purely experimental the second one is more likely to hit production.
After these changes EXPLAIN shows the key is now using lng column and average time to respond around 180ms now which is an improvement.
Any other ideas would be happily accepted!
If you want speed (and simplicity) you'll want some decent geospatial support from your database. This introduces geospatial datatypes, geospatial indexes and (a lot of) functions for processing / building / analyzing geospatial data.
MySQL implements a part of the OpenGIS specifications although it is / was (last time I checked it was) very very rough around the edges / premature (not useful for any real work).
PostGis on PostgreSql would make this trivially easy and readable:
(this finds all points from tableb which are closer then 1000 meters from point a in tablea with id 123)
select
myvalue
from
tablea, tableb
where
st_dwithin(tablea.the_geom, tableb.the_geom, 1000)
and
tablea.id = 123
The first query ignores the parameters you set - using 1 instead of #dist for the distance, and using the table alias orig instead of the parameters #orig_lat and #orig_lon.
You then have the query doing a Cartesian product between the table and itself, which is seldom a good idea if you can avoid it. You get away with it because of the filter condition orig.id = 1, which means that there's only one row from orig joined with each of the rows in dest (including the point with dest.id = 1; you should probably have a condition AND orig.id != dest.id). You also have a HAVING clause but no GROUP BY clause, which is indicative of problems. The HAVING clause is not relating any aggregates, but a HAVING clause is (primarily) for comparing aggregate values.
Unless my memory is failing me, COS(ABS(x)) === COS(x), so you might be able to simplify things by dropping the ABS(). Failing that, it is not clear why one latitude needs the ABS and the other does not - symmetry is crucial in matters of spherical trigonometry.
You have a dose of the magic numbers - the value 69 is presumably number of miles in a degree (of longitude, at the equator), and 3956 is the radius of the earth.
I'm suspicious of the box calculated if the given position is close to a pole. In the extreme case, you might need to allow any longitude at all.
The condition dest.id = 1 in the second query is odd; I believe it should be omitted, but its presence should speed things up, because only one row matches that condition. So the extra time taken is puzzling. But using the primary key index is appropriate as written.
You should move the condition in the HAVING clause into the WHERE clause.
But I'm not sure this is really helping...
The NGS Online Inverse Geodesic Calculator is the traditional reference means to calculate the distance between any two locations on the earth ellipsoid:
http://www.ngs.noaa.gov/cgi-bin/Inv_Fwd/inverse2.prl
But above calculator is still problematic. Especially between two near-antipodal locations, the computed distance can show an error of some tens of kilometres !!! The origin of the numeric trouble was identified long time ago by Thaddeus Vincenty (page 92):
http://www.ngs.noaa.gov/PUBS_LIB/inverse.pdf
In any case, it is preferrable to use the reliable and very accurate online calculator by Charles Karney:
http://geographiclib.sourceforge.net/cgi-bin/Geod
Some thoughts on improving performance. It wouldn't simplify things from a maintainability standpoint (makes things more complex), but it could help with scalability.
Since you know the radius, you can add conditions for the bounding box, which may allow the db to optimize the query to eliminate some rows without having to do the trig calcs.
You could pre-calculate some of the trig values of the lat/lon of stored locations and store them in the table. This would shift some of the performance cost when inserting the record, but if queries outnumber inserts, this would be good. See this answer for an idea of this approach:
Query to get records based on Radius in SQLite?
You could look at something like geohashing.
When used in a database, the structure of geohashed data has two advantages. ,,, Second, this index structure can be used for a quick-and-dirty proximity search - the closest points are often among the closest geohashes.
You could search SO for some ideas on how to implement:
https://stackoverflow.com/search?q=geohash
If you're only interested in rather small distances, you can approximate the geographical grid by a rectangular grid.
SELECT *, SQRT(POWER(RADIANS(#mylat - dest.lat), 2) +
POWER(RADIANS(#mylon - dst.lng)*COS(RADIANS(#mylat)), 2)
)*#radiusOfEarth AS approximateDistance
…
You could make this even more efficient by storing radians instead of (or in addition to) degrees in your database. If your queries may cross the 180° meridian, some extra care would be neccessary there, but many applications don't have to deal with those locations. You could also try to change POWER(x) to x*x, which might get computed faster.