Generate normally distributed series using BIgQuery - google-bigquery

Is there a way to generate normally distributed series in BQ? ideally specifying the mean and sd of the distribution.
I found a way using Marsaglia polar method , but it is not ideal for I do not want polar coordinates of the distribution but to generate an array that follows the parameters specified for it to be normally distributed.
Thank you in advance.

This query gives you the euclidean coordinates of the normal distribution centred in 0. You can adjust both the mean (mean variable) or the sd (variance variable) and the x-axis values (GENERATE_ARRAY(beginning,end,step)) :
CREATE TEMPORARY FUNCTION normal(x FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
var mean=0;
var variance=1;
var x0=1/(Math.sqrt(2*Math.PI*variance));
var x1=-Math.pow(x-mean,2)/(2*Math.pow(variance,2));
return x0*Math.pow(Math.E,x1);
""";
WITH numbers AS
(SELECT x FROM UNNEST(GENERATE_ARRAY(-10, 10,0.5)) AS x)
SELECT x, normal(x) as normal
FROM numbers;
For doing that, I used "User Defined Funtions" [1]. They are used when you want to have another SQL expression or when you want to use Java Script (as I did).
NOTE: I used the probability density function of the normal distribution, if you want to use another you'd need to change variables x0,x1 and the return (I wrote them separately so it's clearer).

Earlier answers give the probability distribution function of a normal rv. Here I modify previous answers to give a random number generated with the desired distribution, in BQ standard SQL, using the 'polar coordinates' method. The question asks not to use polar coordinates, which is an odd request, since polar coordinates are not use in the generation of the normally distributed random number.
CREATE TEMPORARY FUNCTION rnorm ( mu FLOAT64, sigma FLOAT64 ) AS
(
(select mu + sigma*(sqrt( 2*abs(
log( RAND())
)
)
)*cos( 2*ACOS(-1)*RAND())
)
)
;
select
num ,
rnorm(-1, 5.3) as RAND_NORM
FROM UNNEST(GENERATE_ARRAY(1, 17) ) AS num

The easiest way to do it in BQ is by creating a custom function:
CREATE OR REPLACE FUNCTION
`your_project.functions.normal_distribution_pdf`
(x ANY TYPE, mu ANY TYPE, sigma ANY TYPE) AS (
(
SELECT
safe_divide(1,sigma * power(2 * ACOS(-1),0.5)) * exp(-0.5 * power(safe_divide(x-mu,sigma),2))
)
);
Next you only need to apply the function:
with inputs as (
SELECT 1 as x, 0 as mu, 1 as sigma
union all
SELECT 1.5 as x, 1 as mu, 2 as sigma
union all
SELECT 2 as x , 2 as mu, 3 as sigma
)
SELECT x,
`your_project.functions.normal_distribution_pdf`(x, mu, sigma) as normal_pdf
from
inputs

Related

Intersection between two tables in DolphinDB server

I'm trying to use the intersection function in DolphinDB as follows:
n=1000000
ID=rand(100, n)
dates=2017.08.07..2017.08.11
date=rand(dates, n)
x=rand(10.0, n)
t=table(ID, date, x)
dbDate = database(, VALUE, 2017.08.07..2017.09.11)
dbID = database(, RANGE, 0 50 100)
db = database("dfs://compodb", COMPO, [dbDate, dbID])
pt = db.createPartitionedTable(t, `pt, `date`ID).append!(t)
dfsTable=loadTable("dfs://compodb","pt")
A = select * from dfsTable where date = 2017.08.07
B = select * from dfsTable where date = 2017.08.08
intersection(A[`x],B[`x])
But I am getting the error:
The both arguments for 'bitAnd'(&) must be integers
Apparently something doesn’t work in this query... any idea?
This document section says this about how to create a vector:
A vector from a table column. For example, trades.qty indicates column qty from table trades.
And it looks like intersection is an alias for &, which for vectors is treated as bitAnd, as said here:
Arguments
Set Operation: X and Y are sets.
Bit Operation: X and Y are equal sized vectors, or Y is a scalar.
So you need to convert vector to set with set(A[`x]) function.

BigQuery: external UDFs with standard SQL

Today I tried to write a UDF in standard SQL language in the Web Editor UI, and I have already unchecked the option 'Use Legacy SQL', but it returned to me the following error message:
Not Implemented: You cannot use legacy SQL UDFs with standard SQL queries. See https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql#differences_in_user-defined_javascript_functions
Therefore I tried an example of external UDF provided on the Google Cloud Platform: https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions. But it still returns to me the same error message. In the following the example:
CREATE TEMPORARY FUNCTION multiplyInputs(x FLOAT64, y FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
return x*y;
""";
WITH numbers AS
(SELECT 1 AS x, 5 as y
UNION ALL
SELECT 2 AS x, 10 as y
UNION ALL
SELECT 3 as x, 15 as y)
SELECT x, y, multiplyInputs(x, y) as product
FROM numbers;
Question: How to use external UDF with standard SQL in Web UI?
Make sure not to enter the input in the "UDF Editor" panel. It should go with the rest of your query. See the topic in the migration guide for an example:
#standardSQL
-- Computes the harmonic mean of the elements in 'arr'.
-- The harmonic mean of x_1, x_2, ..., x_n can be expressed as:
-- n / ((1 / x_1) + (1 / x_2) + ... + (1 / x_n))
CREATE TEMPORARY FUNCTION HarmonicMean(arr ARRAY<FLOAT64>)
RETURNS FLOAT64 LANGUAGE js AS """
var sum_of_reciprocals = 0;
for (var i = 0; i < arr.length; ++i) {
sum_of_reciprocals += 1 / arr[i];
}
return arr.length / sum_of_reciprocals;
""";
WITH T AS (
SELECT GENERATE_ARRAY(1.0, x * 4, x) AS arr
FROM UNNEST([1, 2, 3, 4, 5]) AS x
)
SELECT arr, HarmonicMean(arr) AS h_mean
FROM T;

Adding constraints to a function using Optim.jl in Julia

I am using Optim.jl Library to maximise the Sharpe Ratio value
using Optim
function getSharpeRatioNegative(W,ex_mu,S)
return dot(W', ex_mu) / sqrt(dot(W',S*W))
end
f(W::Vector) = getSharpeRatioNegative(W,ex_mu,S)
result = optimize(f, [0.2;0.2;0.2;0.2;0.2])
How can I add the following constraints:
Value elements of W is positive. ( W[i] >0 )
Sum of values of W is 1. ( sum(W[1:5]) == 1 )
Optim.jl doesn't currently do constrained optimization. There is a PR to add this, but it's not merged quite yet. Check out JuMP for doing constrained optimization.

SQL function table interpolation

I have an SQL table of (x,y) values.
x y
0.0 0.0
0.1 0.4
0.5 1.0
5.0 2.0
6.0 4.0
8.0 4.0
10.0 5.0
The x column is indexed. I am using sqlite.
My ultimate goal is to get a y(x) for any x value. I will use linear interpolation using table values. Similar as shown in the plot below.
Is there a way to perform the linear interpolation directly using a select query?
Otherwise getting the interval values where the x belongs, would be enough.
Is there a query that will give me the last smaller and the first bigger pair of a given x, so that I can compute the interpolated y(x) value?
For example if x=2.0 to get:
0.5 1.0
5.0 2.0
In case x is out of the table to get the two first/last values to perform an extrapolation.
For example if x=20.0 to get:
8.0 4.0
10.0 5.0
It would be hard to do this in plain SQLLite, without analytical functions. In more complex SQL engines, You could use LEG and LEAD analytical functions for obtaining set of pairs You want easily enough.
In SQLLite though, I would create two cursors, like those:
Cursor C1:
SELECT
x,y
FROM
table
WHERE
x>=2
ORDER BY
x asc
;
Cursor C2:
SELECT
x,y
FROM
table
WHERE
x<=2
ORDER BY
x desc
;
And perform rest of operations in other language - fetching once from both, or if one cursor do not return value, twice from the other. Also, some additional exceptions need to be handled - what if Your set have less than two values. Or if You have given X in Your set - You do not need interpolata at all... And so on.
I would go with a simple substraction.
You are looking to the two nearest input so :
SELECT x, y
FROM my_table
ORDER BY Abs(:val - x)
LIMIT 2
However this will lead to a full table scan.

find ranges to create Uniform histogram

I need to find ranges in order to create a Uniform histogram
i.e: ages
to 4 ranges
data_set = [18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46]
is there a function that gives me the ranges so the histogram is uniform?
in this case
ranges = [(18,24), (27,29), (30,33), (42,46)]
This example is easy, I'd like to know if there is an algorithm that deals with complex data sets as well
thanks
You are looking for the quantiles that split up your data equally. This combined with cutshould work. So, suppose you want n groups.
set.seed(1)
x <- rnorm(1000) # Generate some toy data
n <- 10
uniform <- cut(x, c(-Inf, quantile(x, prob = (1:(n-1))/n), Inf)) # Determine the groups
plot(uniform)
Edit: now corrected to yield the correct cuts in the ends.
Edit2: I don't quite understand the downvote. But this also works in your example:
data_set = c(18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46)
n <- 4
groups <- cut(data_set, breaks = c(-Inf, quantile(data_set, prob = 1:(n-1)/n), Inf))
levels(groups)
With some minor renaming nessesary. For slightly better level names, you could also put in min(x) and max(x) instead of -Inf and Inf.