BigQuery: external UDFs with standard SQL - google-bigquery

Today I tried to write a UDF in standard SQL language in the Web Editor UI, and I have already unchecked the option 'Use Legacy SQL', but it returned to me the following error message:
Not Implemented: You cannot use legacy SQL UDFs with standard SQL queries. See https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql#differences_in_user-defined_javascript_functions
Therefore I tried an example of external UDF provided on the Google Cloud Platform: https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions. But it still returns to me the same error message. In the following the example:
CREATE TEMPORARY FUNCTION multiplyInputs(x FLOAT64, y FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
return x*y;
""";
WITH numbers AS
(SELECT 1 AS x, 5 as y
UNION ALL
SELECT 2 AS x, 10 as y
UNION ALL
SELECT 3 as x, 15 as y)
SELECT x, y, multiplyInputs(x, y) as product
FROM numbers;
Question: How to use external UDF with standard SQL in Web UI?

Make sure not to enter the input in the "UDF Editor" panel. It should go with the rest of your query. See the topic in the migration guide for an example:
#standardSQL
-- Computes the harmonic mean of the elements in 'arr'.
-- The harmonic mean of x_1, x_2, ..., x_n can be expressed as:
-- n / ((1 / x_1) + (1 / x_2) + ... + (1 / x_n))
CREATE TEMPORARY FUNCTION HarmonicMean(arr ARRAY<FLOAT64>)
RETURNS FLOAT64 LANGUAGE js AS """
var sum_of_reciprocals = 0;
for (var i = 0; i < arr.length; ++i) {
sum_of_reciprocals += 1 / arr[i];
}
return arr.length / sum_of_reciprocals;
""";
WITH T AS (
SELECT GENERATE_ARRAY(1.0, x * 4, x) AS arr
FROM UNNEST([1, 2, 3, 4, 5]) AS x
)
SELECT arr, HarmonicMean(arr) AS h_mean
FROM T;

Related

Intersection between two tables in DolphinDB server

I'm trying to use the intersection function in DolphinDB as follows:
n=1000000
ID=rand(100, n)
dates=2017.08.07..2017.08.11
date=rand(dates, n)
x=rand(10.0, n)
t=table(ID, date, x)
dbDate = database(, VALUE, 2017.08.07..2017.09.11)
dbID = database(, RANGE, 0 50 100)
db = database("dfs://compodb", COMPO, [dbDate, dbID])
pt = db.createPartitionedTable(t, `pt, `date`ID).append!(t)
dfsTable=loadTable("dfs://compodb","pt")
A = select * from dfsTable where date = 2017.08.07
B = select * from dfsTable where date = 2017.08.08
intersection(A[`x],B[`x])
But I am getting the error:
The both arguments for 'bitAnd'(&) must be integers
Apparently something doesn’t work in this query... any idea?
This document section says this about how to create a vector:
A vector from a table column. For example, trades.qty indicates column qty from table trades.
And it looks like intersection is an alias for &, which for vectors is treated as bitAnd, as said here:
Arguments
Set Operation: X and Y are sets.
Bit Operation: X and Y are equal sized vectors, or Y is a scalar.
So you need to convert vector to set with set(A[`x]) function.

Generate normally distributed series using BIgQuery

Is there a way to generate normally distributed series in BQ? ideally specifying the mean and sd of the distribution.
I found a way using Marsaglia polar method , but it is not ideal for I do not want polar coordinates of the distribution but to generate an array that follows the parameters specified for it to be normally distributed.
Thank you in advance.
This query gives you the euclidean coordinates of the normal distribution centred in 0. You can adjust both the mean (mean variable) or the sd (variance variable) and the x-axis values (GENERATE_ARRAY(beginning,end,step)) :
CREATE TEMPORARY FUNCTION normal(x FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
var mean=0;
var variance=1;
var x0=1/(Math.sqrt(2*Math.PI*variance));
var x1=-Math.pow(x-mean,2)/(2*Math.pow(variance,2));
return x0*Math.pow(Math.E,x1);
""";
WITH numbers AS
(SELECT x FROM UNNEST(GENERATE_ARRAY(-10, 10,0.5)) AS x)
SELECT x, normal(x) as normal
FROM numbers;
For doing that, I used "User Defined Funtions" [1]. They are used when you want to have another SQL expression or when you want to use Java Script (as I did).
NOTE: I used the probability density function of the normal distribution, if you want to use another you'd need to change variables x0,x1 and the return (I wrote them separately so it's clearer).
Earlier answers give the probability distribution function of a normal rv. Here I modify previous answers to give a random number generated with the desired distribution, in BQ standard SQL, using the 'polar coordinates' method. The question asks not to use polar coordinates, which is an odd request, since polar coordinates are not use in the generation of the normally distributed random number.
CREATE TEMPORARY FUNCTION rnorm ( mu FLOAT64, sigma FLOAT64 ) AS
(
(select mu + sigma*(sqrt( 2*abs(
log( RAND())
)
)
)*cos( 2*ACOS(-1)*RAND())
)
)
;
select
num ,
rnorm(-1, 5.3) as RAND_NORM
FROM UNNEST(GENERATE_ARRAY(1, 17) ) AS num
The easiest way to do it in BQ is by creating a custom function:
CREATE OR REPLACE FUNCTION
`your_project.functions.normal_distribution_pdf`
(x ANY TYPE, mu ANY TYPE, sigma ANY TYPE) AS (
(
SELECT
safe_divide(1,sigma * power(2 * ACOS(-1),0.5)) * exp(-0.5 * power(safe_divide(x-mu,sigma),2))
)
);
Next you only need to apply the function:
with inputs as (
SELECT 1 as x, 0 as mu, 1 as sigma
union all
SELECT 1.5 as x, 1 as mu, 2 as sigma
union all
SELECT 2 as x , 2 as mu, 3 as sigma
)
SELECT x,
`your_project.functions.normal_distribution_pdf`(x, mu, sigma) as normal_pdf
from
inputs

Solving Least Squares with orthogonality constraint using Matlab

I need to solve the following Least Squares Problem where A and B and X are all matrices:
cvx_begin quiet;
variable X(len_x) nonnegative;
minimize ( norm(X * A - B , 2));
subject to
X >= 0;
for i=1: size(X,2)
for j= i + 1: size(X,2)
transpose(X(:,i)) * X(:,j) <= epsilon
end
end
cvx_end
I choose CVX, but it doesn't require me to transform the problem into standard form. But with CVX, I get the following error:
Error using cvx/quad_form (line 230)
The second argument must be positive or negative semidefinite.
Error in * (line 261)
[ z2, success ] = quad_form( xx, P, Q, R );
Error in sanaz_opt (line 28)
transpose(X(:,i)) * X(:,j) <= 0.1
I'm wondering how I can solve this problem? I'm trying to use Gurobi or least squares function in Matlab, but it seems they can't handle the transpose(X(:,i)) * X(:,j) constraint.

Conditional Graphing Plot?

I am trying to graph two functions, but i want to graph one function for a condition but graph using another function if another condition is met.
A simple example would be:
if x > 0
then sin(x)
else cos(x)
It would then graph cos and sin depending on the x value, there being an obvious gap at x = 0, as cos(0) = 1 and sin(0) = 0.
EDIT: There is a built-in way. I'll leave my original answer below for posterity, but try using the piecewise() function:
plot(piecewise(((cos(x),x<0), (sin(x), 0<x))))
See it here.
I would guess that there's a built-in way to do this, but I don't know it. You can multiply your functions by the Heaviside Step Function to accomplish this task. The step function is 1 if x > 0 and 0 if x < 0, so multiplying this into your functions and then summing them together will select only one of them based on the sign of x, that is to say:
f(x) := heaviside(x) * sin(x) + heaviside(-x) * cos(x)
If x > 0, heaviside(x) = 1 and heaviside(-x) = 0, so f(x) = sin(x).
If x < 0, heaviside(x) = 0 and heaviside(-x) = 1, so f(x) = cos(x).
See it in action here. In general, note that if you want the transition to be at x = a, then you could do heaviside(x-a) and heaviside(-x+a), respectively. If you want N functions, you'll have to have (N-1) multiplied step functions on each term, each with their own (x-a_i) argument. I hope someone else can contribute a cleaner solution.

Implementing Wilson Score in SQL

We have a relatively small table that we would like to sort based on rating, using the Wilson interval or a reasonable equivalent. I'm a reasonably smart guy, but my math fu is nowhere near strong enough to understand this:
The above formula, I am told, calculates a score for a positive/negative (thumbs up/thumbs down) voting system. I've never taken a statistics course, and it's been 15 years since I've done any sort of advanced mathematics. I don't have a clue what the little hat that the p is wearing means, or what the backwards Jesus fish beneath z indicates.
I would like to know two things:
Can this formula be altered to accommodate a 5-star rating system? I found this, but the author expresses his doubts as to the accuracy of his formula.
How can this formula be expressed in a SQL function? Note that I do not need to calculate and sort in real-time. The score can be calculated and cached daily.
Am I overlooking something built-in to Microsoft SQL Server?
Instead of trying to manipulate the Wilson's algorithm to do a 5 star rating system. Why don't you look into a different algorithm? This is what imdb uses for their top 250: Bayesian Estimate
As for explaining the math in the Wilson's algorithm, below was posted on the link in your first post. It is written in Ruby.
require 'statistics2'
def ci_lower_bound(pos, n, power)
if n == 0
return 0
end
z = Statistics2.pnormaldist(1-power/2)
phat = 1.0*pos/n
(phat + z*z/(2*n) - z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
end
If you'd like another example, here is one in PHP:
http://www.derivante.com/2009/09/01/php-content-rating-confidence/
Edit: It seems that derivante.com is no longer around. You can see the original article on archive.org - https://web.archive.org/web/20121018032822/http://derivante.com/2009/09/01/php-content-rating-confidence/ and I've added the code from the article below.
class Rating
{
public static function ratingAverage($positive, $total, $power = '0.05')
{
if ($total == 0)
return 0;
$z = Rating::pnormaldist(1-$power/2,0,1);
$p = 1.0 * $positive / $total;
$s = ($p + $z*$z/(2*$total) - $z * sqrt(($p*(1-$p)+$z*$z/(4*$total))/$total))/(1+$z*$z/$total);
return $s;
}
public static function pnormaldist($qn)
{
$b = array(
1.570796288, 0.03706987906, -0.8364353589e-3,
-0.2250947176e-3, 0.6841218299e-5, 0.5824238515e-5,
-0.104527497e-5, 0.8360937017e-7, -0.3231081277e-8,
0.3657763036e-10, 0.6936233982e-12);
if ($qn < 0.0 || 1.0 < $qn)
return 0.0;
if ($qn == 0.5)
return 0.0;
$w1 = $qn;
if ($qn > 0.5)
$w1 = 1.0 - $w1;
$w3 = - log(4.0 * $w1 * (1.0 - $w1));
$w1 = $b[0];
for ($i = 1;$i <= 10; $i++)
$w1 += $b[$i] * pow($w3,$i);
if ($qn > 0.5)
return sqrt($w1 * $w3);
return - sqrt($w1 * $w3);
}
}
As for doing this in SQL, SQL has all these Math functions already in it's library. If I were you I'd do this in your application though. Make your application update your database every so often (hours? days?) instead of doing this on the fly or your application will become very slow.
Regarding your first question (adjusting the formula to the 5-stars system) I would agree with Paul Creasey.
conversion formula: [3 +/- i stars -> i up/down-votes] (3 stars -> 0)
example: 4 stars -> +1 up-vote, 5 stars -> +2, 1 -> -2 and so on.
I would note though that instead of the lower bound of the interval that both ruby and php functions compute, I would just compute the much more simple wilson midpoint:
(x + (z^2)/2) / (n + z^2)
where:
n = Sum(up_votes) + Sum(|down_votes|)
x = (positive votes)/n = Sum(up_votes) / n
z = 1.96 (fixed value)
Taking Williams link to the php solution http://www.derivante.com/2009/09/01/php-content-rating-confidence/ and making your system such that it just postive and negative (5 stars could be 2 pos, 1 start could be 2 neg perhaps) then it would be fairly easy to convert it to T-SQL, but you'd be much better off doing it in the server side logic.
The author of the first link recently added an SQL implementation to his post.
Here it is:
SELECT widget_id, ((positive + 1.9208) / (positive + negative) -
1.96 * SQRT((positive * negative) / (positive + negative) + 0.9604) /
(positive + negative)) / (1 + 3.8416 / (positive + negative))
AS ci_lower_bound FROM widgets WHERE positive + negative > 0
ORDER BY ci_lower_bound DESC;
Whether this can be accommodated to a 5-star rating system is beyond me too.
I have uploaded an Oracle PL/SQL implementation to https://github.com/mattgrogan/stats_wilson_score
create or replace function stats_wilson_score(
/*****************************************************************************************************************
Author : Matthew Grogan
Website : https://github.com/mattgrogan
Name : stats_wilson_score.sql
Description : Oracle PL/SQL function to return the Wilson Score Interval for the given proportion.
Citation : Wilson E.B. J Am Stat Assoc 1927, 22, 209-212
Example:
select
round(29 / 250, 4) point_estimate,
stats_wilson_score(29, 250, 0.10, 'LCL') lcl,
stats_wilson_score(29, 250, 0.10, 'UCL') ucl
from dual;
******************************************************************************************************************/
x integer, -- Number of successes
m integer, -- Number of trials
alpha number default 0.95, -- Probability of a Type I error
return_value varchar2 default 'LCL' -- LCL = Lower control limit, UCL = upper control limit
)
return number is
z float(10);
phat float(10) := 0.0;
lcl float(10) := 0.0;
ucl float(10) := 0.0;
begin
if m = 0 then
return(0);
end if;
case alpha
when 0.10 then z := 1.644854;
when 0.05 then z := 1.959964;
when 0.01 then z := 2.575829;
else return(null); -- No Z value for this alpha
end case;
phat := x/m;
lcl := (phat + z*z/(2*m) - z * sqrt( (phat * (1-phat) ) / m + z * z / (4 * (m * m)) ) ) / (1 + z * z / m);
ucl := (phat + z*z/(2*m) + z * sqrt((phat*(1-phat)+z*z/(4*m))/m))/(1+z*z/m);
case return_value
when 'LCL' then return(lcl);
when 'UCL' then return(ucl);
else return(null);
end case;
end;
/
grant execute on stats_wilson_score to public;
The Wilson score is actually not a very good of a way of sorting items by rating. It's certainly better than just sorting by mean review score, but it still has a lot of problems. For example, an item with 1 negative review (whose quality is still very uncertain) will be sorted below an item with 10 negative reviews and 1 positive review (which we can be fairly certain is bad quality).
I would recommend using an adaptation of the SteamDB rating formula instead (by Reddit user /u/tornmandate). In addition to being better suited to this sort of thing than the Wilson score (for reasons that are explained in the linked article), it can also be adapted to a 5-star rating system much more easily than Wilson.
Original SteamDB formula:
( Total Reviews = Positive Reviews + Negative Reviews )
( Review Score = frac{Positive Reviews}{Total Reviews} )
( Rating = Review Score - (Review Score - 0.5)*2^{-log_{10}(Total Reviews + 1)} )
5-star version (note the change from 0.5 (a 50% score with up/down votes) to 2.5 (a 50% score with 5-star ratings)):
( Total Reviews = total count of all reviews )
( Review Score = mean star rating of all reviews )
( Rating = Review Score - (Review Score - 2.5)*2^{-log_{10}(Total Reviews + 1)} )
The formula is also much more understandable by non-mathematicians and easy to translate into code.