Factorial in Google BigQuery - sql

I need to compute the factorial of a variable in Google BigQuery - is there a function for this? I cannot find one in the documentation here:
https://cloud.google.com/bigquery/query-reference#arithmeticoperators
My proposed solution at this point is to compute the factorial for numbers 1 through 100 and upload that as a table and join with that table. If you have something better, please advise.
As context may reveal a best solution, the factorial is used in the context of computing a Poisson probability of a random variable (number of events in a window of time). See the first equation here: https://en.wikipedia.org/wiki/Poisson_distribution

Try below. Quick & dirty example
select number, factorial
FROM js(
// input table
(select number from
(select 4 as number),
(select 6 as number),
(select 12 as number)
),
// input columns
number,
// output schema
"[{name: 'number', type: 'integer'},
{name: 'factorial', type: 'integer'}]",
// function
"function(r, emit){
function fact(num)
{
if(num<0)
return 0;
var fact=1;
for(var i=num;i>1;i--)
fact*=i;
return fact;
}
var factorial = fact(r.number)
emit({number: r.number, factorial: factorial});
}"
)

If the direct approach works for the values you need the Poisson distribution calculated on, then cool. If you reach the point where it blows up or gives you inaccurate results, then read on for numerical analysis fun times.
In general you'll get better range and numerical stability if you do the arithmetic on the logarithms, and then exp() as the final operation.
You want: c^k / k! exp(-c).
Compute its log, ln( c^k / k! exp(-c) ),
i.e. k ln(x) - ln(k!) - c
Take exp() of that.
Well but how do we get ln(k!) without computing k!? There's a function called the gamma function, whose practical point here is that its logarithm gammaln() can be approximated directly, and ln(k!) = gammaln(k+1).
There is a Javascript gammaln() in Phil Mainwaring's answer here, which I have not tested, but assuming it works it should fit into a UDF for you.

Extending Mikhail's answer to be general and correct for computing the factorial for all number 1 to n, where n < 500, the following solution holds and can be computed efficiently:
select number, factorial
FROM js(
// input table
(
SELECT
ROW_NUMBER() OVER() AS number,
some_thing_from_the_table
FROM
[any table with at least LIMIT many entries]
LIMIT
100 #Change this to any number to compute factorials from 1 to this number
),
// input columns
number,
// output schema
"[{name: 'number', type: 'integer'},
{name: 'factorial', type: 'float'}]",
// function
"function(r, emit){
function fact(num)
{
if(num<0)
return 0;
var fact=1;
for(var i=num;i>1;i--)
fact*=i;
return fact;
}
#Use toExponential and parseFloat to handle large integers in both Javascript and BigQuery
emit({number: r.number, factorial: parseFloat(fact(r.number).toExponential())});
}"
)

You can get up to 27! using SQL UDF. Above that value NUMERIC type gets overflow error.
CREATE OR REPLACE FUNCTION factorial(integer_expr INT64) AS ( (
SELECT
ARRAY<numeric>[
1,
1,
2,
6,
24,
120,
720,
5040,
40320,
362880,
3628800,
39916800,
479001600,
6227020800,
87178291200,
1307674368000,
20922789888000,
355687428096000,
6402373705728000,
121645100408832000,
2432902008176640000,
51090942171709440000.,
1124000727777607680000.,
25852016738884976640000.,
620448401733239439360000.,
15511210043330985984000000.,
403291461126605635584000000.,
10888869450418352160768000000.][
OFFSET
(integer_expr)] AS val ) );
select factorial(10);

Related

Add array of other records from the same table to each record

My project is a Latin language learning app. My DB has all the words I'm teaching, in the table 'words'. It has the lemma (the main form of the word), along with the definition and other information the user needs to learn.
I show one word at a time for them to guess/remember what it means. The correct word is shown along with some wrong words, like:
What does Romanus mean? Greek - /Roman/ - Phoenician - barbarian
What does domus mean? /house/ - horse - wall - senator
The wrong options are randomly drawn from the same table, and must be from the same part of speech (adjective, noun...) as the correct word; but I am only interested in their lemma. My return value looks like this (some properties omitted):
[
{ lemma: 'Romanus', definition: 'Roman', options: ['Greek', 'Phoenician', 'barbarian'] },
{ lemma: 'domus', definition: 'house', options: ['horse', 'wall', 'senator'] }
]
What I am looking for is a more efficient way of doing it than my current approach, which runs a new query for each word:
// All the necessary requires are here
class Word extends Model {
static async fetch() {
const words = await this.findAll({
limit: 10,
order: [Sequelize.literal('RANDOM()')],
attributes: ['lemma', 'definition'], // also a few other columns I need
});
const wordsWithOptions = await Promise.all(words.map(this.addOptions.bind(this)));
return wordsWithOptions;
}
static async addOptions(word) {
const options = await this.findAll({
order: [Sequelize.literal('RANDOM()')],
limit: 3,
attributes: ['lemma'],
where: {
partOfSpeech: word.dataValues.partOfSpeech,
lemma: { [Op.not]: word.dataValues.lemma },
},
});
return { ...word.dataValues, options: options.map((row) => row.dataValues.lemma) };
}
}
So, is there a way I can do this with raw SQL? How about Sequelize? One thing that still helps me is to give a name to what I'm trying to do, so that I can Google it.
EDIT: I have tried the following and at least got somewhere:
const words = await this.findAll({
limit: 10,
order: [Sequelize.literal('RANDOM()')],
attributes: {
include: [[sequelize.literal(`(
SELECT lemma FROM words AS options
WHERE "partOfSpeech" = "options"."partOfSpeech"
ORDER BY RANDOM() LIMIT 1
)`), 'options']],
},
});
Now, there are two problems with this. First, I only get one option, when I need three; but if the query has LIMIT 3, I get: SequelizeDatabaseError: more than one row returned by a subquery used as an expression.
The second error is that while the code above does return something, it always gives the same word as an option! I thought to remedy that with WHERE "partOfSpeech" = "options"."partOfSpeech", but then I get SequelizeDatabaseError: invalid reference to FROM-clause entry for table "words".
So, how do I tell PostgreSQL "for each row in the result, add a column with an array of three lemmas, WHERE existingRow.partOfSpeech = wordToGoInTheArray.partOfSpeech?"
Revised
Well that seems like a different question and perhaps should be posted that way, but...
The main technique remains the same. JOIN instead of sub-select. The difference being generating the list of lemmas for then piping then into the initial query. In a single this can get nasty.
As single statement (actually this turned out not to be too bad):
select w.lemma, w.defination, string_to_array(string_agg(o.defination,','), ',') as options
from words w
join lateral
(select defination
from words o
where o.part_of_speech = w.part_of_speech
and o.lemma != w.lemma
order by random()
limit 3
) o on 1=1
where w.lemma in( select lemma
from words
order by random()
limit 4 --<<< replace with parameter
)
group by w.lemma, w.defination;
The other approach build a small SQL function to randomly select a specified number of lemmas. This selection is the piped into the (renamed) function previous fiddle.
create or replace
function exam_lemma_definition_options(lemma_array_in text[])
returns table (lemma text
,definition text
,option text[]
)
language sql strict
as $$
select w.lemma, w.definition, string_to_array(string_agg(o.definition,','), ',') as options
from words w
join lateral
(select definition
from words o
where o.part_of_speech = w.part_of_speech
and o.lemma != w.lemma
order by random()
limit 3
) o on 1=1
where w.lemma = any(lemma_array_in)
group by w.lemma, w.definition;
$$;
create or replace
function exam_lemmas(num_of_lemmas integer)
returns text[]
language sql
strict
as $$
select string_to_array(string_agg(lemma,','),',')
from (select lemma
from words
order by random()
limit num_of_lemmas
) ll
$$;
Using this approach your calling code reduces to a needs a single SQL statement:
select *
from exam_lemma_definition_options(exam_lemmas(4))
order by lemma;
This permits you to specify the numbers of lemmas to select (in this case 4) limited only by the number of rows in Words table. See revised fiddle.
Original
Instead of using a sub-select to get the option words just JOIN.
select w.lemma, w.definition, string_to_array(string_agg(o.definition,','), ',') as options
from words w
join lateral
(select definition
from words o
where o.part_of_speech = w.part_of_speech
and o.lemma != w.lemma
order by random()
limit 3
) o on 1=1
where w.lemma = any(array['Romanus', 'domus'])
group by w.lemma, w.definition;
See fiddle. Obviously this will not necessary produce the same options as your questions provides due to random() selection. But it will get matching parts of speech. I will leave translation to your source language to you; or you can use the function option and reduce your SQL to a simple "select *".

Matlab: Explicit or named association of `splitapply` arguments with table VariableNames

I've been biting the bullet on the extra code and bookkeeping needed in Matlab to perform common SQL operations. Here is an example of a typical SQL code pattern for generating metrics that summarize a data table tDat:
SELECT vGrouping, MEAN( x - y ) AS rollup1, VAR(y+z) AS rollup2
INTO tRollups FROM tDat GROUP BY vGrouping
My SQL's a bit rusty, but the general idea should be clear to SQLers. Here is the Matlab equivalent:
% Create test data
tDat = array2table( floor(10*rand(5,3)) , ...
'VariableNames',{'x','y','z'} );
tDat.vGrouping = ( rand(5,1) > 0.5 )
% Calculate summary metrics for each group of data
[vGroup,grps] = findgroups(tDat.vGrouping)
fRollup = #(a,b,c)[ mean(a-b) var(b+c) ] % Calculates summary metric
rollups = splitapply( fRollup, tDat(:,{'x','y','z'}), vGroup )
% Code pattern 1 to assemble results
tRollups = [ array2table( grps , 'VariableNames',{'group'} ) ...
array2table( rollups , ...
'VariableNames',{'rollup1','rollup2'} ) ]
% Code pattern 2 to assemble results
tRollups = array2table( [grps rollups], ...
'VariableNames',{'group','rollup1','rollup2'} )
It's not a fair comparison because the Matlab code contains the data setup, as well as two possible code patterns for assembling the summary metrics. Furthermore, I've added comments -- not to make the Matlab code more voluminous, but because it is so much busier that some cognitive signposts are needed to aid in the reading.
Code volume aside, however, one of the things that bugs me is that the rollup expressions in fRollup are not explicitly associated with the names of the input or output data columns. The arguments are dummy arguments, and the actual input data columns from tDat are specified in the splitapply invocation. The association with the fRollup arguments are positional, so field/variable names themselves aren't able to enforce correct association. Likewise, the output columns in tRollups are specified in the array2table invocation, again positionally associated with the fRollup output.
This makes the rather simple relationships in the SQL statement very difficult to see in the Matlab code. Is there an alternate pattern or design idiom that doesn't have this drawback, but hopefully, doesn't incur much in the way of other drawbacks?
AFTERNOTE: For some reason, even though the following doesn't solve the named/explicit association of splitapply input/output arguments with actual input/output variables, I still find it easier to see the relationships. The code definitely looks less noisier. The key is that the function fRollup for generating summary metrics on the data now returns multiple outputs instead of bundling them into a single arrayed output. This allows me to explicitly name properties of scalar struct ssRollups as the target of the assignment. I don't need to all sorts of conversions to tables, with the extra code to designate VariableNames, just to concatenate the results with the identified groups. Instead, the group identities start off as just another property grps in the same struct (ssRollups) as the splitapply results -- in fact, it is the first property that brings the struct into existence.
% File tmp.m
%-----------
function tmp
% Create test data
tDat = array2table( floor(10*rand(5,3)) , ...
'VariableNames',{'x','y','z'} );
tDat.vGrouping = ( rand(5,1) > 0.5 )
% Find the groups
[ vGroup, ssRollups.grps ] = findgroups(tDat.vGrouping)
% Calculate summary metrics for each group of data
[ ssRollups.rollup1 ssRollups.rollup2 ] = ...
splitapply( #fRollup, tDat(:,{'x','y','z'}), vGroup );
% Display use nice table formatting
struct2table( ssRollups )
end % function tmp
function [rollup1 rollup2] = fRollup(a,b,c)
rollup1 = mean(a-b);
rollup2 = var(b+c);
end % function fRollup
As a multiple-output function, however, fRollup seems better suited to a non-anonymous function. To me, it actually seems to document the multiple outputs better, despite the less compact code. It may be just one of those situations where more compactness is less readable, causing the data relationships to be harder to see. However, it does require that the entire passage of code be made into a function (tmp in this case), unless you don't mind breaking out fRollup into it's own function and m-file. I prefer not to litter my file system with such tiny snippet functions meant to be used in one place.
This "answer" doesn't directly deal with explicit named association between the actual input/output variables and the arguments of the function handle supplied to splitapply. However, it significantly simplifies the code in the initial example, hopefully making it clearer to see the relationship between the function arguments and the input/output variables. This solution was initially included in the AFTERNOTE in the question. Since better answers don't appear to be forthcoming soon, I've decided to hive it out as the answer. It uses deal to implement an anonymous multiple-output function for splitapply to use on the data groups that are delineated by its grouping argument.
% Create test data
tDat = array2table( floor(10*rand(5,3)) , ...
'VariableNames',{'x','y','z'} );
tDat.vGrouping = ( rand(5,1) > 0.5 )
% Find the groups
[ vGroup, ssRollups.grps ] = findgroups(tDat.vGrouping)
% Calculate summary metrics for each group of data
fRollup = #(a,b,c) deal( mean(a-b), var(b+c) )
[ ssRollups.rollup1 ssRollups.rollup2 ] = ...
splitapply( fRollup, tDat(:,{'x','y','z'}), vGroup );
% Display use nice table formatting
struct2table( ssRollups )
Until a better solution emerges, this approach will be my go-to idiom for splitapply.
Here is a variation that uses a table variable for the output of splitapply. This might be more convenient when using multiple grouping variables because findgroups will pass the grouping variable names to the output variable tRollups on the LHS:
% Create test data
tDat = array2table( floor(10*rand(8,3)) , ...
'VariableNames',{'x','y','z'} );
tDat = [ tDat ...
array2table( rand(8,2)>0.5 , ...
'VariableNames',{'vGrpng1','vGrpng2'} ) ];
% Find the groups
[ vGroup, tRollups ] = findgroups(tDat(:,{'vGrpng1','vGrpng2'}));
% Calculate summary metrics for each group of data
fRollup = #(a,b,c) deal( mean(a-b), var(b+c) )
[ tRollups.rollup1 tRollups.rollup2 ] = ...
splitapply( fRollup, tDat(:,{'x','y','z'}), vGroup );
tRollups
And here is a version that uses multiple grouping variables and uses a scalar struct instead of a table for the outputs of findgroup and splitapply:
% Create test data
tDat = array2table( floor(10*rand(8,3)) , ...
'VariableNames',{'x','y','z'} );
tDat.vGrpng1 = rand(8,1)>0.5 ;
tDat.vGrpng2 = rand(8,1)>0.5
% Find the groups
[ vGroup, ssRollups.vGrpng1, ssRollups.vGrpng2 ] = ...
findgroups( tDat.vGrpng1, tDat.vGrpng2 );
% Calculate summary metrics for each group of data
fRollup = #(a,b,c) deal( mean(a-b), var(b+c) )
[ ssRollups.rollup1 ssRollups.rollup2 ] = ...
splitapply( fRollup, tDat(:,{'x','y','z'}), vGroup );
% Display using nice table formatting
struct2table( ssRollups )

Extracting Values from Array in Redshift SQL

I have some arrays stored in Redshift table "transactions" in the following format:
id, total, breakdown
1, 100, [50,50]
2, 200, [150,50]
3, 125, [15, 110]
...
n, 10000, [100,900]
Since this format is useless to me, I need to do some processing on this to get the values out. I've tried using regex to extract it.
SELECT regexp_substr(breakdown, '\[([0-9]+),([0-9]+)\]')
FROM transactions
but I get an error returned that says
Unmatched ( or \(
Detail:
-----------------------------------------------
error: Unmatched ( or \(
code: 8002
context: T_regexp_init
query: 8946413
location: funcs_expr.cpp:130
process: query3_40 [pid=17533]
--------------------------------------------
Ideally I would like to get x and y as their own columns so I can do the appropriate math. I know I can do this fairly easy in python or PHP or the like, but I'm interested in a pure SQL solution - partially because I'm using an online SQL editor (Mode Analytics) to plot it easily as a dashboard.
Thanks for your help!
If breakdown really is an array you can do this:
select id, total, breakdown[1] as x, breakdown[2] as y
from transactions;
If breakdown is not an array but e.g. a varchar column, you can cast it into an array if you replace the square brackets with curly braces:
select id, total,
(translate(breakdown, '[]', '{}')::integer[])[1] as x,
(translate(breakdown, '[]', '{}')::integer[])[2] as y
from transactions;
You can try this :
SELECT REPLACE(SPLIT_PART(breakdown,',',1),'[','') as x,REPLACE(SPLIT_PART(breakdown,',',2),']','') as y FROM transactions;
I tried this with redshift db and this worked for me.
Detailed Explanation:
SPLIT_PART(breakdown,',',1) will give you [50.
SPLIT_PART(breakdown,',',2) will give you 50].
REPLACE(SPLIT_PART(breakdown,',',1),'[','') will replace the [ and will give just 50.
REPLACE(SPLIT_PART(breakdown,',',2),']','') will replace the ] and will give just 50.
Know its an old post.But if someone needs a much easier way
select json_extract_array_element_text('[100,101,102]', 2);
output : 102

SQL to MDX conversion

I have this where clause in sql language:
where (cond1=1 or cond2=1) and cond3=1
How can I get this result in MDX with the slicing(condition into the where)?
{[cond1].&[1],[cond2].&[1]} /*and*/ {[cond3].&[1]}
Thanks
Try to use a subcube:
Select
-- YOUR SELECTED MEASURES AND DIMENSIONS
From
(
Select
{[cond1].&[1],[cond2].&[1]} on 0
,{[cond3].&[1]} on 1
-- ,{more slices} on x
From [CubeName]
)
Hope this help!
You can use subcube expression as stated above, but this is not the only option. If you use subcube, you would increase query performance greatly (assuming the fact you don't perform crossjoins in it).
You can also use general WHERE keyword after last expression that returns cube:
select
{ .. } on 0,
{ .. } on 1
from (select { [Dim1].[X].allmembers } on 0)
where ([Dim2].[Y].&[Y1])
Or:
select
{ .. } on 0,
{ .. } on 1
from (select { [Dim1].[X].allmembers } on 0)
where {[DimTime].[Time].[Year].&[2001] : [DimTime].[Time].[Year].&[2015]}
This is applied at the end of execution, which means performance may decrease. However, if you need to apply external filter to all axis, this is the option you need.
Another way to filter member values is using tuple expressions:
with member LastDateSale as ( (Tail(EXISTING [DimTime].[Time].[Dates].members,1), [Measures].[ActualSales]) )
This will take your DimTime axis, apply external filter, get the last element from it and calculate [ActualSales] for it, if possible.

Find maximum of a function

I need to find a maximum of the function:
a1^x1 * const1 + a2^x2 * const2 +....+ ak^xk * constk = qaulity
where xk>0 and xk is integer. ak is constant.
constraint:
a1^x1 * const1*func(x1) + a2^x2 * const2*func(x2) +....+ ak^xk * constk*func(xk) < Budget
Where func is a discrete function:
func(x)
{
switch(x)
{
case 1: return 423;
case 2: return 544;
...
etc
}
}
k may be big(over 1000). x less then 100.
What is the best method?
There are techniques like nelder-mead optimization (which I believe GSL implements), but most techniques assume some sort of special structure (i.e. convexity or continuity). Depending on the values of the function, there may not exist a unique optimum or even an optimum that a normal downhill method can find.