BigQuery Standard SQL - store query or UDF in table - google-bigquery

Is it possible to store data in a table that can then be converted into either a SQL query or a UDF - like a javascript eval()?
The use case is that I have a list of clients where earnings are calculated in quite significantly different ways for each, and this can change over time. So I would like to have a lookup table which can be updated with a formula for calculating this figure rather than having to write not only hundreds of queries (one for each client) but also maintain these.
I have tried to think if there is a way of having a standard formula that would be flexible enough, but I really don't think it's possible unfortunately.

Sure! BigQuery can define and use JS UDFs. The good news is that eval() works as expected:
CREATE TEMP FUNCTION calculate(x FLOAT64, y FLOAT64, formula STRING)
RETURNS FLOAT64
LANGUAGE js AS """
return eval(formula);
""";
WITH table AS (
SELECT 1 AS x, 5 as y, 'x+y' formula
UNION ALL SELECT 2, 10, 'x-y'
UNION ALL SELECT 3, 15, 'x*y'
)
SELECT x, y, formula, calculate(x, y, formula) result
FROM table;

Related

Is it possible to concisely tell SQL SELECT to omit some temporary/dummy columns?

Suppose I have this query:
select
x + y as _total,
abs(x - y) / _total as _err,
round(100 * _err) as pct_err,
x,
y
from foo;
This assumes I have a table with x and y, and calculates an error between them. Note that columns I prefixed with _ are dummy columns - they're only there to show the steps of the calculation more clearly. Is there a way to omit them from the result?
I don't want to simply collapse the three columns into a single expression. It would be messier, and consider also a calculation with 10 steps and much longer field names.
I don't want to make this a CTE and then re-select only the columns I want. That seems too much hassle for such a simple thing.
It would be okay if I could just put the dummy columns at the end where they would be out of the way, but SQL doesn't seem to allow referencing a column that comes after.
Note that "no" is an acceptable answer, if you have reasonably comprehensive knowledge of SQL syntax :)

Bigquery SQL: convert array to columns

I have a table with a field A where each entry is a fixed length array A of integers (say length=1000). I want to know how to convert it into 1000 columns, with column name given by index_i, for i=0,1,2,...,999, and each element is the corresponding integer. I can have it done by something like
A[OFFSET(0)] as index_0,
A[OFFSET(1)] as index_1
A[OFFSET(2)] as index_2,
A[OFFSET(3)] as index_3,
A[OFFSET(4)] as index_4,
...
A[OFFSET(999)] as index_999,
I want to know what would be an elegant way of doing this. thanks!
The first thing to say is that, sadly, this is going to be much more complicated than most people expect. It can be conceptually easier to pass the values into a scripting language (e.g. Python) and work there, but clearly keeping things inside BigQuery is going to be much more performant. So here is an approach.
Cross-joining to turn array fields into long-format tables
I think the first thing you're going to want to do is get the values out of the arrays and into rows.
Typically in BigQuery this is accomplished using CROSS JOIN. The syntax is a tad unintuitive:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw
CROSS JOIN UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
UNNEST(raw.a) is taking those arrays of values and turning each array into a set of (five) rows, every single one of which is then joined to the corresponding value of name (the definition of a CROSS JOIN). In this way we can 'unwrap' a table with an array field.
This will yields results like
name | vals
-------------
A | 1
A | 2
A | 3
A | 4
A | 5
B | 5
B | 4
B | 3
B | 2
B | 1
Confusingly, there is a shorthand for this syntax in which CROSS JOIN is replaced with a simple comma:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw, UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
This is more compact but may be confusing if you haven't seen it before.
Typically this is where we stop. We have a long-format table, created without any requirement that the original arrays all had the same length. What you're asking for is harder to produce - you want a wide-format table containing the same information (relying on the fact that each array was the same length.
Pivot tables in BigQuery
The good news is that BigQuery now has a PIVOT function! That makes this kind of operation possible, albeit non-trivial:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN (0,1,2,3,4)
)
This makes use of WITH OFFSET to generate an extra offset column (so that we know which order the values in the array originally had).
Also, in general pivoting requires us to aggregate the values returned in each cell. But here we expect exactly one value for each combination of name and offset, so we simply use the aggregation function ANY_VALUE, which non-deterministically selects a value from the group you're aggregating over. Since, in this case, each group has exactly one value, that's the value retrieved.
The query yields results like:
name vals_0 vals_1 vals_2 vals_3 vals_4
----------------------------------------------
A 1 2 3 4 5
B 5 4 3 2 1
This is starting to look pretty good, but we have a fundamental issue, in that the column names are still hard-coded. You wanted them generated dynamically.
Unfortunately expressions for the pivot column values aren't something PIVOT can accept out-of-the-box. Note that BigQuery has no way to know that your long-format table will resolve neatly to a fixed number of columns (it relies on offset having the values 0-4 for each and every set of records).
Dynamically building/executing the pivot
And yet, there is a way. We will have to leave behind the comfort of standard SQL and move into the realm of BigQuery Procedural Language.
What we must do is use the expression EXECUTE IMMEDIATE, which allows us to dynamically construct and execute a standard SQL query!
(as an aside, I bet you - OP or future searchers - weren't expecting this rabbit hole...)
This is, of course, inelegant to say the least. But here is the above toy example, implemented using EXECUTE IMMEDIATE. The trick is that the executed query is defined as a string, so we just have to use an expression to inject the full range of values you want into this string.
Recall that || can be used as a string concatenation operator.
EXECUTE IMMEDIATE """
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
|| """
)
)
"""
Ouch. I've tried to make that as readable as possible. Near the bottom there is an expression that generates the list of column suffices (pivoted values of offset):
(SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
This generates the string "0,1,2,3,4" which is then concatenated to give us ...FOR offset IN (0,1,2,3,4)... in our final query (as in the hard-coded example before).
REALLY dynamically executing the pivot
It hasn't escaped my notice that this is still technically insisting on your knowing up-front how long those arrays are! It's a big improvement (in the narrow sense of avoiding painful repetitive code) to use GENERATE_ARRAY(0,4), but it's not quite what was requested.
Unfortunately, I can't provide a working toy example, but I can tell you how to do it. You would simply replace the pivot values expression with
(SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM long_format)
But doing this in the example above won't work, because long_format is a Common Table Expression that is only defined inside the EXECUTE IMMEDIATE block. The statement in that block won't be executed until after building it, so at build-time long_format has yet to be defined.
Yet all is not lost. This will work just fine:
SELECT *
FROM d.long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM d.long_format)
|| """
)
)
... provided you first define a BigQuery VIEW (for example) called long_format (or, better, some more expressive name) in a dataset d. That way, both the job that builds the query and the job that runs it will have access to the values.
If successful, you should see both jobs execute and succeed. You should then click 'VIEW RESULTS' on the job that ran the query.
As a final aside, this assumes you are working from the BigQuery console. If you're instead working from a scripting language, that gives you plenty of options to either load and manipulate the data, or build the query in your scripting language rather than massaging BigQuery into doing it for you.
Consider below approach
execute immediate ( select '''
select * except(id) from (
select to_json_string(A) id, * except(A)
from your_table, unnest(A) value with offset
)
pivot (any_value(value) index for offset in ('''
|| (select string_agg('' || val order by offset) from unnest(generate_array(0,999)) val with offset) || '))'
)
If to apply to dummy data like below (with 10 instead of 1000 elements)
select [10,11,12,13,14,15,16,17,18,19] as A union all
select [20,21,22,23,24,25,26,27,28,29] as A union all
select [30,31,32,33,34,35,36,37,38,39] as A
the output is

Normalize column in ClickHouse

Is there a possibility a to normalize a column in Clickhouse?
I was trying to do it getting the column into array via groupArray and then using arrayMap with lambda function
arrayMap(x -> (x-minArray(c)) / (maxArray(c)-minArray(c), c) to normalize the data in the array.
But it seem a little bit clunky, cause it should be a subquery that repeats the actual query and then JOIN this subquery to it.
So, is there a better solution to it?
hmm... just try use standart aggregation function like this:
SELECT c, (c-min(c)) / (max(c)-min(c)) AS normalized_c FROM table GROUP BY c

Using SQL Query to return value from BigQuery User Defined Function

Can I use a query in Google BigQuery User Defined Function to return some value? I've been searching docs and stackoverflow for hours without any luck and I have a very specific use case where I need to return a single scalar value based on the values of multiple columns.
Following will be the use case for the query:
SELECT campaign,source,medium, get_channel(campaign,source,medium)
FROM table_name
the get_channel() UDF will use these parameters and a complex select statement to return a single scalar value for the row. I've prepared the query, I just need to find a way to use that query in the UDF, for which I, honestly am at loss and without a cause.
Is my use case correct? Is this even possible? Are there any alternatives to do this?
Looks like you want to use UDF to select scalar value off of some lookup table. if so, NO - you cannot reference a table in UDF - see more in Limits and Limitations
But if you just want to have some complex manipulation with arguments - sure - see dummy example below
#standardSQL
CREATE TEMPORARY FUNCTION get_channel(campaign INT64, source INT64, medium INT64) AS ((
SELECT campaign + source + medium as result_of_complex_select_statement
));
WITH `project.dataset.table_name` AS (
SELECT 1 AS campaign, 2 AS source, 3 AS medium UNION ALL
SELECT 4, 5, 6 UNION ALL
SELECT 7, 8, 9
)
SELECT
campaign,
source,
medium,
get_channel(campaign,source,medium) AS channel
FROM `project.dataset.table_name`
You should rather use JOIN to achieve your goal

SQL Server 2008 Query Result Formating (changing x and y axis fields)

What is the most efficient way to format query results, wether in the actual SQL Server SQL code or using a different program such as access or excel so that the X (first row column headers) and Y Axis (first column field values) can be changed, but with the same query result data still being represented, just in a different way.
they way the data is stored in my database and they way my original query results are returned in SQL Server 2008 are as follows:
Original Results Format
And Below is the way I need to have the data look:
How I need the Results to Look
In essence, I need to have the zipcode field go down the Y Axis (first column) and the Coveragecode field to go across the top first Row (X Axis) with the exposures filling in the rest of the data.
The only way I can thing of getting this done is by bringing the data into excel and doing a bunch of V-LookUps. I tried using some pivot tables but didn't get to far. I'm going to continue trying to format the data using the V-LookUps but hopefully someone can come up with a better way of doing this.
What you are looking for is a table operator called PIVOT. Docs: http://msdn.microsoft.com/en-us/library/ms177410.aspx
WITH Src AS(
SELECT Coveragecode, Zipcode, EarnedExposers
FROM yourTable
)
SELECT Zipcode, [BI], [PD], [PIP], [UMBI], [COMP], [COLL]
FROM Src
PIVOT(MAX(EarnedExposers) FOR CoverageCode
IN(
[BI],
[PD],
[PIP],
[UMBI],
[COMP],
[COLL]
)
) AS P;