How can I get a cumulative product with Snowflake? - sql

I want to calculate the cumulative product across rows in Snowflake.
Basically I have monthly rates that multiplied accumulate across time.
(Some databases have the product() SQL function for that).

A trick suggested by Sterling Paramore: Add logs, and then exponentiate it:
with data as (select $1 x from values (1),(2),(3),(4),(5))
select x
, sum(x) over(order by x) sum
, exp(sum(ln(x)) over(order by x)) mult
from data

If a built-in function doesn't exist, it's usually possible to roll something custom using a User-Defined Table Function.
In this case:
CREATE OR REPLACE FUNCTION CUMULATIVE_PRODUCT(VALUE double)
RETURNS TABLE (PRODUCT double)
LANGUAGE JAVASCRIPT
AS '{
initialize: function(argumentInfo, context) {
this.cumulativeProduct = 1;
},
processRow: function f(row, rowWriter, context){
this.cumulativeProduct = this.cumulativeProduct*row.VALUE;
rowWriter.writeRow({PRODUCT: this.cumulativeProduct});
}
}';
Example table:
create temp table sample_numbers as (
select 1 as index, 5.1::double as current_value
union all
select 2 as index, 4.3::double as current_value
union all
select 3 as index, 3.7::double as current_value
union all
select 4 as index, 3.9::double as current_value
)
invoking the UDTF:
select index,current_value,PRODUCT as cumulative_product
from sample_numbers,table(CUMULATIVE_PRODUCT(current_value) over ())
Note the empty over() clause which forces Snowflake to do a single sequential run over the data instead of splitting it into parallel chunks

Related

Using SQL Query to return value from BigQuery User Defined Function

Can I use a query in Google BigQuery User Defined Function to return some value? I've been searching docs and stackoverflow for hours without any luck and I have a very specific use case where I need to return a single scalar value based on the values of multiple columns.
Following will be the use case for the query:
SELECT campaign,source,medium, get_channel(campaign,source,medium)
FROM table_name
the get_channel() UDF will use these parameters and a complex select statement to return a single scalar value for the row. I've prepared the query, I just need to find a way to use that query in the UDF, for which I, honestly am at loss and without a cause.
Is my use case correct? Is this even possible? Are there any alternatives to do this?
Looks like you want to use UDF to select scalar value off of some lookup table. if so, NO - you cannot reference a table in UDF - see more in Limits and Limitations
But if you just want to have some complex manipulation with arguments - sure - see dummy example below
#standardSQL
CREATE TEMPORARY FUNCTION get_channel(campaign INT64, source INT64, medium INT64) AS ((
SELECT campaign + source + medium as result_of_complex_select_statement
));
WITH `project.dataset.table_name` AS (
SELECT 1 AS campaign, 2 AS source, 3 AS medium UNION ALL
SELECT 4, 5, 6 UNION ALL
SELECT 7, 8, 9
)
SELECT
campaign,
source,
medium,
get_channel(campaign,source,medium) AS channel
FROM `project.dataset.table_name`
You should rather use JOIN to achieve your goal

SQL Query: get adjusted value of each row

I have several columns in a SQL table, and I would like to compute the result for each row in the table. My data looks something like this:
data
name value adjustor1 adjustor2
Comp1 20 0.05 0.08
Comp2 80 -0.07 0.065
The formula for the adjusted value for each row is:
adjusted_value = value*(1 + adjustor1)*(1 + adjustor2)*(100/sum(value))
So the adjusted output should be:
data
name adjusted_value
Comp1 22.25
Comp2 77.75
The original values sum to 100, and the adjusted values should also sum to 100. I've tried things such as:
SELECT adjusted*(100/sum(adjusted))
FROM (
SELECT value*(1+adjustor1)*(1+adjustor2) as adjusted
FROM data
) as result
which gives me the error: ERROR: column "result.adjusted" must appear in the GROUP BY clause or be used in an aggregate function
Although if I just do:
SELECT sum(adjusted)
FROM (
SELECT value*(1+adjustor1)*(1+adjustor2) as adjusted
FROM data
) as result
OR
SELECT adjusted
FROM (
SELECT value*(1+adjustor1)*(1+adjustor2) as adjusted
FROM data
) as result
I can get either the sum OR the adjusted value, but not both.
Column name value is not a good choice, it is a reserved word. I changed it to val instead.
This is the query you need:
WITH data(name,val,adjustor1,adjustor2) AS (
VALUES
('Comp1'::text,20,0.05,0.08),
('Comp2',80,-0.07,0.065)
)
SELECT name,val,adjustor1,adjustor2,
CAST(val*(1+adjustor1)*(1+adjustor2)*(100/sum(
val*(1+adjustor1)*(1+adjustor2)
) OVER ()) AS numeric(10,3)) adjusted_value
FROM data;
I must admit, it is a bit clumsy, sub-query makes it easier to understand (SQL Fiddle):
WITH data(name,val,adjustor1,adjustor2) AS (
VALUES
('Comp1'::text,20,0.05,0.08),
('Comp2',80,-0.07,0.065)
)
SELECT name, CAST(adj*(100/sum(adj) OVER ()) AS numeric(8,3)) adjusted_value
FROM (
SELECT name,val,adjustor1,adjustor2,
val*(1+adjustor1)*(1+adjustor2) adj
FROM data) s;
Some notes:
Original sum(adj) is replaced with sum(adj) OVER (), which is a syntax for window functions.
I also added [CAST()][3] in order to round up the values.
you can get the SUM in a subquery and then you can do CROSS JOIN to achieve what you want
SELECT name,
value*(1+adjustor1)*(1+adjustor2)*(100/T.adjSum) as adjusted_value
FROM data
CROSS JOIN
(SELECT SUM(value*(1+adjustor1)*(1+adjustor2)) as adjSum
FROM data) T

Pairwise array sum aggregate function?

I have a table with arrays as one column, and I want to sum the array elements together:
> create table regres(a int[] not null);
> insert into regres values ('{1,2,3}'), ('{9, 12, 13}');
> select * from regres;
a
-----------
{1,2,3}
{9,12,13}
I want the result to be:
{10, 14, 16}
that is: {1 + 9, 2 + 12, 3 + 13}.
Does such a function already exist somewhere? The intagg extension looked like a good candidate, but such a function does not already exist.
The arrays are expected to be between 24 and 31 elements in length, all elements are NOT NULL, and the arrays themselves will also always be NOT NULL. All elements are basic int. There will be more than two rows per aggregate. All arrays will have the same number of elements, in a query. Different queries will have different number of elements.
My implementation target is: PostgreSQL 9.1.13
General solutions for any number of arrays with any number of elements. Individual elements or the the whole array can be NULL, too:
Simpler in 9.4+ using WITH ORDINALITY
SELECT ARRAY (
SELECT sum(elem)
FROM tbl t
, unnest(t.arr) WITH ORDINALITY x(elem, rn)
GROUP BY rn
ORDER BY rn
);
See:
PostgreSQL unnest() with element number
Postgres 9.3+
This makes use of an implicit LATERAL JOIN
SELECT ARRAY (
SELECT sum(arr[rn])
FROM tbl t
, generate_subscripts(t.arr, 1) AS rn
GROUP BY rn
ORDER BY rn
);
See:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Postgres 9.1
SELECT ARRAY (
SELECT sum(arr[rn])
FROM (
SELECT arr, generate_subscripts(arr, 1) AS rn
FROM tbl t
) sub
GROUP BY rn
ORDER BY rn
);
The same works in later versions, but set-returning functions in the SELECT list are not standard SQL and were frowned upon by some. Should be OK since Postgres 10, though. See:
What is the expected behaviour for multiple set-returning functions in SELECT clause?
db<>fiddle here
Old sqlfiddle
Related:
Is there something like a zip() function in PostgreSQL that combines two arrays?
If you need better performances and can install Postgres extensions, the agg_for_vecs C extension provides a vec_to_sum function that should meet your need. It also offers various aggregate functions like min, max, avg, and var_samp that operate on arrays instead of scalars.
I know the original question and answer are pretty old, but for others who find this... The most elegant and flexible solution I've found is to create a custom aggregate function. Erwin's answer presents some great simple solutions if you only need the single resulting array, but doesn't translate to a solution that could include other table columns and aggregations, in a GROUP BY for example.
With a custom array_add function and array_sum aggregate function:
CREATE OR REPLACE FUNCTION array_add(_a numeric[], _b numeric[])
RETURNS numeric[]
AS
$$
BEGIN
RETURN ARRAY(
SELECT coalesce(a, 0) + coalesce(b, 0)
FROM unnest(_a, _b) WITH ORDINALITY AS x(a, b, n)
ORDER BY n
);
END
$$ LANGUAGE plpgsql;
CREATE AGGREGATE array_sum(numeric[])
(
sfunc = array_add,
stype = numeric[],
initcond = '{}'
);
Then (using the names from your example):
SELECT array_sum(a) a_sums
FROM regres;
Returns your array of sums, and it can just as well be used anywhere other aggregate functions could be used, so if your table also had a column name you wanted to group by, and another array of numbers, column b:
SELECT name, array_sum(a) a_sums, array_sum(b) b_sums
FROM regres
GROUP BY name;
You won't get quite the performance you'd get out of the built-in sum function and just selecting sum(a[1]), sum(a[2]), sum(a[3]), you'd have to implement the array_add function as a compiled C function to get that. But in cases where you don't have the ability to add custom C functions (like a managed cloud database, e.g. AWS RDS), or you're not aggregating huge numbers of rows, the difference probably won't be noticed.

Purposely having a query return blank entries at regular intervals

I want to write a query that returns 3 results followed by blank results followed by the next 3 results, and so on. So if my database had this data:
CREATE TABLE table (a integer, b integer, c integer, d integer);
INSERT INTO table (a,b,c,d)
VALUES (1,2,3,4),
(5,6,7,8),
(9,10,11,12),
(13,14,15,16),
(17,18,19,20),
(21,22,23,24),
(25,26,37,28);
I would want my query to return this
1,2,3,4
5,6,7,8
9,10,11,12
, , ,
13,14,15,16
17,18,19,20
21,22,23,24
, , ,
25,26,27,28
I need this to work for arbitrarily many entries that I select for, have three be grouped together like this.
I'm running postgresql 8.3
This should work flawlessly in PostgreSQL 8.3
SELECT a, b, c, d
FROM (
SELECT rn, 0 AS rk, (x[rn]).*
FROM (
SELECT x, generate_series(1, array_upper(x, 1)) AS rn
FROM (SELECT ARRAY(SELECT tbl FROM tbl) AS x) x
) y
UNION ALL
SELECT generate_series(3, (SELECT count(*) FROM tbl), 3), 1, (NULL::tbl).*
ORDER BY rn, rk
) z
Major points
Works for a query that selects all columns of tbl.
Works for any table.
For selecting arbitrary columns you have to substitute (NULL::tbl).* with a matching number of NULL columns in the second query.
Assuming that NULL values are ok for "blank" rows.
If not, you'll have to cast your columns to text in the first and substitute '' for NULL in the second SELECT.
Query will be slow with very big tables.
If I had to do it, I would write a plpgsql function that loops through the results and inserts the blank rows. But you mentioned you had no direct access to the db ...
In short, no, there's not an easy way to do this, and generally, you shouldn't try. The database is concerned with what your data actually is, not how it's going to be displayed. It's not an appropriate scope of responsibility to expect your database to return "dummy" or "extra" data so that some down-stream process produces a desired output. The generating script needs to do that.
As you can't change your down-stream process, you could (read that with a significant degree of skepticism and disdain) add things like this:
Select Top 3
a, b, c, d
From
table
Union Select Top 1
'', '', '', ''
From
table
Union Select Top 3 Skip 3
a, b, c, d
From
table
Please, don't actually try do that.
You can do it (at least on DB2 - there doesn't appear to be equivalent functionality for your version of PostgreSQL).
No looping needed, although there is a bit of trickery involved...
Please note that though this works, it's really best to change your display code.
Statement requires CTEs (although that can be re-written to use other table references), and OLAP functions (I guess you could re-write it to count() previous rows in a subquery, but...).
WITH dataList (rowNum, dataColumn) as (SELECT CAST(CAST(:interval as REAL) /
(:interval - 1) * ROW_NUMBER() OVER(ORDER BY dataColumn) as INTEGER),
dataColumn
FROM dataTable),
blankIncluder(rowNum, dataColumn) as (SELECT rowNum, dataColumn
FROM dataList
UNION ALL
SELECT rowNum - 1, :blankDataColumn
FROM dataList
WHERE MOD(rowNum - 1, :interval) = 0
AND rowNum > :interval)
SELECT *
FROM dataList
ORDER BY rowNum
This will generate a list of those elements from the datatable, with a 'blank' line every interval lines, as ordered by the initial query. The result set only has 'blank' lines between existing lines - there are no 'blank' lines on the ends.

how to select lines in Mysql while a condition lasts

I have something like this:
Name.....Value
A...........10
B............9
C............8
Meaning, the values are in descending order. I need to create a new table that will contain the values that make up 60% of the total values. So, this could be a pseudocode:
set Total = sum(value)
set counter = 0
foreach line from table OriginalTable do:
counter = counter + value
if counter > 0.6*Total then break
else insert line into FinalTable
end
As you can see, I'm parsing the sql lines here. I know this can be done using handlers, but I can't get it to work. So, any solution using handlers or something else creative will be great.
It should also be in a reasonable time complexity - the solution how to select values that sum up to 60% of the total
works, but it's slow as hell :(
Thanks!!!!
You'll likely need to use the lead() or lag() window function, possibly with a recursive query to merge the rows together. See this related question:
merge DATE-rows if episodes are in direct succession or overlapping
And in case you're using MySQL, you can work around the lack of window functions by using something like this:
Mysql query problem
I don't know which analytical functions SQL Server (which I assume you are using) supports; for Oracle, you could use something like:
select v.*,
cumulative/overall percent_current,
previous_cumulative/overall percent_previous from (
select
id,
name,
value,
cumulative,
lag(cumulative) over (order by id) as previous_cumulative,
overall
from (
select
id,
name,
value,
sum(value) over (order by id) as cumulative,
(select sum(value) from mytab) overall
from mytab
order by id)
) v
Explanation:
- sum(value) over ... computes a running total for the sum
- lag() gives you the value for the previous row
- you can then combine these to find the first row where percent_current > 0.6 and percent_previous < 0.6