Achieving window function-like behavior using a PostgreSQL user defined function? - sql

Let's say that given a table observations_tbl with attributes date (day) and value, I want to produce the new attribute prev_day_value to get the following table:
|---------------------|-------|----------------|
| date | value | prev_day_value |
|---------------------|-------|----------------|
| 01.01.2015 00:00:00 | 5 | 0 |
| 02.01.2015 00:00:00 | 4 | 5 |
| 03.01.2015 00:00:00 | 3 | 4 |
| 04.01.2015 00:00:00 | 2 | 3 |
|---------------------|-------|----------------|
I am well-aware that such an output can typically be obtained using a WINDOW function. But how would I achieve this through a PostgreSQL user defined function? I want to indicate that I am in a situation where I must use a function, difficult to explain why without going into detail - these are the restrictions I have and if anything, it is a technical challenge.
Take into consideration this template query:
SELECT *, lag(value,1) AS prev_day_value -- or lag(record,1) or lag(date,value,1) or lag(date,1) or lag(observations_tbl,1), etc.
FROM observations_tbl
I am using function lag with parameter 1 to look for a value which comes before the current row by 1 - a distance of 1 row. I don't care what other parameters the function lag can have (table name, other attributes) - what could the function lag look like to achieve such functionality? The function can be of any language, SQL, PL/pgSQL and even C using PostgreSQL API/backend.
I understand that one answer can be wrapping a WINDOW query inside lag user defined function. But I am thinking that would be a rather costly operation if I have to scan the entire table twice (once inside the lag function and once outside). I was thinking that maybe each PostgreSQL record would have a pointer to its previous record which is directly accessible? Or that I can somehow open a cursor at this specific row / row number without having to scan the entire table? Or is what I am asking impossible?

Your request is not possible to solve with relational tools (window functions are not relational extension in SQL). In C language you can write own alternative of function lag. You can do same work in PL8 language (Javascript). Unfortunately the API for window functions doesn't exist for PL/pgSQL. You cannot to write simple PL/pgSQL function that has access to different row than is processed.
The one possible alternative (but with some performance risk) is writing table function. There you have a control over all processed dataset, and you can do this operation simply.
CREATE OR REPLACE FUNCTION report()
RETURNS TABLE(d date, v int, prev_v int) $$
DECLARE r RECORD;
BEGIN
prev_v := 0;
FOR r IN SELECT date, value FROM observations_tbl t ORDER BY 1
LOOP
d := r.date; v := r.value;
RETURN NEXT;
prev_v := v;
END LOOP;
END;
$$ LANGUAGE plpgsql;
There is not any other alternative usable solution. In very old date these values was calculated with correlated selfjoins, but this solution has pretty terrible performance.

What Pavel posted, just with fewer assignments. Should be faster:
CREATE OR REPLACE FUNCTION report()
RETURNS TABLE(d date, v int, prev_v int) AS
$func$
BEGIN
prev_v := 0;
FOR d, v IN
SELECT date, value FROM observations_tbl ORDER BY 1
LOOP
RETURN NEXT;
prev_v := v;
END LOOP;
END
$func$ LANGUAGE plpgsql;
The general idea can pay if it actually replaces multiple scans over the table with a single one. Like here:
GROUP BY and aggregate sequential numeric values

Related

Remove duplicate entries from string array column of postgres

I have a PostgreSQL table where there is column which has array of strings. The row have some unique array strings or some have duplicate strings also. I want to remove duplicate strings from each row if they exists.
I have tried to some queries but couldn't make it happen.
Following is the table:
veh_id | vehicle_types
--------+----------------------------------------
1 | {"byd_tang","volt","viper","laferrari"}
2 | {"volt","viper"}
3 | {"byd_tang","sonata","jaguarxf"}
4 | {"swift","teslax","mirai"}
5 | {"volt","viper"}
6 | {"viper","ferrariff","bmwi8","viper"}
7 | {"ferrariff","viper","viper","volt"}
I am expecting following output:
veh_id | vehicle_types
--------+----------------------------------------
1 | {"byd_tang","volt","viper","laferrari"}
2 | {"volt","viper"}
3 | {"byd_tang","sonata","jaguarxf"}
4 | {"swift","teslax","mirai"}
5 | {"volt","viper"}
6 | {"viper","ferrariff","bmwi8"}
7 | {"ferrariff","viper","volt"}
Since each row's array is independent, a plain correlated subquery with an ARRAY constructor would do the job:
SELECT *, ARRAY(SELECT DISTINCT unnest (vehicle_types)) AS vehicle_types_uni
FROM vehicle;
See:
Why is array_agg() slower than the non-aggregate ARRAY() constructor?
Note that NULL is converted to an empty array ('{}'). We'd need to special-case it, but it is excluded in the UPDATE below anyway.
Fast and simple. But don't use this. You didn't say so, but typically you'd want to preserve original order of array elements. Your rudimentary sample suggests as much. Use WITH ORDINALITY in the correlated subquery, which becomes a bit more sophisticated:
SELECT *, ARRAY (SELECT v
FROM unnest(vehicle_types) WITH ORDINALITY t(v,ord)
GROUP BY 1
ORDER BY min(ord)
) AS vehicle_types_uni
FROM vehicle;
See:
PostgreSQL unnest() with element number
UPDATE to actually remove dupes:
UPDATE vehicle
SET vehicle_types = ARRAY (
SELECT v
FROM unnest(vehicle_types) WITH ORDINALITY t(v,ord)
GROUP BY 1
ORDER BY min(ord)
)
WHERE cardinality(vehicle_types) > 1 -- optional
AND vehicle_types <> ARRAY (
SELECT v
FROM unnest(vehicle_types) WITH ORDINALITY t(v,ord)
GROUP BY 1
ORDER BY min(ord)
); -- suppress empty updates (optional)
Both added WHERE conditions are optional to improve performance. The 1st one is completely redundant. Each condition also excludes the NULL case. The 2nd one suppresses all empty updates.
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
If you tried to do that without preserving original order, you'd likely update most rows without need, just because the order or elements changed even without dupes.
Requires Postgres 9.4 or later.
db<>fiddle here
I don't claim it's efficient, but something like this might work:
with expanded as (
select veh_id, unnest (vehicle_types) as vehicle_type
from vehicles
)
select veh_id, array_agg (distinct vehicle_type)
from expanded
group by veh_id
If you really want to get fancy and do something that is worst case O(n), you can write a custom function:
create or replace function unique_array(input_array text[])
returns text[] as $$
DECLARE
output_array text[];
i integer;
BEGIN
output_array = array[]::text[];
for i in 1..cardinality(input_array) loop
if not (input_array[i] = any (output_array)) then
output_array := output_array || input_array[i];
end if;
end loop;
return output_array;
END;
$$
language plpgsql
Usage example:
select veh_id, unique_array(vehicle_types)
from vehicles

Oracle performance: query executing multiple identical function calls

Is it possible for Oracle to reuse the result of a function when it is called in the same query (transaction?) without the use of the function result cache?
The application I am working with is heavily reliant on Oracle functions. Many queries end up executing the exact same functions multiple times.
A typical example would be:
SELECT my_package.my_function(my_id),
my_package.my_function(my_id) / 24,
my_package.function_also_calling_my_function(my_id)
FROM my_table
WHERE my_table.id = my_id;
I have noticed that Oracle always executes each of these functions, not realizing that the same function was called just a second ago in the same query. It is possible that some elements in the function get cached, resulting in a slightly faster return. This is not relevant to my question as I want to avoid the entire second or third execution.
Assume that the functions are fairly resource-consuming and that these functions may call more functions, basing their result on tables that are reasonably large and with frequent updates (a million records, updates with say 1000 updates per hour). For this reason it is not possible to use Oracle's Function Result Cache.
Even though the data is changing frequently, I expect the result of these functions to be the same when they are called from the same query.
Is it possible for Oracle to reuse the result of these functions and how? I am using Oracle11g and Oracle12c.
Below is an example (just a random non-sense function to illustrate the problem):
-- Takes 200 ms
SELECT test_package.testSpeed('STANDARD', 'REGEXP_COUNT')
FROM dual;
-- Takes 400ms
SELECT test_package.testSpeed('STANDARD', 'REGEXP_COUNT')
, test_package.testSpeed('STANDARD', 'REGEXP_COUNT')
FROM dual;
Used functions:
CREATE OR REPLACE PACKAGE test_package IS
FUNCTION testSpeed (p_package_name VARCHAR2, p_object_name VARCHAR2)
RETURN NUMBER;
END;
/
CREATE OR REPLACE PACKAGE BODY test_package IS
FUNCTION testSpeed (p_package_name VARCHAR2, p_object_name VARCHAR2)
RETURN NUMBER
IS
ln_total NUMBER;
BEGIN
SELECT SUM(position) INTO ln_total
FROM all_arguments
WHERE package_name = 'STANDARD'
AND object_name = 'REGEXP_COUNT';
RETURN ln_total;
END testSpeed;
END;
/
Add an inline view and a ROWNUM to prevent the Oracle from re-writing the query into a single query block and executing the functions multiple times.
Sample function and demonstration of the problem
create or replace function wait_1_second return number is
begin
execute immediate 'begin dbms_lock.sleep(1); end;';
-- ...
-- Do something here to make caching impossible.
-- ...
return 1;
end;
/
--1 second
select wait_1_second() from dual;
--2 seconds
select wait_1_second(), wait_1_second() from dual;
--3 seconds
select wait_1_second(), wait_1_second() , wait_1_second() from dual;
Simple query changes that do NOT work
Both of these methods still take 2 seconds, not 1.
select x, x
from
(
select wait_1_second() x from dual
);
with execute_function as (select wait_1_second() x from dual)
select x, x from execute_function;
Forcing Oracle to execute in a specific order
It's difficult to tell Oracle "execute this code by itself, don't do any predicate pushing, merging, or other transformations on it". There are hints for each of those optimizations, but they are difficult to use. There are a few ways to disable those transformations, adding an extra ROWNUM is usually the easiest.
--Only takes 1 second
select x, x
from
(
select wait_1_second() x, rownum
from dual
);
It's hard to see exactly where the functions get evaluated. But these explain plans show how the ROWNUM causes the inline view to run separately.
explain plan for select x, x from (select wait_1_second() x from dual);
select * from table(dbms_xplan.display(format=>'basic'));
Plan hash value: 1388734953
---------------------------------
| Id | Operation | Name |
---------------------------------
| 0 | SELECT STATEMENT | |
| 1 | FAST DUAL | |
---------------------------------
explain plan for select x, x from (select wait_1_second() x, rownum from dual);
select * from table(dbms_xplan.display(format=>'basic'));
Plan hash value: 1143117158
---------------------------------
| Id | Operation | Name |
---------------------------------
| 0 | SELECT STATEMENT | |
| 1 | VIEW | |
| 2 | COUNT | |
| 3 | FAST DUAL | |
---------------------------------
You can try the deterministic keyword to mark functions as pure. Whether or not this actually improves performance is another question though.
Update:
I don't know how realistic your example above is, but in theory you can always try to re-structure your SQL so it knows about repeated functions calls (actually repeated values). Kind of like
select x,x from (
SELECT test_package.testSpeed('STANDARD', 'REGEXP_COUNT') x
FROM dual
)
Use an in-line view.
with get_functions as(
SELECT my_package.my_function(my_id) as func_val,
my_package.function_also_calling_my_function(my_id) func_val_2
FROM my_table
WHERE my_table.id = my_id
)
select func_val,
func_val / 24 as func_val_adj,
func_val_2
from get_functions;
If you want to eliminate the call for item 3, instead pass the result of func_val to the third function.

Alternating true/false flag for current time given step

I have a query being executed every X milliseconds. As a part of result I would like to have alternating true/false flag. This flag should change whenever the query is executed again.
Example
Sample query: select 1, <<boolean alternator>>;
1st execution returns: 1 | true
2nd execution returns: 1 | false
3rd execution returns: 1 | true
and so on. It does not matter if it returns true or false for the first time.
For cases when X is odd number of seconds I have the following solution:
select
mod(right(extract(epoch from current_timestamp)::int::varchar,1)::int, 2) = 0
as alternator
This extracts last digit from epoch and then test if this is even number. Because X is defined as odd number, this test will alternate from one execution to another.
What would work in the same way when X is different - even or not in whole seconds? I would like to make it work for X like 500ms, 1200ms, 2000ms, ...
Note: I plan to use this with PostgreSQL.
I suggest a dedicated SEQUENCE.
CREATE SEQUENCE tf_seq MINVALUE 0 MAXVALUE 1 START 0 CYCLE;
Each call with nextval() returns 0 / 1 alternating. You can cast to boolean:
0::bool = FALSE
1::bool = TRUE
So:
SELECT nextval('tf_seq'::regclass)::int::bool;
To keep other roles from messing with the state of the sequence, only
GRANT USAGE ON SEQUENCE tf_seq TO $dedicated_role;. Run your query as that role or create a function with SECURITY DEFINER and ALTER FUNCTION foo() OWNER TO $dedicated_role;
Or, simpler yet, just make it the column default and completely ignore it in your inserts:
ALTER TABLE foo ALTER COLUMN bool_col
SET DEFAULT nextval('tf_seq'::regclass)::int::bool;
You need to grant USAGE on the sequence to roles that can insert.
Every next row gets the flipped value automatically.
The usual notes for sequences apply. Like, if you roll back an INSERT, the sequence stays flipped. Sequence states are never rolled back.
A temporary table can save the boolean state:
create temporary table t (b boolean);
insert into t (b) values (true);
with u as (
update t
set b = not b
returning b
)
select 1, b
from t;
I would do the same (boolean flipping with not b) with a pltcl function.
Advantage: variable is cached in your session tcl interpreter. Disadvantage: Any pl??? function has a call overhead.
Test table, insert some values:
strobel=# create table test (i integer);
CREATE TABLE
strobel=# insert into test (select * from generate_series(1, 10));
INSERT 0 10
The function:
create or replace function flipper () returns boolean as $$
if {![info exists ::flag]} {set ::flag false}
return [set ::flag [expr {! $::flag}]]
$$ language pltcl volatile;
The test:
strobel=# select *, flipper() from test;
i | flipper
----+---------
1 | t
2 | f
3 | t
4 | f
5 | t

How to avoid multiple function evals with the (func()).* syntax in a query?

Context
When a function returns a TABLE or a SETOF composite-type, like this one:
CREATE FUNCTION func(n int) returns table(i int, j bigint) as $$
BEGIN
RETURN QUERY select 1,n::bigint
union all select 2,n*n::bigint
union all select 3,n*n*n::bigint;
END
$$ language plpgsql;
the results can be accessed by various methods:
select * from func(3) will produce these output columns :
i | j
---+---
1 | 3
2 | 9
3 | 27
select func(3) will produce only one output column of ROW type.
func
-------
(1,3)
(2,9)
(3,27)
select (func(3)).* will produce like #1:
i | j
---+---
1 | 3
2 | 9
3 | 27
When the function argument comes from a table or a subquery, the syntax #3 is the only possible one, as in:
select N, (func(N)).*
from (select 2 as N union select 3 as N) s;
or as in this related answer. If we had LATERAL JOIN we could use that, but until PostgreSQL 9.3 is out, it's not supported, and the previous versions will still be used for years anyway.
Problem
The problem with syntax #3 is that the function is called as many times as there are columns in the result. There's no apparent reason for that, but it happens.
We can see it in version 9.2 by adding a RAISE NOTICE 'called for %', n in the function. With the query above, it outputs:
NOTICE: called for 2
NOTICE: called for 2
NOTICE: called for 3
NOTICE: called for 3
Now, if the function is changed to return 4 columns, like this:
CREATE FUNCTION func(n int) returns table(i int, j bigint,k int, l int) as $$
BEGIN
raise notice 'called for %', n;
RETURN QUERY select 1,n::bigint,1,1
union all select 2,n*n::bigint,1,1
union all select 3,n*n*n::bigint,1,1;
END
$$ language plpgsql stable;
then the same query outputs:
NOTICE: called for 2
NOTICE: called for 2
NOTICE: called for 2
NOTICE: called for 2
NOTICE: called for 3
NOTICE: called for 3
NOTICE: called for 3
NOTICE: called for 3
2 function calls were needed, 8 were actually made. The ratio is the number of output columns.
With syntax #2 that produces the same result except for the output columns layout, these multiple calls don't happen:
select N, func(N)
from (select 2 as N union select 3 as N) s;
gives:
NOTICE: called for 2
NOTICE: called for 3
followed by the 6 resulting rows:
n | func
---+------------
2 | (1,2,1,1)
2 | (2,4,1,1)
2 | (3,8,1,1)
3 | (1,3,1,1)
3 | (2,9,1,1)
3 | (3,27,1,1)
Questions
Is there a syntax or a construct with 9.2 that would achieve the expected result by doing only the minimum required function calls?
Bonus question: why do the multiple evaluations happen at all?
You can wrap it up in a subquery but that's not guaranteed safe without the OFFSET 0 hack. In 9.3, use LATERAL. The problem is caused by the parser effectively macro-expanding * into a column list.
Workaround
Where:
SELECT (my_func(x)).* FROM some_table;
will evaluate my_func n times for n result columns from the function, this formulation:
SELECT (mf).* FROM (
SELECT my_func(x) AS mf FROM some_table
) sub;
generally will not, and tends not to add an additional scan at runtime. To guarantee that multiple evaluation won't be performed you can use the OFFSET 0 hack or abuse PostgreSQL's failure to optimise across CTE boundaries:
SELECT (mf).* FROM (
SELECT my_func(x) AS mf FROM some_table OFFSET 0
) sub;
or:
WITH tmp(mf) AS (
SELECT my_func(x) FROM some_table
)
SELECT (mf).* FROM tmp;
In PostgreSQL 9.3 you can use LATERAL to get a saner behaviour:
SELECT mf.*
FROM some_table
LEFT JOIN LATERAL my_func(some_table.x) AS mf ON true;
LEFT JOIN LATERAL ... ON true retains all rows like the original query, even if the function call returns no row.
Demo
Create a function that isn't inlineable as a demonstration:
CREATE OR REPLACE FUNCTION my_func(integer)
RETURNS TABLE(a integer, b integer, c integer) AS $$
BEGIN
RAISE NOTICE 'my_func(%)',$1;
RETURN QUERY SELECT $1, $1, $1;
END;
$$ LANGUAGE plpgsql;
and a table of dummy data:
CREATE TABLE some_table AS SELECT x FROM generate_series(1,10) x;
then try the above versions. You'll see that the first raises three notices per invocation; the latter only raise one.
Why?
Good question. It's horrible.
It looks like:
(func(x)).*
is expanded as:
(my_func(x)).i, (func(x)).j, (func(x)).k, (func(x)).l
in parsing, according to a look at debug_print_parse, debug_print_rewritten and debug_print_plan. The (trimmed) parse tree looks like this:
:targetList (
{TARGETENTRY
:expr
{FIELDSELECT
:arg
{FUNCEXPR
:funcid 57168
...
}
:fieldnum 1
:resulttype 23
:resulttypmod -1
:resultcollid 0
}
:resno 1
:resname i
...
}
{TARGETENTRY
:expr
{FIELDSELECT
:arg
{FUNCEXPR
:funcid 57168
...
}
:fieldnum 2
:resulttype 20
:resulttypmod -1
:resultcollid 0
}
:resno 2
:resname j
...
}
{TARGETENTRY
:expr
{FIELDSELECT
:arg
{FUNCEXPR
:funcid 57168
...
}
:fieldnum 3
:...
}
:resno 3
:resname k
...
}
{TARGETENTRY
:expr
{FIELDSELECT
:arg
{FUNCEXPR
:funcid 57168
...
}
:fieldnum 4
...
}
:resno 4
:resname l
...
}
)
So basically, we're using a dumb parser hack to expand wildcards by cloning nodes.

Selecting positive aggregate value and ignoring negative in Postgres SQL

I must apply a certain transformation fn(argument). Here argument is equal to value, but not when it is negative. When you get a first negative value, then you "wait" until it sums up with consecutive values and this sum becomes positive. Then you do fn(argument). See the table I want to get:
value argument
---------------------
2 2
3 3
-10 0
4 0
3 0
10 7
1 1
I could have summed all values and apply fn to the sum, but fn can be different for different rows and it is essential to know the row number to choose a concrete fn.
As want a Postgres SQL solution, looks like window functions fit, but I am not experienced enough to write expression that does that yet. In fact, I am new to "thinking in sql", unfortunately. I guess that can be easily done in an imperative way, but I do not want to write a stored procedure yet.
I suppose I'm late, but this may help someone:
select
value,
greatest(0, value) as argument
from your_table;
This doesn't really fit any of the predefined aggregation functions. You probably need to write your own. Note that in postgresql, aggregate functions can be used as window functions, and in fact that is the only way to write window functions in anything other than C, as of 9.0.
You can write a function that tracks the state of "summing" the values, except that it always returns the input value if the current "sum" is positive, and just keeps adding when the "sum" is negative. Then you simply need to take the greater of either this sum or zero. To whit:
-- accumulator function: first arg is state, second arg is input
create or replace function ouraggfunc(int, int)
returns int immutable language plpgsql as $$
begin
raise info 'ouraggfunc: %, %', $1, $2; -- to help you see what's going on
-- get started by returning the first value ($1 is null - no state - first row)
if $1 is null then
return $2;
end if;
-- if our state is negative, we're summing until it becomes positive
-- otherwise, we're just returning the input
if $1 < 0 then
return $1 + $2;
else
return $2;
end if;
end;
$$;
You need to create an aggregate function to invoke this accumulator:
create aggregate ouragg(basetype = int, sfunc = ouraggfunc, stype = int);
This defines that the aggregate takes integers as input and stores its state as an integer.
I copied your example into a table:
steve#steve#[local] =# create table t(id serial primary key, value int not null, argument int not null);
NOTICE: CREATE TABLE will create implicit sequence "t_id_seq" for serial column "t.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "t_pkey" for table "t"
CREATE TABLE
steve#steve#[local] =# copy t(value, argument) from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 2 2
>> 3 3
>> -10 0
>> 4 0
>> 3 0
>> 10 7
>> 1 1
>> \.
And you can now have those values produced by using the aggregate function with a window clause:
steve#steve#[local] =# select value, argument, ouragg(value) over(order by id) from t;
INFO: ouraggfunc: <NULL>, 2
INFO: ouraggfunc: 2, 3
INFO: ouraggfunc: 3, -10
INFO: ouraggfunc: -10, 4
INFO: ouraggfunc: -6, 3
INFO: ouraggfunc: -3, 10
INFO: ouraggfunc: 7, 1
value | argument | ouragg
-------+----------+--------
2 | 2 | 2
3 | 3 | 3
-10 | 0 | -10
4 | 0 | -6
3 | 0 | -3
10 | 7 | 7
1 | 1 | 1
(7 rows)
So as you can see, the final step is that you need to take the output of the function if it is positive, or zero. This can be done by wrapping the query, or writing a function to do that:
create function positive(int) returns int immutable strict language sql as
$$ select case when $1 > 0 then $1 else 0 end $$;
and now:
select value, argument, positive(ouragg(value) over(order by id)) as raw_agg from t
This produces the arguments for the function that you specified in the question.