Oracle performance: query executing multiple identical function calls - sql

Is it possible for Oracle to reuse the result of a function when it is called in the same query (transaction?) without the use of the function result cache?
The application I am working with is heavily reliant on Oracle functions. Many queries end up executing the exact same functions multiple times.
A typical example would be:
SELECT my_package.my_function(my_id),
my_package.my_function(my_id) / 24,
my_package.function_also_calling_my_function(my_id)
FROM my_table
WHERE my_table.id = my_id;
I have noticed that Oracle always executes each of these functions, not realizing that the same function was called just a second ago in the same query. It is possible that some elements in the function get cached, resulting in a slightly faster return. This is not relevant to my question as I want to avoid the entire second or third execution.
Assume that the functions are fairly resource-consuming and that these functions may call more functions, basing their result on tables that are reasonably large and with frequent updates (a million records, updates with say 1000 updates per hour). For this reason it is not possible to use Oracle's Function Result Cache.
Even though the data is changing frequently, I expect the result of these functions to be the same when they are called from the same query.
Is it possible for Oracle to reuse the result of these functions and how? I am using Oracle11g and Oracle12c.
Below is an example (just a random non-sense function to illustrate the problem):
-- Takes 200 ms
SELECT test_package.testSpeed('STANDARD', 'REGEXP_COUNT')
FROM dual;
-- Takes 400ms
SELECT test_package.testSpeed('STANDARD', 'REGEXP_COUNT')
, test_package.testSpeed('STANDARD', 'REGEXP_COUNT')
FROM dual;
Used functions:
CREATE OR REPLACE PACKAGE test_package IS
FUNCTION testSpeed (p_package_name VARCHAR2, p_object_name VARCHAR2)
RETURN NUMBER;
END;
/
CREATE OR REPLACE PACKAGE BODY test_package IS
FUNCTION testSpeed (p_package_name VARCHAR2, p_object_name VARCHAR2)
RETURN NUMBER
IS
ln_total NUMBER;
BEGIN
SELECT SUM(position) INTO ln_total
FROM all_arguments
WHERE package_name = 'STANDARD'
AND object_name = 'REGEXP_COUNT';
RETURN ln_total;
END testSpeed;
END;
/

Add an inline view and a ROWNUM to prevent the Oracle from re-writing the query into a single query block and executing the functions multiple times.
Sample function and demonstration of the problem
create or replace function wait_1_second return number is
begin
execute immediate 'begin dbms_lock.sleep(1); end;';
-- ...
-- Do something here to make caching impossible.
-- ...
return 1;
end;
/
--1 second
select wait_1_second() from dual;
--2 seconds
select wait_1_second(), wait_1_second() from dual;
--3 seconds
select wait_1_second(), wait_1_second() , wait_1_second() from dual;
Simple query changes that do NOT work
Both of these methods still take 2 seconds, not 1.
select x, x
from
(
select wait_1_second() x from dual
);
with execute_function as (select wait_1_second() x from dual)
select x, x from execute_function;
Forcing Oracle to execute in a specific order
It's difficult to tell Oracle "execute this code by itself, don't do any predicate pushing, merging, or other transformations on it". There are hints for each of those optimizations, but they are difficult to use. There are a few ways to disable those transformations, adding an extra ROWNUM is usually the easiest.
--Only takes 1 second
select x, x
from
(
select wait_1_second() x, rownum
from dual
);
It's hard to see exactly where the functions get evaluated. But these explain plans show how the ROWNUM causes the inline view to run separately.
explain plan for select x, x from (select wait_1_second() x from dual);
select * from table(dbms_xplan.display(format=>'basic'));
Plan hash value: 1388734953
---------------------------------
| Id | Operation | Name |
---------------------------------
| 0 | SELECT STATEMENT | |
| 1 | FAST DUAL | |
---------------------------------
explain plan for select x, x from (select wait_1_second() x, rownum from dual);
select * from table(dbms_xplan.display(format=>'basic'));
Plan hash value: 1143117158
---------------------------------
| Id | Operation | Name |
---------------------------------
| 0 | SELECT STATEMENT | |
| 1 | VIEW | |
| 2 | COUNT | |
| 3 | FAST DUAL | |
---------------------------------

You can try the deterministic keyword to mark functions as pure. Whether or not this actually improves performance is another question though.
Update:
I don't know how realistic your example above is, but in theory you can always try to re-structure your SQL so it knows about repeated functions calls (actually repeated values). Kind of like
select x,x from (
SELECT test_package.testSpeed('STANDARD', 'REGEXP_COUNT') x
FROM dual
)

Use an in-line view.
with get_functions as(
SELECT my_package.my_function(my_id) as func_val,
my_package.function_also_calling_my_function(my_id) func_val_2
FROM my_table
WHERE my_table.id = my_id
)
select func_val,
func_val / 24 as func_val_adj,
func_val_2
from get_functions;
If you want to eliminate the call for item 3, instead pass the result of func_val to the third function.

Related

Achieving window function-like behavior using a PostgreSQL user defined function?

Let's say that given a table observations_tbl with attributes date (day) and value, I want to produce the new attribute prev_day_value to get the following table:
|---------------------|-------|----------------|
| date | value | prev_day_value |
|---------------------|-------|----------------|
| 01.01.2015 00:00:00 | 5 | 0 |
| 02.01.2015 00:00:00 | 4 | 5 |
| 03.01.2015 00:00:00 | 3 | 4 |
| 04.01.2015 00:00:00 | 2 | 3 |
|---------------------|-------|----------------|
I am well-aware that such an output can typically be obtained using a WINDOW function. But how would I achieve this through a PostgreSQL user defined function? I want to indicate that I am in a situation where I must use a function, difficult to explain why without going into detail - these are the restrictions I have and if anything, it is a technical challenge.
Take into consideration this template query:
SELECT *, lag(value,1) AS prev_day_value -- or lag(record,1) or lag(date,value,1) or lag(date,1) or lag(observations_tbl,1), etc.
FROM observations_tbl
I am using function lag with parameter 1 to look for a value which comes before the current row by 1 - a distance of 1 row. I don't care what other parameters the function lag can have (table name, other attributes) - what could the function lag look like to achieve such functionality? The function can be of any language, SQL, PL/pgSQL and even C using PostgreSQL API/backend.
I understand that one answer can be wrapping a WINDOW query inside lag user defined function. But I am thinking that would be a rather costly operation if I have to scan the entire table twice (once inside the lag function and once outside). I was thinking that maybe each PostgreSQL record would have a pointer to its previous record which is directly accessible? Or that I can somehow open a cursor at this specific row / row number without having to scan the entire table? Or is what I am asking impossible?
Your request is not possible to solve with relational tools (window functions are not relational extension in SQL). In C language you can write own alternative of function lag. You can do same work in PL8 language (Javascript). Unfortunately the API for window functions doesn't exist for PL/pgSQL. You cannot to write simple PL/pgSQL function that has access to different row than is processed.
The one possible alternative (but with some performance risk) is writing table function. There you have a control over all processed dataset, and you can do this operation simply.
CREATE OR REPLACE FUNCTION report()
RETURNS TABLE(d date, v int, prev_v int) $$
DECLARE r RECORD;
BEGIN
prev_v := 0;
FOR r IN SELECT date, value FROM observations_tbl t ORDER BY 1
LOOP
d := r.date; v := r.value;
RETURN NEXT;
prev_v := v;
END LOOP;
END;
$$ LANGUAGE plpgsql;
There is not any other alternative usable solution. In very old date these values was calculated with correlated selfjoins, but this solution has pretty terrible performance.
What Pavel posted, just with fewer assignments. Should be faster:
CREATE OR REPLACE FUNCTION report()
RETURNS TABLE(d date, v int, prev_v int) AS
$func$
BEGIN
prev_v := 0;
FOR d, v IN
SELECT date, value FROM observations_tbl ORDER BY 1
LOOP
RETURN NEXT;
prev_v := v;
END LOOP;
END
$func$ LANGUAGE plpgsql;
The general idea can pay if it actually replaces multiple scans over the table with a single one. Like here:
GROUP BY and aggregate sequential numeric values

Remove duplicate entries from string array column of postgres

I have a PostgreSQL table where there is column which has array of strings. The row have some unique array strings or some have duplicate strings also. I want to remove duplicate strings from each row if they exists.
I have tried to some queries but couldn't make it happen.
Following is the table:
veh_id | vehicle_types
--------+----------------------------------------
1 | {"byd_tang","volt","viper","laferrari"}
2 | {"volt","viper"}
3 | {"byd_tang","sonata","jaguarxf"}
4 | {"swift","teslax","mirai"}
5 | {"volt","viper"}
6 | {"viper","ferrariff","bmwi8","viper"}
7 | {"ferrariff","viper","viper","volt"}
I am expecting following output:
veh_id | vehicle_types
--------+----------------------------------------
1 | {"byd_tang","volt","viper","laferrari"}
2 | {"volt","viper"}
3 | {"byd_tang","sonata","jaguarxf"}
4 | {"swift","teslax","mirai"}
5 | {"volt","viper"}
6 | {"viper","ferrariff","bmwi8"}
7 | {"ferrariff","viper","volt"}
Since each row's array is independent, a plain correlated subquery with an ARRAY constructor would do the job:
SELECT *, ARRAY(SELECT DISTINCT unnest (vehicle_types)) AS vehicle_types_uni
FROM vehicle;
See:
Why is array_agg() slower than the non-aggregate ARRAY() constructor?
Note that NULL is converted to an empty array ('{}'). We'd need to special-case it, but it is excluded in the UPDATE below anyway.
Fast and simple. But don't use this. You didn't say so, but typically you'd want to preserve original order of array elements. Your rudimentary sample suggests as much. Use WITH ORDINALITY in the correlated subquery, which becomes a bit more sophisticated:
SELECT *, ARRAY (SELECT v
FROM unnest(vehicle_types) WITH ORDINALITY t(v,ord)
GROUP BY 1
ORDER BY min(ord)
) AS vehicle_types_uni
FROM vehicle;
See:
PostgreSQL unnest() with element number
UPDATE to actually remove dupes:
UPDATE vehicle
SET vehicle_types = ARRAY (
SELECT v
FROM unnest(vehicle_types) WITH ORDINALITY t(v,ord)
GROUP BY 1
ORDER BY min(ord)
)
WHERE cardinality(vehicle_types) > 1 -- optional
AND vehicle_types <> ARRAY (
SELECT v
FROM unnest(vehicle_types) WITH ORDINALITY t(v,ord)
GROUP BY 1
ORDER BY min(ord)
); -- suppress empty updates (optional)
Both added WHERE conditions are optional to improve performance. The 1st one is completely redundant. Each condition also excludes the NULL case. The 2nd one suppresses all empty updates.
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
If you tried to do that without preserving original order, you'd likely update most rows without need, just because the order or elements changed even without dupes.
Requires Postgres 9.4 or later.
db<>fiddle here
I don't claim it's efficient, but something like this might work:
with expanded as (
select veh_id, unnest (vehicle_types) as vehicle_type
from vehicles
)
select veh_id, array_agg (distinct vehicle_type)
from expanded
group by veh_id
If you really want to get fancy and do something that is worst case O(n), you can write a custom function:
create or replace function unique_array(input_array text[])
returns text[] as $$
DECLARE
output_array text[];
i integer;
BEGIN
output_array = array[]::text[];
for i in 1..cardinality(input_array) loop
if not (input_array[i] = any (output_array)) then
output_array := output_array || input_array[i];
end if;
end loop;
return output_array;
END;
$$
language plpgsql
Usage example:
select veh_id, unique_array(vehicle_types)
from vehicles

PostgreSQL efficiently find last decendant in linear list

I currently try to retrieve the last decendet efficiently from a linked list like structure.
Essentially there's a table with a data series, with certain criteria I split it up to get a list like this
current_id | next_id
for example
1 | 2
2 | 3
3 | 4
4 | NULL
42 | 43
43 | 45
45 | NULL
etc...
would result in lists like
1 -> 2 -> 3 -> 4
and
42 -> 43 -> 45
Now I want to get the first and the last id from each of those lists.
This is what I have right now:
WITH RECURSIVE contract(ruid, rdid, rstart_ts, rend_ts) AS ( -- recursive Query to traverse the "linked list" of continuous timestamps
SELECT start_ts, end_ts FROM track_caps tc
UNION
SELECT c.rstart_ts, tc.end_ts AS end_ts0 FROM contract c INNER JOIN track_caps tc ON (tc.start_ts = c.rend_ts AND c.rend_ts IS NOT NULL AND tc.end_ts IS NOT NULL)
),
fcontract AS ( --final step, after traversing the "linked list", pick the largest timestamp found as the end_ts and the smallest as the start_ts
SELECT DISTINCT ON(start_ts, end_ts) min(rstart_ts) AS start_ts, rend_ts AS end_ts
FROM (
SELECT rstart_ts, max(rend_ts) AS rend_ts FROM contract
GROUP BY rstart_ts
) sq
GROUP BY end_ts
)
SELECT * FROM fcontract
ORDER BY start_ts
In this case I just used timestamps which work fine for the given data.
Basically I just use a recursive query that walks through all the nodes until it reaches the end, as suggested by many other posts on StackOverflow and other sites. The next query removes all the sub-steps and returns what I want, like in the first list example: 1 | 4
Just for illustration, the produced result set by the recursive query looks like this:
1 | 2
2 | 3
3 | 4
1 | 3
2 | 4
1 | 4
As nicely as it works, it's quite a memory hog however which is absolutely unsurprising when looking at the results of EXPLAIN ANALYZE.
For a dataset of roughly 42,600 rows, the recursive query produces a whopping 849,542,346 rows. Now it was actually supposed to process around 2,000,000 rows but with that solution right now it seems very unfeasible.
Did I just improperly use recursive queries? Is there a way to reduce the amount of data it produces?(like removing the sub-steps?)
Or are there better single-query solutions to this problem?
The main problem is that your recursive query doesn't properly filter the root nodes which is caused by the the model you have. So the non-recursive part already selects the entire table and then Postgres needs to recurse for each and every row of the table.
To make that more efficient only select the root nodes in the non-recursive part of your query. This can be done using:
select t1.current_id, t1.next_id, t1.current_id as root_id
from track_caps t1
where not exists (select *
from track_caps t2
where t2.next_id = t1.current_id)
Now that is still not very efficient (compared to the "usual" where parent_id is null design), but at least makes sure the recursion doesn't need to process more rows then necessary.
To find the root node of each tree, just select that as an extra column in the non-recursive part of the query and carry it over to each row in the recursive part.
So you wind up with something like this:
with recursive contract as (
select t1.current_id, t1.next_id, t1.current_id as root_id
from track_caps t1
where not exists (select *
from track_caps t2
where t2.next_id = t1.current_id)
union
select c.current_id, c.next_id, p.root_id
from track_caps c
join contract p on c.current_id = p.next_id
and c.next_id is not null
)
select *
from contract
order by current_id;
Online example: http://rextester.com/DOABC98823

Random() in Redshift CTE returns wildly incorrect results under certain conditions

(Cross posting this from the AWS forums...)
Need a fairly sizable chunk of dummy data for this. I used this list of English words: http://www.mieliestronk.com/corncob_lowercase.txt
I'm seeing a MASSIVE difference in the number of results I get for seemingly equivalent queries involving the random() function within a CTE in Amazon Redshift. (I'm trying to take a random sample - one query returns an actual sample as expected, the other basically just returns the entire list of items I was trying to sample.)
Can somebody take a look at this? Am I doing something wrong? Is there another issue here?
/* Create tables to hold words */
create table main_words(word varchar(max));
create table couple_words(word varchar(max));
/* Get some words */
copy main_words
from 'S3 LOCATION OF CORNCOB FILE'
credentials 'aws_access_key_id=ID;aws_secret_access_key=KEY'
csv;
/* Put a few in another table */
insert into
couple_words
select top 5000
word
from
main_words;
/* Returns about 500 results */
with the_cte as
(
select
word,
random() as random_value
from
main_words
where
word not in (select word from couple_words)
)
select
count(*)
from
the_cte
where
random_value > .99;
/* Returns about 58,000 results (basically, the whole list) */
with the_cte as
(
select
word
from
main_words
where
word not in (select word from couple_words)
and random() > .99
)
select
count(*)
from
the_cte;
/* Clean up */
drop table if exists main_words;
drop table if exists couple_words;
Have you try it on a different server?
I just create a sample on SqlFidle with 100 rows plus random() > 0.9 and result are very similar.
First CTE
| count |
|-------|
| 4 |
Second CTE
| count |
|-------|
| 13 |
Average count(*) with 10 runs
CTE 1 CTE 2
8.3 9.8
I suspect some funky query rewriting. If you have to have the inner query, you can use LIMIT 2147483647 inside and see what comes up.

How to avoid multiple function evals with the (func()).* syntax in a query?

Context
When a function returns a TABLE or a SETOF composite-type, like this one:
CREATE FUNCTION func(n int) returns table(i int, j bigint) as $$
BEGIN
RETURN QUERY select 1,n::bigint
union all select 2,n*n::bigint
union all select 3,n*n*n::bigint;
END
$$ language plpgsql;
the results can be accessed by various methods:
select * from func(3) will produce these output columns :
i | j
---+---
1 | 3
2 | 9
3 | 27
select func(3) will produce only one output column of ROW type.
func
-------
(1,3)
(2,9)
(3,27)
select (func(3)).* will produce like #1:
i | j
---+---
1 | 3
2 | 9
3 | 27
When the function argument comes from a table or a subquery, the syntax #3 is the only possible one, as in:
select N, (func(N)).*
from (select 2 as N union select 3 as N) s;
or as in this related answer. If we had LATERAL JOIN we could use that, but until PostgreSQL 9.3 is out, it's not supported, and the previous versions will still be used for years anyway.
Problem
The problem with syntax #3 is that the function is called as many times as there are columns in the result. There's no apparent reason for that, but it happens.
We can see it in version 9.2 by adding a RAISE NOTICE 'called for %', n in the function. With the query above, it outputs:
NOTICE: called for 2
NOTICE: called for 2
NOTICE: called for 3
NOTICE: called for 3
Now, if the function is changed to return 4 columns, like this:
CREATE FUNCTION func(n int) returns table(i int, j bigint,k int, l int) as $$
BEGIN
raise notice 'called for %', n;
RETURN QUERY select 1,n::bigint,1,1
union all select 2,n*n::bigint,1,1
union all select 3,n*n*n::bigint,1,1;
END
$$ language plpgsql stable;
then the same query outputs:
NOTICE: called for 2
NOTICE: called for 2
NOTICE: called for 2
NOTICE: called for 2
NOTICE: called for 3
NOTICE: called for 3
NOTICE: called for 3
NOTICE: called for 3
2 function calls were needed, 8 were actually made. The ratio is the number of output columns.
With syntax #2 that produces the same result except for the output columns layout, these multiple calls don't happen:
select N, func(N)
from (select 2 as N union select 3 as N) s;
gives:
NOTICE: called for 2
NOTICE: called for 3
followed by the 6 resulting rows:
n | func
---+------------
2 | (1,2,1,1)
2 | (2,4,1,1)
2 | (3,8,1,1)
3 | (1,3,1,1)
3 | (2,9,1,1)
3 | (3,27,1,1)
Questions
Is there a syntax or a construct with 9.2 that would achieve the expected result by doing only the minimum required function calls?
Bonus question: why do the multiple evaluations happen at all?
You can wrap it up in a subquery but that's not guaranteed safe without the OFFSET 0 hack. In 9.3, use LATERAL. The problem is caused by the parser effectively macro-expanding * into a column list.
Workaround
Where:
SELECT (my_func(x)).* FROM some_table;
will evaluate my_func n times for n result columns from the function, this formulation:
SELECT (mf).* FROM (
SELECT my_func(x) AS mf FROM some_table
) sub;
generally will not, and tends not to add an additional scan at runtime. To guarantee that multiple evaluation won't be performed you can use the OFFSET 0 hack or abuse PostgreSQL's failure to optimise across CTE boundaries:
SELECT (mf).* FROM (
SELECT my_func(x) AS mf FROM some_table OFFSET 0
) sub;
or:
WITH tmp(mf) AS (
SELECT my_func(x) FROM some_table
)
SELECT (mf).* FROM tmp;
In PostgreSQL 9.3 you can use LATERAL to get a saner behaviour:
SELECT mf.*
FROM some_table
LEFT JOIN LATERAL my_func(some_table.x) AS mf ON true;
LEFT JOIN LATERAL ... ON true retains all rows like the original query, even if the function call returns no row.
Demo
Create a function that isn't inlineable as a demonstration:
CREATE OR REPLACE FUNCTION my_func(integer)
RETURNS TABLE(a integer, b integer, c integer) AS $$
BEGIN
RAISE NOTICE 'my_func(%)',$1;
RETURN QUERY SELECT $1, $1, $1;
END;
$$ LANGUAGE plpgsql;
and a table of dummy data:
CREATE TABLE some_table AS SELECT x FROM generate_series(1,10) x;
then try the above versions. You'll see that the first raises three notices per invocation; the latter only raise one.
Why?
Good question. It's horrible.
It looks like:
(func(x)).*
is expanded as:
(my_func(x)).i, (func(x)).j, (func(x)).k, (func(x)).l
in parsing, according to a look at debug_print_parse, debug_print_rewritten and debug_print_plan. The (trimmed) parse tree looks like this:
:targetList (
{TARGETENTRY
:expr
{FIELDSELECT
:arg
{FUNCEXPR
:funcid 57168
...
}
:fieldnum 1
:resulttype 23
:resulttypmod -1
:resultcollid 0
}
:resno 1
:resname i
...
}
{TARGETENTRY
:expr
{FIELDSELECT
:arg
{FUNCEXPR
:funcid 57168
...
}
:fieldnum 2
:resulttype 20
:resulttypmod -1
:resultcollid 0
}
:resno 2
:resname j
...
}
{TARGETENTRY
:expr
{FIELDSELECT
:arg
{FUNCEXPR
:funcid 57168
...
}
:fieldnum 3
:...
}
:resno 3
:resname k
...
}
{TARGETENTRY
:expr
{FIELDSELECT
:arg
{FUNCEXPR
:funcid 57168
...
}
:fieldnum 4
...
}
:resno 4
:resname l
...
}
)
So basically, we're using a dumb parser hack to expand wildcards by cloning nodes.