BigQuery overhead of Stored Procedures - google-bigquery

We have a use case to create some sophisticated search predicates against Google BigQuery. We have built some search screens that use stored procedures that create dynamic sql that is then run using "execute immediate". Here is a completely synthetic example:
CREATE OR REPLACE PROCEDURE `somewhere`.author_name_search(
IN firstNamePrefix STRING,
IN lastNamePrefix STRING
)
BEGIN
/* note select '*' is bad practice */
DECLARE DYNAMIC_SQL STRING DEFAULT 'SELECT * FROM `somewhere_else.author`';
DECLARE WHERE_CLAUSE STRING DEFAULT '';
IF ASCII(firstNamePrefix) != 0 THEN
SET WHERE_CLAUSE = WHERE_CLAUSE || " firstName like '" || firstNamePrefix || "%'";
END IF;
IF ASCII(lastNamePrefix) != 0 THEN
IF ASCII(WHERE_CLAUSE) != 0 THEN
SET WHERE_CLAUSE = WHERE_CLAUSE || " OR ";
END IF;
SET WHERE_CLAUSE = WHERE_CLAUSE || "lastName like '" || lastNamePrefix || "%'";
END IF;
/* if there no optional input this is a full table scan which is VERY BAD. you need to partition and force that all search inputs only return some partitions */
IF ASCII(WHERE_CLAUSE) != 0 THEN
SET DYNAMIC_SQL = DYNAMIC_SQL || " WHERE " || WHERE_CLAUSE;
END IF;
SET DYNAMIC_SQL = DYNAMIC_SQL || " LIMIT 100000";
EXECUTE IMMEDIATE DYNAMIC_SQL;
END;
Note that is a really basic example with lots of performance issues. In reality the data is partitioned by date, require partition filter is set to true, the UI suggests a reasonable date range, the UI only select a subset of fields not all of them. Where things get complex is that we have sort column and sort order and up to a dozen optional search input fields which is why we are looking to use dynamic sql with execute immediate within a procedure. We are aware that BigQuery is not built for low latency OLTP queries so we optimize our datasets for these "interactive search" screens and set user expectations that the response times will a few seconds depending on the complexity of the search they are running.
We ran some performance tests comparing the overhead of just querying the underlying table/view as opposed to querying it from within a proc. The simplest test proc is hardcoded to just return some fixed data from a view so that we can do a direct comparison such as:
CREATE OR REPLACE PROCEDURE `somewhere`.author_name_search(
IN firstNamePrefix STRING,
IN lastNamePrefix STRING
)
BEGIN
/* not the logic here is not quite identical in reality we are hardcoding the query to return a known result to compare with other test runs */
SELECT * FROM `somewhere_else.author` where firstName like firstNamePrefix OR lastName like lastNamePrefix;
END;
The real test proc is more or less hardcoded to pull some exact test rows from the underlying table. If we query the table directly without a proc everything is sub-second end-to-end. When we simply hardcode the same query statement into a minimal stored proc (as above) everything is one whole second slower. So using the proc seems to add a fixed overhead which is more than 100% the full application time when not going via a proc.
It this expected or is there something we can do to tune things so that procs are not a massive overhead?

Related

Save stored procedure output to new table without repeating table type

I want to call an existing procedure and store its table-typed OUT parameters to new physical tables, without having to repeat the definitions of the output types when creating the new tables. For example, if the procedure were
CREATE PROCEDURE MYPROC
(IN X INTEGER, OUT Y TABLE(A INTEGER, B DOUBLE, C NVARCHAR(25)))
LANGUAGE SQLSCRIPT AS BEGIN
...
END;
I would want to create a physical table for the output without repeating the (A INTEGER, B DOUBLE, C NVARCHAR(25)) part.
If I already had a table with the structure I want my result to have, I could CREATE TABLE MY_OUTPUT LIKE EXISTING_TABLE, but I don't.
If I already had a named type defined for the procedure's output type, I could create my table based on that type, but I don't.
If it were a subquery instead of a procedure output parameter, I could CREATE TABLE MY_OUTPUT AS (<subquery>), but it's not a subquery, and I don't know how to express it as a subquery. Also, there could be multiple output parameters, and I don't know how you'd make this work with multiple output parameters.
In my specific case, the functions come from the SAP HANA Predictive Analysis Library, so I don't have the option of changing how the functions are defined. Additionally, I suspect that PAL's unusually flexible handling of parameter types might prevent me from using solutions that would work for ordinary SQLScript procedures, but I'm still interested in solutions that would work for regular procedures, even if they fail on PAL.
Is there a way to do this?
It's possible, with limitations, to do this by using a SQLScript anonymous block:
DO BEGIN
CALL MYPROC(5, Y);
CREATE TABLE BLAH AS (SELECT * FROM :Y);
END;
We store the output to a table variable in the anonymous block, then create a physical table with data taken from the table variable. This even works with PAL! It's a lot of typing, though.
The limitation I've found is that the body of an anonymous block can't refer to local temporary tables created outside the anonymous block, so it's awkward to pass local temporary tables to the procedure this way. It's possible to do it anyway by passing the local temporary table as a parameter to the anonymous block itself, but that requires writing out the type of the local temporary table, and we were trying to avoid writing table types manually.
As far as I understand, you want to use your database tables as output parameter types.
In my default schema, I have a database table named CITY
I can create a stored procedure as follows using the table as output parameter type
CREATE PROCEDURE MyCityList (
OUT CITYLIST CITY
)
LANGUAGE SQLSCRIPT
AS
BEGIN
CITYLIST = SELECT * FROM CITY;
END;
After procedure is created, you can execute it as follows
do
begin
declare myList CITY;
call MyCityList(:myList);
select * from :myList;
end;
Here is the result where the output data is in a database table format, namely as CITY table
I hope this answers your question,
Update after first comment
If the scenario is the opposite as mentioned in the first comment, you can query system view PROCEDURE_PARAMETER_COLUMNS and create dynamic SQL statements that will generate tables with definitions in procedure table type parameters
Here is the SQL query
select
parameter_name,
'CREATE Column Table ' ||
procedure_name || '_'
|| parameter_name || ' ( ' ||
string_agg(
column_name || ' ' ||
data_type_name ||
case when data_type_name = 'INTEGER' then '' else
'(' || length || ')'
end
, ','
) || ' );'
from PROCEDURE_PARAMETER_COLUMNS
where
schema_name = 'A00077387'
group by procedure_name, parameter_name
You need to replace the WHERE clause according to your case.
Each line will have such an output
CREATE Column Table LISTCITIESBYCOUNTRYID_CITYLIST ( CITYID INTEGER,NAME NVARCHAR(40) );
The format for table name is concatenation of procedure name and parameter name
One last note, some data types integer, decimal, etc requires special code like excluding length or adding of scale , etc. Some are not handled in this SQL.
I'll try to enhance the query soon and publish an update

How to stop a large query if it has too many rows in PostgreSQL using Jdbc?

We run user submitted queries which can potentially return a large result set.
In order to avoid memory issues, we would like to detect these cases and cancel the query. The user then is expected to modify the query.
We already use PreparedStatement#setFetchSize() to scroll the result set and process a large result set incrementally.
However, when the result set is too large, we would like to avoid bringing even the first results over the network or any other unnecessary work as much as possible on the client side and on the database side.
Doing a SELECT COUNT(*)... beforehand just degrades the performance of the expected case where the queries behave nicely in general.
Is there a way for postgres to tell the expected result set size?
Take a look here.
They are doing an estimate with a database procedure:
CREATE FUNCTION count_estimate(query text) RETURNS INTEGER AS
$func$
DECLARE
rec record;
ROWS INTEGER;
BEGIN
FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP
ROWS := SUBSTRING(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');
EXIT WHEN ROWS IS NOT NULL;
END LOOP;
RETURN ROWS;
END
$func$ LANGUAGE plpgsql;
It uses the EXPLAIN command of PGSQL to estimate the returned rowcount.

If I pass a where clause as a parameter will that prevent SQL Injection?

I created an Oracle proc where I create a dynamic sql statement based on the parameters supplied to the proc.
I've done some testing and it appears that I can't perform sql injection.
Is there anything additional I should be safe guarding against?
SELECT 'UPDATE ' || p_table || ' SET MY_FIELD = ''' || p_Value || ''' ' || p_Where
INTO query_string
FROM DUAL;
EDIT:
Scenarios that I've tried.
1. WHERE SOME_VAL IN ('AAA','BBB') - This works
2. WHERE SOME_VAL IN ('AAA','BBB') OR SOME_VAL2 = '123' - This works.
3. WHERE SOME_VAL IN ('AAA','BBB'); DROP TABLE TEST_TABLE; - This errors out.
4. WHERE SOME_VAL IN ('AAA','BBB') OR (DELETE FROM TEST_TABLE) - This errors out.
It depends on how and by whom your procedure is being invoked. Usually you need to worry about SQL injection for something that is open to large number of users in production. And that should not be the case for any database procedure. If your database procedure is accessible by large number of users, then you have potential for malicious use by someone.
In your case, you can mitigate this risk by creating mapping of parameters to hide actual schema object names and some validation.
For example change parameter p_table to table_name as input parameter. Then using case statement map to actual table name. I am giving you example of table name here because you should really restrict who can access which table from db.
CREATE OR REPLACE PROCEDURE test_proc(table_name IN VARCHAR)
IS
p_table varchar2(100);
BEGIN
CASE table_name
WHEN 'A' THEN p_table:='db_table_a';
WHEN 'B' THEN p_table:='db_table_b';
ELSE RAISE 'Invalid table name parameter';
END CASE;
SELECT 'UPDATE ' || p_table || ' SET MY_FIELD = ''' || p_Value || ''' '
|| p_Where
INTO query_string
FROM DUAL;
END;
You should do similar mapping and validation for other parameters too.
SQL injection always opens Pandora's box.
You should always assume a user can break out of a dynamic SQL statement. With full SQL access you should then assume a user can find a way to escalate privileges and own your database. (Depending on how paranoid you are, it might be safe to assume privilege escalation is impossible as long as your database and schemas are constantly patched and thoroughly hardened. In practice the vast majority of Oracle databases are not sufficiently patched and hardened.)
Below are a few simple examples that should scare you. And you should also assume that there are many hackers who are more clever than I am and have better attacks.
Sample Schema
First let's create a simple table with some data for a realistic test.
drop table test1;
create table test1(my_field varchar2(100), some_val varchar2(100));
insert into test1 values('A', 'AAA');
commit;
Obviously Dangerous Function
Are all of the existing functions safe?
create or replace function dangerous_function return number is
pragma autonomous_transaction;
begin
delete from test1;
commit;
return 1;
end;
/
If not, what is stopping the user from calling it like this?
--Safe static part:
update test1
set my_field = 'b'
--Dangerous dynamic part:
where some_val IN ('AAA')
and 1 = (select dangerous_function from dual)
Luckily creating an autonomous function is unusual and you can probably check the code. But can you guarantee the application will not create one in the future?
Custom Function in SQL
Even if there are no objects a clever user can turn your UPDATE into other DML:
--Safe static part:
update --+ WITH_PLSQL
test1
set my_field = 'b'
--Dangerous dynamic part:
where some_val IN ('AAA')
and 1 = (
with function dangerous_function return number is
pragma autonomous_transaction;
begin
delete from test1;
commit;
return 1;
end;
select dangerous_function from dual
);
I did cheat a little, the above code only works for me with the --+ WITH_PLSQL hint. Without that hint the code throws the error ORA-32034: unsupported use of WITH clause. But that's only a version limitation that might be lifted in the future. Or there might be some clever way to work around it, sometimes hints can break out of their part of the query and reference other sections.
Why Risk It?
Maybe there is a safe way to do it. But why risk it? Everybody in the IT world understands SQL injection bugs now. If you mess up and cause an exploit there will be no sympathy for you.

Postgresql insert trigger becomes slow when querying current table

When inserting a lot number into a table we are counting the number times the base number exists, and adding a -## to the end of the new number based on that count.
I have stripped out most the logic (we check for other things as well). I also am aware of the logic flaw here that would skip -1.
-- Function: stone._lsuniqueid()
-- DROP FUNCTION stone._lsuniqueid();
CREATE OR REPLACE FUNCTION stone._lsuniqueid()
RETURNS trigger AS
$BODY$
DECLARE
_count INTEGER;
BEGIN
-- Obtain the number of occurences of this new ls_number
SELECT COUNT(ls_number) into _count
FROM ls
WHERE ls_number LIKE CAST(NEW.ls_number || '%' AS text);
-- Allow new ls_numbers to be entered as is, otherwise add "-#{count + 1}"
-- to the end of the ls_number
if _count > 0 THEN
NEW.ls_number = NEW.ls_number || '-' || CAST(_count + 1 AS text);
END IF;
RETURN NEW;
END
$BODY$
INSERT INTO ls VALUES (NEXTVAL('ls_ls_id_seq'),7285,UPPER('20151012'));
--> Query returned successfully: one row affected, 391 ms execution time.
The count query is plenty fast
SELECT COUNT(ls_number)
FROM ls
WHERE ls_number LIKE CAST('20151012' || '%' AS text);
--> 19ms
For comparison I tried a similar trigger, but ran the count against a different table with same amount of rows, and similar query time.
SELECT COUNT(lsdetail_id)
FROM lsdetail
WHERE lsdetail_id > 2433308
--> 20ms
Running the same insert with the count running against a different table returns the result 20 times faster.
INSERT INTO ls VALUES (NEXTVAL('ls_ls_id_seq'),7285,UPPER('20151012'));
--> Query returned successfully: one row affected, 20 ms execution time.
The ls table has about 2.5 million rows
I've tried a couple of different things and the issue seems to be when selecting from the same table I'm inserting into.
I would like to know why this happening, but I would also be open to a better way to create "sub-lot" numbers.
Thanks!
Found the answer here:
http://www.postgresql.org/message-id/27705.1150381444#sss.pgh.pa.us
Re: How to analyze function performance
"Mindaugas" writes:
Is it possible to somehow analyze function performance? E.g. we are using function cleanup() which takes obviously too much time to execute but I have problems trying to figure what is slowing things down.
When I explain analyze function lines step by step it show quite acceptable performance.
--
Are you sure you are "explain analyze"ing the same queries the function
is really doing? You have to account for the fact that what plpgsql is
issuing is parameterized queries, and sometimes that limits the
planner's ability to pick a good plan. For instance, if you have
declare x int;
begin
...
for r in select * from foo where key = x loop ...
then what is really getting planned and executed is "select * from foo
where key = $1" --- every plpgsql variable gets replaced by a parameter
symbol "$n". You can model this for EXPLAIN purposes with a prepared
statement:
prepare p1(int) as select * from foo where key = $1;
explain analyze execute p1(42);
If you find out that a particular query really sucks when parameterized,
you can work around this by using EXECUTE to force the query to be
planned afresh on each use with literal constants instead of parameters:
Then I looked into this:
http://www.postgresql.org/docs/9.1/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-EXECUTING-DYN
39.5.4. Executing Dynamic Commands
Oftentimes you will want to generate dynamic commands inside your PL/pgSQL functions, that is, commands that will involve different tables or different data types each time they are executed. PL/pgSQL's normal attempts to cache plans for commands (as discussed in Section 39.10.2) will not work in such scenarios. To handle this sort of problem, the EXECUTE statement is provided:
EXECUTE 'SELECT count(*) FROM mytable WHERE inserted_by = $1 AND inserted <= $2'
INTO c
USING checked_user, checked_date;
--
So in the end it was a matter of updating count select to this:
EXECUTE 'SELECT COALESCE(COUNT(ls_number), 0) FROM ls WHERE ls_number LIKE $1 || ''%'';'
INTO _count
USING NEW.ls_number;

Elegant way of handling PostgreSQL exceptions?

In PostgreSQL, I would like to create a safe-wrapping mechanism which returns empty result if an exception occurs. Consider the following:
SELECT * FROM myschema.mytable;
I could do the safe-wrapping in the client application:
try {
result = execute_query('SELECT value FROM myschema.mytable').fetchall();
}
catch(pg_exception) {
result = []
}
But could I do such a thing in SQL directly? I would like to make the following code work, but it seems like it should by put into DO $$ ... $$ block and here I'm getting lost.
BEGIN
SELECT * FROM myschema.mytable;
EXCEPTION WHEN others THEN
SELECT unnest(ARRAY[]::TEXT[])
END
Exception handling in PL/pgSQL
PL/pgSQL code is always wrapped into a BEGIN ... END block. That can be inside the body of a DO statement or a function. Blocks can be nested inside - but they cannot exist outside, don't confuse it with plain SQL.
Each block can optionally contain an EXCEPTION clause for handling exceptions, but functions that need to trap exceptions are more expensive, so it's best to avoid exceptions a priori. Postgres needs to prepare for the possibility of rolling back to a point in the transaction before the exception happened, similar to an SQL SAVEPOINT. The manual:
A block containing an EXCEPTION clause is significantly more
expensive to enter and exit than a block without one. Therefore, don't
use EXCEPTION without need.
Example:
Is SELECT or INSERT in a function prone to race conditions?
How to avoid an exception in the example
A DO statement can't return anything. Create a function that takes table and schema name as parameters and returns whatever you want:
CREATE OR REPLACE FUNCTION f_tbl_value(_tbl text, _schema text = 'public')
RETURNS TABLE (value text)
LANGUAGE plpgsql AS
$func$
DECLARE
_t regclass := to_regclass(_schema || '.' || _tbl);
BEGIN
IF _t IS NULL THEN
value := ''; RETURN NEXT; -- return single empty string
ELSE
RETURN QUERY EXECUTE
'SELECT value FROM ' || _t; -- return set of values
END IF;
END
$func$;
Call:
SELECT * FROM f_tbl_value('my_table');
Or:
SELECT * FROM f_tbl_value('my_table', 'my_schema');
Assuming you want a set of rows with a single text column or an empty string if the table does not exist.
Also assuming that a column value exists if the given table exists. You could test for that, too, but you didn't ask for that.
Both input parameters are only case sensitive if double-quoted. Just like identifiers are handled in SQL statements.
The schema name defaults to 'public' in my example. Adapt to your needs. You could even ignore the schema completely and default to the current search_path.
to_regclass() is new in Postgres 9.4. For older versions substitute:
IF EXISTS (
SELECT FROM information_schema.tables
WHERE table_schema = _schema
AND table_name = _tbl
) THEN ...
This is actually more accurate, because it tests exactly what you need. More options and detailed explanation:
Table name as a PostgreSQL function parameter
Always defend against SQL injection when working with dynamic SQL! The cast to regclass does the trick here. More details:
How to check if a table exists in a given schema
If you are selecting only one column then the COALESCE() function should be able to do the trick for you
SELECT COALESCE( value, '{}'::text[] ) FROM myschema.mytable
If you require more rows you may require to create a function with types.