Robust approach for building SQL queries programmatically - sql

I have to resort to raw SQL where the ORM is falling short (using Django 1.7). The problem is that most of the queries end up being 80-90% similar. I cannot figure out a robust & secure way to build queries without violating re-usability.
Is string concatenation the only way out, i.e. build parameter-less query strings using if-else conditions, then safely include the parameters using prepared statements (to avoid SQL injection). I want to follow a simple approach for templating SQL for my project instead of re-inventing a mini ORM.
For example, consider this query:
SELECT id, name, team, rank_score
FROM
( SELECT id, name, team
ROW_NUMBER() OVER (PARTITION BY team
ORDER BY count_score DESC) AS rank_score
FROM
(SELECT id, name, team
COUNT(score) AS count_score
FROM people
INNER JOIN scores on (scores.people_id = people.id)
GROUP BY id, name, team
) AS count_table
) AS rank_table
WHERE rank_score < 3
How can I:
a) add optional WHERE constraint on people or
b) change INNER JOIN to LEFT OUTER or
c) change COUNT to SUM or
d) completely skip the OVER / PARTITION clause?

Better query
Fix the syntax, simplify and clarify:
SELECT *
FROM (
SELECT p.person_id, p.name, p.team, sum(s.score)::int AS score
, rank() OVER (PARTITION BY p.team ORDER BY sum(s.score) DESC)::int AS rnk
FROM person p
JOIN score s USING (person_id)
GROUP BY 1
) sub
WHERE rnk < 3;
Building on my updated table layout. See fiddle below.
You do not need the additional subquery. Window functions are executed after aggregate functions, so you can nest it like demonstrated.
While talking about "rank", you probably want to use rank(), not row_number().
Assuming people.people_id is the PK, you can simplify the GROUP BY clause.
Be sure to table-qualify all column names that might be ambiguous.
PL/pgSQL function
I would write a PL/pgSQL function that takes parameters for your variable parts. Implementing a - c of your points. d is unclear, leaving that for you to add.
CREATE TABLE person (
person_id serial PRIMARY KEY
, name text NOT NULL
, team text
);
CREATE TABLE score (
score_id serial PRIMARY KEY
, person_id int NOT NULL REFERENCES person
, score int NOT NULL
);
-- dummy values
WITH ins AS (
INSERT INTO person(name, team)
SELECT 'Jon Doe ' || p, t
FROM generate_series(1,20) p -- 20 guys x
, unnest ('{team1,team2,team3}'::text[]) t -- 3 teams
RETURNING person_id
)
INSERT INTO score(person_id, score)
SELECT i.person_id, (random() * 100)::int
FROM ins i, generate_series(1,5) g; -- 5 scores each
Function:
CREATE OR REPLACE FUNCTION f_demo(_agg text DEFAULT 'sum'
, _left_join bool DEFAULT false
, _where_name text DEFAULT null)
RETURNS TABLE(person_id int, name text, team text, score numeric, rnk bigint)
LANGUAGE plpgsql AS
$func$
DECLARE
_agg_op CONSTANT text[] := '{count, sum, avg}'; -- allowed agg functions
_sql text;
BEGIN
-- assert --
IF _agg ILIKE ANY (_agg_op) THEN
-- all good
ELSE
RAISE EXCEPTION '_agg must be one of %', _agg_op;
END IF;
-- query --
_sql := format('
SELECT *
FROM (
SELECT p.person_id, p.name, p.team, %1$s(s.score)::numeric AS score
, rank() OVER (PARTITION BY p.team ORDER BY %1$s(s.score) DESC) AS rnk
FROM person p
%2$s score s USING (person_id)
%3$s
GROUP BY 1
) sub
WHERE rnk < 3
ORDER BY team, rnk'
, _agg -- %1$s
, CASE WHEN _left_join THEN 'LEFT JOIN' ELSE 'JOIN' END -- %2$s
, CASE WHEN _where_name <> '' THEN 'WHERE p.name LIKE $1' ELSE '' END -- %3$s
);
-- debug -- inspect query first
-- RAISE NOTICE '%', _sql;
-- execute -- unquote when tested ok
RETURN QUERY EXECUTE _sql
USING _where_name; -- $1
END
$func$;
Call:
SELECT * FROM f_demo();
SELECT * FROM f_demo('sum', TRUE, '%2');
SELECT * FROM f_demo('avg', FALSE);
SELECT * FROM f_demo(_where_name := '%1_'); -- named param
fiddle
Old sqlfiddle
You need a firm understanding of PL/pgSQL. Else, there is too much to explain. You'll find related answers here on SO under plpgsql for every detail in the answer.
All parameters are treated safely, no SQL injection possible. See:
Define table and column names as arguments in a plpgsql function?
Table name as a PostgreSQL function parameter
Note in particular, how a WHERE clause is added conditionally when _where_name is passed, with the positional parameter $1 in the query sting. The value is passed to EXECUTE as value with the USING clause. No type conversion, no escaping, no chance for SQL injection. Examples:
Row expansion via "*" is not supported here
SQL state: 42601 syntax error at or near "11"
Refactor a PL/pgSQL function to return the output of various SELECT queries
Use DEFAULT values for function parameters, so you are free to provide any or none. More:
Functions with variable number of input parameters
The manual on calling functions
The function format() is instrumental for building complex dynamic SQL strings in a safe and clean fashion.

Related

SQL Conditional join inside a function

This question has been asked many times on SO but I never quite found the answer to it, they are mostly solutions to avoid the problem altogether.
I'm working with SQL MS and I'm trying to build a query inside a function (for security reasons) that will either return a table or it's unnested version by country.
meaning that the function should either be
SELECT * FROM SALES AS S
or
SELECT
S.*,
C.Country,
C.CountryPercentage * S.AmountWithouthVAT as CountryValue
FROM SALES AS S
INNER JOIN CountryAllocation AS C ON S.CountryAllocationID = C.CountryAllocationID
(the fact that this join will make a single row into many rows is why I don't simply use the above one. And the reason why I don't make the join outside the function is because the person running the function will not have access to either of the tables. Also note that because of the way permissions in SQL Server work a dynamic query will require permission evaluation, meaning that is not a feasible option unless I'm to develop a structure around certificates)
So, now I got 2 problems:
The output table might or might not have the columns Country and CountryValue causing problems when defining the output type of the function
The actual way to have a function parameter to switch between the 2 versions of the table.
I've got a solution, but this code pains my eyes to look upon:
CREATE FUNCTION [dbo].[fn_I_view] (#Type int)
RETURNS #OutTable TABLE
(
SaleID int,
AmountWithouthVAT decimal(18, 2),
Country varchar(50),
AlocationPercentage decimal(18, 2)
)
AS
BEGIN
WITH
Out1 AS
(
SELECT
S.*,
NULL as Country,
NULL as AlocationPercentage
FROM Sales AS S
WHERE #Type = 1
),
Out2 AS
(
SELECT
S.*,
C.Country,
C.CountryPercentage * S.AmountWithouthVAT as CountryValue
FROM SALES AS S
INNER JOIN CountryAllocation AS C ON S.CountryAllocationID = C.CountryAllocationID
WHERE #Type = 2
)
INSERT INTO #OutTable
SELECT * FROM Out1
UNION ALL
SELECT * FROM Out2
RETURN
END
GO
so, I can't exactly fix the first problem, only worked around it by making SELECT * from [INV].[fn_I_ViewAllMyInvoices](1) still return those 2 extra columns with NULL and I didn't fix the second problem either, as I'm calculating both queries when I only needed 1 of them (and as you can expect this is a demo code, the real deal is way more complex)
Is there any way to improve this code?/solve the problem in a different way? performance, readability as well as maintenance improvements are all welcome
You don't need to calculate both. Just do:
BEGIN
IF #type = 1
BEGIN
INSERT INTO #OutTable
SELECT S.*, NULL as Country, NULL as AlocationPercentage
FROM Sales s;
END;
ELSE
BEGIN
INSERT INTO #OutTable
SELECT S.*, C.Country, C.CountryPercentage * S.AmountWithouthVAT as CountryValue
FROM SALES S JOIN
CountryAllocation C
ON S.CountryAllocationID = C.CountryAllocationID;
END;
RETURN;
END;

How to find the rank based on height in a table based on heights of players in sql

Q1: What I was trying to do was order the rank by height, but this line looks like it may be illegal because height is a variable that I was trying to make the column in order, but couldn’t I just take to order by, won’t the rank function just give each height a rank with just calling it and doing the over thing, like this:
CREATE OR REPLACE FUNCTION get_rank(fn VARCHAR, ln VARCHAR)
RETURNS FLOAT AS $$
DECLARE height Float=0.0;
BEGIN
SELECT rank() OVER(PARTITION BY height) rank
FROM
(
SELECT INTO height AVG(((p.h_feet*12)+ p.h_inches)*2.54)
FROM Players p
)
WHERE p.firstname=fn AND p.lastname=ln; return coalesce(rank,0.0)
END;
$$LANGUAGE plpgsql;
Q2: Do I need to think about this another way.
Q3: Can I do this: PARTITION BY height ORDER BY height
Q4: I am getting an error: ERROR: syntax error at or near "END"
LINE 1: ...ERE p.firstname=fn AND p.lastname=ln; return rank END; $$LAN...
I wanted to comment on your function, which has multiple structural errors. So let's look at those. First off starting from your subselect:
(
select into height avg(((p.h_feet*12)+ p.h_inches)*2.54)
from players p
)
A subselect can only pass its results to the outer select but cannot use an INTO clause (at least I cannot figure out how to do it). But it is NOT necessary, you can alias the result and use that alias in the outer query.
Further, were this even possible this would produce a compile time error.
Format of select needed would be "Select column_list into variable list", you have the variable list and the column list reversed. Finally, Postgres requires an alias for subselect. So correcting your need:
(
select height avg(((p.h_feet*12)+ p.h_inches)*2.54) height
from players p
) p1
Moving to the outer query, and replacing the subselect by the result we effectively get:
select rank() over(partition by height) rank
from subselect_results
where p.firstname=fn and p.lastname=ln; return coalesce(rank,0.0)
What do we find here:
The last line contains 2 statements, this is an extremely poor
process. Acceptable to the compiler, but it often generates very
hard to find errors and/or very hard to find run time exceptions.
Get into the habit of 1 statement per line.
A select statement in a code block requires the "into clause" that
is missing here.
There is no independent table reference to any table. So the ONLY
columns available are the subquery results, any table references
there is not available. In this case the single value height. This
then causes an error an undefined column.
"rank() over(partition by height)" at this point height is
essentially constant. Ranking over a constant always produces a result of 1.
Even when you reference Players table your where clause will, at
least should return a single row. Rank over a single row always
returns 1.
Finally "return coalesce(rank,0.0)". Rank is a undefined as a
variable. This statement that error or Invalid parameters call to
rank. Using a Postgres function name as a variable is a poor
practice.You may get away with it, but it's also likely to cause hard to find errors and/of exceptions.
Correcting all the above we arrive a function that will actually run;
create or replace function get_rank(fn varchar, ln varchar)
returns float
language plpgsql
as $$
declare
rank_l float ;
begin
select rank() over(partition by pi.height)
into rank_l
from players p2
, (
select avg(((p.h_feet*12)+ p.h_inches)*2.54) height
from players p
) pi
where p2.firstname = fn
and p2.lastname = ln;
return rank_l;
end;
$$;
Of course since it returns a single row from players and ranks on a constant this could replaced by the following:
create or replace function get_rank(fn varchar, ln varchar)
returns float
language plpgsql
as $$
begin
return 1.0
end;
$$;
While working through this it seems what you are asking is the rank for a specific player by based on height alone. So perhaps what you need is:
-- setup
create table players(id serial, firstname text, lastname text, h_feet integer, h_inches integer);
insert into players values (1,'A','B',6,3)
, (2,'C','D',6,4)
, (3,'E','F',6,9);
-- Build function
create or replace function get_rank(fn text, ln text)
returns table( firstname text
, lastname text
, height text
, rank_by_height integer
)
language sql strict
as $$
with player_ranking as
(select firstname
, lastname
, concat(trim(to_char(h_feet, '9')), '''', trim(to_char(h_inches,'00')), '"') size
, 12*h_feet + h_inches height
from players
)
select firstname, lastname, size, rnk
from (
select firstname, lastname, size, rank() over (order by height desc)::integer rnk
from player_ranking pr
) pr
where firstname = fn
and lastname = ln;
$$;
-- test
select *
from get_rank('C','D');
I'll leave it to you to research what each function used does, and the difference in SQL vs. pgplsql functions. Study/research performed to understand something is never a waste of time.

Reuse function call inside select statement

Simplified example of the problem:
select p.id,
p.name,
-- other columns from joined tables
decode(get_complicated_number(p.id), null, null, "The number is: " || get_complicated_number(p.id)))
from some_table p
-- join other tables and WHERE clause
It includes get_complicated_number call which queries multiple tables. I wasn't able to write it as a JOIN statement that would be as fast and as easy to maintain as a separate function so far.
Currently the function is called twice in case its return value is not NULL.
In reality I have an XML generation package that gets the data with a select:
select distinct xmlAgg
(
xmlelement
(
"TestElement",
xmlelement("Id", p.id),
xmlelement("Name", p.name),
-- other elements from joined tables
decode(get_complicated_number(p.id), null, null, xmlelement("ComplicatedNum", get_complicated_number(p.id)))
)
)
from some_table p
-- join other tables and WHERE clause
Is there a way to make it only one call and still avoid creating an empty element on NULL?
You can use WITH Syntax (Common Table Expressions) as:
with complicated_number as (
select get_complicated_number(p.id) as num from some_table p
) select distinct xmlAgg
--...
decode(complicated_number.num, null, null, xmlelement("ComplicatedNum", complicated_number.num))
from complicated_number
common table expression (CTE) is a named temporary result set that exists within the scope of a single statement and that can be referred to later within that statement, possibly multiple times
user7294900's answer is good, but if it's hard to combine with your existing joins, here's an alternate version with an inline view instead of a CTE.
select distinct xmlAgg
(
xmlelement
(
"TestElement",
xmlelement("Id", p2.id),
xmlelement("Name", p2.name),
-- other elements from joined tables
decode(p2.num, null, null, xmlelement("ComplicatedNum", p2.num))
)
)
from (
select p.id, p.name, get_complicated_number(p.id) as num
from some_table p
) p2
-- join other tables to p2. or put them inside it.
If you want help with adding your existing joins to these example queries, you might need to edit your question and add your other tables and WHERE clauses.

How to get get Records based on multiple columns from a table

Consider the following table.
From the above table I want to select the Middle BFS_SCORE per LN_LOAN_ID and BR_ID. There are some LN_LOAN_ID with single score.
As an example for the above table the output I need is as below.
Please let me know how this can be achieved.
To handle cases where there are two scores for unique pair of LN_LOAD_ID, BR_ID you need a median, as there is no middle value for BFS_SCORE.
Postgres solution:
Create a median aggregate function following Postgres wiki:
CREATE OR REPLACE FUNCTION _final_median(NUMERIC[])
RETURNS NUMERIC AS
$$
SELECT AVG(val)
FROM (
SELECT val
FROM unnest($1) val
ORDER BY 1
LIMIT 2 - MOD(array_upper($1, 1), 2)
OFFSET CEIL(array_upper($1, 1) / 2.0) - 1
) sub;
$$
LANGUAGE 'sql' IMMUTABLE;
CREATE AGGREGATE median(NUMERIC) (
SFUNC=array_append,
STYPE=NUMERIC[],
FINALFUNC=_final_median,
INITCOND='{}'
);
Then your query would look as simple as this:
select
ln_load_id,
median(bfs_score) as bfs_score
br_id
from yourtable
But the tricky part comes with score_order. If there are two pairs and you actually really need a median, not the middle value - then there will be no row for your calculated score, so it will be null. Other than that, join back to your table to retrieve it for the "middle" column:
select
t1.ln_load_id, t1.bfs_score, t1.br_id, t2.score_order
from (
select
ln_load_id,
median(bfs_score) as bfs_score
br_id
from yourtable
) t1
left join yourtable t2 on
t1.ln_load_id = t2.ln_load_id
and t1.br_id = t2.br_id
and t1.bfs_score = t2.bfs_score

PLpgSQL (or ANSI SQL?) Conditional calculation on a column

I want to write a stored procedure that performs a conditional calculation on a column. Ideally the implementation of the SP will be db agnostic - if possible. If not the underlying db is PostgreSQL (v8.4), so that takes precedence.
The underlying tables being queried looks like this:
CREATE TABLE treatment_def ( id PRIMARY SERIAL KEY,
name VARCHAR(16) NOT NULL
);
CREATE TABLE foo_group_def ( id PRIMARY SERIAL KEY,
name VARCHAR(16) NOT NULL
);
CREATE TABLE foo ( id PRIMARY SERIAL KEY,
name VARCHAR(16) NOT NULL,
trtmt_id INT REFERENCES treatment_def(id) ON DELETE RESTRICT,
foo_grp_id INT REFERENCES foo_group_def(id) ON DELETE RESTRICT,
is_male BOOLEAN NOT NULL,
cost REAL NOT NULL
);
I want to write a SP that returns the following 'table' result set:
treatment_name, foo_group_name, averaged_cost
where averaged cost is calcluated differently, depending on whether the row field *is_male* flag is set to true or false.
For the purpose of this question, lets assume that if the is_male flag is set to true, then the averaged cost is calculated as the SUM of the cost values for the grouping, and if the is_male flag is set to false, then the cost value is calculated as the AVERAGE of the cost values for the grouping.
(Obviously) the data is being grouped by trmt_id, foo_grp_id (and is_male?).
I have a rough idea about how to to write the SQL if there was no conditional test on the is_male flag. However, I could do with some help in writing the SP as defined above.
Here is my first attempt:
CREATE TYPE FOO_RESULT AS (treatment_name VARCHAR(16), foo_group_name VARCHAR(64), averaged_cost DOUBLE);
// Outline plpgsql (Pseudo code)
CREATE FUNCTION somefunc() RETURNS SETOF FOO_RESULT AS $$
BEGIN
RETURN QUERY SELECT t.name treatment_name, g.name group_name, averaged_cost FROM foo f
INNER JOIN treatment_def t ON t.id = f.trtmt_id
INNER JOIN foo_group_def g ON g.id = f.foo_grp_id
GROUP BY f.trtmt_id, f.foo_grp_id;
END;
$$ LANGUAGE plpgsql;
I would appreciate some help on how to write this SP correctly to implement the conditional calculation in the column results
Could look like this:
CREATE FUNCTION somefunc()
RETURNS TABLE (
treatment_name varchar(16)
, foo_group_name varchar(16)
, averaged_cost double precision)
AS
$BODY$
SELECT t.name -- AS treatment_name
, g.name -- AS group_name
, CASE WHEN f.is_male THEN sum(f.cost)
ELSE avg(f.cost) END -- AS averaged_cost
FROM foo f
JOIN treatment_def t ON t.id = f.trtmt_id
JOIN foo_group_def g ON g.id = f.foo_grp_id
GROUP BY 1, 2, f.is_male;
$BODY$ LANGUAGE sql;
Major points
I used an sql function, not plpgsql. You can use either, I just did it to shorten the code. plpgsql might be slightly faster, because the query plan is cached.
I skipped the custom composite type. You can do that simpler with RETURNS TABLE.
I would generally advise to use the data type text instead of varchar(n). Makes your life easier.
Be careful not to use names of the RETURN parameter without table-qualifying (tbl.col) in the function body, or you will create naming conflicts. That is why I commented the aliases.
I adjusted the GROUP BY clause. The original didn't work. (Neither does the one in #Ken's answer.)
You should be able to use a CASE statement:
SELECT t.name treatment_name, g.name group_name,
CASE is_male WHEN true then SUM(cost)
ELSE AVG(cost) END AS averaged_cost
FROM foo f
INNER JOIN treatment_def t ON t.id = f.trtmt_id
INNER JOIN foo_group_def g ON g.id = f.foo_grp_id
GROUP BY 1, 2, f.is_male;
I'm not familiar with PLpgSQL, so I'm not sure of the exact syntax for the BOOLEAN column, but the above should at least get you started in the right direction.