PLpgSQL (or ANSI SQL?) Conditional calculation on a column - sql

I want to write a stored procedure that performs a conditional calculation on a column. Ideally the implementation of the SP will be db agnostic - if possible. If not the underlying db is PostgreSQL (v8.4), so that takes precedence.
The underlying tables being queried looks like this:
CREATE TABLE treatment_def ( id PRIMARY SERIAL KEY,
name VARCHAR(16) NOT NULL
);
CREATE TABLE foo_group_def ( id PRIMARY SERIAL KEY,
name VARCHAR(16) NOT NULL
);
CREATE TABLE foo ( id PRIMARY SERIAL KEY,
name VARCHAR(16) NOT NULL,
trtmt_id INT REFERENCES treatment_def(id) ON DELETE RESTRICT,
foo_grp_id INT REFERENCES foo_group_def(id) ON DELETE RESTRICT,
is_male BOOLEAN NOT NULL,
cost REAL NOT NULL
);
I want to write a SP that returns the following 'table' result set:
treatment_name, foo_group_name, averaged_cost
where averaged cost is calcluated differently, depending on whether the row field *is_male* flag is set to true or false.
For the purpose of this question, lets assume that if the is_male flag is set to true, then the averaged cost is calculated as the SUM of the cost values for the grouping, and if the is_male flag is set to false, then the cost value is calculated as the AVERAGE of the cost values for the grouping.
(Obviously) the data is being grouped by trmt_id, foo_grp_id (and is_male?).
I have a rough idea about how to to write the SQL if there was no conditional test on the is_male flag. However, I could do with some help in writing the SP as defined above.
Here is my first attempt:
CREATE TYPE FOO_RESULT AS (treatment_name VARCHAR(16), foo_group_name VARCHAR(64), averaged_cost DOUBLE);
// Outline plpgsql (Pseudo code)
CREATE FUNCTION somefunc() RETURNS SETOF FOO_RESULT AS $$
BEGIN
RETURN QUERY SELECT t.name treatment_name, g.name group_name, averaged_cost FROM foo f
INNER JOIN treatment_def t ON t.id = f.trtmt_id
INNER JOIN foo_group_def g ON g.id = f.foo_grp_id
GROUP BY f.trtmt_id, f.foo_grp_id;
END;
$$ LANGUAGE plpgsql;
I would appreciate some help on how to write this SP correctly to implement the conditional calculation in the column results

Could look like this:
CREATE FUNCTION somefunc()
RETURNS TABLE (
treatment_name varchar(16)
, foo_group_name varchar(16)
, averaged_cost double precision)
AS
$BODY$
SELECT t.name -- AS treatment_name
, g.name -- AS group_name
, CASE WHEN f.is_male THEN sum(f.cost)
ELSE avg(f.cost) END -- AS averaged_cost
FROM foo f
JOIN treatment_def t ON t.id = f.trtmt_id
JOIN foo_group_def g ON g.id = f.foo_grp_id
GROUP BY 1, 2, f.is_male;
$BODY$ LANGUAGE sql;
Major points
I used an sql function, not plpgsql. You can use either, I just did it to shorten the code. plpgsql might be slightly faster, because the query plan is cached.
I skipped the custom composite type. You can do that simpler with RETURNS TABLE.
I would generally advise to use the data type text instead of varchar(n). Makes your life easier.
Be careful not to use names of the RETURN parameter without table-qualifying (tbl.col) in the function body, or you will create naming conflicts. That is why I commented the aliases.
I adjusted the GROUP BY clause. The original didn't work. (Neither does the one in #Ken's answer.)

You should be able to use a CASE statement:
SELECT t.name treatment_name, g.name group_name,
CASE is_male WHEN true then SUM(cost)
ELSE AVG(cost) END AS averaged_cost
FROM foo f
INNER JOIN treatment_def t ON t.id = f.trtmt_id
INNER JOIN foo_group_def g ON g.id = f.foo_grp_id
GROUP BY 1, 2, f.is_male;
I'm not familiar with PLpgSQL, so I'm not sure of the exact syntax for the BOOLEAN column, but the above should at least get you started in the right direction.

Related

Delete rows depending on values from another table

How to delete rows from my customer table depending on values from another table, let's say orders?
If the customer has no active orders they should be able to be deleted from the DB along with their rows (done using CASCADE). However, if they have any active orders at all, they can't be deleted.
I thought about a PLPGSQL function, then using it in a trigger to check, but I'm lost. I have a basic block of SQL shown below of my first idea of deleting the record accordingly. But it doesn't work properly as it deletes the customer regardless of status, it just needs one cancelled order in this function and not all cancelled.
CREATE OR REPLACE FUNCTION DelCust(int)
RETURNS void AS $body$
DELETE FROM Customer
WHERE CustID IN (
SELECT CustID
FROM Order
WHERE Status = 'C'
AND CustID = $1
);
$body$
LANGUAGE SQL;
SELECT * FROM Customer;
I have also tried to use a PLPGSQL function returning a trigger then using a trigger to help with the deletion and checks, but I'm lost on it. Any thoughts on how to fix this? Any good sources for further reading?
You were very close. I suggest you use NOT IN instead of IN:
DELETE FROM Customer
WHERE CustID = $1 AND
CustID NOT IN (SELECT DISTINCT CustID
FROM Order
WHERE Status = 'A');
I'm guessing here that Status = 'A' means "active order". If it's something else change the code appropriately.
Your basic function can work like this:
CREATE OR REPLACE FUNCTION del_cust(_custid int)
RETURNS int --③
LANGUAGE sql AS
$func$
DELETE FROM customer c
WHERE c.custid = $1
AND NOT EXISTS ( -- ①
SELECT FROM "order" o -- ②
WHERE o.status = 'C' -- meaning "active"?
AND o.custid = $1
);
RETURNING c.custid; -- ③
$func$;
① Avoid NOT IN if you can. It's inefficient and potentially treacherous. Related:
Select rows which are not present in other table
② "order" is a reserved word in SQL. Better chose a legal identifier, or you have to always double-quote.
③ I made it RETURNS int to signify whether the row was actually deleted. The function returns NULL if the DELETE does not go through. Optional addition.
For a trigger implementation, consider this closely related answer:
Foreign key contraints in many-to-many relationships

How to find intersecting geographies between two tables recursively

I'm running Postgres 9.6.1 and PostGIS 2.3.0 r15146 and have two tables.
geographies may have 150,000,000 rows, paths may have 10,000,000 rows:
CREATE TABLE paths (id uuid NOT NULL, path path NOT NULL, PRIMARY KEY (id))
CREATE TABLE geographies (id uuid NOT NULL, geography geography NOT NULL, PRIMARY KEY (id))
Given an array/set of ids for table geographies, what is the "best" way of finding all intersecting paths and geometries?
In other words, if an initial geography has a corresponding intersecting path we need to also find all other geographies that this path intersects. From there, we need to find all other paths that these newly found geographies intersect, and so on until we've found all possible intersections.
The initial geography ids (our input) may be anywhere from 0 to 700. With an average around 40.
Minimum intersections will be 0, max will be about 1000. Average likely around 20, typically less than 100 connected.
I've created a function that does this, but I'm new to GIS in PostGIS, and Postgres in general. I've posted my solution as an answer to this question.
I feel like there should be a more eloquent and faster way of doing this than what I've come up with.
Your function can be radically simplified.
Setup
I suggest you convert the column paths.path to data type geography (or at least geometry). path is a native Postgres type and does not play well with PostGIS functions and spatial indexes. You would have to cast path::geometry or path::geometry::geography (resulting in a LINESTRING internally) to make it work with PostGIS functions like ST_Intersects().
My answer is based on these adapted tables:
CREATE TABLE paths (
id uuid PRIMARY KEY
, path geography NOT NULL
);
CREATE TABLE geographies (
id uuid PRIMARY KEY
, geography geography NOT NULL
, fk_id text NOT NULL
);
Everything works with data type geometry for both columns just as well. geography is generally more exact but also more expensive. Which to use? Read the PostGIS FAQ here.
Solution 1: Your function optimized
CREATE OR REPLACE FUNCTION public.function_name(_fk_ids text[])
RETURNS TABLE(id uuid, type text)
LANGUAGE plpgsql AS
$func$
DECLARE
_row_ct int;
_loop_ct int := 0;
BEGIN
CREATE TEMP TABLE _geo ON COMMIT DROP AS -- dropped at end of transaction
SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct AS loop_ct -- dupes possible?
FROM geographies g
WHERE g.fk_id = ANY(_fk_ids);
GET DIAGNOSTICS _row_ct = ROW_COUNT;
IF _row_ct = 0 THEN -- no rows found, return empty result immediately
RETURN; -- exit function
END IF;
CREATE TEMP TABLE _path ON COMMIT DROP AS
SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct AS loop_ct
FROM _geo g
JOIN paths p ON ST_Intersects(g.geography, p.path); -- no dupes yet
GET DIAGNOSTICS _row_ct = ROW_COUNT;
IF _row_ct = 0 THEN -- no rows found, return _geo immediately
RETURN QUERY SELECT g.id, text 'geo' FROM _geo g;
RETURN;
END IF;
ALTER TABLE _geo ADD CONSTRAINT g_uni UNIQUE (id); -- required for UPSERT
ALTER TABLE _path ADD CONSTRAINT p_uni UNIQUE (id);
LOOP
_loop_ct := _loop_ct + 1;
INSERT INTO _geo(id, geography, loop_ct)
SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct
FROM _paths p
JOIN geographies g ON ST_Intersects(g.geography, p.path)
WHERE p.loop_ct = _loop_ct - 1 -- only use last round!
ON CONFLICT ON CONSTRAINT g_uni DO NOTHING; -- eliminate new dupes
EXIT WHEN NOT FOUND;
INSERT INTO _path(id, path, loop_ct)
SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct
FROM _geo g
JOIN paths p ON ST_Intersects(g.geography, p.path)
WHERE g.loop_ct = _loop_ct - 1
ON CONFLICT ON CONSTRAINT p_uni DO NOTHING;
EXIT WHEN NOT FOUND;
END LOOP;
RETURN QUERY
SELECT g.id, text 'geo' FROM _geo g
UNION ALL
SELECT p.id, text 'path' FROM _path p;
END
$func$;
Call:
SELECT * FROM public.function_name('{foo,bar}');
Much faster than what you have.
Major points
You based queries on the whole set, instead of the latest additions to the set only. This gets increasingly slower with every loop without need. I added a loop counter (loop_ct) to avoid redundant work.
Be sure to have spatial GiST indexes on geographies.geography and paths.path:
CREATE INDEX geo_geo_gix ON geographies USING GIST (geography);
CREATE INDEX paths_path_gix ON paths USING GIST (path);
Since Postgres 9.5 index-only scans would be an option for GiST indexes. You might add id as second index column. The benefit depends on many factors, you'd have to test. However, there is no fitting operator GiST class for the uuid type. It would work with bigint after installing the extension btree_gist:
Postgres multi-column index (integer, boolean, and array)
Multicolumn index on 3 fields with heterogenous data types
Have a fitting index on g.fk_id, too. Again, a multicolumn index on (fk_id, id, geography) might pay if you can get index-only scans out of it. Default btree index, fk_id must be first index column. Especially if you run the query often and rarely update the table and table rows are much wider than the index.
You can initialize variables at declaration time. Only needed once after the rewrite.
ON COMMIT DROP drops the temp tables at the end of the transaction automatically. So I removed dropping tables explicitly. But you get an exception if you call the function in the same transaction twice. In the function I would check for existence of the temp table and use TRUNCATE in this case. Related:
How to check if a table exists in a given schema
Use GET DIAGNOSTICS to get the row count instead of running another query for the count.
Count rows affected by DELETE
You need GET DIAGNOSTICS. CREATE TABLE does not set FOUND (as is mentioned in the manual).
It's faster to add an index or PK / UNIQUE constraint after filling the table. And not before we actually need it.
ON CONFLICT ... DO ... is the simpler and cheaper way for UPSERT since Postgres 9.5.
How to UPSERT (MERGE, INSERT ... ON DUPLICATE UPDATE) in PostgreSQL?
For the simple form of the command you just list index columns or expressions (like ON CONFLICT (id) DO ...) and let Postgres perform unique index inference to determine an arbiter constraint or index. I later optimized by providing the constraint directly. But for this we need an actual constraint - a unique index is not enough. Fixed accordingly. Details in the manual here.
It may help to ANALYZE temporary tables manually to help Postgres find the best query plan. (But I don't think you need it in your case.)
Are regular VACUUM ANALYZE still recommended under 9.1?
_geo_ct - _geographyLength > 0 is an awkward and more expensive way of saying _geo_ct > _geographyLength. But that's gone completely now.
Don't quote the language name. Just LANGUAGE plpgsql.
Your function parameter is varchar[] for an array of fk_id, but you later commented:
It is a bigint field that represents a geographic area (it's actually a precomputed s2cell id at level 15).
I don't know s2cell id at level 15, but ideally you pass an array of matching data type, or if that's not an option default to text[].
Also since you commented:
There are always exactly 13 fk_ids passed in.
This seems like a perfect use case for a VARIADIC function parameter. So your function definition would be:
CREATE OR REPLACE FUNCTION public.function_name(_fk_ids VARIADIC text[]) ...
Details:
Pass multiple values in single parameter
Solution 2: Plain SQL with recursive CTE
It's hard to wrap an rCTE around two alternating loops, but possible with some SQL finesse:
WITH RECURSIVE cte AS (
SELECT g.id, g.geography::text, NULL::text AS path, text 'geo' AS type
FROM geographies g
WHERE g.fk_id = ANY($kf_ids) -- your input array here
UNION
SELECT p.id, g.geography::text, p.path::text
, CASE WHEN p.path IS NULL THEN 'geo' ELSE 'path' END AS type
FROM cte c
LEFT JOIN paths p ON c.type = 'geo'
AND ST_Intersects(c.geography::geography, p.path)
LEFT JOIN geographies g ON c.type = 'path'
AND ST_Intersects(g.geography, c.path::geography)
WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
)
SELECT id, type FROM cte;
That's all.
You need the same indexes as above. You might wrap it into an SQL function for repeated use.
Major additional points
The cast to text is necessary because the geography type is not "hashable" (same for geometry). (See this open PostGIS issue for details.) Work around it by casting to text. Rows are unique by virtue of (id, type) alone, we can ignore the geography columns for this. Cast back to geography for the join. Shouldn't cost too much extra.
We need two LEFT JOIN so not to exclude rows, because at each iteration only one of the two tables may contribute more rows.
The final condition makes sure we are not done, yet:
WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
This works because duplicate findings are excluded from the temporary
intermediate table. The manual:
For UNION (but not UNION ALL), discard duplicate rows and rows that
duplicate any previous result row. Include all remaining rows in the
result of the recursive query, and also place them in a temporary
intermediate table.
So which is faster?
The rCTE is probably faster than the function for small result sets. The temp tables and indexes in the function mean considerably more overhead. For large result sets the function may be faster, though. Only testing with your actual setup can give you a definitive answer.*
See the OP's feedback in the comment.
I figured it'd be good to post my own solution here even if it isn't optimal.
Here is what I came up with (using Steve Chambers' advice):
CREATE OR REPLACE FUNCTION public.function_name(
_fk_ids character varying[])
RETURNS TABLE(id uuid, type character varying)
LANGUAGE 'plpgsql'
COST 100.0
VOLATILE
ROWS 1000.0
AS $function$
DECLARE
_pathLength bigint;
_geographyLength bigint;
_currentPathLength bigint;
_currentGeographyLength bigint;
BEGIN
DROP TABLE IF EXISTS _pathIds;
DROP TABLE IF EXISTS _geographyIds;
CREATE TEMPORARY TABLE _pathIds (id UUID PRIMARY KEY);
CREATE TEMPORARY TABLE _geographyIds (id UUID PRIMARY KEY);
-- get all geographies in the specified _fk_ids
INSERT INTO _geographyIds
SELECT g.id
FROM geographies g
WHERE g.fk_id= ANY(_fk_ids);
_pathLength := 0;
_geographyLength := 0;
_currentPathLength := 0;
_currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
-- _pathIds := ARRAY[]::uuid[];
WHILE (_currentPathLength - _pathLength > 0) OR (_currentGeographyLength - _geographyLength > 0) LOOP
_pathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
_geographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
-- gets all paths that have paths that intersect the geographies that aren't in the current list of path ids
INSERT INTO _pathIds
SELECT DISTINCT p.id
FROM paths p
JOIN geographies g ON ST_Intersects(g.geography, p.path)
WHERE
g.id IN (SELECT _geographyIds.id FROM _geographyIds) AND
p.id NOT IN (SELECT _pathIds.id from _pathIds);
-- gets all geographies that have paths that intersect the paths that aren't in the current list of geography ids
INSERT INTO _geographyIds
SELECT DISTINCT g.id
FROM geographies g
JOIN paths p ON ST_Intersects(g.geography, p.path)
WHERE
p.id IN (SELECT _pathIds.id FROM _pathIds) AND
g.id NOT IN (SELECT _geographyIds.id FROM _geographyIds);
_currentPathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
_currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
END LOOP;
RETURN QUERY
SELECT _geographyIds.id, 'geography' AS type FROM _geographyIds
UNION ALL
SELECT _pathIds.id, 'path' AS type FROM _pathIds;
END;
$function$;
Sample plot and data from this script
It can be pure relational with an aggregate function. This implementation uses one path table and one point table. Both are geometries. The point is easier to create test data with and to test than a generic geography but it should be simple to adapt.
create table path (
path_text text primary key,
path geometry(linestring) not null
);
create table point (
point_text text primary key,
point geometry(point) not null
);
A type to keep the aggregate function's state:
create type mpath_mpoint as (
mpath geometry(multilinestring),
mpoint geometry(multipoint)
);
The state building function:
create or replace function path_point_intersect (
_i mpath_mpoint[], _e mpath_mpoint
) returns mpath_mpoint[] as $$
with e as (select (e).mpath, (e).mpoint from (values (_e)) e (e)),
i as (select mpath, mpoint from unnest(_i) i (mpath, mpoint))
select array_agg((mpath, mpoint)::mpath_mpoint)
from (
select
st_multi(st_union(i.mpoint, e.mpoint)) as mpoint,
(
select st_collect(gd)
from (
select gd from st_dump(i.mpath) a (a, gd)
union all
select gd from st_dump(e.mpath) b (a, gd)
) s
) as mpath
from i inner join e on st_intersects(i.mpoint, e.mpoint)
union all
select i.mpoint, i.mpath
from i inner join e on not st_intersects(i.mpoint, e.mpoint)
union all
select e.mpoint, e.mpath
from e
where not exists (
select 1 from i
where st_intersects(i.mpoint, e.mpoint)
)
) s;
$$ language sql;
The aggregate:
create aggregate path_point_agg (mpath_mpoint) (
sfunc = path_point_intersect,
stype = mpath_mpoint[]
);
This query will return a set of multilinestring, multipoint strings containing the matched paths/points:
select st_astext(mpath), st_astext(mpoint)
from unnest((
select path_point_agg((st_multi(path), st_multi(mpoint))::mpath_mpoint)
from (
select path, st_union(point) as mpoint
from
path
inner join
point on st_intersects(path, point)
group by path
) s
)) m (mpath, mpoint)
;
st_astext | st_astext
-----------------------------------------------------------+-----------------------------
MULTILINESTRING((-10 0,10 0,8 3),(0 -10,0 10),(2 1,4 -1)) | MULTIPOINT(0 0,0 5,3 0,5 0)
MULTILINESTRING((-9 -8,4 -8),(-8 -9,-8 6)) | MULTIPOINT(-8 -8,2 -8)
MULTILINESTRING((-7 -4,-3 4,-5 6)) | MULTIPOINT(-6 -2)

Robust approach for building SQL queries programmatically

I have to resort to raw SQL where the ORM is falling short (using Django 1.7). The problem is that most of the queries end up being 80-90% similar. I cannot figure out a robust & secure way to build queries without violating re-usability.
Is string concatenation the only way out, i.e. build parameter-less query strings using if-else conditions, then safely include the parameters using prepared statements (to avoid SQL injection). I want to follow a simple approach for templating SQL for my project instead of re-inventing a mini ORM.
For example, consider this query:
SELECT id, name, team, rank_score
FROM
( SELECT id, name, team
ROW_NUMBER() OVER (PARTITION BY team
ORDER BY count_score DESC) AS rank_score
FROM
(SELECT id, name, team
COUNT(score) AS count_score
FROM people
INNER JOIN scores on (scores.people_id = people.id)
GROUP BY id, name, team
) AS count_table
) AS rank_table
WHERE rank_score < 3
How can I:
a) add optional WHERE constraint on people or
b) change INNER JOIN to LEFT OUTER or
c) change COUNT to SUM or
d) completely skip the OVER / PARTITION clause?
Better query
Fix the syntax, simplify and clarify:
SELECT *
FROM (
SELECT p.person_id, p.name, p.team, sum(s.score)::int AS score
, rank() OVER (PARTITION BY p.team ORDER BY sum(s.score) DESC)::int AS rnk
FROM person p
JOIN score s USING (person_id)
GROUP BY 1
) sub
WHERE rnk < 3;
Building on my updated table layout. See fiddle below.
You do not need the additional subquery. Window functions are executed after aggregate functions, so you can nest it like demonstrated.
While talking about "rank", you probably want to use rank(), not row_number().
Assuming people.people_id is the PK, you can simplify the GROUP BY clause.
Be sure to table-qualify all column names that might be ambiguous.
PL/pgSQL function
I would write a PL/pgSQL function that takes parameters for your variable parts. Implementing a - c of your points. d is unclear, leaving that for you to add.
CREATE TABLE person (
person_id serial PRIMARY KEY
, name text NOT NULL
, team text
);
CREATE TABLE score (
score_id serial PRIMARY KEY
, person_id int NOT NULL REFERENCES person
, score int NOT NULL
);
-- dummy values
WITH ins AS (
INSERT INTO person(name, team)
SELECT 'Jon Doe ' || p, t
FROM generate_series(1,20) p -- 20 guys x
, unnest ('{team1,team2,team3}'::text[]) t -- 3 teams
RETURNING person_id
)
INSERT INTO score(person_id, score)
SELECT i.person_id, (random() * 100)::int
FROM ins i, generate_series(1,5) g; -- 5 scores each
Function:
CREATE OR REPLACE FUNCTION f_demo(_agg text DEFAULT 'sum'
, _left_join bool DEFAULT false
, _where_name text DEFAULT null)
RETURNS TABLE(person_id int, name text, team text, score numeric, rnk bigint)
LANGUAGE plpgsql AS
$func$
DECLARE
_agg_op CONSTANT text[] := '{count, sum, avg}'; -- allowed agg functions
_sql text;
BEGIN
-- assert --
IF _agg ILIKE ANY (_agg_op) THEN
-- all good
ELSE
RAISE EXCEPTION '_agg must be one of %', _agg_op;
END IF;
-- query --
_sql := format('
SELECT *
FROM (
SELECT p.person_id, p.name, p.team, %1$s(s.score)::numeric AS score
, rank() OVER (PARTITION BY p.team ORDER BY %1$s(s.score) DESC) AS rnk
FROM person p
%2$s score s USING (person_id)
%3$s
GROUP BY 1
) sub
WHERE rnk < 3
ORDER BY team, rnk'
, _agg -- %1$s
, CASE WHEN _left_join THEN 'LEFT JOIN' ELSE 'JOIN' END -- %2$s
, CASE WHEN _where_name <> '' THEN 'WHERE p.name LIKE $1' ELSE '' END -- %3$s
);
-- debug -- inspect query first
-- RAISE NOTICE '%', _sql;
-- execute -- unquote when tested ok
RETURN QUERY EXECUTE _sql
USING _where_name; -- $1
END
$func$;
Call:
SELECT * FROM f_demo();
SELECT * FROM f_demo('sum', TRUE, '%2');
SELECT * FROM f_demo('avg', FALSE);
SELECT * FROM f_demo(_where_name := '%1_'); -- named param
fiddle
Old sqlfiddle
You need a firm understanding of PL/pgSQL. Else, there is too much to explain. You'll find related answers here on SO under plpgsql for every detail in the answer.
All parameters are treated safely, no SQL injection possible. See:
Define table and column names as arguments in a plpgsql function?
Table name as a PostgreSQL function parameter
Note in particular, how a WHERE clause is added conditionally when _where_name is passed, with the positional parameter $1 in the query sting. The value is passed to EXECUTE as value with the USING clause. No type conversion, no escaping, no chance for SQL injection. Examples:
Row expansion via "*" is not supported here
SQL state: 42601 syntax error at or near "11"
Refactor a PL/pgSQL function to return the output of various SELECT queries
Use DEFAULT values for function parameters, so you are free to provide any or none. More:
Functions with variable number of input parameters
The manual on calling functions
The function format() is instrumental for building complex dynamic SQL strings in a safe and clean fashion.

Optimize INSERT / UPDATE / DELETE operation

I wonder if the following script can be optimized somehow. It does write a lot to disk because it deletes possibly up-to-date rows and reinserts them. I was thinking about applying something like "insert ... on duplicate key update" and found some possibilities for single-row updates but I don't know how to apply it in the context of INSERT INTO ... SELECT query.
CREATE OR REPLACE FUNCTION update_member_search_index() RETURNS VOID AS $$
DECLARE
member_content_type_id INTEGER;
BEGIN
member_content_type_id :=
(SELECT id FROM django_content_type
WHERE app_label='web' AND model='member');
DELETE FROM watson_searchentry WHERE content_type_id = member_content_type_id;
INSERT INTO watson_searchentry (engine_slug, content_type_id, object_id
, object_id_int, title, description, content
, url, meta_encoded)
SELECT 'default',
member_content_type_id,
web_member.id,
web_member.id,
web_member.name,
'',
web_user.email||' '||web_member.normalized_name||' '||web_country.name,
'',
'{}'
FROM web_member
INNER JOIN web_user ON (web_member.user_id = web_user.id)
INNER JOIN web_country ON (web_member.country_id = web_country.id)
WHERE web_user.is_active=TRUE;
END;
$$ LANGUAGE plpgsql;
EDIT: Schemas of web_member, watson_searchentry, web_user, web_country: http://pastebin.com/3tRVPPVi.
The main point is to update columns title and content in watson_searchentry. There is a trigger on the table that sets value of column search_tsv based on these columns.
(content_type_id, object_id_int) in watson_searchentry is unique pair in the table but atm the index is not present (there is no use for it).
This script should be run at most once a day for full rebuilds of search index and occasionally after importing some data.
Modified table definition
If you really need those columns to be NOT NULL and you really need the string 'default' as default for engine_slug, I would advice to introduce column defaults:
COLUMN | TYPE | Modifiers
-----------------+-------------------------+---------------------
id | INTEGER | NOT NULL DEFAULT ...
engine_slug | CHARACTER VARYING(200) | NOT NULL DEFAULT 'default'
content_type_id | INTEGER | NOT NULL
object_id | text | NOT NULL
object_id_int | INTEGER |
title | CHARACTER VARYING(1000) | NOT NULL
description | text | NOT NULL DEFAULT ''
content | text | NOT NULL
url | CHARACTER VARYING(1000) | NOT NULL DEFAULT ''
meta_encoded | text | NOT NULL DEFAULT '{}'
search_tsv | tsvector | NOT NULL
...
DDL statement would be:
ALTER TABLE watson_searchentry ALTER COLUMN engine_slug DEFAULT 'default';
Etc.
Then you don't have to insert those values manually every time.
Also: object_id text NOT NULL, object_id_int INTEGER? That's odd. I guess you have your reasons ...
I'll go with your updated requirement:
The main point is to update columns title and content in watson_searchentry
Of course, you must add a UNIQUE constraint to enforce your requirements:
ALTER TABLE watson_searchentry
ADD CONSTRAINT ws_uni UNIQUE (content_type_id, object_id_int)
The accompanying index will be used. By this query for starters.
BTW, I almost never use varchar(n) in Postgres. Just text. Here's one reason.
Query with data-modifying CTEs
This could be rewritten as a single SQL query with data-modifying common table expressions, also called "writeable" CTEs. Requires Postgres 9.1 or later.
Additionally, this query only deletes what has to be deleted, and updates what can be updated.
WITH ctyp AS (
SELECT id AS content_type_id
FROM django_content_type
WHERE app_label = 'web'
AND model = 'member'
)
, sel AS (
SELECT ctyp.content_type_id
,m.id AS object_id_int
,m.id::text AS object_id -- explicit cast!
,m.name AS title
,concat_ws(' ', u.email,m.normalized_name,c.name) AS content
-- other columns have column default now.
FROM web_user u
JOIN web_member m ON m.user_id = u.id
JOIN web_country c ON c.id = m.country_id
CROSS JOIN ctyp
WHERE u.is_active
)
, del AS ( -- only if you want to del all other entries of same type
DELETE FROM watson_searchentry w
USING ctyp
WHERE w.content_type_id = ctyp.content_type_id
AND NOT EXISTS (
SELECT 1
FROM sel
WHERE sel.object_id_int = w.object_id_int
)
)
, up AS ( -- update existing rows
UPDATE watson_searchentry
SET object_id = s.object_id
,title = s.title
,content = s.content
FROM sel s
WHERE w.content_type_id = s.content_type_id
AND w.object_id_int = s.object_id_int
)
-- insert new rows
INSERT INTO watson_searchentry (
content_type_id, object_id_int, object_id, title, content)
SELECT sel.* -- safe to use, because col list is defined accordingly above
FROM sel
LEFT JOIN watson_searchentry w1 USING (content_type_id, object_id_int)
WHERE w1.content_type_id IS NULL;
The subquery on django_content_type always returns a single value? Otherwise, the CROSS JOIN might cause trouble.
The first CTE sel gathers the rows to be inserted. Note how I pick matching column names to simplify things.
In the CTE del I avoid deleting rows that can be updated.
In the CTE up those rows are updated instead.
Accordingly, I avoid inserting rows that were not deleted before in the final INSERT.
Can easily be wrapped into an SQL or PL/pgSQL function for repeated use.
Not secure for heavy concurrent use. Much better than the function you had, but still not 100% robust against concurrent writes. But that's not an issue according to your updated info.
Replacing the UPDATEs with DELETE and INSERT may or may not be a lot more expensive. Internally every UPDATE results in a new row version anyways, due to the MVCC model.
Speed first
If you don't really care about preserving old rows, your simpler approach may be faster: Delete everything and insert new rows. Also, wrapping into a plpgsql function saves a bit of planning overhead. Your function basically, with a couple of minor simplifications and observing the defaults added above:
CREATE OR REPLACE FUNCTION update_member_search_index()
RETURNS VOID AS
$func$
DECLARE
_ctype_id int := (
SELECT id
FROM django_content_type
WHERE app_label='web'
AND model = 'member'
); -- you can assign at declaration time. saves another statement
BEGIN
DELETE FROM watson_searchentry
WHERE content_type_id = _ctype_id;
INSERT INTO watson_searchentry
(content_type_id, object_id, object_id_int, title, content)
SELECT _ctype_id, m.id, m.id::int,m.name
,u.email || ' ' || m.normalized_name || ' ' || c.name
FROM web_member m
JOIN web_user u USING (user_id)
JOIN web_country c ON c.id = m.country_id
WHERE u.is_active;
END
$func$ LANGUAGE plpgsql;
I even refrain from using concat_ws(): It is safe against NULL values and simplifies code, but a bit slower than simple concatenation.
Also:
There is a trigger on the table that sets value of column search_tsv
based on these columns.
It would be faster to incorporate the logic into this function - if this is the only time the trigger is needed. Else, it's probably not worth the fuss.

Using stored procedure returning SETOF record in LEFT OUTER JOIN

I'm trying to call a stored procedure passing parameters in a left outer join like this:
select i.name,sp.*
from items i
left join compute_prices(i.id,current_date) as sp(price numeric(15,2),
discount numeric(5,2), taxes numeric(5,2)) on 1=1
where i.type = 404;
compute_prices() returns a setof record.
This is the message postgres shows:
ERROR: invalid reference to FROM-clause entry for table "i"
...left join compute_prices(i.id,current_date)...
HINT: There is an entry for table "i", but it cannot be referenced
from this part of the query.
This kind of query works in Firebird. Is there a way I could make it work by just using a query? I don't want to create another stored procedure that cycles through items and makes separate calls to compute_prices().
Generally, you can expand well known row types (a.k.a. record type, complex type, composite type) with the simple syntax #Daniel supplied:
SELECT i.name, (compute_prices(i.id, current_date)).*
FROM items i
WHERE i.type = 404;
However, if your description is accurate ...
The compute_prices sp returns a setof record.
... we are dealing with anonymous records. Postgres does not know how to expand anonymous records and throws an EXCEPTION in despair:
ERROR: a column definition list is required for functions returning "record"
PostgreSQL 9.3
There is a solution for that in Postgres 9.3. LATERAL, as mentioned by #a_horse in the comments:
SELECT i.name, sp.*
FROM items i
LEFT JOIN LATERAL compute_prices(i.id,current_date) AS sp (
price numeric(15,2)
,discount numeric(5,2)
,taxes numeric(5,2)
) ON TRUE
WHERE i.type = 404;
Details in the manual.
PostgreSQL 9.2 and earlier
Things get hairy. Here's a workaround: write a wrapper function that converts your anonymous records into a well known type:
CREATE OR REPLACE FUNCTION compute_prices_wrapper(int, date)
RETURNS TABLE (
price numeric(15,2)
,discount numeric(5,2)
,taxes numeric(5,2)
) AS
$func$
SELECT * FROM compute_prices($1, $2)
AS t(price numeric(15,2)
,discount numeric(5,2)
,taxes numeric(5,2));
$func$ LANGUAGE sql;
Then you can use the simple solution by #Daniel and just drop in the wrapper function:
SELECT i.name, (compute_prices_wrapper(i.id, current_date)).*
FROM items i
WHERE i.type = 404;
PostgreSQL 8.3 and earlier
PostgreSQL 8.3 has just reached EOL and is unsupported as of now (Feb. 2013).
So you'd better upgrade if at all possible. But if you can't:
CREATE OR REPLACE FUNCTION compute_prices_wrapper(int, date
,OUT price numeric(15,2)
,OUT discount numeric(5,2)
,OUT taxes numeric(5,2))
RETURNS SETOF record AS
$func$
SELECT * FROM compute_prices($1, $2)
AS t(price numeric(15,2)
,discount numeric(5,2)
,taxes numeric(5,2));
$func$ LANGUAGE sql;
Works in later versions, too.
The proper solution would be to fix your function compute_prices() to return a well know type to begin with. Functions returning SETOF record are generally a PITA. I only poke those with a five-meter-pole.
Assuming the compute_prices function always return a record with 3 prices, you could make its return type to TABLE (price numeric(15,2), discount numeric(5,2),taxes numeric(5,2)), and then I believe what you want could be expressed as:
SELECT i.name, (compute_prices(i.id,current_date)).*
FROM items i
WHERE i.type=404;
Note that its seems to me that LEFT JOIN ON 1=1 does not differ from an unconstrained normal JOIN (or CROSS JOIN), and I interpreted the question as actually unrelated to the left join.
I believe Daniel's answer will work also but haven't tried it yet. I do know that I have an SP called list_failed_jobs2 in a schema called logging, and a dummy table called Dual (like in Oracle) and the following statement works for me:
select * from Dual left join
(select * from logging.list_failed_jobs2()) q on 1=1;
Note, the SP call will not work without the parens, the correlation (q), or the ON clause. My SP returns a SETOF also.
Thus, I suspect something like this will work for you:
select i.name,sp.*
from items i
left join (select * from compute_prices(i.id,current_date)) as sp on 1=1
where i.type = 404;
Hope that helps.