I'm running Postgres 9.6.1 and PostGIS 2.3.0 r15146 and have two tables.
geographies may have 150,000,000 rows, paths may have 10,000,000 rows:
CREATE TABLE paths (id uuid NOT NULL, path path NOT NULL, PRIMARY KEY (id))
CREATE TABLE geographies (id uuid NOT NULL, geography geography NOT NULL, PRIMARY KEY (id))
Given an array/set of ids for table geographies, what is the "best" way of finding all intersecting paths and geometries?
In other words, if an initial geography has a corresponding intersecting path we need to also find all other geographies that this path intersects. From there, we need to find all other paths that these newly found geographies intersect, and so on until we've found all possible intersections.
The initial geography ids (our input) may be anywhere from 0 to 700. With an average around 40.
Minimum intersections will be 0, max will be about 1000. Average likely around 20, typically less than 100 connected.
I've created a function that does this, but I'm new to GIS in PostGIS, and Postgres in general. I've posted my solution as an answer to this question.
I feel like there should be a more eloquent and faster way of doing this than what I've come up with.
Your function can be radically simplified.
Setup
I suggest you convert the column paths.path to data type geography (or at least geometry). path is a native Postgres type and does not play well with PostGIS functions and spatial indexes. You would have to cast path::geometry or path::geometry::geography (resulting in a LINESTRING internally) to make it work with PostGIS functions like ST_Intersects().
My answer is based on these adapted tables:
CREATE TABLE paths (
id uuid PRIMARY KEY
, path geography NOT NULL
);
CREATE TABLE geographies (
id uuid PRIMARY KEY
, geography geography NOT NULL
, fk_id text NOT NULL
);
Everything works with data type geometry for both columns just as well. geography is generally more exact but also more expensive. Which to use? Read the PostGIS FAQ here.
Solution 1: Your function optimized
CREATE OR REPLACE FUNCTION public.function_name(_fk_ids text[])
RETURNS TABLE(id uuid, type text)
LANGUAGE plpgsql AS
$func$
DECLARE
_row_ct int;
_loop_ct int := 0;
BEGIN
CREATE TEMP TABLE _geo ON COMMIT DROP AS -- dropped at end of transaction
SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct AS loop_ct -- dupes possible?
FROM geographies g
WHERE g.fk_id = ANY(_fk_ids);
GET DIAGNOSTICS _row_ct = ROW_COUNT;
IF _row_ct = 0 THEN -- no rows found, return empty result immediately
RETURN; -- exit function
END IF;
CREATE TEMP TABLE _path ON COMMIT DROP AS
SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct AS loop_ct
FROM _geo g
JOIN paths p ON ST_Intersects(g.geography, p.path); -- no dupes yet
GET DIAGNOSTICS _row_ct = ROW_COUNT;
IF _row_ct = 0 THEN -- no rows found, return _geo immediately
RETURN QUERY SELECT g.id, text 'geo' FROM _geo g;
RETURN;
END IF;
ALTER TABLE _geo ADD CONSTRAINT g_uni UNIQUE (id); -- required for UPSERT
ALTER TABLE _path ADD CONSTRAINT p_uni UNIQUE (id);
LOOP
_loop_ct := _loop_ct + 1;
INSERT INTO _geo(id, geography, loop_ct)
SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct
FROM _paths p
JOIN geographies g ON ST_Intersects(g.geography, p.path)
WHERE p.loop_ct = _loop_ct - 1 -- only use last round!
ON CONFLICT ON CONSTRAINT g_uni DO NOTHING; -- eliminate new dupes
EXIT WHEN NOT FOUND;
INSERT INTO _path(id, path, loop_ct)
SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct
FROM _geo g
JOIN paths p ON ST_Intersects(g.geography, p.path)
WHERE g.loop_ct = _loop_ct - 1
ON CONFLICT ON CONSTRAINT p_uni DO NOTHING;
EXIT WHEN NOT FOUND;
END LOOP;
RETURN QUERY
SELECT g.id, text 'geo' FROM _geo g
UNION ALL
SELECT p.id, text 'path' FROM _path p;
END
$func$;
Call:
SELECT * FROM public.function_name('{foo,bar}');
Much faster than what you have.
Major points
You based queries on the whole set, instead of the latest additions to the set only. This gets increasingly slower with every loop without need. I added a loop counter (loop_ct) to avoid redundant work.
Be sure to have spatial GiST indexes on geographies.geography and paths.path:
CREATE INDEX geo_geo_gix ON geographies USING GIST (geography);
CREATE INDEX paths_path_gix ON paths USING GIST (path);
Since Postgres 9.5 index-only scans would be an option for GiST indexes. You might add id as second index column. The benefit depends on many factors, you'd have to test. However, there is no fitting operator GiST class for the uuid type. It would work with bigint after installing the extension btree_gist:
Postgres multi-column index (integer, boolean, and array)
Multicolumn index on 3 fields with heterogenous data types
Have a fitting index on g.fk_id, too. Again, a multicolumn index on (fk_id, id, geography) might pay if you can get index-only scans out of it. Default btree index, fk_id must be first index column. Especially if you run the query often and rarely update the table and table rows are much wider than the index.
You can initialize variables at declaration time. Only needed once after the rewrite.
ON COMMIT DROP drops the temp tables at the end of the transaction automatically. So I removed dropping tables explicitly. But you get an exception if you call the function in the same transaction twice. In the function I would check for existence of the temp table and use TRUNCATE in this case. Related:
How to check if a table exists in a given schema
Use GET DIAGNOSTICS to get the row count instead of running another query for the count.
Count rows affected by DELETE
You need GET DIAGNOSTICS. CREATE TABLE does not set FOUND (as is mentioned in the manual).
It's faster to add an index or PK / UNIQUE constraint after filling the table. And not before we actually need it.
ON CONFLICT ... DO ... is the simpler and cheaper way for UPSERT since Postgres 9.5.
How to UPSERT (MERGE, INSERT ... ON DUPLICATE UPDATE) in PostgreSQL?
For the simple form of the command you just list index columns or expressions (like ON CONFLICT (id) DO ...) and let Postgres perform unique index inference to determine an arbiter constraint or index. I later optimized by providing the constraint directly. But for this we need an actual constraint - a unique index is not enough. Fixed accordingly. Details in the manual here.
It may help to ANALYZE temporary tables manually to help Postgres find the best query plan. (But I don't think you need it in your case.)
Are regular VACUUM ANALYZE still recommended under 9.1?
_geo_ct - _geographyLength > 0 is an awkward and more expensive way of saying _geo_ct > _geographyLength. But that's gone completely now.
Don't quote the language name. Just LANGUAGE plpgsql.
Your function parameter is varchar[] for an array of fk_id, but you later commented:
It is a bigint field that represents a geographic area (it's actually a precomputed s2cell id at level 15).
I don't know s2cell id at level 15, but ideally you pass an array of matching data type, or if that's not an option default to text[].
Also since you commented:
There are always exactly 13 fk_ids passed in.
This seems like a perfect use case for a VARIADIC function parameter. So your function definition would be:
CREATE OR REPLACE FUNCTION public.function_name(_fk_ids VARIADIC text[]) ...
Details:
Pass multiple values in single parameter
Solution 2: Plain SQL with recursive CTE
It's hard to wrap an rCTE around two alternating loops, but possible with some SQL finesse:
WITH RECURSIVE cte AS (
SELECT g.id, g.geography::text, NULL::text AS path, text 'geo' AS type
FROM geographies g
WHERE g.fk_id = ANY($kf_ids) -- your input array here
UNION
SELECT p.id, g.geography::text, p.path::text
, CASE WHEN p.path IS NULL THEN 'geo' ELSE 'path' END AS type
FROM cte c
LEFT JOIN paths p ON c.type = 'geo'
AND ST_Intersects(c.geography::geography, p.path)
LEFT JOIN geographies g ON c.type = 'path'
AND ST_Intersects(g.geography, c.path::geography)
WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
)
SELECT id, type FROM cte;
That's all.
You need the same indexes as above. You might wrap it into an SQL function for repeated use.
Major additional points
The cast to text is necessary because the geography type is not "hashable" (same for geometry). (See this open PostGIS issue for details.) Work around it by casting to text. Rows are unique by virtue of (id, type) alone, we can ignore the geography columns for this. Cast back to geography for the join. Shouldn't cost too much extra.
We need two LEFT JOIN so not to exclude rows, because at each iteration only one of the two tables may contribute more rows.
The final condition makes sure we are not done, yet:
WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
This works because duplicate findings are excluded from the temporary
intermediate table. The manual:
For UNION (but not UNION ALL), discard duplicate rows and rows that
duplicate any previous result row. Include all remaining rows in the
result of the recursive query, and also place them in a temporary
intermediate table.
So which is faster?
The rCTE is probably faster than the function for small result sets. The temp tables and indexes in the function mean considerably more overhead. For large result sets the function may be faster, though. Only testing with your actual setup can give you a definitive answer.*
See the OP's feedback in the comment.
I figured it'd be good to post my own solution here even if it isn't optimal.
Here is what I came up with (using Steve Chambers' advice):
CREATE OR REPLACE FUNCTION public.function_name(
_fk_ids character varying[])
RETURNS TABLE(id uuid, type character varying)
LANGUAGE 'plpgsql'
COST 100.0
VOLATILE
ROWS 1000.0
AS $function$
DECLARE
_pathLength bigint;
_geographyLength bigint;
_currentPathLength bigint;
_currentGeographyLength bigint;
BEGIN
DROP TABLE IF EXISTS _pathIds;
DROP TABLE IF EXISTS _geographyIds;
CREATE TEMPORARY TABLE _pathIds (id UUID PRIMARY KEY);
CREATE TEMPORARY TABLE _geographyIds (id UUID PRIMARY KEY);
-- get all geographies in the specified _fk_ids
INSERT INTO _geographyIds
SELECT g.id
FROM geographies g
WHERE g.fk_id= ANY(_fk_ids);
_pathLength := 0;
_geographyLength := 0;
_currentPathLength := 0;
_currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
-- _pathIds := ARRAY[]::uuid[];
WHILE (_currentPathLength - _pathLength > 0) OR (_currentGeographyLength - _geographyLength > 0) LOOP
_pathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
_geographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
-- gets all paths that have paths that intersect the geographies that aren't in the current list of path ids
INSERT INTO _pathIds
SELECT DISTINCT p.id
FROM paths p
JOIN geographies g ON ST_Intersects(g.geography, p.path)
WHERE
g.id IN (SELECT _geographyIds.id FROM _geographyIds) AND
p.id NOT IN (SELECT _pathIds.id from _pathIds);
-- gets all geographies that have paths that intersect the paths that aren't in the current list of geography ids
INSERT INTO _geographyIds
SELECT DISTINCT g.id
FROM geographies g
JOIN paths p ON ST_Intersects(g.geography, p.path)
WHERE
p.id IN (SELECT _pathIds.id FROM _pathIds) AND
g.id NOT IN (SELECT _geographyIds.id FROM _geographyIds);
_currentPathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
_currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
END LOOP;
RETURN QUERY
SELECT _geographyIds.id, 'geography' AS type FROM _geographyIds
UNION ALL
SELECT _pathIds.id, 'path' AS type FROM _pathIds;
END;
$function$;
Sample plot and data from this script
It can be pure relational with an aggregate function. This implementation uses one path table and one point table. Both are geometries. The point is easier to create test data with and to test than a generic geography but it should be simple to adapt.
create table path (
path_text text primary key,
path geometry(linestring) not null
);
create table point (
point_text text primary key,
point geometry(point) not null
);
A type to keep the aggregate function's state:
create type mpath_mpoint as (
mpath geometry(multilinestring),
mpoint geometry(multipoint)
);
The state building function:
create or replace function path_point_intersect (
_i mpath_mpoint[], _e mpath_mpoint
) returns mpath_mpoint[] as $$
with e as (select (e).mpath, (e).mpoint from (values (_e)) e (e)),
i as (select mpath, mpoint from unnest(_i) i (mpath, mpoint))
select array_agg((mpath, mpoint)::mpath_mpoint)
from (
select
st_multi(st_union(i.mpoint, e.mpoint)) as mpoint,
(
select st_collect(gd)
from (
select gd from st_dump(i.mpath) a (a, gd)
union all
select gd from st_dump(e.mpath) b (a, gd)
) s
) as mpath
from i inner join e on st_intersects(i.mpoint, e.mpoint)
union all
select i.mpoint, i.mpath
from i inner join e on not st_intersects(i.mpoint, e.mpoint)
union all
select e.mpoint, e.mpath
from e
where not exists (
select 1 from i
where st_intersects(i.mpoint, e.mpoint)
)
) s;
$$ language sql;
The aggregate:
create aggregate path_point_agg (mpath_mpoint) (
sfunc = path_point_intersect,
stype = mpath_mpoint[]
);
This query will return a set of multilinestring, multipoint strings containing the matched paths/points:
select st_astext(mpath), st_astext(mpoint)
from unnest((
select path_point_agg((st_multi(path), st_multi(mpoint))::mpath_mpoint)
from (
select path, st_union(point) as mpoint
from
path
inner join
point on st_intersects(path, point)
group by path
) s
)) m (mpath, mpoint)
;
st_astext | st_astext
-----------------------------------------------------------+-----------------------------
MULTILINESTRING((-10 0,10 0,8 3),(0 -10,0 10),(2 1,4 -1)) | MULTIPOINT(0 0,0 5,3 0,5 0)
MULTILINESTRING((-9 -8,4 -8),(-8 -9,-8 6)) | MULTIPOINT(-8 -8,2 -8)
MULTILINESTRING((-7 -4,-3 4,-5 6)) | MULTIPOINT(-6 -2)
Related
Provided there is a long list of values, which happen to be values of attributes of records in a postgres-database.
I would like to create a query which finds out which of these values can not be found in the database.
I have no right to execute DDL-Statements and I would like to avoid procedural code.
Example:
the table might be
CREATE TABLE Test (
ID Integer,
attr varchar(30)
)
The list might be something like (but longer, about 240000 values)
ATTR
TestValue0
TestValue1
TestValue2
TestValue3
Using sed I can create and execute a statement
select count(*) from Test where attr in ('TestValue0',
'TestValue1','TestValue2','TestValue3')
This statement shows me, that not all of these values can be found in Test.
How can I formulate a query which tells me which of these uniq-values can not be found in the postgres-database?
For what you want to do, you can use left join, not in or not exists. But the key is that you need a derived table with the values you care about:
select v.attr
from (values ('TestValue0'), ('TestValue1'), ('TestValue2'), ('TestValue3')
) v attr
where not exists (select 1 from test t where t.attr = v.attr);
Suppose I have the following data:
Table some_table:
some_table_id | value | other_table_id
--------------------------------------
1 | foo | 1
2 | bar | 2
Table other_table:
other_table_id | value
----------------------
1 | foo
2 | bar
Here, some_table has a foreign key to column other_table_id from other_table into the column of some name.
With the following query in PostgreSQL:
SELECT *
FROM some_table
WHERE other_table_id = 3;
As you see, 3 does not exists in other_table This query obviously will return 0 results.
Without doing a second query, is there a way to know if the foreign key that I am using as a filter effectively does not exist in the other_table?
Ideally as an error that later could be parsed (as it happends when doing an INSERT or an UPDATE with a wrong foreign key, for example).
You can exploit a feature of PL/pgSQL to implement this very cheaply:
CREATE OR REPLACE FUNCTION f_select_from_some_tbl(int)
RETURNS SETOF some_table AS
$func$
BEGIN
RETURN QUERY
SELECT *
FROM some_table
WHERE other_table_id = $1;
IF NOT FOUND THEN
RAISE WARNING 'Call with non-existing other_table_id >>%<<', $1;
END IF;
END
$func$ LANGUAGE plpgsql;
A final RETURN; is optional in this case.
The WARNINGis only raised if your query doesn't return any rows. I am not raising an ERROR in the example, since this would roll back the whole transaction (but you can do that if it fits your needs).
We've added a code example to the manual with Postgres 9.3 to demonstrate this.
If you perform an INSERT or UPDATE on some_table, specifying an other_table_id value that does not in fact exist in other_table, then you will get an error arising from violation of the foreign key constraint. SELECT queries are therefore your primary concern.
One way you could address the issue with SELECT queries would be to transform your queries to perform an outer join with other_table, like so:
SELECT st.*
FROM
other_table ot
LEFT JOIN some_table st ON st.other_table_id = ot.other_table_id
WHERE st.other_table_id = 3;
That query will always return at least one row if any other_table row has other_table_id = 3. In that case, if there is no matching some_table row, then it will return exactly one row, with that row having all columns NULL (given that it selects only columns from some_table), even columns that are declared not null.
If you want such queries to raise an error then you'll probably need to write a custom function to assist, but it can be done. I'd probably implement it in PL/pgSQL, using that language's RAISE statement.
I wonder if the following script can be optimized somehow. It does write a lot to disk because it deletes possibly up-to-date rows and reinserts them. I was thinking about applying something like "insert ... on duplicate key update" and found some possibilities for single-row updates but I don't know how to apply it in the context of INSERT INTO ... SELECT query.
CREATE OR REPLACE FUNCTION update_member_search_index() RETURNS VOID AS $$
DECLARE
member_content_type_id INTEGER;
BEGIN
member_content_type_id :=
(SELECT id FROM django_content_type
WHERE app_label='web' AND model='member');
DELETE FROM watson_searchentry WHERE content_type_id = member_content_type_id;
INSERT INTO watson_searchentry (engine_slug, content_type_id, object_id
, object_id_int, title, description, content
, url, meta_encoded)
SELECT 'default',
member_content_type_id,
web_member.id,
web_member.id,
web_member.name,
'',
web_user.email||' '||web_member.normalized_name||' '||web_country.name,
'',
'{}'
FROM web_member
INNER JOIN web_user ON (web_member.user_id = web_user.id)
INNER JOIN web_country ON (web_member.country_id = web_country.id)
WHERE web_user.is_active=TRUE;
END;
$$ LANGUAGE plpgsql;
EDIT: Schemas of web_member, watson_searchentry, web_user, web_country: http://pastebin.com/3tRVPPVi.
The main point is to update columns title and content in watson_searchentry. There is a trigger on the table that sets value of column search_tsv based on these columns.
(content_type_id, object_id_int) in watson_searchentry is unique pair in the table but atm the index is not present (there is no use for it).
This script should be run at most once a day for full rebuilds of search index and occasionally after importing some data.
Modified table definition
If you really need those columns to be NOT NULL and you really need the string 'default' as default for engine_slug, I would advice to introduce column defaults:
COLUMN | TYPE | Modifiers
-----------------+-------------------------+---------------------
id | INTEGER | NOT NULL DEFAULT ...
engine_slug | CHARACTER VARYING(200) | NOT NULL DEFAULT 'default'
content_type_id | INTEGER | NOT NULL
object_id | text | NOT NULL
object_id_int | INTEGER |
title | CHARACTER VARYING(1000) | NOT NULL
description | text | NOT NULL DEFAULT ''
content | text | NOT NULL
url | CHARACTER VARYING(1000) | NOT NULL DEFAULT ''
meta_encoded | text | NOT NULL DEFAULT '{}'
search_tsv | tsvector | NOT NULL
...
DDL statement would be:
ALTER TABLE watson_searchentry ALTER COLUMN engine_slug DEFAULT 'default';
Etc.
Then you don't have to insert those values manually every time.
Also: object_id text NOT NULL, object_id_int INTEGER? That's odd. I guess you have your reasons ...
I'll go with your updated requirement:
The main point is to update columns title and content in watson_searchentry
Of course, you must add a UNIQUE constraint to enforce your requirements:
ALTER TABLE watson_searchentry
ADD CONSTRAINT ws_uni UNIQUE (content_type_id, object_id_int)
The accompanying index will be used. By this query for starters.
BTW, I almost never use varchar(n) in Postgres. Just text. Here's one reason.
Query with data-modifying CTEs
This could be rewritten as a single SQL query with data-modifying common table expressions, also called "writeable" CTEs. Requires Postgres 9.1 or later.
Additionally, this query only deletes what has to be deleted, and updates what can be updated.
WITH ctyp AS (
SELECT id AS content_type_id
FROM django_content_type
WHERE app_label = 'web'
AND model = 'member'
)
, sel AS (
SELECT ctyp.content_type_id
,m.id AS object_id_int
,m.id::text AS object_id -- explicit cast!
,m.name AS title
,concat_ws(' ', u.email,m.normalized_name,c.name) AS content
-- other columns have column default now.
FROM web_user u
JOIN web_member m ON m.user_id = u.id
JOIN web_country c ON c.id = m.country_id
CROSS JOIN ctyp
WHERE u.is_active
)
, del AS ( -- only if you want to del all other entries of same type
DELETE FROM watson_searchentry w
USING ctyp
WHERE w.content_type_id = ctyp.content_type_id
AND NOT EXISTS (
SELECT 1
FROM sel
WHERE sel.object_id_int = w.object_id_int
)
)
, up AS ( -- update existing rows
UPDATE watson_searchentry
SET object_id = s.object_id
,title = s.title
,content = s.content
FROM sel s
WHERE w.content_type_id = s.content_type_id
AND w.object_id_int = s.object_id_int
)
-- insert new rows
INSERT INTO watson_searchentry (
content_type_id, object_id_int, object_id, title, content)
SELECT sel.* -- safe to use, because col list is defined accordingly above
FROM sel
LEFT JOIN watson_searchentry w1 USING (content_type_id, object_id_int)
WHERE w1.content_type_id IS NULL;
The subquery on django_content_type always returns a single value? Otherwise, the CROSS JOIN might cause trouble.
The first CTE sel gathers the rows to be inserted. Note how I pick matching column names to simplify things.
In the CTE del I avoid deleting rows that can be updated.
In the CTE up those rows are updated instead.
Accordingly, I avoid inserting rows that were not deleted before in the final INSERT.
Can easily be wrapped into an SQL or PL/pgSQL function for repeated use.
Not secure for heavy concurrent use. Much better than the function you had, but still not 100% robust against concurrent writes. But that's not an issue according to your updated info.
Replacing the UPDATEs with DELETE and INSERT may or may not be a lot more expensive. Internally every UPDATE results in a new row version anyways, due to the MVCC model.
Speed first
If you don't really care about preserving old rows, your simpler approach may be faster: Delete everything and insert new rows. Also, wrapping into a plpgsql function saves a bit of planning overhead. Your function basically, with a couple of minor simplifications and observing the defaults added above:
CREATE OR REPLACE FUNCTION update_member_search_index()
RETURNS VOID AS
$func$
DECLARE
_ctype_id int := (
SELECT id
FROM django_content_type
WHERE app_label='web'
AND model = 'member'
); -- you can assign at declaration time. saves another statement
BEGIN
DELETE FROM watson_searchentry
WHERE content_type_id = _ctype_id;
INSERT INTO watson_searchentry
(content_type_id, object_id, object_id_int, title, content)
SELECT _ctype_id, m.id, m.id::int,m.name
,u.email || ' ' || m.normalized_name || ' ' || c.name
FROM web_member m
JOIN web_user u USING (user_id)
JOIN web_country c ON c.id = m.country_id
WHERE u.is_active;
END
$func$ LANGUAGE plpgsql;
I even refrain from using concat_ws(): It is safe against NULL values and simplifies code, but a bit slower than simple concatenation.
Also:
There is a trigger on the table that sets value of column search_tsv
based on these columns.
It would be faster to incorporate the logic into this function - if this is the only time the trigger is needed. Else, it's probably not worth the fuss.
I am trying to run a cursor on full join of two tables but having problem accessing the columns in cursor.
CREATE TABLE APPLE(
MY_ID VARCHAR(2) NOT NULL,
A_TIMESTAMP TIMESTAMP,
A_NAME VARCHAR(10)
);
CREATE TABLE BANANA(
MY_ID VARCHAR(2) NOT NULL,
B_TIMESTAMP TIMESTAMP,
B_NAME VARCHAR(10)
);
I have written a Full join to return all related rows from tables A and B where any of the two timestamps are in future.
i.e. if a row in table APPLE has timestamp in future then fetch row from APPLE joined with row from BANANA on MY_ID
Similarly, if a row in table BANANA has timestamp in future then fetch row from BANANA joined with row from APPLE on MY_ID
This full join works for me.
select * from APPLE a full join BANANA b on a.MY_ID = b.MY_ID where
(
a.A_TIMESTAMP > current_timestamp
or b.B_TIMESTAMP > current_timestamp
);
Now I want to iterate over each joined record and do some processing. I am able to access the columns which are only present in one tables but getting error when trying to access the column names which are same in both tables. For ex. ID in this case.
create or replace
PROCEDURE testProc(someDate IN DATE)
AS
CURSOR c1 IS
select * from APPLE a full join BANANA b on a.MY_ID = b.MY_ID where
(
a.A_TIMESTAMP > current_timestamp
or b.B_TIMESTAMP > current_timestamp
);
BEGIN
FOR rec IN c1
LOOP
DBMS_OUTPUT.PUT_LINE(rec.A_NAME);
DBMS_OUTPUT.PUT_LINE(rec.A_TIMESTAMP);
DBMS_OUTPUT.PUT_LINE(rec.MY_ID);
END LOOP;
END testProc;
I get this error when I compile the above proc:
Error(16,28): PLS-00302: component 'MY_ID' must be declared
and I am not sure how would I access the MY_ID element. I am sure it will be pretty
straight forward but I am new to database programming and have been trying but unable to find the right way to do it.
Any help is appreciated.
Thanks
One other thing you can do in this case is to join the tables with the USING clause instead of using ON, as in:
select *
from APPLE a
full join BANANA b
USING (MY_ID)
where a.A_TIMESTAMP > current_timestamp or
b.B_TIMESTAMP > current_timestamp
USING can be used if the columns on both tables have the same name, and the comparison of the key values is made using the equality ('=') operator. In the result set there will be one column named MY_ID along with the other columns from both table (A_TIMESTAMP, B_TIMESTAMP, etc).
Share and enjoy.
I assume the problem is that MY_ID is defined in both tables, so * gets both of them. Try defining the cursor using this query:
select coalesce(A.MY_ID, B.MY_ID) as MY_ID,
A_TIMESTAMP, A_NAME, B_TIMESTAMP, B_NAME
from APPLE a full join
BANANA b
on a.MY_ID = b.MY_ID
where a.A_TIMESTAMP > current_timestamp or b.B_TIMESTAMP > current_timestamp;
EDIT:
You have two issues with conflicting columns. If this were just an inner join, you could do:
select A.*, B_TIMESTAMP, B_NAME
That is, you can select the columns from one table using * and the rest individually. However, this is a full outer join, so there is a set of columns where you want to use coalesce().
So, the best answer is that you should list out all the columns. This is good coding practice anyway, and helps protect the code from inadvertent mistakes when columns are added or removed from the table.
I want to write a stored procedure that performs a conditional calculation on a column. Ideally the implementation of the SP will be db agnostic - if possible. If not the underlying db is PostgreSQL (v8.4), so that takes precedence.
The underlying tables being queried looks like this:
CREATE TABLE treatment_def ( id PRIMARY SERIAL KEY,
name VARCHAR(16) NOT NULL
);
CREATE TABLE foo_group_def ( id PRIMARY SERIAL KEY,
name VARCHAR(16) NOT NULL
);
CREATE TABLE foo ( id PRIMARY SERIAL KEY,
name VARCHAR(16) NOT NULL,
trtmt_id INT REFERENCES treatment_def(id) ON DELETE RESTRICT,
foo_grp_id INT REFERENCES foo_group_def(id) ON DELETE RESTRICT,
is_male BOOLEAN NOT NULL,
cost REAL NOT NULL
);
I want to write a SP that returns the following 'table' result set:
treatment_name, foo_group_name, averaged_cost
where averaged cost is calcluated differently, depending on whether the row field *is_male* flag is set to true or false.
For the purpose of this question, lets assume that if the is_male flag is set to true, then the averaged cost is calculated as the SUM of the cost values for the grouping, and if the is_male flag is set to false, then the cost value is calculated as the AVERAGE of the cost values for the grouping.
(Obviously) the data is being grouped by trmt_id, foo_grp_id (and is_male?).
I have a rough idea about how to to write the SQL if there was no conditional test on the is_male flag. However, I could do with some help in writing the SP as defined above.
Here is my first attempt:
CREATE TYPE FOO_RESULT AS (treatment_name VARCHAR(16), foo_group_name VARCHAR(64), averaged_cost DOUBLE);
// Outline plpgsql (Pseudo code)
CREATE FUNCTION somefunc() RETURNS SETOF FOO_RESULT AS $$
BEGIN
RETURN QUERY SELECT t.name treatment_name, g.name group_name, averaged_cost FROM foo f
INNER JOIN treatment_def t ON t.id = f.trtmt_id
INNER JOIN foo_group_def g ON g.id = f.foo_grp_id
GROUP BY f.trtmt_id, f.foo_grp_id;
END;
$$ LANGUAGE plpgsql;
I would appreciate some help on how to write this SP correctly to implement the conditional calculation in the column results
Could look like this:
CREATE FUNCTION somefunc()
RETURNS TABLE (
treatment_name varchar(16)
, foo_group_name varchar(16)
, averaged_cost double precision)
AS
$BODY$
SELECT t.name -- AS treatment_name
, g.name -- AS group_name
, CASE WHEN f.is_male THEN sum(f.cost)
ELSE avg(f.cost) END -- AS averaged_cost
FROM foo f
JOIN treatment_def t ON t.id = f.trtmt_id
JOIN foo_group_def g ON g.id = f.foo_grp_id
GROUP BY 1, 2, f.is_male;
$BODY$ LANGUAGE sql;
Major points
I used an sql function, not plpgsql. You can use either, I just did it to shorten the code. plpgsql might be slightly faster, because the query plan is cached.
I skipped the custom composite type. You can do that simpler with RETURNS TABLE.
I would generally advise to use the data type text instead of varchar(n). Makes your life easier.
Be careful not to use names of the RETURN parameter without table-qualifying (tbl.col) in the function body, or you will create naming conflicts. That is why I commented the aliases.
I adjusted the GROUP BY clause. The original didn't work. (Neither does the one in #Ken's answer.)
You should be able to use a CASE statement:
SELECT t.name treatment_name, g.name group_name,
CASE is_male WHEN true then SUM(cost)
ELSE AVG(cost) END AS averaged_cost
FROM foo f
INNER JOIN treatment_def t ON t.id = f.trtmt_id
INNER JOIN foo_group_def g ON g.id = f.foo_grp_id
GROUP BY 1, 2, f.is_male;
I'm not familiar with PLpgSQL, so I'm not sure of the exact syntax for the BOOLEAN column, but the above should at least get you started in the right direction.