How to join on a nested value from a jsonb column? - sql

I have a PostgreSQL 11 database with these tables:
CREATE TABLE stats (
id integer NOT NULL,
uid integer NOT NULL,
date date NOT NULL,
data jsonb DEFAULT '[]'::json NOT NULL
);
INSERT INTO stats(id, uid, date, data) VALUES
(1, 1, '2020-10-01', '{"somerandomhash":{"source":"thesource"}}');
CREATE TABLE links(
id integer NOT NULL,
uuid uuid NOT NULL,
path text NOT NULL
);
INSERT INTO links(id, uuid, path) VALUES
(1, 'acbd18db-4cc2-f85c-edef-654fccc4a4d8', 'thesource');
My goal is to create a new table reports with data from the stats table, but with a new key from the links table. It will look like this:
CREATE TABLE reports(
id integer NOT NULL,
uid integer NOT NULL,
date date NOT NULL,
data jsonb DEFAULT '[]'::json NOT NULL
);
INSERT INTO reports(id, uid, date, data) VALUES
(1, 1, 2020-10-01, {"uuid":{"source":"thesource"});
To this end, I tried to left join the table links in order to retrieve the uuid column value - without luck:
SELECT s.uid, s.date, s.data->jsonb_object_keys(data)->>'source' as path, s.data->jsonb_object_keys(data) as data, l.uuid
FROM stats s LEFT JOIN links l ON s.data->jsonb_object_keys(data)->>'source' = l.path
I tried to use the result of s.data->jsonb_object_keys(data)->>'source' in the left join, but got the error:
ERROR: set-returning functions are not allowed in JOIN conditions
I tried using LATERAL but still not valid result.
How to make this work?

jsonb_object_keys() is a set-returning function which cannot be used the way you do - as the error messages tells you. What's more, json_object_keys() returns top-level key(s), but it seems you are only interested in the value. Try jsonb_each() instead:
SELECT s.id
, s.uid
, s.date
, jsonb_build_object(l.uuid::text, o.value) AS new_data
FROM stats s
CROSS JOIN LATERAL jsonb_each(s.data) o -- defaults to column names (key, value)
LEFT JOIN links l ON l.path = o.value->>'source';
db<>fiddle here
jsonb_each() returns top-level key and value. Proceed using only the value.
The nested JSON object seems to have the constant key name 'source'. So the join condition is l.path = o.value->>'source'.
Finally, build the new jsonb value with jsonb_build_object().
While this works as demonstrated, a couple of questions remain:
The above assumes there is always exactly one top-level key in stats.data. If not, you'd have to define what to do ...
The above assumes there is always exactly one match in table links. If not, you'd have to define what to do ...
Most importantly:
If data is as regular as you make it out to be, consider a plain "uuid" column (or drop it as the value is in table links anyway) and a plain column "source" to replace the jsonb column. Much simpler and more efficient.

It looks like that you want to join by the "source" key from the JSON column.
Instead of
s.data->jsonb_object_keys(data)->>'source'
Try this
s.data ->> 'source'
If my assumptions are correct the whole query can go like that:
SELECT
s.uid,
s.date,
s.data ->> 'source' AS path,
s.data -> jsonb_object_keys(data) AS data,
l.uuid
FROM stats s
LEFT JOIN links l ON s.data ->> 'source' = l.path

Related

Querying a JSON array in Postgres

I have a table with json field
CREATE TABLE orders (
ID serial NOT NULL PRIMARY KEY,
info json NOT NULL
);
I inserted data to it
INSERT INTO orders (info)
VALUES
(
'{"interestedIn":[11,12,13],"countries":["US", "UK"]}'
);
How can I get rows where intrestedIn in(11, 12) and countries in("US")?
Your use of "in" in the problem specification is confusing, as you seem to be using it in some SQL-like pseudo syntax, but not with the same meaning it has SQL.
If you want rows where one of the countries is US and where both 11 and 12 are contained in "interestedIn", then it can be done in a single containment operation:
select * from orders where info::jsonb #> '{"interestedIn":[11,12],"countries":["US"]}'
If you want something other than that, then please explain in more detail, and provide both examples which should match and examples which should not match (and tell us which is which).
Checking for the country is quite simple as that is a text value:
select *
from orders
where info::jsonb -> 'countries' ? 'US'
Adding the condition for the integer value is a bit more complicated because the ? operator only works with strings. So you need to unnest the array:
select o.*
from orders o
where o.info::Jsonb -> 'countries' ? 'US'
and exists (select *
from json_array_elements_text(o.info -> 'interestedIn') as x(id)
where x.id in ('11','12'));
If you also might need to check for multiple country values, you can use the ?| operator:
select o.*
from orders o
where o.info::jsonb -> 'countries' ?| array['US', 'UK']
and exists (select *
from json_array_elements_text(o.info -> 'interestedIn') as x(id)
where x.id in ('11','12'));
Online example: https://rextester.com/GSDTOH43863

Use value of one column as identifier in another table in SNOWFLAKE

I have two table one of which contains the rule for another
create table t1(id int, query string)
create table t2(id int, place string)
insert into t1 values (1,'id < 10')
insert into t1 values (2,'id == 10')
And the values in t2 are
insert into t2 values (11,'Nevada')
insert into t2 values (20,'Texas')
insert into t2 values (10,'Arizona')
insert into t2 values (2,'Abegal')
I need to select from second table as per the value of first table column value.
like
select * from t2 where {query}
or
with x(query)
as
(select c2 from test)
select * from test where query;
but neither are helping.
There are a couple of problems with storing criteria in a table like this:
First, as has already been noted, you'll likely have to resort to dynamic SQL, which can get messy, and limits how you can use it.
It's going to be problematic (to say the least) to validate and parse your criteria. What if someone writes a rule of [id] *= 10, or [this_field_doesn't_exist] = blah?
If you're just storing potential values for your [id] column, one solution would be to have your t1 (storing your queries) include a min value and max value, like this:
CREATE TABLE t1
(
[id] INT IDENTITY(1,1) PRIMARY KEY,
min_value INT NULL,
max_value INT NULL
)
Note that both the min and max values can be null. Your provided criteria would then be expressed as this:
INSERT INTO t1
([id], min_value, max_value)
VALUES
(1, NULL, 10),
(2, 10, 10)
Note that I've explicitly referenced what attibutes we're inserting into, as you should also be doing (to prevent issues with attributes being added/modified down the line).
A null value on min_value means no lower limit; a null max_value means no upper limit.
To then get results from t2 that meet all your t1 criteria, simply do an INNER JOIN:
SELECT t2.*
FROM t2
INNER JOIN t1 ON
(t2.id <= t1.max_value OR t1.max_value IS NULL)
AND
(t2.id >= t1.min_value OR t1.min_value IS NULL)
Note that, as I said, this will only return results that match all your criteria. If you need to more complex logic (for example, show records that meet Rules 1, 2 and 3, or meet Rule 4), you'll likely have to resort to dynamic SQL (or at the very least some ugly JOINs).
As stated in a comment, however, you want to have more complex rules, which might mean you have to end up using dynamic SQL. However, you still have the problem of validating and parsing your rule. How do you handle cases where a user enters an invalid rule?
A better solution might be to store your rules in a format that can easily be parsed and validated. For example, come up with an XML schema that defines a valid rule/criterion. Then, your Rules table would have a rule XML attribute, tied to that schema, so users could only enter valid rules. You could then either shred that XML document, or create the SQL client-side to come up with your query.
I got the answer myself. And I am putting it below.
I have used python CLI to do the job. (As snowflake does not support dynamic query)
I believe one can use the same for other DB (tedious but doable)
setting up configuration to connect
CONFIG_PATH = "/root/config/snowflake.json"
with open(CONFIG_PATH) as f:
config = json.load(f)
#snowflake
snf_user = config['snowflake']['user']
snf_pwd = config['snowflake']['pwd']
snf_account = config['snowflake']['account']
snf_region = config['snowflake']['region']
snf_role = config['snowflake']['role']
ctx = snowflake.connector.connect(
user=snf_user,
password=snf_pwd,
account=snf_account,
region=snf_region,
role=snf_role
)
--comment
Used multiple cursor as in loop we don't want recursive connection
cs = ctx.cursor()
cs1 = ctx.cursor()
query = "select c2 from test"
cs.execute(query)
for (x) in cs:
y = "select * from test1 where {0}".format(', '.join(x).replace("'",""))
cs1.execute(y)
for (y1) in cs1:
print('{0}'.format(y1))
And boom done

Determine the values of a long list which are not existing in a postgres table

Provided there is a long list of values, which happen to be values of attributes of records in a postgres-database.
I would like to create a query which finds out which of these values can not be found in the database.
I have no right to execute DDL-Statements and I would like to avoid procedural code.
Example:
the table might be
CREATE TABLE Test (
ID Integer,
attr varchar(30)
)
The list might be something like (but longer, about 240000 values)
ATTR
TestValue0
TestValue1
TestValue2
TestValue3
Using sed I can create and execute a statement
select count(*) from Test where attr in ('TestValue0',
'TestValue1','TestValue2','TestValue3')
This statement shows me, that not all of these values can be found in Test.
How can I formulate a query which tells me which of these uniq-values can not be found in the postgres-database?
For what you want to do, you can use left join, not in or not exists. But the key is that you need a derived table with the values you care about:
select v.attr
from (values ('TestValue0'), ('TestValue1'), ('TestValue2'), ('TestValue3')
) v attr
where not exists (select 1 from test t where t.attr = v.attr);

How to find intersecting geographies between two tables recursively

I'm running Postgres 9.6.1 and PostGIS 2.3.0 r15146 and have two tables.
geographies may have 150,000,000 rows, paths may have 10,000,000 rows:
CREATE TABLE paths (id uuid NOT NULL, path path NOT NULL, PRIMARY KEY (id))
CREATE TABLE geographies (id uuid NOT NULL, geography geography NOT NULL, PRIMARY KEY (id))
Given an array/set of ids for table geographies, what is the "best" way of finding all intersecting paths and geometries?
In other words, if an initial geography has a corresponding intersecting path we need to also find all other geographies that this path intersects. From there, we need to find all other paths that these newly found geographies intersect, and so on until we've found all possible intersections.
The initial geography ids (our input) may be anywhere from 0 to 700. With an average around 40.
Minimum intersections will be 0, max will be about 1000. Average likely around 20, typically less than 100 connected.
I've created a function that does this, but I'm new to GIS in PostGIS, and Postgres in general. I've posted my solution as an answer to this question.
I feel like there should be a more eloquent and faster way of doing this than what I've come up with.
Your function can be radically simplified.
Setup
I suggest you convert the column paths.path to data type geography (or at least geometry). path is a native Postgres type and does not play well with PostGIS functions and spatial indexes. You would have to cast path::geometry or path::geometry::geography (resulting in a LINESTRING internally) to make it work with PostGIS functions like ST_Intersects().
My answer is based on these adapted tables:
CREATE TABLE paths (
id uuid PRIMARY KEY
, path geography NOT NULL
);
CREATE TABLE geographies (
id uuid PRIMARY KEY
, geography geography NOT NULL
, fk_id text NOT NULL
);
Everything works with data type geometry for both columns just as well. geography is generally more exact but also more expensive. Which to use? Read the PostGIS FAQ here.
Solution 1: Your function optimized
CREATE OR REPLACE FUNCTION public.function_name(_fk_ids text[])
RETURNS TABLE(id uuid, type text)
LANGUAGE plpgsql AS
$func$
DECLARE
_row_ct int;
_loop_ct int := 0;
BEGIN
CREATE TEMP TABLE _geo ON COMMIT DROP AS -- dropped at end of transaction
SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct AS loop_ct -- dupes possible?
FROM geographies g
WHERE g.fk_id = ANY(_fk_ids);
GET DIAGNOSTICS _row_ct = ROW_COUNT;
IF _row_ct = 0 THEN -- no rows found, return empty result immediately
RETURN; -- exit function
END IF;
CREATE TEMP TABLE _path ON COMMIT DROP AS
SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct AS loop_ct
FROM _geo g
JOIN paths p ON ST_Intersects(g.geography, p.path); -- no dupes yet
GET DIAGNOSTICS _row_ct = ROW_COUNT;
IF _row_ct = 0 THEN -- no rows found, return _geo immediately
RETURN QUERY SELECT g.id, text 'geo' FROM _geo g;
RETURN;
END IF;
ALTER TABLE _geo ADD CONSTRAINT g_uni UNIQUE (id); -- required for UPSERT
ALTER TABLE _path ADD CONSTRAINT p_uni UNIQUE (id);
LOOP
_loop_ct := _loop_ct + 1;
INSERT INTO _geo(id, geography, loop_ct)
SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct
FROM _paths p
JOIN geographies g ON ST_Intersects(g.geography, p.path)
WHERE p.loop_ct = _loop_ct - 1 -- only use last round!
ON CONFLICT ON CONSTRAINT g_uni DO NOTHING; -- eliminate new dupes
EXIT WHEN NOT FOUND;
INSERT INTO _path(id, path, loop_ct)
SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct
FROM _geo g
JOIN paths p ON ST_Intersects(g.geography, p.path)
WHERE g.loop_ct = _loop_ct - 1
ON CONFLICT ON CONSTRAINT p_uni DO NOTHING;
EXIT WHEN NOT FOUND;
END LOOP;
RETURN QUERY
SELECT g.id, text 'geo' FROM _geo g
UNION ALL
SELECT p.id, text 'path' FROM _path p;
END
$func$;
Call:
SELECT * FROM public.function_name('{foo,bar}');
Much faster than what you have.
Major points
You based queries on the whole set, instead of the latest additions to the set only. This gets increasingly slower with every loop without need. I added a loop counter (loop_ct) to avoid redundant work.
Be sure to have spatial GiST indexes on geographies.geography and paths.path:
CREATE INDEX geo_geo_gix ON geographies USING GIST (geography);
CREATE INDEX paths_path_gix ON paths USING GIST (path);
Since Postgres 9.5 index-only scans would be an option for GiST indexes. You might add id as second index column. The benefit depends on many factors, you'd have to test. However, there is no fitting operator GiST class for the uuid type. It would work with bigint after installing the extension btree_gist:
Postgres multi-column index (integer, boolean, and array)
Multicolumn index on 3 fields with heterogenous data types
Have a fitting index on g.fk_id, too. Again, a multicolumn index on (fk_id, id, geography) might pay if you can get index-only scans out of it. Default btree index, fk_id must be first index column. Especially if you run the query often and rarely update the table and table rows are much wider than the index.
You can initialize variables at declaration time. Only needed once after the rewrite.
ON COMMIT DROP drops the temp tables at the end of the transaction automatically. So I removed dropping tables explicitly. But you get an exception if you call the function in the same transaction twice. In the function I would check for existence of the temp table and use TRUNCATE in this case. Related:
How to check if a table exists in a given schema
Use GET DIAGNOSTICS to get the row count instead of running another query for the count.
Count rows affected by DELETE
You need GET DIAGNOSTICS. CREATE TABLE does not set FOUND (as is mentioned in the manual).
It's faster to add an index or PK / UNIQUE constraint after filling the table. And not before we actually need it.
ON CONFLICT ... DO ... is the simpler and cheaper way for UPSERT since Postgres 9.5.
How to UPSERT (MERGE, INSERT ... ON DUPLICATE UPDATE) in PostgreSQL?
For the simple form of the command you just list index columns or expressions (like ON CONFLICT (id) DO ...) and let Postgres perform unique index inference to determine an arbiter constraint or index. I later optimized by providing the constraint directly. But for this we need an actual constraint - a unique index is not enough. Fixed accordingly. Details in the manual here.
It may help to ANALYZE temporary tables manually to help Postgres find the best query plan. (But I don't think you need it in your case.)
Are regular VACUUM ANALYZE still recommended under 9.1?
_geo_ct - _geographyLength > 0 is an awkward and more expensive way of saying _geo_ct > _geographyLength. But that's gone completely now.
Don't quote the language name. Just LANGUAGE plpgsql.
Your function parameter is varchar[] for an array of fk_id, but you later commented:
It is a bigint field that represents a geographic area (it's actually a precomputed s2cell id at level 15).
I don't know s2cell id at level 15, but ideally you pass an array of matching data type, or if that's not an option default to text[].
Also since you commented:
There are always exactly 13 fk_ids passed in.
This seems like a perfect use case for a VARIADIC function parameter. So your function definition would be:
CREATE OR REPLACE FUNCTION public.function_name(_fk_ids VARIADIC text[]) ...
Details:
Pass multiple values in single parameter
Solution 2: Plain SQL with recursive CTE
It's hard to wrap an rCTE around two alternating loops, but possible with some SQL finesse:
WITH RECURSIVE cte AS (
SELECT g.id, g.geography::text, NULL::text AS path, text 'geo' AS type
FROM geographies g
WHERE g.fk_id = ANY($kf_ids) -- your input array here
UNION
SELECT p.id, g.geography::text, p.path::text
, CASE WHEN p.path IS NULL THEN 'geo' ELSE 'path' END AS type
FROM cte c
LEFT JOIN paths p ON c.type = 'geo'
AND ST_Intersects(c.geography::geography, p.path)
LEFT JOIN geographies g ON c.type = 'path'
AND ST_Intersects(g.geography, c.path::geography)
WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
)
SELECT id, type FROM cte;
That's all.
You need the same indexes as above. You might wrap it into an SQL function for repeated use.
Major additional points
The cast to text is necessary because the geography type is not "hashable" (same for geometry). (See this open PostGIS issue for details.) Work around it by casting to text. Rows are unique by virtue of (id, type) alone, we can ignore the geography columns for this. Cast back to geography for the join. Shouldn't cost too much extra.
We need two LEFT JOIN so not to exclude rows, because at each iteration only one of the two tables may contribute more rows.
The final condition makes sure we are not done, yet:
WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
This works because duplicate findings are excluded from the temporary
intermediate table. The manual:
For UNION (but not UNION ALL), discard duplicate rows and rows that
duplicate any previous result row. Include all remaining rows in the
result of the recursive query, and also place them in a temporary
intermediate table.
So which is faster?
The rCTE is probably faster than the function for small result sets. The temp tables and indexes in the function mean considerably more overhead. For large result sets the function may be faster, though. Only testing with your actual setup can give you a definitive answer.*
See the OP's feedback in the comment.
I figured it'd be good to post my own solution here even if it isn't optimal.
Here is what I came up with (using Steve Chambers' advice):
CREATE OR REPLACE FUNCTION public.function_name(
_fk_ids character varying[])
RETURNS TABLE(id uuid, type character varying)
LANGUAGE 'plpgsql'
COST 100.0
VOLATILE
ROWS 1000.0
AS $function$
DECLARE
_pathLength bigint;
_geographyLength bigint;
_currentPathLength bigint;
_currentGeographyLength bigint;
BEGIN
DROP TABLE IF EXISTS _pathIds;
DROP TABLE IF EXISTS _geographyIds;
CREATE TEMPORARY TABLE _pathIds (id UUID PRIMARY KEY);
CREATE TEMPORARY TABLE _geographyIds (id UUID PRIMARY KEY);
-- get all geographies in the specified _fk_ids
INSERT INTO _geographyIds
SELECT g.id
FROM geographies g
WHERE g.fk_id= ANY(_fk_ids);
_pathLength := 0;
_geographyLength := 0;
_currentPathLength := 0;
_currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
-- _pathIds := ARRAY[]::uuid[];
WHILE (_currentPathLength - _pathLength > 0) OR (_currentGeographyLength - _geographyLength > 0) LOOP
_pathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
_geographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
-- gets all paths that have paths that intersect the geographies that aren't in the current list of path ids
INSERT INTO _pathIds
SELECT DISTINCT p.id
FROM paths p
JOIN geographies g ON ST_Intersects(g.geography, p.path)
WHERE
g.id IN (SELECT _geographyIds.id FROM _geographyIds) AND
p.id NOT IN (SELECT _pathIds.id from _pathIds);
-- gets all geographies that have paths that intersect the paths that aren't in the current list of geography ids
INSERT INTO _geographyIds
SELECT DISTINCT g.id
FROM geographies g
JOIN paths p ON ST_Intersects(g.geography, p.path)
WHERE
p.id IN (SELECT _pathIds.id FROM _pathIds) AND
g.id NOT IN (SELECT _geographyIds.id FROM _geographyIds);
_currentPathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
_currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
END LOOP;
RETURN QUERY
SELECT _geographyIds.id, 'geography' AS type FROM _geographyIds
UNION ALL
SELECT _pathIds.id, 'path' AS type FROM _pathIds;
END;
$function$;
Sample plot and data from this script
It can be pure relational with an aggregate function. This implementation uses one path table and one point table. Both are geometries. The point is easier to create test data with and to test than a generic geography but it should be simple to adapt.
create table path (
path_text text primary key,
path geometry(linestring) not null
);
create table point (
point_text text primary key,
point geometry(point) not null
);
A type to keep the aggregate function's state:
create type mpath_mpoint as (
mpath geometry(multilinestring),
mpoint geometry(multipoint)
);
The state building function:
create or replace function path_point_intersect (
_i mpath_mpoint[], _e mpath_mpoint
) returns mpath_mpoint[] as $$
with e as (select (e).mpath, (e).mpoint from (values (_e)) e (e)),
i as (select mpath, mpoint from unnest(_i) i (mpath, mpoint))
select array_agg((mpath, mpoint)::mpath_mpoint)
from (
select
st_multi(st_union(i.mpoint, e.mpoint)) as mpoint,
(
select st_collect(gd)
from (
select gd from st_dump(i.mpath) a (a, gd)
union all
select gd from st_dump(e.mpath) b (a, gd)
) s
) as mpath
from i inner join e on st_intersects(i.mpoint, e.mpoint)
union all
select i.mpoint, i.mpath
from i inner join e on not st_intersects(i.mpoint, e.mpoint)
union all
select e.mpoint, e.mpath
from e
where not exists (
select 1 from i
where st_intersects(i.mpoint, e.mpoint)
)
) s;
$$ language sql;
The aggregate:
create aggregate path_point_agg (mpath_mpoint) (
sfunc = path_point_intersect,
stype = mpath_mpoint[]
);
This query will return a set of multilinestring, multipoint strings containing the matched paths/points:
select st_astext(mpath), st_astext(mpoint)
from unnest((
select path_point_agg((st_multi(path), st_multi(mpoint))::mpath_mpoint)
from (
select path, st_union(point) as mpoint
from
path
inner join
point on st_intersects(path, point)
group by path
) s
)) m (mpath, mpoint)
;
st_astext | st_astext
-----------------------------------------------------------+-----------------------------
MULTILINESTRING((-10 0,10 0,8 3),(0 -10,0 10),(2 1,4 -1)) | MULTIPOINT(0 0,0 5,3 0,5 0)
MULTILINESTRING((-9 -8,4 -8),(-8 -9,-8 6)) | MULTIPOINT(-8 -8,2 -8)
MULTILINESTRING((-7 -4,-3 4,-5 6)) | MULTIPOINT(-6 -2)

PLpgSQL (or ANSI SQL?) Conditional calculation on a column

I want to write a stored procedure that performs a conditional calculation on a column. Ideally the implementation of the SP will be db agnostic - if possible. If not the underlying db is PostgreSQL (v8.4), so that takes precedence.
The underlying tables being queried looks like this:
CREATE TABLE treatment_def ( id PRIMARY SERIAL KEY,
name VARCHAR(16) NOT NULL
);
CREATE TABLE foo_group_def ( id PRIMARY SERIAL KEY,
name VARCHAR(16) NOT NULL
);
CREATE TABLE foo ( id PRIMARY SERIAL KEY,
name VARCHAR(16) NOT NULL,
trtmt_id INT REFERENCES treatment_def(id) ON DELETE RESTRICT,
foo_grp_id INT REFERENCES foo_group_def(id) ON DELETE RESTRICT,
is_male BOOLEAN NOT NULL,
cost REAL NOT NULL
);
I want to write a SP that returns the following 'table' result set:
treatment_name, foo_group_name, averaged_cost
where averaged cost is calcluated differently, depending on whether the row field *is_male* flag is set to true or false.
For the purpose of this question, lets assume that if the is_male flag is set to true, then the averaged cost is calculated as the SUM of the cost values for the grouping, and if the is_male flag is set to false, then the cost value is calculated as the AVERAGE of the cost values for the grouping.
(Obviously) the data is being grouped by trmt_id, foo_grp_id (and is_male?).
I have a rough idea about how to to write the SQL if there was no conditional test on the is_male flag. However, I could do with some help in writing the SP as defined above.
Here is my first attempt:
CREATE TYPE FOO_RESULT AS (treatment_name VARCHAR(16), foo_group_name VARCHAR(64), averaged_cost DOUBLE);
// Outline plpgsql (Pseudo code)
CREATE FUNCTION somefunc() RETURNS SETOF FOO_RESULT AS $$
BEGIN
RETURN QUERY SELECT t.name treatment_name, g.name group_name, averaged_cost FROM foo f
INNER JOIN treatment_def t ON t.id = f.trtmt_id
INNER JOIN foo_group_def g ON g.id = f.foo_grp_id
GROUP BY f.trtmt_id, f.foo_grp_id;
END;
$$ LANGUAGE plpgsql;
I would appreciate some help on how to write this SP correctly to implement the conditional calculation in the column results
Could look like this:
CREATE FUNCTION somefunc()
RETURNS TABLE (
treatment_name varchar(16)
, foo_group_name varchar(16)
, averaged_cost double precision)
AS
$BODY$
SELECT t.name -- AS treatment_name
, g.name -- AS group_name
, CASE WHEN f.is_male THEN sum(f.cost)
ELSE avg(f.cost) END -- AS averaged_cost
FROM foo f
JOIN treatment_def t ON t.id = f.trtmt_id
JOIN foo_group_def g ON g.id = f.foo_grp_id
GROUP BY 1, 2, f.is_male;
$BODY$ LANGUAGE sql;
Major points
I used an sql function, not plpgsql. You can use either, I just did it to shorten the code. plpgsql might be slightly faster, because the query plan is cached.
I skipped the custom composite type. You can do that simpler with RETURNS TABLE.
I would generally advise to use the data type text instead of varchar(n). Makes your life easier.
Be careful not to use names of the RETURN parameter without table-qualifying (tbl.col) in the function body, or you will create naming conflicts. That is why I commented the aliases.
I adjusted the GROUP BY clause. The original didn't work. (Neither does the one in #Ken's answer.)
You should be able to use a CASE statement:
SELECT t.name treatment_name, g.name group_name,
CASE is_male WHEN true then SUM(cost)
ELSE AVG(cost) END AS averaged_cost
FROM foo f
INNER JOIN treatment_def t ON t.id = f.trtmt_id
INNER JOIN foo_group_def g ON g.id = f.foo_grp_id
GROUP BY 1, 2, f.is_male;
I'm not familiar with PLpgSQL, so I'm not sure of the exact syntax for the BOOLEAN column, but the above should at least get you started in the right direction.