Remove long key value pairs in jsonb column in postgres with SQL - sql

I am using a materialized view to merge an query 3 json columns because I want to query all of them together with 1 GIN index. The view looks similar to this:
CREATE MATERIALIZED VIEW IF NOT EXISTS test_materialized_view AS
SELECT t1.id, (t1.data1 || t1.data2 || COALESCE(t2.data1, '{}'::jsonb)) "data"
FROM table_1 t1 LEFT JOIN table_2 t2 ON (...);
Now it can happen that there are longer key value pairs in the json data which I never want to query and which can be stored 1000 of times because they are in t2.data1 . Is it possible to filter the merged json and only include key value pairs with a length less than x characters? Does this even make a difference / reduce saved data?
I dont know the json keys of these fields. I basically just want to remove all key value pairs which are longer than x characters or array / nested objects but did not really find a good way to do this in postgres

There is no built-in function for this. You will need to write your own.
Something along the lines:
create function remove_long_values(p_input jsonb, p_maxlen int)
returns jsonb
as
$$
select coalesce(jsonb_object_agg(e.ky, e.val), '{}')
from jsonb_each(p_input) as e(ky,val)
where length(e.val::text) <= p_maxlen;
$$
language sql
immutable
parallel safe;
The above does not deal with nested key/value pairs! It only checks this on the first level.
Then use it in the query:
CREATE MATERIALIZED VIEW IF NOT EXISTS test_materialized_view
AS
SELECT t1.id, t1.data1 || t1.data2 || remove_long_values(t2.data1,250) as "data"
FROM table_1 t1
LEFT JOIN table_2 t2 ON (...);

Related

PostgreSQL JSONB overlaps operator on multiple JSONB columns

I have a table that contains two jsonb columns both consisting of jsonb data that represent an array of strings. These can be empty arrays too.
I am now trying to query this table and retrieve the rows where either (or both) jsonb arrays contain at least one item of an array I pass, I managed to figure out a working query
SELECT *
FROM TABLE T
WHERE (EXISTS (SELECT *
FROM JSONB_ARRAY_ELEMENTS_TEXT(T.DATA1) AS DATA1
WHERE ARRAY[DATA1] && ARRAY['some string','some other string']))
OR (EXISTS (SELECT *
FROM JSONB_ARRAY_ELEMENTS_TEXT(T.DATA2) AS DATA2
WHERE ARRAY[DATA2] && ARRAY['random string', 'another random string']));
But i think this is not optimal at all, I am trying to do it with a cross join but the issue is that this data1 and data2 in the jsonb columns can be an empty array, and then the join will exclude these rows, while maybe the other jsonb column does satisfy the overlaps && condition.
I tried other approaches too, like:
SELECT DISTINCT ID
FROM table,
JSONB_ARRAY_ELEMENTS_TEXT(data1) data1,
JSONB_ARRAY_ELEMENTS_TEXT(data2) data2
WHERE data1 in ('some string', 'some other string')
OR data2 in ('random string', 'string');
But this one also does not include rows where data1 or data2 is an empty string. So I thought of a FULL OUTER JOIN but because this is a lateral reference it does not work:
The combining JOIN type must be INNER or LEFT for a LATERAL reference.
You don't need to unnest the JSON array. The JSONB operator ?| can do that directly - it checks if any of the array elements of the argument on the right hand side is contained as a top-level element in the JSON value on the left hand side.
SELECT *
FROM the_table t
WHERE t.data1 ?| ARRAY['some string','some other string']))
OR t.data2 ?| ARRAY['random string', 'another random string']));
This will not return rows where both array are empty (or if neither of the columns contains the searched keys)

Construct ARRAY of values from a subquery in Postgres and use it in a WHERE clause

These are samples of the two tables I have:
Table 1
material_id (int) codes (jsonb)
--------------------- -------------------------------
1 ['A-12','B-19','A-14','X-22']
2 ['X-106','A-12','X-22','B-19']
.
.
Table 2
user_id material_list (jsonb)
----------- --------------------
1 [2,3]
2 [1,2]
.
.
Table 1 contains material IDs and an array of codes associated with that material.
Table 2 contains user IDs. Each user has a list of materials associated with it and this is saved an an array of material IDs
I want to fetch a list of user IDs for all materials having certain codes. This is the query I tried, but it threw a syntax error:
SELECT user_id from table2
WHERE material_list ?| array(SELECT material_id
FROM table1 where codes ?| ['A-12','B-19]);
I am unable to figure out how to fix it.
Your query fails for multiple reasons.
First, ['A-12','B-19] isn't a valid Postgres text array. Either use an array constant or an array constructor:
'{A-12,B-19}'
ARRAY['A-12','B-19']
See:
How to pass custom type array to Postgres function
Pass array literal to PostgreSQL function
Next, the operator ?| demands text[] to the right, while you provide int[].
Finally, it wouldn't work anyway, as the operator ?| checks for JSON strings, not numbers. The manual:
Do any of the strings in the text array exist as top-level keys or array elements?
Convert the JSON array to a Postgres integer array, then use the array overlap operator &&
SELECT user_id
FROM tbl2
WHERE ARRAY(SELECT jsonb_array_elements_text(material_list)::int)
&& ARRAY(SELECT material_id FROM tbl1 where codes ?| array['A-12','B-19']);
I strongly suggest to alter your table to convert the JSON array in material_list to a Postgres integer array (int[]) for good. See:
Altering JSON column to INTEGER[] ARRAY
How to turn JSON array into Postgres array?
Then the query gets simpler:
SELECT user_id
FROM tbl2
WHERE material_list && ARRAY(SELECT material_id FROM tbl1 where codes ?| '{A-12,B-19}');
db<>fiddle here
Or - dare I say it? - properly normalize your relational design. See:
How to implement a many-to-many relationship in PostgreSQL?
This seems like the process of unnesting json arrays:
select t2.user_id
from table2 t2
where exists (select 1
from table1 t1 join
jsonb_array_elements_text(t2.material_list) j(material_id)
on t1.material_id = j.material_id::int join
jsonb_array_elements_text(t1.codes) j2(code)
on j2.code in ('A-12', 'B-19')
);
Here is a db<>fiddle.

SELECT on JSON operations of Postgres array column?

I have a column of type jsonb[] (a Postgres array of jsonb objects) and I'd like to perform a SELECT on rows where a criteria is met on at least one of the objects. Something like:
-- Schema would be something like
mytable (
id UUID PRIMARY KEY,
col2 jsonb[] NOT NULL
);
-- Query I'd like to run
SELECT
id,
x->>'field1' AS field1
FROM
mytable
WHERE
x->>'field2' = 'user' -- for any x in the array stored in col2
I've looked around at ANY and UNNEST but it's not totally clear how to achieve this, since you can't run unnest in a WHERE clause. I also don't know how I'd specify that I want the field1 from the matching object.
Do I need a WITH table with the values expanded to join against? And how would I achieve that and keep the id from the other column?
Thanks!
You need to unnest the array and then you can access each json value
SELECT t.id,
c.x ->> 'field1' AS field1
FROM mytable t
cross join unnest(col2) as c(x)
WHERE c.x ->> 'field2' = 'user'
This will return one row for each json value in the array.

How to update a nested bigquery column with data from another bigquery table

I have 2 bigquery tables with nested columns, I need to update all the columns in table1 whenever table1.value1=table2.value, also those tables having a huge amount of data.
I could update a single nested column with static column like below,
#standardSQL
UPDATE `ck.table1`
SET promotion_id = ARRAY(
SELECT AS STRUCT * REPLACE (100 AS PromotionId ) FROM UNNEST(promotion_id)
)
But when I try to reuse the same to update multiple columns based on table2 data I am getting exceptions.
I am trying to update table1 with table2 data whenever the table1.value1=table2.value with all the nested columns.
As of now, both tables are having a similar schema.
I need to update all the columns in table1 whenever table1.value1=table2.value
... both tables are having a similar schema
I assume by similar you meant same
Below is for BigQuery Standard SQL
You can use below query to get combining result and save it back to table1 either using destination table or CREATE OR REPLACE TABLE syntax
#standardSQL
SELECT AS VALUE IF(value IS NULL, t1, t2)
FROM `project.dataset.table1` t1
LEFT JOIN `project.dataset.table2` t2
ON value1 = value
I have not tried this approach with UPDATE syntax - but you can try and let us know :o)

How to find intersecting geographies between two tables recursively

I'm running Postgres 9.6.1 and PostGIS 2.3.0 r15146 and have two tables.
geographies may have 150,000,000 rows, paths may have 10,000,000 rows:
CREATE TABLE paths (id uuid NOT NULL, path path NOT NULL, PRIMARY KEY (id))
CREATE TABLE geographies (id uuid NOT NULL, geography geography NOT NULL, PRIMARY KEY (id))
Given an array/set of ids for table geographies, what is the "best" way of finding all intersecting paths and geometries?
In other words, if an initial geography has a corresponding intersecting path we need to also find all other geographies that this path intersects. From there, we need to find all other paths that these newly found geographies intersect, and so on until we've found all possible intersections.
The initial geography ids (our input) may be anywhere from 0 to 700. With an average around 40.
Minimum intersections will be 0, max will be about 1000. Average likely around 20, typically less than 100 connected.
I've created a function that does this, but I'm new to GIS in PostGIS, and Postgres in general. I've posted my solution as an answer to this question.
I feel like there should be a more eloquent and faster way of doing this than what I've come up with.
Your function can be radically simplified.
Setup
I suggest you convert the column paths.path to data type geography (or at least geometry). path is a native Postgres type and does not play well with PostGIS functions and spatial indexes. You would have to cast path::geometry or path::geometry::geography (resulting in a LINESTRING internally) to make it work with PostGIS functions like ST_Intersects().
My answer is based on these adapted tables:
CREATE TABLE paths (
id uuid PRIMARY KEY
, path geography NOT NULL
);
CREATE TABLE geographies (
id uuid PRIMARY KEY
, geography geography NOT NULL
, fk_id text NOT NULL
);
Everything works with data type geometry for both columns just as well. geography is generally more exact but also more expensive. Which to use? Read the PostGIS FAQ here.
Solution 1: Your function optimized
CREATE OR REPLACE FUNCTION public.function_name(_fk_ids text[])
RETURNS TABLE(id uuid, type text)
LANGUAGE plpgsql AS
$func$
DECLARE
_row_ct int;
_loop_ct int := 0;
BEGIN
CREATE TEMP TABLE _geo ON COMMIT DROP AS -- dropped at end of transaction
SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct AS loop_ct -- dupes possible?
FROM geographies g
WHERE g.fk_id = ANY(_fk_ids);
GET DIAGNOSTICS _row_ct = ROW_COUNT;
IF _row_ct = 0 THEN -- no rows found, return empty result immediately
RETURN; -- exit function
END IF;
CREATE TEMP TABLE _path ON COMMIT DROP AS
SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct AS loop_ct
FROM _geo g
JOIN paths p ON ST_Intersects(g.geography, p.path); -- no dupes yet
GET DIAGNOSTICS _row_ct = ROW_COUNT;
IF _row_ct = 0 THEN -- no rows found, return _geo immediately
RETURN QUERY SELECT g.id, text 'geo' FROM _geo g;
RETURN;
END IF;
ALTER TABLE _geo ADD CONSTRAINT g_uni UNIQUE (id); -- required for UPSERT
ALTER TABLE _path ADD CONSTRAINT p_uni UNIQUE (id);
LOOP
_loop_ct := _loop_ct + 1;
INSERT INTO _geo(id, geography, loop_ct)
SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct
FROM _paths p
JOIN geographies g ON ST_Intersects(g.geography, p.path)
WHERE p.loop_ct = _loop_ct - 1 -- only use last round!
ON CONFLICT ON CONSTRAINT g_uni DO NOTHING; -- eliminate new dupes
EXIT WHEN NOT FOUND;
INSERT INTO _path(id, path, loop_ct)
SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct
FROM _geo g
JOIN paths p ON ST_Intersects(g.geography, p.path)
WHERE g.loop_ct = _loop_ct - 1
ON CONFLICT ON CONSTRAINT p_uni DO NOTHING;
EXIT WHEN NOT FOUND;
END LOOP;
RETURN QUERY
SELECT g.id, text 'geo' FROM _geo g
UNION ALL
SELECT p.id, text 'path' FROM _path p;
END
$func$;
Call:
SELECT * FROM public.function_name('{foo,bar}');
Much faster than what you have.
Major points
You based queries on the whole set, instead of the latest additions to the set only. This gets increasingly slower with every loop without need. I added a loop counter (loop_ct) to avoid redundant work.
Be sure to have spatial GiST indexes on geographies.geography and paths.path:
CREATE INDEX geo_geo_gix ON geographies USING GIST (geography);
CREATE INDEX paths_path_gix ON paths USING GIST (path);
Since Postgres 9.5 index-only scans would be an option for GiST indexes. You might add id as second index column. The benefit depends on many factors, you'd have to test. However, there is no fitting operator GiST class for the uuid type. It would work with bigint after installing the extension btree_gist:
Postgres multi-column index (integer, boolean, and array)
Multicolumn index on 3 fields with heterogenous data types
Have a fitting index on g.fk_id, too. Again, a multicolumn index on (fk_id, id, geography) might pay if you can get index-only scans out of it. Default btree index, fk_id must be first index column. Especially if you run the query often and rarely update the table and table rows are much wider than the index.
You can initialize variables at declaration time. Only needed once after the rewrite.
ON COMMIT DROP drops the temp tables at the end of the transaction automatically. So I removed dropping tables explicitly. But you get an exception if you call the function in the same transaction twice. In the function I would check for existence of the temp table and use TRUNCATE in this case. Related:
How to check if a table exists in a given schema
Use GET DIAGNOSTICS to get the row count instead of running another query for the count.
Count rows affected by DELETE
You need GET DIAGNOSTICS. CREATE TABLE does not set FOUND (as is mentioned in the manual).
It's faster to add an index or PK / UNIQUE constraint after filling the table. And not before we actually need it.
ON CONFLICT ... DO ... is the simpler and cheaper way for UPSERT since Postgres 9.5.
How to UPSERT (MERGE, INSERT ... ON DUPLICATE UPDATE) in PostgreSQL?
For the simple form of the command you just list index columns or expressions (like ON CONFLICT (id) DO ...) and let Postgres perform unique index inference to determine an arbiter constraint or index. I later optimized by providing the constraint directly. But for this we need an actual constraint - a unique index is not enough. Fixed accordingly. Details in the manual here.
It may help to ANALYZE temporary tables manually to help Postgres find the best query plan. (But I don't think you need it in your case.)
Are regular VACUUM ANALYZE still recommended under 9.1?
_geo_ct - _geographyLength > 0 is an awkward and more expensive way of saying _geo_ct > _geographyLength. But that's gone completely now.
Don't quote the language name. Just LANGUAGE plpgsql.
Your function parameter is varchar[] for an array of fk_id, but you later commented:
It is a bigint field that represents a geographic area (it's actually a precomputed s2cell id at level 15).
I don't know s2cell id at level 15, but ideally you pass an array of matching data type, or if that's not an option default to text[].
Also since you commented:
There are always exactly 13 fk_ids passed in.
This seems like a perfect use case for a VARIADIC function parameter. So your function definition would be:
CREATE OR REPLACE FUNCTION public.function_name(_fk_ids VARIADIC text[]) ...
Details:
Pass multiple values in single parameter
Solution 2: Plain SQL with recursive CTE
It's hard to wrap an rCTE around two alternating loops, but possible with some SQL finesse:
WITH RECURSIVE cte AS (
SELECT g.id, g.geography::text, NULL::text AS path, text 'geo' AS type
FROM geographies g
WHERE g.fk_id = ANY($kf_ids) -- your input array here
UNION
SELECT p.id, g.geography::text, p.path::text
, CASE WHEN p.path IS NULL THEN 'geo' ELSE 'path' END AS type
FROM cte c
LEFT JOIN paths p ON c.type = 'geo'
AND ST_Intersects(c.geography::geography, p.path)
LEFT JOIN geographies g ON c.type = 'path'
AND ST_Intersects(g.geography, c.path::geography)
WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
)
SELECT id, type FROM cte;
That's all.
You need the same indexes as above. You might wrap it into an SQL function for repeated use.
Major additional points
The cast to text is necessary because the geography type is not "hashable" (same for geometry). (See this open PostGIS issue for details.) Work around it by casting to text. Rows are unique by virtue of (id, type) alone, we can ignore the geography columns for this. Cast back to geography for the join. Shouldn't cost too much extra.
We need two LEFT JOIN so not to exclude rows, because at each iteration only one of the two tables may contribute more rows.
The final condition makes sure we are not done, yet:
WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
This works because duplicate findings are excluded from the temporary
intermediate table. The manual:
For UNION (but not UNION ALL), discard duplicate rows and rows that
duplicate any previous result row. Include all remaining rows in the
result of the recursive query, and also place them in a temporary
intermediate table.
So which is faster?
The rCTE is probably faster than the function for small result sets. The temp tables and indexes in the function mean considerably more overhead. For large result sets the function may be faster, though. Only testing with your actual setup can give you a definitive answer.*
See the OP's feedback in the comment.
I figured it'd be good to post my own solution here even if it isn't optimal.
Here is what I came up with (using Steve Chambers' advice):
CREATE OR REPLACE FUNCTION public.function_name(
_fk_ids character varying[])
RETURNS TABLE(id uuid, type character varying)
LANGUAGE 'plpgsql'
COST 100.0
VOLATILE
ROWS 1000.0
AS $function$
DECLARE
_pathLength bigint;
_geographyLength bigint;
_currentPathLength bigint;
_currentGeographyLength bigint;
BEGIN
DROP TABLE IF EXISTS _pathIds;
DROP TABLE IF EXISTS _geographyIds;
CREATE TEMPORARY TABLE _pathIds (id UUID PRIMARY KEY);
CREATE TEMPORARY TABLE _geographyIds (id UUID PRIMARY KEY);
-- get all geographies in the specified _fk_ids
INSERT INTO _geographyIds
SELECT g.id
FROM geographies g
WHERE g.fk_id= ANY(_fk_ids);
_pathLength := 0;
_geographyLength := 0;
_currentPathLength := 0;
_currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
-- _pathIds := ARRAY[]::uuid[];
WHILE (_currentPathLength - _pathLength > 0) OR (_currentGeographyLength - _geographyLength > 0) LOOP
_pathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
_geographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
-- gets all paths that have paths that intersect the geographies that aren't in the current list of path ids
INSERT INTO _pathIds
SELECT DISTINCT p.id
FROM paths p
JOIN geographies g ON ST_Intersects(g.geography, p.path)
WHERE
g.id IN (SELECT _geographyIds.id FROM _geographyIds) AND
p.id NOT IN (SELECT _pathIds.id from _pathIds);
-- gets all geographies that have paths that intersect the paths that aren't in the current list of geography ids
INSERT INTO _geographyIds
SELECT DISTINCT g.id
FROM geographies g
JOIN paths p ON ST_Intersects(g.geography, p.path)
WHERE
p.id IN (SELECT _pathIds.id FROM _pathIds) AND
g.id NOT IN (SELECT _geographyIds.id FROM _geographyIds);
_currentPathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
_currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
END LOOP;
RETURN QUERY
SELECT _geographyIds.id, 'geography' AS type FROM _geographyIds
UNION ALL
SELECT _pathIds.id, 'path' AS type FROM _pathIds;
END;
$function$;
Sample plot and data from this script
It can be pure relational with an aggregate function. This implementation uses one path table and one point table. Both are geometries. The point is easier to create test data with and to test than a generic geography but it should be simple to adapt.
create table path (
path_text text primary key,
path geometry(linestring) not null
);
create table point (
point_text text primary key,
point geometry(point) not null
);
A type to keep the aggregate function's state:
create type mpath_mpoint as (
mpath geometry(multilinestring),
mpoint geometry(multipoint)
);
The state building function:
create or replace function path_point_intersect (
_i mpath_mpoint[], _e mpath_mpoint
) returns mpath_mpoint[] as $$
with e as (select (e).mpath, (e).mpoint from (values (_e)) e (e)),
i as (select mpath, mpoint from unnest(_i) i (mpath, mpoint))
select array_agg((mpath, mpoint)::mpath_mpoint)
from (
select
st_multi(st_union(i.mpoint, e.mpoint)) as mpoint,
(
select st_collect(gd)
from (
select gd from st_dump(i.mpath) a (a, gd)
union all
select gd from st_dump(e.mpath) b (a, gd)
) s
) as mpath
from i inner join e on st_intersects(i.mpoint, e.mpoint)
union all
select i.mpoint, i.mpath
from i inner join e on not st_intersects(i.mpoint, e.mpoint)
union all
select e.mpoint, e.mpath
from e
where not exists (
select 1 from i
where st_intersects(i.mpoint, e.mpoint)
)
) s;
$$ language sql;
The aggregate:
create aggregate path_point_agg (mpath_mpoint) (
sfunc = path_point_intersect,
stype = mpath_mpoint[]
);
This query will return a set of multilinestring, multipoint strings containing the matched paths/points:
select st_astext(mpath), st_astext(mpoint)
from unnest((
select path_point_agg((st_multi(path), st_multi(mpoint))::mpath_mpoint)
from (
select path, st_union(point) as mpoint
from
path
inner join
point on st_intersects(path, point)
group by path
) s
)) m (mpath, mpoint)
;
st_astext | st_astext
-----------------------------------------------------------+-----------------------------
MULTILINESTRING((-10 0,10 0,8 3),(0 -10,0 10),(2 1,4 -1)) | MULTIPOINT(0 0,0 5,3 0,5 0)
MULTILINESTRING((-9 -8,4 -8),(-8 -9,-8 6)) | MULTIPOINT(-8 -8,2 -8)
MULTILINESTRING((-7 -4,-3 4,-5 6)) | MULTIPOINT(-6 -2)