PostgreSQL - Rule to create a copy of the primaryID table - sql

In my schema I want to have a primaryID and a SearchID. For every SearchID it is the primaryID plus some text at the start. I need this to look like this:
PrimaryID = 1
SearchID = Search1
Since the PrimaryID is set to autoincrement, I was hoping I could use a postgresql rule to do the following (pseudo code)
IF PRIMARYID CHANGES
{
SEARCHID = SEARCH(PRIMARYID)
}
This would hopefully occure exactly after the primaryID is updated and happen automatically. So, is this the best way of achieving this and can anyone provide an example of how it is done?
Thank you

Postgres 11 introduced genuine generated columns. See:
Computed / calculated / virtual / derived columns in PostgreSQL
For older (or any) versions, you could emulate a "virtual generated column" with a special function. Say your table is named tbl and the serial primary key is named tbl_id:
CREATE FUNCTION search_id(t tbl)
RETURNS text STABLE LANGUAGE SQL AS
$$
SELECT 'Search' || $1.tbl_id;
$$;
Then you can:
SELECT t.tbl_id, t.search_id FROM tbl t;
Table-qualification in t.search_id is needed in this case. Since search_id is not found as column of table tbl, Postgres looks for a function that takes tbl as argument next.
Effectively, t.search_id is just a syntax variant of search_id(t), but makes usage rather intuitive.

Related

How can I create a calculate column in the creation of table in POSTGRESQL, for example in sql server LineTotal AS Price * Quantity [duplicate]

Does PostgreSQL support computed / calculated columns, like MS SQL Server? I can't find anything in the docs, but as this feature is included in many other DBMSs I thought I might be missing something.
Eg: http://msdn.microsoft.com/en-us/library/ms191250.aspx
Postgres 12 or newer
STORED generated columns are introduced with Postgres 12 - as defined in the SQL standard and implemented by some RDBMS including DB2, MySQL, and Oracle. Or the similar "computed columns" of SQL Server.
Trivial example:
CREATE TABLE tbl (
int1 int
, int2 int
, product bigint GENERATED ALWAYS AS (int1 * int2) STORED
);
fiddle
VIRTUAL generated columns may come with one of the next iterations. (Not in Postgres 15, yet).
Related:
Attribute notation for function call gives error
Postgres 11 or older
Up to Postgres 11 "generated columns" are not supported.
You can emulate VIRTUAL generated columns with a function using attribute notation (tbl.col) that looks and works much like a virtual generated column. That's a bit of a syntax oddity which exists in Postgres for historic reasons and happens to fit the case. This related answer has code examples:
Store common query as column?
The expression (looking like a column) is not included in a SELECT * FROM tbl, though. You always have to list it explicitly.
Can also be supported with a matching expression index - provided the function is IMMUTABLE. Like:
CREATE FUNCTION col(tbl) ... AS ... -- your computed expression here
CREATE INDEX ON tbl(col(tbl));
Alternatives
Alternatively, you can implement similar functionality with a VIEW, optionally coupled with expression indexes. Then SELECT * can include the generated column.
"Persisted" (STORED) computed columns can be implemented with triggers in a functionally equivalent way.
Materialized views are a related concept, implemented since Postgres 9.3.
In earlier versions one can manage MVs manually.
YES you can!! The solution should be easy, safe, and performant...
I'm new to postgresql, but it seems you can create computed columns by using an expression index, paired with a view (the view is optional, but makes makes life a bit easier).
Suppose my computation is md5(some_string_field), then I create the index as:
CREATE INDEX some_string_field_md5_index ON some_table(MD5(some_string_field));
Now, any queries that act on MD5(some_string_field) will use the index rather than computing it from scratch. For example:
SELECT MAX(some_field) FROM some_table GROUP BY MD5(some_string_field);
You can check this with explain.
However at this point you are relying on users of the table knowing exactly how to construct the column. To make life easier, you can create a VIEW onto an augmented version of the original table, adding in the computed value as a new column:
CREATE VIEW some_table_augmented AS
SELECT *, MD5(some_string_field) as some_string_field_md5 from some_table;
Now any queries using some_table_augmented will be able to use some_string_field_md5 without worrying about how it works..they just get good performance. The view doesn't copy any data from the original table, so it is good memory-wise as well as performance-wise. Note however that you can't update/insert into a view, only into the source table, but if you really want, I believe you can redirect inserts and updates to the source table using rules (I could be wrong on that last point as I've never tried it myself).
Edit: it seems if the query involves competing indices, the planner engine may sometimes not use the expression-index at all. The choice seems to be data dependant.
One way to do this is with a trigger!
CREATE TABLE computed(
one SERIAL,
two INT NOT NULL
);
CREATE OR REPLACE FUNCTION computed_two_trg()
RETURNS trigger
LANGUAGE plpgsql
SECURITY DEFINER
AS $BODY$
BEGIN
NEW.two = NEW.one * 2;
RETURN NEW;
END
$BODY$;
CREATE TRIGGER computed_500
BEFORE INSERT OR UPDATE
ON computed
FOR EACH ROW
EXECUTE PROCEDURE computed_two_trg();
The trigger is fired before the row is updated or inserted. It changes the field that we want to compute of NEW record and then it returns that record.
PostgreSQL 12 supports generated columns:
PostgreSQL 12 Beta 1 Released!
Generated Columns
PostgreSQL 12 allows the creation of generated columns that compute their values with an expression using the contents of other columns. This feature provides stored generated columns, which are computed on inserts and updates and are saved on disk. Virtual generated columns, which are computed only when a column is read as part of a query, are not implemented yet.
Generated Columns
A generated column is a special column that is always computed from other columns. Thus, it is for columns what a view is for tables.
CREATE TABLE people (
...,
height_cm numeric,
height_in numeric GENERATED ALWAYS AS (height_cm * 2.54) STORED
);
db<>fiddle demo
Well, not sure if this is what You mean but Posgres normally support "dummy" ETL syntax.
I created one empty column in table and then needed to fill it by calculated records depending on values in row.
UPDATE table01
SET column03 = column01*column02; /*e.g. for multiplication of 2 values*/
It is so dummy I suspect it is not what You are looking for.
Obviously it is not dynamic, you run it once. But no obstacle to get it into trigger.
Example on creating an empty virtual column
,(SELECT *
From (values (''))
A("virtual_col"))
Example on creating two virtual columns with values
SELECT *
From (values (45,'Completed')
, (1,'In Progress')
, (1,'Waiting')
, (1,'Loading')
) A("Count","Status")
order by "Count" desc
I have a code that works and use the term calculated, I'm not on postgresSQL pure tho we run on PADB
here is how it's used
create table some_table as
select category,
txn_type,
indiv_id,
accum_trip_flag,
max(first_true_origin) as true_origin,
max(first_true_dest ) as true_destination,
max(id) as id,
count(id) as tkts_cnt,
(case when calculated tkts_cnt=1 then 1 else 0 end) as one_way
from some_rando_table
group by 1,2,3,4 ;
A lightweight solution with Check constraint:
CREATE TABLE example (
discriminator INTEGER DEFAULT 0 NOT NULL CHECK (discriminator = 0)
);

Postgres ATOMIC stored procedure INSERT INTO . . . SELECT with one parameter and one set of rows from a table

I am trying to write a stored procedure to let a dev assign new user identities to a specified group when they don't already have one (i.e. insert a parameter and the output of a select statement into a joining table) without hand-writing every pair of foreign keys as values to do so. I know how I'd do it in T-SQL/SQL Server but I'm working with a preexisting/unfamiliar Postgres database. I would strongly prefer to keep my stored procedures as LANGUAGE SQL/BEGIN ATOMIC and this + online examples being simplified and/or using constants has made it difficult for me to get my bearings.
Apologies in advance for length, this is me trying to articulate why I do not believe this question is a duplicate based on what I've been able to find searching on my own but I may have overcorrected.
Schema (abstracted from the most identifying parts; these are not the original table names and I am not in a position to change what anything is called; I am also leaving out indexing for simplicity's sake) is like:
create table IF NOT EXISTS user_identities (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY NOT NULL,
[more columns not relevant to this query)
)
create table IF NOT EXISTS user_groups (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY NOT NULL,
name TEXT NOT NULL
)
create table IF NOT EXISTS group_identities (
user_id BIGINT REFERENCES user_identities(id) ON DELETE RESTRICT NOT NULL,
group_id BIGINT REFERENCES user_groups(id) ON DELETE RESTRICT NOT NULL
)
Expected dev behavior:
Add all predetermined identities intended to belong to a group in a single batch
Add identifying information for the new group (it is going to take a lot of convincing to bring the people involved around to using nested stored procedures for this if I ever can)
Bring the joining table up to date accordingly (what I've been asked to streamline).
If this were SQL Server I would do (error handling omitted for time and putting aside whether EXCEPT or NOT IN would be best for now, please)
create OR alter proc add_identities_to_group
#group_name varchar(50) NULL
as BEGIN
declare #use_group_id int
if #group_name is NULL
set #use_group_id = (select Top 1 id from user_groups where id not in (select group_id from group_identities) order by id asc)
ELSE set #use_group_id = (select id from user_groups where name = #group_name)
insert into group_identities (user_id, group_id)
select #use_group_id, id from user_identities
where id not in (select user_id from group_identities)
END
GO
Obviously this is not going to fly in Postgres; part of why I want to stick with atomic stored procedures is staying in "neutral" SQL, both to be closer to my comfort zone and because I don't know what other languages the database is currently set up for, but my existing education has played kind of fast and loose with differentiating what was T-SQL specific at any point.
I am aware that this is not going to run for a wide variety of reasons because I'm still trying to internalize the syntax, but the bad/conceptual draft I have written so that I have anything to stare at is:
create OR replace procedure add_identities_to_groups(
group_name text default NULL ) language SQL
BEGIN ATOMIC
declare use_group_id integer
if group_name is NULL
set use_group_id = (select Top 1 id from user_groups
where id not in (select user_id from group_identities)
order by id asc)
ELSE set use_group_id = (select id from user_groups where name = group_name) ;
insert into group_identities (group_id, user_id)
select use_group_id, id from user_identities
where id not in (select user_id from group_identities)
END ;
GO ;
Issues:
Have not found either answers for how to do this with the combination of a single variable and a column with BEGIN ATOMIC or hard confirmation that it wouldn't work (e.g. can atomic stored procedures just not accept parameters? I cannot find an answer to this on my own). (This is part of why existing answers that I can find here and elsewhere haven't been clarifying for me.)
~~Don't know how to compensate for Postgres's not differentiating variables and parameters from column names at all. (This is why examples using a hardcoded constant haven't helped, and they make up virtually all of what I can find off StackOverflow itself.)~~ Not a problem if Postgres will handle that intelligently within the atomic block but that's one of the things I hadn't been able to confirm on my own.
Google results for "vanilla" SQL unpredictably saturated with SQL Server anyway, while my lack of familiarity with Postgres is not doing me any favors but I don't know anyone personally who has more experience than I do.
because I don't know what other languages the database is currently set up for
All supported Postgres versions always include PL/pgSQL.
If you want to use procedural elements like variables or conditional statements like IF you need PL/pgSQL. So your procedure has to be defined with language plpgsql - that removes the possibility to use the ANSI standard BEGIN ATOMIC syntax.
Don't know how to compensate for Postgres's not differentiating variables and parameters from column names at all.
You don't. Most people simply using naming conventions to do that. In my environment we use p_ for parameters and l_ for "local" variables. Use whatever you prefer.
Quote from the manual
By default, PL/pgSQL will report an error if a name in an SQL statement could refer to either a variable or a table column. You can fix such a problem by renaming the variable or column, or by qualifying the ambiguous reference, or by telling PL/pgSQL which interpretation to prefer.
The simplest solution is to rename the variable or column. A common coding rule is to use a different naming convention for PL/pgSQL variables than you use for column names. For example, if you consistently name function variables v_something while none of your column names start with v_, no conflicts will occur.
As documented in the manual the body for a procedure written in PL/pgSQL (or any other language that is not SQL) must be provided as a string. This is typically done using dollar quoting to make writing the source easier.
As documented in the manual, if you want to store the result of a single row query in a variable, use select ... into from ....
As documented in the manual an IF statement needs a THEN
As documented in the manual there is no TOP clause in Postgres (or standard SQL). Use limit or the standard compliant fetch first 1 rows only instead.
To avoid a clash between names of variables and column names, most people use some kind of prefix for parameters and variables. This also helps to identify them in the code.
In Postgres it's usually faster to use NOT EXISTS instead of NOT IN.
In Postgres statements are terminated with ;. GO isn't a SQL command in SQL Server either - it's a client side thing supported by SSMS. To my knowledge, there is no SQL tool that works with Postgres that supports the GO "batch terminator" the same way SSMS does.
So a direct translation of your T-SQL code to PL/pgSQL could look like this:
create or replace procedure add_identities_to_groups(p_group_name text default NULL)
language plpgsql
as
$$ --<< start of PL/pgSQL code
declare --<< start a block for all variables
l_use_group_id integer;
begin --<< start the actual code
if p_group_name is NULL THEN --<< then required
select id
into l_use_group_id
from user_groups ug
where not exists (select * from group_identities gi where gi.id = ug.user_id)
order by ug.id asc
limit 1;
ELSE
select id
into l_use_group_id
from user_groups
where name = p_group_name;
end if;
insert into group_identities (group_id, user_id)
select l_use_group_id, id
from user_identities ui
where not exists (select * from group_identities gi where gi.user_id = ui.id);
END;
$$
;

How to use a temp sequence within a Postgresql function

I have some lines of SQL which will take a set of IDs from the same GROUP_ID that are not contiguous (ex. if some rows got deleted) and will make them contiguous again. I wanted to turn this into a function for reusability purposes. The lines work if executed individually but when I try to create the function I get the error
ERROR: relation "id_seq_temp" does not exist
LINE 10: UPDATE THINGS SET ID=nextval('id_se...
If I create a sequence outside of the function and use that sequence in the function instead then the function is created successfully (schema qualified or unqualified). However I felt like creating the temp sequence inside of the function rather than leaving it in the schema was a cleaner solution.
I have seen this question: Function shows error "relation my_table does not exist"
However, I'm using the public schema and schema qualifying the sequence with public. does not seem to help.
I've also seen this question: How to create a sql function using temp sequences and a SELECT on PostgreSQL8. I probably could use generate_series but this adds a lot of complexity that SERIES solves such as needing to know how big of a series to generate.
Here is my function, I anonymized some of the names - just in case there's a typo.
CREATE OR REPLACE FUNCTION reindex_ids(IN BIGINT) RETURNS VOID
LANGUAGE SQL
AS $$
CREATE TEMPORARY SEQUENCE id_seq_temp
MINVALUE 1
START WITH 1
INCREMENT BY 1;
ALTER SEQUENCE id_seq_temp RESTART;
UPDATE THINGS SET ID=ID+2000 WHERE GROUP_ID=$1;
UPDATE THINGS SET ID=nextval('id_seq_temp') WHERE GROUP_ID=$1;
$$;
Is it possible to use a sequence you create within a function later in the function?
Answer to question
The reason is that SQL functions (LANGUAGE sql) are parsed and planned as one. All objects used must exist before the function runs.
You can switch to PL/pgSQL, (LANGUAGE plpgsql) which plans each statement on demand. There you can create objects and use them in the next command.
See:
Why can PL/pgSQL functions have side effect, while SQL functions can't?
Since you are not returning anything, consider a PROCEDURE. (FUNCTION works, too.)
CREATE OR REPLACE PROCEDURE reindex_ids(IN bigint)
LANGUAGE plpgsql AS
$proc$
BEGIN
IF EXISTS ( SELECT FROM pg_catalog.pg_class
WHERE relname = 'id_seq_temp'
AND relnamespace = pg_my_temp_schema()
AND relkind = 'S') THEN
ALTER SEQUENCE id_seq_temp RESTART;
ELSE
CREATE TEMP SEQUENCE id_seq_temp;
END IF;
UPDATE things SET id = id + 2000 WHERE group_id = $1;
UPDATE things SET id = nextval('id_seq_temp') WHERE group_id = $1;
END
$proc$;
Call:
CALL reindex_ids(123);
This creates your temp sequence if it does not exist already.
If the sequence exists, it is reset. (Remember that temporary objects live for the duration of a session.)
In the unlikely event that some other object occupies the name, an exception is raised.
Alternative solutions
Solution 1
This usually works:
UPDATE things t
SET id = t1.new_id
FROM (
SELECT pk_id, row_number() OVER (ORDER BY id) AS new_id
FROM things
WHERE group_id = $1 -- your input here
) t1
WHERE t.pk_id = t1.pk_id;
And only updates each row once, so half the cost.
Replace pk_id with your PRIMARY KEY column, or any UNIQUE NOT NULL (combination of) column(s).
The trick is that the UPDATE typically processes rows according to the sort order of the subquery in the FROM clause. Updating in ascending order should never hit a duplicate key violation.
And the ORDER BY clause of the window function row_number() imposes that sort order on the resulting set. That's an undocumented implementation detail, so you might want to add an explicit ORDER BY to the subquery. But since the behavior of UPDATE is undocumented anyway, it still depends on an implementation detail.
You can wrap that into a plain SQL function.
Solution 2
Consider not doing what you are doing at all. Gaps in sequential numbers are typically expected and not a problem. Just live with it. See:
Serial numbers per group of rows for compound key

Conditionally replace single value per row in jsonb column

I need a more efficient way to update rows of a single table in Postgres 9.5.
I am currently doing it with pg_dump, and re-import with updated values after search and replace operations in a Linux OS environment.
table_a has 300000 rows with 2 columns: id bigint and json_col jsonb.
json_col has about 30 keys: "C1" to "C30" like in this example:
Table_A
id,json_col
1 {"C1":"Paris","C2":"London","C3":"Berlin","C4":"Tokyo", ... "C30":"Dallas"}
2 {"C1":"Dublin","C2":"Berlin","C3":"Kiev","C4":"Tokyo", ... "C30":"Phoenix"}
3 {"C1":"Paris","C2":"London","C3":"Berlin","C4":"Ankara", ... "C30":"Madrid"}
...
The requirement is to mass search all keys from C1 to C30 then look in
them for the value "Berlin" and replace with "Madrid" and only if
Madrid is not repeated. i.e. id:1 with Key C3, and id:2 with C2. id:3
will be skipped because C30 exists with this value already
It has to be in a single SQL command in PostgreSQL 9.5, one time and considering all keys from the jsonb column.
The fastest and simplest way is to modify the column as text:
update table_a
set json_col = replace(json_col::text, '"Berlin"', '"Madrid"')::jsonb
where json_col::text like '%"Berlin"%'
and json_col::text not like '%"Madrid"%'
It's a practical choice. The above query is rather a find-and-replace operation (like in a text editor) than a modification of objects attributes. The second option is more complicated and surely much more expensive. Even using the fast Javascript engine (example below) more formal solution would be many times slower.
You can try Postgres Javascript:
create extension if not exists plv8;
create or replace function replace_item(data jsonb, from_str text, to_str text)
returns jsonb language plv8 as $$
var found = 0;
Object.keys(data).forEach(function(key) {
if (data[key] == to_str) {
found = 1;
}
})
if (found == 0) {
Object.keys(data).forEach(function(key) {
if (data[key] == from_str) {
data[key] = to_str;
}
})
}
return data;
$$;
update table_a
set json_col = replace_item(json_col, 'Berlin', 'Madrid');
What makes this hard is that you are looking for unknown keys holding values of interest. Postgres infrastructure is optimized to find keys (or array values).
Possibly caused by a sub-optimal table design. The many top-level objects of your jsonb column might be replaced by an array, discarding irrelevant key names altogether. (Or maybe another array for key names.) Or, ideally with a full normalized DB schema to begin with.
Be that as it may, here is a proof of concept, how this can be fast and clean with stock Postgres 9.5 or later anyway.
Additional difficulty 1: it's unknown whether duplicate values are possible.
Additional difficulty 2: value frequencies are unknown, too.
Additional difficulty 3: only the first value found is to be replaced and only if the target value is not there yet. Implementing this with set-based operations is possible, but unwieldy. I wrote a plpgsql function instead:
CREATE OR REPLACE FUNCTION jsonb_replace_value(_j jsonb, _old jsonb, _new jsonb)
RETURNS jsonb AS
$func$
DECLARE
_key text;
_val jsonb;
BEGIN
FOR _key, _val IN
SELECT * FROM jsonb_each(_j)
LOOP
IF _val = _old THEN
RETURN jsonb_set(_j, ARRAY[_key], _new); -- update 1st key
END IF;
END LOOP;
RETURN _j; -- nothing found, return original
END
$func$ LANGUAGE plpgsql IMMUTABLE;
COMMENT ON FUNCTION jsonb_replace_value(jsonb, jsonb, jsonb) IS '
Replace the first occurrence of _old value with _new.
Call:
SELECT jsonb_replace_value('{"C1":"Paris","C3":"Berlin","C4":"Berlin"}', '"Berlin"', '"Madrid"')';
Could be enhanced to optionally replace all occurrences etc. but that's beyond the scope of this question.
Now this would be simple:
UPDATE table_a
SET json_col = jsonb_replace_value(json_col, '"Berlin"', '"Madrid"'); -- note jsonb literal syntax!
If all rows need an update, we can stop here. Won't get faster. (Except possibly with alternatives like demonstrated by #klin.)
If a large percentage of all rows need an update, add a WHERE condition to avoid empty updates:
...
WHERE json_col <> jsonb_replace_value(json_col, '"Berlin"', '"Madrid"');
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
Typically, only very few rows actually need an update. Then iterating through all rows with above query is expensive. We need index support to make it fast. Not easy for the case. I suggest an expression index based on an IMMUTABLE function extracting the array of values:
CREATE OR REPLACE FUNCTION jsonb_object_val_arr(jsonb)
RETURNS text[] LANGUAGE sql IMMUTABLE AS
'SELECT ARRAY (SELECT value FROM jsonb_each_text($1))';
COMMENT ON FUNCTION jsonb_object_val_arr(jsonb) IS '
Generates text array of values in outermost jsonb object.
Of limited use if there can be nested objects.';
CREATE INDEX table_a_val_arr_idx ON table_a USING gin (jsonb_object_val_arr(json_col));
Related, with more explanation:
Find rows containing a key in a JSONB array of records
Query making use of this index:
UPDATE table_a a
SET json_col = jsonb_replace_value(a.json_col, '"Berlin"', '"Madrid"')
WHERE jsonb_object_val_arr(json_col) #> '{Berlin}' -- has Berlin, possibly > 1x ..
-- AND NOT jsonb_object_val_arr(json_col) #> '{Madrid}'
AND NOT EXISTS ( -- .. but not Madrid
SELECT FROM table_a b
WHERE jsonb_object_val_arr(json_col) #> '{Madrid}' -- note array literal syntax
AND b.id = a.id
);
The NOT EXISTS semi-anti-join is carefully drafted to utilize the index a 2nd time.
The commented simpler alternative is faster if there are few rows with 'Berlin' and 'Madrid' - then a filter step in the query plan will be cheaper.
Should be very fast.
db<>fiddle here for Postgres 9.5 demonstrating all.
Ok i have tested all methods and i can say you did a great job
This helped me a lot. Let me share my feedback with you.
Method 1 sugested by Klin. Works perfect and is totally fine, except if
key is named like value, then both will be replaced key and value.
i.e.: "Berlin":"Berlin" becomes "Madrid":"Madrid"
Method 2 with plv8 extension did not worked because i am missing controll file
i had to install it and i just skipped this method, so i have no
feedback regarding this method.
Error that i was getting was this:
ERROR: could not open extension control file
"/usr/pgsql-9.5/share/extension/plv8.control": No such file or directory
Method 3 similar to method 2 with jsonb_replace_value function
works perfect, in replaces rows that contains specific value regardless
of the key. And adding condition
WHERE json_col <> jsonb_replace_value(json_col, '"Berlin"', '"Madrid"')
will avoid empty updates and will skip rows than do not need to be updated
And somethig like this
{"Berlin":"Berlin"} becomes {"Berlin":"Madrid"} i.e. Key is not touched, just value
Method 4 is a little more complicated, it uses Method 3 and Indexes
It works totally awesome and super speedy.
And NOT EXISTS semi-anti-join indeed forced to use Index again.
I was shocked how fast it performed!!!
However i discovered all this methods will work if json string looks like this:
{"key":"value"}
If i have for example to update a value that is a json object it will not update
something like this: {"C30":{"id":10044,"value":"Berlin","created_by":"John Doe"}}
MANY THANKS to you guys. #klin and #erwin-brandstetter. This helped me to learn something new!

Conditionally delete item inside an Array Field PostgreSQL

I'm building a kind of dictionary app and I have a table for storing words like below:
id | surface_form | examples
-----------------------------------------------------------------------
1 | sounds | {"It sounds as though you really do believe that",
| | "A different bell begins to sound midnight"}
Where surface_form is of type CHARACTER VARYING and examples is an array field of CHARACTER VARYING
Since the examples are generated automatically from another API, it might not contain the exact "surface_form". Now I want to keep in examples only sentences that contain the exact surface_form. For instance, in the given example, only the first sentence is kept as it contain sounds, the second should be omitted as it only contain sound.
The problem is I got stuck in how to write a query and/or plSQL stored procedure to update the examples column so that it only has the desired sentences.
This query skips unwanted array elements:
select id, array_agg(example) new_examples
from a_table, unnest(examples) example
where surface_form = any(string_to_array(example, ' '))
group by id;
id | new_examples
----+----------------------------------------------------
1 | {"It sounds as though you really do believe that"}
(1 row)
Use it in update:
with corrected as (
select id, array_agg(example) new_examples
from a_table, unnest(examples) example
where surface_form = any(string_to_array(example, ' '))
group by id
)
update a_table
set examples = new_examples
from corrected
where examples <> new_examples
and a_table.id = corrected.id;
Test it in rextester.
Maybe you have to change the table design. This is what PostgreSQL's documentation says about the use of arrays:
Arrays are not sets; searching for specific array elements can be a sign of database misdesign. Consider using a separate table with a row for each item that would be an array element. This will be easier to search, and is likely to scale better for a large number of elements.
Documentation:
https://www.postgresql.org/docs/current/static/arrays.html
The most compact solution (but not necessarily the fastest) is to write a function that you pass a regular expression and an array and which then returns a new array that only contains the items matching the regex.
create function get_matching(p_values text[], p_pattern text)
returns text[]
as
$$
declare
l_result text[] := '{}'; -- make sure it's not null
l_element text;
begin
foreach l_element in array p_values loop
-- adjust this condition to whatever you want
if l_element ~ p_pattern then
l_result := l_result || l_element;
end if;
end loop;
return l_result;
end;
$$
language plpgsql;
The if condition is only an example. You need to adjust that to whatever you exactly store in the surface_form column. Maybe you need to test on word boundaries for the regex or a simple instr() would do - your question is unclear about that.
Cleaning up the table then becomes as simple as:
update the_table
set examples = get_matching(examples, surface_form);
But the whole approach seems flawed to me. It would be a lot more efficient if you stored the examples in a properly normalized data model.
In SQL, you have to remember two things.
Tuple elements are immutable but rows are mutable via updates.
SQL is declarative, not procedural
So you cannot "conditionally" "delete" a value from an array. You have to think about the question differently. You have to create a new array following a specification. That specification can conditionally include values (using case statements). Then you can overwrite the tuple with the new array.
Looks like one way could to update the array with array elements that are valid by doing a select using like or some regular expression.
https://www.postgresql.org/docs/current/static/arrays.html
If you want to hold elements from array that have "surface_form" in it you have to use that entries with substring(....,...) is not null
First you unnest the array, hold only items that match, and then array_agg the stored items
Here is a little query you can run to test without any table.
SELECT
id,
surface_form,
(SELECT array_agg(examples_matching)
FROM unnest(surfaces.examples) AS examples_matching
WHERE substring(examples_matching, surfaces.surface_form) IS NOT NULL)
FROM
(SELECT
1 AS id,
'example' :: TEXT AS surface_form,
ARRAY ['example form', 'test test','second example form'] :: TEXT [] AS examples
) surfaces;
You can select data in temp table using
Then update temp table using update query on row number
Merge value using
This merge value you can update in original table
For Example
Suppose you create temp table
Temp (id int, element character varying)
Then update Temp table and nest it.
Finally update original table
Here is the query you can directly try to execute in editor
CREATE TEMP TABLE IF NOT EXISTS temp_element (
id bigint,
element character varying)WITH (OIDS);
TRUNCATE TABLE temp_element;
insert into temp_element select row_number() over (order by p),p from (
select unnest(ARRAY['It sounds as though you really do believe that',
'A different bell begins to sound midnight']) as P)t;
update temp_element set element = 'It sounds as though you really'
where element = 'It sounds as though you really do believe that';
--update table
select array_agg(r) from ( select element from temp_element)r