Conditionally delete item inside an Array Field PostgreSQL - sql

I'm building a kind of dictionary app and I have a table for storing words like below:
id | surface_form | examples
-----------------------------------------------------------------------
1 | sounds | {"It sounds as though you really do believe that",
| | "A different bell begins to sound midnight"}
Where surface_form is of type CHARACTER VARYING and examples is an array field of CHARACTER VARYING
Since the examples are generated automatically from another API, it might not contain the exact "surface_form". Now I want to keep in examples only sentences that contain the exact surface_form. For instance, in the given example, only the first sentence is kept as it contain sounds, the second should be omitted as it only contain sound.
The problem is I got stuck in how to write a query and/or plSQL stored procedure to update the examples column so that it only has the desired sentences.

This query skips unwanted array elements:
select id, array_agg(example) new_examples
from a_table, unnest(examples) example
where surface_form = any(string_to_array(example, ' '))
group by id;
id | new_examples
----+----------------------------------------------------
1 | {"It sounds as though you really do believe that"}
(1 row)
Use it in update:
with corrected as (
select id, array_agg(example) new_examples
from a_table, unnest(examples) example
where surface_form = any(string_to_array(example, ' '))
group by id
)
update a_table
set examples = new_examples
from corrected
where examples <> new_examples
and a_table.id = corrected.id;
Test it in rextester.

Maybe you have to change the table design. This is what PostgreSQL's documentation says about the use of arrays:
Arrays are not sets; searching for specific array elements can be a sign of database misdesign. Consider using a separate table with a row for each item that would be an array element. This will be easier to search, and is likely to scale better for a large number of elements.
Documentation:
https://www.postgresql.org/docs/current/static/arrays.html

The most compact solution (but not necessarily the fastest) is to write a function that you pass a regular expression and an array and which then returns a new array that only contains the items matching the regex.
create function get_matching(p_values text[], p_pattern text)
returns text[]
as
$$
declare
l_result text[] := '{}'; -- make sure it's not null
l_element text;
begin
foreach l_element in array p_values loop
-- adjust this condition to whatever you want
if l_element ~ p_pattern then
l_result := l_result || l_element;
end if;
end loop;
return l_result;
end;
$$
language plpgsql;
The if condition is only an example. You need to adjust that to whatever you exactly store in the surface_form column. Maybe you need to test on word boundaries for the regex or a simple instr() would do - your question is unclear about that.
Cleaning up the table then becomes as simple as:
update the_table
set examples = get_matching(examples, surface_form);
But the whole approach seems flawed to me. It would be a lot more efficient if you stored the examples in a properly normalized data model.

In SQL, you have to remember two things.
Tuple elements are immutable but rows are mutable via updates.
SQL is declarative, not procedural
So you cannot "conditionally" "delete" a value from an array. You have to think about the question differently. You have to create a new array following a specification. That specification can conditionally include values (using case statements). Then you can overwrite the tuple with the new array.

Looks like one way could to update the array with array elements that are valid by doing a select using like or some regular expression.
https://www.postgresql.org/docs/current/static/arrays.html

If you want to hold elements from array that have "surface_form" in it you have to use that entries with substring(....,...) is not null
First you unnest the array, hold only items that match, and then array_agg the stored items
Here is a little query you can run to test without any table.
SELECT
id,
surface_form,
(SELECT array_agg(examples_matching)
FROM unnest(surfaces.examples) AS examples_matching
WHERE substring(examples_matching, surfaces.surface_form) IS NOT NULL)
FROM
(SELECT
1 AS id,
'example' :: TEXT AS surface_form,
ARRAY ['example form', 'test test','second example form'] :: TEXT [] AS examples
) surfaces;

You can select data in temp table using
Then update temp table using update query on row number
Merge value using
This merge value you can update in original table
For Example
Suppose you create temp table
Temp (id int, element character varying)
Then update Temp table and nest it.
Finally update original table
Here is the query you can directly try to execute in editor
CREATE TEMP TABLE IF NOT EXISTS temp_element (
id bigint,
element character varying)WITH (OIDS);
TRUNCATE TABLE temp_element;
insert into temp_element select row_number() over (order by p),p from (
select unnest(ARRAY['It sounds as though you really do believe that',
'A different bell begins to sound midnight']) as P)t;
update temp_element set element = 'It sounds as though you really'
where element = 'It sounds as though you really do believe that';
--update table
select array_agg(r) from ( select element from temp_element)r

Related

Conditionally replace single value per row in jsonb column

I need a more efficient way to update rows of a single table in Postgres 9.5.
I am currently doing it with pg_dump, and re-import with updated values after search and replace operations in a Linux OS environment.
table_a has 300000 rows with 2 columns: id bigint and json_col jsonb.
json_col has about 30 keys: "C1" to "C30" like in this example:
Table_A
id,json_col
1 {"C1":"Paris","C2":"London","C3":"Berlin","C4":"Tokyo", ... "C30":"Dallas"}
2 {"C1":"Dublin","C2":"Berlin","C3":"Kiev","C4":"Tokyo", ... "C30":"Phoenix"}
3 {"C1":"Paris","C2":"London","C3":"Berlin","C4":"Ankara", ... "C30":"Madrid"}
...
The requirement is to mass search all keys from C1 to C30 then look in
them for the value "Berlin" and replace with "Madrid" and only if
Madrid is not repeated. i.e. id:1 with Key C3, and id:2 with C2. id:3
will be skipped because C30 exists with this value already
It has to be in a single SQL command in PostgreSQL 9.5, one time and considering all keys from the jsonb column.
The fastest and simplest way is to modify the column as text:
update table_a
set json_col = replace(json_col::text, '"Berlin"', '"Madrid"')::jsonb
where json_col::text like '%"Berlin"%'
and json_col::text not like '%"Madrid"%'
It's a practical choice. The above query is rather a find-and-replace operation (like in a text editor) than a modification of objects attributes. The second option is more complicated and surely much more expensive. Even using the fast Javascript engine (example below) more formal solution would be many times slower.
You can try Postgres Javascript:
create extension if not exists plv8;
create or replace function replace_item(data jsonb, from_str text, to_str text)
returns jsonb language plv8 as $$
var found = 0;
Object.keys(data).forEach(function(key) {
if (data[key] == to_str) {
found = 1;
}
})
if (found == 0) {
Object.keys(data).forEach(function(key) {
if (data[key] == from_str) {
data[key] = to_str;
}
})
}
return data;
$$;
update table_a
set json_col = replace_item(json_col, 'Berlin', 'Madrid');
What makes this hard is that you are looking for unknown keys holding values of interest. Postgres infrastructure is optimized to find keys (or array values).
Possibly caused by a sub-optimal table design. The many top-level objects of your jsonb column might be replaced by an array, discarding irrelevant key names altogether. (Or maybe another array for key names.) Or, ideally with a full normalized DB schema to begin with.
Be that as it may, here is a proof of concept, how this can be fast and clean with stock Postgres 9.5 or later anyway.
Additional difficulty 1: it's unknown whether duplicate values are possible.
Additional difficulty 2: value frequencies are unknown, too.
Additional difficulty 3: only the first value found is to be replaced and only if the target value is not there yet. Implementing this with set-based operations is possible, but unwieldy. I wrote a plpgsql function instead:
CREATE OR REPLACE FUNCTION jsonb_replace_value(_j jsonb, _old jsonb, _new jsonb)
RETURNS jsonb AS
$func$
DECLARE
_key text;
_val jsonb;
BEGIN
FOR _key, _val IN
SELECT * FROM jsonb_each(_j)
LOOP
IF _val = _old THEN
RETURN jsonb_set(_j, ARRAY[_key], _new); -- update 1st key
END IF;
END LOOP;
RETURN _j; -- nothing found, return original
END
$func$ LANGUAGE plpgsql IMMUTABLE;
COMMENT ON FUNCTION jsonb_replace_value(jsonb, jsonb, jsonb) IS '
Replace the first occurrence of _old value with _new.
Call:
SELECT jsonb_replace_value('{"C1":"Paris","C3":"Berlin","C4":"Berlin"}', '"Berlin"', '"Madrid"')';
Could be enhanced to optionally replace all occurrences etc. but that's beyond the scope of this question.
Now this would be simple:
UPDATE table_a
SET json_col = jsonb_replace_value(json_col, '"Berlin"', '"Madrid"'); -- note jsonb literal syntax!
If all rows need an update, we can stop here. Won't get faster. (Except possibly with alternatives like demonstrated by #klin.)
If a large percentage of all rows need an update, add a WHERE condition to avoid empty updates:
...
WHERE json_col <> jsonb_replace_value(json_col, '"Berlin"', '"Madrid"');
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
Typically, only very few rows actually need an update. Then iterating through all rows with above query is expensive. We need index support to make it fast. Not easy for the case. I suggest an expression index based on an IMMUTABLE function extracting the array of values:
CREATE OR REPLACE FUNCTION jsonb_object_val_arr(jsonb)
RETURNS text[] LANGUAGE sql IMMUTABLE AS
'SELECT ARRAY (SELECT value FROM jsonb_each_text($1))';
COMMENT ON FUNCTION jsonb_object_val_arr(jsonb) IS '
Generates text array of values in outermost jsonb object.
Of limited use if there can be nested objects.';
CREATE INDEX table_a_val_arr_idx ON table_a USING gin (jsonb_object_val_arr(json_col));
Related, with more explanation:
Find rows containing a key in a JSONB array of records
Query making use of this index:
UPDATE table_a a
SET json_col = jsonb_replace_value(a.json_col, '"Berlin"', '"Madrid"')
WHERE jsonb_object_val_arr(json_col) #> '{Berlin}' -- has Berlin, possibly > 1x ..
-- AND NOT jsonb_object_val_arr(json_col) #> '{Madrid}'
AND NOT EXISTS ( -- .. but not Madrid
SELECT FROM table_a b
WHERE jsonb_object_val_arr(json_col) #> '{Madrid}' -- note array literal syntax
AND b.id = a.id
);
The NOT EXISTS semi-anti-join is carefully drafted to utilize the index a 2nd time.
The commented simpler alternative is faster if there are few rows with 'Berlin' and 'Madrid' - then a filter step in the query plan will be cheaper.
Should be very fast.
db<>fiddle here for Postgres 9.5 demonstrating all.
Ok i have tested all methods and i can say you did a great job
This helped me a lot. Let me share my feedback with you.
Method 1 sugested by Klin. Works perfect and is totally fine, except if
key is named like value, then both will be replaced key and value.
i.e.: "Berlin":"Berlin" becomes "Madrid":"Madrid"
Method 2 with plv8 extension did not worked because i am missing controll file
i had to install it and i just skipped this method, so i have no
feedback regarding this method.
Error that i was getting was this:
ERROR: could not open extension control file
"/usr/pgsql-9.5/share/extension/plv8.control": No such file or directory
Method 3 similar to method 2 with jsonb_replace_value function
works perfect, in replaces rows that contains specific value regardless
of the key. And adding condition
WHERE json_col <> jsonb_replace_value(json_col, '"Berlin"', '"Madrid"')
will avoid empty updates and will skip rows than do not need to be updated
And somethig like this
{"Berlin":"Berlin"} becomes {"Berlin":"Madrid"} i.e. Key is not touched, just value
Method 4 is a little more complicated, it uses Method 3 and Indexes
It works totally awesome and super speedy.
And NOT EXISTS semi-anti-join indeed forced to use Index again.
I was shocked how fast it performed!!!
However i discovered all this methods will work if json string looks like this:
{"key":"value"}
If i have for example to update a value that is a json object it will not update
something like this: {"C30":{"id":10044,"value":"Berlin","created_by":"John Doe"}}
MANY THANKS to you guys. #klin and #erwin-brandstetter. This helped me to learn something new!

Complicated text compare in SQL

Suppose I have a table result
---------------------------------------------------------
coupon id| required_product_ids|used_product_in_this_year
---------------------------------------------------------
1 |1,2,3,10 |2,3,4,5,6,7,8,9,10,12,13
How can I check if used_product_in_this_year has at least one required_product_ids by SQL.
I tried somethings with SQL like keyword but did not success.
There is no native SQL construct for performing this type of comparison.
To find a single value in a comma separated list, MySQL provides a FIND_IN_SET function. But to handle a comma separated list of values, to check each one, to see if it's in a list, each separate value would need to be supplied into FIND_IN_SET. And that would be unweildy.
If the hard and fast requirement is to handle this comparison in a SQL statement, I'd recommend writing a function to do the comparison.
DELIMITER $$
CREATE FUNCTION upity_halo_rpi(upity VARCHAR(4000), rpi VARCHAR(4000))
RETURNS INT DETERMINISTIC
BEGIN
-- TODO: extract first element of upity
-- TODO: check if element is in rpi list
-- if it is found in the list
RETURN TRUE;
-- otherwise, split off next element
-- loop through all elements
-- if loop completes without finding a match is found, fall out
RETURN FALSE;
END$$
DELIMITER ;
With the function written, and thoroughly tested, it could be used in a SQL statement. To return a column that indicates that the row "has at least one"...
SELECT t.coupon id
, t.required_product_ids
, t.used_product_in_this_year
, upity_halo_rpi(t.used_product_in_this_year,t.required_product_ids) AS halo
FROM result t
To return:
coupon id required_product_ids used_product_in_this_year halo
--------- -------------------- ------------------------ -----
1 1,2,3,10 2,3,4,5,6,7,8,9,10,12,13 1
I'm not going to write the function. I'm just demonstrating a possible approach. One possible answer to "how" this type of comparison operation could be performed within a SQL statement.
This is how you can do it, without changing your database structure.
In MYSQL (Tested):
select * from TableName
where concat(',', used_product_in_this_year, ',') regexp concat(',',replace(required_product_ids,',',',|,'),',')
Using this Regex structure with your table and manipulating the data with a some mysql string functions.
I don't recommend your database structure, but I like puzzles and this one was fun, thanks for the challenge.

PostgreSQL function check if field is CSV

I can accomplish this with PHP in the end, but it would be more elegant to have it in the SQL. I have no choice but to use PostgreSQL for this project and I have never used it before, so...
There is a table 'test_results' that contains the columns:
sample_id(text) | test_result(text) | sessiontime(bigint)
Another table has information that includes the sample_id, but some have had multiple tests run. When that happens the sample_id field is populated with a CSV list of sample_ids. Not all of these sample_ids exist in the test_results table. There is also no way of knowing how many tests have been run.
If there is only one sample_id it will be in the table and should be returned. Otherwise the field of CSV needs to split and checked to see if it exists and since only one test_result need be returned the one with the latest sessiontime(which is epochtime) need be returned.
I have been over this many ways and my code has now become a jumble of unworkable ...
Guidance would be appreciated. I can always go back and do it in the PHP if I need...
EDIT TO BE CLEAR.. SOMETHING LIKE THIS:
DROP FUNCTION get_test_results(text);
CREATE OR REPLACE FUNCTION get_test_results(sample_id TEXT) returns
table(test_results text) as $$
BEGIN
IF position("," in sample_id) THEN
-----DO SOMETHING to
ELSE
SELECT test_results FROM test_results WHERE sample_id = sample_id ORDER BY sessiontime DESC;
END IF;
END
$$ LANGUAGE plpgsql;
This not functioning yet.... needs to split_part(sample_id, ','::text, 1) then get all the results but on the one with the most recent sessiontime.
PostgreSQL is an excellent choice and very versatile for things like this.
First of, to determine if your sample_id is a single value or a list of values:
-- (sample_id ~ '^ *\d\+ *$') returns true if there is one number only
SELECT CASE WHEN sample_id ~ '^ *\d\+ *$' THEN sample_id::int END
Then, to open up the list of ids in a comma-separated list of samples you can unnest the array returned by string_to_array:
SELECT i
FROM unnest(string_to_array(sample_id, ',')::int[]) i
You can use that for either single or multiple numbers (since there is just one value, you'll get only one row).

PostgreSQL - Rule to create a copy of the primaryID table

In my schema I want to have a primaryID and a SearchID. For every SearchID it is the primaryID plus some text at the start. I need this to look like this:
PrimaryID = 1
SearchID = Search1
Since the PrimaryID is set to autoincrement, I was hoping I could use a postgresql rule to do the following (pseudo code)
IF PRIMARYID CHANGES
{
SEARCHID = SEARCH(PRIMARYID)
}
This would hopefully occure exactly after the primaryID is updated and happen automatically. So, is this the best way of achieving this and can anyone provide an example of how it is done?
Thank you
Postgres 11 introduced genuine generated columns. See:
Computed / calculated / virtual / derived columns in PostgreSQL
For older (or any) versions, you could emulate a "virtual generated column" with a special function. Say your table is named tbl and the serial primary key is named tbl_id:
CREATE FUNCTION search_id(t tbl)
RETURNS text STABLE LANGUAGE SQL AS
$$
SELECT 'Search' || $1.tbl_id;
$$;
Then you can:
SELECT t.tbl_id, t.search_id FROM tbl t;
Table-qualification in t.search_id is needed in this case. Since search_id is not found as column of table tbl, Postgres looks for a function that takes tbl as argument next.
Effectively, t.search_id is just a syntax variant of search_id(t), but makes usage rather intuitive.

PostgreSQL - best way to return an array of key-value pairs

I'm trying to select a number of fields, one of which needs to be an array with each element of the array containing two values. Each array item needs to contain a name (character varying) and an ID (numeric). I know how to return an array of single values (using the ARRAY keyword) but I'm unsure of how to return an array of an object which in itself contains two values.
The query is something like
SELECT
t.field1,
t.field2,
ARRAY(--with each element containing two values i.e. {'TheName', 1 })
FROM MyTable t
I read that one way to do this is by selecting the values into a type and then creating an array of that type. Problem is, the rest of the function is already returning a type (which means I would then have nested types - is that OK? If so, how would you read this data back in application code - i.e. with a .Net data provider like NPGSQL?)
Any help is much appreciated.
ARRAYs can only hold elements of the same type
Your example displays a text and an integer value (no single quotes around 1). It is generally impossible to mix types in an array. To get those values into an array you have to create a composite type and then form an ARRAY of that composite type like you already mentioned yourself.
Alternatively you can use the data types json in Postgres 9.2+, jsonb in Postgres 9.4+ or hstore for key-value pairs.
Of course, you can cast the integer to text, and work with a two-dimensional text array. Consider the two syntax variants for a array input in the demo below and consult the manual on array input.
There is a limitation to overcome. If you try to aggregate an ARRAY (build from key and value) into a two-dimensional array, the aggregate function array_agg() or the ARRAY constructor error out:
ERROR: could not find array type for data type text[]
There are ways around it, though.
Aggregate key-value pairs into a 2-dimensional array
PostgreSQL 9.1 with standard_conforming_strings= on:
CREATE TEMP TABLE tbl(
id int
,txt text
,txtarr text[]
);
The column txtarr is just there to demonstrate syntax variants in the INSERT command. The third row is spiked with meta-characters:
INSERT INTO tbl VALUES
(1, 'foo', '{{1,foo1},{2,bar1},{3,baz1}}')
,(2, 'bar', ARRAY[['1','foo2'],['2','bar2'],['3','baz2']])
,(3, '}b",a{r''', '{{1,foo3},{2,bar3},{3,baz3}}'); -- txt has meta-characters
SELECT * FROM tbl;
Simple case: aggregate two integer (I use the same twice) into a two-dimensional int array:
Update: Better with custom aggregate function
With the polymorphic type anyarray it works for all base types:
CREATE AGGREGATE array_agg_mult (anyarray) (
SFUNC = array_cat
,STYPE = anyarray
,INITCOND = '{}'
);
Call:
SELECT array_agg_mult(ARRAY[ARRAY[id,id]]) AS x -- for int
,array_agg_mult(ARRAY[ARRAY[id::text,txt]]) AS y -- or text
FROM tbl;
Note the additional ARRAY[] layer to make it a multidimensional array.
Update for Postgres 9.5+
Postgres now ships a variant of array_agg() accepting array input and you can replace my custom function from above with this:
The manual:
array_agg(expression)
...
input arrays concatenated into array of one
higher dimension (inputs must all have same dimensionality, and cannot
be empty or NULL)
I suspect that without having more knowledge of your application I'm not going to be able to get you all the way to the result you need. But we can get pretty far. For starters, there is the ROW function:
# SELECT 'foo', ROW(3, 'Bob');
?column? | row
----------+---------
foo | (3,Bob)
(1 row)
So that right there lets you bundle a whole row into a cell. You could also make things more explicit by making a type for it:
# CREATE TYPE person(id INTEGER, name VARCHAR);
CREATE TYPE
# SELECT now(), row(3, 'Bob')::person;
now | row
-------------------------------+---------
2012-02-03 10:46:13.279512-07 | (3,Bob)
(1 row)
Incidentally, whenever you make a table, PostgreSQL makes a type of the same name, so if you already have a table like this you also have a type. For example:
# DROP TYPE person;
DROP TYPE
# CREATE TABLE people (id SERIAL, name VARCHAR);
NOTICE: CREATE TABLE will create implicit sequence "people_id_seq" for serial column "people.id"
CREATE TABLE
# SELECT 'foo', row(3, 'Bob')::people;
?column? | row
----------+---------
foo | (3,Bob)
(1 row)
See in the third query there I used people just like a type.
Now this is not likely to be as much help as you'd think for two reasons:
I can't find any convenient syntax for pulling data out of the nested row.
I may be missing something, but I just don't see many people using this syntax. The only example I see in the documentation is a function taking a row value as an argument and doing something with it. I don't see an example of pulling the row out of the cell and querying against parts of it. It seems like you can package the data up this way, but it's hard to deconstruct after that. You'll wind up having to make a lot of stored procedures.
Your language's PostgreSQL driver may not be able to handle row-valued data nested in a row.
I can't speak for NPGSQL, but since this is a very PostgreSQL-specific feature you're not going to find support for it in libraries that support other databases. For example, Hibernate isn't going to be able to handle fetching an object stored as a cell value in a row. I'm not even sure the JDBC would be able to give Hibernate the information usefully, so the problem could go quite deep.
So, what you're doing here is feasible provided you can live without a lot of the niceties. I would recommend against pursuing it though, because it's going to be an uphill battle the whole way, unless I'm really misinformed.
A simple way without hstore
SELECT
jsonb_agg(to_jsonb (t))
FROM (
SELECT
unnest(ARRAY ['foo', 'bar', 'baz']) AS table_name
) t
>>> [{"table_name": "foo"}, {"table_name": "bar"}, {"table_name": "baz"}]