PostgreSQL function check if field is CSV - sql

I can accomplish this with PHP in the end, but it would be more elegant to have it in the SQL. I have no choice but to use PostgreSQL for this project and I have never used it before, so...
There is a table 'test_results' that contains the columns:
sample_id(text) | test_result(text) | sessiontime(bigint)
Another table has information that includes the sample_id, but some have had multiple tests run. When that happens the sample_id field is populated with a CSV list of sample_ids. Not all of these sample_ids exist in the test_results table. There is also no way of knowing how many tests have been run.
If there is only one sample_id it will be in the table and should be returned. Otherwise the field of CSV needs to split and checked to see if it exists and since only one test_result need be returned the one with the latest sessiontime(which is epochtime) need be returned.
I have been over this many ways and my code has now become a jumble of unworkable ...
Guidance would be appreciated. I can always go back and do it in the PHP if I need...
EDIT TO BE CLEAR.. SOMETHING LIKE THIS:
DROP FUNCTION get_test_results(text);
CREATE OR REPLACE FUNCTION get_test_results(sample_id TEXT) returns
table(test_results text) as $$
BEGIN
IF position("," in sample_id) THEN
-----DO SOMETHING to
ELSE
SELECT test_results FROM test_results WHERE sample_id = sample_id ORDER BY sessiontime DESC;
END IF;
END
$$ LANGUAGE plpgsql;
This not functioning yet.... needs to split_part(sample_id, ','::text, 1) then get all the results but on the one with the most recent sessiontime.

PostgreSQL is an excellent choice and very versatile for things like this.
First of, to determine if your sample_id is a single value or a list of values:
-- (sample_id ~ '^ *\d\+ *$') returns true if there is one number only
SELECT CASE WHEN sample_id ~ '^ *\d\+ *$' THEN sample_id::int END
Then, to open up the list of ids in a comma-separated list of samples you can unnest the array returned by string_to_array:
SELECT i
FROM unnest(string_to_array(sample_id, ',')::int[]) i
You can use that for either single or multiple numbers (since there is just one value, you'll get only one row).

Related

Conditionally replace single value per row in jsonb column

I need a more efficient way to update rows of a single table in Postgres 9.5.
I am currently doing it with pg_dump, and re-import with updated values after search and replace operations in a Linux OS environment.
table_a has 300000 rows with 2 columns: id bigint and json_col jsonb.
json_col has about 30 keys: "C1" to "C30" like in this example:
Table_A
id,json_col
1 {"C1":"Paris","C2":"London","C3":"Berlin","C4":"Tokyo", ... "C30":"Dallas"}
2 {"C1":"Dublin","C2":"Berlin","C3":"Kiev","C4":"Tokyo", ... "C30":"Phoenix"}
3 {"C1":"Paris","C2":"London","C3":"Berlin","C4":"Ankara", ... "C30":"Madrid"}
...
The requirement is to mass search all keys from C1 to C30 then look in
them for the value "Berlin" and replace with "Madrid" and only if
Madrid is not repeated. i.e. id:1 with Key C3, and id:2 with C2. id:3
will be skipped because C30 exists with this value already
It has to be in a single SQL command in PostgreSQL 9.5, one time and considering all keys from the jsonb column.
The fastest and simplest way is to modify the column as text:
update table_a
set json_col = replace(json_col::text, '"Berlin"', '"Madrid"')::jsonb
where json_col::text like '%"Berlin"%'
and json_col::text not like '%"Madrid"%'
It's a practical choice. The above query is rather a find-and-replace operation (like in a text editor) than a modification of objects attributes. The second option is more complicated and surely much more expensive. Even using the fast Javascript engine (example below) more formal solution would be many times slower.
You can try Postgres Javascript:
create extension if not exists plv8;
create or replace function replace_item(data jsonb, from_str text, to_str text)
returns jsonb language plv8 as $$
var found = 0;
Object.keys(data).forEach(function(key) {
if (data[key] == to_str) {
found = 1;
}
})
if (found == 0) {
Object.keys(data).forEach(function(key) {
if (data[key] == from_str) {
data[key] = to_str;
}
})
}
return data;
$$;
update table_a
set json_col = replace_item(json_col, 'Berlin', 'Madrid');
What makes this hard is that you are looking for unknown keys holding values of interest. Postgres infrastructure is optimized to find keys (or array values).
Possibly caused by a sub-optimal table design. The many top-level objects of your jsonb column might be replaced by an array, discarding irrelevant key names altogether. (Or maybe another array for key names.) Or, ideally with a full normalized DB schema to begin with.
Be that as it may, here is a proof of concept, how this can be fast and clean with stock Postgres 9.5 or later anyway.
Additional difficulty 1: it's unknown whether duplicate values are possible.
Additional difficulty 2: value frequencies are unknown, too.
Additional difficulty 3: only the first value found is to be replaced and only if the target value is not there yet. Implementing this with set-based operations is possible, but unwieldy. I wrote a plpgsql function instead:
CREATE OR REPLACE FUNCTION jsonb_replace_value(_j jsonb, _old jsonb, _new jsonb)
RETURNS jsonb AS
$func$
DECLARE
_key text;
_val jsonb;
BEGIN
FOR _key, _val IN
SELECT * FROM jsonb_each(_j)
LOOP
IF _val = _old THEN
RETURN jsonb_set(_j, ARRAY[_key], _new); -- update 1st key
END IF;
END LOOP;
RETURN _j; -- nothing found, return original
END
$func$ LANGUAGE plpgsql IMMUTABLE;
COMMENT ON FUNCTION jsonb_replace_value(jsonb, jsonb, jsonb) IS '
Replace the first occurrence of _old value with _new.
Call:
SELECT jsonb_replace_value('{"C1":"Paris","C3":"Berlin","C4":"Berlin"}', '"Berlin"', '"Madrid"')';
Could be enhanced to optionally replace all occurrences etc. but that's beyond the scope of this question.
Now this would be simple:
UPDATE table_a
SET json_col = jsonb_replace_value(json_col, '"Berlin"', '"Madrid"'); -- note jsonb literal syntax!
If all rows need an update, we can stop here. Won't get faster. (Except possibly with alternatives like demonstrated by #klin.)
If a large percentage of all rows need an update, add a WHERE condition to avoid empty updates:
...
WHERE json_col <> jsonb_replace_value(json_col, '"Berlin"', '"Madrid"');
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
Typically, only very few rows actually need an update. Then iterating through all rows with above query is expensive. We need index support to make it fast. Not easy for the case. I suggest an expression index based on an IMMUTABLE function extracting the array of values:
CREATE OR REPLACE FUNCTION jsonb_object_val_arr(jsonb)
RETURNS text[] LANGUAGE sql IMMUTABLE AS
'SELECT ARRAY (SELECT value FROM jsonb_each_text($1))';
COMMENT ON FUNCTION jsonb_object_val_arr(jsonb) IS '
Generates text array of values in outermost jsonb object.
Of limited use if there can be nested objects.';
CREATE INDEX table_a_val_arr_idx ON table_a USING gin (jsonb_object_val_arr(json_col));
Related, with more explanation:
Find rows containing a key in a JSONB array of records
Query making use of this index:
UPDATE table_a a
SET json_col = jsonb_replace_value(a.json_col, '"Berlin"', '"Madrid"')
WHERE jsonb_object_val_arr(json_col) #> '{Berlin}' -- has Berlin, possibly > 1x ..
-- AND NOT jsonb_object_val_arr(json_col) #> '{Madrid}'
AND NOT EXISTS ( -- .. but not Madrid
SELECT FROM table_a b
WHERE jsonb_object_val_arr(json_col) #> '{Madrid}' -- note array literal syntax
AND b.id = a.id
);
The NOT EXISTS semi-anti-join is carefully drafted to utilize the index a 2nd time.
The commented simpler alternative is faster if there are few rows with 'Berlin' and 'Madrid' - then a filter step in the query plan will be cheaper.
Should be very fast.
db<>fiddle here for Postgres 9.5 demonstrating all.
Ok i have tested all methods and i can say you did a great job
This helped me a lot. Let me share my feedback with you.
Method 1 sugested by Klin. Works perfect and is totally fine, except if
key is named like value, then both will be replaced key and value.
i.e.: "Berlin":"Berlin" becomes "Madrid":"Madrid"
Method 2 with plv8 extension did not worked because i am missing controll file
i had to install it and i just skipped this method, so i have no
feedback regarding this method.
Error that i was getting was this:
ERROR: could not open extension control file
"/usr/pgsql-9.5/share/extension/plv8.control": No such file or directory
Method 3 similar to method 2 with jsonb_replace_value function
works perfect, in replaces rows that contains specific value regardless
of the key. And adding condition
WHERE json_col <> jsonb_replace_value(json_col, '"Berlin"', '"Madrid"')
will avoid empty updates and will skip rows than do not need to be updated
And somethig like this
{"Berlin":"Berlin"} becomes {"Berlin":"Madrid"} i.e. Key is not touched, just value
Method 4 is a little more complicated, it uses Method 3 and Indexes
It works totally awesome and super speedy.
And NOT EXISTS semi-anti-join indeed forced to use Index again.
I was shocked how fast it performed!!!
However i discovered all this methods will work if json string looks like this:
{"key":"value"}
If i have for example to update a value that is a json object it will not update
something like this: {"C30":{"id":10044,"value":"Berlin","created_by":"John Doe"}}
MANY THANKS to you guys. #klin and #erwin-brandstetter. This helped me to learn something new!

Conditionally delete item inside an Array Field PostgreSQL

I'm building a kind of dictionary app and I have a table for storing words like below:
id | surface_form | examples
-----------------------------------------------------------------------
1 | sounds | {"It sounds as though you really do believe that",
| | "A different bell begins to sound midnight"}
Where surface_form is of type CHARACTER VARYING and examples is an array field of CHARACTER VARYING
Since the examples are generated automatically from another API, it might not contain the exact "surface_form". Now I want to keep in examples only sentences that contain the exact surface_form. For instance, in the given example, only the first sentence is kept as it contain sounds, the second should be omitted as it only contain sound.
The problem is I got stuck in how to write a query and/or plSQL stored procedure to update the examples column so that it only has the desired sentences.
This query skips unwanted array elements:
select id, array_agg(example) new_examples
from a_table, unnest(examples) example
where surface_form = any(string_to_array(example, ' '))
group by id;
id | new_examples
----+----------------------------------------------------
1 | {"It sounds as though you really do believe that"}
(1 row)
Use it in update:
with corrected as (
select id, array_agg(example) new_examples
from a_table, unnest(examples) example
where surface_form = any(string_to_array(example, ' '))
group by id
)
update a_table
set examples = new_examples
from corrected
where examples <> new_examples
and a_table.id = corrected.id;
Test it in rextester.
Maybe you have to change the table design. This is what PostgreSQL's documentation says about the use of arrays:
Arrays are not sets; searching for specific array elements can be a sign of database misdesign. Consider using a separate table with a row for each item that would be an array element. This will be easier to search, and is likely to scale better for a large number of elements.
Documentation:
https://www.postgresql.org/docs/current/static/arrays.html
The most compact solution (but not necessarily the fastest) is to write a function that you pass a regular expression and an array and which then returns a new array that only contains the items matching the regex.
create function get_matching(p_values text[], p_pattern text)
returns text[]
as
$$
declare
l_result text[] := '{}'; -- make sure it's not null
l_element text;
begin
foreach l_element in array p_values loop
-- adjust this condition to whatever you want
if l_element ~ p_pattern then
l_result := l_result || l_element;
end if;
end loop;
return l_result;
end;
$$
language plpgsql;
The if condition is only an example. You need to adjust that to whatever you exactly store in the surface_form column. Maybe you need to test on word boundaries for the regex or a simple instr() would do - your question is unclear about that.
Cleaning up the table then becomes as simple as:
update the_table
set examples = get_matching(examples, surface_form);
But the whole approach seems flawed to me. It would be a lot more efficient if you stored the examples in a properly normalized data model.
In SQL, you have to remember two things.
Tuple elements are immutable but rows are mutable via updates.
SQL is declarative, not procedural
So you cannot "conditionally" "delete" a value from an array. You have to think about the question differently. You have to create a new array following a specification. That specification can conditionally include values (using case statements). Then you can overwrite the tuple with the new array.
Looks like one way could to update the array with array elements that are valid by doing a select using like or some regular expression.
https://www.postgresql.org/docs/current/static/arrays.html
If you want to hold elements from array that have "surface_form" in it you have to use that entries with substring(....,...) is not null
First you unnest the array, hold only items that match, and then array_agg the stored items
Here is a little query you can run to test without any table.
SELECT
id,
surface_form,
(SELECT array_agg(examples_matching)
FROM unnest(surfaces.examples) AS examples_matching
WHERE substring(examples_matching, surfaces.surface_form) IS NOT NULL)
FROM
(SELECT
1 AS id,
'example' :: TEXT AS surface_form,
ARRAY ['example form', 'test test','second example form'] :: TEXT [] AS examples
) surfaces;
You can select data in temp table using
Then update temp table using update query on row number
Merge value using
This merge value you can update in original table
For Example
Suppose you create temp table
Temp (id int, element character varying)
Then update Temp table and nest it.
Finally update original table
Here is the query you can directly try to execute in editor
CREATE TEMP TABLE IF NOT EXISTS temp_element (
id bigint,
element character varying)WITH (OIDS);
TRUNCATE TABLE temp_element;
insert into temp_element select row_number() over (order by p),p from (
select unnest(ARRAY['It sounds as though you really do believe that',
'A different bell begins to sound midnight']) as P)t;
update temp_element set element = 'It sounds as though you really'
where element = 'It sounds as though you really do believe that';
--update table
select array_agg(r) from ( select element from temp_element)r

Complicated text compare in SQL

Suppose I have a table result
---------------------------------------------------------
coupon id| required_product_ids|used_product_in_this_year
---------------------------------------------------------
1 |1,2,3,10 |2,3,4,5,6,7,8,9,10,12,13
How can I check if used_product_in_this_year has at least one required_product_ids by SQL.
I tried somethings with SQL like keyword but did not success.
There is no native SQL construct for performing this type of comparison.
To find a single value in a comma separated list, MySQL provides a FIND_IN_SET function. But to handle a comma separated list of values, to check each one, to see if it's in a list, each separate value would need to be supplied into FIND_IN_SET. And that would be unweildy.
If the hard and fast requirement is to handle this comparison in a SQL statement, I'd recommend writing a function to do the comparison.
DELIMITER $$
CREATE FUNCTION upity_halo_rpi(upity VARCHAR(4000), rpi VARCHAR(4000))
RETURNS INT DETERMINISTIC
BEGIN
-- TODO: extract first element of upity
-- TODO: check if element is in rpi list
-- if it is found in the list
RETURN TRUE;
-- otherwise, split off next element
-- loop through all elements
-- if loop completes without finding a match is found, fall out
RETURN FALSE;
END$$
DELIMITER ;
With the function written, and thoroughly tested, it could be used in a SQL statement. To return a column that indicates that the row "has at least one"...
SELECT t.coupon id
, t.required_product_ids
, t.used_product_in_this_year
, upity_halo_rpi(t.used_product_in_this_year,t.required_product_ids) AS halo
FROM result t
To return:
coupon id required_product_ids used_product_in_this_year halo
--------- -------------------- ------------------------ -----
1 1,2,3,10 2,3,4,5,6,7,8,9,10,12,13 1
I'm not going to write the function. I'm just demonstrating a possible approach. One possible answer to "how" this type of comparison operation could be performed within a SQL statement.
This is how you can do it, without changing your database structure.
In MYSQL (Tested):
select * from TableName
where concat(',', used_product_in_this_year, ',') regexp concat(',',replace(required_product_ids,',',',|,'),',')
Using this Regex structure with your table and manipulating the data with a some mysql string functions.
I don't recommend your database structure, but I like puzzles and this one was fun, thanks for the challenge.

How do you query table for containing a value

I want to know if a table contains at least one entry that meets specific conditions. I don't want to go over all entries but to stop at first one. Is there a generic way to do this in sql?
I think a research would have given you the answer much more quickly, but anyway here is what I use:
IF EXISTS (SELECT NULL FROM Table WHERE Field = #value)
BEGIN
PRINT 'Exists!'
END
ELSE
BEGIN
PRINT 'Does not exist!'
END
Bear in mind that when using EXISTS, it doesn't matter what fields you select, whether they are from the table, constants or even NULL values as in this case.

How to get unique values from each column based on a condition?

I have been trying to find an optimal solution to select unique values from each column. My problem is I don't know column names in advance since different table has different number of columns. So first, I have to find column names and I could use below query to do it:
select column_name from information_schema.columns
where table_name='m0301010000_ds' and column_name like 'c%'
Sample output for column names:
c1, c2a, c2b, c2c, c2d, c2e, c2f, c2g, c2h, c2i, c2j, c2k, ...
Then I would use returned column names to get unique/distinct value in each column and not just distinct row.
I know a simplest and lousy way is to write select distict column_name from table where column_name = 'something' for every single column (around 20-50 times) and its very time consuming too. Since I can't use more than one distinct per column_name, I am stuck with this old school solution.
I am sure there would be a faster and elegant way to achieve this, and I just couldn't figure how. I will really appreciate any help on this.
You can't just return rows, since distinct values don't go together any more.
You could return arrays, which can be had simpler than you may have expected:
SELECT array_agg(DISTINCT c1) AS c1_arr
,array_agg(DISTINCT c2a) AS c2a_arr
,array_agg(DISTINCT c2b) AS c2ba_arr
, ...
FROM m0301010000_ds;
This returns distinct values per column. One array (possibly big) for each column. All connections between values in columns (what used to be in the same row) are lost in the output.
Build SQL automatically
CREATE OR REPLACE FUNCTION f_build_sql_for_dist_vals(_tbl regclass)
RETURNS text AS
$func$
SELECT 'SELECT ' || string_agg(format('array_agg(DISTINCT %1$I) AS %1$I_arr'
, attname)
, E'\n ,' ORDER BY attnum)
|| E'\nFROM ' || _tbl
FROM pg_attribute
WHERE attrelid = _tbl -- valid, visible table name
AND attnum >= 1 -- exclude tableoid & friends
AND NOT attisdropped -- exclude dropped columns
$func$ LANGUAGE sql;
Call:
SELECT f_build_sql_for_dist_vals('public.m0301010000_ds');
Returns an SQL string as displayed above.
I use the system catalog pg_attribute instead of the information schema. And the object identifier type regclass for the table name. More explanation in this related answer:
PLpgSQL function to find columns with only NULL values in a given table
If you need this in "real time", you won't be able to archive it using a SQL that needs to do a full table scan to archive it.
I would advise you to create a separated table containing the distinct values for each column (initialized with SQL from #Erwin Brandstetter ;) and maintain it using a trigger on the original table.
Your new table will have one column per field. # of row will be equals to the max number of distinct values for one field.
For on insert: for each field to maintain check if that value is already there or not. If not, add it.
For on update: for each field to maintain that has old value != from new value, check if the new value is already there or not. If not, add it. Regarding the old value, check if any other row has that value, and if not, remove it from the list (set field to null).
For delete : for each field to maintain, check if any other row has that value, and if not, remove it from the list (set value to null).
This way the load mainly moved to the trigger, and the SQL on the value list table will super fast.
P.S.: Make sure to pass all you SQL from trigger to explain plan to make sure they use best index and execution plan as possible. For update/deletion, just check if old value exists (limit 1).