How to query an array of values within a JSONB field dictionary? - sql

I have a jsonb column that contains a dictionary which has a key that points to an array of string values. I need to query against that array.
The table (called "things") looks like this:
------------------
| my_column |
|----------------|
| { "a": ["X"] } |
------------------
I need to write two queries:
Does the array contain value "X"?
Does the array not contain value "X"?
my_column has a non-null constraint, but it can contain an empty dictionary. The dictionary can also contain other key/value pairs.
The first query was easy:
SELECT * FROM things
WHERE my_column -> 'a' ? 'X';`
The second one is proving to be more challenging. I started there with:
SELECT * FROM things
WHERE NOT my_column -> 'a' ? 'X';
... but that excluded all the records that had dictionaries that didn't include key 'a'. So I modified it like so:
SELECT * FROM things
WHERE my_column -> 'a' IS NULL OR NOT
my_column -> 'a' ? 'X';
This works, but is there a better way? Also, is it possible to index this query, and if so, how?

I'm not sure if there's any better way -- that honestly looks pretty straightforward to me.
As for indexing, there are a couple things you can do.
First, you can index the jsonb field. Putting a GIN index on that field should help with any use of "exists"-type operators (like ?).
If that isn't the solution you want for whatever reason, Postgres supports functional and partial indexes. A functional index might look like:
CREATE INDEX ON things ( my_column -> 'a' );
(note: It looks like postgres is having trouble with that syntax, which might be a bug. The concept holds, though.)
A partial index would get even more specific, and could even look like:
CREATE INDEX ON things (my_column)
WHERE my_column -> 'a' IS NULL OR NOT
my_column -> 'a' ? 'X';
Obviously, that won't help for more general queries.
At a guess, indexing the whole column with a GIN index is the right way to go, or at least the right place to start.

Related

Index on part of a column in SQLite

Is doing something like the following possible in SQLite:
create INDEX idx on mytable (synopsis(20));
In other words, indexing by something less than the full text field? This is useful on long-text fields where I don't want to index everything into memory (the index itself could take up more space than the entire table).
You seem to be looking for an index on expression:
Use a CREATE INDEX statement to create a new index on one or more expressions just like you would to create an index on columns. The only difference is that expressions are listed as the elements to be indexed rather than column names.
Consider:
CREATE INDEX idx ON mytable(SUBSTR(synopsis, 1, 20));
Please note that, as explained in the documentation, for this index to be considered by the sqlite query planner, you need to use the exact same expression that was given when creating the index.
So this query would use the index:
SELECT * FROM mytable WHERE SUBSTR(synopsis, 1, 20) = 'a text with 20 chars';
While, typically, this would not:
SELECT * FROM mytable WHERE synopsis LIKE 'a text with 20 chars%';
Note: yes, 'a text with 20 chars' is 20 chars long...

Query JSONB column for any value where =?

I have a jsonb column which has the unfortunate case of being very unpredictable, in some cases its value may be an array with nested values:
["UserMailer", "applicant_setup_3", ["5cbffeb7-8d5e-4b52-a475-3cf320b2cee9"]]
Sometimes it will be something with key/values like this:
[{"reference_id": "5cbffeb7-8d5e-4b52-a475-3cf320b2cee9", "job_dictionary": ["StatusUpdater", "FollowTwitterUsersJob"]}]
Is there a way to write a query which just treats the whole column like text and does a like to see if I can find the uuid in the big text blob? I want to find all the records where a particular uuid string is present in the jsonb column.
The query doesn't need to be fast or efficient.
Postgres has search operator ? for jsonb, but that would require you to search the json content recursively.
A possible, although not very efficient method, would to stringify the object and use LIKE to search it:
myjsonb::text LIKE '%"5cbffeb7-8d5e-4b52-a475-3cf320b2cee9"%'
myjsonb::text LIKE '%"' || myuuid || '"%'
Demo on DB Fiddle:
The problem with the jsonb operator ? is that it only considers top-level keys (including array elements), not values, and no nested objects.
You seem to be looking for values and array elements (not keys) on any level. You can get that with a full text search on top of your json(b) column:
SELECT * FROM tbl
WHERE to_tsvector('simple', jsonb_column)
## tsquery '5cbffeb7-8d5e-4b52-a475-3cf320b2cee9';
db<>fiddle here
to_tsvector() extracts values and array elements on all levels - just what you need.
Requires Postgres 10 or later. json(b)_to_tsvector() in Postgres 11 offers more flexibility.
That's attractive for tables of non-trivial size as it can be supported with a full text index very efficiently:
CREATE INDEX tbl_jsonb_column_fts_gin_idx ON tbl USING GIN (to_tsvector('simple', jsonb_column));
I use the 'simple' text search configuration in the example. You might want a language-specific one, like 'english'. Doesn't matter much while you only look for UUID strings, but stemming for a particular language might make the index a bit smaller ...
Related:
LIKE query on elements of flat jsonb array
Does the phrase search operator <-> work with JSONB documents or only relational tables?
While you are only looking for UUIDs, you might optimize further with a custom (IMMUTABLE) function to extract UUIDs from the JSON document as array (uuid[]) and build a functional GIN index on top of it. (Considerably smaller index, yet.) Then:
SELECT * FROM tbl
WHERE my_uuid_extractor(jsonb_column) #> '{5cbffeb7-8d5e-4b52-a475-3cf320b2cee9}';
Such a function can be expensive, but does not matter much with a functional index that stores and operates on pre-computed values.
You can split the array elements first by using jsonb_array_elements(json), and then filter the casted string from those elements by like operator
select q.elm
from
(
select jsonb_array_elements(js) as elm
from tab
) q
where elm::varchar like '%User%'
elm
----------------------------------------------------------------------------------------------------------------------
"UserMailer"
{"reference_id": "5cbffeb7-8d5e-4b52-a475-3cf320b2cee9", "job_dictionary": ["StatusUpdater", "FollowTwitterUsersJob"]}
Demo

Conditionally replace single value per row in jsonb column

I need a more efficient way to update rows of a single table in Postgres 9.5.
I am currently doing it with pg_dump, and re-import with updated values after search and replace operations in a Linux OS environment.
table_a has 300000 rows with 2 columns: id bigint and json_col jsonb.
json_col has about 30 keys: "C1" to "C30" like in this example:
Table_A
id,json_col
1 {"C1":"Paris","C2":"London","C3":"Berlin","C4":"Tokyo", ... "C30":"Dallas"}
2 {"C1":"Dublin","C2":"Berlin","C3":"Kiev","C4":"Tokyo", ... "C30":"Phoenix"}
3 {"C1":"Paris","C2":"London","C3":"Berlin","C4":"Ankara", ... "C30":"Madrid"}
...
The requirement is to mass search all keys from C1 to C30 then look in
them for the value "Berlin" and replace with "Madrid" and only if
Madrid is not repeated. i.e. id:1 with Key C3, and id:2 with C2. id:3
will be skipped because C30 exists with this value already
It has to be in a single SQL command in PostgreSQL 9.5, one time and considering all keys from the jsonb column.
The fastest and simplest way is to modify the column as text:
update table_a
set json_col = replace(json_col::text, '"Berlin"', '"Madrid"')::jsonb
where json_col::text like '%"Berlin"%'
and json_col::text not like '%"Madrid"%'
It's a practical choice. The above query is rather a find-and-replace operation (like in a text editor) than a modification of objects attributes. The second option is more complicated and surely much more expensive. Even using the fast Javascript engine (example below) more formal solution would be many times slower.
You can try Postgres Javascript:
create extension if not exists plv8;
create or replace function replace_item(data jsonb, from_str text, to_str text)
returns jsonb language plv8 as $$
var found = 0;
Object.keys(data).forEach(function(key) {
if (data[key] == to_str) {
found = 1;
}
})
if (found == 0) {
Object.keys(data).forEach(function(key) {
if (data[key] == from_str) {
data[key] = to_str;
}
})
}
return data;
$$;
update table_a
set json_col = replace_item(json_col, 'Berlin', 'Madrid');
What makes this hard is that you are looking for unknown keys holding values of interest. Postgres infrastructure is optimized to find keys (or array values).
Possibly caused by a sub-optimal table design. The many top-level objects of your jsonb column might be replaced by an array, discarding irrelevant key names altogether. (Or maybe another array for key names.) Or, ideally with a full normalized DB schema to begin with.
Be that as it may, here is a proof of concept, how this can be fast and clean with stock Postgres 9.5 or later anyway.
Additional difficulty 1: it's unknown whether duplicate values are possible.
Additional difficulty 2: value frequencies are unknown, too.
Additional difficulty 3: only the first value found is to be replaced and only if the target value is not there yet. Implementing this with set-based operations is possible, but unwieldy. I wrote a plpgsql function instead:
CREATE OR REPLACE FUNCTION jsonb_replace_value(_j jsonb, _old jsonb, _new jsonb)
RETURNS jsonb AS
$func$
DECLARE
_key text;
_val jsonb;
BEGIN
FOR _key, _val IN
SELECT * FROM jsonb_each(_j)
LOOP
IF _val = _old THEN
RETURN jsonb_set(_j, ARRAY[_key], _new); -- update 1st key
END IF;
END LOOP;
RETURN _j; -- nothing found, return original
END
$func$ LANGUAGE plpgsql IMMUTABLE;
COMMENT ON FUNCTION jsonb_replace_value(jsonb, jsonb, jsonb) IS '
Replace the first occurrence of _old value with _new.
Call:
SELECT jsonb_replace_value('{"C1":"Paris","C3":"Berlin","C4":"Berlin"}', '"Berlin"', '"Madrid"')';
Could be enhanced to optionally replace all occurrences etc. but that's beyond the scope of this question.
Now this would be simple:
UPDATE table_a
SET json_col = jsonb_replace_value(json_col, '"Berlin"', '"Madrid"'); -- note jsonb literal syntax!
If all rows need an update, we can stop here. Won't get faster. (Except possibly with alternatives like demonstrated by #klin.)
If a large percentage of all rows need an update, add a WHERE condition to avoid empty updates:
...
WHERE json_col <> jsonb_replace_value(json_col, '"Berlin"', '"Madrid"');
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
Typically, only very few rows actually need an update. Then iterating through all rows with above query is expensive. We need index support to make it fast. Not easy for the case. I suggest an expression index based on an IMMUTABLE function extracting the array of values:
CREATE OR REPLACE FUNCTION jsonb_object_val_arr(jsonb)
RETURNS text[] LANGUAGE sql IMMUTABLE AS
'SELECT ARRAY (SELECT value FROM jsonb_each_text($1))';
COMMENT ON FUNCTION jsonb_object_val_arr(jsonb) IS '
Generates text array of values in outermost jsonb object.
Of limited use if there can be nested objects.';
CREATE INDEX table_a_val_arr_idx ON table_a USING gin (jsonb_object_val_arr(json_col));
Related, with more explanation:
Find rows containing a key in a JSONB array of records
Query making use of this index:
UPDATE table_a a
SET json_col = jsonb_replace_value(a.json_col, '"Berlin"', '"Madrid"')
WHERE jsonb_object_val_arr(json_col) #> '{Berlin}' -- has Berlin, possibly > 1x ..
-- AND NOT jsonb_object_val_arr(json_col) #> '{Madrid}'
AND NOT EXISTS ( -- .. but not Madrid
SELECT FROM table_a b
WHERE jsonb_object_val_arr(json_col) #> '{Madrid}' -- note array literal syntax
AND b.id = a.id
);
The NOT EXISTS semi-anti-join is carefully drafted to utilize the index a 2nd time.
The commented simpler alternative is faster if there are few rows with 'Berlin' and 'Madrid' - then a filter step in the query plan will be cheaper.
Should be very fast.
db<>fiddle here for Postgres 9.5 demonstrating all.
Ok i have tested all methods and i can say you did a great job
This helped me a lot. Let me share my feedback with you.
Method 1 sugested by Klin. Works perfect and is totally fine, except if
key is named like value, then both will be replaced key and value.
i.e.: "Berlin":"Berlin" becomes "Madrid":"Madrid"
Method 2 with plv8 extension did not worked because i am missing controll file
i had to install it and i just skipped this method, so i have no
feedback regarding this method.
Error that i was getting was this:
ERROR: could not open extension control file
"/usr/pgsql-9.5/share/extension/plv8.control": No such file or directory
Method 3 similar to method 2 with jsonb_replace_value function
works perfect, in replaces rows that contains specific value regardless
of the key. And adding condition
WHERE json_col <> jsonb_replace_value(json_col, '"Berlin"', '"Madrid"')
will avoid empty updates and will skip rows than do not need to be updated
And somethig like this
{"Berlin":"Berlin"} becomes {"Berlin":"Madrid"} i.e. Key is not touched, just value
Method 4 is a little more complicated, it uses Method 3 and Indexes
It works totally awesome and super speedy.
And NOT EXISTS semi-anti-join indeed forced to use Index again.
I was shocked how fast it performed!!!
However i discovered all this methods will work if json string looks like this:
{"key":"value"}
If i have for example to update a value that is a json object it will not update
something like this: {"C30":{"id":10044,"value":"Berlin","created_by":"John Doe"}}
MANY THANKS to you guys. #klin and #erwin-brandstetter. This helped me to learn something new!

check if a jsonb field contains an array

I have a jsonb field in a PostgreSQL table which was supposed to contain a dictionary like data aka {} but few of its entries got an array due to source data issues.
I want to weed out those entries. One of the ways is to perform following query -
select json_field from data_table where cast(json_field as text) like '[%]'
But this requires converting each jsonb field into text. With data_table having order of 200 million entries, this looks like bit of an overkill.
I investigated pg_typeof but it returns jsonb which doesn't help differentiate between a dictionary and an array.
Is there a more efficient way to achieve the above?
How about using the json_typeof function?
select json_field from data_table where json_typeof(json_field) = 'array'

How to filter a value of any key of json in postgres

I have a table users with a jsonb field called data. I have to retrieve all the users that have a value in that data column matching a given string. For example:
user1 = data: {"property_a": "a1", "property_b": "b1"}
user2 = data: {"property_a": "a2", "property_b": "b2"}
I want to retrieve any user that has a value data matching 'b2', in this case that will be 'user2'.
Any idea how to do this in an elegant way? I can retrieve all keys from data of all users and create a query manually but that will be neither fast nor elegant.
In addition, I have to retrieve the key and value matched, but first things first.
There is no easy way. Per documentation:
GIN indexes can be used to efficiently search for keys or key/value
pairs occurring within a large number of jsonb documents (datums)
Bold emphasis mine. There is no index over all values. (Those can have non-compatible data types!) If you do not know the name(s) of all key(s) you have to inspect all JSON values in every row.
If there are just two keys like you demonstrate (or just a few well-kown keys), it's still easy enough:
SELECT *
FROM users
WHERE data->>'property_a' = 'b2' OR
data->>'property_b' = 'b2';
Can be supported with a simple expression index:
CREATE INDEX foo_idx ON users ((data->>'property_a'), (data->>'property_b'))
Or with a GIN index:
SELECT *
FROM users
WHERE data #> '{"property_a": "b2"}' OR
data #> '{"property_b": "b2"}'
CREATE INDEX bar_idx ON users USING gin (data jsonb_path_ops);
If you don't know all key names, things get more complicated ...
You could use jsonb_each() or jsonb_each_text() to unnest all values into a set and then check with an ANY construct:
SELECT *
FROM users
WHERE jsonb '"b2"' = ANY (SELECT (jsonb_each(data)).value);
Or
...
WHERE 'b2' = ANY (SELECT (jsonb_each_text(data)).value);
db<>fiddle here
But there is no index support for the last one. You could instead extract all values into and array and create an expression index on that, and match that expression in queries with array operators ...
Related:
How do I query using fields inside the new PostgreSQL JSON datatype?
Index for finding an element in a JSON array
Can PostgreSQL index array columns?
Try this query.
SELECT * FROM users
WHERE data::text LIKE '%b2%'
Of course it won't work if your key will contain such string too.