Postgres JSONB select json objects as columns - sql

The table has an id and step_data columns.
id | step_data
--------------
a1 | {...}
a2 | {...}
a3 | {...}
Where step_data is nested structure as follows, where the actual keys of the metadata can be in of the events objects.
{
"events": [
{
"timestamp": "2021-04-07T17:46:13.739Z",
"meta": [
{
"key": "length",
"value": "0.898"
},
{
"key": "height",
"value": "607023104"
},
{
"key": "weight",
"value": "33509376"
}
]
},
{
"timestamp": "2021-04-07T17:46:13.781Z",
"meta": [
{
"key": "color",
"value": "0.007"
},
{
"key": "count",
"value": "641511424"
},
{
"key": "age",
"value": "0"
}
]
}
]
}
I can extract one field like length pretty easily.
select cast(metadata ->> 'value' as double precision) as length,
id
from (
select jsonb_array_elements(jsonb_array_elements(step_data #> '{events}') #> '{meta}') metadata,
id
from table
) as parsed_keys
where metadata #> '{
"key": "length"
}'::jsonb
id
length
a1
0.898
a2
0.800
But what I really need is to extract the metadata as columns from a couple of known keys, like length and color. Not sure how to get another column efficiently once I split the array with jsonb_array_elements().
Is there an efficient way to do this without having to call jsonb_array_elements() again and do a join on every single one? For example such that the result set looks like this.
id
length
color
weight
a1
0.898
0.007
33509376
a2
0.800
1.000
15812391
Using Postgres 11.7.

With Postgres 11, I can only think of unnesting both levels, then aggregating back into a key/value pair from which you can extract the desired keys:
select t.id,
(x.att ->> 'length')::numeric as length,
(x.att ->> 'color')::numeric as color,
(x.att ->> 'weight')::numeric as weight
from the_table t
cross join lateral (
select jsonb_object_agg(m.item ->> 'key', m.item -> 'value') as att
from jsonb_array_elements(t.step_data -> 'events') as e(event)
cross join jsonb_array_elements(e.event -> 'meta') as m(item)
where m.item ->> 'key' in ('color', 'length', 'weight')
) x
;
With Postgres 12 you could write it a bit simpler:
select t.id,
jsonb_path_query_first(t.step_data, '$.events[*].meta[*] ? (#.key == "length").value') #>> '{}' as length,
jsonb_path_query_first(t.step_data, '$.events[*].meta[*] ? (#.key == "color").value') #>> '{}' as color,
jsonb_path_query_first(t.step_data, '$.events[*].meta[*] ? (#.key == "weight").value') #>> '{}' as weight
from the_table t
;

crosstab()
For any Postgres version.
You could feed the result into the crosstab() function to pivot the result. You need the additional module tablefunc installed. If you are unfamiliar, read basic instructions here first:
PostgreSQL Crosstab Query
SELECT *
FROM crosstab(
$$
SELECT id, metadata->>'key', metadata ->>'value'
FROM (SELECT id, jsonb_array_elements(jsonb_array_elements(step_data -> 'events') -> 'meta') metadata FROM tbl) AS parsed_keys
ORDER BY 1
$$
, $$VALUES ('length'), ('color'), ('weight')$$
) AS ct (id text, length float, color float, weight float);
db<>fiddle here (demonstrating all)
Should deliver best performance, especially for many columns.
Note how we need no explicit cast to double precision (float). crosstab() processes text input anyway, the result is coerced to the types given in the column definition list.
If one of the keys should appear multiple times, the last row wins. (No error is raised, like you would seem to prefer.) You can add a deterministic sort order to the query in $1 to sort the preferred row last. Example: to get the lowest value per key:
ORDER BY 1, 2, 3 DESC
Conditional aggregates with FILTER clause
For Postgres 9.4 or newer.
See:
Aggregate columns with additional (distinct) filters
SELECT id
, min(val) FILTER (WHERE key = 'length') AS length
, min(val) FILTER (WHERE key = 'color') AS color
, min(val) FILTER (WHERE key = 'weight') AS weight
FROM (
SELECT id, metadata->>'key' AS key, (metadata ->>'value')::float AS val
FROM (SELECT id, jsonb_array_elements(jsonb_array_elements(step_data -> 'events') -> 'meta') metadata FROM tbl) AS parsed_keys
) sub
GROUP BY id;
In Postgres 12 or later I would compare performance with jsonb_path_query_first(). a_horse provided a solution.

This query is an option:
SELECT id,
MAX(CASE WHEN metadata->>'key' = 'length' THEN metadata->>'value' END) AS length,
MAX(CASE WHEN metadata->>'key' = 'color' THEN metadata->>'value' END) AS color,
MAX(CASE WHEN metadata->>'key' = 'weight' THEN metadata->>'value' END) AS weight
FROM (SELECT id, jsonb_array_elements(jsonb_array_elements(step_data #> '{events}') #> '{meta}') as metadata
FROM table t) AS aux
GROUP BY id;

Related

BigQuery unnest to new columns instead of new rows

I have a BigQuery table with nested and repeated columns that I'm having trouble unnesting to columns instead of rows.
This is what it the table looks like:
{
"dateTime": "Tue Mar 01 2022 11:11:11 GMT-0800 (Pacific Standard Time)",
"Field1": "123456",
"Field2": true,
"Matches": [
{
"ID": "abc123",
"FieldA": "asdfljadsf",
"FieldB": "0.99"
},
{
"ID": "def456",
"FieldA": "sdgfhgdkghj",
"FieldB": "0.92"
},
{
"ID": "ghi789",
"FieldA": "tfgjyrtjy",
"FieldB": "0.64"
}
]
},
What I want is to return a table with each one of the nested fields as an individual column so I have a clean dataframe to work with:
{
"dateTime": "Tue Mar 01 2022 11:11:11 GMT-0800 (Pacific Standard Time)",
"Field1": "123456",
"Field2": true,
"Matches_1_ID": "abc123",
"Matches_1_FieldA": "asdfljadsf",
"Matches_1_FieldB": "0.99",
"Matches_2_ID": "def456",
"Matches_2_FieldA": "sdgfhgdkghj",
"Matches_2_FieldB": "0.92",
"Matches_3_ID": "ghi789",
"Matches_3_FieldA": "tfgjyrtjy",
"Matches_3_FieldB": "0.64"
},
I tried using UNNEST as below, however it creates new rows with only one set of additional columns, so not what I'm looking for.
SELECT *
FROM table,
UNNEST(Matches) as items
Any solution for this? Thank you in advance.
Consider below approach
select * from (
select t.* except(Matches), match.*, offset + 1 as offset
from your_table t, t.Matches match with offset
)
pivot (min(ID) as id, min(FieldA) as FieldA, min(FieldB) as FieldB for offset in (1,2,3))
If applied to sample data in y our question - output is
if there is are no matches, that entire row is not included in the output table, whereas I want to keep that record with the fields it does have. How can I modify to keep all records?
use left join instead as in below
select * from (
select t.* except(Matches), match.*, offset + 1 as offset
from your_table t left join t.Matches match with offset
)
pivot (min(ID) as id, min(FieldA) as FieldA, min(FieldB) as FieldB for offset in (1,2,3))

How to group results by values that are inside json array in postgreSQL

I have a column of type jSONB that have data like this:
column name: used_filters
row number 1 example:
{ "categories" : ["economic", "Social"], "tags": ["world" ,"eco-friendly"] }
row number 2 example:
{ "categories" : ["economic"], "tags": ["eco-friendly"] , "keywords" : ["2050"] }
I want to group the result to get the most frequent value for each one of the keys
something like this:
key
most_freq
category
economic
tags
eco-friendly
keyword
2050
the keys are not constant and could be something other than the example I said but I know that they will be frequent.
You can extract keys and values as arrays first by using jsonb_each, and then unnest the generated arrays by jsonb_array_elements_text. The rest is classical aggregation along with sorting through the count values by window function such as
SELECT key, value
FROM ( SELECT j.key, jj.value,
RANK() OVER (PARTITION BY j.key ORDER BY COUNT(*) DESC)
FROM t,
LATERAL jsonb_each(js) AS j,
LATERAL jsonb_array_elements_text(j.value) AS jj
GROUP BY j.key, jj.value ) AS q
WHERE rank = 1
Demo

Percentage Matching two JSONB columns,

I am trying to compare two JSONB columns in a table, at the moment it is done in the app, however this doesn't allow proper searching, filtering and ordering without loading the whole data set. It would be better if we could do this comparison in the DB.
The following is an example of the data and calculation.
employer = {
"autism": "1",
"social": "1",
"dementia": "0",
"domestic": "1",
}
employers_keys = ["autism","social","domestic"]
candidate = {
"autism": "0",
"social": "1",
"dementia": "0",
"domestic": "1",
}
candidate_keys = ["social","domestic"]
remainder_keys = employer_key - candidate_key = ["autism"]
1-(remainder_keys.length/employer_keys.length) = 1-(1/3) = 2/3 = 66%
This process is all rather trivial in Ruby, jsonb-> array -> select -> calculation
However, I would like to perform this in SQL or a function at the DB level, something like
function compare_json(employer, candidate) returning a decimal.
More specifically
Select candidates.id,
st_distance_sphere(st_makepoint(employer.long, employer.lat), st_makepoint(candidates.long, candidates.lat)) /
1000 / 8 * 5 as distance
from (select * from users where id = 8117) employer,
(select * from users where role_id = 5) candidates
where st_distance_sphere(st_makepoint(employer.long, employer.lat), st_makepoint(candidates.long, candidates.lat)) /
1000 / 8 * 5 < 25
order by distance
The above SQL calculates the distance between a single employer and multiple candidates, the inline queries employer.skills (1 row), candidate.skills (n rows).
So the output should be..
Candidate id, Distance, SkillsMatch(employer.skills, candidates.skills)
As before the edit, any guidance will be welcome.
Here is a pure SQL approach: it works by turning the employer object to a record set, and then performining conditional aggregation:
select 1 - avg( ((d.candidate ->> e.k)::int is distinct from 1)::int ) res
from (values(
'{ "autism": "1", "social": "1", "dementia": "0", "domestic": "1" }'::jsonb,
'{ "autism": "0", "social": "1", "dementia": "0", "domestic": "1" }'::jsonb
)) d(employer, candidate)
cross join lateral jsonb_each_text(d.employer) e(k, v)
where e.v::int = 1
You can easily turn to a function by replacing the literal objects in the values() row constructor with parameters.
Demo on DB Fiddle:
| res |
| ---------------------: |
| 0.66666666666666666667 |
Okay this is what I got to.
CREATE OR REPLACE FUNCTION JSON_COMPARE(employer_json jsonb, candidate_json jsonb, OUT _result numeric)
AS
$$
BEGIN
select 1 - avg(((d.candidate ->> e.k)::int is distinct from 1)::int)
into _result
from (values (employer_json, candidate_json)) d(employer, candidate)
cross join lateral jsonb_each_text(d.employer) e(k, v)
where e.v::int = 1;
RETURN;
END;
$$
LANGUAGE PLPGSQL;
Which is small variation on GMB super fast answer. With a few indexes and correctly limiting the size of the candidate list we get reasonable performance.
I'm new to Stack so my upvote for GMB doesn't show, but thanks again.

Extracting data from an array of JSON objects for specific object values

In my table, there is a column of JSON type which contains an array of objects describing time offsets:
[
{
"type": "start"
"time": 1.234
},
{
"type": "end"
"time": 50.403
}
]
I know that I can extract these with JSON_EACH() and JSON_EXTRACT():
CREATE TEMPORARY TABLE Items(
id INTEGER PRIMARY KEY,
timings JSON
);
INSERT INTO Items(timings) VALUES
('[{"type": "start", "time": 12.345}, {"type": "end", "time": 67.891}]'),
('[{"type": "start", "time": 24.56}, {"type": "end", "time": 78.901}]');
SELECT
JSON_EXTRACT(Timings.value, '$.type'),
JSON_EXTRACT(Timings.value, '$.time')
FROM
Items,
JSON_EACH(timings) AS Timings;
This returns a table like:
start 12.345
end 67.891
start 24.56
end 78.901
What I really need though is to:
Find the timings of specific types. (Find the first object in the array that matches a condition.)
Take this data and select it as a column with the rest of the table.
In other words, I'm looking for a table that looks like this:
id start end
-----------------------------
0 12.345 67.891
1 24.56 78.901
I'm hoping for some sort of query like this:
SELECT
id,
JSON_EXTRACT(timings, '$.[type="start"].time'),
JSON_EXTRACT(timings, '$.[type="end"].time')
FROM Items;
Is there some way to use path in the JSON functions to select what I need? Or, some other way to pivot what I have in the first example to apply to the table?
One possibility:
WITH cte(id, json) AS
(SELECT Items.id
, json_group_object(json_extract(j.value, '$.type'), json_extract(j.value, '$.time'))
FROM Items
JOIN json_each(timings) AS j ON json_extract(j.value, '$.type') IN ('start', 'end')
GROUP BY Items.id)
SELECT id
, json_extract(json, '$.start') AS start
, json_extract(json, '$.end') AS "end"
FROM cte
ORDER BY id;
which gives
id start end
---------- ---------- ----------
1 12.345 67.891
2 24.56 78.901
Another one, that uses the window functions added in sqlite 3.25 and avoids creating intermediate JSON objects:
SELECT DISTINCT Items.id
, max(json_extract(j.value, '$.time'))
FILTER (WHERE json_extract(j.value, '$.type') = 'start') OVER ids AS start
, max(json_extract(j.value, '$.time'))
FILTER (WHERE json_extract(j.value, '$.type') = 'end') OVER ids AS "end"
FROM Items
JOIN json_each(timings) AS j ON json_extract(j.value, '$.type') IN ('start', 'end')
WINDOW ids AS (PARTITION BY Items.id)
ORDER BY Items.id;
The key is using the ON clause of the JOIN to limit results to just the two objects in each array that you care about, and then merging those up to two rows for each Items.id into one with a couple of different approaches.

snowflake json lateral subquery

I have the following in snowflake:
create or replace table json_tmp as select column1 as id, parse_json(column2) as c
from VALUES (1,
'{"id": "0x1",
"custom_vars": [
{ "key": "a", "value": "foo" },
{ "key": "b", "value": "bar" }
] }') v;
Based on the FLATTEN docs, I hoped to turn these into a table looking like this:
+-------+---------+-----+-----+
| db_id | json_id | a | b |
+-------+---------+-----+-----+
+-------+---------+-----+-----+
| 1 | 0x1 | foo | bar |
+-------+---------+-----+-----+
Here is the query I tried; it resulted in a SQL compilation error: "Object 'CUSTOM_VARS' does not exist."
select json_tmp.id as dbid,
f.value:id as json_id,
a.v,
b.v
from json_tmp,
lateral flatten(input => json_tmp.c) as f,
lateral flatten(input => f.value:custom_vars) as custom_vars,
lateral (select value:value as v from custom_vars where value:key = 'a') as a,
lateral (select value:value as v from custom_vars where value:key = 'b') as b;
What exactly is the error here? Is there a better way to do this transformation?
Note - your solution doesn't actually perform any joins - flatten is a "streaming" operation, it "explodes" the input, and then selects the rows it wants. If you only have 2 attributes in the data, it should be reasonably fast. However, if not, it can lead to an unnecessary data explosion (e.g. if you have 1000s of attributes).
The fastest solution depends on how your data is structured exactly, and what you can assume about the input. For example, if you know that 'a' and 'b' are always in that order, you can obviously use
select
id as db_id,
c:id,
c:custom_vars[0].value,
c:custom_vars[1].value
from json_tmp;
If you know that custom_vars is always 2 elements, but the order is not known, you could do e.g.
select
id as db_id,
c:id,
iff(c:custom_vars[0].key = 'a', c:custom_vars[0].value, c:custom_vars[1].value),
iff(c:custom_vars[0].key = 'b', c:custom_vars[0].value, c:custom_vars[1].value)
from json_tmp;
If the size of custom_vars is unknown, you could create a JavaScript function like extract_key(custom_vars, key) that would iterate over custom_vars and return value for the found key (or e.g. null or <empty_string> if not found).
Hope this helps. If not, please provide more details about your problem (data, etc).
Update Nov 2019
There seems to be a function that does this sort of thing:
select json_tmp.id as dbid,
json_tmp.c:id as json_id,
object_agg(custom_vars.value:key, custom_vars.value:value):a as a,
object_agg(custom_vars.value:key, custom_vars.value:value):b as b
from
json_tmp,
lateral flatten(input => json_tmp.c, path => 'custom_vars') custom_vars
group by json_tmp.id
Original answer Sept 2017
The following query seems to work:
select json_tmp.id as dbid,
json_tmp.c:id as json_id,
a.value:value a,
b.value:value b
from
json_tmp,
lateral flatten(input => json_tmp.c, path => 'custom_vars') a,
lateral flatten(input => json_tmp.c, path => 'custom_vars') b
where a.value:key = 'a' and b.value:key = 'b'
;
I'd rather filter in a subquery rather than on the join, so I'm still interested in seeing other answers.