snowflake json lateral subquery - sql

I have the following in snowflake:
create or replace table json_tmp as select column1 as id, parse_json(column2) as c
from VALUES (1,
'{"id": "0x1",
"custom_vars": [
{ "key": "a", "value": "foo" },
{ "key": "b", "value": "bar" }
] }') v;
Based on the FLATTEN docs, I hoped to turn these into a table looking like this:
+-------+---------+-----+-----+
| db_id | json_id | a | b |
+-------+---------+-----+-----+
+-------+---------+-----+-----+
| 1 | 0x1 | foo | bar |
+-------+---------+-----+-----+
Here is the query I tried; it resulted in a SQL compilation error: "Object 'CUSTOM_VARS' does not exist."
select json_tmp.id as dbid,
f.value:id as json_id,
a.v,
b.v
from json_tmp,
lateral flatten(input => json_tmp.c) as f,
lateral flatten(input => f.value:custom_vars) as custom_vars,
lateral (select value:value as v from custom_vars where value:key = 'a') as a,
lateral (select value:value as v from custom_vars where value:key = 'b') as b;
What exactly is the error here? Is there a better way to do this transformation?

Note - your solution doesn't actually perform any joins - flatten is a "streaming" operation, it "explodes" the input, and then selects the rows it wants. If you only have 2 attributes in the data, it should be reasonably fast. However, if not, it can lead to an unnecessary data explosion (e.g. if you have 1000s of attributes).
The fastest solution depends on how your data is structured exactly, and what you can assume about the input. For example, if you know that 'a' and 'b' are always in that order, you can obviously use
select
id as db_id,
c:id,
c:custom_vars[0].value,
c:custom_vars[1].value
from json_tmp;
If you know that custom_vars is always 2 elements, but the order is not known, you could do e.g.
select
id as db_id,
c:id,
iff(c:custom_vars[0].key = 'a', c:custom_vars[0].value, c:custom_vars[1].value),
iff(c:custom_vars[0].key = 'b', c:custom_vars[0].value, c:custom_vars[1].value)
from json_tmp;
If the size of custom_vars is unknown, you could create a JavaScript function like extract_key(custom_vars, key) that would iterate over custom_vars and return value for the found key (or e.g. null or <empty_string> if not found).
Hope this helps. If not, please provide more details about your problem (data, etc).

Update Nov 2019
There seems to be a function that does this sort of thing:
select json_tmp.id as dbid,
json_tmp.c:id as json_id,
object_agg(custom_vars.value:key, custom_vars.value:value):a as a,
object_agg(custom_vars.value:key, custom_vars.value:value):b as b
from
json_tmp,
lateral flatten(input => json_tmp.c, path => 'custom_vars') custom_vars
group by json_tmp.id
Original answer Sept 2017
The following query seems to work:
select json_tmp.id as dbid,
json_tmp.c:id as json_id,
a.value:value a,
b.value:value b
from
json_tmp,
lateral flatten(input => json_tmp.c, path => 'custom_vars') a,
lateral flatten(input => json_tmp.c, path => 'custom_vars') b
where a.value:key = 'a' and b.value:key = 'b'
;
I'd rather filter in a subquery rather than on the join, so I'm still interested in seeing other answers.

Related

Postgres JSONB select json objects as columns

The table has an id and step_data columns.
id | step_data
--------------
a1 | {...}
a2 | {...}
a3 | {...}
Where step_data is nested structure as follows, where the actual keys of the metadata can be in of the events objects.
{
"events": [
{
"timestamp": "2021-04-07T17:46:13.739Z",
"meta": [
{
"key": "length",
"value": "0.898"
},
{
"key": "height",
"value": "607023104"
},
{
"key": "weight",
"value": "33509376"
}
]
},
{
"timestamp": "2021-04-07T17:46:13.781Z",
"meta": [
{
"key": "color",
"value": "0.007"
},
{
"key": "count",
"value": "641511424"
},
{
"key": "age",
"value": "0"
}
]
}
]
}
I can extract one field like length pretty easily.
select cast(metadata ->> 'value' as double precision) as length,
id
from (
select jsonb_array_elements(jsonb_array_elements(step_data #> '{events}') #> '{meta}') metadata,
id
from table
) as parsed_keys
where metadata #> '{
"key": "length"
}'::jsonb
id
length
a1
0.898
a2
0.800
But what I really need is to extract the metadata as columns from a couple of known keys, like length and color. Not sure how to get another column efficiently once I split the array with jsonb_array_elements().
Is there an efficient way to do this without having to call jsonb_array_elements() again and do a join on every single one? For example such that the result set looks like this.
id
length
color
weight
a1
0.898
0.007
33509376
a2
0.800
1.000
15812391
Using Postgres 11.7.
With Postgres 11, I can only think of unnesting both levels, then aggregating back into a key/value pair from which you can extract the desired keys:
select t.id,
(x.att ->> 'length')::numeric as length,
(x.att ->> 'color')::numeric as color,
(x.att ->> 'weight')::numeric as weight
from the_table t
cross join lateral (
select jsonb_object_agg(m.item ->> 'key', m.item -> 'value') as att
from jsonb_array_elements(t.step_data -> 'events') as e(event)
cross join jsonb_array_elements(e.event -> 'meta') as m(item)
where m.item ->> 'key' in ('color', 'length', 'weight')
) x
;
With Postgres 12 you could write it a bit simpler:
select t.id,
jsonb_path_query_first(t.step_data, '$.events[*].meta[*] ? (#.key == "length").value') #>> '{}' as length,
jsonb_path_query_first(t.step_data, '$.events[*].meta[*] ? (#.key == "color").value') #>> '{}' as color,
jsonb_path_query_first(t.step_data, '$.events[*].meta[*] ? (#.key == "weight").value') #>> '{}' as weight
from the_table t
;
crosstab()
For any Postgres version.
You could feed the result into the crosstab() function to pivot the result. You need the additional module tablefunc installed. If you are unfamiliar, read basic instructions here first:
PostgreSQL Crosstab Query
SELECT *
FROM crosstab(
$$
SELECT id, metadata->>'key', metadata ->>'value'
FROM (SELECT id, jsonb_array_elements(jsonb_array_elements(step_data -> 'events') -> 'meta') metadata FROM tbl) AS parsed_keys
ORDER BY 1
$$
, $$VALUES ('length'), ('color'), ('weight')$$
) AS ct (id text, length float, color float, weight float);
db<>fiddle here (demonstrating all)
Should deliver best performance, especially for many columns.
Note how we need no explicit cast to double precision (float). crosstab() processes text input anyway, the result is coerced to the types given in the column definition list.
If one of the keys should appear multiple times, the last row wins. (No error is raised, like you would seem to prefer.) You can add a deterministic sort order to the query in $1 to sort the preferred row last. Example: to get the lowest value per key:
ORDER BY 1, 2, 3 DESC
Conditional aggregates with FILTER clause
For Postgres 9.4 or newer.
See:
Aggregate columns with additional (distinct) filters
SELECT id
, min(val) FILTER (WHERE key = 'length') AS length
, min(val) FILTER (WHERE key = 'color') AS color
, min(val) FILTER (WHERE key = 'weight') AS weight
FROM (
SELECT id, metadata->>'key' AS key, (metadata ->>'value')::float AS val
FROM (SELECT id, jsonb_array_elements(jsonb_array_elements(step_data -> 'events') -> 'meta') metadata FROM tbl) AS parsed_keys
) sub
GROUP BY id;
In Postgres 12 or later I would compare performance with jsonb_path_query_first(). a_horse provided a solution.
This query is an option:
SELECT id,
MAX(CASE WHEN metadata->>'key' = 'length' THEN metadata->>'value' END) AS length,
MAX(CASE WHEN metadata->>'key' = 'color' THEN metadata->>'value' END) AS color,
MAX(CASE WHEN metadata->>'key' = 'weight' THEN metadata->>'value' END) AS weight
FROM (SELECT id, jsonb_array_elements(jsonb_array_elements(step_data #> '{events}') #> '{meta}') as metadata
FROM table t) AS aux
GROUP BY id;

Percentage Matching two JSONB columns,

I am trying to compare two JSONB columns in a table, at the moment it is done in the app, however this doesn't allow proper searching, filtering and ordering without loading the whole data set. It would be better if we could do this comparison in the DB.
The following is an example of the data and calculation.
employer = {
"autism": "1",
"social": "1",
"dementia": "0",
"domestic": "1",
}
employers_keys = ["autism","social","domestic"]
candidate = {
"autism": "0",
"social": "1",
"dementia": "0",
"domestic": "1",
}
candidate_keys = ["social","domestic"]
remainder_keys = employer_key - candidate_key = ["autism"]
1-(remainder_keys.length/employer_keys.length) = 1-(1/3) = 2/3 = 66%
This process is all rather trivial in Ruby, jsonb-> array -> select -> calculation
However, I would like to perform this in SQL or a function at the DB level, something like
function compare_json(employer, candidate) returning a decimal.
More specifically
Select candidates.id,
st_distance_sphere(st_makepoint(employer.long, employer.lat), st_makepoint(candidates.long, candidates.lat)) /
1000 / 8 * 5 as distance
from (select * from users where id = 8117) employer,
(select * from users where role_id = 5) candidates
where st_distance_sphere(st_makepoint(employer.long, employer.lat), st_makepoint(candidates.long, candidates.lat)) /
1000 / 8 * 5 < 25
order by distance
The above SQL calculates the distance between a single employer and multiple candidates, the inline queries employer.skills (1 row), candidate.skills (n rows).
So the output should be..
Candidate id, Distance, SkillsMatch(employer.skills, candidates.skills)
As before the edit, any guidance will be welcome.
Here is a pure SQL approach: it works by turning the employer object to a record set, and then performining conditional aggregation:
select 1 - avg( ((d.candidate ->> e.k)::int is distinct from 1)::int ) res
from (values(
'{ "autism": "1", "social": "1", "dementia": "0", "domestic": "1" }'::jsonb,
'{ "autism": "0", "social": "1", "dementia": "0", "domestic": "1" }'::jsonb
)) d(employer, candidate)
cross join lateral jsonb_each_text(d.employer) e(k, v)
where e.v::int = 1
You can easily turn to a function by replacing the literal objects in the values() row constructor with parameters.
Demo on DB Fiddle:
| res |
| ---------------------: |
| 0.66666666666666666667 |
Okay this is what I got to.
CREATE OR REPLACE FUNCTION JSON_COMPARE(employer_json jsonb, candidate_json jsonb, OUT _result numeric)
AS
$$
BEGIN
select 1 - avg(((d.candidate ->> e.k)::int is distinct from 1)::int)
into _result
from (values (employer_json, candidate_json)) d(employer, candidate)
cross join lateral jsonb_each_text(d.employer) e(k, v)
where e.v::int = 1;
RETURN;
END;
$$
LANGUAGE PLPGSQL;
Which is small variation on GMB super fast answer. With a few indexes and correctly limiting the size of the candidate list we get reasonable performance.
I'm new to Stack so my upvote for GMB doesn't show, but thanks again.

Querying JSONB array value for sub values?

I have a JSONB Object:
{"name": "Foo", "interfaces": [{"name": "Bar", "status": "up"}]}
It is stored in a jsonb column of a table
create table device (
device_name character varying not null,
device_data jsonb not null
);
So i was trying to get a count by name of devices which have interfaces that are not 'up'. Group By is used for developing counts by naame, but i am having issues querying the json list for values.
MY first Attempt was:
select device_name, count(*) from device where device_json -> 'interfaces' -> 'status' != 'up' group by device_name;
Some surrounding data that made me think something was going to be difficult was:
select count(device_data -> 'interfaces') from device;
which I thought that was going to get me a count of all interfaces from all devices, but that is not correct. It seems like it is just returning the count from the first item.
Im thinking I might need to do a sub query or join of inner content.
Ive been thinking it over and when looking up psql it seems like i havent found a way to query a list type in a jsonb object. Maybe im mistaken. I didnt want to build a business layer on top of this as I figured that the DBMS would be able to handle this heavy lifting.
I saw there is a function jsonb_array_elements_text(device_data -> 'interfaces')::jsonb -> 'status' which would return the information, but I cant do any sort of count in it, as count(jsonb_array_elements_text(device_data -> 'interfaces')::jsonb -> 'status') will return ERROR: set-valued function called in context that cannot accept a set
You need a lateral join to unnest the array and count the elements that are down (or not up)
select d.device_name, t.num_down
from device d
cross join lateral (
select count(*) num_down
from jsonb_array_elements(d.device_data -> 'interfaces') as x(i)
where i ->> 'status' = 'down'
) t
To count all interfaces and the down interfaces, you can use filtered aggregation:
select d.device_name, t.*
from device d
cross join lateral (
select count(*) as all_interfaces,
count(*) filter (where i ->> 'status' = 'down') as down_interfaces
from jsonb_array_elements(d.device_data -> 'interfaces') as x(i)
) t
Online example
jsonb_array_elements is the right idea, I think you are looking for an EXISTS condition to match your description "devices which have interfaces that are not 'up'":
SELECT device_name, count(*)
FROM device
WHERE EXISTS (SELECT *
FROM jsonb_array_elements(device_json -> 'interfaces') interface
WHERE interface ->> 'status' != 'up')
GROUP BY device_name;
I would like to know how many interfaces are down
That's a different problem, for this you could use a subquery in the SELECT clause, and probably wouldn't need to do any grouping:
SELECT
device_name,
( SELECT count(*)
FROM jsonb_array_elements(device_json -> 'interfaces') interface
WHERE interface ->> 'status' != 'up'
) AS down_count
FROM device

BigQuery: Store semi-structured JSON data

I have data which can have varying json keys, I want to store all of this data in bigquery and then explore the available fields later.
My structure will be like so:
[
{id: 1111, data: {a:27, b:62, c: 'string'} },
{id: 2222, data: {a:27, c: 'string'} },
{id: 3333, data: {a:27} },
{id: 4444, data: {a:27, b:62, c:'string'} },
]
I wanted to use a STRUCT type but it seems all the fields need to be declared?
I then want to be able to query and see how often each key appears, and basically run queries over all records with for example a key as though it was in its own column.
Side note: this data is coming from URL query strings, maybe someone thinks it is best to push the full url and the use the functions to run analysis?
There are two primary methods for storing semi-structured data as you have in your example:
Option #1: Store JSON String
You can store the data field as a JSON string, and then use the JSON_EXTRACT function to pull out the values it can find, and it will return NULL for any value it cannot find.
Since you mentioned needing to do mathematical analysis on the fields, let's do a simple SUM for the values of a and b:
# Creating an example table using the WITH statement, this would not be needed
# for a real table.
WITH records AS (
SELECT 1111 AS id, "{\"a\":27, \"b\":62, \"c\": \"string\"}" as data
UNION ALL
SELECT 2222 AS id, "{\"a\":27, \"c\": \"string\"}" as data
UNION ALL
SELECT 3333 AS id, "{\"a\":27}" as data
UNION ALL
SELECT 4444 AS id, "{\"a\":27, \"b\":62, \"c\": \"string\"}" as data
)
# Example Query
SELECT SUM(aValue) AS aSum, SUM(bValue) AS bSum FROM (
SELECT id,
CAST(JSON_EXTRACT(data, "$.a") AS INT64) AS aValue, # Extract & cast as an INT
CAST(JSON_EXTRACT(data, "$.b") AS INT64) AS bValue # Extract & cast as an INT
FROM records
)
# results
# Row | aSum | bSum
# 1 | 108 | 124
There are some pros and cons to this approach:
Pros
The syntax is fairly straight forward
Less error prone
Cons
Store costs will be slightly higher since you have to store all the characters to serialize to JSON.
Queries will run slower than using pure native SQL.
Option #2: Repeated Fields
BigQuery has support for repeated fields, allowing you to take your structure and express it natively in SQL.
Using the same example, here is how we would do that:
## Using a with to create a sample table
WITH records AS (SELECT * FROM UNNEST(ARRAY<STRUCT<id INT64, data ARRAY<STRUCT<key STRING, value STRING>>>>[
(1111, [("a","27"),("b","62"),("c","string")]),
(2222, [("a","27"),("c","string")]),
(3333, [("a","27")]),
(4444, [("a","27"),("b","62"),("c","string")])
])),
## Using another WITH table to take records and unnest them to be joined later
recordsUnnested AS (
SELECT id, key, value
FROM records, UNNEST(records.data) AS keyVals
)
SELECT SUM(aValue) AS aSum, SUM(bValue) AS bSum
FROM (
SELECT R.id, CAST(RA.value AS INT64) AS aValue, CAST(RB.value AS INT64) AS bValue
FROM records R
LEFT JOIN recordsUnnested RA ON R.id = RA.id AND RA.key = "a"
LEFT JOIN recordsUnnested RB ON R.id = RB.id AND RB.key = "b"
)
# results
# Row | aSum | bSum
# 1 | 108 | 124
As you can see, to perform a similar, it is still rather complex. You also still have to store items like strings and CAST them to other values when necessary, since you cannot mix types in a repeated field.
Pros
Store size will be less than JSON
Queries will typically execute faster.
Cons
The syntax is more complex, not as straight forward
Hope that helps, good luck.

Optimize SQL query and process result

I am looking for advice on optimizing the following sample query and processing the result. The SQL variant in use is the internal FileMaker ExecuteSQL engine which is limited to the SELECT statement with the following syntax:
SELECT [DISTINCT] {* | column_expression [[AS] column_alias],...}
FROM table_name [table_alias], ...
[ WHERE expr1 rel_operator expr2 ]
[ GROUP BY {column_expression, ...} ]
[ HAVING expr1 rel_operator expr2 ]
[ UNION [ALL] (SELECT...) ]
[ ORDER BY {sort_expression [DESC | ASC]}, ... ]
[ OFFSET n {ROWS | ROW} ]
[ FETCH FIRST [ n [ PERCENT ] ] { ROWS | ROW } {ONLY | WITH TIES } ]
[ FOR UPDATE [OF {column_expression, ...}] ]
The query:
SELECT item1 as val ,interval, interval_next FROM meddata
WHERE fk = 12 AND active1 = 1 UNION
SELECT item2 as val ,interval, interval_next FROM meddata
WHERE fk = 12 AND active2 = 1 UNION
SELECT item3 as val ,interval, interval_next FROM meddata
WHERE fk = 12 AND active3 = 1 UNION
SELECT item4 as val ,interval, interval_next FROM meddata
WHERE fk = 12 AND active4 = 1 ORDER BY val
This may give the following result as a sample:
val,interval,interval_next
Artelac,0,1
Artelac,3,6
Celluvisc,1,3
Celluvisc,12,24
What I am looking to achieve (in addition to suggestions for optimization) is a result formatted like this:
val,interval,interval_next,interval,interval_next,interval,interval_next,interval,interval_next ->etc
Artelac,0,1,3,6
Celluvisc,1,3,12,24
Preferably I would like this processed result to be produced by the SQL engine.
Possible?
Thank you.
EDIT: I included the column names in the result for clarity, though they are not part of the result. I wish to illustrate that there may be an arbitrary number of 'interval' and 'interval_next' columns in the result.
I do not think you need to optimise you query, looks fine to me.
You are looking for something like PIVOT in TSQL, which is not supported in FQL. You biggest issue is going to be a variable number of columns returned.
I think the best approach is to get your intermediate result and use a FileMaker script or Custom Function to pivot it.
An alternative is to get the list of distinct val and loop through them (with CF or script) with FQL Statement for each row. You will not be able to combine them with union as it needs the same number of columns.