This is running,
val r = sql("SELECT T.* FROM ( VALUES ('xx','xxx','2019-01-01'), ('xxxx','yyyy','2019-01-02') ) T")
but r have "no name" columns, showed as col1| col2| col3. In standard SQL I can express names as table parameters, something as T(a,b,c) instead T... But this,
val r = sql("SELECT T.* FROM ( VALUES ('xx','xxx','2019-01-01'), ('xxxx','yyyy','2019-01-02') ) T(a,b,c)")
is not working, it is an ugly error message, say nothing about correct Spark syntax for it...
The question is "How to express column names?", and need an example that I can run in spark-shell v2.2.
Notes
The ugly message:
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '(' expecting {<EOF>, ',', 'WHERE', 'GROUP', 'ORDER', 'HAVING', 'LIMIT', 'JOIN', 'CROSS', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'ANTI'}(line 1, pos 73)
== SQL ==
SELECT T.*, 'aaa' as chk FROM ( VALUES ('xx','xxx','2019-01-01') ) T (a,b,c)
----------------------------------------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:114)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:637)
... 50 elided
I suspect that the dataset (VALUES (),(),...) has a predefined column naming (col1,col2, ...) so the only way to override it, is to use aliases
Since you need columns a,b,c in the val r, it should look like this:
SELECT T.col1 as a, T.col2 as b, T.col3 as c FROM ( VALUES ('xx','xxx','2019-01-01'), ('xxxx','yyyy','2019-01-02') ) T
You can cast this to a dataframe and specify your column names:
val r = sql("SELECT T.* FROM ( VALUES ('xx','xxx','2019-01-01'), ('xxxx','yyyy','2019-01-02') ) T").toDF("foo","bar","baz")
Or you can use a a different approach to creating your dataframe:
val r = Seq(("xx","xxx","2019-01-01"),("xxxx","yyyy","2019-01-02")).toDF("foo","bar","baz")
Either way, you will get:
r.show
+----+----+----------+
| foo| bar| baz|
+----+----+----------+
| xx| xxx|2019-01-01|
|xxxx|yyyy|2019-01-02|
+----+----+----------+
Not necessarily any prettier/better than #mangusta's answer, just an alternative approach.
Related
I have the following data in one column in BigQuery:
{"id": "81", "type": ["{'id2': '12', 'type2': 'main'}", "{'id2': '15', 'type2': 'sub'}"]}
I would like to parse this and have 'id2' and 'type2' as nested fields. I tried using JSON_VALUE_ARRAY(data, "$.type") that correctly creates the nested rows but couldn't process extracting 'id2' and 'type2'. I believe maybe the "s are the issue inside the list, how could I get past those?
UPDATE:
This is the format I would like to achieve.
Consider below approach
select
json_value(json, '$.id') id, array(
select as struct json_value(trim(type, '"'), '$.id2') as id2, json_value(trim(type, '"'), '$.type2') as type2
from unnest(json_extract_array(json, '$.type')) type
) type
from your_table
I am currently figuring out how to do a bit more complex data migration in my database and whether it is even possible to do in SQL (not very experienced SQL developer myself).
Let's say that I store JSONs in one of my text columns in a Postgres table wtih roughly the following format:
{"type":"something","params":[{"value":"00de1be5-f75b-4072-ba30-c67e4fdf2333"}]}
Now, I would like to migrate the value part to a bit more complex format:
{"type":"something","params":[{"value":{"id":"00de1be5-f75b-4072-ba30-c67e4fdf2333","path":"/hardcoded/string"}}]}
Furthermore, I also need to reason whether the value contains a UUID pattern, and if not, use slightly different structure:
{"type":"something-else","params":[{"value":"not-id"}]} ---> {"type":"something-else","params":[{"value":{"value":"not-id","path":""}}]}
I know I can define a procedure and use REGEX_REPLACE: REGEXP_REPLACE(source, pattern, replacement_string,[, flags]) but I have no idea how to approach the reasoning about whether the content contains ID or not. Could someone suggest at least some direction or hint how to do this?
You can use jsonb function for extract data and change them. At the end you should extend data.
Sample data structure and query result: dbfiddle
select
(t.data::jsonb || jsonb_build_object(
'params',
jsonb_agg(
jsonb_build_object(
'value',
case
when e.value->>'value' ~* '^[0-9A-F]{8}-[0-9A-F]{4}-4[0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$' then
jsonb_build_object('id', e.value->>'value', 'path', '/hardcoded/string')
else
jsonb_build_object('value', 'not-id', 'path', '')
end
)
)
))::text
from
test t
cross join jsonb_array_elements(t.data::jsonb->'params') e
group by t.data
PS:
If your table had id or unique field you can change group by t.data to do things like that:
select
(t.data::jsonb || jsonb_build_object(
'params',
jsonb_agg(
jsonb_build_object(
'value',
case
when e.value->>'value' ~* '^[0-9A-F]{8}-[0-9A-F]{4}-4[0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$' then
jsonb_build_object('id', e.value->>'value', 'path', '/hardcoded/string')
else
jsonb_build_object('value', 'not-id', 'path', '')
end
)
)
))::text
from
test t
cross join jsonb_array_elements(t.data::jsonb->'params') e
group by t.id
To replace values at any depth, you can use a recursive CTE to run replacements for each value of a value key, using a conditional to check if the value is a UUID, and producing the proper JSON object accordingly:
with recursive cte(v, i, js) as (
select (select array_to_json(array_agg(distinct t.i))
from (select (regexp_matches(js, '"value":("[\w\-]+")', 'g'))[1] i) t), 0, js from (select '{"type":"something","params":[{"value":"00de1be5-f75b-4072-ba30-c67e4fdf2333"}, {"value":"sdfsa"}]}' js) t1
union all
select c.v, c.i+1, regexp_replace(
regexp_replace(c.js, regexp_replace((c.v -> c.i)::text, '[\\"]+', '', 'g'),
case when not ((c.v -> c.i)::text ~ '\w+\-\w+\-\w+\-\w+\-\w+') then
json_build_object('value', regexp_replace((c.v -> c.i)::text, '[\\"]+', '', 'g'), 'path', '')::text
else json_build_object('id', regexp_replace((c.v -> c.i)::text, '[\\"]+', '', 'g'), 'path', '/hardcoded/path')::text end, 'g'),
'(")(?=\{)|(?<=\})(")', '', 'g')
from cte c where c.i < json_array_length(c.v)
)
select js from cte order by i desc limit 1
Output:
{"type":"something","params":[{"value":{"id" : "00de1be5-f75b-4072-ba30-c67e4fdf2333", "path" : "/hardcoded/path"}}, {"value":{"value" : "sdfsa", "path" : ""}}]}
On a more complex JSON input string:
{"type":"something","params":[{"value":"00de1be5-f75b-4072-ba30-c67e4fdf2333"}, {"value":"sdfsa"}, {"more":[{"additional":[{"value":"00f41be5-g75b-4072-ba30-c67e4fdf3777"}]}]}]}
Output:
{"type":"something","params":[{"value":{"id" : "00de1be5-f75b-4072-ba30-c67e4fdf2333", "path" : "/hardcoded/path"}}, {"value":{"value" : "sdfsa", "path" : ""}}, {"more":[{"additional":[{"value":{"id" : "00f41be5-g75b-4072-ba30-c67e4fdf3777", "path" : "/hardcoded/path"}}]}]}]}
I'm trying to write an sql query that would find the rows in a table that match any value of the provided json array.
To put it more concretely, I have the following db table:
CREATE TABLE mytable (
name text,
id SERIAL PRIMARY KEY,
config json,
matching boolean
);
INSERT INTO "mytable"(
"name", "id", "config", "matching"
)
VALUES
(
E 'Name 1', 50,
E '{"employees":[1,7],"industries":["1","3","4","13","14","16"],"levels":["1110","1111","1112","1113","1114"],"revenue":[0,5],"states":["AK","Al","AR","AZ","CA","CO","CT","DC","DE","FL","GA","HI","IA","ID","IL"]}',
TRUE
),
(
E 'Name 2', 63,
E '{"employees":[3,5],"industries":["1"],"levels":["1110"],"revenue":[2,5],"states":["AK","AZ","CA","CO","HI","ID","MT","NM","NV","OR","UT","WA","WY"]}',
TRUE,
),
(
E 'Name 3', 56,
E '{"employees":[0,0],"industries":["14"],"levels":["1111"],"revenue":[7,7],"states":["AK","AZ","CA","CO","HI","ID","MT","NM","NV","OR","UT","WA","WY"]}',
TRUE,
),
(
E 'Name 4', 61,
E '{"employees":[3,8],"industries":["1"],"levels":["1110"],"revenue":[0,5],"states":["AK","AZ","CA","CO","HI","ID","WA","WY"]}',
FALSE
);
I need to perform search queries on this table with the given filtering params. The filtering params basically correspond to the json keys in config field. They come from the client side and can look something like this:
{"employees": [1, 8], "industries": ["12", "5"]}
{"states": ["LA", "WA", "CA"], "levels": ["1100", "1100"], "employees": [3]}
And given such filters, I need to find the rows in my table that include any of the array elements from the corresponding filter key for every filter key provided.
So given the filter {"employees": [1, 8], "industries": ["12", "5"]} the query would have to return all the rows where (employees key in config field contains either 1 or 8 AND where industries key in config field contains either 12 or 5);
I need to generate such a query dynamically from the javascript code so that I could include/exclude filtering by a certain parameter bu adding/removing the AND operator.
What I have so far is a super long-running query that generates all possible combinations of array elements in config field which feels very wrong:
select * from mytable
cross join lateral json_array_elements(config->'employees') as e1
cross join lateral json_array_elements(config->'states') as e2
cross join lateral json_array_elements(config->'levels') as e3
cross join lateral json_array_elements(config->'revenue') as e4;
I've also tried to do something like this:
select * from mytable
where
matching = TRUE
and (config->'employees')::jsonb #> ANY(ARRAY ['[1, 7, 8]']::jsonb[])
and (config->'states')::jsonb #> ANY(ARRAY ['["AK", "AZ"]']::jsonb[])
and ........;
however this didn't work, although looked promising.
Also, I've tried playing with ?| operator but to no avail.
Basically, what I need is: given an array key in a json field check if this field contains any of the provided values in another array (which is my filtering parameter); and I have to do this for multiple filtering parameters dynamically.
So the logic is the following:
select all rows from the table
*where*
matching = TRUE
*and* config->key1 includes any of the keys from [5,6,8,7]
*and* config->key2 includes any of the keys from [8,6,2]
*and* so forth;
Could you help me with implementing such an sql query?
Or maybe such sql queries will always be extremely slow and its best to do such filtering outside of the database level?
I'd try with something like that. I guess there are certain side effects (e.g. What if the comparison data is empty?) and I didn't test it on larger data sets... It was just the first which came to my mind... :
demo:db<>fiddle
SELECT
*
FROM
mytable t
JOIN (SELECT '{"states": ["LA", "WA", "CA"], "levels": ["1100", "1100"], "employees": [3]}'::json as data) c
ON
CASE WHEN c.data -> 'employees' IS NOT NULL THEN
ARRAY(SELECT json_array_elements_text(t.config -> 'employees')) && ARRAY(SELECT json_array_elements_text(c.data -> 'employees'))
ELSE TRUE END
AND
CASE WHEN c.data -> 'industries' IS NOT NULL THEN
ARRAY(SELECT json_array_elements_text(t.config -> 'industries')) && ARRAY(SELECT json_array_elements_text(c.data -> 'industries'))
ELSE TRUE END
AND
CASE WHEN c.data -> 'states' IS NOT NULL THEN
ARRAY(SELECT json_array_elements_text(t.config -> 'states')) && ARRAY(SELECT json_array_elements_text(c.data -> 'states'))
ELSE TRUE END
AND
CASE WHEN c.data -> 'revenue' IS NOT NULL THEN
ARRAY(SELECT json_array_elements_text(t.config -> 'revenue')) && ARRAY(SELECT json_array_elements_text(c.data -> 'revenue'))
ELSE TRUE END
AND
CASE WHEN c.data -> 'levels' IS NOT NULL THEN
ARRAY(SELECT json_array_elements_text(t.config -> 'levels')) && ARRAY(SELECT json_array_elements_text(c.data -> 'levels'))
ELSE TRUE END
Explanation of the join condition:
CASE WHEN c.data -> 'levels' IS NOT NULL THEN
ARRAY(SELECT json_array_elements_text(t.config -> 'levels')) && ARRAY(SELECT json_array_elements_text(c.data -> 'levels'))
ELSE TRUE END
If your comparision data does not contain the specific attribute, the condition is true and therefore will be ignored. If it contains an attribute, compare the table and comparision arrays for this attribute by transforming both JSON arrays into simple Postgres arrays
I have a column with string entries with the following format (for example)
"[ { 'state' : 'CA', 'tax_amount' : 3},{ 'state' : 'AZ', 'tax_amount' : 4}]"
I want to sum through the tax_amounts in each entry to get a total tax amount for each row. How can I do this in PostgreSQL?
I would use a scalar sub-query:
select t.*,
(select sum((item ->> 'tax_amount')::int)
from jsonb_array_elements(t.the_column) as x(item)) as total_tax
from the_table t
Online example
You can use JSONB_POPULATE_RECORDSET() function such as
WITH t(js) AS
(
SELECT '[ { "state" : "CA", "tax_amount" : 3},
{ "state" : "AZ", "tax_amount" : 4}]'::JSONB
)
SELECT SUM(tax_amount) AS total_tax_amount
FROM t
CROSS JOIN JSONB_POPULATE_RECORDSET(NULL::record,js )
AS tab(state VARCHAR(10), tax_amount INT)
total_tax_amount
----------------
7
after fixing the syntax of the JSON object by replacing single-quotes with double-quotes, and double-quotes wrapping up whole object with single-quotes.
Demo
Once upon a time, there was one row of data (massively simplified, the actual json data is 10KB+) thus:
ID, json
1, '{
"a1.arr": [1,2,3],
"a1.boo": true,
"a1.str": "hello",
"a1.num": 123
}'
A process was supposed to write another record with predominantly different data:
ID, json
2, '{
"a1.arr": [1,2,3], //from ID 1
"a2.arr": [4,5,6], //new (and so are all below)
"a2.boo": false,
"a2.str": "goodbye",
"a2.num": 456
}'
But due to some external error, the original set of json from ID 1 ended up also being represented in ID 2, so now the table looks like this:
ID, json
1, '{
"a1.arr": [1,2,3],
"a1.boo": true,
"a1.str": "hello",
"a1.num": 123
}'
2, '{
"a1.arr": [1,2,3],
"a1.boo": true, //extraneous
"a1.str": "hello", //extraneous
"a1.num": 123, //extraneous
"a2.arr": [4,5,6],
"a2.boo": false,
"a2.str": "goodbye",
"a2.num": 456
}'
I'd like to know if there's a way to remove the extraneous lines from the ID 2 record.
I believe that the entire JSON string from ID 1 is represented in ID 2 as a contiguous block, so string replacement could work but there's a chance that some reordering has taken place. Gets a bit messy with the element that is supposed to remain, though
There's also a chance that some of the a1.* nodes' values have been changed slightly, (I didn't do a diff) but I'm happy to use just the node names, not their values, in assessing whether an node should be removed. One of the nodes (a1.arr) should be kept in ID 2. The resultset should hence look like:
ID, json
1, '{
"a1.arr": [1,2,3],
"a1.boo": true,
"a1.str": "hello",
"a1.num": 123
}'
2, '{
"a1.arr": [1,2,3],
"a2.arr": [4,5,6],
"a2.boo": false,
"a2.str": "goodbye",
"a2.num": 456
}'
I've started playing about with https://dba.stackexchange.com/questions/168303/can-sql-server-2016-extract-node-names-from-json to get the list of node names from ID 1 that I want to remove from ID 2, just not sure how I then strip them out of ID 2's JSON - presumably a deserialize, reduce and reserialize sequence?
You can follow this approach:
get the keys you want to replace with an openjson on row with id=1 value
use cross apply to filter keys in rows with id>1
rebuild the json string without unwanted keys using STRING_AGG and group by
This code should work:
declare #tmp table ([id] int, jsonValue nvarchar(max))
declare #source_json nvarchar(max)
insert into #tmp values
(1, '{"a1.arr":[1,2,3],"a1.boo":true,"a1.str":"hello","a1.num":123}')
,(2, '{"a1.arr":[1,2,3],"a1.boo":true,"a1.str":"hello","a1.num":123,"a2.arr":[4,5,6],"a2.boo":false,"a2.str":"goodbye","a2.num":456}')
,(3, '{"a1.arr":[1,2,3],"a1.boo":true,"a1.str":"hello","a1.num":123,"a3.arr":[4,5,6],"a3.boo":false,"a3.str":"goodbye","a3.num":456}')
--get json string from id=1
select #source_json = jsonValue from #tmp where [id] = 1
--rebuild json string for id > 1 removing keys from id = 1
select t.[id],
'{' + STRING_AGG( '"' + g.[key] + '":"' + STRING_ESCAPE(g.[value], 'json') + '"', ',') + '}' as [json]
from #tmp t cross apply openjson(jsonValue) g
where t.id > 1
and g.[key] not in (select [key] from openjson(#source_json) where [key] <> 'a1.arr')
group by t.id
Result: