Expanding a JSON array embedded in an object in Postgres 11.7 - sql

Postgres 11.7. An upgrade to PG 12 (JSONPath, I know) is in planning stages, not sure when we'll get there.
I'm working on passing some data to a PL/PgSQL stored function, and am struggling to unpack an array embedded in an object. I'm dealing with some client libraries that Really Like Object as JSON Root. {[]} instead of [].
As a starting point, here's a sample that works when I get the array as the top-level element:
-- Note: jsonb instead of json may save reparsing time, if the results are reused.
-- Or so I think I heard.
with expanded_data AS (
select *
from jsonb_to_recordset(
'[
{"base_text":"Red Large Special","base_id":1},
{"base_text":"Blue Small","base_id":5},
{"base_text":"Green Medium Special","base_id":87}
]')
AS unpacked (base_text citext, base_id citext)
)
select base_text,
base_id
from expanded_data
This returns the hoped-for results:
base_text base_id
Red Large Special 1
Blue Small 5
Green Medium Special 87
This variant also works fine on a top-level array
with expanded_data AS (
select *
from json_populate_recordset(
null::record,
'[
{"base_text":"Red Large Special","base_id":1},
{"base_text":"Blue Small","base_id":5},
{"base_text":"Green Medium Special","base_id":87}
]')
AS unpacked (base_text citext, base_id citext)
)
select base_text,
base_id
from expanded_data
What I've failed to figure out is how to get these same results when the JSON array is embedded as an element within a JSON object:
{"base_strings":[
{"base_text":"Red Large Special","base_id":1},
{"base_text":"Blue Small","base_id":5},
{"base_text":"Green Medium Special","base_id":87}
]}
I've been working with the docs on the extraction syntax, and the various available functions...and haven't sorted it out. Can someone suggest a sensible strategy for expanding the embedded array elements into a rowset?

It is simple:
with expanded_data AS (
select *
from jsonb_to_recordset(
'{"base_strings":[
{"base_text":"Red Large Special","base_id":1},
{"base_text":"Blue Small","base_id":5},
{"base_text":"Green Medium Special","base_id":87}
]}'::jsonb -> 'base_strings') -- Chages here
AS unpacked (base_text citext, base_id citext)
)
select base_text,
base_id
from expanded_data;

Related

How to retrieve the list of dynamic nested keys of BigQuery nested records

My ELT tools imports my data in bigquery and generates/extends automatically the schema for dynamic nested keys (in the schema below, under properties)
It looks like this
How can I get the list of nested keys of a repeated record ? so for example I can group by properties when those items have said property non-null ?
I have tried
select column_name
from my_schema.INFORMATION_SCHEMA.COLUMNS
where
table_name = 'my_table
But it will only list first level keys
From the picture above, I want, as a first step, a SQL query that returns
message
user_id
seeker
liker_id
rateable_id
rateable_type
from_organization
likeable_type
company
existing_attempt
...
My real goal through, is to group/count my data based on a non-null value of a 2nd level nested properties properties.filters.[filter_type]
The schema may evolve when our application adds more filters, so this need to be dynamically generated, I can't just hard-code the list of nested keys.
Note: this is very similar to this question How to extract all the keys in a JSON object with BigQuery but in my case my data is already in a shcema and it's not a JSON object
EDIT:
Suppose I have a list of such records with nested properties, how do I write a SQL query that adds a field "enabled_filters" which aggregates, for each item, the list of properties for wihch said property is not null ?
Example input (properties.x are dynamic and not known by the programmer)
search_id
properties.filters.school
properties.filters.type
1
MIT
master
2
Princetown
null
3
null
master
Example output
search_id
enabled_filters
1
["school", "type"]
2
["school"]
3
["type"]
Have you looked at COLUMN_FIELD_PATHS? It should give you the paths for all columns.
select field_path from my_schema.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS where table_name = '<table>'
[https://cloud.google.com/bigquery/docs/information-schema-column-field-paths]
The field properties is not nested by array only by structures. Then a UDF in JavaScript to parse thise field should work fast enough.
CREATE TEMP FUNCTION jsonObjectKeys(input STRING, shownull BOOL,fullname Bool)
RETURNS Array<String>
LANGUAGE js AS """
function test(input,old){
var out=[]
for(let x in input){
let te=input[x];
out=out.concat(te==null ? (shownull?[x+'==null']:[]) : typeof te=='object' ? test(te,old+x+'.') : [fullname ? old+x : x] );
}
return out;
Object.keys(JSON.parse(input));
}
return test(JSON.parse(input),"");
""";
with tbl as (select struct(1 as alpha,struct(2 as x, 3 as y,[1,2,3] as z ) as B) A from unnest(generate_array(1,10*1))
union all select struct(null,struct(null,1,[999])) )
select *,
TO_JSON_STRING (A ) as string_output,
jsonObjectKeys(TO_JSON_STRING (A),true,false) as output1,
jsonObjectKeys(TO_JSON_STRING (A),false,true) as output2,
concat('["', array_to_string(jsonObjectKeys(TO_JSON_STRING (A),false,true),'","' ) ,'"]') as output_sring,
jsonObjectKeys(TO_JSON_STRING (A.B),false,true) as outpu
from tbl

Presto extract string from array of JSON elements

I am on Presto 0.273 and I have a complex JSON data from which I am trying to extract only specific values.
First, I ran SELECT JSON_EXTRACT(library_data, '.$books') which gets me all the books from a certain library. The problem is this returns an array of JSON objects that look like this:
[{
"book_name":"abc",
"book_size":"453",
"requestor":"27657899462"
"comments":"this is a comment"
}, {
"book_name":"def",
"book_size":"354",
"requestor":"67657496274"
"comments":"this is a comment"
}, ...
]
I would like the code to return just a list of the JSON objects, not an array. My intention is to later be able to loop through the JSON objects to find ones from a specific requester. Currently, when I loop through the given arrays using python, I get a range of errors around this data being a Series, hence trying to extract it properly rather.
I tried this SELECT JSON_EXTRACT(JSON_EXTRACT(data, '$.domains'), '$[0]') but this doesn't work because the index position of the object needed is not known.
I also tried SELECT array_join(array[books], ', ') but getting "Error casting array element to VARCHAR " error.
Can anyone please point me in the right direction?
Cast to array(json):
SELECT CAST(JSON_EXTRACT(library_data, '.$books') as array(json))
Or you can use it in unnest to flatten it to rows:
SELECT *,
js_obj -- will contain single json object
FROM table
CROSS JOIN UNNEST CAST(JSON_EXTRACT(library_data, '.$books') as array(json)) as t(js_obj)

Extracting JSON returns null (Presto Athena)

I'm working with SQL Presto in Athena and in a table I have a column named "data.input.additional_risk_data.basket" that has a json like this:
[
{
"data.input.additional_risk_data.basket.val.brand":null,
"data.input.additional_risk_data.basket.val.category":null,
"data.input.additional_risk_data.basket.val.item_reference":"26484651",
"data.input.additional_risk_data.basket.val.name":"Nike Force 1",
"data.input.additional_risk_data.basket.val.product_name":null,
"data.input.additional_risk_data.basket.val.published_date":null,
"data.input.additional_risk_data.basket.val.quantity":"1",
"data.input.additional_risk_data.basket.val.size":null,
"data.input.additional_risk_data.basket.val.subCategory":null,
"data.input.additional_risk_data.basket.val.unit_price":769.0,
"data.input.additional_risk_data.basket.val.upc":null,
"data.input.additional_risk_data.basket.val.url":null
}
]
I need to extract some of the data there, for example data.input.additional_risk_data.basket.val.item_reference. I'm not used to working with jsons but I tried a few things:
json_extract("data.input.additional_risk_data.basket", '$.data.input.additional_risk_data.basket.val.item_reference')
json_extract_scalar("data.input.additional_risk_data.basket", '$.data.input.additional_risk_data.basket.val.item_reference)
They all returned null. I'm wondering what is the correct way to get the values from that json
Thank you!
There are multiple "problems" with your data and json path selector. Keys are not conventional (and I have not found a way to tell athena to escape them) and your json is actually an array of json objects. What you can do - cast data to an array and process it. For example:
-- sample data
WITH dataset (json_val) AS (
VALUES (json '[
{
"data.input.additional_risk_data.basket.val.brand":null,
"data.input.additional_risk_data.basket.val.category":null,
"data.input.additional_risk_data.basket.val.item_reference":"26484651",
"data.input.additional_risk_data.basket.val.name":"Nike Force 1",
"data.input.additional_risk_data.basket.val.product_name":null,
"data.input.additional_risk_data.basket.val.published_date":null,
"data.input.additional_risk_data.basket.val.quantity":"1",
"data.input.additional_risk_data.basket.val.size":null,
"data.input.additional_risk_data.basket.val.subCategory":null,
"data.input.additional_risk_data.basket.val.unit_price":769.0,
"data.input.additional_risk_data.basket.val.upc":null,
"data.input.additional_risk_data.basket.val.url":null
}
]')
)
--query
select arr[1]['data.input.additional_risk_data.basket.val.item_reference'] item_reference -- or use unnest if there are actually more than 1 element in array expected
from(
select cast(json_val as array(map(varchar, json))) arr
from dataset
)
Output:
item_reference
"26484651"

How do I Process This String?

I have some results in one of my tables and the results vary, each; represents multiple entries in one column which I need to split out.
Here is my SQL and the results:
select REGEXP_COUNT(value,';') as cnt,
description
from mytable;
1 {Managed By|xBoss}{xBoss xBoss Number|X0910505569}{Time
Requested|2009-04-15 20:47:11.0}{Time Arrived|2009-04-15 21:46:11.0};
1 {Managed By|Modern Management}{xBoss Number|}{Time Requested|2009-04-
16 14:01:29.0}{Time Arrived|2009-04-16 14:44:11.0};
2 {Managed By|xBoss}{xBoss Number|X091480092}{Time Requested|2009-05-28
08:58:41.0}{Time Arrived|};{Managed By|Jims Allocation}{xBoss xBoss
Number|}{Time Requested|}{Time Arrived|};
Desired output:
R1:
Managed By: xBoss
Time Requested:2009-10-19 07:53:45.0
Time Arrived: 2009-10-19 07:54:46.0
R2:
Managed By:Own Arrangements
Number: x5876523
Time Requested: 2009-10-19 07:57:46.0
Time Arrived:
R3:
Managed By: xBoss
Time Requested:2009-10-19 08:07:27.0
select
SPLIT_PART(description, '}', 1),
SPLIT_PART(description, '}', 2),
SPLIT_PART(description, '}', 3),
SPLIT_PART(description, '}', 4),
SPLIT_PART(description, '}', 5)
as description_with_tag from mytable;
This is ok when the count is 1, but when there are multiple ; in the description it doesn't give me the results.
Is it possible to put this into an array based on the count?
First, it's worth pointing out that data in this type of format cannot take advantage of all the benefits that Redshift can offer. Amazon Redshift is a columnar database that can provide amazing performance when data is stored in appropriate columns. However, selecting specific text from a text field will always perform poorly.
Therefore, my main advice would be to pre-process the data into normal rows and columns so that Redshift can provide you the best capabilities.
However, to answer your question, I would recommend making a Scalar User-Defined Function:
CREATE FUNCTION f_extract_curly (s TEXT, key TEXT)
RETURNS TEXT
STABLE
AS $$
# List of items in {brackets}
items = s[1:-1].split('}{')
# Dictionary of Key|Value from items
entries = {i.split('|')[0]: i.split('|')[1] for i in items}
# Return desired value
return entries.get(key, None)
$$ LANGUAGE plpythonu;
I loaded sample data with:
CREATE TABLE foo (
description TEXT
);
INSERT INTO foo values('{Managed By|xBoss}{xBoss xBoss Number|X0910505569}{Time Requested|2009-04-15 20:47:11.0}{Time Arrived|2009-04-15 21:46:11.0};');
INSERT INTO foo values('{Managed By|Modern Management}{xBoss Number|}{Time Requested|2009-04-16 14:01:29.0}{Time Arrived|2009-04-16 14:44:11.0};');
INSERT INTO foo values('{Managed By|xBoss}{xBoss Number|X091480092}{Time Requested|2009-05-28 08:58:41.0}{Time Arrived|};{Managed By|Jims Allocation}{xBoss xBoss Number|}{Time Requested|}{Time Arrived|};');
Then I tested it with:
SELECT
f_extract_curly(description, 'Managed By'),
f_extract_curly(description, 'Time Requested')
FROM foo
and got the result:
xBoss 2009-04-15 20:47:11.0
Modern Management 2009-04-16 14:01:29.0
xBoss
It doesn't know how to handle lines that have the same field specified twice (with semi-colons between). You did not provide enough sample input and output lines for me to figure out what you wanted in such situations, but feel free to tweak the code for your requirements.
There is no array data type in Redshift. There are 2 options:
1) First split_part by ';', then union results separately for every index of the first split_part output, then split_part results by '}' and finally get what you need.
2) Create a Python UDF and process these strings with Python. I guess this is the best solution for your use case.
3) Transform your data outside Redshift. From your data structure it seems like it's much better to process it before copying to Redshift, unnesting the arrays into rows and extracting keys from your objects into columns.

Store a 2D array in postgres

I have a 2D array which looks like below
[['2018-05-15', 6.7580658761256265], ['2018-05-16', 7.413963464926804], ['2018-05-17', 8.801776892107043], ['2018-05-18', 10.292505686766823], ['2018-05-19', 10.292505686766823], ['2018-05-20', 10.292505686766823], ['2018-05-21', 10.292505686766823]]
I want to store this 2D array in postgres.I checked postgres's documentation https://www.postgresql.org/docs/9.1/static/arrays.html and found the below example
CREATE TABLE sal_emp (
name text,
pay_by_quarter integer[],
schedule text[][]
);
INSERT INTO sal_emp
VALUES ('Bill',
ARRAY[10000, 10000, 10000, 10000],
ARRAY[['meeting', 'lunch'], ['training', 'presentation']]);
For a list that contains both of its values as string ['meeting', 'lunch'], we use text[][] But as you can see, I have one string and one float as a content of a list
['2018-05-15', 6.7580658761256265]
Then what do I set the column type as?
It's not an array anymore if the types are different. Arrays only can have one base type. You'll have to use an array of a composite type instead.
CREATE TYPE your_type AS (text_member text,
float_member float);
CREATE TABLE your_table
(your_type_column your_type[]);
INSERT INTO your_table
(your_type_column)
VALUES (array[row('2018-05-15',
6.7580658761256265)::your_type,
row('2018-05-16',
7.413963464926804)::your_type,
...
row('2018-05-21',
10.292505686766823)::your_type]);
But maybe consider not using arrays at all if possible, have another table and a normalized schema. Also consider using date instead of text if it's actually a date. (I just used text because you said it was a "string", though the sample data suggests differently.)