Change null to empty array in databricks SQL? - sql

I have a value in a JSON column that is sometimes all null in an Azure Databricks table. The full process to get to JSON_TABLE is: read parquet, infer schema of JSON column, convert the column from JSON string to deeply nested structure, explode any arrays within. I am working in SQL with python-defined UDFs (json_exists() checks the schema to see if the key is possible to use, json_get() gets a key from the column or returns a default) and want to do the following:
SELECT
ID, EXPLODE(json_get(JSON_COL, 'ARRAY', NULL)) AS SINGLE_ARRAY_VALUE
FROM
JSON_TABLE
WHERE
JSON_COL IS NOT NULL AND
json_exists(JSON_COL, 'ARRAY')==1
When the data has at least one instance of JSON_COL containing ARRAY, the schema is such that this has no problems. If, however, the data has all null values in JSON_COL.ARRAY, an error occurs because the column has been inferred as a string type (error received: input to function explode should be array or map type, not string). Unfortunately, while the json_exists() function returns the expected values, the error still occurs even when the returned dataset would be empty.
Can I get around this error via casting or replacement of nulls? If not, what is an alternative that still allows inferring the schema of the JSON?
Note: This is a simplified example. I am writing code to generate SQL code for hundreds of similar data structures, so while I am open to workarounds, a direct solution would be ideal. Please ask if anything is unclear.
Example table that causes error:
| ID | JSON_COL |
| 1 | {"_corrupt_record": null, "otherInfo": [{"test": 1, "from": 3}]} |
| 2 | {"_corrupt_record": null, "otherInfo": [{"test": 5, "from": 2}]} |
Example table that does not cause error:
| ID | JSON_COL |
| 1 | {"_corrupt_record": null, "array": [{"test": 1, "from": 3}]} |
| 2 | {"_corrupt_record": null, "otherInfo": [{"test": 5, "from": 2}]} |
This question seems like it might hold the answer, but I was not able to get anything working from it.

You can filter the table before calling json_get and explode, so that you only explode when json_get returns a non-null value:
SELECT
ID, EXPLODE(json_get(JSON_COL, 'ARRAY', NULL)) AS SINGLE_ARRAY_VALUE
FROM (
SELECT *
FROM JSON_TABLE
WHERE
JSON_COL IS NOT NULL AND
json_exists(JSON_COL, 'ARRAY')==1
)

Related

SQL get the value of a nested key in a jsonb field

Let's suppose I have a table my_table with a field named data, of type jsonb, which thus contains a json data structure.
let's suppose that if I run
select id, data from my_table where id=10;
I get
id | data
------------------------------------------------------------------------------------------
10 | {
|"key_1": "value_1" ,
|"key_2": ["value_list_element_1", "value_list_element_2", "value_list_element_3" ],
|"key_3": {
| "key_3_1": "value_3_1",
| "key_3_2": {"key_3_2_1": "value_3_2_1", "key_3_2_2": "value_3_2_2"},
| "key_3_3": "value_3_3"
| }
| }
so in pretty formatting, the content of column data is
{
"key_1": "value_1",
"key_2": [
"value_list_element_1",
"value_list_element_2",
"value_list_element_3"
],
"key_3": {
"key_3_1": "value_3_1",
"key_3_2": {
"key_3_2_1": "value_3_2_1",
"key_3_2_2": "value_3_2_2"
},
"key_3_3": "value_3_3"
}
}
I know that If I want to get directly in a column the value of a key (of "level 1") of the json, I can do it with the ->> operator.
For example, if I want to get the value of key_2, what I do is
select id, data->>'key_2' alias_for_key_2 from my_table where id=10;
which returns
id | alias_for_key_2
------------------------------------------------------------------------------------------
10 |["value_list_element_1", "value_list_element_2", "value_list_element_3" ]
Now let's suppose I want to get the value of key_3_2_1, that is value_3_2_1.
How can I do it?
I have tryed with
select id, data->>'key_3'->>'key_3_2'->>'key_3_2_1' alias_for_key_3_2_1 from my_table where id=10;
but I get
select id, data->>'key_3'->>'key_3_2'->>'key_3_2_1' alias_for_key_3_2_1 from my_table where id=10;
^
HINT: No operators found with name and argument types provided. Types may need to be converted explicitly.
what am I doing wrong?
The problem in the query
select id, data->>'key_3'->>'key_3_2'->>'key_3_2_1' alias_for_key_3_2_1 --this is wrong!
from my_table
where id=10;
was that by using the ->> operand I was turning a json to a string, so that with the next ->> operand I was trying to get a json key object key_3_2 out of a string object, which makes no sense.
Thus one has to use the -> operand, which does not convert json into string, until one gets to the "final" key.
so the query I was looking for was
select id, data->'key_3'->'key_3_2'->>'key_3_2_1' alias_for_key_3_2_1 --final ->> : this gets the value of 'key_3_2_1' as string
from my_table
where id=10;
or either
select id, data->'key_3'->'key_3_2'->'key_3_2_1' alias_for_key_3_2_1 --final -> : this gets the value of 'key_3_2_1' as json / jsonb
from my_table
where id=10;
More info on JSON Functions and Operators can be find here

How to store an array of date ranges in Postgres?

I am trying to build a schedule, I generate an array of objects on the client containing date ranges
[
{start: "2020-07-06 0:0", end: "2020-07-10 23:59"},
{start: "2020-07-13 0:0", end: "2020-07-17 23:59"}
]
I have a column of type daterange[] what is the proper way to format this data to insert it into my table?
This is what I have so far:
INSERT INTO schedules(owner, name, dates) VALUES (
1,
'work',
'{
{[2020-07-06 0:0,2020-07-10 23:59]},
{[2020-07-13 0:0,2020-07-17 23:59]}
}'
)
I think you want:
insert into schedules(owner, name, dates) values (
1,
'work',
array[
'[2020-07-06, 2020-07-11)'::daterange,
'[2020-07-13, 2020-07-18)'::daterange
]
);
Rationale:
you are using dateranges, so you cannot have time portions (for this, you would need tsrange instead); as your code stands, it seems like you want an inclusive lower bound and an exclusive upper bound (hence [ at the left side, and ) at the right side)
explicit casting is needed so Postgres can recognize the that array elements have the proper datatype (otherwise, they look like text)
then, you can surround the list of ranges with the array[] constructor
Demo on DB Fiddle:
owner | name | dates
----: | :--- | :----------------------------------------------------
1 | work | {"[2020-07-06,2020-07-11)","[2020-07-13,2020-07-18)"}

How do I identify problematic documents in S3 when querying data in Athena?

I have a basic Athena query like this:
SELECT *
FROM my.dataset LIMIT 10
When I try to run it I get an error message like this:
Your query has the following error(s):
HIVE_BAD_DATA: Error parsing field value for field 2: For input string: "32700.000000000004"
How do I identify the S3 document that has the invalid field?
My documents are JSON.
My table looks like this:
CREATE EXTERNAL TABLE my.data (
`id` string,
`timestamp` string,
`profile` struct<
`name`: string,
`score`: int>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');
Inconsistent schema
Inconsistent schema is when values in some rows are of different data type. Let's assume that we have two json files
// inside s3://path/to/bad.json
{"name":"1Patrick", "age":35}
{"name":"1Carlos", "age":"eleven"}
{"name":"1Fabiana", "age":22}
// inside s3://path/to/good.json
{"name":"2Patrick", "age":35}
{"name":"2Carlos", "age":11}
{"name":"2Fabiana", "age":22}
Then a simple query SELECT * FROM some_table will fail with
HIVE_BAD_DATA: Error parsing field value 'eleven' for field 1: For input string: "eleven"
However, we can exclude that file within WHERE clause
SELECT
"$PATH" AS "source_s3_file",
*
FROM some_table
WHERE "$PATH" != 's3://path/to/bad.json'
Result:
source_s3_file | name | age
---------------------------------------
s3://path/to/good.json | 1Patrick | 35
s3://path/to/good.json | 1Carlos | 11
s3://path/to/good.json | 1Fabiana | 22
Of course, this is the best case scenario when we know which files are bad. However, you can employ this approach to somewhat manually infer which files are good. You can also use LIKE or regexp_like to walk through multiple files at a time.
SELECT
COUNT(*)
FROM some_table
WHERE regexp_like("$PATH", 's3://path/to/go[a-z]*.json')
-- If this query doesn't fail, that those files are good.
The obvious drawback of such approach is cost to execute query and time spent, especially if it is done file by file.
Malformed records
In the eyes of AWS Athena, good records are those which are formatted as a single JSON per line:
{ "id" : 50, "name":"John" }
{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }
AWS Athena supports OpenX JSON SerDe library which can be set to evaluate malformed records as NULL by specifying
-- When you create table
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
when you create table. Thus, the following query will reveal files with malformed records:
SELECT
DISTINCT("$PATH")
FROM "some_database"."some_table"
WHERE(
col_1 IS NULL AND
col_2 IS NULL AND
col_3 IS NULL
-- etc
)
Note: you can use only a single col_1 IS NULL if you are 100% sure that it doesn't contain empty fields other then in corrupted rows.
In general, malformed records are not that big of a deal provided that 'ignore.malformed.json' = 'true'. For example the following query will still succeed
For example if a file contains:
{"name": "2Patrick","age": 35,"address": "North Street"}
{
"name": "2Carlos",
"age": 11,
"address": "Flowers Street"
}
{"name": "2Fabiana","age": 22,"address": "Main Street"}
the following query will still succeed
SELECT
"$PATH" AS "source_s3_file",
*
FROM some_table
Result:
source_s3_file | name | age | address
-----------------------------|----------|-----|-------------
1 s3://path/to/malformed.json| 2Patrick | 35 | North Street
2 s3://path/to/malformed.json| | |
3 s3://path/to/malformed.json| | |
4 s3://path/to/malformed.json| | |
5 s3://path/to/malformed.json| | |
6 s3://path/to/malformed.json| | |
7 s3://path/to/malformed.json| 2Fabiana | 22 | Main Street
While with 'ignore.malformed.json' = 'false' (which is the default behaviour) exactly the same query will throw an error
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]

How to get JSON value from varchar field

*outdated Oracle version
I have a table for receipt data.
I want to get some data from field EXT_ATTR. such as PAYMENT_RECEIPT_NO
The field "EXT_ATTR" is varchar(4000) stored JSON value
SerialId | EXT_ATTR
1 |
{
"PAYMENT_RECEIPT_NO": "PS00000000000000001",
"IS_CORPOR": "1",
"POSTCODE1": "51000",
"POSTCODE2": "51000",
"BILLADDR1PART1": "BILLADDR1PART1_DATA",
"BILLADDR1PART2": "BILLADDR1PART2_DATA",
"NEED_PRINT_WHT": "1",
"WHT_AMT": "0",
"TRXAMT": "2340600",
"LOCATIONID": "02140",
"PAYMENT_METHOD_NAME": "Cash",
"WITH_TAX": "1"
}
2 |
{
"PAYMENT_RECEIPT_NO": "PS00000000000000055",
"IS_CORPOR": "1",
"POSTCODE1": "51000",
"POSTCODE2": "51000",
"BILLADDR1PART1": "BILLADDR1PART1_DATA",
"BILLADDR1PART2": "BILLADDR1PART2_DATA",
"NEED_PRINT_WHT": "1",
"WHT_AMT": "0",
"TRXAMT": "2340600",
"LOCATIONID": "02140",
"PAYMENT_METHOD_NAME": "Cash",
"WITH_TAX": "1"
}
How can I extract varchar filed to get only value.
SerialId | PAYMENT_RECEIPT_NO
1 | PS00000000000000001
2 | PS00000000000000055
Thank you very much.
to work with json documents you can use PL/JSON
if you want to parse it without json Tools, than you can use substr, instr function in Oracle.
depending on what your string looks like, you have to adjust string positions.
create table tab (json varchar2(1000));
insert into tab values('{"PAYMENT_RECEIPT_NO": "PS00000000000000001","IS_CORPOR": "1","POSTCODE1": "51000","POSTCODE2": "51000","BILLADDR1PART1": "BILLADDR1PART1_DATA","BILLADDR1PART2": "BILLADDR1PART2_DATA","NEED_PRINT_WHT": "1","WHT_AMT": "0","TRXAMT": "2340600","LOCATIONID": "02140","PAYMENT_METHOD_NAME": "Cash","WITH_TAX": "1"}');
insert into tab values('{"PAYMENT_RECEIPT_NO": "PS00000000000000055","IS_CORPOR": "1","POSTCODE1": "51000","POSTCODE2": "51000","BILLADDR1PART1": "BILLADDR1PART1_DATA","BILLADDR1PART2": "BILLADDR1PART2_DATA","NEED_PRINT_WHT": "1","WHT_AMT": "0","TRXAMT": "2340600","LOCATIONID": "02140","PAYMENT_METHOD_NAME": "Cash","WITH_TAX": "1"}');
select substr(json,instr(json,': ',1,1)+3,instr(json,',',1,1)-instr(json,': ',1,1)-4)
from tab;
| SUBSTR(JSON,INSTR(JSON,':',1,1)+3,INSTR(JSON,',',1,1)-INSTR(JSON,':',1,1)-4) |
| :--------------------------------------------------------------------------- |
| PS00000000000000001 |
| PS00000000000000055 |
db<>fiddle here
JSON functions are defined for Database Oracle12c+ version. APEX_JSON package with release 5.0+ should be installed for the previous releases. Whenever installation complete, then the following code might be used as an XML data type manner through APEX_JSON.TO_XMLTYPE() function in order to extract the desired values :
WITH t AS
(
SELECT SerialId, APEX_JSON.TO_XMLTYPE(Payment_Receipt_No) AS xml_data
FROM tab
)
SELECT SerialId, Payment_Receipt_No
FROM t
CROSS JOIN
XMLTABLE('/json'
PASSING xml_data
COLUMNS
Payment_Receipt_No VARCHAR2(100) PATH 'PAYMENT_RECEIPT_NO'
)

Using Postgres JSON Functions on table columns

I have searched extensively (in Postgres docs and on Google and SO) to find examples of JSON functions being used on actual JSON columns in a table.
Here's my problem: I am trying to extract key values from an array of JSON objects in a column, using jsonb_to_recordset(), but get syntax errors. When I pass the object literally to the function, it works fine:
Passing JSON literally:
select *
from jsonb_to_recordset('[
{ "id": 0, "name": "400MB-PDF.pdf", "extension": ".pdf",
"transferId": "ap31fcoqcajjuqml6rng"},
{ "id": 0, "name": "1000MB-PDF.pdf", "extension": ".pdf",
"transferId": "ap31fcoqcajjuqml6rng"}
]') as f(name text);`
results in:
400MB-PDF.pdf
1000MB-PDF.pdf
It extracts the value of the key "name".
Here's the JSON in the column, being extracted using:
select journal.data::jsonb#>>'{context,data,files}'
from journal
where id = 'ap32bbofopvo7pjgo07g';
resulting in:
[ { "id": 0, "name": "400MB-PDF.pdf", "extension": ".pdf",
"transferId": "ap31fcoqcajjuqml6rng"},
{ "id": 0, "name": "1000MB-PDF.pdf", "extension": ".pdf",
"transferId": "ap31fcoqcajjuqml6rng"}
]
But when I try to pass jsonb#>>'{context,data,files}' to jsonb_to_recordset() like this:
select id,
journal.data::jsonb#>>::jsonb_to_recordset('{context,data,files}') as f(name text)
from journal
where id = 'ap32bbofopvo7pjgo07g';
I get a syntax error. I have tried different ways but each time it complains about a syntax error:
Version:
PostgreSQL 9.4.10 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit
The expressions after select must evaluate to a single value. Since jsonb_to_recordset returns a set of rows and columns, you can't use it there.
The solution is a cross join lateral, which allows you to expand one row into multiple rows using a function. That gives you single rows that select can act on. For example:
select *
from journal j
cross join lateral
jsonb_to_recordset(j.data#>'{context, data, files}') as d(id int, name text)
where j.id = 'ap32bbofopvo7pjgo07g'
Note that the #>> operator returns type text, and the #> operator returns type jsonb. As jsonb_to_recordset expects jsonb as its first parameter I'm using #>.
See it working at rextester.com
jsonb_to_recordset is a set-valued function and can only be invoked in specific places. The FROM clause is one such place, which is why your first example works, but the SELECT clause is not.
In order to turn your JSON array into a "table" that you can query, you need to use a lateral join. The effect is rather like a foreach loop on the source recordset, and that's where you apply the jsonb_to_recordset function. Here's a sample dataset:
create table jstuff (id int, val jsonb);
insert into jstuff
values
(1, '[{"outer": {"inner": "a"}}, {"outer": {"inner": "b"}}]'),
(2, '[{"outer": {"inner": "c"}}]');
A simple lateral join query:
select id, r.*
from jstuff
join lateral jsonb_to_recordset(val) as r("outer" jsonb) on true;
id | outer
----+----------------
1 | {"inner": "a"}
1 | {"inner": "b"}
2 | {"inner": "c"}
(3 rows)
That's the hard part. Note that you have to define what your new recordset looks like in the AS clause -- since each element in our val array is a JSON object with a single field named "outer", that's what we give it. If your array elements contain multiple fields you're interested in, you declare those in a similar manner. Be aware also that your JSON schema needs to be consistent: if an array element doesn't contain a key named "outer", the resulting value will be null.
From here, you just need to pull the specific value you need out of each JSON object using the traversal operator as you were. If I wanted only the "inner" value from the sample dataset, I would specify select id, r.outer->>'inner'. Since it's already JSONB, it doesn't require casting.