How to get JSON value from varchar field - sql

*outdated Oracle version
I have a table for receipt data.
I want to get some data from field EXT_ATTR. such as PAYMENT_RECEIPT_NO
The field "EXT_ATTR" is varchar(4000) stored JSON value
SerialId | EXT_ATTR
1 |
{
"PAYMENT_RECEIPT_NO": "PS00000000000000001",
"IS_CORPOR": "1",
"POSTCODE1": "51000",
"POSTCODE2": "51000",
"BILLADDR1PART1": "BILLADDR1PART1_DATA",
"BILLADDR1PART2": "BILLADDR1PART2_DATA",
"NEED_PRINT_WHT": "1",
"WHT_AMT": "0",
"TRXAMT": "2340600",
"LOCATIONID": "02140",
"PAYMENT_METHOD_NAME": "Cash",
"WITH_TAX": "1"
}
2 |
{
"PAYMENT_RECEIPT_NO": "PS00000000000000055",
"IS_CORPOR": "1",
"POSTCODE1": "51000",
"POSTCODE2": "51000",
"BILLADDR1PART1": "BILLADDR1PART1_DATA",
"BILLADDR1PART2": "BILLADDR1PART2_DATA",
"NEED_PRINT_WHT": "1",
"WHT_AMT": "0",
"TRXAMT": "2340600",
"LOCATIONID": "02140",
"PAYMENT_METHOD_NAME": "Cash",
"WITH_TAX": "1"
}
How can I extract varchar filed to get only value.
SerialId | PAYMENT_RECEIPT_NO
1 | PS00000000000000001
2 | PS00000000000000055
Thank you very much.

to work with json documents you can use PL/JSON
if you want to parse it without json Tools, than you can use substr, instr function in Oracle.
depending on what your string looks like, you have to adjust string positions.
create table tab (json varchar2(1000));
insert into tab values('{"PAYMENT_RECEIPT_NO": "PS00000000000000001","IS_CORPOR": "1","POSTCODE1": "51000","POSTCODE2": "51000","BILLADDR1PART1": "BILLADDR1PART1_DATA","BILLADDR1PART2": "BILLADDR1PART2_DATA","NEED_PRINT_WHT": "1","WHT_AMT": "0","TRXAMT": "2340600","LOCATIONID": "02140","PAYMENT_METHOD_NAME": "Cash","WITH_TAX": "1"}');
insert into tab values('{"PAYMENT_RECEIPT_NO": "PS00000000000000055","IS_CORPOR": "1","POSTCODE1": "51000","POSTCODE2": "51000","BILLADDR1PART1": "BILLADDR1PART1_DATA","BILLADDR1PART2": "BILLADDR1PART2_DATA","NEED_PRINT_WHT": "1","WHT_AMT": "0","TRXAMT": "2340600","LOCATIONID": "02140","PAYMENT_METHOD_NAME": "Cash","WITH_TAX": "1"}');
select substr(json,instr(json,': ',1,1)+3,instr(json,',',1,1)-instr(json,': ',1,1)-4)
from tab;
| SUBSTR(JSON,INSTR(JSON,':',1,1)+3,INSTR(JSON,',',1,1)-INSTR(JSON,':',1,1)-4) |
| :--------------------------------------------------------------------------- |
| PS00000000000000001 |
| PS00000000000000055 |
db<>fiddle here

JSON functions are defined for Database Oracle12c+ version. APEX_JSON package with release 5.0+ should be installed for the previous releases. Whenever installation complete, then the following code might be used as an XML data type manner through APEX_JSON.TO_XMLTYPE() function in order to extract the desired values :
WITH t AS
(
SELECT SerialId, APEX_JSON.TO_XMLTYPE(Payment_Receipt_No) AS xml_data
FROM tab
)
SELECT SerialId, Payment_Receipt_No
FROM t
CROSS JOIN
XMLTABLE('/json'
PASSING xml_data
COLUMNS
Payment_Receipt_No VARCHAR2(100) PATH 'PAYMENT_RECEIPT_NO'
)

Related

Change null to empty array in databricks SQL?

I have a value in a JSON column that is sometimes all null in an Azure Databricks table. The full process to get to JSON_TABLE is: read parquet, infer schema of JSON column, convert the column from JSON string to deeply nested structure, explode any arrays within. I am working in SQL with python-defined UDFs (json_exists() checks the schema to see if the key is possible to use, json_get() gets a key from the column or returns a default) and want to do the following:
SELECT
ID, EXPLODE(json_get(JSON_COL, 'ARRAY', NULL)) AS SINGLE_ARRAY_VALUE
FROM
JSON_TABLE
WHERE
JSON_COL IS NOT NULL AND
json_exists(JSON_COL, 'ARRAY')==1
When the data has at least one instance of JSON_COL containing ARRAY, the schema is such that this has no problems. If, however, the data has all null values in JSON_COL.ARRAY, an error occurs because the column has been inferred as a string type (error received: input to function explode should be array or map type, not string). Unfortunately, while the json_exists() function returns the expected values, the error still occurs even when the returned dataset would be empty.
Can I get around this error via casting or replacement of nulls? If not, what is an alternative that still allows inferring the schema of the JSON?
Note: This is a simplified example. I am writing code to generate SQL code for hundreds of similar data structures, so while I am open to workarounds, a direct solution would be ideal. Please ask if anything is unclear.
Example table that causes error:
| ID | JSON_COL |
| 1 | {"_corrupt_record": null, "otherInfo": [{"test": 1, "from": 3}]} |
| 2 | {"_corrupt_record": null, "otherInfo": [{"test": 5, "from": 2}]} |
Example table that does not cause error:
| ID | JSON_COL |
| 1 | {"_corrupt_record": null, "array": [{"test": 1, "from": 3}]} |
| 2 | {"_corrupt_record": null, "otherInfo": [{"test": 5, "from": 2}]} |
This question seems like it might hold the answer, but I was not able to get anything working from it.
You can filter the table before calling json_get and explode, so that you only explode when json_get returns a non-null value:
SELECT
ID, EXPLODE(json_get(JSON_COL, 'ARRAY', NULL)) AS SINGLE_ARRAY_VALUE
FROM (
SELECT *
FROM JSON_TABLE
WHERE
JSON_COL IS NOT NULL AND
json_exists(JSON_COL, 'ARRAY')==1
)

Get value from json dimensional array in oracle

I have below JSON from which i need to fetch the value of issuedIdentValue where issuedIdentType = PANCARD
{
"issuedIdent": [
{"issuedIdentType":"DriversLicense","issuedIdentValue":"9797979797979797"},
{"issuedIdentType":"SclSctyNb","issuedIdentValue":"078-01-8877"},
{"issuedIdentType":"PANCARD","issuedIdentValue":"078-01-8877"}
]
}
I can not hard-code the index value [2] in my below query as the order of these records can be changed. So want to get rid off any hardcoded index.
select json_value(
'{"issuedIdent": [{"issuedIdentType":"DriversLicense","issuedIdentValue":"9797979797979797"},{"issuedIdentType":"SclSctyNb","issuedIdentValue":"078-01-8877"}, {"issuedIdentType":"PANCARDSctyNb","issuedIdentValue":"078-01-8877"}]}',
'$.issuedIdent[2].issuedIdentValue'
) as output
from d1entzendev.ExternalEventLog
where
eventname = 'CustomerDetailsInqSVC'
and applname = 'digitalBANKING'
and requid = '4fe1fa1b-abd4-47cf-834b-858332c31618';
What changes will need to apply in json_value function to achieve the expected result
In Oracle 12c or higher, you can use JSON_TABLE() for this:
select value
from json_table(
'{"issuedIdent": [{"issuedIdentType":"DriversLicense","issuedIdentValue":"9797979797979797"},{"issuedIdentType":"SclSctyNb","issuedIdentValue":"078-01-8877"}, {"issuedIdentType":"PANCARD","issuedIdentValue":"078-01-8877"}]}',
'$.issuedIdent[*]' columns
type varchar(50) path '$.issuedIdentType',
value varchar(50) path '$.issuedIdentValue'
) t
where type = 'PANCARD'
This returns:
| VALUE |
| :---------- |
| 078-01-8877 |

How to store an array of date ranges in Postgres?

I am trying to build a schedule, I generate an array of objects on the client containing date ranges
[
{start: "2020-07-06 0:0", end: "2020-07-10 23:59"},
{start: "2020-07-13 0:0", end: "2020-07-17 23:59"}
]
I have a column of type daterange[] what is the proper way to format this data to insert it into my table?
This is what I have so far:
INSERT INTO schedules(owner, name, dates) VALUES (
1,
'work',
'{
{[2020-07-06 0:0,2020-07-10 23:59]},
{[2020-07-13 0:0,2020-07-17 23:59]}
}'
)
I think you want:
insert into schedules(owner, name, dates) values (
1,
'work',
array[
'[2020-07-06, 2020-07-11)'::daterange,
'[2020-07-13, 2020-07-18)'::daterange
]
);
Rationale:
you are using dateranges, so you cannot have time portions (for this, you would need tsrange instead); as your code stands, it seems like you want an inclusive lower bound and an exclusive upper bound (hence [ at the left side, and ) at the right side)
explicit casting is needed so Postgres can recognize the that array elements have the proper datatype (otherwise, they look like text)
then, you can surround the list of ranges with the array[] constructor
Demo on DB Fiddle:
owner | name | dates
----: | :--- | :----------------------------------------------------
1 | work | {"[2020-07-06,2020-07-11)","[2020-07-13,2020-07-18)"}

How do I identify problematic documents in S3 when querying data in Athena?

I have a basic Athena query like this:
SELECT *
FROM my.dataset LIMIT 10
When I try to run it I get an error message like this:
Your query has the following error(s):
HIVE_BAD_DATA: Error parsing field value for field 2: For input string: "32700.000000000004"
How do I identify the S3 document that has the invalid field?
My documents are JSON.
My table looks like this:
CREATE EXTERNAL TABLE my.data (
`id` string,
`timestamp` string,
`profile` struct<
`name`: string,
`score`: int>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');
Inconsistent schema
Inconsistent schema is when values in some rows are of different data type. Let's assume that we have two json files
// inside s3://path/to/bad.json
{"name":"1Patrick", "age":35}
{"name":"1Carlos", "age":"eleven"}
{"name":"1Fabiana", "age":22}
// inside s3://path/to/good.json
{"name":"2Patrick", "age":35}
{"name":"2Carlos", "age":11}
{"name":"2Fabiana", "age":22}
Then a simple query SELECT * FROM some_table will fail with
HIVE_BAD_DATA: Error parsing field value 'eleven' for field 1: For input string: "eleven"
However, we can exclude that file within WHERE clause
SELECT
"$PATH" AS "source_s3_file",
*
FROM some_table
WHERE "$PATH" != 's3://path/to/bad.json'
Result:
source_s3_file | name | age
---------------------------------------
s3://path/to/good.json | 1Patrick | 35
s3://path/to/good.json | 1Carlos | 11
s3://path/to/good.json | 1Fabiana | 22
Of course, this is the best case scenario when we know which files are bad. However, you can employ this approach to somewhat manually infer which files are good. You can also use LIKE or regexp_like to walk through multiple files at a time.
SELECT
COUNT(*)
FROM some_table
WHERE regexp_like("$PATH", 's3://path/to/go[a-z]*.json')
-- If this query doesn't fail, that those files are good.
The obvious drawback of such approach is cost to execute query and time spent, especially if it is done file by file.
Malformed records
In the eyes of AWS Athena, good records are those which are formatted as a single JSON per line:
{ "id" : 50, "name":"John" }
{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }
AWS Athena supports OpenX JSON SerDe library which can be set to evaluate malformed records as NULL by specifying
-- When you create table
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
when you create table. Thus, the following query will reveal files with malformed records:
SELECT
DISTINCT("$PATH")
FROM "some_database"."some_table"
WHERE(
col_1 IS NULL AND
col_2 IS NULL AND
col_3 IS NULL
-- etc
)
Note: you can use only a single col_1 IS NULL if you are 100% sure that it doesn't contain empty fields other then in corrupted rows.
In general, malformed records are not that big of a deal provided that 'ignore.malformed.json' = 'true'. For example the following query will still succeed
For example if a file contains:
{"name": "2Patrick","age": 35,"address": "North Street"}
{
"name": "2Carlos",
"age": 11,
"address": "Flowers Street"
}
{"name": "2Fabiana","age": 22,"address": "Main Street"}
the following query will still succeed
SELECT
"$PATH" AS "source_s3_file",
*
FROM some_table
Result:
source_s3_file | name | age | address
-----------------------------|----------|-----|-------------
1 s3://path/to/malformed.json| 2Patrick | 35 | North Street
2 s3://path/to/malformed.json| | |
3 s3://path/to/malformed.json| | |
4 s3://path/to/malformed.json| | |
5 s3://path/to/malformed.json| | |
6 s3://path/to/malformed.json| | |
7 s3://path/to/malformed.json| 2Fabiana | 22 | Main Street
While with 'ignore.malformed.json' = 'false' (which is the default behaviour) exactly the same query will throw an error
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]

Hive JSON Serde MetaStore Issue

I have an external table with JSON data and I am using JsonSerde to populate data into the table. I am properly getting the data populated and when I query the data I am able to see the results correctly.
But,when I use desc command on that table I am getting from deserializer text for all the column comments.
Below is the table creation ddl.
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
field1 string COMMENT 'This is a field1',
field2 int COMMENT 'This is a field2',
field3 string COMMENT 'This is a field3',
field4 double COMMENT 'This is a field4'
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
Location '/user/uszszb6/json_test/data';
Entries in the data file.
{"field1":"data1","field2":100,"field3":"more data1","field4":123.001}
{"field1":"data2","field2":200,"field3":"more data2","field4":123.002}
{"field1":"data3","field2":300,"field3":"more data3","field4":123.003}
{"field1":"data4","field2":400,"field3":"more data4","field4":123.004}
When I use use the command desc my_table, I get the below output.
+-----------+------------+--------------------+--+
| col_name | data_type | comment |
+-----------+------------+--------------------+--+
| field1 | string | from deserializer |
| field2 | int | from deserializer |
| field3 | string | from deserializer |
| field4 | double | from deserializer |
+-----------+------------+--------------------+--+
JsonSerde is not able to capture the comments properly. I have also tried with other JSONSerde like
org.openx.data.jsonserde.JsonSerDe
org.apache.hive.hcatalog.data.JsonSerDe
com.amazon.elasticmapreduce.JsonSerde
But desc command output is same. There is a JIRA ticket for this bug [https://issues.apache.org/jira/browse/HIVE-6681][1]
According to ticket it's resolved in version 0.13, I am using hive 1.2.1 but still I am facing this issue.
Could anyone share your thoughts on resolving this issue.
Yeah, it looks like it's an hive bug that affects all the Json SerDes, but have you tried using DESCRIBE EXTENDED ?
DESCRIBE EXTENDED my_table;
hive> describe extended json_serde_test;
OK
browser string from deserializer
device_uuid string from deserializer
custom struct<customer_id:string> from deserializer
Detailed Table Information
Table(tableName:json_serde_test,dbName:default, owner:rcongiu,
createTime:1448477902, lastAccessTime:0, retention:0,
sd:StorageDescriptor(cols:[FieldSchema(name:browser, type:string,
comment:hello), FieldSchema(name:device_uuid, type:string, comment:my
name is elder price), FieldSchema(name:custom,
type:struct<customer_id:string>, comment:null)],
location:hdfs://localhost:9000/user/hive/warehouse/json_serde_test,
inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.openx.data.jsonserde.JsonSerDe, parameters:
{serialization.format=1, mapping.customer_id=Customer ID}),
bucketCols:[], sortCols:[], parameters:{},
skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],
skewedColValueLocationMaps:{}), storedAsSubDirectories:false),
partitionKeys:[], parameters:{numFiles=1,
transient_lastDdlTime=1448477903, COLUMN_STATS_ACCURATE=true,
totalSize=128, numRows=0, rawDataSize=0}, viewOriginalText:null,
viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.073 seconds, Fetched: 5 row(s)
Will output a json-ish detailed description that includes comments..kind of hard to read but it is showing me the comments and may be enough for your purposes..or not.