How do I extract all of the values from a JSON element? - sql

How can I extract all of the values for the element account_code? The below SELECT statement lets me extract any single value associated with index [x] but I want to extract all the values (each in it's own row) such that the output is:
account_codes
------------
1
2
3
SELECT
JSON_EXTRACT_SCALAR(v, '$.accounting[0].account_code') AS account_codes
FROM (VALUES JSON '
{"accounting":
[
{"account_code": "1", "account_name": "Travel"},
{"account_code": "2", "account_name": "Salary"},
{"account_code": "3", "account_name": "Equipment"},
]
}'
) AS t(v)

The operator you need to use is unnest which will flatten the array and fetch you all the column values. Below is the query and DDL in hive catalog I used to create table and fetch all account codes
DDL :
CREATE EXTERNAL TABLE `sf_73515497`(
`accounting` array<struct<account_code:string,account_name:string>> COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='accounting')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://path-to-json-prefix/'
SQL with unnest:
WITH dataset AS (
SELECT accounting from "sf_73515497"
)
SELECT t.accounts.account_code FROM dataset
CROSS JOIN UNNEST(accounting) as t(accounts)

Related

How would I convert this JSON stored in a database column into a table with columns and rows?

Given the below JSON data and table that holds the data, how can I query the values and write to a new table with rows and columns split?
Basic table that contains the JSON:
CREATE TABLE BI.DataTable
(
JsonDataText VARCHAR(MAX)
);
JSON data:
{
"datatable": {
"data": [
[
"ZSFH",
"99999",
"2022-08-31",
571106
],
[
"ZSFH",
"99999",
"2022-07-31",
578530
],
[
"ZSFH",
"99999",
"2022-06-30",
582233
],
[
"ZSFH",
"99999",
"2022-05-31",
581718
]
]
}
}
When I use the JSON_VALUE function, I get a null for each column.
SELECT
JSON_VALUE (JsonDataText, '$.data[0]') AS MetricCode,
JSON_VALUE (JsonDataText, '$.data[1]') AS RegionID,
JSON_VALUE (JsonDataText, '$.data[2]') AS ReportingPeriod,
JSON_VALUE (JsonDataText, '$.data[3]') AS MetricValue
FROM BI.DataTable
WHERE ISJSON(JsonDataText) > 0
Using OPENJSON() with the appropriate columns definitions and an additioinal APPLY operator is a possible approach:
SELECT j.*
FROM DataTable d
OUTER APPLY OPENJSON(d.JsonDataText, '$.datatable.data') WITH (
MetricCode varchar(4) '$[0]',
RegionID varchar(5) '$[1]',
ReportingPeriod varchar(10) '$[2]',
MetricValue int '$[3]'
) j
WHERE ISJSON(d.JsonDataText) = 1
Result:
MetricCode
RegionID
ReportingPeriod
MetricValue
ZSFH
99999
2022-08-31
571106
ZSFH
99999
2022-07-31
578530
ZSFH
99999
2022-06-30
582233
ZSFH
99999
2022-05-31
581718
Note, that the reasons for the NULL values returned from your current statement are:
JSON_VALUE() extracts a scalar value from a JSON string, but the $.datatable.data part of the stored JSON is a JSON array with JSON arrays as items, so in this situation you need to use JSON_QUERY().
The $.data[0] path expression is wrong.
The following example (using JSON_QUERY() and the correct path) extracts an item from the $.datatable.data JSON array:
SELECT
JSON_QUERY(JsonDataText, '$.datatable.data[0]') AS MetricCode,
JSON_QUERY(JsonDataText, '$.datatable.data[1]') AS RegionID,
JSON_QUERY(JsonDataText, '$.datatable.data[2]') AS ReportingPeriod,
JSON_QUERY(JsonDataText, '$.datatable.data[3]') AS MetricValue
FROM DataTable
WHERE ISJSON(JsonDataText) > 0

Json Arrays of objects PostgreSQL Table format

I have a JSON file (array of objects) which I have to convert into a table format using a PostgreSQL query.
Follow Sample Data.
"b", "c", "d", "e" are to be extracted as separate tables as they are arrays and in these arrays, there are objects
I have tried using json_populate_recordset() but it only works if I have a single array.
[{a:"1",b:"2"},{a:"10",b:"20"}]
I have referred to some links and codes.
jsonb_array_element example
postgreSQL functions
Expected Output
Sample Data:
{
"b":[
{columnB1:value, columnB2:value},
{columnB1:value, columnB2:value},
],
"c":[
{columnC1:value, columnC2:value, columnC3:value},
{columnC1:value, columnC2:value, columnC3:value},
{columnC1:value, columnC2:value, columnC3:value}
],
"d":[
{columnD1:value, columnD2:value},
{columnD1:value, columnD2:value},
],
"e":[
{columnE1:value, columnE2:value},
]
}
expected output
b should be one table in which columnA1 and columnA2 are displayed with their values.
Similarly table c, d, e with their respective columns and values.
Expected Output
You can use jsonb_to_recordset() but you need to unnest your JSON. You need to do this inline as this is a JSON Processing Function which cannot used derived values.
I am using validated JSON as simplified and formatted at end of this answer
To unnest your JSON use below notation which extracts JSON object field with the given key.
--one level
select '{"a":1}'::json->'a'
result : 1
--two levels
select '{"a":{"b":[2]}}'::json->'a'->'b'
result : [2]
We now expand this to include json_to_recordset()
select * from
json_to_recordset(
'{"a":{"b":[{"f1":2,"f2":4},{"f1":3,"f2":6}]}}'::json->'a'->'b' --inner table b
)
as x("f1" int, "f2" int); --fields from table b
or using json_array_elements. Either way we need to list our fields. With second solution type will be json not int so you cant sum etc
with b as (select json_array_elements('{"a":{"b":[{"f1":2,"f2":4},{"f1":3,"f2":6}]}}'::json->'a'->'b') as jx)
select jx->'f1' as f1, jx->'f2' as f2 from b;
Output
f1 f2
2 4
3 6
We now use your data structure in jsonb_to_recordset()
select * from jsonb_to_recordset( '{"a":{"b":[{"columnname1b":"value1b","columnname2b":"value2b"},{"columnname1b":"value","columnname2b":"value"}],"c":[{"columnname1":"value","columnname2":"value"},{"columnname1":"value","columnname2":"value"},{"columnname1":"value","columnname2":"value"}]}}'::jsonb->'a'->'b') as x(columnname1b text, columnname2b text);
Output:
columnname1b columnname2b
value1b value2b
value value
For table c
select * from jsonb_to_recordset( '{"a":{"b":[{"columnname1b":"value1b","columnname2b":"value2b"},{"columnname1b":"value","columnname2b":"value"}],"c":[{"columnname1":"value","columnname2":"value"},{"columnname1":"value","columnname2":"value"},{"columnname1":"value","columnname2":"value"}]}}'::jsonb->'a'->'c') as x(columnname1 text, columnname2 text);
Output
columnname1 columnname2
value value
value value
value value
Sample JSON
{
"a": {
"b": [
{
"columnname1b": "value1b",
"columnname2b": "value2b"
},
{
"columnname1b": "value",
"columnname2b": "value"
}
],
"c": [
{
"columnname1": "value",
"columnname2": "value"
},
{
"columnname1": "value",
"columnname2": "value"
},
{
"columnname1": "value",
"columnname2": "value"
}
]
}
}
Well, I came up with some ideas, here is one that worked. I was able to get one table at a time.
https://www.postgresql.org/docs/9.5/functions-json.html
I am using json_populate_recordset.
The column used in the first select statement comes from a table whose column is a JSON type which we are trying to extract into a table.
The 'tablename from column' in the json_populate_recordset function, is the table we are trying to extract followed with b its columns and datatypes.
WITH input AS(
SELECT cast(column as json) as a
FROM tablename
)
SELECT b.*
FROM input c,
json_populate_recordset(NULL::record,c.a->'tablename from column') as b(columnname1 datatype, columnname2 datatype)

amazon athena create request with partitions

I create a table with partitions as follow : first by year, month, and day.
Question : I hope get data of 12/2017 and 03/2018, how I can do this?
What I think do :
where (year='2017' and month='12') and ( year ='2018' and month='03')
Is it correct? I will not have a confusion so Amazon Athena get data of:
12/2017 and 03/2018 and 03/2017 and 12/2018
because of the and operator ?
PS: I can't test, I have only free account.
Thanks.
Anyway, I tried in a mini set of data and I found that Amazon Athena take into account the parenthesis.
My test is as follow :
The DDl of table as générated :
CREATE EXTERNAL TABLE `manyands`(
`years` int COMMENT 'from deserializer',
`months` int COMMENT 'from deserializer',
`days` int COMMENT 'from deserializer')
PARTITIONED BY (
`year` string,
`month` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://mybucket/'
My set of data test:
My tests:
1- SELECT * FROM "atlasdatabase"."manyands" where month='1';
I got in CSV format :
"years","months","days","year","month"
"2017","1","21","2017","1"
"2018","1","81","2018","1"
2- SELECT * FROM "atlasdatabase"."manyands" where month='1' and year='2017';
"years","months","days","year","month"
"2017","1","21","2017","1"
3- SELECT * FROM "atlasdatabase"."manyands" where (month='1' and year='2018') and (month='3' and year='2017') ;
empty (Zéro enregistrements renvoyés)
4- SELECT * FROM "atlasdatabase"."manyands" where (month='1' and year='2018') or (month='3' ) ;
"years","months","days","year","month"
"2018","1","81","2018","1"
"2017","3","73","2017","3"
"2018","3","73","2018","3"
Conclusion : add OR operator between many instances of the partitions.

How to querying data from Amazon S3

I am looking to create a Tableau Dashboard with data originated on Amazon DynamoDB. Right now I am sending the data to a bucket on Amazon S3 using Amazon Lambda and I am getting this file on the S3 bucket,
{
"Items": [
{
"payload": {
"phase": "T",
"tms_event": "2017-03-16 18:19:50",
"id_UM": 0,
"num_severity_level": 0,
"event_value": 1,
"int_status": 0
},
"deviceId": 6,
"tms_event": "2017-03-16 18:19:50"
}
]
}
I trying to use Amazon Athena to create a connection with Tableau but the payload attribute is giving me problems and I am not getting any results when I do the SELECT query.
This is the Athena Table,
CREATE EXTERNAL TABLE IF NOT EXISTS default.iot_table_test (
`payload` map<string,string>,
`deviceId` int,
`tms_event` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://iot-logging/'
TBLPROPERTIES ('has_encrypted_data'='false')
Thanks,
Alejandro
Your table does not look like it matches your data, because your data has a top-level Items array. Without restructing the JSON data files, I think you would need a table definition like this:
CREATE EXTERNAL TABLE IF NOT EXISTS default.iot_table_test_items (
`Items` ARRAY<
STRUCT<
`payload`: MAP<string, string>,
`deviceId`: int,
`tms_event`: string
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://iot-logging/'
TBLPROPERTIES ('has_encrypted_data'='false')
and then query it unnesting the Items array:
SELECT
item.deviceId,
item.tms_event,
item.payload
FROM
default.iot_table_test_items
CROSS JOIN UNNEST (Items) AS i (item)
LIMIT 10;

Store multiple elements in json files in AWS Athena

I have some json files stored in a S3 bucket , where each file has multiple elements of same structure. For example,
[{"eventId":"1","eventName":"INSERT","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"New item!","Id":101}},{"eventId":"2","eventName":"MODIFY","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}},{"eventId":"3","eventName":"REMOVE","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}]
I want to create a table in Athena corresponding to above data.
The query I wrote for creating the table:
CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.elb_logs2 (
`eventId` string,
`eventName` string,
`eventVersion` string,
`eventSource` string,
`awsRegion` string,
`image` map<string,string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'field.delim' = ' '
) LOCATION 's3://<bucketname>/';
But if I do a SELECT query as follows,
SELECT * FROM sampledb.elb_logs4;
I get the following result:
1 {"eventid":"1","eventversion":"1.0","image":{"id":"101","message":"New item!"},"eventsource":"aws:dynamodb","eventname":"INSERT","awsregion":"us-west-2"} {"eventid":"2","eventversion":"1.0","image":{"id":"101","message":"This item has changed"},"eventsource":"aws:dynamodb","eventname":"MODIFY","awsregion":"us-west-2"} {"eventid":"3","eventversion":"1.0","image":{"id":"101","message":"This item has changed"},"eventsource":"aws:dynamodb","eventname":"REMOVE","awsregion":"us-west-2"}
The entire content of the json file is picked as one entry here.
How can I read each element of json file as one entry?
Edit: How can I read each subcolumn of image, i.e., each element of the map?
Thanks.
Question1: Store multiple elements in json files for AWS Athena
I need to rewrite my json file as
{"eventId":"1","eventName":"INSERT","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"New item!","Id":101}}, {"eventId":"2","eventName":"MODIFY","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}, {"eventId":"3","eventName":"REMOVE","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}
That means
Remove the square brackets [ ] Keep each element in one line
{.....................}
{.....................}
{.....................}
Question2. Access nonlinear json attributes
CREATE EXTERNAL TABLE IF NOT EXISTS <tablename> (
`eventId` string,
`eventName` string,
`eventVersion` string,
`eventSource` string,
`awsRegion` string,
`image` struct <`Id` : string,
`Message` : string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
"dots.in.keys" = "true"
) LOCATION 's3://exampletablewithstream-us-west-2/';
Query:
select image.Id, image.message from <tablename>;
Ref:
http://engineering.skybettingandgaming.com/2015/01/20/parsing-json-in-hive/
https://github.com/rcongiu/Hive-JSON-Serde#mapping-hive-keywords