I am uploading the below JSON format data into the AWS S3 bucket.
{
"time": 1663090620000,
"data": [
[
1,
[
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
]
]
}
I have data types in my object for the fields in the JSON as mentioned below.
time - Long
data - List<List<Object>>
I tried creating the below schema in Athena with partition projection by pointing to S3 bucket but I am not able to map the data field in my JSON data with amazon Athena supported data types - https://docs.aws.amazon.com/athena/latest/ug/data-types.html.
CREATE EXTERNAL TABLE `test_data_db`.`test_data_table`(
`time` bigint COMMENT 'from deserializer',
`data` array<array<double>> COMMENT 'from deserializer')
PARTITIONED BY (
`id` string,
`creation_date` date
)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://test-data-bucket/test-data'
TBLPROPERTIES (
'has_encrypted_data'='false',
'projection.creation_date.format'='yyyy-MM-dd',
'projection.creation_date.range'='2010-01-01,2050-12-31',
'projection.creation_date.type'='date',
'projection.enabled'='true',
'projection.id.type'='injected',
'storage.location.template'='s3://test-data-bucket/test-data/id=${id}/creation_date=${creation_date}',
'transient_lastDdlTime'='1663357524')
Could you please help me how to map my JSON data field with the Athena data type?
I have gone through your table definition and sample data provided and understood the problem. The actual data has three levels of nested array items where as your table definition has only two levels defined.
So I tried changing data array<array<double>> to
data array<array<array<double>>> and was able to query the data field fine. It is up to you to decide on the data type as double or int as per your use case.
How can I extract all of the values for the element account_code? The below SELECT statement lets me extract any single value associated with index [x] but I want to extract all the values (each in it's own row) such that the output is:
account_codes
------------
1
2
3
SELECT
JSON_EXTRACT_SCALAR(v, '$.accounting[0].account_code') AS account_codes
FROM (VALUES JSON '
{"accounting":
[
{"account_code": "1", "account_name": "Travel"},
{"account_code": "2", "account_name": "Salary"},
{"account_code": "3", "account_name": "Equipment"},
]
}'
) AS t(v)
The operator you need to use is unnest which will flatten the array and fetch you all the column values. Below is the query and DDL in hive catalog I used to create table and fetch all account codes
DDL :
CREATE EXTERNAL TABLE `sf_73515497`(
`accounting` array<struct<account_code:string,account_name:string>> COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'paths'='accounting')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://path-to-json-prefix/'
SQL with unnest:
WITH dataset AS (
SELECT accounting from "sf_73515497"
)
SELECT t.accounts.account_code FROM dataset
CROSS JOIN UNNEST(accounting) as t(accounts)
We have a requirement to move data from Snowflake to Hive. I am able to unload data from snowflake to aws S3 and do and msck repair on Hive.
But all records are coming as null in Hive. What could be the reason ? Is there anything wrong here .
To check the parquet is created correctly , I read the Parquet file using Spark . I am able to read the parquet file.
##Snowflake
create or replace stage dev_zone.DAILY_LOG url= 's3://myc-mlb-alpha-us-east-1-drg-322t232/hive/rs_hive_008_test1' storage_integration = DEV_HIVE_INTEGRATION file_format = (type = 'parquet') ENCRYPTION = (TYPE = 'AWS_SSE_S3');
copy into #dev_zone.DAILY_LOG from (select * from dev_zone.DAILY_LOG limit 100) partition by ('as_on_date=' ||as_on_date);
##Hive
CREATE EXTERNAL TABLE dev_zone.DAILY_LOG(
dim_id decimal(38,0),
card_type string,
type string,
cntry string,
PARTITIONED BY (
as_on_date date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://myc-mlb-alpha-us-east-1-drg-322t232/hive/rs_hive_008_test1'
What I missed was to add header = true
copy into #dev_zone.DAILY_LOG from (select * from dev_zone.DAILY_LOG limit 100) partition by ('as_on_date=' ||as_on_date) header = true;
I have this Json:
'[{key=342, value=someword}, {key=317, value=anotherword}, {key=229, value=yetanotherword}]'
I want to get the value where the key is equal to 317 via SQL in Amazon Athena. In other words: I want the output to be 'anotherword'. So I tried:
SELECT * where json_extract('[{key=342, value=someword}, {key=317, value=anotherword}, {key=229, value=yetanotherword}]', '$.key')=317;
How can I get the value where the key is equal to 317 via SQL in Amazon Athena?
I am assuming, you have json data stored in s3 bucket and below steps can be used to get expected result
1) I have created a sample json file “sampleJson.json” and uploaded to s3 bucket. You can download it from this link
2) Creat a external table that will point to s3 bucket where you have stored “sampleJson.json” file(s)
CREATE EXTERNAL TABLE JsonExtTable(
sampledata string
)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://<s3 bucket path>'
3) Query the above created external table to get the expected result
Select
items['key'] key
,items['value'] value
from JsonExtTable
cross join unnest(cast(json_extract(sampledata, '$') AS ARRAY<MAP<VARCHAR, VARCHAR>>)) t (items)
where items['key']='317';
Query result
Above steps might help you to get the exact solution or will give some clue.
I have some json files stored in a S3 bucket , where each file has multiple elements of same structure. For example,
[{"eventId":"1","eventName":"INSERT","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"New item!","Id":101}},{"eventId":"2","eventName":"MODIFY","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}},{"eventId":"3","eventName":"REMOVE","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}]
I want to create a table in Athena corresponding to above data.
The query I wrote for creating the table:
CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.elb_logs2 (
`eventId` string,
`eventName` string,
`eventVersion` string,
`eventSource` string,
`awsRegion` string,
`image` map<string,string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'field.delim' = ' '
) LOCATION 's3://<bucketname>/';
But if I do a SELECT query as follows,
SELECT * FROM sampledb.elb_logs4;
I get the following result:
1 {"eventid":"1","eventversion":"1.0","image":{"id":"101","message":"New item!"},"eventsource":"aws:dynamodb","eventname":"INSERT","awsregion":"us-west-2"} {"eventid":"2","eventversion":"1.0","image":{"id":"101","message":"This item has changed"},"eventsource":"aws:dynamodb","eventname":"MODIFY","awsregion":"us-west-2"} {"eventid":"3","eventversion":"1.0","image":{"id":"101","message":"This item has changed"},"eventsource":"aws:dynamodb","eventname":"REMOVE","awsregion":"us-west-2"}
The entire content of the json file is picked as one entry here.
How can I read each element of json file as one entry?
Edit: How can I read each subcolumn of image, i.e., each element of the map?
Thanks.
Question1: Store multiple elements in json files for AWS Athena
I need to rewrite my json file as
{"eventId":"1","eventName":"INSERT","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"New item!","Id":101}}, {"eventId":"2","eventName":"MODIFY","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}, {"eventId":"3","eventName":"REMOVE","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}
That means
Remove the square brackets [ ] Keep each element in one line
{.....................}
{.....................}
{.....................}
Question2. Access nonlinear json attributes
CREATE EXTERNAL TABLE IF NOT EXISTS <tablename> (
`eventId` string,
`eventName` string,
`eventVersion` string,
`eventSource` string,
`awsRegion` string,
`image` struct <`Id` : string,
`Message` : string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
"dots.in.keys" = "true"
) LOCATION 's3://exampletablewithstream-us-west-2/';
Query:
select image.Id, image.message from <tablename>;
Ref:
http://engineering.skybettingandgaming.com/2015/01/20/parsing-json-in-hive/
https://github.com/rcongiu/Hive-JSON-Serde#mapping-hive-keywords