Store multiple elements in json files in AWS Athena - sql

I have some json files stored in a S3 bucket , where each file has multiple elements of same structure. For example,
[{"eventId":"1","eventName":"INSERT","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"New item!","Id":101}},{"eventId":"2","eventName":"MODIFY","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}},{"eventId":"3","eventName":"REMOVE","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}]
I want to create a table in Athena corresponding to above data.
The query I wrote for creating the table:
CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.elb_logs2 (
`eventId` string,
`eventName` string,
`eventVersion` string,
`eventSource` string,
`awsRegion` string,
`image` map<string,string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'field.delim' = ' '
) LOCATION 's3://<bucketname>/';
But if I do a SELECT query as follows,
SELECT * FROM sampledb.elb_logs4;
I get the following result:
1 {"eventid":"1","eventversion":"1.0","image":{"id":"101","message":"New item!"},"eventsource":"aws:dynamodb","eventname":"INSERT","awsregion":"us-west-2"} {"eventid":"2","eventversion":"1.0","image":{"id":"101","message":"This item has changed"},"eventsource":"aws:dynamodb","eventname":"MODIFY","awsregion":"us-west-2"} {"eventid":"3","eventversion":"1.0","image":{"id":"101","message":"This item has changed"},"eventsource":"aws:dynamodb","eventname":"REMOVE","awsregion":"us-west-2"}
The entire content of the json file is picked as one entry here.
How can I read each element of json file as one entry?
Edit: How can I read each subcolumn of image, i.e., each element of the map?
Thanks.

Question1: Store multiple elements in json files for AWS Athena
I need to rewrite my json file as
{"eventId":"1","eventName":"INSERT","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"New item!","Id":101}}, {"eventId":"2","eventName":"MODIFY","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}, {"eventId":"3","eventName":"REMOVE","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}
That means
Remove the square brackets [ ] Keep each element in one line
{.....................}
{.....................}
{.....................}
Question2. Access nonlinear json attributes
CREATE EXTERNAL TABLE IF NOT EXISTS <tablename> (
`eventId` string,
`eventName` string,
`eventVersion` string,
`eventSource` string,
`awsRegion` string,
`image` struct <`Id` : string,
`Message` : string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
"dots.in.keys" = "true"
) LOCATION 's3://exampletablewithstream-us-west-2/';
Query:
select image.Id, image.message from <tablename>;
Ref:
http://engineering.skybettingandgaming.com/2015/01/20/parsing-json-in-hive/
https://github.com/rcongiu/Hive-JSON-Serde#mapping-hive-keywords

Related

Athena gives extra rows than actual present in datasource in s3

I have stored data source in s3 and when querying it in athena and querying the total no of rows , its giving me more rows than present in csv file stored in s3 .
I have also given separate path for athena query result i.e different from the data source folder path of s3 .
Please help me with this , why athena is giving me extra rows and unknown values in them ,thus creating discrepancies in the data.
Please find the query i wrote create the table in athena
athena_client.start_query_execution(QueryString='create database cms_data',ResultConfiguration={'OutputLocation': 's3://cms-dashboard-automation/Athenaoutput/'})
\t#Tables created for athena
context = {'Database': 'cms_data'}
athena_client.start_query_execution(QueryString='''CREATE EXTERNAL TABLE IF NOT EXISTS `cms_data`.`mpf_data` (
`State` String,
`County` String,
`Org_Name` String,
`Contract_ID` String,
`Plan_ID` double,
`Segment_ID` double,
`Plan_Type_Desc` String,
`Contract_Year` double,
`Category_Name` String,
`Service_Name` String,
`Limit_Flag` double,
`Authorization_Flag` double,
`Referral_Flag` double,
`Network_Description` String,
`Cost_Share` String )
\t ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
\t 'field.delim' = ','
) LOCATION 's3://cms-dashboard-automation/MPF_Data/'
TBLPROPERTIES ('has_encrypted_data'='false');
''',QueryExecutionContext = context,ResultConfiguration={'OutputLocation': 's3://cms-dashboard-automation/Athenaoutput/'})

AWS Athena: line 11:1: mismatched input 'PARTITIONED'. Expecting: 'COMMENT', 'WITH', <EOF> while creating table

I am trying to create an empty table which contains some columns with determined datatypes in S3 with a command launched in Athena, but it is throwing me the following error:
line 11:1: mismatched input 'PARTITIONED'. Expecting: 'COMMENT', 'WITH', <EOF>
The query I'm executing is the following:
CREATE TABLE IF NOT EXISTS boards_raw_fields_v1 (
"uuid" bigint,
"source" string,
"raw_company_name" string,
"raw_contract_type" string,
"raw_employment_type" string,
"raw_working_hours_type" string,
"raw_all_locations" array<string>,
"raw_categories" array<string>,
"raw_industry" string)
PARTITIONED BY (
"year" string,
"month" string,
"day" string,
"hour" string,
"version" bigint)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://my-data/datalake-raw/boards/raw-fields/v1'
TBLPROPERTIES (
'classification'='parquet',
'compressionType'='snappy',
'projection.enabled'='false',
'typeOfData'='file')
In the URI:
s3://my-data/datalake-raw/boards/raw-fields/v1
IMPORTANT NOTE: All the folders are currently created except the last one v1
What am I doing wrong here in the process of creating the table?

Import Logs from S3 to Athena using Regex

I have exported CloudWatch logs to S3 and now want to import those logs to Athena. The format of the logs is as follow (pasted only one log for reference):
2021-07-30T14:30:22.937Z RequestId INFO {"_logLevel":"debug","msg":"Start: Calling All the Data Associates Function","timestamp":1627655422937,"EventSubCategory":"AppSyncService","API":"AppSyncService","function":"XXXXXXXXXXXXXXXXX","Correlation_Id":"XXXXXXXXXXXXXXXXX"}
I am using a regular expression to import the log and using the following query to create the table.
CREATE EXTERNAL TABLE IF NOT EXISTS test1 (
`time` string
`requestid` string
`loglevel` string
`message` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '^(.*?)\t(.*?)\t(.*?)\t([\s\S]*?)\n'
)
LOCATION 's3://logs/test/'
TBLPROPERTIES ('has_encrypted_data'='false');
Regular Expression:
^(.*?)\t(.*?)\t(.*?)\t([\s\S]*?)\n
There are four columns in the table and the regular expression is also creating four groups and working as per my expectation. However, we still get empty table as result.
Can anyone please help to resolve this issue?
I think your problem is that you need to double-escape things in the regex, and you also should not match on a newline at the end, but $. Try this pattern:
'input.regex' = '^(.*?)\\t(.*?)\\t(.*?)\\t([\\s\\S]*?)$'
You can see an example in the official docs.
Also, the pattern [\s\S] could be replaced by . (\S means everything not matched by \s, so together they match anything).
An alternative to the regex serde is Grok, which is less error prone to write. Using the Grok serde I think this table would work for you:
CREATE EXTERNAL TABLE IF NOT EXISTS test1 (
`time` string
`requestid` string
`loglevel` string
`message` string
)
ROW FORMAT SERDE 'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
input.format' = '%{TIMESTAMP_ISO8601:time} %{NOTSPACE:requestid} %{NOTSPACE:loglevel} %{NOTSPACE:message}'
)
LOCATION 's3://logs/test/'
Grok patterns are much easier to read. Check out the documentation and the built-in patterns for more info.

Can't get Hive SerDe to work - returns 0 records

This is my second attempt of using SerDe. First one worked quiet well but now, I'm really struggling.
I got an XML of this structure:
This is the Hive table I created
CREATE TABLE raw_abc.text_abc
(
publicationid string,
parentid string,
id string,
level string,
usertypeid string,
name string,
assetcrossreferences_ordered string,
assetcrossreferences MAP<string, string>,
attributenames_ordered string,
attributenames map<string,string>,
seo_ordered string,
seo MAP<string, string>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.publicationid"="/ST:ECC-HierarchyMessage/#PublicationID",
"column.xpath.parentid"="/ST:ECC-HierarchyMessage/Product/#ParentID",
"column.xpath.id"="/ST:ECC-HierarchyMessage/Product/#ID",
"column.xpath.level"="/ST:ECC-HierarchyMessage/Product/#Level",
"column.xpath.usertypeid"="/ST:ECC-HierarchyMessage/Product/#UserTypeID",
"column.xpath.name"="/ST:ECC-HierarchyMessage/Product/#Name",
"column.xpath.assetcrossreferences_ordered"="/ST:ECC-HierarchyMessage/Product/AssetCrossReferences/#Ordered",
"column.xpath.assetcrossreferences"="/ST:ECC-HierarchyMessage/Product/AssetCrossReferences/AssetCrossReference",
"column.xpath.attributenames_ordered"="/ST:ECC-HierarchyMessage/Product/AttributeNames/#Ordered",
"column.xpath.attributenames"="/ST:ECC-HierarchyMessage/Product/AttributeNames/#Ordered",
"column.xpath.seo_ordered"="/ST:ECC-HierarchyMessage/Product/SEO/#Ordered",
"column.xpath.seo"="/ST:ECC-HierarchyMessage/Product/SEO"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
location 's3a://ec-abc-dev/inbound/abc/abc/'
TBLPROPERTIES (
"xmlinput.start"="<ST:ECC-HierarchyMessage>",
"xmlinput.end"="</ST:ECC-HierarchyMessage>"
)
;
Table is created successfully, however,
when I try select * from raw_abc.text_abc , I get no records in return.
Any idea what's wrong here? I've spent the last 2 days trying to figure it out with no luck.
Thanks,
G

How to querying data from Amazon S3

I am looking to create a Tableau Dashboard with data originated on Amazon DynamoDB. Right now I am sending the data to a bucket on Amazon S3 using Amazon Lambda and I am getting this file on the S3 bucket,
{
"Items": [
{
"payload": {
"phase": "T",
"tms_event": "2017-03-16 18:19:50",
"id_UM": 0,
"num_severity_level": 0,
"event_value": 1,
"int_status": 0
},
"deviceId": 6,
"tms_event": "2017-03-16 18:19:50"
}
]
}
I trying to use Amazon Athena to create a connection with Tableau but the payload attribute is giving me problems and I am not getting any results when I do the SELECT query.
This is the Athena Table,
CREATE EXTERNAL TABLE IF NOT EXISTS default.iot_table_test (
`payload` map<string,string>,
`deviceId` int,
`tms_event` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://iot-logging/'
TBLPROPERTIES ('has_encrypted_data'='false')
Thanks,
Alejandro
Your table does not look like it matches your data, because your data has a top-level Items array. Without restructing the JSON data files, I think you would need a table definition like this:
CREATE EXTERNAL TABLE IF NOT EXISTS default.iot_table_test_items (
`Items` ARRAY<
STRUCT<
`payload`: MAP<string, string>,
`deviceId`: int,
`tms_event`: string
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://iot-logging/'
TBLPROPERTIES ('has_encrypted_data'='false')
and then query it unnesting the Items array:
SELECT
item.deviceId,
item.tms_event,
item.payload
FROM
default.iot_table_test_items
CROSS JOIN UNNEST (Items) AS i (item)
LIMIT 10;