Athena gives extra rows than actual present in datasource in s3 - sql

I have stored data source in s3 and when querying it in athena and querying the total no of rows , its giving me more rows than present in csv file stored in s3 .
I have also given separate path for athena query result i.e different from the data source folder path of s3 .
Please help me with this , why athena is giving me extra rows and unknown values in them ,thus creating discrepancies in the data.
Please find the query i wrote create the table in athena
athena_client.start_query_execution(QueryString='create database cms_data',ResultConfiguration={'OutputLocation': 's3://cms-dashboard-automation/Athenaoutput/'})
\t#Tables created for athena
context = {'Database': 'cms_data'}
athena_client.start_query_execution(QueryString='''CREATE EXTERNAL TABLE IF NOT EXISTS `cms_data`.`mpf_data` (
`State` String,
`County` String,
`Org_Name` String,
`Contract_ID` String,
`Plan_ID` double,
`Segment_ID` double,
`Plan_Type_Desc` String,
`Contract_Year` double,
`Category_Name` String,
`Service_Name` String,
`Limit_Flag` double,
`Authorization_Flag` double,
`Referral_Flag` double,
`Network_Description` String,
`Cost_Share` String )
\t ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
\t 'field.delim' = ','
) LOCATION 's3://cms-dashboard-automation/MPF_Data/'
TBLPROPERTIES ('has_encrypted_data'='false');
''',QueryExecutionContext = context,ResultConfiguration={'OutputLocation': 's3://cms-dashboard-automation/Athenaoutput/'})

Related

AWS Athena: line 11:1: mismatched input 'PARTITIONED'. Expecting: 'COMMENT', 'WITH', <EOF> while creating table

I am trying to create an empty table which contains some columns with determined datatypes in S3 with a command launched in Athena, but it is throwing me the following error:
line 11:1: mismatched input 'PARTITIONED'. Expecting: 'COMMENT', 'WITH', <EOF>
The query I'm executing is the following:
CREATE TABLE IF NOT EXISTS boards_raw_fields_v1 (
"uuid" bigint,
"source" string,
"raw_company_name" string,
"raw_contract_type" string,
"raw_employment_type" string,
"raw_working_hours_type" string,
"raw_all_locations" array<string>,
"raw_categories" array<string>,
"raw_industry" string)
PARTITIONED BY (
"year" string,
"month" string,
"day" string,
"hour" string,
"version" bigint)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://my-data/datalake-raw/boards/raw-fields/v1'
TBLPROPERTIES (
'classification'='parquet',
'compressionType'='snappy',
'projection.enabled'='false',
'typeOfData'='file')
In the URI:
s3://my-data/datalake-raw/boards/raw-fields/v1
IMPORTANT NOTE: All the folders are currently created except the last one v1
What am I doing wrong here in the process of creating the table?

Snowflake to Hive data movement with partition

We have a requirement to move data from Snowflake to Hive. I am able to unload data from snowflake to aws S3 and do and msck repair on Hive.
But all records are coming as null in Hive. What could be the reason ? Is there anything wrong here .
To check the parquet is created correctly , I read the Parquet file using Spark . I am able to read the parquet file.
##Snowflake
create or replace stage dev_zone.DAILY_LOG url= 's3://myc-mlb-alpha-us-east-1-drg-322t232/hive/rs_hive_008_test1' storage_integration = DEV_HIVE_INTEGRATION file_format = (type = 'parquet') ENCRYPTION = (TYPE = 'AWS_SSE_S3');
copy into #dev_zone.DAILY_LOG from (select * from dev_zone.DAILY_LOG limit 100) partition by ('as_on_date=' ||as_on_date);
##Hive
CREATE EXTERNAL TABLE dev_zone.DAILY_LOG(
dim_id decimal(38,0),
card_type string,
type string,
cntry string,
PARTITIONED BY (
as_on_date date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://myc-mlb-alpha-us-east-1-drg-322t232/hive/rs_hive_008_test1'
What I missed was to add header = true
copy into #dev_zone.DAILY_LOG from (select * from dev_zone.DAILY_LOG limit 100) partition by ('as_on_date=' ||as_on_date) header = true;

Can't get Hive SerDe to work - returns 0 records

This is my second attempt of using SerDe. First one worked quiet well but now, I'm really struggling.
I got an XML of this structure:
This is the Hive table I created
CREATE TABLE raw_abc.text_abc
(
publicationid string,
parentid string,
id string,
level string,
usertypeid string,
name string,
assetcrossreferences_ordered string,
assetcrossreferences MAP<string, string>,
attributenames_ordered string,
attributenames map<string,string>,
seo_ordered string,
seo MAP<string, string>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.publicationid"="/ST:ECC-HierarchyMessage/#PublicationID",
"column.xpath.parentid"="/ST:ECC-HierarchyMessage/Product/#ParentID",
"column.xpath.id"="/ST:ECC-HierarchyMessage/Product/#ID",
"column.xpath.level"="/ST:ECC-HierarchyMessage/Product/#Level",
"column.xpath.usertypeid"="/ST:ECC-HierarchyMessage/Product/#UserTypeID",
"column.xpath.name"="/ST:ECC-HierarchyMessage/Product/#Name",
"column.xpath.assetcrossreferences_ordered"="/ST:ECC-HierarchyMessage/Product/AssetCrossReferences/#Ordered",
"column.xpath.assetcrossreferences"="/ST:ECC-HierarchyMessage/Product/AssetCrossReferences/AssetCrossReference",
"column.xpath.attributenames_ordered"="/ST:ECC-HierarchyMessage/Product/AttributeNames/#Ordered",
"column.xpath.attributenames"="/ST:ECC-HierarchyMessage/Product/AttributeNames/#Ordered",
"column.xpath.seo_ordered"="/ST:ECC-HierarchyMessage/Product/SEO/#Ordered",
"column.xpath.seo"="/ST:ECC-HierarchyMessage/Product/SEO"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
location 's3a://ec-abc-dev/inbound/abc/abc/'
TBLPROPERTIES (
"xmlinput.start"="<ST:ECC-HierarchyMessage>",
"xmlinput.end"="</ST:ECC-HierarchyMessage>"
)
;
Table is created successfully, however,
when I try select * from raw_abc.text_abc , I get no records in return.
Any idea what's wrong here? I've spent the last 2 days trying to figure it out with no luck.
Thanks,
G

amazon athena create request with partitions

I create a table with partitions as follow : first by year, month, and day.
Question : I hope get data of 12/2017 and 03/2018, how I can do this?
What I think do :
where (year='2017' and month='12') and ( year ='2018' and month='03')
Is it correct? I will not have a confusion so Amazon Athena get data of:
12/2017 and 03/2018 and 03/2017 and 12/2018
because of the and operator ?
PS: I can't test, I have only free account.
Thanks.
Anyway, I tried in a mini set of data and I found that Amazon Athena take into account the parenthesis.
My test is as follow :
The DDl of table as générated :
CREATE EXTERNAL TABLE `manyands`(
`years` int COMMENT 'from deserializer',
`months` int COMMENT 'from deserializer',
`days` int COMMENT 'from deserializer')
PARTITIONED BY (
`year` string,
`month` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://mybucket/'
My set of data test:
My tests:
1- SELECT * FROM "atlasdatabase"."manyands" where month='1';
I got in CSV format :
"years","months","days","year","month"
"2017","1","21","2017","1"
"2018","1","81","2018","1"
2- SELECT * FROM "atlasdatabase"."manyands" where month='1' and year='2017';
"years","months","days","year","month"
"2017","1","21","2017","1"
3- SELECT * FROM "atlasdatabase"."manyands" where (month='1' and year='2018') and (month='3' and year='2017') ;
empty (Zéro enregistrements renvoyés)
4- SELECT * FROM "atlasdatabase"."manyands" where (month='1' and year='2018') or (month='3' ) ;
"years","months","days","year","month"
"2018","1","81","2018","1"
"2017","3","73","2017","3"
"2018","3","73","2018","3"
Conclusion : add OR operator between many instances of the partitions.

Store multiple elements in json files in AWS Athena

I have some json files stored in a S3 bucket , where each file has multiple elements of same structure. For example,
[{"eventId":"1","eventName":"INSERT","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"New item!","Id":101}},{"eventId":"2","eventName":"MODIFY","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}},{"eventId":"3","eventName":"REMOVE","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}]
I want to create a table in Athena corresponding to above data.
The query I wrote for creating the table:
CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.elb_logs2 (
`eventId` string,
`eventName` string,
`eventVersion` string,
`eventSource` string,
`awsRegion` string,
`image` map<string,string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'field.delim' = ' '
) LOCATION 's3://<bucketname>/';
But if I do a SELECT query as follows,
SELECT * FROM sampledb.elb_logs4;
I get the following result:
1 {"eventid":"1","eventversion":"1.0","image":{"id":"101","message":"New item!"},"eventsource":"aws:dynamodb","eventname":"INSERT","awsregion":"us-west-2"} {"eventid":"2","eventversion":"1.0","image":{"id":"101","message":"This item has changed"},"eventsource":"aws:dynamodb","eventname":"MODIFY","awsregion":"us-west-2"} {"eventid":"3","eventversion":"1.0","image":{"id":"101","message":"This item has changed"},"eventsource":"aws:dynamodb","eventname":"REMOVE","awsregion":"us-west-2"}
The entire content of the json file is picked as one entry here.
How can I read each element of json file as one entry?
Edit: How can I read each subcolumn of image, i.e., each element of the map?
Thanks.
Question1: Store multiple elements in json files for AWS Athena
I need to rewrite my json file as
{"eventId":"1","eventName":"INSERT","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"New item!","Id":101}}, {"eventId":"2","eventName":"MODIFY","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}, {"eventId":"3","eventName":"REMOVE","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}
That means
Remove the square brackets [ ] Keep each element in one line
{.....................}
{.....................}
{.....................}
Question2. Access nonlinear json attributes
CREATE EXTERNAL TABLE IF NOT EXISTS <tablename> (
`eventId` string,
`eventName` string,
`eventVersion` string,
`eventSource` string,
`awsRegion` string,
`image` struct <`Id` : string,
`Message` : string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
"dots.in.keys" = "true"
) LOCATION 's3://exampletablewithstream-us-west-2/';
Query:
select image.Id, image.message from <tablename>;
Ref:
http://engineering.skybettingandgaming.com/2015/01/20/parsing-json-in-hive/
https://github.com/rcongiu/Hive-JSON-Serde#mapping-hive-keywords