amazon athena create request with partitions - sql

I create a table with partitions as follow : first by year, month, and day.
Question : I hope get data of 12/2017 and 03/2018, how I can do this?
What I think do :
where (year='2017' and month='12') and ( year ='2018' and month='03')
Is it correct? I will not have a confusion so Amazon Athena get data of:
12/2017 and 03/2018 and 03/2017 and 12/2018
because of the and operator ?
PS: I can't test, I have only free account.
Thanks.

Anyway, I tried in a mini set of data and I found that Amazon Athena take into account the parenthesis.
My test is as follow :
The DDl of table as générated :
CREATE EXTERNAL TABLE `manyands`(
`years` int COMMENT 'from deserializer',
`months` int COMMENT 'from deserializer',
`days` int COMMENT 'from deserializer')
PARTITIONED BY (
`year` string,
`month` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://mybucket/'
My set of data test:
My tests:
1- SELECT * FROM "atlasdatabase"."manyands" where month='1';
I got in CSV format :
"years","months","days","year","month"
"2017","1","21","2017","1"
"2018","1","81","2018","1"
2- SELECT * FROM "atlasdatabase"."manyands" where month='1' and year='2017';
"years","months","days","year","month"
"2017","1","21","2017","1"
3- SELECT * FROM "atlasdatabase"."manyands" where (month='1' and year='2018') and (month='3' and year='2017') ;
empty (Zéro enregistrements renvoyés)
4- SELECT * FROM "atlasdatabase"."manyands" where (month='1' and year='2018') or (month='3' ) ;
"years","months","days","year","month"
"2018","1","81","2018","1"
"2017","3","73","2017","3"
"2018","3","73","2018","3"
Conclusion : add OR operator between many instances of the partitions.

Related

Athena gives extra rows than actual present in datasource in s3

I have stored data source in s3 and when querying it in athena and querying the total no of rows , its giving me more rows than present in csv file stored in s3 .
I have also given separate path for athena query result i.e different from the data source folder path of s3 .
Please help me with this , why athena is giving me extra rows and unknown values in them ,thus creating discrepancies in the data.
Please find the query i wrote create the table in athena
athena_client.start_query_execution(QueryString='create database cms_data',ResultConfiguration={'OutputLocation': 's3://cms-dashboard-automation/Athenaoutput/'})
\t#Tables created for athena
context = {'Database': 'cms_data'}
athena_client.start_query_execution(QueryString='''CREATE EXTERNAL TABLE IF NOT EXISTS `cms_data`.`mpf_data` (
`State` String,
`County` String,
`Org_Name` String,
`Contract_ID` String,
`Plan_ID` double,
`Segment_ID` double,
`Plan_Type_Desc` String,
`Contract_Year` double,
`Category_Name` String,
`Service_Name` String,
`Limit_Flag` double,
`Authorization_Flag` double,
`Referral_Flag` double,
`Network_Description` String,
`Cost_Share` String )
\t ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
\t 'field.delim' = ','
) LOCATION 's3://cms-dashboard-automation/MPF_Data/'
TBLPROPERTIES ('has_encrypted_data'='false');
''',QueryExecutionContext = context,ResultConfiguration={'OutputLocation': 's3://cms-dashboard-automation/Athenaoutput/'})

Import Logs from S3 to Athena using Regex

I have exported CloudWatch logs to S3 and now want to import those logs to Athena. The format of the logs is as follow (pasted only one log for reference):
2021-07-30T14:30:22.937Z RequestId INFO {"_logLevel":"debug","msg":"Start: Calling All the Data Associates Function","timestamp":1627655422937,"EventSubCategory":"AppSyncService","API":"AppSyncService","function":"XXXXXXXXXXXXXXXXX","Correlation_Id":"XXXXXXXXXXXXXXXXX"}
I am using a regular expression to import the log and using the following query to create the table.
CREATE EXTERNAL TABLE IF NOT EXISTS test1 (
`time` string
`requestid` string
`loglevel` string
`message` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '^(.*?)\t(.*?)\t(.*?)\t([\s\S]*?)\n'
)
LOCATION 's3://logs/test/'
TBLPROPERTIES ('has_encrypted_data'='false');
Regular Expression:
^(.*?)\t(.*?)\t(.*?)\t([\s\S]*?)\n
There are four columns in the table and the regular expression is also creating four groups and working as per my expectation. However, we still get empty table as result.
Can anyone please help to resolve this issue?
I think your problem is that you need to double-escape things in the regex, and you also should not match on a newline at the end, but $. Try this pattern:
'input.regex' = '^(.*?)\\t(.*?)\\t(.*?)\\t([\\s\\S]*?)$'
You can see an example in the official docs.
Also, the pattern [\s\S] could be replaced by . (\S means everything not matched by \s, so together they match anything).
An alternative to the regex serde is Grok, which is less error prone to write. Using the Grok serde I think this table would work for you:
CREATE EXTERNAL TABLE IF NOT EXISTS test1 (
`time` string
`requestid` string
`loglevel` string
`message` string
)
ROW FORMAT SERDE 'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
input.format' = '%{TIMESTAMP_ISO8601:time} %{NOTSPACE:requestid} %{NOTSPACE:loglevel} %{NOTSPACE:message}'
)
LOCATION 's3://logs/test/'
Grok patterns are much easier to read. Check out the documentation and the built-in patterns for more info.

Snowflake to Hive data movement with partition

We have a requirement to move data from Snowflake to Hive. I am able to unload data from snowflake to aws S3 and do and msck repair on Hive.
But all records are coming as null in Hive. What could be the reason ? Is there anything wrong here .
To check the parquet is created correctly , I read the Parquet file using Spark . I am able to read the parquet file.
##Snowflake
create or replace stage dev_zone.DAILY_LOG url= 's3://myc-mlb-alpha-us-east-1-drg-322t232/hive/rs_hive_008_test1' storage_integration = DEV_HIVE_INTEGRATION file_format = (type = 'parquet') ENCRYPTION = (TYPE = 'AWS_SSE_S3');
copy into #dev_zone.DAILY_LOG from (select * from dev_zone.DAILY_LOG limit 100) partition by ('as_on_date=' ||as_on_date);
##Hive
CREATE EXTERNAL TABLE dev_zone.DAILY_LOG(
dim_id decimal(38,0),
card_type string,
type string,
cntry string,
PARTITIONED BY (
as_on_date date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://myc-mlb-alpha-us-east-1-drg-322t232/hive/rs_hive_008_test1'
What I missed was to add header = true
copy into #dev_zone.DAILY_LOG from (select * from dev_zone.DAILY_LOG limit 100) partition by ('as_on_date=' ||as_on_date) header = true;

Can't get Hive SerDe to work - returns 0 records

This is my second attempt of using SerDe. First one worked quiet well but now, I'm really struggling.
I got an XML of this structure:
This is the Hive table I created
CREATE TABLE raw_abc.text_abc
(
publicationid string,
parentid string,
id string,
level string,
usertypeid string,
name string,
assetcrossreferences_ordered string,
assetcrossreferences MAP<string, string>,
attributenames_ordered string,
attributenames map<string,string>,
seo_ordered string,
seo MAP<string, string>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.publicationid"="/ST:ECC-HierarchyMessage/#PublicationID",
"column.xpath.parentid"="/ST:ECC-HierarchyMessage/Product/#ParentID",
"column.xpath.id"="/ST:ECC-HierarchyMessage/Product/#ID",
"column.xpath.level"="/ST:ECC-HierarchyMessage/Product/#Level",
"column.xpath.usertypeid"="/ST:ECC-HierarchyMessage/Product/#UserTypeID",
"column.xpath.name"="/ST:ECC-HierarchyMessage/Product/#Name",
"column.xpath.assetcrossreferences_ordered"="/ST:ECC-HierarchyMessage/Product/AssetCrossReferences/#Ordered",
"column.xpath.assetcrossreferences"="/ST:ECC-HierarchyMessage/Product/AssetCrossReferences/AssetCrossReference",
"column.xpath.attributenames_ordered"="/ST:ECC-HierarchyMessage/Product/AttributeNames/#Ordered",
"column.xpath.attributenames"="/ST:ECC-HierarchyMessage/Product/AttributeNames/#Ordered",
"column.xpath.seo_ordered"="/ST:ECC-HierarchyMessage/Product/SEO/#Ordered",
"column.xpath.seo"="/ST:ECC-HierarchyMessage/Product/SEO"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
location 's3a://ec-abc-dev/inbound/abc/abc/'
TBLPROPERTIES (
"xmlinput.start"="<ST:ECC-HierarchyMessage>",
"xmlinput.end"="</ST:ECC-HierarchyMessage>"
)
;
Table is created successfully, however,
when I try select * from raw_abc.text_abc , I get no records in return.
Any idea what's wrong here? I've spent the last 2 days trying to figure it out with no luck.
Thanks,
G

hive xml serDe : table is empty

I want to store xml data into hive table, XML data :
<servicestatuslist>
<recordcount>1266</recordcount>
<servicestatus id="435680">
<status_text>/: 61%used(9714MB/15975MB) (<80%) : OK</status_text>
<display_name>/ Disk Usage</display_name>
<host_name>zabbix.vshodc.com</host_name>
</servicestatus>
</servicestatuslist>
I have added jar file to path
hive> add jar /home/cloudera/HiveJars/hivexmlserde-1.0.5.1.jar ;
Added /home/cloudera/HiveJars/hivexmlserde-1.0.5.1.jar to class path
Added resource: /home/cloudera/HiveJars/hivexmlserde-1.0.5.1.jar
I have written a hive serDe query:
create table xml_AIR(id STRING, status_text STRING,display_name STRING ,host_name STRING)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with serdeproperties(
"column.xpath.id"="/servicestatus/#id",
"column.xpath.status_text"="/servicestatus/status_text/text()",
"column.xpath.display_name"="/servicestatus/display_name/text()",
"column.xpath.host_name"="/servicestatus/host_name/text()"
)
stored as
inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/cloudera/input/air.xml'
tblproperties(
"xmlinput.start"="<servicestatus",
"xmlinput.end"="</servicestatus>"
);
OK
Time taken: 1.609 seconds
When I issued select command , it didn't show the table's data:
hive> select * from xml_AIR;
OK
Time taken: 3.0 seconds
What's wrong in the above code? Please help.
I came out through the same Problem when dealing with XML Serde. After some struggle, I fixed it by using the "Load data" statement separately and avoiding addition of "LOCATION" property in "CREATE" statement.
the following is my XML data.
<record customer_id="0000-JTALA">
<income>200000</income>
<demographics>
<gender>F</gender>
<agecat>1</agecat>
<edcat>1</edcat>
<jobcat>2</jobcat>
<empcat>2</empcat>
<retire>0</retire>
<jobsat>1</jobsat>
<marital>1</marital>
<spousedcat>1</spousedcat>
<residecat>4</residecat>
<homeown>0</homeown>
<hometype>2</hometype>
<addresscat>2</addresscat>
</demographics>
<financial>
<income>18</income>
<creddebt>1.003392</creddebt>
<othdebt>2.740608</othdebt>
<default>0</default>
</financial>
</record>
CREATE TABLE Statement:
CREATE TABLE xml_bank(customer_id STRING, income BIGINT, demographics map<string,string>, financial map<string,string>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.customer_id"="/record/#customer_id",
"column.xpath.income"="/record/income/text()",
"column.xpath.demographics"="/record/demographics/*",
"column.xpath.financial"="/record/financial/*"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
"xmlinput.start"="<record customer",
"xmlinput.end"="</record>"
);
CREATE Query Result:
OK
Time taken: 0.925 seconds
hive>
for the above create statement, I used the following "LOAD DATA" statement to load the data contained in an XML file in to the above created table.
hive> load data local inpath '/home/mahesh/hive_input_datasets/XMLdata/XMLdatafile.xml' overwrite into table xml_bank6;
LOAD Query Result:
Copying data from file:/home/mahesh/hive_input_datasets/XMLdata/XMLdatafile.xml
Copying file: file:/home/mahesh/hive_input_datasets/XMLdata/XMLdatafile.xml
Loading data to table default.xml_bank6
Table default.xml_bank6 stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 500, raw_data_size: 0]
OK
Time taken: 0.879 seconds
hive>
And finally,
SELECT Query and Result:
hive> select * from xml_bank6;
OK
0000-JTALA 200000 {"empcat":"2","jobcat":"2","residecat":"4","retire":"0","hometype":"2","addresscat":"2","homeown":"0","spousedcat":"1","gender":"F","jobsat":"1","edcat":"1","marital":"1","agecat":"1"} {"default":"0","income":"18","othdebt":"2.740608","creddebt":"1.003392"}
Time taken: 0.149 seconds, Fetched: 1 row(s)
hive>
And in the above query i would suggest the value for "xmlinput.start" as "<servicestatus id", instead of "<servicestatus",because the XML start tag is in the pattern <servicestatus id="some data">.I believe this would be helpful for you.
Well, the code looks good. As per the example in this link, it should work for you.
Btw, there is a typo in the code that you have provided. In the table definition status_test STRING should be status_text STRING or vice versa.
The entire XML file should be a single line (i.e. no newlines in the XML). (A simple unix command to strip newlines is tr '\n\r' ' ' < source.xml > processed.xml.)
https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources
According to Hive DDL documentation, LOCATION clause expects an hdfs_path. Hence, try specifying only the directory, not the whole path to your XML file. By using LOAD after CREATE TABLE, you cannot have external tables, which might be an interesting approach in some cases.
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/TruncateTable
LOCATION give directory only instead of file
create table xml_AIR(id STRING, status_text STRING,display_name STRING ,host_name STRING)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with serdeproperties(
"column.xpath.id"="/servicestatus/#id",
"column.xpath.status_text"="/servicestatus/status_text/text()",
"column.xpath.display_name"="/servicestatus/display_name/text()",
"column.xpath.host_name"="/servicestatus/host_name/text()"
)
stored as
inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/cloudera/input'
tblproperties(
"xmlinput.start"="<servicestatus",
"xmlinput.end"="</servicestatus>"
);