Hive partitioning for data on s3 - amazon-s3

Our data is stored using s3://bucket/YYYY/MM/DD/HH and we are using aws firehouse to land parquet data in there locations in near real time . I can query data using AWS athena just fine however we have a hive query cluster which is giving troubles querying data when partitioning is enabled .
This is what I am doing :
PARTITIONED BY (
`year` string,
`month` string,
`day` string,
`hour` string)
This doesn't seem to work when data on s3 is stored as s3:bucket/YYYY/MM/DD/HH
however this does work for s3:bucket/year=YYYY/month=MM/day=DD/hour=HH
Given the stringent bucket paths of firehose i cannot modify the s3 paths. So my question is what's the right partitioning scheme in hive ddl when you don't have an explicitly defined column name on your data path like year = or month= ?

Now you can specify S3 prefix in firehose.https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html
myPrefix/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/

If you can't obtain folder names as per hive naming convention, you will need to map all the partitions manually
ALTER TABLE tableName ADD PARTITION (year='YYYY') LOCATION 's3:bucket/YYYY'

Related

How can we update existing partition data in aws glue table without running crawler?

When we are updating data in existing partition by using manual upload to s3 bucket then the data is showing in existing partition in athena glue table.
But when data is updated by using API, the data uploaded to s3 bucket is in existing partition, but in glue table data is stored in the different partition which is current date[last modified] (August 2, 2022, 17:52:15 (UTC+05:30)),
but in my s3 bucket partition date is different(s3://aiq-grey-s3-sink-created-at-partition/topics/core.Test.s3/2022/07/19/) which is 2022/07/19 this.
so when I check same object in glue table I want the partition by this date 2022/07/19.
but it shows the partition by current date without running crawler.
When I run crawler it writes data in correct partition,
but I don't want to run crawler every single time.
How can I update data in existing partition on glue table by using API ?
Am I missing some configuration that is needed to achieve required result for this process ?
Please suggest if anybody has Idea on this.
Here're two solutions I proposed:
use boto3 to run an Athena query to alter partition: ALTER TABLE ADD PARTITION
athena = boto3.client('athena')
response = athena.start_query_execution(
QueryString='ALTER TABLE table ADD PARTITION ... LOCATION ... ', // compose the query as you need
QueryExecutionContext={
'Database': database
},
ResultConfiguration={
'OutputLocation': output,
}
)
use boto3 to create partition via glue data catalog: glue.Client.create_partition(**kwargs)

Which file format I have to use which supports appending?

Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level
If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'

Hive External Table with Azure Blob Storage

Is there a way to create a Hive external table using SerDe with location pointing to Azure Storage, organized in such a way that the data uses the fewest number of blobs. For example if insert 10000 records, I would like it to create just 100 page blobs with 100 line records each instead of maybe 10000 with 1 record each. I am de serializing from the blob, so fewer blobs will require lesser time.What would be the most optimal format in hive?
First, there is a way to create a Hive external table using Serde with localtion pointing to Azure Blob Storage, but not directly, please see the section Create Hive database and tables like the HiveQL below.
create database if not exists <database name>;
CREATE EXTERNAL TABLE if not exists <database name>.<table name>
(
field1 string,
field2 int,
field3 float,
field4 double,
...,
fieldN string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>' lines terminated by '<line separator>'
STORED AS TEXTFILE LOCATION '<storage location>' TBLPROPERTIES("skip.header.line.count"="1");
And focus the below content for explaination <storage location>.
<storage location>: the Azure storage location to save the data of Hive tables. If you do not specify LOCATION , the database and the tables are stored in hive/warehouse/ directory in the default container of the Hive cluster by default. If you want to specify the storage location, the storage location has to be within the default container for the database and tables. This location has to be referred as location relative to the default container of the cluster in the format of 'wasb:///<directory 1>/' or 'wasb:///<directory 1>/<directory 2>/', etc. After the query is executed, the relative directories are created within the default container.
So it means you can access Azure Blob Storage location on Hive via wasb protocol, which required hadoop-azure library that support Hadoop access HDFS on Azure Storage. If your Hive on Hadoop not deployed on Azure, you need to refer to the Hadoop offical document Hadoop Azure Support: Azure Blob Storage to configure it.
For using serde, it is depended on the file format you used, like for orc file format, the hql code using OrcSerde like below.
CREATE EXTERNAL TABLE IF NOT EXSISTS <table name> (<column_name column_type>, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS ORC
LOCATION '<orcfile path>'
For your second, the most optimal format is ORC File Format in Hive.

How to query data from gz file of Amazon S3 using Qubole Hive query?

I need get specific data from gz.
how to write the sql?
can I just sql as table database?:
Select * from gz_File_Name where key = 'keyname' limit 10.
but it always turn back with an error.
You need to create Hive external table over this file location(folder) to be able to query using Hive. Hive will recognize gzip format. Like this:
create external table hive_schema.your_table (
col_one string,
col_two string
)
stored as textfile --specify your file type, or use serde
LOCATION
's3://your_s3_path_to_the_folder_where_the_file_is_located'
;
See the manual on Hive table here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable
To be precise s3 under the hood does not store folders, filename containing /s in s3 represented by different tools such as Hive like a folder structure. See here: https://stackoverflow.com/a/42877381/2700344

HIVE script - Specify file name as S3 Location

I am exporting data from DynamoDB to S3 using follwing script:
CREATE EXTERNAL TABLE TableDynamoDB(col1 String, col2 String)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES (
"dynamodb.table.name" = "TableDynamoDB",
"dynamodb.column.mapping" = "col1:col1,col2:col2"
);
CREATE EXTERNAL TABLE TableS3(col1 String, col2 String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://myBucket/DataFiles/MyData.txt';
INSERT OVERWRITE TABLE TableS3
SELECT * FROM TableDynamoDB;
In S3, I want to write the output to a given file name (MyData.txt)
but the way it is working currently is that above script created folder with name 'MyData.txt'
and then generated a file with some random name under this folder.
Is it at all possible to specify a file name in S3 using HIVE?
Thank you!
A few things:
There are 2 different ways hadoop can write data to s3. This wiki describes the differences in a little more detail. Since you are using the "s3" scheme, you are probably seeing a block number.
In general, M/R jobs (and hive queries) are going to want to write their output to multiple files. This is an artifact of parallel processing. In practice, most commands/APIs in hadoop handle directories pretty seamlessly so you shouldn't let it bug you too much. Also, you can use things like hadoop fs -getmerge on a directory to read all of the files in a single stream.
AFAIK, the LOCATION argument in the DDL for an external hive table is always treated as a directory for the reasons above.