Specify compression type with Athena - hive

I have S3 data which is has GZIP compression. I'm trying to create a table in Athena using this file, and my CREATE TABLE statement succeeds - but when I query the table all rows are empty.
create external table mydatabase.table1 (
date date,
week_begin_date date,
week_end_date date,
value float
)
row format delimited fields terminated by ','
stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 's3://my-bucket/some/path/'
How can I insist that Athena read my files as GZIP?

While Athena supports TBLPROPERTIES metadata (we can set properties within a CREATE TABLE, ALTER TABLE to set these properties, and SHOW TBLPROPERTIES to display properties of any table), it does not respect the TBLPROPERTIES ('compressionType'='gzip') option.
There's no apparent way to force compression / decompression algorithm. Athena attempts to identify compression based on file extension. A GZIP file with a .gz suffix will be readable; a GZIP file without that suffix will not.
Similarly, an uncompressed file with a .gz suffix will fail. The reported error is
HIVE_CURSOR_ERROR: incorrect header check
Some investigation revealed the following:
The only known way to have Athena recognize a file as a GZIP is to name it with a .gz suffix.
Other similar suffixes that do not work include .gzip, .zip, [^.]gz
GZIP and uncompressed files can live happily side by side in an Athena table or partition - the compression detection is done at the file level, not at the table level.

Related

Which file format I have to use which supports appending?

Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level
If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'

query from athena a file in s3

I am complete newbie on this. I have this file in amazon s3.
How can I query this .tar.gz from Athena?
I am assuming I have to somehow decompress and ‘restore’ to ‘athena’? But I do not know how to do it.
You can directly query files in AWS Athena that are in .gz format as well as any flat files. If your tar file contains multiple .gz files and they are of the same file format then you don't need to gunzip them to .tsv.
Since, you have already converted to .tsv files make sure the files of the same format are put into a folder e.g.
s3://bucketname/folder/file1.gz
s3://bucketname/folder/file2.gz
etc. file1 and file2 should have the same structure.
Then define your AWS Athena table on top of this. Sample script below -
CREATE EXTERNAL TABLE table_name (
yr INT,
quarter INT,
month INT,
dayofmonth INT,
dayofweek INT,
flightdate STRING
)
PARTITIONED BY (year STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://bucketname/folder/';
Keeping homogeneous files is not mandatory but recommended so that you can add remove files under the same folder and just update the partition information every time there is a change.
Run MSCK REPAIR TABLE to refresh partition metadata each time a new partition is added to this table.
MSCK REPAIR TABLE table_name ;
Reference - https://docs.aws.amazon.com/athena/latest/ug/lazy-simple-serde.html#tsv-example
You can't query tarballs. Athena requires gzipped or uncompressed text-files. Other options are ORC or parquet files. You will need to untar the file and create a gzip file with just the .txt in it.

File size in hive with different file formats

I have a small file (2MB). I created a external hive table over this file (stored as textfile). I created another table (stored as ORC) and copied the data from the previous table. When I checked the size of data in ORC table, it was more than 2MB.
ORC is a compressed file format, so shouldn't the data size be less?
As of Hive 0.14, users can request an efficient merge of small ORC files together by issuing a CONCATENATE command on their table or partition. The files will be merged at the stripe level without reserialization.
ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;
It's because your source file is too small. ORC has complex structure with internal indexes, headers, footers, postscript, compressing codecs also add some structures, etc, etc.
See this for details: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileFormat
All these supporting structures consume more space than the data. For such small file you really do not need to store min/max values for columns, do not need blum filters, etc since your file may fit in memory. The best storage for this case is text file uncompressed. You can also try just to gzip your source file and check it's size. Too small gzipped file may be bigger than uncompressed. The bigger the file the more benefit from compressing and using orc will be.

Loading zip csv file from S3 into Hive

I have a csv file that's zipped in S3. For unzipped files, I would use the below code. Is there an option I can add so it upzips before loading?
I'm on Hive and am using a sql editor (db visualizer). I googled and saw some unix steps but I've never used unix before so am wondering if there is another way within the sql
create external table abc (
email string
value int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://path/'
TBLPROPERTIES ("skip.header.line.count"="1");

Create hive table for schema less avro files

I have multiple avro files and each file have a STRING in it. Each avro file is a single row. How can I write hive table to consume all the avro files located in a single directory .
Each file has a big number in it and hence I do not have any json kind of schema that I can relate too. I might be wrong when I say schema less . But I cannot find a way for hive to understand this data. This might be very simple but I am lost since I tried numerous different ways without success. I created tables pointing to json schema as avro uri, but this is not the case here.
For more context files were written using crunch api
final Path outcomesVersionPath = ...
pipeline.write(fruit.keys(), To.avroFile(outcomesVersionPath));
I tried following query which creates table but does not read data properly
CREATE EXTERNAL TABLE test_table
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///somePath/directory_with_Ids'
If your data set only has one STRING field then you should be able to read it from Hive with a single column called data (or whatever you would like) by changing your DDL to:
CREATE EXTERNAL TABLE test_table
(data STRING)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///somePath/directory_with_Ids'
And then read the data with:
SELECT data FROM test_table;
Use avro utilities jar to see avro schema for any given binary file here!
Then just link the schema file while creating a table.