Hive parquet compression doesn't work - hive

Hive version 2.3
SET hive.exec.compress.output=true;
CREATED TABLE (
*) STORED AS PARQUET
LOCATION 's3 location'
TBLPROPERTIES ('parquet.compress'='SNAPPY');
I did above but the table output in s3 location is not compressed, I'm able to see the result by using cat, I also tried 'TBLPROPERTIES ('PARQUET.COMPRESS'='ZLIB');' that didn't work either. Does anyone know what's the best way to compress parquet using hive? Thank you.

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
CREATED TABLE (*) STORED AS PARQUET LOCATION 's3 location';
you can also set other compression formats. List of compression
gzip - org.apache.hadoop.io.compress.GzipCodec
bzip2 - org.apache.hadoop.io.compress.BZip2Codec
LZO - com.hadoop.compression.lzo.LzopCodec
Snappy - org.apache.hadoop.io.compress.SnappyCodec
Deflate -org.apache.hadoop.io.compress.DeflateCodec
From the above list, Snappy is NOT a default one, DeflateCodec is the default.
You can confirm this by running
hive> SET mapred.output.compression.codec;

Related

will the hive property hive.exec.compress.output affect table which store as orc?

I know how to open hive compression correctly,and i do something below.The table format i have tried was two type, textformat and orc.
table info
data size
textformat, hive compression open
346B
textformat, hive compression close
568B
orc, orc.compress=NONE, hive compression open
1635B
orc, orc.compress=NONE, hive compression close
1635B
orc, orc.compress=ZLIB, hive compression open
1371B
orc, orc.compress=ZLIB, hive compression close
1371B
From the table we can see,the hive compression can affect textformat data file ,but orc.So I guess the hive compression can work on textformat table, but orc.Am i right? do you have some reliable evidence to support me or overturn? Thanks!

How to set min and max parquet files size in Hive?

I have created the external table using parquet in hive using snappy compression. I want to configure the parquet files size created in the target folder. I.e Min file size = 1024B and Max file size = 128MB based on these configurations, no of parquet files should be created. Can anyone please let me know how this can be achieved?
Thanks

Specify compression type with Athena

I have S3 data which is has GZIP compression. I'm trying to create a table in Athena using this file, and my CREATE TABLE statement succeeds - but when I query the table all rows are empty.
create external table mydatabase.table1 (
date date,
week_begin_date date,
week_end_date date,
value float
)
row format delimited fields terminated by ','
stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 's3://my-bucket/some/path/'
How can I insist that Athena read my files as GZIP?
While Athena supports TBLPROPERTIES metadata (we can set properties within a CREATE TABLE, ALTER TABLE to set these properties, and SHOW TBLPROPERTIES to display properties of any table), it does not respect the TBLPROPERTIES ('compressionType'='gzip') option.
There's no apparent way to force compression / decompression algorithm. Athena attempts to identify compression based on file extension. A GZIP file with a .gz suffix will be readable; a GZIP file without that suffix will not.
Similarly, an uncompressed file with a .gz suffix will fail. The reported error is
HIVE_CURSOR_ERROR: incorrect header check
Some investigation revealed the following:
The only known way to have Athena recognize a file as a GZIP is to name it with a .gz suffix.
Other similar suffixes that do not work include .gzip, .zip, [^.]gz
GZIP and uncompressed files can live happily side by side in an Athena table or partition - the compression detection is done at the file level, not at the table level.

Parquet Output File Not Compressed

I am executing below command in hive console.
create table departments_parquet stored as parquet tblproperties("parquet.compression"="GZIP") as select * from departments;
I see the output file created in parquet format as below.
-rwxrwxrwx 1 cloudera supergroup 463 2017-06-17 14:55 /user/hive/warehouse/departments_parquet/000000_0
Hive relevant properties are set as :
mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
hive.exec.compress.output=true;
I expected the output file name as 000000_0.gz
Please help to get the final output as compressed gzip file.
Thanks.
Columnar storage usage various compression techniques simultaneously and page compression is just one of them, therefor, although containing gzipped data parts, the files are not gzipped files.

Parquet compression type via Impala

We have quite a few impala tables defined, and assume we are using Snappy compression. (parquet files)
However nobody really knows what compression type we are actually using on existing tables.
The impala docs don't seem to specify how to get the compression type from an existing table.
Is there a way to find the used compression type via impala?
As of right now there is no command in Impala that would tell you the type of compression being used in a table stored as parquet, but there is a work around. What you can do is look at one of the parquet files within the table and then use the parquet-tools meta command in order to see the compression being used.
-- step1) run hdfs dfs -ls to determine the location and name for a parquet file
hdfs dfs -ls /yourTableLocationPath
-- step2) parquet-tools really only works locally right now so you will need to copy the file to a local directory
hdfs dfs -get /yourTableLocationPath/yourFileName /yourLocalPath
-- step3) run parquet-tools meta command
parquet-tools meta /yourLocalPath/yourFileName
The output of the parquet-tools meta command will show you the type of compression being used under the row group output.