will the hive property hive.exec.compress.output affect table which store as orc? - hive

I know how to open hive compression correctly,and i do something below.The table format i have tried was two type, textformat and orc.
table info
data size
textformat, hive compression open
346B
textformat, hive compression close
568B
orc, orc.compress=NONE, hive compression open
1635B
orc, orc.compress=NONE, hive compression close
1635B
orc, orc.compress=ZLIB, hive compression open
1371B
orc, orc.compress=ZLIB, hive compression close
1371B
From the table we can see,the hive compression can affect textformat data file ,but orc.So I guess the hive compression can work on textformat table, but orc.Am i right? do you have some reliable evidence to support me or overturn? Thanks!

Related

Hive table in Power Bi

I want to create a hive table which will store data with orc format and snappy compression. Will power bi be able to read from that table? Also do you suggest any other format/compression for my table?
ORC is a special file format only going to work with hive and its highly optimized for HDFS read operations. And power BI can connect to hive using hive odbc data connection. So, i think if you have to use hive all the time, you can use this format to store the data. But if you want flexibility of both hive and impala and use cludera provided impala ODBC driver, you can think of using parquet.
Now, both orc and parquet has their own advantages and disadvantages. And main deciding factor can be tools that access the data, how nested data is, and how many columns are there .
If you have many columns with nested data and if you want to use both hive and impala to access data, go with parquet. And if you have few columns with flat data structure and huge amount of data, go with orc.

Read hive table (or HDFS data in parquet format) in Streamsets DC

Is it possible to read hive table (or HDFS data in parquet format) in Streamsets Data collector? I don't want to use Transformer for this.
Reading the raw files in parquet is counter to the way that data collector works so that would be a better use case for transformer.
But I have successfully used the jdbc origin either from Impala or hive to achieve this, there are some additional hurdles to jump with the jdbc source.

Are metadata duplicated in Hive tables stored as ORC?

Being ORC a self-descriptive format, informations about columns are stored within the files.
When a new table is created and stored as ORC, its metadata are added to the Hive matastore.
Aren't these informations duplicated? How does Hive handle this?
A possible explanation:
The columns metadata (COLUMN_NAME, TYPE_NAME, COMMENT etc) is reflected in only a single table in the hive metastore (COLUMNS_V2).
The hive metastore is consisted of dozens of tables with various dependencies.
So having the columns metadata removed from the metastore may save a small duplicity, but it is negligible in comparison to the entire metastore db (in our cluster its a 176KB/530MB ratio).
I guess that saving ~0.01% of redundancy isn't worth the hassle of redesigning the metastore schema.
ORC is a format that is compatible with many other technologies other then Hive.
Could be that hive is using only the columnar compression, while ignoring the benefit of self describing data format.

File size in hive with different file formats

I have a small file (2MB). I created a external hive table over this file (stored as textfile). I created another table (stored as ORC) and copied the data from the previous table. When I checked the size of data in ORC table, it was more than 2MB.
ORC is a compressed file format, so shouldn't the data size be less?
As of Hive 0.14, users can request an efficient merge of small ORC files together by issuing a CONCATENATE command on their table or partition. The files will be merged at the stripe level without reserialization.
ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;
It's because your source file is too small. ORC has complex structure with internal indexes, headers, footers, postscript, compressing codecs also add some structures, etc, etc.
See this for details: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileFormat
All these supporting structures consume more space than the data. For such small file you really do not need to store min/max values for columns, do not need blum filters, etc since your file may fit in memory. The best storage for this case is text file uncompressed. You can also try just to gzip your source file and check it's size. Too small gzipped file may be bigger than uncompressed. The bigger the file the more benefit from compressing and using orc will be.

Performance improvement for GZ to ORC File

Please let me know Is there any faster way to move (*.gz) to ORC table directly.
1)Another thought, from *.gz file to NON Partition table, Rather than creating External Table and dumping gz file data to External Table. Is there any other approach for quicker loading from Gz to External Table. We are thinking of 2 other approaches like Can we have ADF with Custom .exe to uncompress *.gz file and upload to Azure Blob.
For Example : If the *.Gz File is 10 GB and Un Compressed File is 120 GB , time it takes to uncompress is 40 Mins, How do we upload this un compressed 120 GB data File to Azure Blob. Do we need to have Azure Blob SDK for uploading or Will ADF Executes .exe at location where data is present i.e. exactly at the cluster which holds Blob Data. ( If ADF executes .exe at Azure Blob Storage Data Center’s Cluster, then there will be no Network cost, No Network latency and upload time to upload Uncompressed data will be very less). So Is it possible with ADF?. Will it be right approach ?
If above approach doesn’t work, If we create MR Solution where Mapper is going to UnCompress Gz File and Uploads to Azure Blob Storage, will there be any performance improvement, since I just need to create External Table pointing to uncompressed File. MR will be executing at Azure Blob storage location.
We see ORC and ORC with Partition are performing at same (sometimes we see minimal difference b/w ORC partition and ORC without partition). Will ORC With Partition perform better than ORC . Will ORC With Partition Bucketing performs better than ORC Partition ?. I see each ORC Partition File is close 50-100 MB and ORC With Out Partition (each File size 30-50 MB).
**Note: 120 GB of Un Compressed Data is compressed to 17 GB of ORC File Format
The only way that I know to move from gz to ORC file format is by writing a Hive query. Using a compressed format will always be slower since it needs to be decompressed before conversion. You may want to play around with these parameters as shown here, to see if it speed up moving from gz to orc.
For question #1 above, you may want to follow up with Azure Data Factory team.
For question #3, I have not tried it but computing on uncompressed data should be faster than using compressed data.
For #4, depends on what the field you are partitioning on. Make sure your key is not under partitioned (i.e. results in too few partitions). Also ensure you add sorted by to add a secondary partitioning key. Refer to this link for more details.
Hive has native support for compressed format, including GZIP, BZIP2 and deflate. So you can upload .gz files to Azure Blob and create external table with those files directly. And then you can create table with ORC and load the data there. Normally Hive runs faster with compressed files, please refer to Compression in Hadoop by MSIT for details.