Parquet Output File Not Compressed

Parquet Output File Not Compressed - hive

I am executing below command in hive console.
create table departments_parquet stored as parquet tblproperties("parquet.compression"="GZIP") as select * from departments;
I see the output file created in parquet format as below.
-rwxrwxrwx 1 cloudera supergroup 463 2017-06-17 14:55 /user/hive/warehouse/departments_parquet/000000_0
Hive relevant properties are set as :
mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
hive.exec.compress.output=true;
I expected the output file name as 000000_0.gz
Please help to get the final output as compressed gzip file.
Thanks.

Columnar storage usage various compression techniques simultaneously and page compression is just one of them, therefor, although containing gzipped data parts, the files are not gzipped files.

Related

Amazon S3 parquet file - Transferring to GCP / BQ

Good morning everyone. I have a GCS Bucket, which has files that have been transferred from our Amazon S3 bucket. These files are in .gz.parquet format. I am trying to set up a transfer from the GSC bucket to BigQuery with the transfer feature, however I am running into issues with the parquet file format.
When I create a transfer and specify the file format as Parquet, I receive an error stating that the data is not in parquet format. When I tried specifying the file in CSV, weird values appear in my table as shown in the image linked:
I have tried the following URIs:
bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.parquet. FILE FORMAT: PARQUET. RESULTS: FILE NOT IN PARQUET FORMAT.
bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.gz.parquet. FILE FORMAT: PARQUET. RESULTS: FILE NOT IN PARQUET FORMAT.
bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.gz.parquet. FILE FORMAT: CSV. RESULTS: TRANSFER DONE, BUT WEIRD VALUES.
bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.parquet. FILE FORMAT: CSV. RESULTS: TRANSFER DONE, BUT WEIRD VALUES.
Does anyone have any idea on how I should proceed? Thank you in advance!

There is a dedicated documentation explaining how to copy Parquet data from Cloud storage bucket to Big Query which is given below. Could you please go thru it and update us if its still not solving your problem.
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet
Regards,
Anbu.

Seeing the looks of your URIs, the page you are looking for is this one, for loading hive partitioned parquet files into BigQuery.
You can try something like below in Cloud Shell:
bq load --source_format=PARQUET --autodetect \
--hive_partitioning_mode=STRINGS \
--hive_partitioning_source_uri_prefix=gs://bucket-name/folder-1/folder-2/ \
dataset.table `gcs_uris`

Export table from Bigquery into GCS split sizes

I am exporting a table of size>1GB from Bigquery into GCS but it splits the files into very small files of 2-3 MB. Is there a way to get bigger files like 40-60MB per files rather than 2-3 MB.
I do the expport via the api
https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files
https://cloud.google.com/bigquery/docs/reference/v2/jobs
The source table size is 60 GB on Bigquery. I extract the data with format - NewLine_Delimited_Json and GZIP compression
destination_cloud_storage_uris=[
'gs://bucket_name/main_folder/partition_date=xxxxxxx/part-*.gz'
]

Are you trying to export partitioned table? If yes, each partition is exported as different table and it might cause small files.
I run the export in cli with each of the following commands and received in both cases files of size 49 MB:
bq extract --compression=GZIP --destination_format=NEWLINE_DELIMITED_JSON project:dataset.table gs://bucket_name/path5-component/file-name-*.gz
bq extract --compression=GZIP project:dataset.table gs://bucket_name/path5-component/file-name-*.gz

Please add more details to the question so we can provide specific advice: How are you exactly asking for this export?
Nevertheless, if you have many files in GCS and you want to merge them all into one, you can do:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
https://cloud.google.com/storage/docs/gsutil/commands/compose

File size in hive with different file formats

I have a small file (2MB). I created a external hive table over this file (stored as textfile). I created another table (stored as ORC) and copied the data from the previous table. When I checked the size of data in ORC table, it was more than 2MB.
ORC is a compressed file format, so shouldn't the data size be less?

As of Hive 0.14, users can request an efficient merge of small ORC files together by issuing a CONCATENATE command on their table or partition. The files will be merged at the stripe level without reserialization.
ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;

It's because your source file is too small. ORC has complex structure with internal indexes, headers, footers, postscript, compressing codecs also add some structures, etc, etc.
See this for details: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileFormat
All these supporting structures consume more space than the data. For such small file you really do not need to store min/max values for columns, do not need blum filters, etc since your file may fit in memory. The best storage for this case is text file uncompressed. You can also try just to gzip your source file and check it's size. Too small gzipped file may be bigger than uncompressed. The bigger the file the more benefit from compressing and using orc will be.

Parquet compression type via Impala

We have quite a few impala tables defined, and assume we are using Snappy compression. (parquet files)
However nobody really knows what compression type we are actually using on existing tables.
The impala docs don't seem to specify how to get the compression type from an existing table.
Is there a way to find the used compression type via impala?

As of right now there is no command in Impala that would tell you the type of compression being used in a table stored as parquet, but there is a work around. What you can do is look at one of the parquet files within the table and then use the parquet-tools meta command in order to see the compression being used.
-- step1) run hdfs dfs -ls to determine the location and name for a parquet file
hdfs dfs -ls /yourTableLocationPath
-- step2) parquet-tools really only works locally right now so you will need to copy the file to a local directory
hdfs dfs -get /yourTableLocationPath/yourFileName /yourLocalPath
-- step3) run parquet-tools meta command
parquet-tools meta /yourLocalPath/yourFileName
The output of the parquet-tools meta command will show you the type of compression being used under the row group output.

Performance improvement for GZ to ORC File

Please let me know Is there any faster way to move (*.gz) to ORC table directly.
1)Another thought, from *.gz file to NON Partition table, Rather than creating External Table and dumping gz file data to External Table. Is there any other approach for quicker loading from Gz to External Table. We are thinking of 2 other approaches like Can we have ADF with Custom .exe to uncompress *.gz file and upload to Azure Blob.
For Example : If the *.Gz File is 10 GB and Un Compressed File is 120 GB , time it takes to uncompress is 40 Mins, How do we upload this un compressed 120 GB data File to Azure Blob. Do we need to have Azure Blob SDK for uploading or Will ADF Executes .exe at location where data is present i.e. exactly at the cluster which holds Blob Data. ( If ADF executes .exe at Azure Blob Storage Data Center’s Cluster, then there will be no Network cost, No Network latency and upload time to upload Uncompressed data will be very less). So Is it possible with ADF?. Will it be right approach ?
If above approach doesn’t work, If we create MR Solution where Mapper is going to UnCompress Gz File and Uploads to Azure Blob Storage, will there be any performance improvement, since I just need to create External Table pointing to uncompressed File. MR will be executing at Azure Blob storage location.
We see ORC and ORC with Partition are performing at same (sometimes we see minimal difference b/w ORC partition and ORC without partition). Will ORC With Partition perform better than ORC . Will ORC With Partition Bucketing performs better than ORC Partition ?. I see each ORC Partition File is close 50-100 MB and ORC With Out Partition (each File size 30-50 MB).
**Note: 120 GB of Un Compressed Data is compressed to 17 GB of ORC File Format

The only way that I know to move from gz to ORC file format is by writing a Hive query. Using a compressed format will always be slower since it needs to be decompressed before conversion. You may want to play around with these parameters as shown here, to see if it speed up moving from gz to orc.
For question #1 above, you may want to follow up with Azure Data Factory team.
For question #3, I have not tried it but computing on uncompressed data should be faster than using compressed data.
For #4, depends on what the field you are partitioning on. Make sure your key is not under partitioned (i.e. results in too few partitions). Also ensure you add sorted by to add a secondary partitioning key. Refer to this link for more details.

Hive has native support for compressed format, including GZIP, BZIP2 and deflate. So you can upload .gz files to Azure Blob and create external table with those files directly. And then you can create table with ORC and load the data there. Normally Hive runs faster with compressed files, please refer to Compression in Hadoop by MSIT for details.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas