Performance improvement for GZ to ORC File

Performance improvement for GZ to ORC File - hive

Please let me know Is there any faster way to move (*.gz) to ORC table directly.
1)Another thought, from *.gz file to NON Partition table, Rather than creating External Table and dumping gz file data to External Table. Is there any other approach for quicker loading from Gz to External Table. We are thinking of 2 other approaches like Can we have ADF with Custom .exe to uncompress *.gz file and upload to Azure Blob.
For Example : If the *.Gz File is 10 GB and Un Compressed File is 120 GB , time it takes to uncompress is 40 Mins, How do we upload this un compressed 120 GB data File to Azure Blob. Do we need to have Azure Blob SDK for uploading or Will ADF Executes .exe at location where data is present i.e. exactly at the cluster which holds Blob Data. ( If ADF executes .exe at Azure Blob Storage Data Center’s Cluster, then there will be no Network cost, No Network latency and upload time to upload Uncompressed data will be very less). So Is it possible with ADF?. Will it be right approach ?
If above approach doesn’t work, If we create MR Solution where Mapper is going to UnCompress Gz File and Uploads to Azure Blob Storage, will there be any performance improvement, since I just need to create External Table pointing to uncompressed File. MR will be executing at Azure Blob storage location.
We see ORC and ORC with Partition are performing at same (sometimes we see minimal difference b/w ORC partition and ORC without partition). Will ORC With Partition perform better than ORC . Will ORC With Partition Bucketing performs better than ORC Partition ?. I see each ORC Partition File is close 50-100 MB and ORC With Out Partition (each File size 30-50 MB).
**Note: 120 GB of Un Compressed Data is compressed to 17 GB of ORC File Format

The only way that I know to move from gz to ORC file format is by writing a Hive query. Using a compressed format will always be slower since it needs to be decompressed before conversion. You may want to play around with these parameters as shown here, to see if it speed up moving from gz to orc.
For question #1 above, you may want to follow up with Azure Data Factory team.
For question #3, I have not tried it but computing on uncompressed data should be faster than using compressed data.
For #4, depends on what the field you are partitioning on. Make sure your key is not under partitioned (i.e. results in too few partitions). Also ensure you add sorted by to add a secondary partitioning key. Refer to this link for more details.

Hive has native support for compressed format, including GZIP, BZIP2 and deflate. So you can upload .gz files to Azure Blob and create external table with those files directly. And then you can create table with ORC and load the data there. Normally Hive runs faster with compressed files, please refer to Compression in Hadoop by MSIT for details.

Related

Which file format I have to use which supports appending?

Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level

If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'

Export table from Bigquery into GCS split sizes

I am exporting a table of size>1GB from Bigquery into GCS but it splits the files into very small files of 2-3 MB. Is there a way to get bigger files like 40-60MB per files rather than 2-3 MB.
I do the expport via the api
https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files
https://cloud.google.com/bigquery/docs/reference/v2/jobs
The source table size is 60 GB on Bigquery. I extract the data with format - NewLine_Delimited_Json and GZIP compression
destination_cloud_storage_uris=[
'gs://bucket_name/main_folder/partition_date=xxxxxxx/part-*.gz'
]

Are you trying to export partitioned table? If yes, each partition is exported as different table and it might cause small files.
I run the export in cli with each of the following commands and received in both cases files of size 49 MB:
bq extract --compression=GZIP --destination_format=NEWLINE_DELIMITED_JSON project:dataset.table gs://bucket_name/path5-component/file-name-*.gz
bq extract --compression=GZIP project:dataset.table gs://bucket_name/path5-component/file-name-*.gz

Please add more details to the question so we can provide specific advice: How are you exactly asking for this export?
Nevertheless, if you have many files in GCS and you want to merge them all into one, you can do:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
https://cloud.google.com/storage/docs/gsutil/commands/compose

loading a pg_dump off of s3 into redshift

I'm trying to load a complete database dump into Redshift. Is there a single command to restore the data from a pg_dump living on s3 into Redshift? If not, what are the best steps for tackling this?
Thanks

If you have a non compressed pg_dump this should be possible using a psql command (you may need to manually edit to get the right syntax, depending on your versions and options set).
However this is a very inefficient and slow way to load redshift and I do not recommend it. If your tables are large it could take days or weeks!
What you need to do is this:
create target tables on redshift based upon the source table, but
considering sort keys and distribution.
unload you postgres source tables into csv files using postgres
"copy" command
If the source csv files are very big (e.g. more than say 100MB),
consider splitting these into separate files as they will load
faster (redshift will parallelize)
gzip the csv files (recommended but not essential)
upload these csv files to s3, with a separate folder per table
load the data into redshift from s3 by using the redshift copy
command

File size in hive with different file formats

I have a small file (2MB). I created a external hive table over this file (stored as textfile). I created another table (stored as ORC) and copied the data from the previous table. When I checked the size of data in ORC table, it was more than 2MB.
ORC is a compressed file format, so shouldn't the data size be less?

As of Hive 0.14, users can request an efficient merge of small ORC files together by issuing a CONCATENATE command on their table or partition. The files will be merged at the stripe level without reserialization.
ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;

It's because your source file is too small. ORC has complex structure with internal indexes, headers, footers, postscript, compressing codecs also add some structures, etc, etc.
See this for details: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileFormat
All these supporting structures consume more space than the data. For such small file you really do not need to store min/max values for columns, do not need blum filters, etc since your file may fit in memory. The best storage for this case is text file uncompressed. You can also try just to gzip your source file and check it's size. Too small gzipped file may be bigger than uncompressed. The bigger the file the more benefit from compressing and using orc will be.

what is the impact of small hive tables?

In one of my usecases, several tables were created out of a bunch of csv files. Each csv file was about 50-80MB.Table is configured to contain 2 buckets. Tables are stored in ORC format. However, when I see in hive warehouse directory in hdfs, it is only about 4 MB - 5MB. I have already brought down the hive block size from its default to 64MB. My concern here is that small files in hdfs put pressure on Namenode. Similarly, is it an issue with small hive tables? Can I still bring down size of hive block?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas