Apache Nifi 1.7.1 PutHive3Streaming Hive 3.0 - Managed table compression - hive

I am using PutHive3Streaming to load avro data from Nifi to Hive. For a sample, I am sending 10 MB data Json data to Nifi, converting it to Avro (reducing the size to 118 KB) and using PutHive3Streaming to write to a managed hive table. However, I see that the data is not compressed at hive.
hdfs dfs -du -h -s /user/hive/warehouse/my_table*
32.1 M /user/hive/warehouse/my_table (<-- replication factor 3)
At the table level, I have:
STORED AS ORC
TBLPROPERTIES (
'orc.compress'='ZLIB',
'orc.compression.strategy'='SPEED',
'orc.create.index'='true',
'orc.encoding.strategy'='SPEED',
'transactional'='true');
and I have also enabled:
hive.exec.dynamic.partition=true
hive.optimize.sort.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
hive.optimize.sort.dynamic.partition=true
avro.output.codec=zlib
hive.exec.compress.intermediate=true;
hive.exec.compress.output=true;
It looks like despite this, compression is not enabled in Hive. Any pointers to enable this?

Hive does not compress datas which inserted by Streaming Data Ingest API.
They'll be compressed when compaction runs.
See https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest+V2#StreamingDataIngestV2-APIUsage
If you dont' wanna wait, use ALTER TABLE your_table PARTITION(key=value) COMPACT "MAJOR".

Yes, #K.M is correct in so far that Compaction needs to be used.
a) Hive compaction strategies need to be used to manage the size of the data. Only after compaction is the data encoded. Below are the default properties for auto-compaction.
hive.compactor.delta.num.threshold=10
hive.compactor.delta.pct.threshold=0.1
b) Despite this being default, one of the challenges I had for compaction is that the delta files written by nifi were not accessible(delete-able) by the compaction cleaner (after the compaction itself). I fixed this by using the hive user as the table owner as well as giving the hive user 'rights' to the delta files as per standards laid out by kerberos.
d) Another challenge I continue to face is in triggering auto compaction jobs. In my case, as delta files continue to get streamed into hive for a given table/partition, the very first major compaction job completes successfully, deletes deltas and creates a base file. But after that point, auto-compact jobs are not triggered. And hive accumulates a huge number of delta files. (which have to be cleaned up manually <--- not desirable)

Related

HIVE 3.1 - Automatic Major compaction triggered only once per partition

I have an acid enabled, partitioned, bucketed hive table to which I am writing using a streaming client. I see that several delta files are created as the records are written into partitions. I wanted to enable auto-compaction and tried the following base and specific params:
hive.support.concurrency=true
hive.enforce.bucketing=true
hive.exec.dynamic.partition.mode=nonstrict
hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.compactor.initiator.on=true
hive.compactor.worker.threads=1
with,
hive.compactor.initiator.on=true
hive.compactor.cleaner.run.interval=5000ms
hive.compactor.delta.num.threshold=10
hive.compactor.delta.pct.threshold=0.1f
hive.compactor.abortedtxn.threshold=1000
hive.compactor.initiator.failed.compacts.threshold=2
hive.compactor.abortedtxn.threshold=1000
I did the above with hopes of enabling major compaction. However I see that major compaction is triggered automatically only once. i.e, Major compaction runs once and creates a base file. Once a base file is created for a number of delta files within that partition, Major compaction is not scheduled further, despite more delta files streamed into the partition since. How do I enable auto-Major compaction for a table? Has anyone faced similar issues before?
I have the same issue and the only solution I found is run manual compaction for each partition.
ALTER TABLE myTable PARTITION (myPartitionColumn='myPartitionValue') COMPACT 'major';
I still trying to figure out why happens.

Hive Table deletion and query processing

as per my understanding on Hive concepts, if we load the dataset into hive table, the data file will move from source path to hive warehouse within HDFS, and HDFS was set to three replicas for the data.
these questions might look silly but as i am beginner, i want clear my doubts.
my questions are:
1) if i delete the hive table, will it delete data file from hive warehouse only or along other two replicas from HDFS also?
2)if we are processing query on hive table, will that query be done as distributed processing?
per say, one data file is of size 1GB (interns 8 blocks x 128MB), and as we have three replication factor, there would be total 24 blocks available for this file
will our hive query be distributed among all the data blocks or it would be processed on hive warehouse blocks only?
Thanks in advance..
If you do "load data inpath" from a HDFS path the data will be moved from source to destination HDFS path,
If you do "load data local inpath", it doesn't move data from local to HDFS path, instead it copies
For your question
If you delete file in HDFS all the replicas are deleted.
If you have a 1gb file (8 blocks) with 3 replication factor, when you trigger the query in hive CLI, it converts your query to MR. It process only 8 blocks, in case of the datanode failure of the triggered job, it accesses the 2nd replica on a different node and processes the data (speculative execution)

How to avoid reading old files from S3 when appending new data?

Once in 2 hours, spark job is running to convert some tgz files to parquet.
The job appends the new data into an existing parquet in s3:
df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet")
In spark-submit output I can see significant time is being spent on reading old parquet files, for example:
16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet' for reading
16/11/27 14:06:15 INFO S3NativeFileSystem: Stream for key
'foo.parquet/id=123/day=2016-11-26/part-r-00003-e80419de-7019-4859-bbe7-dcd392f6fcd3.snappy.parquet'
seeking to position '149195444'
It looks like this operation takes less than 1 second per file, but the amount of files increases with time (each append adds new files), which makes me think that my code will not be able to scale.
Any ideas how to avoid reading old parquet files from s3 if I just need to append new data?
I use EMR 4.8.2 and DirectParquetOutputCommitter:
sc._jsc.hadoopConfiguration().set('spark.sql.parquet.output.committer.class', 'org.apache.spark.sql.parquet.DirectParquetOutputCommitter')
I resolved this issue by writing the dataframe to EMR HDFS and then using s3-dist-cp uploading the parquets to S3
Switch this over to using Dynamic Partition Overwrite Mode using:
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
Also, avoid the DirectParquetOutputCommitter, and instead don't modify this - you will achieve better results in terms of speed using the EMRFS File Committer.

Loading or pointing to multiple parquet paths for data analysis with hive or prestodb

I have couple of spark jobs that produce parquet files in AWS S3. Every once in a while i need to run some ad-hoc queries on a given date range of this data. I don't want to do this in spark because I want our QA team which has no knowledge os spark be able to do this. What i like to do is to spin up an AWS EMR cluster and load the parquet files into HDFS and run my queries against it. I have figured out how to create tables with hive and point it to one s3 path. But then that limits my data to only one day. because each day of date has multiple files under a path like
s3://mybucket/table/date/(parquet files 1 ... n).
So problem one is to figure how to load multiple days of data into hive. ie
s3://mybucket/table_a/day_1/(parquet files 1 ... n).
s3://mybucket/table_a/day_2/(parquet files 1 ... n).
s3://mybucket/table_a/day_3/(parquet files 1 ... n).
...
s3://mybucket/table_b/day_1/(parquet files 1 ... n).
s3://mybucket/table_b/day_2/(parquet files 1 ... n).
s3://mybucket/table_b/day_3/(parquet files 1 ... n).
I know hive can support partitions but my s3 files are not setup that way.
I have also looked into prestodb which looks like to be the favorite tool for this type of data analysis. The fact it supports ansi SQL makes it a great tool for people that have SQL knowledge but know very little about hadoop or spark. I did install this on my cluster and it works great. But looks like you can't really load data into your tables and you have to rely on Hive to do that part. Is this the right way to use prestodb? I watched a netflix presentation about their use of prestodb and using s3 in place of HDFS. If this works its great but i wonder how the data is moved into memory. At what point the parquet files will be moved from s3 to the cluster. Do i need to have cluster that can load the entire data into memory? how is this generally setup?
You can install Hive and create Hive tables with you data in S3, described in the blog post here: https://blog.mustardgrain.com/2010/09/30/using-hive-with-existing-files-on-s3/
Then install Presto on AWS, configure Presto to connect the hive catalog which you installed previously. Then you can query the your data on S3, with Presto by using SQL.
Rather than trying to load multiple files, you could instead use the API to concatenate the days you want into a single object, which you can then load through the means you already mention.
AWS has a blog post highlighting how to do this exact thing purely through the API (without downloading + re-uploading the data):
https://ruby.awsblog.com/post/Tx2JE2CXGQGQ6A4/Efficient-Amazon-S3-Object-Concatenation-Using-the-AWS-SDK-for-Ruby

is Parquet predicate pushdown works on S3 using Spark non EMR?

Just wondering if Parquet predicate pushdown also works on S3, not only HDFS. Specifically if we use Spark (non EMR).
Further explanation might be helpful since it might involve understanding on distributed file system.
I was wondering this myself so I just tested it out. We use EMR clusters and Spark 1.6.1 .
I generated some dummy data in Spark and saved it as a parquet file locally as well as on S3.
I created multiple Spark jobs with different kind of filters and column selections. I ran these tests once for the local file and once for the S3 file.
I then used the Spark History Server to see how much data each job had as input.
Results:
For the local parquet file: The results showed that the column selection and filters were pushed down to the read as the input size was reduced when the job contained filters or column selection.
For the S3 parquet file: The input size was always the same as the Spark job that processed all of the data. None of the filters or column selections were pushed down to the read. The parquet file was always completely loaded from S3. Even though the query plan (.queryExecution.executedPlan) showed that the filters were pushed down.
I will add more details about the tests and results when I have time.
Yes. Filter pushdown does not depend on the underlying file system. It only depends on the spark.sql.parquet.filterPushdown and the type of filter (not all filters can be pushed down).
See https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L313 for the pushdown logic.
Here's the keys I'd recommend for s3a work
spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false
spark.hadoop.parquet.enable.summary-metadata false
spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true
For committing the work. use the S3A "zero rename committer" (hadoop 3.1+) or the EMR equivalent. The original FileOutputCommitters are slow and unsafe
Recently I tried this with Spark 2.4 and seems like Pushdown predicate works with s3.
This is the spark sql query:
explain select * from default.my_table where month = '2009-04' and site = 'http://jdnews.com/sports/game_1997_jdnsports__article.html/play_rain.html' limit 100;
And here is the part of output:
PartitionFilters: [isnotnull(month#6), (month#6 = 2009-04)], PushedFilters: [IsNotNull(site), EqualTo(site,http://jdnews.com/sports/game_1997_jdnsports__article.html/play_ra...
Which clearly stats that PushedFilters is not empty.
Note: The used table was created on top of AWS S3
Spark uses the HDFS parquet & s3 libraries so the same logic works.
(and in spark 1.6 they've added even a faster shortcut for flat schema parquet files)