How to merge small files saved on hive by sparksql? - hive

Like code below, insert a dataframe into a hive table. The output hdfs files of hive have too many small files. How to merge them when save on hive?
myDf.write.format("orc").partitionBy("datestr").insertInto("myHiveTable")
When there are 100 tasks, it will produce 100 small files.
Is using coalesce on dataframe a good idea?
myDf.coalesce(3).write.format("orc").partitionBy("datestr").insertInto("myHiveTable")
Why hive configures as below do not work?
sqlContext.sql("set hive.merge.mapfiles=true")
sqlContext.sql("set hive.merge.sparkfiles=false")
sqlContext.sql("set hive.merge.smallfiles.avgsize=16000000")
sqlContext.sql("set hive.merge.size.per.task=256000000")
Thanks a lot for any help.

I encounterd this problem and find issue-24940
Use/*+ COALESCE(numPartitions) */ or /*+ REPARTITION(numPartitions) */ in spark sql query will control output file numbers.
In my parctice I recommend second parm for users, because it will generate a new stage to do this job, while first parm won't which may lead the job dead because of fewer tasks in the last stage.

That's because SparkSQL returns number of files which coresponds to the number of the spark partitions. Even if dynamic partitioning config is on.
I faced the same problem. In my view, configurations mentioned above are only applicable to Hive on MapReduce engine: in my case, HiveQL commands work well (small files are being merged).
See Hive architecture for more detail.

Related

aws Glue: Is it possible to pull only specific data from a database?

I need to transform a fairly big database table with aws Glue to csv. However I only the newest table rows from the past 24 hours. There ist a column which specifies the creation date of the row. Is it possible, to just transform these rows, without copying the whole table into the csv file? I am using a python script with Spark.
Thank you very much in advance!
There are some Built-in Transforms in AWS Glue which are used to process your data. This transfers can be called from ETL scripts.
Please refer the below link for the same :
https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
You haven't mentioned the type of database that you are trying connect. Anyway for JDBC connections spark has the option of query, in which you can issue the usual SQL query to get the rows you need.

How to fix the block size in external databricks tables?

I have a SQL notebook to change data and insert into another table.
I have a situation when I'm trying to change the storaged block size in blobStorage, I want to have less and bigger files. I try change a lot of parameters.
So i found a behaviour.
When I run the notebook the command create the files with almost 10MB for each.
If I create the table internaly in databricks and run another comand
create external_table as
select * from internal_table
the files had almost 40 MB...
So my question is..
There is a way to fix the minimal block size in external databricks tables?
When i'm transforming data in a SQL Notebook we have best pratices? like transform all data and store locally so after that move the data to external source?
Thanks!
Spark doesn't have a straightforward way to control the size of output files. One method people use is to call repartition or coalesce to the number of desired files. To use this to control the size of output files, you need to have an idea of how many files you want to create, e.g. to create 10MB files, if your output data is 100MB, you could call repartition(10) before the write command.
It sounds like you are using Databricks, in which case you can use the OPTIMIZE command for Delta tables. Delta's OPTIMIZE will take your underlying files and compact them for you into approximately 1GB files, which is an optimal size for the JVM in large data use cases.
https://docs.databricks.com/spark/latest/spark-sql/language-manual/optimize.html

Spark HDFS Direct Read vs Hive External table read

We have couple HDFS directories in which data stored in delimited format. These directories created as one directory per ingestion date. These directories added as a partitions to a Hive external table.
Directory structure:
/data/table1/INGEST_DATE=20180101
/data/table1/INGEST_DATE=20180102
/data/table1/INGEST_DATE=20180103 etc.
Now we want to process this data in spark job. From the program I can directly read these HDFS directories by giving exact directory path(Option 1) or I can read from Hive into a data frame and process(Option 2).
I would like to know if there is any significant difference in following Option1 or Option2. Please let me know if need any other details.
Thanks in Advance
If you want to select a subset of the columns, then that it is only possible via spark.sql. In your use case I don't think there will be a significant difference.
With Spark SQL you can get Partition pruning automatically.

How to handle hive locking across hive and presto

I have a few hive tables that are insert-overwrite from spark and hive. Those tables are also accessed by analysts on presto. Naturally, we're running into some windows of time that users are hitting an incomplete data set because presto is ignoring locks.
The options I can think of:
Fork the presto-hive connector to support hive S and X locks appropriately. This isn't too bad, but time consuming to do properly.
Swap the table location on the hive metastore once an insert overwrite is complete. This is OK, but a little messy because we like to store explicit locations at the database level and let the tables inherit location.
Stop doing insert-overwrite on these tables and instead just add a new partition for the things that have changed, then once a new partition is written, alter the hive table to see it. Then we can just have views on top of the data that will properly reconcile the latest version of each row.
Stop doing insert-overwrite on s3 which has a long window of copy from hive staging to the target table. If we move to hdfs for all insert-overwrite, we still have the issue, but it's over the span of time that it takes to do a hdfs mv which is significantly faster. (probably bad: there's still a window that we can get incomplete data)
My question is how do people generally handle that? It seems like a common scenario that would have an explicit solution, but I seem to be missing it. This can be asked in general for any third party tool that can query the hive metastore and interact with the hdfs/s3 directly while not respecting hive locks.

Using Hive with Pig

My hive query has multiple outer joins and takes very long to execute. I was wondering if it would make sense to break it into multiple smaller queries and use pig to work the transformations.
Is there a way I could query hive tables or read hive table data within a pig script?
Thanks
The goal of the Howl project is to allow Pig and Hive to share a single metadata repository. Once Howl is mature, you'll be able to run PigLatin and HiveQL queries over the
same tables. For now, you can try to work with the data as it is stored in HDFS.
Note that Howl has been renamed to HCatalog.