How to physically partition data to avoid shuffle in Spark SQL joins

How to physically partition data to avoid shuffle in Spark SQL joins - apache-spark-sql

I have a requirement to join 5 medium size tables (~80 gb each) with a big Input data ~ 800 gb. All data resides in HIVE tables.
I am using Spark SQL 1.6.1 for achieving this.
Join is taking 40 mins of time to complete with
--num-executors 20 --driver-memory 40g --executor-memory 65g --executor-cores 6. All joins are sort merge outer joins. Also seeing a lot of shuffle happening.
I bucketed all tables in hive into same number of buckets so that similar keys from all tables will go to same spark partitions while loading data itself at first. But it seems spark does not understand bucketing.
Is there any other way i can physically partition & sort data in Hive (no of part files) so that spark will know about partitioning keys while loading data from hive itself and do a join with in the same partitioning without shuffling data around? This will avoid additional re-Partitioning after loading data from hive.

First of all Spark Sql 1.6.1 doesn't support hive buckets yet.
So in this case we are left with Spark level operations ensuring that all tables must go to same spark partitions while loading the data. Spark API provides repartition and sortWithinPartitions to achieve the same. e.g
val part1 = df1.repartition(df1("key1")).sortWithinPartitions(df1("key1"))
In the same way you can go for the generations of partitions for remaining tables and joined them on the key which was sorted within partitions.
This will make the join "shuffle free" operation but come with major computational cost. Caching the dataframes(you can go for cache operation for newly created partition) perform better if the operation will be performed subsequent times. Hope this help.

Related

How to optimise google BigQuery with 17+ tables which contains approx. 55 GB of data?

I have a huge amount of data store which contains almost 20+ tables. all tables contain data in GB.
So basically I'm exporting all data into CSV for analysis. I have 17+ tables in join query which almost process billions of records. Google says it will process 10 GB data.
Now the problem is query taking too much time & resources, sometimes query fails with resource limit. how can I optimize such a query?
FYI: I'm using LEFT JOIN

Best way to optimize your query is implement Partitioning & Clustering. Best solution is to implement partitioning and Clustering on fields over which Joining conditions are done.

Hive performance improvement

I want to join 1TB data of table with another table which also has 1TB of data in hive. Could you please suggest some best practises to follow?
I want to know how performance will be improved in hive if both the tables are partitioned. Basically, how mapreduce works in this case.

Below are few performance improvement Rules to follow when dealing with large data -
Tez-Execution Engine in Hive Or Hive on Spark
Use Tez Execution Engine (Hortonworks) – Hive Optimization Techniques, to increase the Hive performance of our hive query by using our execution engine as Tez. On defining Tez, it is a new application framework built on Hadoop Yarn. That executes complex-directed acyclic graphs of general data processing tasks. However, we can consider it to be a much more flexible and powerful successor to the map-reduce framework.
Or
Use Hive on Spark (Cloudera)
In addition, to write native YARN applications on Hadoop that bridges the spectrum of interactive and batch workloads Tez offers an API framework to developers. To be more specific, to work with petabytes of data over thousands of nodes it allows those data access applications.
Let’s Discuss Apache Hive Features & Limitations of Hive
SET hive.execution.engine=tez;
SET hive.execution.engine=spark;
Usage of Suitable File Format in Hive
ORCFILE File Formate – Hive Optimization Techniques, if we use appropriate file format on the basis of data. It will drastically increase our query performance. Basically, for increasing your query performance ORC file format is best suitable. Here, ORC refers to Optimized Row Columnar. That implies we can store data in an optimized way than the other file formats.
To be more specific, ORC reduces the size of the original data up to 75%. Hence, data processing speed also increases. On comparing to Text, Sequence and RC file formats, ORC shows better performance. Basically, it contains rows data in groups. Such as Stripes along with a file footer. Therefore, we can say when Hive is processing the data ORC format improves the performance.
Hive Partitioning
Hive Partition – Hive Optimization Techniques, Hive reads all the data in the directory Without partitioning. Further, it applies the query filters on it. Since all data has to be read this is a slow as well as expensive.
Also, users need to filter the data on specific column values frequently. Although, users need to understand the domain of the data on which they are doing analysis, to apply the partitioning in the Hive.
Basically, by Partitioning all the entries for the various columns of the dataset are segregated and stored in their respective partition. Hence, While we write the query to fetch the values from the table, only the required partitions of the table are queried. Thus it reduces the time taken by the query to yield the result.
Bucketing in Hive
Bucketing in Hive – Hive Optimization Techniques, let’s suppose a scenario. At times, there is a huge dataset available. However, after partitioning on a particular field or fields, the partitioned file size doesn’t match with the actual expectation and remains huge. Still, we want to manage the partition results into different parts. Thus, to solve this issue of partitioning, Hive offers Bucketing concept. Basically, that allows the user to divide table data sets into more manageable parts.
Hence, to maintain parts that are more manageable we can use Bucketing. Through it, the user can set the size of the manageable parts or Buckets too.
Vectorization In Hive
Vectorization In Hive – Hive Optimization Techniques, to improve the performance of operations we use Vectorized query execution. Here operations refer to scans, aggregations, filters, and joins. It happens by performing them in batches of 1024 rows at once instead of single row each time.
However, this feature is introduced in Hive 0.13. It significantly improves query execution time, and is easily enabled with two parameters settings:
set hive.vectorized.execution = true
set hive.vectorized.execution.enabled = true
Cost-Based Optimization in Hive (CBO)
Cost-Based Optimization in Hive – Hive Optimization Techniques, before submitting for final execution Hive optimizes each Query’s logical and physical execution plan. Although, until now these optimizations are not based on the cost of the query.
However, CBO, performs, further optimizations based on query cost in a recent addition to Hive. That results in potentially different decisions: how to order joins, which type of join to perform, the degree of parallelism and others.
To use CBO, set the following parameters at the beginning of your query:
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
Then, prepare the data for CBO by running Hive’s “analyze” command to collect various statistics on the tables for which we want to use CBO.
Hive Indexing
Hive Index – Hive Optimization Techniques, one of the best ways is Indexing. To increase your query performance indexing will definitely help. Basically, for the original table use of indexing will create a separate called index table which acts as a reference.
As we know, there are many numbers of rows and columns, in a Hive table. Basically, it will take a large amount of time if we want to perform queries only on some columns without indexing. Because queries will be executed on all the columns present in the table.
Moreover, there is no need for the query to scan all the rows in the table while we perform a query on a table that has an index, it turned out as the major advantage of using indexing. Further, it checks the index first and then goes to the particular column and performs the operation.
Hence, maintaining indexes will be easier for Hive query to look into the indexes first and then perform the needed operations within less amount of time. Well, time is the only factor that everyone focuses on, eventually.
This was all about Hive Optimization Techniques Tutorial. Hope you like our explanation of Hive Performance Tuning.

Spark SQL Performance optimisation

I am running a Spark SQL query with huge amount of data (approx 50 million records).Due to heavy records, Query becomes slower in the cluster, So that it was taking long time (20 mins) to process the entire data. I am using inner join,left join inside the query.How can i improve the performance.

Since you are performing join operation and data size is huge, chances are lot of shuffling and I/O operation is involved. So if you are not using kryo serialization, your code would be using default Java Serialization. Use Kryo serialization, it gives better performance.
This one depends how you are storing your data in Hdfs. if its in a file format. Try creating Hive tables on top of it. Hive provides many optimisation technique.
a. Partitiong and Bucketing : Partitioning speeds up query since you don't have to scan entire table while reading data. Bucketing speeds up join operation.
b. Map Side join can be used: Here smaller table is sent to memory where join operation is performed in mapper itself and thus speeding up query.
Apache Spark provides option to take benefit of Hive using Hive Context.

Bigquery Shard Vs Bigquery Partition

I have a table with 340GB of data, but we use only last one week of data. So to minimize the cost planning to move this data to partition table or shard tables.
I have done some experiment with shard tables and partition. I have created partition table and loaded two days worth of data(two partitions) and created two shard tables(Individual tables). I tried to pull last two days worth of data.
Full table - 27sec
Partition Table - 33 sec
shard tables - 91 sec
Please let me know which way is best. Based on the experiment result is giving quick when I run against full table but full table will scan.
Thanks,

From GCP official documentation on Partitioning versus Sharding you should use Partitioned tables.
Partitioned tables perform better than tables sharded by date. When
you create date-named tables, BigQuery must maintain a copy of the
schema and metadata for each date-named table. Also, when date-named
tables are used, BigQuery might be required to verify permissions for
each queried table. This practice also adds to query overhead and
impacts query performance. The recommended best practice is to use
partitioned tables instead of date-sharded tables.

The difference in performance seems to be due to some background optimizations that have run on the non-partitioned table, but are yet to run on the partitioned table (since the data is newer).

Improve apache hive performance

I have 5GB of data in my HDFS sink. When I run any query on Hive it takes more than 10-15 minutes to complete. The number of rows I get when I run,
select count(*) from table_name
is 3,880,900. My VM has 4.5 GB mem and it runs on MBP 2012. I would like to know if creating index in the table will have any performance improvement. Also are there any other ways to tell hive to only use this much amount of data or rows so as to get results faster? I am ok even if the queries are run for a lesser subset of data at least to get a glimpse of the results.

Yes, indexing should help. However, getting a subset of data (using limit) isn't really helpful as hive still scans the whole data before limiting the output.
You can try using RCFile/ORCFile format for faster results. In my experiments, RCFile based tables executed queries roughly 10 times faster than textfile/sequence file based tables.

Depending on the data you are querying you can get gains by using the different file formats like ORC, Parquet. What kind of data are you querying, is it structured or unstructured data? What kind of queries are you trying to perform? If it is structured data you can see gains also by using other SQL on Hadoop solutions such as InfiniDB, Presto, Impala etc...
I am an architect for InfiniDB
http://infinidb.co
SQL on Hadoop solutions like InfiniDB, Impala and others work by you loading your data through them at which they will perform calculations, optimizations etc... to make that data faster to query. This helps tremendously for interactive analytical queries, especially when compared to something like Hive.
With that said, you are working with 5GB of data (but data always grows! someday could be TBs), which is pretty small so you can still work in the worlds of the some of the tools that are not intended for high performance queries. Your best solution with Hive is to look at how your data is and see if ORC or Parquet could benefit your queries (columnar formats are good for analytic queries).
Hive is always going to be one of the slower options though for performing SQL queries on your HDFS data. Hortonworks with their Stinger initiative is making it better, you might want to check that out.
http://hortonworks.com/labs/stinger/

The use case sounds fit for ORC, Parquet if you are interested in a subset of the columns. ORC with hive 0.12 comes with PPD which will help you discarding blocks while running the queries using the meta data that it stores for each column.
We did an implementation on top of hive to support bloom filters in the meta data indexes for ORC files which gave a performance gain of 5-6X.
What is average number of Mapper/Reducer tasks launched for the queries you execute? Tuning some parameters can definitely help.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas