Trying to improve the hive query speed based on the techniques. Below config changes increases speed and want to use these settings for all the queries i execute. But i wanted some input on if these settings will impact inversely if used across all queries.
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
Vectorized query execution improves performance of operations like scans,
aggregations, filters and joins, by performing them in
batches of 1024 rows at once instead of single row each time.
Introduced in Hive 0.13, this feature significantly improves
query execution time.
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
analyze table tweets compute statistics for columns;
Enable cost based optimization(cbo)
set hive.execution.engine=tez;
use tez engine
Related
So, it would be a very simple update: UPDATE TableA SET columnB = columnB + #numbervalue WHERE id=#id
Now this seems to be using a lot of the db right now (possibly we can do it in batches later, but for now, just wondering a few things)
Should I add OPTION (MAXDOP 1) to the end? Would it be faster if it were a double vs an int? If it's always adding 1 is there a better way increment 1? thing I should do if this query is the highest load on the db?
Below are few tips to increase the performance of the query:
Create an index on the filtered column to avoid table scan.
Remove index on update columns before the update. This increases performance and you can add index again after the update.
Remove triggers on the table, if possible, when you are doing large updates.
Removing indexes on the table which not being used can increase your update performance.
As Query Optimizer selects the best execution plan for a query, it is recommended to use hints as a last resort.
The Max Degree of parallelism is to set the number of processors to run a single statement for each parallel plan execution.
In the Azure SQL database, the MAXDOP setting for each database is set to 8. This default prevents unnecessary resource utilization while executing queries faster using parallel threads.
Check this MS document which describes the database behavior with different MAXDOP values.
You can override the MAXDOP value to add at query level using the MAXDOP query hint as below.
OPTION (MAXDOP 4)
Int uses fewer bytes (4 bytes) and has the least amount of IO. It has more performance benefits if you are storing integer values in a column.
I want to join 1TB data of table with another table which also has 1TB of data in hive. Could you please suggest some best practises to follow?
I want to know how performance will be improved in hive if both the tables are partitioned. Basically, how mapreduce works in this case.
Below are few performance improvement Rules to follow when dealing with large data -
Tez-Execution Engine in Hive Or Hive on Spark
Use Tez Execution Engine (Hortonworks) – Hive Optimization Techniques, to increase the Hive performance of our hive query by using our execution engine as Tez. On defining Tez, it is a new application framework built on Hadoop Yarn. That executes complex-directed acyclic graphs of general data processing tasks. However, we can consider it to be a much more flexible and powerful successor to the map-reduce framework.
Or
Use Hive on Spark (Cloudera)
In addition, to write native YARN applications on Hadoop that bridges the spectrum of interactive and batch workloads Tez offers an API framework to developers. To be more specific, to work with petabytes of data over thousands of nodes it allows those data access applications.
Let’s Discuss Apache Hive Features & Limitations of Hive
SET hive.execution.engine=tez;
SET hive.execution.engine=spark;
Usage of Suitable File Format in Hive
ORCFILE File Formate – Hive Optimization Techniques, if we use appropriate file format on the basis of data. It will drastically increase our query performance. Basically, for increasing your query performance ORC file format is best suitable. Here, ORC refers to Optimized Row Columnar. That implies we can store data in an optimized way than the other file formats.
To be more specific, ORC reduces the size of the original data up to 75%. Hence, data processing speed also increases. On comparing to Text, Sequence and RC file formats, ORC shows better performance. Basically, it contains rows data in groups. Such as Stripes along with a file footer. Therefore, we can say when Hive is processing the data ORC format improves the performance.
Hive Partitioning
Hive Partition – Hive Optimization Techniques, Hive reads all the data in the directory Without partitioning. Further, it applies the query filters on it. Since all data has to be read this is a slow as well as expensive.
Also, users need to filter the data on specific column values frequently. Although, users need to understand the domain of the data on which they are doing analysis, to apply the partitioning in the Hive.
Basically, by Partitioning all the entries for the various columns of the dataset are segregated and stored in their respective partition. Hence, While we write the query to fetch the values from the table, only the required partitions of the table are queried. Thus it reduces the time taken by the query to yield the result.
Bucketing in Hive
Bucketing in Hive – Hive Optimization Techniques, let’s suppose a scenario. At times, there is a huge dataset available. However, after partitioning on a particular field or fields, the partitioned file size doesn’t match with the actual expectation and remains huge. Still, we want to manage the partition results into different parts. Thus, to solve this issue of partitioning, Hive offers Bucketing concept. Basically, that allows the user to divide table data sets into more manageable parts.
Hence, to maintain parts that are more manageable we can use Bucketing. Through it, the user can set the size of the manageable parts or Buckets too.
Vectorization In Hive
Vectorization In Hive – Hive Optimization Techniques, to improve the performance of operations we use Vectorized query execution. Here operations refer to scans, aggregations, filters, and joins. It happens by performing them in batches of 1024 rows at once instead of single row each time.
However, this feature is introduced in Hive 0.13. It significantly improves query execution time, and is easily enabled with two parameters settings:
set hive.vectorized.execution = true
set hive.vectorized.execution.enabled = true
Cost-Based Optimization in Hive (CBO)
Cost-Based Optimization in Hive – Hive Optimization Techniques, before submitting for final execution Hive optimizes each Query’s logical and physical execution plan. Although, until now these optimizations are not based on the cost of the query.
However, CBO, performs, further optimizations based on query cost in a recent addition to Hive. That results in potentially different decisions: how to order joins, which type of join to perform, the degree of parallelism and others.
To use CBO, set the following parameters at the beginning of your query:
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
Then, prepare the data for CBO by running Hive’s “analyze” command to collect various statistics on the tables for which we want to use CBO.
Hive Indexing
Hive Index – Hive Optimization Techniques, one of the best ways is Indexing. To increase your query performance indexing will definitely help. Basically, for the original table use of indexing will create a separate called index table which acts as a reference.
As we know, there are many numbers of rows and columns, in a Hive table. Basically, it will take a large amount of time if we want to perform queries only on some columns without indexing. Because queries will be executed on all the columns present in the table.
Moreover, there is no need for the query to scan all the rows in the table while we perform a query on a table that has an index, it turned out as the major advantage of using indexing. Further, it checks the index first and then goes to the particular column and performs the operation.
Hence, maintaining indexes will be easier for Hive query to look into the indexes first and then perform the needed operations within less amount of time. Well, time is the only factor that everyone focuses on, eventually.
This was all about Hive Optimization Techniques Tutorial. Hope you like our explanation of Hive Performance Tuning.
I have a large volume of data, and I'm looking to efficiently (ie, using a relatively small Spark cluster) perform COUNT and DISTINCT operations one of the columns.
If I do what seems obvious, ie load the data into a dataframe:
df = spark.read.format("CSV").load("s3://somebucket/loadsofcsvdata/*").toDF()
df.registerView("someview")
and then attempt to run a query:
domains = sqlContext.sql("""SELECT domain, COUNT(id) FROM someview GROUP BY domain""")
domains.take(1000).show()
my cluster just crashes and burns - throwing out of memory exceptions or otherwise hanging/crashing/not completing the operation.
I'm guessing that somewhere along the way there's some sort of join that blows one of the executors' memory?
What's the ideal method for performing an operation like this, when the source data is at massive scale and the target data isn't (the list of domains in the above query is relatively short, and should easily fit in memory)
related info available at this question: What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?
I would suggest to tune your executors settings. Especially, setting following parameters correctly can provide dramatic improvement in performance.
spark.executor.instances
spark.executor.memory
spark.yarn.executor.memoryOverhead
spark.executor.cores
In your case, I would also suggest to tune Number of partitions, especially bump up following param from default 200 to higher value, as per requirement.
spark.sql.shuffle.partitions
Currently i am running multiple queries which spool the output to a file this can be a lengthy process, below are my current settings in sqlplus.
set feedback off
set heading off
set echo off
set termout OFF
set linesize 150
set long 1999999
set longchunk 1999999
set pagesize 0
spool results.sql
#queries.sql
spool off
set termout on
set echo on
set heading on
set feedback on
I was wondering if there is any way i can speed up this process? Or is there a faster way of sending output of the queries to a file using sqlplus?
Thanks
Clarifications
Can you please clarify the following
Is the sqlplus client on the same physical box (or virtual machine) that your database management system is on (more specifically, your ora and datafiles)?
Can you test your query by decreasing the size of the result set using fetch first n rows?
Time Complexity and Specifics
Why do you believe the bottleneck is spooling? Before jumping to any conclusions about the causes of poor performance you may want to try an gather some timing statistics on your query. I would modify your query to only return a subset of rows
SELECT * FROM scott.dept FETCH FIRST 10000 rows
Optimizing your queries is highly depending on how your data is structure and how your routines run, so in order to optimize you need to be able to benchmark your performance changes.
You will then want to check the following two parameters
SHOW PARAMETER statistics_level
SHOW PARAMETER timed_statistics
And you will also want to time your queries
SET TIMING ON
Remember, all of this is simply to benchmark performance, you will need to revert these settings in production.
Keep in mind that running the same query multiple times will cause inconsistent performance metrics because of pooling / buffering / caching / etc. You should query the v$statistics_level and investigate
Rows Processed
Sort (memory)
Sort (Disk)
physical reads
consistent get
I would SHUTDOWN IMMEDIATELY; STARTUP MOUNT; and then your an explain on your query multiple times.
EXPLAIN PLAN FOR SELECT * FROM SCOTT.DEPT;
Check if the explain plan is changing on your subsequent calls.
Additionally, in certain circumstances you can force a cache of a table by running
SELECT * FROM SCOTT.DEPT CACHE;
And then comparing it to the performance metrics of
SELECT * FROM SCOTT.DEPT NOCACHE;
You may want to also keep the table in-memory indefinitely
ALTER TABLE *scott.dept* STORAGE (buffer_pool keep)
If you don't see anything glaring then run an explain plan for your query. Run the explain multiple times in succession to see if anything changes.
Beyond that, you should also remember that running your query multiple times without flushing your buffers / clearing cache is going to give you unreliable readings. If you aren't sure the spooling is the bottle neck, I would also suggest you run an explain plan for your query
There are also some tricks to increasing performance over time (assuming this routine is ran regularly). If you with alter table *table_name* cache, run the query without spooling to a file, then running again alter table *table_name* nocache with spool you may be a more performant run, but again if this is feasible completely depends on your use cases.
Settings
While I doubt your issue here is going to be solved by any combination of settings, below are a few that may provide marginal benefits
set trimspool on
Removing white spaces in spool file to fill up linesize
2. set echo off
This may be pointless depending on whether your routine is fully automated or interactive
3. set verify off
This will remove substitution variables from your spool, but can speed up your process considerably... again, depending on your data
4. set autoprint off
This is another variable setting. Off is the default setting, but you but assurance I would explicitly set it
5. set serveroutput off
6. SET TRIMOUT ON
Removes whitespaces filling linesize on out file
7. set arraysize N
You are definitely going to tinker with this one. N is the number of rows being fetched at a single time. The size of this depends on the structure of your table
While these settings may increase the speed of your routines, performance is generally specific to your architecture and data set.
Currently working on improving the efficiency of a few queries. After looking at the query plan I found that SQL is not performing parallelism when a top clause is in place, increasing the query time from 1-2s to several minutes.
The query in question is using a view with various joins and unions, I'm looking for a general answer/understanding as to why this is happening - google has thus far failed me.
Thanks
As you may be aware that
Generally, SQL Server processes queries in parallel in the following cases:
When the number of CPUs is greater than the number of active connections.
When the estimated cost for the serial execution of a query is higher than the query plan threshold (The estimated cost refers to the elapsed time in seconds required to execute the query serially.)
Certain types of statements cannot be processed in parallel unless they contain clauses, however.
For example, UPDATE, INSERT, and DELETE are not normally processed in parallel even if the related query meets the criteria.
But if the UPDATE or DELETE statements contain a WHERE clause, or an INSERT statement contains a SELECT clause, WHERE and SELECT can be executed in parallel. Changes are applied serially to the database in these cases.
To configure parallel processing, simply do the following:
In the Server Properties dialog box, go to the Advanced page.
By default, the Max Degree Of Parallelism setting has a value of 0, which means that the maximum number of processors used for parallel processing is controlled automatically. Essentially, SQL Server uses the actual number of available processors, depending on the workload. To limit the number of processors used for parallel processing to a set amount (up to the maximum supported by SQL Server), change the Max Degree Of Parallelism setting to a value greater than 1. A value of 1 tells SQL Server not to use parallel processing.
Large, complex queries usually can benefit from parallel execution. However, SQL Server performs parallel processing only when the estimated number of seconds required to run a serial plan for the same query is higher than the value set in the cost threshold for parallelism. Set the cost estimate threshold using the Cost Threshold For Parallelism box on the Advanced page of the Server Properties dialog box. You can use any value from 0 through 32,767. On a single CPU, the cost threshold is ignored.
Click OK. These changes are applied immediately. You do not need to restart the server.
You can use the stored procedure sp_configure to configure parallel processing. The Transact-SQL commands are:
exec sp_configure "max degree of parallelism", <integer value>
exec sp_configure "cost threshold for parallelism", <integer value>
Quoted from Technet article Configure Parallel Processing in SQL Server 2008
TOP automatically places the query into serial (non parallel mode). This is a restriction and cannot be overcome. Attempt using a rank where rand value=1 as a possible alternative to the TOP function...