Comparison of Explain Statement Output on Amazon Redshift - sql

I have written a very complicated query in Amazon Redshift which comprises of 3-4 temporary tables along with sub-queries.Since, Query is slow in execution, I tried to replace it with another query, which uses derived tables instead of temporary tables.
I just want to ask, Is there any way to compare the "Explain" Output for both the queries, so that we can conclude which query is working better in performance(both space and time).
Also, how much helpful is replacing temporary tables with derived tables in redshift ?

When Redshift generates it's own temporary tables (visible in the plan) then you may be able to tune the query by creating them as temporary tables yourself, specifying compression and adding distribution and sort keys that help with joins done on the table.
Very slow queries typically use a nested loop join style. The fastest join type is a merge join. If possible, rewrite the query or modify the tables to use merge join or at least hash join. Details here: https://docs.aws.amazon.com/redshift/latest/dg/query-performance-improvement-opportunities.html
Resources to better understand Redshift query planning and execution:
Query Planning And Execution Workflow:
https://docs.aws.amazon.com/redshift/latest/dg/c-query-planning.html
Reviewing Query Plan Steps:
https://docs.aws.amazon.com/redshift/latest/dg/reviewing-query-plan-steps.html
Mapping the Query Plan to the Query Summary:
https://docs.aws.amazon.com/redshift/latest/dg/query-plan-summary-map.html
Diagnostic Queries for Query Tuning:
https://docs.aws.amazon.com/redshift/latest/dg/diagnostic-queries-for-query-tuning.html

Related

Without changing memory limit and without affecting query performance. Is there anyway to improve Impala memory issue?

I would like to know -
without affecting SQL query performance
without lowering the memory limit
is there any way to improve the impala memory error issue?
I got a few suggestions like changing my join statements in my SQL queries
Impala uses in-memory analytics engine so being minimilastic in every aspect does the trick.
Filters - Use as many filters as you can. Use subquery and filter inside subquery if you can.
Joins - Main reason of memory issue - you need to use joins intelligently. As per rule of the thumb, in case of inner join - use the driving table first, then tinyiest table and then next tiny table and so on. For left joins you can use same thumb rule. So, move the tables as per their size (columns and count).
Also, use as many filters as you can.
Operations like distinct, regexp, IN, concat/function in a join condition or filter can slow things down. Please make sure they are absolutely necessary and there is no way you can avoid them.
Number of columns in select statement, subquery - keep them minimal.
Operations in select statement, subquery - keep them minimal.
Partitions - keep them optimized so you have optimum performance. More partition will slow INSERT and less partition will slow down SELECT.
Statistics - Create a daily plan to gather statistics of all tables and partitions to make things faster.
Explain Plan - Get the explain plan while the query is running. Query execution give you a unique query link. You will see lots of insights in the operations of the SQL.

Sql Server Join query

I have two tables. one is a small table and another one is a large table. While joining between two table, which table i will keep in left and which one in right so that the query optimiser will search quicker or it does not matter where i will join the table..
for example :
--1
SELECT smalltable.column1,
largetable.column1
FROM smalltable
INNER JOIN largetable
ON smalltable.column1 = largetable.column1 ;
--2
SELECT smalltable.column1,
largetable.column1
FROM smalltable
INNER JOIN largetable
ON largetable.column1 = smalltable.column1 ;
Which query will make it faster or it doesnot matter.
If you're talking about Microsoft SQL Server, both queries are equivalent to the query optimizer. In fact, to almost any cost-based query optimizer they'll be equivalent. You can try it by looking at the execucution plan (here for details http://www.simple-talk.com/sql/performance/execution-plan-basics/).
The query optimizer for most decent SQL Server variants will solve that. Some pritimitive ones dont (have a query optimizer - older MySQL, Access come to my mind). SOme may get overlaoded with complex decisions (this is simple).
But in general - trust the query optimizer first.
It should not matter which order you use, as your SQL Server should optimise the query execution for you. However, (if you are using Microsoft SQL Server) you could use SQL Server Profiler (found under the Tools menu of SQL Server Management Studio) to check the execution plans of both options.
If one of the tables is smaller that the other table.
Place the smaller table first and then the larger table as it will have less work to do and more over this will help the query optimizer to choose a plan that uses a Hash Join.
Then run the query profiler and check that the Hash join is used because this is the best and fastest in this scenario.
If there are no indexes on the joined tables then optimizer will select hash join.
You can force a Hash join by using OPTION (HASH JOIN) after inner join statement
From MSDN,http://blogs.msdn.com/b/irenak/archive/2006/03/24/559855.aspx
The column name that joins the table is called a hash key. In the example above, it’ll be au_id. SQL Server examines the two tables being joined, chooses the smaller table (so called build input), and builds a hash table applying a hash algorithm to the values of a hash key. Each row is inserted into a hash bucket depending on the hash value computed for the hash key. If build input is done completely in-memory, the hash join is called an “in-memory hash join”. If SQL Server doesn’t have enough memory to hold the entire build input, the process will be done in chunks, and is called “grace hash join”.
Before running both the queries,select'Include Actual Execution Plan' from the menu & then run the queries. The Sql server will show the execution plan which is the best tool to create the optimized queries. See more about Execution Plan here.
The order of the join columns does matter. See this post for more detail. Also there has been no discussion of indexing in this thread. It is the combination of optimal join table order AND useful indexing that results in the fastest executing queries.

Sql joining a table

I have a question regarding the SQL joins -
Whenever we join two different tables on some fields, what will happen exactly inside oracle which will result in the query output?
Does Oracle create/use a temporary table just for presenting the query output?
There is an overview of join mechanisms used in Oracle and a couple of Oracle wiki pages about join:
Cluster join
Hash join
Nested loops join
Sort merge join
The Cost-based optimizer documentation gives plenty of detail pertaining to access paths, how blocks of data are read, which scans are used etc.etc.
http://download.oracle.com/docs/cd/B10501_01/server.920/a96533/optimops.htm#35891
I don't think it will be temporary table, I guess it will table in the memory to speed up the operation.
If by "temporary table" you mean an Oracle global temporary table (GTT), the answer is No, Oracle never uses a GTT just for presenting the query output, but on the other hand, Yes, it might use a GTT for storing intermediate results depending on the query plan.

Getting rid of Table Spool in SQL Server Execution plan

I have a query that creates several temporary tables and then inserts data into them. From what I understand this is a potential cause of Table Spool. When I look at my execution plan the bulk of my processing is spent on Table Spool. Are there any good techniques for improving these types of performance problems? Would using a view or a CTE offer me any benefits over the temp tables?
I also noticed that when I mouse over each table spool the output list is from the same temporary table.
Well, with the information you gave I can tell only that: the query optimizer has chosen the best possible plan. It uses table spools to speed up execution. Alternatives that don't use table spools would be even slower.
How about showing the query, the table(s) schema and cardinality, and the plan.
Update
I certainly understand if you cannot show us the query. But is really hard to guess why the spooling is preffered by the engine whithout knowing any specifics. I recommend you go over Craig Freedman's blog, he is an engineer in the query optimizer team and has explained a lot of the inner workings of SQL 2005/2008 optimizer. Here are some entries I could quickly find that touch the topic of spooling in one form or another:
Recursive CTEs
Ranking Functions: RANK, DENSE_RANK, and NTILE
More on TOP
SQL customer support team also has an interesting blog at http://blogs.msdn.com/psssql/
And 'sqltips' (the relational engine's team blog) has some tips, like Spool operators in query plan...

What are the steps followed by sql engine to execute the query..??

My question is not how to use inner join in sql. I know about how it matches between table a and table b.
I'd like to ask how is the internal working of inner working. What algorithm it involves? What happens internally when joining multiple tables?
There are different algorithms, depending on the DB server, indexes and data order (clustered PK), whether calculated values are joined or not etc.
Have a look at a query plan, which most SQL systems can create for a query, it should give you an idea what it does.
In MS Sql, different join algorithms will be used in different situations depending on the tables (their size, what sort of indexes are available, etc). I imagine other DB engines also use a variety of algorithms.
The main types of join used by Ms Sql are:
- Nested loops joins
- Merge joins
- Hash joins
You can read more about them on this page: Msdn -Advanced Query Tuning Concepts
If you get SQL to display the 'execution plan' for your queries you will be able to see what type of join is being used in different situations.
It depends on what database you're using, what you're joining (large/small, in sequence/random, indexed/non-indexed etc).
For example, SQL Server has several different join algorithms; loop joins, merge joins, hash joins. Which one is used is determined by the optimizer when it is working out an execution plan. Sometimes it makes a misjudgement and you can then force a specific join algorithm by using join hints.
You may find the following MSDN pages interesting:
http://msdn.microsoft.com/en-us/library/ms191318.aspx (loop)
http://msdn.microsoft.com/en-us/library/ms189313.aspx (hash)
http://msdn.microsoft.com/en-us/library/ms190967.aspx (merge)
http://msdn.microsoft.com/en-us/library/ms173815.aspx (hints)
In this case you should see how to saved data in b-tree after it i think you will understand JOIN algorithm.
All based set theory, been around a while.
Try not to link too many table at any one time, seems to conk out database resources with all the scanning. Indices help with performance, look at some sql sites and search on optimising sql queries to get some insight. SQL Management Studio has some inbuilt execution plan utility that's often interesting, especially for large complex queries.
The optimizer will (or should) choose the fastest join algo.
However there are two different kinds of determining what is fast:
You measure the time that it takes to return all the joined rows.
You measure the time that it takes to return the first joined rows.
If you want to return all the rows as fast as possible the optimizer will often choose a hash join or a merge join. If you want to return the first few rows as fast as possible the optimzer will choose a nested loops join.
It creates a Cartesian Product of the two tables and then selects the rows out of it. Read Korth book on Databases for the same.