A Hive query that joins table runs 12+ hours.
This query joins string columns. So for each column, hive has to do a string compare. It would be logical to join on strings.
Does it make sense to convert string columns to int? Or is the impact in general considered as too low?
I would suggest trying to improve the join performance by adding some properties in hive query that will join better.
set hive.auto.convert.join=false;
set hive.auto.convert.join.noconditionaltask=false;
I have seen the above parameters make a difference many times. Maybe if you give us more information as to how many tables are u joining and how big are they. There will be better solution.
Related
I have a query that has joins between 15 tables or even more. I need to optimize the response time. I created some index columns, changed some conditions from NOT IN to NOT EXISTS, but I found myself wondering about this.
Does the order of these joins affect the response time?
The order of JOINs definitely does affect performance, as well as the type of join. INNER JOINs, generally, will yield quicker results than RIGHT or LEFT OUTER JOINs due to the selectivity of the join.
SQL Server also tries its best to optimize every query, but at 15 joins it may have a hard time. Consider breaking the statement up into smaller junks of fewer joins. A strategy to resolve things like this that I've used in the past is to create a temporary table to store the results, then INSERT into it and UPDATE it accordingly through several different statements, with the 15 tables being spread out into the appropriate insert/update spots.
The order of the joins definitely matters. The question is, can you change the order by rewriting the SQL? The query optimizer doesn't necessarily care what order you write the joins in, depending on the type of join. The query optimizer does its best to find the most efficient execution plan based on the SQL you've written. However, it is no where close to perfect. If you notice after looking at the execution plan that it could be done more efficiently another way you can trick it into doing it your way and see if it helps.
One way to trick it is to use temp tables to pare down the result set before joining to large tables. This will allow less records to be selected which will reduce I/O.
Another way, as demonstrated by Adam Machanic, is to use a top in the select clause with an order by.
Any ideas how I can create a join on a table retrieved from data in a previously joined table? For example, I have an objects table that contains existing table Names, the display field of those tables, and the main ID field of those tables. I want to join on the table specified in the objects table as well as the objects table.
select T.field1,T.field2,T.field3,O.DisplayName
from tableT as T
inner join objects as O on T.SystemObjectID = O.SystemObjectID
inner join O.TableName as X on X.O.ID = T.SystemObjectRecordID
I understand the script above is not correct. But what is the easiest most efficient way of accomplishing this task? I hope I'm clear on what I'm asking...
Thank you for any help in advance.
You can't do that.
What you can do is take the results of that and create a string that is a SQL query.
You can even make that query parameterised and execute it using sp_executesql.
- That link has a few good examples to get you started.
This is because SQL is compiled to an execution plan before it is executed. At this point in time it needs to know the tables, decides the indexes to use, etc.
This means that data is data and sql is sql. You can't use late binding scripting tricks to modify the query that is already running.
You're stuck with SQL that writes SQL (Dynamic SQL) I'm afraid.
I have a Sql-Server-2008 database that I am querying from on the regular that was over 30 million entries (joy!). Unfortunately this database cannot be drastically changed because it is still in use for R/D.
When I query from this database, it takes FOREVER. By that I mean I haven't been patient enough to wait for results (after 2 mins I have to cancel to avoid locking the R/D department out). Even if I use a short date range (more than a few months), it is basically impossible to get any results from it. I am querying with requirements from 4 of the columns and unfortunately have to use an inner-join for another table (which I've been told is very costly in terms of query efficiency -- but it unavoidable). This inner joined table has less than 100k entries.
What I was wondering, is it is possible to organize the table to have it defaultly be ordered by date to reduce the number of results it has to search through?
If this is not possible, is there anything I can do to reduce query times? Is there any other useful information that could assist me in coming up with a solution?
I have included a sample of the query that I use:
SELECT DISTINCT N.TestName
FROM [DalsaTE].[dbo].[ResultsUut] U
INNER JOIN [DalsaTE].[dbo].[ResultsNumeric] N
ON N.ModeDescription = 'Mode 8: Low Gain - Green-Blue'
AND N.ResultsUutId = U.ResultsUutId
WHERE U.DeviceName = 'BO-32-3HK60-00-R'
AND U.StartDateTime > '2011-11-25 01:10:10.001'
ORDER BY N.TestName
Any help or suggestions are appreciated!
It sounds like datetime may be a text based field and subsequently an index isn't being used?
Could you try the following to see if you have any speed improvement:
select distinct N.TestName
from [DalsaTE].[dbo].[ResultsUut] U
inner join [DalsaTE].[dbo].[ResultsNumeric] N
on N.ModeDescription = 'Mode 8: Low Gain - Green-Blue'
and N.ResultsUutId = U.ResultsUutId
where U.DeviceName = 'BO-32-3HK60-00-R'
and U.StartDateTime > cast('2011-11-25 01:10:10.001' as datetime)
order by N.TestName
It would also be worth trying changing your inner join to a left outer join as those occasionally perform faster for no conceivable reason (at least one that I'm not aware of).
you can add an index based on your date column, which should improve your query time. You can either use an alter table command, or use the table designer.
Is the sole purpose of the join to provide sorting? If so, a quick thing to try would be to remove this, and see how much of a difference it makes - at least then you'll know where to focus your attention.
Finally, SQL server management studio has some useful tools such as execution plans that can help diagnose performance issues. Good luck!
There are a number of problems which may be causing delays in the execution of your query.
Indexes (except the primary key) do not reorder the data, they merely create an index (think phonebook) which orders a number of values and points back to the primary key.
Without seeing the type of data or the existing indexes, it's difficult, but at the very least, the following ASCENDING indexes might help:
[DalsaTE].[dbo].[ResultsNumeric] ModeDescription and ResultsUutId and TestName
[DalsaTE].[dbo].[ResultsUut] StartDateTime and DeviceName and ResultsUutId
Without the indexes above, the sample query you gave can be completed without performing a single lookup on the actual table data.
I have a query containing three inner join statements in the Where clause. The query takes roughly 2 minutes to execute. If I simply change the order of two of the inner joins, performance drops to 40 seconds.
How can doing nothing but changing the order of the inner joins have such a drastic impact of query performance? I would have thought the optimizer would figure all this out.
SQL is declarative, that is, the JOIN order should not matter.
However it can in practice, say, if it's a complex query when the optimiser does not explore all options (which in theory could take months).
Another option is that it's a very different query if you reorder and you get different results, but this is usually with OUTER JOINs.
And it could also be the way the ON clause is specified It has to change if you reorder the FROM clause. Unless you are using the older (and bad) JOIN-in-the-WHERE-clause.
Finally, if it's a concern you could use parenthesis to change evaluation order to make your intentions clear, say, filter on a large table first to generate a derived table.
Because by changing the order of the joins, SQL Server is coming up with a different execution plan for your query (chances are it's changing the way it's filtering the tables based on your joins).
In this case, I'm guessing you have several large tables...one of which performs the majority of the filtering.
In one query, your joins are joining several of the large tables together and then filtering the records at the end.
In the other, you are filtering the first table down to a much smaller sub-set of the data...and then joining the rest of the tables in. Since that initial table got filtered before joining the other large recordsets, performance is much better.
You could always verify but running the query with the 'Show query plan' option enabled and see what the query plan is for the two different join orders.
I would have thought it was smart enough to do that as well, but clearly it's still performing the joins in the order you explicitly list them... As to why that affects the performance, if the first join produces an intermediate result set of only 100 records in one ordering scheme, then the second join will be from that 100-record set to the third table.
If putting the other join first produces a first intermediate result set of one million records, then the second join will be from a one million row result set to the third table...
My question is not how to use inner join in sql. I know about how it matches between table a and table b.
I'd like to ask how is the internal working of inner working. What algorithm it involves? What happens internally when joining multiple tables?
There are different algorithms, depending on the DB server, indexes and data order (clustered PK), whether calculated values are joined or not etc.
Have a look at a query plan, which most SQL systems can create for a query, it should give you an idea what it does.
In MS Sql, different join algorithms will be used in different situations depending on the tables (their size, what sort of indexes are available, etc). I imagine other DB engines also use a variety of algorithms.
The main types of join used by Ms Sql are:
- Nested loops joins
- Merge joins
- Hash joins
You can read more about them on this page: Msdn -Advanced Query Tuning Concepts
If you get SQL to display the 'execution plan' for your queries you will be able to see what type of join is being used in different situations.
It depends on what database you're using, what you're joining (large/small, in sequence/random, indexed/non-indexed etc).
For example, SQL Server has several different join algorithms; loop joins, merge joins, hash joins. Which one is used is determined by the optimizer when it is working out an execution plan. Sometimes it makes a misjudgement and you can then force a specific join algorithm by using join hints.
You may find the following MSDN pages interesting:
http://msdn.microsoft.com/en-us/library/ms191318.aspx (loop)
http://msdn.microsoft.com/en-us/library/ms189313.aspx (hash)
http://msdn.microsoft.com/en-us/library/ms190967.aspx (merge)
http://msdn.microsoft.com/en-us/library/ms173815.aspx (hints)
In this case you should see how to saved data in b-tree after it i think you will understand JOIN algorithm.
All based set theory, been around a while.
Try not to link too many table at any one time, seems to conk out database resources with all the scanning. Indices help with performance, look at some sql sites and search on optimising sql queries to get some insight. SQL Management Studio has some inbuilt execution plan utility that's often interesting, especially for large complex queries.
The optimizer will (or should) choose the fastest join algo.
However there are two different kinds of determining what is fast:
You measure the time that it takes to return all the joined rows.
You measure the time that it takes to return the first joined rows.
If you want to return all the rows as fast as possible the optimizer will often choose a hash join or a merge join. If you want to return the first few rows as fast as possible the optimzer will choose a nested loops join.
It creates a Cartesian Product of the two tables and then selects the rows out of it. Read Korth book on Databases for the same.