driving_site hint for multiple remote tables - sql

I have a query of the following format. It uses two remote tables and a local table.
SELECT *
FROM table1#db2 t1 INNER JOIN table2#db2 t2 -- two large remote tables on the same DB
ON t1.id = t2.id
WHERE t1.prop = '1'
AND t2.prop = '2'
AND t1.prop2 IN (SELECT val FROM tinylocaltable)
I'm wondering how to properly use the DRIVING_SITE query hint to push the bulk of the work to db2 (i.e. ensure the join and conditions are applied on db2). Most of the examples I see of DRIVING_SITE reference only one remote table. Is SELECT /*+DRIVING_SITE(t1)*/ * sufficient or do I need to list both remote tables (t1 and t2) in the hint? If the latter, what is the proper syntax?
(If you're wondering why this isn't being executed on db2 to start with, it's because this is actually one UNION ALL section of a larger query, where the other UNION ALL sections use the local DB).

The DRIVING_SITE hint instructs the optimizer to execute the query at a different site than that selected by the database
Your query uses
FROM table1#db2 t1 INNER JOIN table2#db2 t2
where both tables are on the same "different site", so
SELECT /*+ DRIVING_SITE(t1)*/
should be OK (in my opinion. Can't find anything in documentation that would suggest different).

Related

SQL - Combining two databases

Trying to connect two active DB for reporting both running SQL. Recently migrated several customer groups to the new DB but now need to run comparative historical reports. How do I get all data from both db, joins filter out newly added customers on either side based on the join used. Is this possible?
If they are on two separate servers, you will need to link the servers to each other first. This link should set you in the proper direction. Then you need to reference them using 4-part BOL:
SELECT T1.*
FROM [Server1].[Database1].[dbo].[Table1] T1
LEFT JOIN [Server2].[Database2].[dbo].[Table1] T2
ON T1.MyField = T2.MyField
If they are on the same server, you only need to add the database name to your SQL code. So, if you're trying to link data from Table1 in Database1 to Table1 in Database2, you would do this:
SELECT T1.*
FROM [Database1].[Table1] T1
LEFT JOIN [Database2].[Table1] T2
ON T1.MyField = T2.MyField

Performance of join vs pre-select on MsSQL

I can do the same query in two ways as following, will #1 be more efficient as we don't have join?
1
select table1.* from table1
inner join table2 on table1.key = table2.key
where table2.id = 1
2
select * from table1
where key = (select key from table2 where id=1)
These are doing two different things. The second will return an error if more than one row is returned by the subquery.
In practice, my guess is that you have an index on table2(id) or table2(id, key), and that id is unique in table2. In that case, both should be doing index lookups and the performance should be very comparable.
And, the general answer to performance question is: try them on your servers with your data. That is really the only way to know if the performance difference makes a difference in your environment.
I executed these two statements after running set statistics io on (on SQL Server 2008 R2 Enterprise - which supposedly has the best optimization compared to Standard).
select top 5 * from x2 inner join ##prices on
x1.LIST_PRICE = ##prices.i1
and
select top 5 * from x2 where LIST_PRICE in (select i1 from ##prices)
and the statistics matched exactly. I have always preferred the first type of join but the second allows me to select just that part and see what rows are being returned.
I was taught that joins vs subqueries are mostly equivalent when it comes to performance. I would also look at the resulting query plans to see if one is better then the other. The query plans matched exactly.
MS SQL Server is smart enough to understand that it is the same action in such a simple query.
However if you have more than 1 record in subquery then you'll probably use IN. In is slow operation and it will never work faster than JOIN. It can be the same but never faster.
The best option for your case is to use EXISTS. It will be always faster or the same as JOIN or IN operation. Example:
select * from table1 t1
where EXISTS (select * from table2 t2 where id=1 AND t1.key = t2.key)

JOIN on table on another database server vs. JOIN on a SELECT on table on another server

I’m curious if there is any difference in performance between JOINing on a table residing on another SQL server instance and JOINing on a subset of that same table the other server instance. In other words, would performance be the same for the following two queries:
SELECT t1.CustomerName, t2.Address, t2.Phone
FROM Table1 t1
LEFT JOIN [Server X].dbo.Table2 t2 on t2.CustomerID = t1.CustomerID
And
SELECT t1.CustomerName, t2.Address, t2.Phone
FROM Table1 t1
LEFT JOIN (SELECT CustomerID, Address. Phone FROM [Server X].dbo.Table2)
t2
on t2.CustomerID = t1.CustomerID
We can assume that Table2 contains more than just these two columns. I’m wondering if SELECTing only the columns I need vs. JOINing on the entire table will make any sort of difference, especially given this is a cross server query.
Off the top of my head, I'm not sure, but without testing it, it looks to me like SQL would execute these the same. You can check this out if you have SQL Server Management Studio and run the execution plan.
I believe the top is more efficient (as it would be on the same server) if the inner select were more complex. It's really up to the optimizer.
IMO, both of the queries will have same performance in this case.
If you see the execution plans for this. both plans will be identical.

Why is this SQL statement hanging when group by, sum, or where clause is included?

I have a SQL statement:
select
t3.item1,
t3.item2,
sum(t1.moneys)
from
table t1
inner join table t2 on t1.key = t2.key
inner join table t3 on t1.key2 = t3.key2
where
t2.type = 'thistype'
and t3.type2 = 'thistype'
group by
t3.item1, t3.item2
If I remove the group by, sum, or where clause it runs fine - but if I add back any of those it hangs forever... any ideas... this is on SQL Server Management Studio 2008 R2
Thanks
Further Testing
so I created a view:
select
t3.item1,
t3.item2,
t1.moneys,
t2.type,
t3.type2
from
table t1
inner join table t2 on t1.key = t2.key
inner join table t3 on t1.key2 = t3.key2
and I can select top 1000 from the view fine and see the type I want to specifically look at in the data, but when I add the 'where type2 = 'thistype'' it hangs again...
Your joining three tables together with millions of records, this is normal for it take a bit to run.
To answer your question about statistics, they are what the indices utilize to retrieve records faster from your tables. Without accurate or up to date statistics, indices can actually slow your queries down.
http://blogs.technet.com/b/rob/archive/2008/05/16/sql-server-statistics.aspx
I think we'd need to see some table structure and know some more things about your DB before we can give a solid answer. First thing, though, is to run a trace on it and see what that tells you.
At first blush, I have found that issues with aggregate functions (sum, group by, etc) tend to stem from a) overly large data sets (that is: you're just trying to pull back too much data) or b) from overly-complicated structure or relationships on the joined tables.
However, those are just my general rules-of-thumb, and may not apply in a specific situation: run a trace and any other form of profiling you can and see what that tells you.
Has you looked at the execution plan you're getting? That will tell you where the problem is. Do you have covering indices on the columns on which you're joining and grouping?
Is it possible that the execution plan is corrupted?
http://msdn.microsoft.com/en-us/library/aa175244(v=sql.80).aspx
Try recompiling the plan using sp_recompile

WHERE and JOIN order of operation

My question is similar to this SQL order of operations but with a little twist, so I think it's fair to ask.
I'm using Teradata. And I have 2 tables: table1, table2.
table1 has only an id column.
table2 has the following columns: id, val
I might be wrong but I think these two statements give the same results.
Statement 1.
SELECT table1.id, table2.val
FROM table1
INNER JOIN table2
ON table1.id = table2.id
WHERE table2.val<100
Statement 2.
SELECT table1.id, table3.val
FROM table1
INNER JOIN (
SELECT *
FROM table2
WHERE val<100
) table3
ON table1.id=table3.id
My questions is, will the query optimizer be smart enough to
- execute the WHERE clause first then JOIN later in Statement 1
- know that table 3 isn't actually needed in Statement 2
I'm pretty new to SQL, so please educate me if I'm misunderstanding anything.
this would depend on many many things (table size, index, key distribution, etc), you should just check the execution plan:
you don't say which database, but here are some ways:
MySql EXPLAIN
SQL Server SET SHOWPLAN_ALL (Transact-SQL)
Oracle EXPLAIN PLAN
what is explain in teradata?
Teradata Capture and compare plans faster with Visual Explain and XML plan logging
Depending on the availability of statistics and indexes for the tables in question the query rewrite mechanism in the optimizer will may or may not opt to scan Table2 for records where val < 100 before scanning Table1.
In certain situations, based on data demographics, joins, indexing and statistics you may find that the optimizer is not eliminating records in the query plan when you feel that it should. Even if you have a derived table such as the one in your example. You can force the optimizer to process a derived table by simply placing a GROUP BY in your derived table. The optimizer is then obligated to resolve the GROUP BY aggregate before it can consider resolving the join between the two tables in your example.
SELECT table1.id, table3.val
FROM table1
INNER JOIN (
SELECT table2.id, tabl2.val
FROM table2
WHERE val<100
GROUP BY 1,2
) table3
ON table1.id=table3.id
This is not to say that your standard approach should be to run with this through out your code. This is typically one of my last resorts when I have a query plan that simply doesn't eliminate extraneous records earlier enough in the plan and results in too much data being scanned and carried around through the various SPOOL files. This is simply a technique you can put in your toolkit to when you encounter such a situation.
The query rewrite mechanism is continually being updated from one release to the next and the details about how it works can be found in the SQL Transaction Processing Manual for Teradata 13.0.
Unless I'm missing something, Why do you even need Table1??
Just query Table2
Select id, val
From table2
WHERE val<100
or are you using the rows in table1 as a filter? i.e., Does table1 only copntain a subset of the Ids in Table2??
If so, then this will work as well ...
Select id, val
From table2
Where val<100
And id In (Select id
From table1)
But to answer your question, Yes the query optimizer should be intelligent enough to figure out the best order in which to execute the steps necessary to translate your logical instructions into a physical result. It uses the strored statistics that the database maintains on each table to determine what to do (what type of join logic to use for example), as wekll as what order to perform the operations in in order to minimize Disk IOs and processing costs.
Q1. execute the WHERE clause first then JOIN later in Statement 1
The thing is, if you switch the order of inner join, i.e. table2 INNER JOIN table1, then I guess WHERE clause can be processed before JOIN operation, during the preparation phase. However, I guess even if you don't change the original query, the optimizer should be able to switch their order, if it thinks the join operation will be too expensive with fetching the whole row, so it will apply WHERE first. Just my guess.
Q2. know that table 3 isn't actually needed in Statement 2
Teradata will interpret your second query in such way that the derived table is necessary, so it will keep processing table 3 involved operation.