How to eliminate multiple server calls | MS SQL Server - sql

There is a stored procedure that needs to be modified to eliminate a call to another server.
What is the easiest and feasible way to do this so that the final SP's execution time is faster and also preference to solutions which do not involve much change to the application?
Eg:
select *
from dbo.table1 a
inner join server2.dbo.table2 b on a.id = b.id

Cross server JOINs can be problematic as the optimiser doesn't always pick the most effective solution, which may even result in the entire remote table being dragged over your network to be queried for a single row.
Replication is by far the best option, if you can justify it. This will mean you need to have a primary key on the table you want to replicate, which seems a reasonable constraint (ha!), but might become an issue with a third-party system.
if the remote table is small then it might be better to take a temporary local copy, e.g. SELECT * INTO #temp FROM server2.<database>.dbo.table2;. Then you can change your query to something like this: select * from dbo.table1 a inner join #temp b on a.id = b.id;. The temporary table will be marked for garbage collection when your session ends, so no need to tidy up after yourself.
If the table is larger then you might want to do the above, but also add an index to your temporary table, e.g. CREATE INDEX ix$temp ON #temp (id);. Note that if you use a named index then you will have issues if you run the same procedure twice simultaneously, as the index name won't be unique. This isn't a problem if the execution is always in series.
If you have a small number of ids that you want to include then OPENQUERY might be the way to go, e.g. SELECT * FROM OPENQUERY('server2', 'SELECT * FROM table2 WHERE id IN (''1'', ''2'')');. The advantage here is that you are now running the query on the remote server, so it's more likely to use a more efficient query plan.
The bottom line is that if you expect to be able to JOIN a remote and local table then you will always have some level of uncertainty; even if the query runs well one day, it might suddenly decide to run a LOT slower the following day. Small things, like adding a single row of data to the remote table, can completely change the way the query is executed.

Related

Single SELECT with linked server makes multiple SELECT by ID

This is my issue. I defined a linked server, let's call it LINKSERV, which has a database called LINKDB. In my server (MYSERV) I've got the MYDB database.
I want to perform the query below.
SELECT *
FROM LINKSERV.LINKDB.LINKSCHEMA.LINKTABLE
INNER JOIN MYSERV.MYDB.MYSCHEMA.MYTABLE ON MYKEYFIELD = LINKKEYFIELD
The problem is that if I take a look to the profiler, I see that in the LINKSERV server lots of SELECT are made. They looks similar to:
SELECT *
FROM LINKTABLE WHERE LINKKEYFIELD = #1
Where #1 is a parameter that is changed for every SELECT.
This is, of course, unwanted because it appears to be not performing. I could be wrong, but I suppose the problem is related to the use of different servers in the JOIN. In fact, if I avoid this, the problem disappear.
Am I right? Is there a solution? Thank you in advance.
What you see may well be the optimal solution, as you have no filter statements that could be used to limit the number of rows returned from the remote server.
When you execute a query that draws data from two or more servers, the query optimizer has to decide what to do: pull a lot of data to the requesting server and do the joins there, or somehow send parts of the query to the linked server for evaluation? Depending on the filters and the availability or quality of the statistics on both servers, the optimizer may pick different operations for the join (merge or nested loop).
In your case, it has decided that the local table has fewer rows than the target and requests the target row that correspons to each of the local rows.
This behavior and ways to improve performance are described in Linked Server behavior when used on JOIN clauses
The obvious optimizations are to update your statistics and add a WHERE statement that will filter the rows returned from the remote table.
Another optimization is to return only the columns you need from the remote server, instead of selecting *

situations requiring temporary tables in stored procedures

Can anyone explain the situations in which we need to make use of temporary tables in stored procedures?
There are many cases where a complex join can really trip up the optimizer and make it do very expensive things. Sometimes the easiest way to cool the optimizer down is to break the complex query into smaller parts. You'll find a lot of misinformation out there about using a #table variable instead of a #temp table because #table variables always live in memory - this is a myth and don't believe it.
You may also find this worthwhile if you have an outlier query that would really benefit from a different index that is not on the base table, and you are not permitted (or it may be detrimental) to add that index to the base table (it may be an alternate clustered index, for example). A way to get around that would be to put the data in a #temp table (it may be a limited subset of the base table, acting like a filtered index), create the alternate index on the #temp table, and run the join against the #temp table. This is especially true if the data filtered into the #temp table is going to be used multiple times.
There are also times when you need to make many updates against some data, but you don't want to update the base table multiple times. You may have multiple things you need to do against a variety of other data that can't be done in one query. It can be more efficient to put the affected data into a #temp table, perform your series of calculations / modifications, then update back to the base table once instead of n times. If you use a transaction here against the base tables you could be locking them from your users for an extended period of time.
Another example is if you are using linked servers and the join across servers turns out to be very expensive. Instead you can stuff the remote data into a local #temp table first, create indexes on it locally, and run the query locally.

Optimize SQL query that uses NOT EXISTS with many columns in not exists' WHERE clause

Edit: using SQL Server 2005.
I have a query that has to check whether rows from a legacy database have already been imported into a new database and imports them if they are not already there. Since the legacy database was badly designed, there is no unique id for the rows from the legacy table so I have to use heuristics to decide whether the row has been imported. (I have no control over the legacy database.) The new database has slightly different structure and I have to check several values such as whether create dates match, group number match, etc. to heuristically decide whether the row exists in the new database or not. Not very pretty, but the bad design of the legacy system it has to interface with leaves me little choice.
Anyhow the users of the system started throwing 10x to 100x more data at the system than I designed for, and now the query is running too slow. Can you suggest a way to make it faster? Here is the code, with some redadacted for privacy or to simplify but I think I left the important part:
INSERT INTO [...NewDatabase...]
SELECT [...Bunch of columns...]
FROM [...OldDatabase...] AS t1
WHERE t1.Printed = 0
AND NOT EXISTS(SELECT *
FROM [...New Database...] AS s3
WHERE year(s3.dtDatePrinted) = 1850 --This allows for re-importing rows marked for reprint
AND CAST(t1.[Group] AS int) = CAST(s3.vcGroupNum AS int)
AND RTRIM(t1.Subgroup) = s3.vcSubGroupNum
AND RTRIM(t1.SSN) = s3.vcPrimarySSN
AND RTRIM(t1.FirstName) = s3.vcFirstName
AND RTRIM(t1.LastName) = s3.vcLastName
AND t1.CaptureDate = s3.dtDateCreated)
Not knowing what the schema looks like, your first step is to EXPLAIN those sub-queries. That should show you where the database is chewing up its time. If there's no indexes its likely doing multiple full table scans. If I had to guess, I'd say t1.printed and s3.dtDatePrinted are the two most vital to get indexed as they'll weed out what's already been converted.
Also anything which needs to be calculated might cause the database not to use the index. For example, the calls to RTRIM and CAST. That suggests you have dirty data in the new database. Trim it off permanently, and see about changing t1.group to the right type.
year(s3.dtDatePrinted) = 1850 may fool the optimizer into not using an index for s3.dtDatePrinted (EXPLAIN should let you know). This appears to be just a flag set by you to check if the row has already been converted, so set it to a specific date (ie. 1850-01-01 00:00:00) and do a specific match (ie. s3.dtDatePrinted = "1850-01-01 00:00:00") and now that's a simple index lookup.
Making your comparision simpler would also help. Essentially what you have here is a 1-to-1 relationship between t1 and s3 (if t1 is the real name for the new table, consider something more descriptive). So rather than matching each individual bit of s3 to t1, just give t1 a column to reference the primary key of its corresponding s3 row. Then you just have one thing to check. If you can't alter t1 then you could use a 3rd table to track t1 to s3 mappings.
Once you have that, all you should have to do is a join to find rows in s3 which are not in t1.
SELECT s3.*
FROM s3
LEFT JOIN t1 ON t1.s3 = s3.id -- or whatever s3's primary key is
WHERE t1.s3 IS NULL
Try replacing this:
year(s3.dtDatePrinted) = 1850
With this:
s3.dtDatePrinted >= '1850-01-01' and s3.dtDatePrinted < '1851-01-01'
In this case, and if there's an index on dtDatePrinted MAYBE the optimizer could use a range index scan.
But I agree with previous posters that you should avoid the RTRIMs. One idea is keeping in s3 the untrimmed (original) value, or creating an intermediate table that maps untrimmed values with trimmed (new) ones. Or even creating materialized views. But all this work is useless without proper indexes.

SQL server compare tables and update if changed

Hi I am looking for A solution
to update values in table1 only if the values change. I mean compare to tableb and update only the changed values
An alternative to neils solution is to use binary checksum and store that in a field in your table then compare against that
Not saying its a better solution, just giving you some options.
for multiple rows at a time try:
UPDATE a
SET IntCol=b.IntCol
,varcharCol=b.varcharCol
,DatetimeCol=b.DatetimeCol
FROM TableA a
INNER JOIN (SELECT pk,IntCol,varcharCol,DatetimeCol FROM TableA
EXCEPT
SELECT pk,IntCol,varcharCol,DatetimeCol FROM TableB
) dt ON a.pk=dt.pk
You could use a trigger on the source table that updates the target table.
However if there's large volume that could slow inserts/updates on the source quite badly. In which case, I'd make the trigger insert into a 3rd table. A scheduled job could then process that table and delete records (with appropriate use of transactions, of course).
An entirely different approach would be to move the triggering up one layer into your application, and use a message-based approach. In this situation you get the best of both worlds, because the process listening for messages will process them in order as fast as it can, leading to almost real-time updates of the target table.
So you can have your cake, and eat it.

TSQL Join efficiency

I'm developing an ASP.NET/C#/SQL application. I've created a query for a specific grid-view that involves a lot of joins to get the data needed. On the hosted server, the query has randomly started taking up to 20 seconds to process. I'm sure it's partly an overloaded host-server (because sometimes the query takes <1s), but I don't think the query (which is actually a view reference via a stored procedure) is at all optimal regardless.
I'm unsure how to improve the efficiency of the below query:
(There are about 1500 matching records to those joins, currently)
SELECT dbo.ca_Connections.ID,
dbo.ca_Connections.Date,
dbo.ca_Connections.ElectricityID,
dbo.ca_Connections.NaturalGasID,
dbo.ca_Connections.LPGID,
dbo.ca_Connections.EndUserID,
dbo.ca_Addrs.LotNumber,
dbo.ca_Addrs.UnitNumber,
dbo.ca_Addrs.StreetNumber,
dbo.ca_Addrs.Street1,
dbo.ca_Addrs.Street2,
dbo.ca_Addrs.Suburb,
dbo.ca_Addrs.Postcode,
dbo.ca_Addrs.LevelNumber,
dbo.ca_CompanyConnectors.ConnectorID,
dbo.ca_CompanyConnectors.CompanyID,
dbo.ca_Connections.HandOverDate,
dbo.ca_Companies.Name,
dbo.ca_States.State,
CONVERT(nchar, dbo.ca_Connections.Date, 103) AS DateView,
CONVERT(nchar, dbo.ca_Connections.HandOverDate, 103) AS HandOverDateView
FROM dbo.ca_CompanyConnections
INNER JOIN dbo.ca_CompanyConnectors ON dbo.ca_CompanyConnections.CompanyID = dbo.ca_CompanyConnectors.CompanyID
INNER JOIN dbo.ca_Connections ON dbo.ca_CompanyConnections.ConnectionID = dbo.ca_Connections.ID
INNER JOIN dbo.ca_Addrs ON dbo.ca_Connections.AddressID = dbo.ca_Addrs.ID
INNER JOIN dbo.ca_Companies ON dbo.ca_CompanyConnectors.CompanyID = dbo.ca_Companies.ID
INNER JOIN dbo.ca_States ON dbo.ca_Addrs.StateID = dbo.ca_States.ID
It may have nothing to do with your query and everything to do with the data transfer.
How fast does the query run in query analyzer?
How does this compare to the web page?
If you are bringing back the entire data set you may want to introduce paging, say 100 records per page.
The first thing I normally suggest is to profile to look for potential indexes to help out. But the when the problem is sporadic like this and the normal case is for the query to run in <1sec, it's more likely due to lock contention rather than a missing index. That means the cause is something else in the system causing this query to take longer. Perhaps an insert or update. Perhaps another select query — one that you would normally expect to take a little longer so the extra time on it's end isn't noted.
I would start with indexing, but I have a database that is a third-party application. Creating my own indexes is not an option. I read an article (sorry, can't find the reference) recommending breaking up the query into table variables or temp tables (depending on number of records) when you have multiple tables in your query (not sure what the magic number is).
Start with dbo.ca_CompanyConnections, dbo.ca_CompanyConnectors, dbo.ca_Connections. Include the fields you need. And then subsitute these three joined tables with just the temp table.
Not sure what the issue is (would like to here recommendations) but seems like when you get over 5 tables performance seems to drop.