TSQL Join efficiency - sql

I'm developing an ASP.NET/C#/SQL application. I've created a query for a specific grid-view that involves a lot of joins to get the data needed. On the hosted server, the query has randomly started taking up to 20 seconds to process. I'm sure it's partly an overloaded host-server (because sometimes the query takes <1s), but I don't think the query (which is actually a view reference via a stored procedure) is at all optimal regardless.
I'm unsure how to improve the efficiency of the below query:
(There are about 1500 matching records to those joins, currently)
SELECT dbo.ca_Connections.ID,
dbo.ca_Connections.Date,
dbo.ca_Connections.ElectricityID,
dbo.ca_Connections.NaturalGasID,
dbo.ca_Connections.LPGID,
dbo.ca_Connections.EndUserID,
dbo.ca_Addrs.LotNumber,
dbo.ca_Addrs.UnitNumber,
dbo.ca_Addrs.StreetNumber,
dbo.ca_Addrs.Street1,
dbo.ca_Addrs.Street2,
dbo.ca_Addrs.Suburb,
dbo.ca_Addrs.Postcode,
dbo.ca_Addrs.LevelNumber,
dbo.ca_CompanyConnectors.ConnectorID,
dbo.ca_CompanyConnectors.CompanyID,
dbo.ca_Connections.HandOverDate,
dbo.ca_Companies.Name,
dbo.ca_States.State,
CONVERT(nchar, dbo.ca_Connections.Date, 103) AS DateView,
CONVERT(nchar, dbo.ca_Connections.HandOverDate, 103) AS HandOverDateView
FROM dbo.ca_CompanyConnections
INNER JOIN dbo.ca_CompanyConnectors ON dbo.ca_CompanyConnections.CompanyID = dbo.ca_CompanyConnectors.CompanyID
INNER JOIN dbo.ca_Connections ON dbo.ca_CompanyConnections.ConnectionID = dbo.ca_Connections.ID
INNER JOIN dbo.ca_Addrs ON dbo.ca_Connections.AddressID = dbo.ca_Addrs.ID
INNER JOIN dbo.ca_Companies ON dbo.ca_CompanyConnectors.CompanyID = dbo.ca_Companies.ID
INNER JOIN dbo.ca_States ON dbo.ca_Addrs.StateID = dbo.ca_States.ID

It may have nothing to do with your query and everything to do with the data transfer.
How fast does the query run in query analyzer?
How does this compare to the web page?
If you are bringing back the entire data set you may want to introduce paging, say 100 records per page.

The first thing I normally suggest is to profile to look for potential indexes to help out. But the when the problem is sporadic like this and the normal case is for the query to run in <1sec, it's more likely due to lock contention rather than a missing index. That means the cause is something else in the system causing this query to take longer. Perhaps an insert or update. Perhaps another select query — one that you would normally expect to take a little longer so the extra time on it's end isn't noted.

I would start with indexing, but I have a database that is a third-party application. Creating my own indexes is not an option. I read an article (sorry, can't find the reference) recommending breaking up the query into table variables or temp tables (depending on number of records) when you have multiple tables in your query (not sure what the magic number is).
Start with dbo.ca_CompanyConnections, dbo.ca_CompanyConnectors, dbo.ca_Connections. Include the fields you need. And then subsitute these three joined tables with just the temp table.
Not sure what the issue is (would like to here recommendations) but seems like when you get over 5 tables performance seems to drop.

Related

Speed Up Processing of SQL in Access 2007

I've been trying to get this SQL query I am using in an Access 2007 database to execute faster. I've already eliminated any distinct queries involved -- and Access doesn't really give me too much information on where the hang-up is. The query is pulling about 150,000 rows and is taking about 2 minutes to complete.
I'm still learning the syntax for sql in access, but I think I have it setup correctly. I'd appreciate any insights or hints on what I might be missing.
SELECT OLS_UNITS_GROSS_ACRES.AGMT_NUM,
MAX(OLS_UNITS_GROSS_ACRES.UNIT_GROSS_ACRES) AS UNIT_GROSS,
SUM(MIS_ACREAGES.ACRE_AMT) AS [RELATED ACRES]
FROM ((OLS_UNITS_GROSS_ACRES
INNER JOIN MIS_XREFERENCED_AGMTS_M ON OLS_UNITS_GROSS_ACRES.ARRG_KEY = MIS_XREFERENCED_AGMTS_M.ACTIVE_ARRG_KEY)
INNER JOIN ALL_AGMTS1 ON = MIS_XREFERENCED_AGMTS_M.RELATED_ARRG_KEY = ALL_AGMTS1.ARRG_KEY)
INNER JOIN MIS_ACREAGES ON ALL_AGMTS1.ARRG_KEY = MIS_ACREAGES.ARRG_KEY
WHERE (((ALL_AGMTS1.SUBJ_CODE)="LSE")
AND (((MIS_ACREAGES.ACRE_TYPE_CODE)="CNT")
OR ((MIS_ACREAGES.ACRE_TYPE_CODE)="STN")
OR ((MIS_ACREAGES.ACRE_TYPE_CODE)="DNT")))
GROUP BY OLS_UNITS_GROSS_ACRES.AGMT_NUM;
There are several things you could try. Eliminating any DISTINCT Queries is a good start. Another way to speed it up would be to create an index for fields that are frequently queried and or sorted. This will speed up the SELECT query, however it does slow down any INSERT commands as it now has to insert information into the index as well as the table.If your database design is stable and its objects will not be renamed, you can safely turn off Name AutoCorrect to improve performance.Over time, the performance of a database file can become slow because of space that remains allocated to deleted or temporary objects. The Compact and Repair command removes this wasted space and can help a database run faster and more efficiently. I got most of this from the following link. It had several other ideas, but those were the best ones; http://office.microsoft.com/en-us/access-help/help-access-run-faster-HA010235589.aspx

How to speedup joins

I'm using SQL server 2008r2. I have a problem of returning data to the user because of massive joins (for example I need to make 5 inner + 6 left joins in one query (usually tvfs, sometimes tables). It takes toooo long.)
What are the workarounds for this problem?
Should I denormolize my database?
What are the best practices to avoid huge number of joins?
I'd have to see the SQL to troubleshoot specifics, but here's a few things I do when pulling results that have extremely high demand:
Use you tools. Display Estimated Execution Plan can expose some obvious vagaries in your logic.
Learn to love 'where exists' and 'having'. You can minimize the focus and scope sometimes by qualifying in creative ways that don't require HARD IO. This is more true for sub-queries than joins but I add a clause for every outer join I need.
Most importantly IMO, don't be afraid of staging your results. You sometimes need to process billions/trillions of transactions against millions of records and what takes hours with joins can be accomplished in minutes or seconds by staging. If you only need x% of you top 2 or 3 tables, why join every record top to bottom? Sometimes it's just too much overhead.
Pull your simplest result-set down to a stage table (or temp, whatever you need), index it and then go after the next chunk. That usually saves me a fortune in memory.
Use CTEs when you can. However, my experience has been they degrade beyond a certain point. Nice for ancillary tables but not for serious volume.
Be creative in your combinations. I'll use those exists clauses in Stage 1 (reading Tables a, b and c) to only bring back the records that also exist in tables d, e and f.
A lot of the expert SQL advice is not based on VLDBs - it's based on Customer, Orders, Demographic type schemas.
Are these stored procs run natively?
Here's a good (over-simplified) example of staging:
Let's say you wanted to find all of the high-risk individuals in your city (Might as well be interesting about it). You have a Phone company dB (national) indexed by state, city, last name, first name, address and an FBI dB (global) indexed by last name, first name, country, region, address. Let's say the FBI dB has multiple records for each individual due to multiple past addresses.
You could join the two dBs on the common elements and then qualify your criteria. Or...
Select RecordID from Phone as P1
Where State = 'MyState' and City = 'MyCity' and
exists (Select 1
From TheMan as M1
Where M1.Last = P1.Last and M1.First = P1.First and M1.Risk > 80)
Now I have a small record-set to qualify and a small result-set to work from. From there I can go get details. That's a good candidate for a CTE and I could shoot a dozen holes in the logic, but it illustrates the concept. If you bring M1.Risk (non-indexed field) into the equation with a full join, you're forcing SQL Server to plan against it in certain situations. Not necessarily here, but as your logic gets more complex and subsequent non-indexed criteria comes into play.

Slow SQL Queries, Order Table by Date?

I have a Sql-Server-2008 database that I am querying from on the regular that was over 30 million entries (joy!). Unfortunately this database cannot be drastically changed because it is still in use for R/D.
When I query from this database, it takes FOREVER. By that I mean I haven't been patient enough to wait for results (after 2 mins I have to cancel to avoid locking the R/D department out). Even if I use a short date range (more than a few months), it is basically impossible to get any results from it. I am querying with requirements from 4 of the columns and unfortunately have to use an inner-join for another table (which I've been told is very costly in terms of query efficiency -- but it unavoidable). This inner joined table has less than 100k entries.
What I was wondering, is it is possible to organize the table to have it defaultly be ordered by date to reduce the number of results it has to search through?
If this is not possible, is there anything I can do to reduce query times? Is there any other useful information that could assist me in coming up with a solution?
I have included a sample of the query that I use:
SELECT DISTINCT N.TestName
FROM [DalsaTE].[dbo].[ResultsUut] U
INNER JOIN [DalsaTE].[dbo].[ResultsNumeric] N
ON N.ModeDescription = 'Mode 8: Low Gain - Green-Blue'
AND N.ResultsUutId = U.ResultsUutId
WHERE U.DeviceName = 'BO-32-3HK60-00-R'
AND U.StartDateTime > '2011-11-25 01:10:10.001'
ORDER BY N.TestName
Any help or suggestions are appreciated!
It sounds like datetime may be a text based field and subsequently an index isn't being used?
Could you try the following to see if you have any speed improvement:
select distinct N.TestName
from [DalsaTE].[dbo].[ResultsUut] U
inner join [DalsaTE].[dbo].[ResultsNumeric] N
on N.ModeDescription = 'Mode 8: Low Gain - Green-Blue'
and N.ResultsUutId = U.ResultsUutId
where U.DeviceName = 'BO-32-3HK60-00-R'
and U.StartDateTime > cast('2011-11-25 01:10:10.001' as datetime)
order by N.TestName
It would also be worth trying changing your inner join to a left outer join as those occasionally perform faster for no conceivable reason (at least one that I'm not aware of).
you can add an index based on your date column, which should improve your query time. You can either use an alter table command, or use the table designer.
Is the sole purpose of the join to provide sorting? If so, a quick thing to try would be to remove this, and see how much of a difference it makes - at least then you'll know where to focus your attention.
Finally, SQL server management studio has some useful tools such as execution plans that can help diagnose performance issues. Good luck!
There are a number of problems which may be causing delays in the execution of your query.
Indexes (except the primary key) do not reorder the data, they merely create an index (think phonebook) which orders a number of values and points back to the primary key.
Without seeing the type of data or the existing indexes, it's difficult, but at the very least, the following ASCENDING indexes might help:
[DalsaTE].[dbo].[ResultsNumeric] ModeDescription and ResultsUutId and TestName
[DalsaTE].[dbo].[ResultsUut] StartDateTime and DeviceName and ResultsUutId
Without the indexes above, the sample query you gave can be completed without performing a single lookup on the actual table data.

LEFT JOIN vs. multiple SELECT statements

I am working on someone else's PHP code and seeing this pattern over and over:
(pseudocode)
result = SELECT blah1, blah2, foreign_key FROM foo WHERE key=bar
if foreign_key > 0
other_result = SELECT something FROM foo2 WHERE key=foreign_key
end
The code needs to branch if there is no related row in the other table, but couldn't this be done better by doing a LEFT JOIN in a single SELECT statement? Am I missing some performance benefit? Portability issue? Or am I just nitpicking?
This is definitely wrong. You are going over the wire a second time for no reason. DBs are very fast at their problem space. Joining tables is one of those and you'll see more of a performance degradation from the second query then the join. Unless your tablespace is hundreds of millions of records, this is not a good idea.
There is not enough information to really answer the question. I've worked on applications where decreasing the query count for one reason and increasing the query count for another reason both gave performance improvements. In the same application!
For certain combinations of table size, database configuration and how often the foreign table would be queried, doing the two queries can be much faster than a LEFT JOIN. But experience and testing is the only thing that will tell you that. MySQL with moderately large tables seems to be susceptable to this, IME. Performing three queries on one table can often be much faster than one query JOINing the three. I've seen speedups of an order of magnitude.
I'm with you - a single SQL would be better
There's a danger of treating your SQL DBMS as if it was a ISAM file system, selecting from a single table at a time. It might be cleaner to use a single SELECT with the outer join. On the other hand, detecting null in the application code and deciding what to do based on null vs non-null is also not completely clean.
One advantage of a single statement - you have fewer round trips to the server - especially if the SQL is prepared dynamically each time the other result is needed.
On average, then, a single SELECT statement is better. It gives the optimizer something to do and saves it getting too bored as well.
It seems to me that what you're saying is fairly valid - why fire off two calls to the database when one will do - unless both records are needed independently as objects(?)
Of course while it might not be as simple code wise to pull it all back in one call from the database and separate out the fields into the two separate objects, it does mean that you're only dependent on the database for one call rather than two...
This would be nicer to read as a query:
Select a.blah1, a.blah2, b.something From foo a Left Join foo2 b On a.foreign_key = b.key Where a.Key = bar;
And this way you can check you got a result in one go and have the database do all the heavy lifting in one query rather than two...
Yeah, I think it seems like what you're saying is correct.
The most likely explanation is that the developer simply doesn't know how outer joins work. This is very common, even among developers who are quite experienced in their own specialty.
There's also a widespread myth that "queries with joins are slow." So many developers blindly avoid joins at all costs, even to the extreme of running multiple queries where one would be better.
The myth of avoiding joins is like saying we should avoid writing loops in our application code, because running a line of code multiple times is obviously slower than running it once. To say nothing of the "overhead" of ++i and testing i<20 during every iteration!
You are completely correct that the single query is the way to go. To add some value to the other answers offered let me add this axiom: "Use the right tool for the job, the Database server should handle the querying work, the code should handle the procedural work."
The key idea behind this concept is that the compiler/query optimizers can do a better job if they know the entire problem domain instead of half of it.
Considering that in one database hit you have all the data you need having one single SQL statement would be better performance 99% of the time. Not sure if the connections is being creating dynamically in this case or not but if so doing so is expensive. Even if the process if reusing existing connections the DBMS is not getting optimize the queries be best way and not really making use of the relationships.
The only way I could ever see doing the calls like this for performance reasons is if the data being retrieved by the foreign key is a large amount and it is only needed in some cases. But in the sample you describe it just grabs it if it exists so this is not the case and therefore not gaining any performance.
The only "gotcha" to all of this is if the result set to work with contains a lot of joins, or even nested joins.
I've had two or three instances now where the original query I was inheriting consisted of a single query that had so a lot of joins in it and it would take the SQL a good minute to prepare the statement.
I went back into the procedure, leveraged some table variables (or temporary tables) and broke the query down into a lot of the smaller single select type statements and constructed the final result set in this manner.
This update dramatically fixed the response time, down to a few seconds, because it was easier to do a lot of simple "one shots" to retrieve the necessary data.
I'm not trying to object for objections sake here, but just to point out that the code may have been broken down to such a granular level to address a similar issue.
A single SQL query would lead in more performance as the SQL server (Which sometimes doesn't share the same location) just needs to handle one request, if you would use multiple SQL queries then you introduce a lot of overhead:
Executing more CPU instructions,
sending a second query to the server,
create a second thread on the server,
execute possible more CPU instructions
on the sever, destroy a second thread
on the server, send the second results
back.
There might be exceptional cases where the performance could be better, but for simple things you can't reach better performance by doing a bit more work.
Doing a simple two table join is usually the best way to go after this problem domain, however depending on the state of the tables and indexing, there are certain cases where it may be better to do the two select statements, but typically I haven't run into this problem until I started approaching 3-5 joined tables, not just 2.
Just make sure you have covering indexes on both tables to ensure you aren't scanning the disk for all records, that is the biggest performance hit a database gets (in my limited experience)
You should always try to minimize the number of query to the database when you can. Your example is perfect for only 1 query. This way you will be able later to cache more easily or to handle more request in same time because instead of always using 2-3 query that require a connexion, you will have only 1 each time.
There are many cases that will require different solutions and it isn't possible to explain all together.
Join scans both the tables and loops to match the first table record in second table. Simple select query will work faster in many cases as It only take cares for the primary/unique key(if exists) to search the data internally.

Need tips for optimizing SQL Query using a JOIN

The Query I'm writing runs fine when looking at the past few days, once I go over a week it crawls (~20min). I am joining 3 tables together. I was wondering what things I should look for to make this run faster. I don't really know what other information is needed for the post.
EDIT: More info: db is Sybase 10. Query:
SELECT a.id, a.date, a.time, a.signal, a.noise,
b.signal_strength, b.base_id, b.firmware,
a.site, b.active, a.table_key_id
FROM adminuser.station AS a
JOIN adminuser.base AS b
ON a.id = b.base_id
WHERE a.site = 1234 AND a.date >= '2009-03-20'
I also took out the 3rd JOIN and it still runs extremely slow. Should I try another JOIN method?
I don't know Sybase 10 that well, but try running that query for say 10-day period and then 10 times, for each day in a period respectively and compare times. If the time in the first case is much higher, you've probably hit the database cache limits.
The solution is than to simply run queries for shorter periods in a loop (in program, not SQL). It works especially well if table A is partitioned by date.
You can get a lot of information (assuming you're using MSSQL here) by running your query in SQL Server Management Studio with the Include Actual Execution Plan option set (in the Query menu).
This will show you a diagram of the steps that SQLServer performs in order to execute the query - with relative costs against each step.
The next step is to rework the query a little (try doing it a different way) then run the new version and the old version at the same time. You will get two execution plans, with relative costs not only against each step, but against the two versions of the query! So you can tell objectively if you are making progress.
I do this all the time when debugging/optimizing queries.
Make sure you have indexes on the foreign keys.
It sounds more like you have a memory leak or aren't closing database connections in your client code than that there's anything wrong with the query.
[edit]
Nevermind: you mean quering over a date range rather than the duration the server has been active. I'll leave this up to help others avoid the same confusion.
Also, it would help if you could post the sql query, even if you need to obfuscate it some first, and it's a good bet to check if there's an index on your date column and the number of records returned by the longer range.
You may want to look into using a PARTITION for the date ranges, if your DB supports it. I've heard this can help significantly.
Grab the book "Professional SQL Server 2005 Performance Tuning" its pretty great.
You didn't mention your database. If it's not SQL Server, the specifics of how to get the data might be different, but the advice is fundamentally the same.
Look at indexing, for sure, but the first thing to do is to follow Blorgbeard's advice and scan for execution plans using Management Studio (again, if you are running SQL Server).
What I'm guessing you'll see is that for small date ranges, the optimizer picks a reasonable query plan, but that when the date range is large, it picks something completely different, likely involving either table scans or index scans, and possibly joins that lead to very large temporary recordsets. The execution plan analyzer will reveal all of this.
A scan means that the optimizer thinks that grinding over the whole table or the whole index is cheaper for what you are trying to do than seeking specific values.
What you eventually want to do is get indexes and the syntax of your query set up such that you keep index seeks in the query plan for your query regardless of the date range, or, failing that, that the scans you require are filtered as well as you can manage to minimize temporary recordset size and thereby avoid excessive reads and I/O.
SELECT
a.id, a.date, a.time, a.signal, a.noise,a.site, b.active, a.table_key_id,
b.signal_strength, b.base_id, b.firmware
FROM
( SELECT * FROM adminuser.station
WHERE site = 1234 AND date >= '2009-03-20') AS a
JOIN
adminuser.base AS b
ON
a.id = b.base_id
Kind of rewrote the query, so as to first filter the desired rows then perform a join rather than perform a join then filter the result.
Rather than pulling * from the sub-query you can just select the columns you want, which might be little helpful.
May be this will of little help, in speeding things.
While this is valid in MySql, I am not sure of the sysbase syntax though.