Need tips for optimizing SQL Query using a JOIN - sql

The Query I'm writing runs fine when looking at the past few days, once I go over a week it crawls (~20min). I am joining 3 tables together. I was wondering what things I should look for to make this run faster. I don't really know what other information is needed for the post.
EDIT: More info: db is Sybase 10. Query:
SELECT a.id, a.date, a.time, a.signal, a.noise,
b.signal_strength, b.base_id, b.firmware,
a.site, b.active, a.table_key_id
FROM adminuser.station AS a
JOIN adminuser.base AS b
ON a.id = b.base_id
WHERE a.site = 1234 AND a.date >= '2009-03-20'
I also took out the 3rd JOIN and it still runs extremely slow. Should I try another JOIN method?

I don't know Sybase 10 that well, but try running that query for say 10-day period and then 10 times, for each day in a period respectively and compare times. If the time in the first case is much higher, you've probably hit the database cache limits.
The solution is than to simply run queries for shorter periods in a loop (in program, not SQL). It works especially well if table A is partitioned by date.

You can get a lot of information (assuming you're using MSSQL here) by running your query in SQL Server Management Studio with the Include Actual Execution Plan option set (in the Query menu).
This will show you a diagram of the steps that SQLServer performs in order to execute the query - with relative costs against each step.
The next step is to rework the query a little (try doing it a different way) then run the new version and the old version at the same time. You will get two execution plans, with relative costs not only against each step, but against the two versions of the query! So you can tell objectively if you are making progress.
I do this all the time when debugging/optimizing queries.

Make sure you have indexes on the foreign keys.

It sounds more like you have a memory leak or aren't closing database connections in your client code than that there's anything wrong with the query.
[edit]
Nevermind: you mean quering over a date range rather than the duration the server has been active. I'll leave this up to help others avoid the same confusion.
Also, it would help if you could post the sql query, even if you need to obfuscate it some first, and it's a good bet to check if there's an index on your date column and the number of records returned by the longer range.

You may want to look into using a PARTITION for the date ranges, if your DB supports it. I've heard this can help significantly.

Grab the book "Professional SQL Server 2005 Performance Tuning" its pretty great.

You didn't mention your database. If it's not SQL Server, the specifics of how to get the data might be different, but the advice is fundamentally the same.
Look at indexing, for sure, but the first thing to do is to follow Blorgbeard's advice and scan for execution plans using Management Studio (again, if you are running SQL Server).
What I'm guessing you'll see is that for small date ranges, the optimizer picks a reasonable query plan, but that when the date range is large, it picks something completely different, likely involving either table scans or index scans, and possibly joins that lead to very large temporary recordsets. The execution plan analyzer will reveal all of this.
A scan means that the optimizer thinks that grinding over the whole table or the whole index is cheaper for what you are trying to do than seeking specific values.
What you eventually want to do is get indexes and the syntax of your query set up such that you keep index seeks in the query plan for your query regardless of the date range, or, failing that, that the scans you require are filtered as well as you can manage to minimize temporary recordset size and thereby avoid excessive reads and I/O.

SELECT
a.id, a.date, a.time, a.signal, a.noise,a.site, b.active, a.table_key_id,
b.signal_strength, b.base_id, b.firmware
FROM
( SELECT * FROM adminuser.station
WHERE site = 1234 AND date >= '2009-03-20') AS a
JOIN
adminuser.base AS b
ON
a.id = b.base_id
Kind of rewrote the query, so as to first filter the desired rows then perform a join rather than perform a join then filter the result.
Rather than pulling * from the sub-query you can just select the columns you want, which might be little helpful.
May be this will of little help, in speeding things.
While this is valid in MySql, I am not sure of the sysbase syntax though.

Related

Changing a view definition improves/degrades two different queries

so we have many Oracle views that we expose to other teams to work with, and they run queries against the views to extract data.
Recently we realized that one of the views we expose, a user did a select * with a date range and the query just doesn't return in a timely fashion. After investigation, we decided to 'optimize' the view by converting a select subquery into a left join, something that I know normally improves query performance.
Previous view definition :
select a.date, (select name from table_b b where b.id = a.id), a.id
from table_a a
New view definition :
select a.date, b.name, a.id
from table_a a left join table_b b on a.id = b.id
We tested it with the user and his queries are now much more performant so the change was rolled out to production. A day later we realized another user had been using this view within some complex query, and his query went from running 2 hours everyday to > 7 hours or not completing at all.
So i guess my question is, how do I deal with this tuning issue, where improving one query's performance degrades another query's performance? I'm in the process of rollback such that I can examine the two different query plans, but I'm not sure what insight I can gain from the plan differences. I checked the table statistics and they all look good.
"user did a select * with a date range ".
Date range scans are notoriously hard to tune. A plan which is great for date '2018-04-01' to date '2018-04-02' may well suck for date '2017-04-01' to date '2018-04-01'. And of course vice versa.
So what you may be suffering from here is that your user is using bind variables for the date range values. Bind variables are normally good for performance because they allow Oracle to re-use the same execution plan for all executions of the query with any value for those variables. This is a good thing when the pertinent values have a normal distribution. Then we save the cost of a hard parse and use an efficient path. This is called Bind Variable Peeking.
However, when the data is an uneven distribution or when we are are specifying ranges we need a different strategy. The overhead of a hard parse is trivial compared to the cost of using an indexed read to retrieve 20% of the rows in a table. So you need a different approach, one which doesn't rely on bind variables. Ideally you can work with your users, understand what they're doing and help them write better queries. However, the Oracle database does have features like Adaptive Cursors which allows the database to assess whether the cached plan is still good for new values of bind variables. This doesn't guarantee good performance but can help in situations where we have users running ad hoc queries. Find out more.
" the underlying tables was partitioned by date and also indexed by date hence I believe the date range should not be an issue."
Belief is not the same as proof. If the date range is within a single partition then maybe it's not the issue. If the queried range spans several partitions then it's a potential culprit. Consider: if your table is partitioned into one day sections then a date range of date '2017-04-01' to date '2018-04-01' would scan 365 partitions. Partition pruning won't do much for you then. But if you don't think it's worth investigating that's cool.
"my general question was how to tune one thing without breaking another (that you may not be aware of)"
As I think you know already, this is not possible. The best we can hope for is to tune a query to perform optimally under the conditions we know about. If it were possible to write a query so that it executed perfectly in any scenario then all those Oracle tuning consultants would not make the fine livings that they do.

Why is my SQL query getting disproportionally slow when adding a simple string comparison?

So, I have an SQL query for MSSQL looking like this (simplified for readability):
SELECT ...
FROM (
SELECT ..., ROUND(SUM(TOTAL_TIME)/86400.0,2) ...
FROM MY_DATA
WHERE STATUS NOT IN (107)
GROUP BY ...
) q
WHERE q.Tdays > 0
GROUP BY ...
It works fine, but I need a comparison against another table in the inner query, so I added a left join and said comparison:
SELECT ...
FROM (
SELECT ..., ROUND(SUM(TOTAL_TIME)/86400.0,2) ...
FROM MY_DATA
LEFT JOIN OTHER_TABLE ON MY_DATA.ID=OTHER_TABLE.ID //new JOIN
WHERE STATUS NOT IN (107) AND (DEPARTMENT_ID='SP' OR DEPARTMENT_ID='BL') //new AND branch
GROUP BY ...
) q
WHERE q.Tdays > 0
GROUP BY ...
This query works, but is A LOT slower thant the previous one. The wierd thing is, commenting out the new AND-branch of the WHERE clause while leaving the JOIN as it is makes it faster again. As if it's not joining another table that is slowing the query down, but the actual string comparisons... I am lost as to why this is so slow, or how I could speed it up... any advice would be appreciated!
Use an INNER JOIN. The outer join is being undone by the WHERE clause anyway:
SELECT ..., ROUND(SUM(TOTAL_TIME)/86400.0,2) ...
FROM MY_DATA d INNER JOIN
OTHER_TABLE ot
ON d.ID = ot.ID //new JOIN
WHERE od.STATUS NOT IN (107) AND DEPARTMENT_ID IN ('SP', 'BL') //new AND branch
GROUP BY ...
(The IN shouldn't make a difference; it is just easier to write.)
Next, if this still has slow performance, then look at the execution plans. It means that SQL Server is making a poor decision, probably on the JOIN algorithm. Normally, I fix this by forbidding nested loop joins, but there might be other solutions as well.
It's hard to say definitively what will or won't speed things up without seeing the execution plan. Also, understanding how fast you need it to be affects what steps you might want to (or not want to) consider taking.
What follows is admittedly somewhat vague, but these are a few things that came to mind when I thought about this. Take a look at the execution plan as Philip Couling suggested in that good link to get an idea where the pain points are, and of course, take these suggestions with a grain of salt.
You might consider adding some indexes to either or both of the tables. The execution plan might even give you suggestions on what could be useful, but off the top of my head, something on OTHER_TABLE.DEPARTMENT_ID probably wouldn't hurt.
You might be able to build potential new indexes as Filtered Indexes if you know those hard-coded search terms (like STATUS and DEPARTMENT_ID are always going to be the same).
You could pre-calculate some of this information if it's not changing so rapidly that you need to query it fresh on every call. This comes back to how fast you need it to go, because for just about any query, you can add columns or pre-populated lookup tables to avoid doing work at run time. For example, you could make an new bit field like IsNewOrBranch or IsStatusNot107 (both somewhat egregious steps, but things which could work). Or that might be pre-aggregating the data in the inner query ahead of time.
I know you simplified the query for our benefit, but that also makes it a little hard to know what's going on with the subquery, and the subsequent GROUP BY being performed against that subquery. There might be a way to avoid having to do two group bys.
Along the same vein, you might also look into splitting what you're doing into separate statements if SQL is having a difficult time figuring out how best to return the data. For example, you might populate a temp table or table variable with the results of your inner query, then perform your subsequent GROUP BY on that. While this approach isn't always useful, there are many times where trying to cram all the work into a single query will actually end up being worse than several individual, simple, optimized steps would be.
And as Gordon Linoff suggested, there are a number of query hints which could be used to coax the execution plan into doing things a specific way. But be careful, often that way lies madness.
Your SQL is fine, and restricting your data with an additional AND clause should usually not make it slower.
As it happens, choosing a fast execution path is a hard problem, and SQL Server sometimes (albeit seldom) gets it wrong.
What you can do to help SQL Server find the best execution path is to:
make sure the statistics on your tables are up-to-date and
make sure that there is an "obviously suitable" index that SQL Server can use. SQL Server Management studio will usually give you suggestions on missing indexes when selecting the "show actual execution plan" option.

Slow SQL Queries, Order Table by Date?

I have a Sql-Server-2008 database that I am querying from on the regular that was over 30 million entries (joy!). Unfortunately this database cannot be drastically changed because it is still in use for R/D.
When I query from this database, it takes FOREVER. By that I mean I haven't been patient enough to wait for results (after 2 mins I have to cancel to avoid locking the R/D department out). Even if I use a short date range (more than a few months), it is basically impossible to get any results from it. I am querying with requirements from 4 of the columns and unfortunately have to use an inner-join for another table (which I've been told is very costly in terms of query efficiency -- but it unavoidable). This inner joined table has less than 100k entries.
What I was wondering, is it is possible to organize the table to have it defaultly be ordered by date to reduce the number of results it has to search through?
If this is not possible, is there anything I can do to reduce query times? Is there any other useful information that could assist me in coming up with a solution?
I have included a sample of the query that I use:
SELECT DISTINCT N.TestName
FROM [DalsaTE].[dbo].[ResultsUut] U
INNER JOIN [DalsaTE].[dbo].[ResultsNumeric] N
ON N.ModeDescription = 'Mode 8: Low Gain - Green-Blue'
AND N.ResultsUutId = U.ResultsUutId
WHERE U.DeviceName = 'BO-32-3HK60-00-R'
AND U.StartDateTime > '2011-11-25 01:10:10.001'
ORDER BY N.TestName
Any help or suggestions are appreciated!
It sounds like datetime may be a text based field and subsequently an index isn't being used?
Could you try the following to see if you have any speed improvement:
select distinct N.TestName
from [DalsaTE].[dbo].[ResultsUut] U
inner join [DalsaTE].[dbo].[ResultsNumeric] N
on N.ModeDescription = 'Mode 8: Low Gain - Green-Blue'
and N.ResultsUutId = U.ResultsUutId
where U.DeviceName = 'BO-32-3HK60-00-R'
and U.StartDateTime > cast('2011-11-25 01:10:10.001' as datetime)
order by N.TestName
It would also be worth trying changing your inner join to a left outer join as those occasionally perform faster for no conceivable reason (at least one that I'm not aware of).
you can add an index based on your date column, which should improve your query time. You can either use an alter table command, or use the table designer.
Is the sole purpose of the join to provide sorting? If so, a quick thing to try would be to remove this, and see how much of a difference it makes - at least then you'll know where to focus your attention.
Finally, SQL server management studio has some useful tools such as execution plans that can help diagnose performance issues. Good luck!
There are a number of problems which may be causing delays in the execution of your query.
Indexes (except the primary key) do not reorder the data, they merely create an index (think phonebook) which orders a number of values and points back to the primary key.
Without seeing the type of data or the existing indexes, it's difficult, but at the very least, the following ASCENDING indexes might help:
[DalsaTE].[dbo].[ResultsNumeric] ModeDescription and ResultsUutId and TestName
[DalsaTE].[dbo].[ResultsUut] StartDateTime and DeviceName and ResultsUutId
Without the indexes above, the sample query you gave can be completed without performing a single lookup on the actual table data.

TSQL Join efficiency

I'm developing an ASP.NET/C#/SQL application. I've created a query for a specific grid-view that involves a lot of joins to get the data needed. On the hosted server, the query has randomly started taking up to 20 seconds to process. I'm sure it's partly an overloaded host-server (because sometimes the query takes <1s), but I don't think the query (which is actually a view reference via a stored procedure) is at all optimal regardless.
I'm unsure how to improve the efficiency of the below query:
(There are about 1500 matching records to those joins, currently)
SELECT dbo.ca_Connections.ID,
dbo.ca_Connections.Date,
dbo.ca_Connections.ElectricityID,
dbo.ca_Connections.NaturalGasID,
dbo.ca_Connections.LPGID,
dbo.ca_Connections.EndUserID,
dbo.ca_Addrs.LotNumber,
dbo.ca_Addrs.UnitNumber,
dbo.ca_Addrs.StreetNumber,
dbo.ca_Addrs.Street1,
dbo.ca_Addrs.Street2,
dbo.ca_Addrs.Suburb,
dbo.ca_Addrs.Postcode,
dbo.ca_Addrs.LevelNumber,
dbo.ca_CompanyConnectors.ConnectorID,
dbo.ca_CompanyConnectors.CompanyID,
dbo.ca_Connections.HandOverDate,
dbo.ca_Companies.Name,
dbo.ca_States.State,
CONVERT(nchar, dbo.ca_Connections.Date, 103) AS DateView,
CONVERT(nchar, dbo.ca_Connections.HandOverDate, 103) AS HandOverDateView
FROM dbo.ca_CompanyConnections
INNER JOIN dbo.ca_CompanyConnectors ON dbo.ca_CompanyConnections.CompanyID = dbo.ca_CompanyConnectors.CompanyID
INNER JOIN dbo.ca_Connections ON dbo.ca_CompanyConnections.ConnectionID = dbo.ca_Connections.ID
INNER JOIN dbo.ca_Addrs ON dbo.ca_Connections.AddressID = dbo.ca_Addrs.ID
INNER JOIN dbo.ca_Companies ON dbo.ca_CompanyConnectors.CompanyID = dbo.ca_Companies.ID
INNER JOIN dbo.ca_States ON dbo.ca_Addrs.StateID = dbo.ca_States.ID
It may have nothing to do with your query and everything to do with the data transfer.
How fast does the query run in query analyzer?
How does this compare to the web page?
If you are bringing back the entire data set you may want to introduce paging, say 100 records per page.
The first thing I normally suggest is to profile to look for potential indexes to help out. But the when the problem is sporadic like this and the normal case is for the query to run in <1sec, it's more likely due to lock contention rather than a missing index. That means the cause is something else in the system causing this query to take longer. Perhaps an insert or update. Perhaps another select query — one that you would normally expect to take a little longer so the extra time on it's end isn't noted.
I would start with indexing, but I have a database that is a third-party application. Creating my own indexes is not an option. I read an article (sorry, can't find the reference) recommending breaking up the query into table variables or temp tables (depending on number of records) when you have multiple tables in your query (not sure what the magic number is).
Start with dbo.ca_CompanyConnections, dbo.ca_CompanyConnectors, dbo.ca_Connections. Include the fields you need. And then subsitute these three joined tables with just the temp table.
Not sure what the issue is (would like to here recommendations) but seems like when you get over 5 tables performance seems to drop.

Correlated query vs inner join performance in SQL Server

let's say that you want to select all rows from one table that have a corresponding row in another one (the data in the other table is not important, only the presence of a corresponding row is important). From what I know about DB2, this kinda query is better performing when written as a correlated query with a EXISTS clause rather than a INNER JOIN. Is that the same for SQL Server? Or doesn't it make any difference whatsoever?
I just ran a test query and the two statements ended up with the exact same execution plan. Of course, for just about any performance question I would recommend running the test on your own environment; With SQL server Management Studio this is easy (or SQL Query Analyzer if your running 2000). Just type both statements into a query window, select Query|Include Actual Query Plan. Then run the query. Go to the results tab and you can easily see what the plans are and which one had a higher cost.
Odd: it's normally more natural for me to write these as a correlated query first, at which point I have to then go back and re-factor to use a join because in my experience the sql server optimizer is more likely to get that right.
But don't take me too seriously. For all I have 26K rep here and one of only 2 current sql topic-specific badges, I'm actually pretty junior in terms of sql knowledge (It's all about the volume! ;) ); certainly I'm no DBA. In practice, you will of course need to profile each method to gauge it's actual performance. I would expect the optimizer to recognize what you're asking for and handle either query in the optimal way, but you never know until you check.
As everyone notes, it all boils down to the optimizer. I'd suggest writing it in whatever way feels more natural to you, then making sure the optimizer can figure out the most effective query plan (gather statistics, create an index, whatever). The SQL Server optimizer is pretty good overall, so long as you give it the information it needs to work with.
Use the join. It might not make much of a difference in performance if you have small tables, but if the "outer" table is very large then it will need to do the EXISTS sub-query for each row. If your tables are indexed on the common columns then it should be far quicker to do the INNER JOIN. BTW, if you want to find all rows that are NOT in the second table, use a LEFT JOIN and test for NULL in the second table--it is much faster than using EXISTS when you have very large tables and indexes.
Probably the best performance is with a join to a derived table. Exists would probably be next (and might be faster). The worst performance would be with a subquery inside the select as it would tend to run row by row instead of as a set.
However, all things being equal and database performance being very dependent on the database design. I would try out all possible methods and see which are faster in your circumstances.