Slow SQL Queries, Order Table by Date? - sql

I have a Sql-Server-2008 database that I am querying from on the regular that was over 30 million entries (joy!). Unfortunately this database cannot be drastically changed because it is still in use for R/D.
When I query from this database, it takes FOREVER. By that I mean I haven't been patient enough to wait for results (after 2 mins I have to cancel to avoid locking the R/D department out). Even if I use a short date range (more than a few months), it is basically impossible to get any results from it. I am querying with requirements from 4 of the columns and unfortunately have to use an inner-join for another table (which I've been told is very costly in terms of query efficiency -- but it unavoidable). This inner joined table has less than 100k entries.
What I was wondering, is it is possible to organize the table to have it defaultly be ordered by date to reduce the number of results it has to search through?
If this is not possible, is there anything I can do to reduce query times? Is there any other useful information that could assist me in coming up with a solution?
I have included a sample of the query that I use:
SELECT DISTINCT N.TestName
FROM [DalsaTE].[dbo].[ResultsUut] U
INNER JOIN [DalsaTE].[dbo].[ResultsNumeric] N
ON N.ModeDescription = 'Mode 8: Low Gain - Green-Blue'
AND N.ResultsUutId = U.ResultsUutId
WHERE U.DeviceName = 'BO-32-3HK60-00-R'
AND U.StartDateTime > '2011-11-25 01:10:10.001'
ORDER BY N.TestName
Any help or suggestions are appreciated!

It sounds like datetime may be a text based field and subsequently an index isn't being used?
Could you try the following to see if you have any speed improvement:
select distinct N.TestName
from [DalsaTE].[dbo].[ResultsUut] U
inner join [DalsaTE].[dbo].[ResultsNumeric] N
on N.ModeDescription = 'Mode 8: Low Gain - Green-Blue'
and N.ResultsUutId = U.ResultsUutId
where U.DeviceName = 'BO-32-3HK60-00-R'
and U.StartDateTime > cast('2011-11-25 01:10:10.001' as datetime)
order by N.TestName
It would also be worth trying changing your inner join to a left outer join as those occasionally perform faster for no conceivable reason (at least one that I'm not aware of).

you can add an index based on your date column, which should improve your query time. You can either use an alter table command, or use the table designer.
Is the sole purpose of the join to provide sorting? If so, a quick thing to try would be to remove this, and see how much of a difference it makes - at least then you'll know where to focus your attention.
Finally, SQL server management studio has some useful tools such as execution plans that can help diagnose performance issues. Good luck!

There are a number of problems which may be causing delays in the execution of your query.
Indexes (except the primary key) do not reorder the data, they merely create an index (think phonebook) which orders a number of values and points back to the primary key.
Without seeing the type of data or the existing indexes, it's difficult, but at the very least, the following ASCENDING indexes might help:
[DalsaTE].[dbo].[ResultsNumeric] ModeDescription and ResultsUutId and TestName
[DalsaTE].[dbo].[ResultsUut] StartDateTime and DeviceName and ResultsUutId
Without the indexes above, the sample query you gave can be completed without performing a single lookup on the actual table data.

Related

Oracle SQL: What is the best way to select a subset of a very large table

I have been roaming these forums for a few years and I've always found my questions had already been asked, and a fitting answer was already present.
I have a pretty generic (and maybe easy) question now though, but I haven't been able to find a thread asking the same one yet.
The situation:
I have a payment table with 10-50M records per day, a history of 10 days and hundreds of columns. About 10-20 columns are indexed. One of the indices is batch_id.
I have a batch table with considerably fewer records and columns, say 10k a day and 30 columns.
If I want to select all payments from one specific sender, I could just do this:
Select * from payments p
where p.sender_id = 'SenderA'
This runs a while, even though sender_id is also indexed. So I figure, it's better to select the batches first, then go into the payments table with the batch_id:
select * from payments p
where p.batch_id in
(select b.batch_id from batches where b.sender_id = 'SenderA')
--and p.sender_id = 'SenderA'
Now, my questions are:
In the second script, should I uncomment the Sender_id in my where clause on the payments table? It doesn't feel very efficient to filter on sender_id twice, even though it's in different tables.
Is it better if I make it an inner join instead of a nested query?
Is it better if I make it a common table expression instead of a nested query or inner join?
I suppose it could all fit into one question: What is the best way to query this?
In the worst case the two queries should run in the same time and in the best case I would expect the first query to run quicker. If it is running slower, there is some problem elsewhere. You don't need the additional condition in the second query.
The first query will retrieve index entries for a single value, so that is going to access less blocks than the second query which has to find index entries for multiple batches (as well as executing the subquery, but that is probably not significant).
But the danger as always with Oracle is that there are a lot of factors determining which query plan the optimizer chooses. I would immediately verify that the statistics on your indexed columns are up-to-date. If they are not, this might be your problem and you don't need to read any further.
The next step is to obtain a query execution plan. My guess is that this will tell you that your query is running a full-table-scan.
Whether or not Oracle choses to perform a full-table-scan on a query such as this is dependent on the number of rows returned and whether Oracle thinks it is more efficient to use the index or to simply read the whole table. The threshold for flipping between the two is not a fixed number: it depends on a lot of things, one of them being a parameter called DB_FILE_MULTIBLOCK_READ_COUNT.
This is set-up by Orale and in theory it should be configured such that the transition between indexed and full-table scan queries should be smooth. In other words, at the transition point where your query is returning enough rows to just about make a full table scan more efficient, the index scan and the table scan should take roughly the same time.
Unfortunately, I have seen systems where this is way out and Oracle flips to doing full table scans far too quickly, resulting in a long query time once the number of rows gets over a certain threshold.
As I said before, first check your statistics. If that doesn't work, get a QEP and start tuning your Oracle instance.
Tuning Oracle is a very complex subject that can't be answered in full here, so I am forced to recommend links. Here is a useful page on the parameter: reducing it might help: Why Change the Oracle DB_FILE_MULTIBLOCK_READ_COUNT.
Other than that, the general Oracle performance tuning guide is here: (Oracle) Configuring a Database for Performance.
If you are still having problems, you need to progress your investigation further and then come up with a more specific question.
EDIT:
Based on your comment where you say your query is returning 4M rows out of 10M-50M in the table. If it is 4M out of 10M there is no way an index will be of any use. Even with 4M out of 50M, it is still pretty certain that a full-table-scan would be the most efficient approach.
You say that you have a lot of columns, so probably this 4M row fetch is returning a huge amount of data.
You could perhaps consider splitting off some of the columns that are not required and putting them into a child table. In particular, if you have columns containing a lot of data (e.g., some text comments or whatever) they might be better being kept outside the main table.
Remember - small is fast, not only in terms of number of rows, but also in terms of the size of each row.
SQL is an declarative language. This means, that you specify what you like not how.
Check your indexes primary and "normal" ones...

Using temp table for sorting data in SQL Server

Recently, I came across a pattern (not sure, could be an anti-pattern) of sorting data in a SELECT query. The pattern is more of a verbose and non-declarative way for ordering data. The pattern is to dump relevant data from actual table into temporary table and then apply orderby on a field on the temporary table. I guess, the only reason why someone would do that is to improve the performance (which I doubt) and no other benefit.
For e.g. Let's say, there is a user table. The table might contain rows in millions. We want to retrieve all the users whose first name starts with 'G' and sorted by first name. The natural and more declarative way to implement a SQL query for this scenario is:
More natural and declarative way
SELECT * FROM Users
WHERE NAME LIKE 'G%'
ORDER BY Name
Verbose way
SELECT * INTO TempTable
FROM Users
WHERE NAME LIKE 'G%'
SELECT * FROM TempTable
ORDER BY Name
With that context, I have few questions:
Will there be any performance difference between two ways if there is no index on the first name field. If yes, which one would be better.
Will there be any performance difference between two ways if there is an index on the first name field. If yes, which one would be better.
Should not the SQL Server optimizer generate same execution plan for both the ways?
Is there any benefit in writing a verbose way from any other persective like locking/blocking?
Thanks in advance.
Reguzlarly: Anti pattern by people without an idea what they do.
SOMETIMES: ok, because SQL Server has a problem that is not resolvable otherwise - not seen that one in yeas, though.
It makes things slower because it forces the tmpddb table to be fully populated FIRST, while otherwise the query could POSSIBLY be resoled more efficiently.
last time I saw that was like 3 years ago. We got it 3 times as fast by not being smart and using a tempdb table ;)
Answers:
1: No, it still needs a table scan, obviously.
2: Possibly - depends on data amount, but an index seek by index would contain the data in order already (as the index is ordered by content).
3: no. Obviously. Query plan optimization is statement by statement. By cutting the execution in 2, the query optimizer CAN NOT merge the join into the first statement.
4: Only if you run into a query optimizer issue or a limitation of how many tables you can join - not in that degenerate case (degenerate in a technical meaning - i.e. very simplistic). BUt if you need to join MANY MANY tables it may be better to go with an interim step.
If the field you want to do an order by on is not indexed, you could put everything into a temp table and index it and then do the ordering and it might be faster. You would have to test to make sure.
There is never any benefit of the second approach that I can think of.
It means if the data is available pre-ordered SQL Server can't take advantage of this and adds an unnecessary blocking operator and additional sort to the plan.
In the case that the data is not available pre-ordered SQL Server will sort it in a work table either in memory or tempdb anyway and adding an explicit #temp table just adds an unnecessary additional step.
Edit
I suppose one case where the second approach could give an apparent benefit might be if the presence of the ORDER BY caused SQL Server to choose a different plan that turned out to be sub optimal. In which case I would resolve that in a different way by either improving statistics or by using hints/query rewrite to avoid the undesired plan.

SQL Server: How can a table scan be so expensive on a tiny table?

I'm looking at an execution plan from a troublesome query.
I can see that 45% of the plan is taken up doing a table scan on a table with seven (7) rows of data.
I am about to put a clustered index to cover the columns in my query on a table with seven rows and it feels...wrong. How can this part of my query take up so much of the plan given the table is so tiny?
I was reading up here and it feel it might just be becuase of non-contiguous data - there are no indexes at all on the table in question. Overall though our database is large-ish (7GB) and busy.
I'd love to know what others think - thanks!
EDIT:
The query is run very frequently and was involved in deadlock (and chosen as the victim). Right now it's taking between 300ms and 500ms to run, but will take longer when the database is busier.
The query:
select l.team1Score, l.team2Score, ls.team1ExternalID, ls.team2ExternalID, et.eventCategoryID, e.eventID, ls.statusCode
from livescoretracking l(nolock)
inner join liveScores ls (nolock) on l.liveScoreID = ls.liveScoreID
inner join db1.dbo.events e on e.gameid = ls.gameid
inner join db1.dbo.eventtype et (nolock) on e.eventTypeID = et.eventTypeID
inner join eventCategoryPayTypeMappings ecb (nolock) on ( et.eventCategoryID = ecb.eventCategoryID and e.payTypeID = ecb.payTypeID and ecb.mainEvent = 1 )
where ls.gameID = 286711 order by l.dateinserted
The problem table is the eventCategoryPayTypeMappings table - thanks!
A percentage cost is meaningless without knowing the total cost in real terms. e.g. if the query takes 1 ms to execute a 45% cost for a table scan is .45 of a milisecond which is not worth trying to optimise, if the query takes 10 seconds to execute then the 45% cost is significant and worth optimising.
A table scan on a seven row table is not expensive. Barring query hints, the query engine will use a table scan on such a small table no matter what indexes exist. Can you show us more about the query in question and the problem with the execution plan?
If there are no indexes on the table, the query engine will always have to do a table scan. There's no other way it can process the data.
Many RDBMS platforms will do a table scan on a table that small even if there are indexes. (I'm not sure about SQL Server specifically.)
I would be more concerned about the actual numbers in the query plan.
Deadlocks are usually more indicative of a resource access ordering issue than a problem with query design in particular. I would look at the other participant(s) in the deadlock and take a look at what objects each transaction had locked that were required by the other(s). If you can reorder to ensure consistent access order you may be able to avoid contention issues entirely.
It really depends how long the query takes from start to finish. 45% doesn't mean its taking a long time if the query is only taking say 10ms. All it really says is most of the time is spent doing the table scan which is understandable.
Having an index may help when the table grows and is probably not a bad idea unless you know this table is not going to grow. However you will find that adding an index to a table with 7 records makes little to no difference to performance.
A table scan on a small table is not a bad thing - If it fits in a single read into the cache the optimizer will calculate that a table scan costs less than reading through an index chain.
I would only recommend a clustered index if you want to help insure that the contents will 'tend' to be sorted that way (though you will need an explicit order by to guarantee that).

TSQL Join efficiency

I'm developing an ASP.NET/C#/SQL application. I've created a query for a specific grid-view that involves a lot of joins to get the data needed. On the hosted server, the query has randomly started taking up to 20 seconds to process. I'm sure it's partly an overloaded host-server (because sometimes the query takes <1s), but I don't think the query (which is actually a view reference via a stored procedure) is at all optimal regardless.
I'm unsure how to improve the efficiency of the below query:
(There are about 1500 matching records to those joins, currently)
SELECT dbo.ca_Connections.ID,
dbo.ca_Connections.Date,
dbo.ca_Connections.ElectricityID,
dbo.ca_Connections.NaturalGasID,
dbo.ca_Connections.LPGID,
dbo.ca_Connections.EndUserID,
dbo.ca_Addrs.LotNumber,
dbo.ca_Addrs.UnitNumber,
dbo.ca_Addrs.StreetNumber,
dbo.ca_Addrs.Street1,
dbo.ca_Addrs.Street2,
dbo.ca_Addrs.Suburb,
dbo.ca_Addrs.Postcode,
dbo.ca_Addrs.LevelNumber,
dbo.ca_CompanyConnectors.ConnectorID,
dbo.ca_CompanyConnectors.CompanyID,
dbo.ca_Connections.HandOverDate,
dbo.ca_Companies.Name,
dbo.ca_States.State,
CONVERT(nchar, dbo.ca_Connections.Date, 103) AS DateView,
CONVERT(nchar, dbo.ca_Connections.HandOverDate, 103) AS HandOverDateView
FROM dbo.ca_CompanyConnections
INNER JOIN dbo.ca_CompanyConnectors ON dbo.ca_CompanyConnections.CompanyID = dbo.ca_CompanyConnectors.CompanyID
INNER JOIN dbo.ca_Connections ON dbo.ca_CompanyConnections.ConnectionID = dbo.ca_Connections.ID
INNER JOIN dbo.ca_Addrs ON dbo.ca_Connections.AddressID = dbo.ca_Addrs.ID
INNER JOIN dbo.ca_Companies ON dbo.ca_CompanyConnectors.CompanyID = dbo.ca_Companies.ID
INNER JOIN dbo.ca_States ON dbo.ca_Addrs.StateID = dbo.ca_States.ID
It may have nothing to do with your query and everything to do with the data transfer.
How fast does the query run in query analyzer?
How does this compare to the web page?
If you are bringing back the entire data set you may want to introduce paging, say 100 records per page.
The first thing I normally suggest is to profile to look for potential indexes to help out. But the when the problem is sporadic like this and the normal case is for the query to run in <1sec, it's more likely due to lock contention rather than a missing index. That means the cause is something else in the system causing this query to take longer. Perhaps an insert or update. Perhaps another select query — one that you would normally expect to take a little longer so the extra time on it's end isn't noted.
I would start with indexing, but I have a database that is a third-party application. Creating my own indexes is not an option. I read an article (sorry, can't find the reference) recommending breaking up the query into table variables or temp tables (depending on number of records) when you have multiple tables in your query (not sure what the magic number is).
Start with dbo.ca_CompanyConnections, dbo.ca_CompanyConnectors, dbo.ca_Connections. Include the fields you need. And then subsitute these three joined tables with just the temp table.
Not sure what the issue is (would like to here recommendations) but seems like when you get over 5 tables performance seems to drop.

Need tips for optimizing SQL Query using a JOIN

The Query I'm writing runs fine when looking at the past few days, once I go over a week it crawls (~20min). I am joining 3 tables together. I was wondering what things I should look for to make this run faster. I don't really know what other information is needed for the post.
EDIT: More info: db is Sybase 10. Query:
SELECT a.id, a.date, a.time, a.signal, a.noise,
b.signal_strength, b.base_id, b.firmware,
a.site, b.active, a.table_key_id
FROM adminuser.station AS a
JOIN adminuser.base AS b
ON a.id = b.base_id
WHERE a.site = 1234 AND a.date >= '2009-03-20'
I also took out the 3rd JOIN and it still runs extremely slow. Should I try another JOIN method?
I don't know Sybase 10 that well, but try running that query for say 10-day period and then 10 times, for each day in a period respectively and compare times. If the time in the first case is much higher, you've probably hit the database cache limits.
The solution is than to simply run queries for shorter periods in a loop (in program, not SQL). It works especially well if table A is partitioned by date.
You can get a lot of information (assuming you're using MSSQL here) by running your query in SQL Server Management Studio with the Include Actual Execution Plan option set (in the Query menu).
This will show you a diagram of the steps that SQLServer performs in order to execute the query - with relative costs against each step.
The next step is to rework the query a little (try doing it a different way) then run the new version and the old version at the same time. You will get two execution plans, with relative costs not only against each step, but against the two versions of the query! So you can tell objectively if you are making progress.
I do this all the time when debugging/optimizing queries.
Make sure you have indexes on the foreign keys.
It sounds more like you have a memory leak or aren't closing database connections in your client code than that there's anything wrong with the query.
[edit]
Nevermind: you mean quering over a date range rather than the duration the server has been active. I'll leave this up to help others avoid the same confusion.
Also, it would help if you could post the sql query, even if you need to obfuscate it some first, and it's a good bet to check if there's an index on your date column and the number of records returned by the longer range.
You may want to look into using a PARTITION for the date ranges, if your DB supports it. I've heard this can help significantly.
Grab the book "Professional SQL Server 2005 Performance Tuning" its pretty great.
You didn't mention your database. If it's not SQL Server, the specifics of how to get the data might be different, but the advice is fundamentally the same.
Look at indexing, for sure, but the first thing to do is to follow Blorgbeard's advice and scan for execution plans using Management Studio (again, if you are running SQL Server).
What I'm guessing you'll see is that for small date ranges, the optimizer picks a reasonable query plan, but that when the date range is large, it picks something completely different, likely involving either table scans or index scans, and possibly joins that lead to very large temporary recordsets. The execution plan analyzer will reveal all of this.
A scan means that the optimizer thinks that grinding over the whole table or the whole index is cheaper for what you are trying to do than seeking specific values.
What you eventually want to do is get indexes and the syntax of your query set up such that you keep index seeks in the query plan for your query regardless of the date range, or, failing that, that the scans you require are filtered as well as you can manage to minimize temporary recordset size and thereby avoid excessive reads and I/O.
SELECT
a.id, a.date, a.time, a.signal, a.noise,a.site, b.active, a.table_key_id,
b.signal_strength, b.base_id, b.firmware
FROM
( SELECT * FROM adminuser.station
WHERE site = 1234 AND date >= '2009-03-20') AS a
JOIN
adminuser.base AS b
ON
a.id = b.base_id
Kind of rewrote the query, so as to first filter the desired rows then perform a join rather than perform a join then filter the result.
Rather than pulling * from the sub-query you can just select the columns you want, which might be little helpful.
May be this will of little help, in speeding things.
While this is valid in MySql, I am not sure of the sysbase syntax though.