Avoiding full table scan - sql

I have the following query which obtains transactions from the transactions table and transaction detail. Both tables have a big amount of entries, so this query takes a while to return results.
SELECT * FROM transactions t LEFT JOIN transac_detail tidts ON (tidts.id_transac = t.id);
However, I'm more worried because of the fact that Oracle does a full table scan on both tables, according to explain plan, even though t.id and tidts.id_transac have indexes.
Is there any way to optimize this without touching table structure?

I don't think it's true that the SQL neccessarily would be best served with full table scans. This would really be most obviously true where there is a foreign key between the two, but even then there could be exceptions.
I think the key question is this: "what proportion of the rows from each table are expected to be included in the result set?". If the answer is "100% from each" then you have a clear case for full table scans (and a hash join).
However consider the case where tables A and B are joined, with table A containing 5 rows with a foreign key to table B (the parent) containing a million rows. Obviously here you'd look for a full scan of A with a nested loop join to B (the join column on table B would be indexed because it would have to be a primary or unique key).
In the OP's case though it looks like you'd expect 100% rows to be returned from each table. I'd expect to see a full scan of both tables, and a hash join with TRANSACTIONS(probably the smaller table) being accessed first and built into the hash table. This would be the optimum join method, and I'd just be looking out for a situation where TRANSACTIONS is too big for a single-pass hash join. If the join spills to disk then that could be a performance worry and you'd have to look at increasing the memory allocation or equi-partitioning the two tables to reduce the memory requirement.

Since the query as given just returns everything anyway, full table scan may be actually the fastest way of arriving at the final result. Since I/O is so expensive compared to CPU time, it may be more efficient to just pull everything into memory and making the final join there, than to make an index seek in a loop over one table.
To determine whether the query could actually run any faster, you can try the following approaches:
look at query plan on only a subset of the data (for example, a range of id's)
try the query on subsets of varying sizes, see what kind of curve you're plotting here

You don't have a WHERE clause, therefore Oracle is thinking that since it has to return all records from both tables a full table scan will be most efficient.
If you would add a WHERE clause that uses the index, I think you will find that EXPLAIN PLAN will no longer use a full table scan.

Full table scans are not necessarily bad - it appears that you're not limiting the result set in any way and this might be the most efficient way to execute the query. You can always verify this by using an index hint and determining the resulting performance change.

Related

SQL : Can WHERE clause increase a SELECT DISTINCT query's speed?

So here's the specific situation: I have primary unique indexed keys set to each entry in the database, but each row has a secondID referring to an attribute of the entry, and as such, the secondIDs are not unique. There is also another attribute of these rows, let's call it isTitle, which is NULL by default, but each group of entries with the same secondID have at least one entry with 1 isTitle value.
Considering the conditions above, would a WHERE clause increase the processing speed of the query or not? See the following:
SELECT DISTINCT secondID FROM table;
vs.
SELECT DISTINCT secondID FROM table WHERE isTitle = 1;
EDIT:
The first query without the WHERE clause is faster, but could someone explain me why? Algorithmically the process should be faster with having only one 'if' in the cycle, no?
In general, to benchmark performances of queries, you usually use queries that gives you the execution plan the query they receive in input (Every small step that the engine is performing to solve your request).
You are not mentioning your database engine (e.g. PostgreSQL, SQL Server, MySQL), but for example in PostgreSQL the query is the following:
EXPLAIN SELECT DISTINCT secondID FROM table WHERE isTitle = 1;
Going back to your question, since the isTitle is not indexed, I think the first action the engine will do is a full scan of the table to check that attribute and then perform the SELECT hence, in my opinion, the first query:
SELECT DISTINCT secondID FROM table;
will be faster.
If you want to optimize it, you can create an index on isTitle column. In such scenario, the query with the WHERE clause will become faster.
This is a very hard question to answer, particularly without specifying the database. Here are three important considerations:
Will the database engine use the index on secondID for select distinct? Any decent database optimizer should, but that doesn't mean that all do.
How wide is the table relative to the index? That is, is scanning the index really that much faster than scanning the table?
What is the ratio of isTitle = 1 to all rows with the same value of secondId?
For the first query, there are essentially two ways to process the query:
Scan the index, taking each unique value as it comes.
Scan the table, sort or hash the table, and choose the unique values.
If it is not obvious, (1) is much faster than (2), except perhaps in trivial cases where there are a small number of rows.
For the second query, the only real option is:
Scan the table, filter out the non-matching values, sort or hash the table, and choose the unique values.
The key issues here are how much data needs to be scanned and how much is filtered out. It is even possible -- if you had, say, zillions of rows per secondaryId, no additional columns, and small number of values -- that this might be comparable or slightly faster than (1) above. There is a little overhead for scanning an index and sorting a small amount of data is often quite fast.
And, this method is almost certainly faster than (2).
As mentioned in the comments, you should test the queries on your system with your data (use a reasonable amount of data!). Or, update the table statistics and learn to read execution plans.

Performance for big query in SQL Server view

I have a big query for a view that takes a couple of hours to run and I feel like it may be possible to work on its performance "a bit"..
The problem is that I am not sure of what I should do. The query SELECT 39 values, LEFT OUTER JOIN 25 tables and each table could have up to a couple of million rows.
Any tip is good. Is there any good way to attack this problem? I tried to look at the actual execution plan on a test with less data (took about 10 min to run) but it's crazy big. Is there any general things I could do to make this faster? Do I have to tackle one small part at the time..?
Maybe there is just one join that slows down everything? How do I detect it? So what I mean for short, how do I work on a query like this?
As a said, all feedback is good. Is there some more information I need to show, tell me!
The query looks something like this:
SELECT DISTINCT
A.something,
A.somethingElse,
B.something,
C.somethingElse,
ISNULL(C.somethingElseElse, '')
C.somethingElseElseElse,
CASE *** THEN D.something ELSE 0,
E.something,
...
U.something
FROM
TableA A
JOIN
TableB B on ...
JOIN
TableC C on ...
JOIN
TableD D on ...
JOIN
TableE E on ...
JOIN
TableF F on ...
JOIN
TableG G on ...
...
JOIN
Table U on ...
Break your problem into manageable pieces. If the execution plan is too large for you to analyze, start with a smaller part of the query, check its execution plan and optimize it.
There is no general answer on how to optimize a query, since there is a whole bunch of possible reasons why a query can be slow. You have to check the execution plan.
Generally the most promising ways to improve performance are:
Indexing:
When you see a a Clustered Index Scan or - even worse (because then you don't have a clustered index) - a Table Scan in your query plan for a table that you join, you need an index for your JOIN predicate. This is especially true if you have tables with millions of entries, and you select only a small subset of those entries. Check also the index suggestions in the execution plan.
You see that the index works when your Clustered Index Scan turns into an Index Seek.
Index includes:
You probably are displaying columns from your joined tables that are different from the fields you use to join (otherwise, why would you need to join then?). SQL Server needs to get the fields that you need from the table, which you see in the execution plan as Key Lookup.
Since you are taking 39 values from 25 tables, there will be very few fields per table that you will need to get (mostly one or two). SQL Server needs to load entire pages of the respecitive table and get the values from them.
In this case, you should INCLUDE the column(s) you want to display in your index to avoid the key lookups. This comes at an increased index size, but considering you only include a few columns, that cost should be neglectable compared to the size of your tables.
Checking views that you join:
When you join VIEWs you should be aware that it basically means an extension to your query (which means also of the execution plan). Do the same performance optimizations for the view as you do for your main query. Also, check if you join tables in the view that you already join in the main query. These joins might be unnecessary.
Indexed views (maybe):
In general, you can add indexes to views you are joining to your query or create one or more indexed views for parts of your query. There are some caveats though:
Indexed views take storage space in your DB, because you store parts of the data multiple times.
There are a lot of restrictions to indexed views, most notably in your case that OUTER JOINs are forbidden. If you can transform at least some of your OUTER JOINs to INNER JOINs this might be an option.
When you join indexed views, don't forget to use WITH(NOEXPAND) in your join, otherwise they might be ignored.
Partitioned tables (maybe):
If you are running on the Enterprise Edition of SQL Server, you can partition your tables. That can be useful if the rows you join are always selected from a small subset of the available rows. You can make a partition for this subset and increase performance.
Summary:
Divide and conquer. Analyze your query bit by bit to optimize it. The most promising options are indexes and index includes. If you still have trouble, go from there.

Estimated Subtree Cost Wildly Off, Terrible Optimization

I am joining a table that has two record id fields (record1, record2) to a view twice--once on each record--and selecting the top 1000. The view consists of several rather large tables, and it's id field is a string concatenation of their respective Ids (this was necessary for some third party software that requires a unique ID for the view. Row numbering was abysmally slow). There is also a where clause in the view calling a function that compares dates.
The estimated execution plan produces a "No Join Predicate" warning unless I use OPTION(FORCE ORDER). With forcing the ordering, the execution plan has multiple nodes displaying 100% cost. In both cases, the estimated subtree cost at the endpoint is thirteen orders of magnitude smaller than just one of it's nodes (it's doing a lot or nested loop joins with cpu costs as high 35927400000000)
What is going on here with the numbers in the execution plan? And why is SQL Server having such a hard time optimizing the query?
Simply adding an index to the view on the concatenated string and using the NOEXPAND table hint fixed the problem entirely. It ran in all of 12 seconds. But why did sql stumble so bad (even requiring the noexpand hint after I added the index)?
Running SQL Server 2008 SP1 with CU 8.
The View:
SELECT
dbo.fnGetCombinedTwoPartKey(N.NameID,A.AddressID) AS NameAddressKey,
[other fields]
FROM
[7 joined tables]
WHERE dbo.fnDatesAreOverlapping(N.dtmValidStartDate,N.dtmValidEndDate,A.dtmValidStartDate,A.dtmValidEndDate) = 1
The Query
SELECT TOP 1000
vw1.strFullName,
vw1.strAddress1,
vw1.strCity,
vw2.strFullName,
vw2.strAddress1,
vw2.strCity
FROM tblMatches M
JOIN vwImportNameAddress vw1 ON vw1.NameAddressKey = M.Record1
JOIN vwImportNameAddress vw2 ON vw2.DetailAddressKey = M.Record2
Looks like you're already pretty close to the explanation. It's because of this:
The view consists of several rather large tables, and it's id field is a string concatenation of their respective Ids...
This creates a non-sargable join predicate condition, and prevents SQL server from using any of the indexes on the base tables. Thus, the engine has to perform a full scan of all the underlying tables for each join (two in your case).
Perhaps in order to avoid doing several full table scans (one for each table, multiplied by the number of joins), SQL Server has decided that it will be faster to simply use the cartesian product and filter afterward (hence the "no join predicate" warning). When you FORCE ORDER, it dutifully performs all of the full scans and nested loops that you originally asked it for.
I do agree with some of the comments that this view is underlying a problematic data model, but the short-term workaround, as you've discovered, is to index the computed ID column in the view, which (obviously) makes it sargable again because it has hashes of the actual generated ID.
Edit: I also missed this on the first read-through:
WHERE dbo.fnDatesAreOverlapping(N.dtmValidStartDate,N.dtmValidEndDate,A.dtmValidStartDate,A.dtmValidEndDate) = 1
This, again, is a non-sargable predicate which will lead to poor performance. Wrapping any columns in a UDF will cause this behaviour. Indexing the view also materializes it, which may also factor into the speed of the query; without the index, this predicate has to be evaluated every time and forces a full scan on the base tables, even without the composite ID.
It would have to parse your function (fnGetCombinedTwoPartKey) to determine what columns are fetched to create the result column. It can't so it's going to assume all columns are necessary. If your indexes are covering indexes then your estimate is going to be wrong.

Question on how to read a SQL Execution plan

I have executed a query and included the Actual Execution Plan. There is one Hash Match that is of interest to me because it's subtree uses a Index Scan instead of an index seek. When I mouse over this Hash Match there is a section called "Probe Residual". I had assumed that this is whatever values I am joining on. Am I correct here or is there a better explanation of what that means?
The second question I had is regarding the indexes it uses. In my example I am pretty sure this particular join is joining on two columns. The index that it is Scanning has both of these columns in it as well as another column that is not used in the join. I was under the impression that this would result in an Index Seek rather than a Scan. Am I mistaken on this?
A Hash Join will generally (always?) use a scan or at least a range scan. A hash join works by scanning both left and right join tables (or a range in the tables) and building an in-memory hash table that contains all values 'seen' by the scans.
What happened in your case is this: the QO noticed that it can obtain all the values of a column C from a non-clustered index that happens to contain this column (as a key or as an included column). Being a non-clustered index is probably fairly narrow, so the total amount of IO to scan the entire non-clustered index is not exaggerate. The QO also considered that the system has enough RAM to store a hash table in memory. When compared the cost of this query (a scan of a non-clustered index end-to-end for, say, 10000 pages) with the cost of a nested loop that used seeks (say 5000 probes at 2-3 pages each) the scan won as requiring less IO. Of course, is largely speculation on my part, but I'm trying to present the case from the QO point of view, and the plan is likely optimal.
Factors that contributed to this particular plan choice would be:
a large number of estimated candidates on the right side of the join
availability of the join column in a narrow non-clustered index for the left side
plenty of RAM
For a large estimate of the number of candidates, a better choice than the hash join is only the merge-join, and that one requires the input to be presorted. If both the left side can offer an access path that guarantees an order on the joined column and the right side has a similar possibility then you may end up with the merge join, which is the fastest join.
This blog post will probably answer your first question.
As for your second, index scans might be selected by the optimizer in a number of situations. Off the top of my head:
If the index is very small
If most of the rows in the index will be selected by the query
If you are using functions in the where clause of your query
For the first two cases, it's more efficient to do a scan, so the optimizer chooses it over a seek. For the third case, the optimizer has no choice.
1/ A Hash Match means that it takes a hash of columns used in an equality join, but needs to include all the other columns involved in the join (for >, etc) so that they can be checked too. This is where residual columns come in.
2/ An Index Seek can be done if it can go straight to the rows you want. Perhaps you're applying a calculation to the columns and using that? Then it will use the index as a smaller version of the data, but will still need to check every row (applying the calculation on each one).
Check out those excellent articles on execution plans on simple-talk.com:
Execution Plan Basics
SQL Server Execution Plans
Graphical Execution Plans for simple SQL queries
Understanding more complex execution plans
They also have a free e-book SQL Server execution plans for download.

Does limiting a query to one record improve performance

Will limiting a query to one result record, improve performance in a large(ish) MySQL table if the table only has one matching result?
for example
select * from people where name = "Re0sless" limit 1
if there is only one record with that name? and what about if name was the primary key/ set to unique? and is it worth updating the query or will the gain be minimal?
If the column has
a unique index: no, it's no faster
a non-unique index: maybe, because it will prevent sending any additional rows beyond the first matched, if any exist
no index: sometimes
if 1 or more rows match the query, yes, because the full table scan will be halted after the first row is matched.
if no rows match the query, no, because it will need to complete a full table scan
If you have a slightly more complicated query, with one or more joins, the LIMIT clause gives the optimizer extra information. If it expects to match two tables and return all rows, a hash join is typically optimal. A hash join is a type of join optimized for large amounts of matching.
Now if the optimizer knows you've passed LIMIT 1, it knows that it won't be processing large amounts of data. It can revert to a loop join.
Based on the database (and even database version) this can have a huge impact on performance.
To answer your questions in order:
1) yes, if there is no index on name. The query will end as soon as it finds the first record. take off the limit and it has to do a full table scan every time.
2) no. primary/unique keys are guaranteed to be unique. The query should stop running as soon as it finds the row.
I believe the LIMIT is something done after the data set is found and the result set is being built up so I wouldn't expect it to make any difference at all. Making name the primary key will have a significant positive effect though as it will result in an index being made for the column.
If "name" is unique in the table, then there may still be a (very very minimal) gain in performance by putting the limit constraint on your query. If name is the primary key, there will likely be none.
Yes, you will notice a performance difference when dealing with the data. One record takes up less space than multiple records. Unless you are dealing with many rows, this would not be much of a difference, but once you run the query, the data has to be displayed back to you, which is costly, or dealt with programmatically. Either way, one record is easier than multiple.