INNER JOINs with where on the joined table - sql

Let's say we have
SELECT * FROM A INNER JOIN B ON [....]
Assuming A has 2 rows and B contains 1M rows including 2 rows linked to A:
B will be scanned only once with "actual # of rows" of 2 right?
If I add a WHERE on table B:
SELECT * FROM A INNER JOIN B ON [....] WHERE B.Xyz > 10
The WHERE will actually be executed before the join... So if the where
returns 1000 rows, the "actual # of rows" of B will be 1000...
I don't get it.. shouldn't it be <= 2???
What am I missing... why does the optimiser proceeds that way?
(SQL 2008)
Thanks

The optimizer will proceed whichever way it thinks is faster. That means if the Xyz column is indexed but the join column is not, it will likely do the xyz filter first. Or if your statistics are bad so it doesn't know that the join filter would pare B down to just two rows, it would do the WHERE clause first.

It's based entirely on what indexes are available for the optimizer to use. Also, there is no reason to believe that the db engine will execute the WHERE before another part of the query. The query optimizer is free to execute the query in any order it likes as long as the correct results are returned. Again, the way to properly optimize this type of query is with strategically placed indexes.

The "scanned only once" is a bit misleading. A table scan is a horrendously expensive thing in SQL Server. At least up to SS2005, a table scan requires a read of all rows into a temporary table, then a read of the temporary table to find rows matching the join condition. So in the worst case, your query will read and write 1M rows, then try to match 2 rows to 1M rows, then delete the temporary table (that last bit is probably the cheapest part of the query). So if there are no usable indexes on B, you're just in a bad place.
In your second example, if B.Xyz is not indexed, the full table scan happens and there's a secondary match from 2 rows to 1000 rows - even less efficient. If B.Xyz is indexed, there should be an index lookup and a 2:1000 match - much faster & more efficient.
'course, this assumes the table stats are relatively current and no options are in effect that change how the optimizer works.
EDIT: is it possible for you to "unroll" the A rows and use them as a static condition in a no-JOIN query on B? We've used this in a couple of places in our application where we're joining small tables (<100 rows) to large (> 100M rows) ones to great effect.

Related

SQL Server filtered index "Estimated number of rows to be read" is size of full table

I have a table in an Azure SQL database with ~2 million rows. On this table I have two pairs of columns that I want to filter for null - so I have filtered indexes on each pairs, checking for null on each column.
One of the indexes appears to be about twice the size of the other - (~400,000 vs ~800,000 rows).
However it seems to take about 10x as long to query.
I'm running the exact same query on both:
SELECT DomainObjectId
FROM DomainObjects
INNER JOIN
(SELECT <id1> AS Id
UNION ALL
SELECT<id2> etc...
) AS DomainObjectIds ON DomainObjectIds = DonorIds.Id
WHERE C_1 IS NULL
AND C_2 IS NULL
Where C_1/C_2 are the columns with the filtered index (and get replaced with other columns in my other query).
The query plans both involve an Index Seek - but in the fast one, it's 99% of the runtime. In the slow case, it's a mere 50% - with the other half spent filtering (which seems suspect, given the filtering should be implicit from the filtered index), and then joining to the queried IDs.
In addition, the "estimated number of rows to be read" for the index seek is ~2 million, i.e. the size of the full table. Based on that, it looks like what I'd expect a full table scan to look like. I've checked the index sizes and the "slow" one only takes up twice the space of the "fast" one, which implies it's not just been badly generated.
I don't understand:
(a) Why so much time is spent re-applying the filter from the index
(b) Why the estimated row count is so high for the "slow" index
(c) What could cause these two queries to be so different in speed, given the similarities in the filtered indexes. I would expect the slow one to be twice as slow, based purely on the number of rows matching each filter.

Two very alike select statements different performance

I've just came across some weird performance differences.
I have two selects:
SELECT s.dwh_end_date,
t.*,
'-1' as PROMOTION_DROP_EMP_CODE,
trunc(sysdate +1) as PROMOTION_END_DATE,
'K01' as PROMOTION_DROP_REASON,
-1 as PROMOTION_DROP_WO_NUMBER
FROM STG_PROMO_EXPIRE_DATE t
INNER JOIN fct_customer_services s
ON(t.dwh_product_key = s.dwh_product_key)
Which takes approximately 20 seconds.
And this one:
SELECT s.dwh_end_date,
s.dwh_product_key,
s.promotion_expire_date,
s.PROMOTION_DROP_EMP_CODE,
s.PROMOTION_END_DATE,
s.PROMOTION_DROP_REASON,
s.PROMOTION_DROP_WO_NUMBER
FROM STG_PROMO_EXPIRE_DATE t
INNER JOIN fct_customer_services s
ON(t.dwh_product_key = s.dwh_product_key)
That takes approximately 400 seconds
They are basically the same - its just to assure that I've updated my data correct (first select is to update the FCT tables) second select to make sure every thing updated correctly.
The only differences between this two selects, is the columns I select. (STG table has two columns - dwh_p_key and prom_expire_date)
First select explain plan
Second select explain plan
What can cause this weird behaviour?..
The FCT tables is indexed UNIQUE (dwh_product_key, dwh_end_date) and partitioned by dwh_end_date (250 million records), the STG doesn't have any indexes (and its only 15k records)
Thanks in advance.
The plans are not exactly the same. The first query uses a fast full scan of the index on fct_customer_services and doesn't need to access any blocks from the actual table, since you only refer to the two indexed columns.
The second query does have to access the table blocks to get the other unidexed column values. It's doing a full table scan - slower and more expensive than a full index scan. The optimiser doesn't see any improvement from using the index and then accessing specific table rows, presumably because the cardinality is too high - it needs to access too many table rows to save any effort by hitting the index first. Doing so would be even slower.
So the second query is slower because it has to read the whole table from disk/cache rather than just the whole index, and the table is much larger than the index. You can look at the segments assigned to both objects (index and table) to see the ratio of their sizes.

how to speed up a clustered index scan while selecting all fields on range of rows or all the rows

I have a table
Books(BookId, Name, ...... , PublishedYear)
I do have about 30 fields in my Books table, where BookId is the primary key (Identity column). I have about 2 million records for this table.
I know select * is evil performance killer..
I have a situation to select range of rows or all the rows having all the columns in it.
Select * from Books;
this query takes more than 2 seconds to scan through the data page and get all the records. On checking the execution it still uses the Clustered index scan.
Obviously 2 seconds my not be that bad, however when this table has to be joined with other tables which is executed in batch is taking time over 15 minutes (There are no duplicate records though on the final result at completion as the count is matching). The join criteria is pretty simple and yields no duplication.
Excluding this table alone has the batch execution completed in sub seconds.
Is there a way to optimize this having said that I will have to select all the columns :(
Thanks in advance.
I've just run a batch against my developer instance, one SELECT specifying all Columns and one using *. There is no evidence (nor should there) that there is any difference aside from the raw parsing of my input. If I remember correctly, that old saying really means: Do not SELECT columns you are not using, they use up resources without benefit.
When you try to improve performance in your code, always check your assumptions, they might only apply to some older version (of sql server etc) or other method.

Oracle Performance with Index

Let say if I have table TABLE1 that consist of million of records.
The table has COLUMN A, B and C.
I have an index of A with B.
C is not indexed at all.
After that I do a query with as per below
I run query Select * from TABLE1 where A='something' and
B='something'
I run query Select * from TABLE1 where
A='something' and B='something' and C='something'
I understand that both query will use the index that I have specified. Based on my understanding, the performance of both query should be the same. However, is there any possibility that a query has better performance / run faster than the other? Why?
The queries will not necessarily use the index. Oracle makes a decision to use an index for queries based on the "selectivity" of the index. So, if 90% of the rows have a = 'something' and b = 'something' being true, then a full table scan is faster than using the index.
In both cases, the selectivity of the index would be the same (assuming the comparison values are the same). So both should be using the same execution plan.
Even so, the second query would typically run a bit faster, because it would typically have a smaller result set. The size of the result set is another factor in query performance.
By the way, both could take advantage of an index on table1(A, B, C).
Also, on a "cold" database (one just started with no queries run), the second should run faster for the simple reason that some or all of the data will have already been loaded into page and index caches.

Is there any general rule on SQL query complexity Vs performance?

1)Are SQL query execution times O(n) compared to the number of joins, if indexes are not used? If not, what kind of relationship are we likely to expect? And can indexing improve the actual big-O time-complexity, or does it only reduce the entire query time by some constant factor?
Slightly vague question, I'm sure it varies a lot but I'm talking in a general sense here.
2) If you have a query like:
SELECT T1.name, T2.date
FROM T1, T2
WHERE T1.id=T2.id
AND T1.color='red'
AND T2.type='CAR'
Am I right assuming the DB will do single table filtering first on T1.color and T2.type, before evaluating multi-table conditions? In such a case, making the query more complex could make it faster because less rows are subjected to the join-level tests?
This depends on the query plan used.
Even without indexes, modern servers can use HASH JOIN and MERGE JOIN which are faster than O(N * M)
More specifically, complexity of a HASH JOIN is O(N + M), where N is the hashed table and M the is lookup table. Hashing and hash lookups have constant complexity.
Complexity of a MERGE JOIN is O(N*Log(N) + M*Log(M)): it's the sum of times to sort both tables plus time to scan them.
SELECT T1.name, T2.date
FROM T1, T2
WHERE T1.id=T2.id
AND T1.color='red'
AND T2.type='CAR'
If there are no indexes defined, the engine will select either a HASH JOIN or a MERGE JOIN.
The HASH JOIN works as follows:
The hashed table is chosen (usually it's the table with fewer records). Say it's t1
All records from t1 are scanned. If the records holds color='red', this record goes into the hash table with id as a key and name as a value.
All records from t2 are scanned. If the record holds type='CAR', its id is searched in the hash table and the values of name from all hash hits are returned along with the current value of data.
The MERGE JOIN works as follows:
The copy of t1 (id, name) is created, sorted on id
The copy of t2 (id, data) is created, sorted on id
The pointers are set to the minimal values in both tables:
>1 2<
2 3
2 4
3 5
The pointers are compared in a loop, and if they match, the records are returned. If they don't match, the pointer with the minimal value is advanced:
>1 2< - no match, left pointer is less. Advance left pointer
2 3
2 4
3 5
1 2< - match, return records and advance both pointers
>2 3
2 4
3 5
1 2 - match, return records and advance both pointers
2 3<
2 4
>3 5
1 2 - the left pointer is out of range, the query is over.
2 3
2 4<
3 5
>
In such a case, making the query more complex could make it faster because less rows are subjected to the join-level tests?
Sure.
Your query without the WHERE clause:
SELECT T1.name, T2.date
FROM T1, T2
is more simple but returns more results and runs longer.
Be careful of conflating too many different things. You have a logical cost of the query based on number of rows to be examined, a (possibly) smaller logical cost based on number of rows actually returned and an unrelated a physical cost based on number of pages that have to be examined.
The three are related, but not strongly.
The number of rows examined is the largest of these costs and least easy to control. The rows have to be matched through the join algorithm. This, also, is the least relevant.
The number of rows returned is more costly because that's I/O bandwidth between client application and database.
The number of pages read is the most costly because that's an even larger number of physical I/O's. That's the most costly because that's load inside the database with impact on all clients.
SQL Query with one table is O( n ). That's the number of rows. It's also O( p ) based on the number of pages.
With more than one table, the rows examined is O(nm...). That's the nested-loops algorithm. Depending on the cardinality of the relationship, however, the result set may be as small as O( n ) because the relationships are all 1:1. But each table must be examined for matching rows.
A Hash Join replaces O( n*log(n) ) index + table reads with O( n ) direct hash lookups. You still have to process O( n ) rows, but you bypass some index reads.
A Merge Join replaces O( nm ) nested loops with O( log(n+m)(n+m) ) sort operation.
With indexes, the physical cost may be reduced to O(log(n)m) if a table is merely checked for existence. If rows are required, then the index speeds access to the rows, but all matching rows must be processed. O(nm) because that's the size of the result set, irrespective of indexes.
The pages examined for this work may be smaller, depending on the selectivity of the index.
The point of an index isn't to reduce the number of rows examined so much. It's to reduce the physical I/O cost of fetching the rows.
Are SQL query execution times O(n) compared to the number of joins, if indexes are not used?
Generally they're going to be O(n^m), where n is the number of records per table involved and m is the number of tables being joined.
And can indexing improve the actual big-O time-complexity, or does it only reduce the entire query time by some constant factor?
Both. Indexes allow for direct lookup when the joins are heavily filtered (i.e. with a good WHERE clause), and they allow for faster joins when they're on the right columns.
Indexes are no help when they're not on the columns being joined or filtered by.
Check out how clustered vs non-clustered indexes work
That is from a pure technical point of view...for an easy explanation my good buddy mladen has written a simple article to understand indexing.
Indexes definately help but I do recommend the reads to understand the pros and cons.