Is there any general rule on SQL query complexity Vs performance? - sql

1)Are SQL query execution times O(n) compared to the number of joins, if indexes are not used? If not, what kind of relationship are we likely to expect? And can indexing improve the actual big-O time-complexity, or does it only reduce the entire query time by some constant factor?
Slightly vague question, I'm sure it varies a lot but I'm talking in a general sense here.
2) If you have a query like:
SELECT T1.name, T2.date
FROM T1, T2
WHERE T1.id=T2.id
AND T1.color='red'
AND T2.type='CAR'
Am I right assuming the DB will do single table filtering first on T1.color and T2.type, before evaluating multi-table conditions? In such a case, making the query more complex could make it faster because less rows are subjected to the join-level tests?

This depends on the query plan used.
Even without indexes, modern servers can use HASH JOIN and MERGE JOIN which are faster than O(N * M)
More specifically, complexity of a HASH JOIN is O(N + M), where N is the hashed table and M the is lookup table. Hashing and hash lookups have constant complexity.
Complexity of a MERGE JOIN is O(N*Log(N) + M*Log(M)): it's the sum of times to sort both tables plus time to scan them.
SELECT T1.name, T2.date
FROM T1, T2
WHERE T1.id=T2.id
AND T1.color='red'
AND T2.type='CAR'
If there are no indexes defined, the engine will select either a HASH JOIN or a MERGE JOIN.
The HASH JOIN works as follows:
The hashed table is chosen (usually it's the table with fewer records). Say it's t1
All records from t1 are scanned. If the records holds color='red', this record goes into the hash table with id as a key and name as a value.
All records from t2 are scanned. If the record holds type='CAR', its id is searched in the hash table and the values of name from all hash hits are returned along with the current value of data.
The MERGE JOIN works as follows:
The copy of t1 (id, name) is created, sorted on id
The copy of t2 (id, data) is created, sorted on id
The pointers are set to the minimal values in both tables:
>1 2<
2 3
2 4
3 5
The pointers are compared in a loop, and if they match, the records are returned. If they don't match, the pointer with the minimal value is advanced:
>1 2< - no match, left pointer is less. Advance left pointer
2 3
2 4
3 5
1 2< - match, return records and advance both pointers
>2 3
2 4
3 5
1 2 - match, return records and advance both pointers
2 3<
2 4
>3 5
1 2 - the left pointer is out of range, the query is over.
2 3
2 4<
3 5
>
In such a case, making the query more complex could make it faster because less rows are subjected to the join-level tests?
Sure.
Your query without the WHERE clause:
SELECT T1.name, T2.date
FROM T1, T2
is more simple but returns more results and runs longer.

Be careful of conflating too many different things. You have a logical cost of the query based on number of rows to be examined, a (possibly) smaller logical cost based on number of rows actually returned and an unrelated a physical cost based on number of pages that have to be examined.
The three are related, but not strongly.
The number of rows examined is the largest of these costs and least easy to control. The rows have to be matched through the join algorithm. This, also, is the least relevant.
The number of rows returned is more costly because that's I/O bandwidth between client application and database.
The number of pages read is the most costly because that's an even larger number of physical I/O's. That's the most costly because that's load inside the database with impact on all clients.
SQL Query with one table is O( n ). That's the number of rows. It's also O( p ) based on the number of pages.
With more than one table, the rows examined is O(nm...). That's the nested-loops algorithm. Depending on the cardinality of the relationship, however, the result set may be as small as O( n ) because the relationships are all 1:1. But each table must be examined for matching rows.
A Hash Join replaces O( n*log(n) ) index + table reads with O( n ) direct hash lookups. You still have to process O( n ) rows, but you bypass some index reads.
A Merge Join replaces O( nm ) nested loops with O( log(n+m)(n+m) ) sort operation.
With indexes, the physical cost may be reduced to O(log(n)m) if a table is merely checked for existence. If rows are required, then the index speeds access to the rows, but all matching rows must be processed. O(nm) because that's the size of the result set, irrespective of indexes.
The pages examined for this work may be smaller, depending on the selectivity of the index.
The point of an index isn't to reduce the number of rows examined so much. It's to reduce the physical I/O cost of fetching the rows.

Are SQL query execution times O(n) compared to the number of joins, if indexes are not used?
Generally they're going to be O(n^m), where n is the number of records per table involved and m is the number of tables being joined.
And can indexing improve the actual big-O time-complexity, or does it only reduce the entire query time by some constant factor?
Both. Indexes allow for direct lookup when the joins are heavily filtered (i.e. with a good WHERE clause), and they allow for faster joins when they're on the right columns.
Indexes are no help when they're not on the columns being joined or filtered by.

Check out how clustered vs non-clustered indexes work
That is from a pure technical point of view...for an easy explanation my good buddy mladen has written a simple article to understand indexing.
Indexes definately help but I do recommend the reads to understand the pros and cons.

Related

How to reduce # of columns SQL has to look through while joining 2 tables?

I'm joining two tables together using an inner join, but given that these tables are billions of rows long, I was hoping to speed up my query and find a way to reduce the columns the sql has to comb through. Is there a way to, in a join, only have sql search through certain columns? I'm understand you can do it through SELECT, but I was hoping rather than select columns from the join, that I could reduce the # of columns being searched from.
Ex)
SELECT *
FROM table1 t1
JOIN table2 t2
ON t1.suite = t2.suite
AND t1.region = t2.region
Currently table1 and table2 both have over 20 columns, but I only need the 3 columns from each table.
I'm using presto btw. Thanks and stay safe :)
If you create indexes on each table for both suite and region in the same index, plus an INCLUDES clause for any additional result columns you need, SQL Server can complete the query using only the indexes. This is called a covering index, and it will help performance for the query by increasing the number of "rows" (index entries) which fit in an 8Kb page verses an entire real row, which therefore also reduces the total number of page reads to complete the query.
Be aware, though, that you pay this cost by extra work at INSERT/UPDATE/DELETE time to keep the indexes up to date, extra storage needed for the indexes, and extra cache RAM use if any part of the indexes end up in the cache buffer. With potentially billions of index entries, that cost could be significant, and may outweigh the gains for this one query, or may require updates to your server capacity planning.
SELECT T1.COL_1,T1.COL_2,T1.COL_3,T2.COL_1,T2.COL_2,T2.COL_3
FROM TABLE_1 T1
JOIN TABLE_2 T2 ON t1.suite = t2.suite AND t1.region = t2.region
And in the majority books about SQL-language there is a WARNING "Do not use * in the production code"

SQL Server filtered index "Estimated number of rows to be read" is size of full table

I have a table in an Azure SQL database with ~2 million rows. On this table I have two pairs of columns that I want to filter for null - so I have filtered indexes on each pairs, checking for null on each column.
One of the indexes appears to be about twice the size of the other - (~400,000 vs ~800,000 rows).
However it seems to take about 10x as long to query.
I'm running the exact same query on both:
SELECT DomainObjectId
FROM DomainObjects
INNER JOIN
(SELECT <id1> AS Id
UNION ALL
SELECT<id2> etc...
) AS DomainObjectIds ON DomainObjectIds = DonorIds.Id
WHERE C_1 IS NULL
AND C_2 IS NULL
Where C_1/C_2 are the columns with the filtered index (and get replaced with other columns in my other query).
The query plans both involve an Index Seek - but in the fast one, it's 99% of the runtime. In the slow case, it's a mere 50% - with the other half spent filtering (which seems suspect, given the filtering should be implicit from the filtered index), and then joining to the queried IDs.
In addition, the "estimated number of rows to be read" for the index seek is ~2 million, i.e. the size of the full table. Based on that, it looks like what I'd expect a full table scan to look like. I've checked the index sizes and the "slow" one only takes up twice the space of the "fast" one, which implies it's not just been badly generated.
I don't understand:
(a) Why so much time is spent re-applying the filter from the index
(b) Why the estimated row count is so high for the "slow" index
(c) What could cause these two queries to be so different in speed, given the similarities in the filtered indexes. I would expect the slow one to be twice as slow, based purely on the number of rows matching each filter.

Postgres running parallel queries (cross joins on large tables)

I need to run queries of the following type:
SELECT * FROM A CROSS JOIN B WHERE myfunction(A.x,B.y) = Z
Because the query is slow I would like to use all processors available to speed it up.
I have only very basic knowledge of relational databases so even "obvious" comments are welcome.
Postgres v 9.4.4 (upgrade is not an option due to some constraints)
A has 3 mil rows
B can have 100k rows (but could have like 10M rows in future)
A,B have indexed columns
myfunction (A.x,B.y) takes advantage of indexes on A.x, B.y - without them it is even much more slower.
What would be a reasonable solution?
At present 10k x 2M query using 50 processors with naive split suggested below took about 20 min.
I am considering running cross joins on parts of B in parallel. B would be split by values of ID (integer primary key)
SELECT * FROM A CROSS JOIN B WHERE myfunction(A.x,B.y) = Z AND A.id BETWEEN N and M.
and the run multiple "psql -d mydatabase subqueryNumberX.sql" commands using gnu parallel.
Some questions:
If I have an indexed table T and use a SELECT from it within another query would index of T used in search? or this subSELECT destroys it?
In my query above, would selection of a part of the table (WHERE A.id BETWEEN N and M) prevent using index?
When a (slow) cross-join on a table is in progress is such table accessible for other operations (next cross-join)?
Your question is (still) rather vague.
For a cross join, indexes are not necessarily of much use, but it depends on the columns which are indexed and the columns referenced in the query and the size of the rows in the table. If the index is on the relevant columns, then maybe the optimizer will do an 'index only' scan instead of a 'full table scan' and benefit from the smaller amount of I/O. However, since you have SELECT *, you are selecting all columns from A and B so the full rows will need to be read (but see point 2). There isn't a sub-select in the query, so it is mystifying to ask about the sub-select destroying anything.
Nominally, you might get some benefit from moving the WHERE clause into a sub-select such as:
SELECT *
FROM (SELECT * FROM A WHERE A.id BETWEEN N AND M) AS A1
CROSS JOIN B
WHERE myFunction(A1.x, B.y) = Z
However, it would be a feeble optimizer that would not do that automatically. The range condition might make an index on A.id attractive, especially if M and N represent a small fraction of the total range of values in A.id. So, the optimizer should use an index with A.id as the leading or sole component to allow it to speed up the query. The condition won't prevent the use of any index — indexes almost certainly won't be used otherwise.
A slow query does not inhibit other queries; it may inhibit updates while it is running, or it may stress the MVCC (multi-version concurrency control) mechanisms of the DBMS.

Two very alike select statements different performance

I've just came across some weird performance differences.
I have two selects:
SELECT s.dwh_end_date,
t.*,
'-1' as PROMOTION_DROP_EMP_CODE,
trunc(sysdate +1) as PROMOTION_END_DATE,
'K01' as PROMOTION_DROP_REASON,
-1 as PROMOTION_DROP_WO_NUMBER
FROM STG_PROMO_EXPIRE_DATE t
INNER JOIN fct_customer_services s
ON(t.dwh_product_key = s.dwh_product_key)
Which takes approximately 20 seconds.
And this one:
SELECT s.dwh_end_date,
s.dwh_product_key,
s.promotion_expire_date,
s.PROMOTION_DROP_EMP_CODE,
s.PROMOTION_END_DATE,
s.PROMOTION_DROP_REASON,
s.PROMOTION_DROP_WO_NUMBER
FROM STG_PROMO_EXPIRE_DATE t
INNER JOIN fct_customer_services s
ON(t.dwh_product_key = s.dwh_product_key)
That takes approximately 400 seconds
They are basically the same - its just to assure that I've updated my data correct (first select is to update the FCT tables) second select to make sure every thing updated correctly.
The only differences between this two selects, is the columns I select. (STG table has two columns - dwh_p_key and prom_expire_date)
First select explain plan
Second select explain plan
What can cause this weird behaviour?..
The FCT tables is indexed UNIQUE (dwh_product_key, dwh_end_date) and partitioned by dwh_end_date (250 million records), the STG doesn't have any indexes (and its only 15k records)
Thanks in advance.
The plans are not exactly the same. The first query uses a fast full scan of the index on fct_customer_services and doesn't need to access any blocks from the actual table, since you only refer to the two indexed columns.
The second query does have to access the table blocks to get the other unidexed column values. It's doing a full table scan - slower and more expensive than a full index scan. The optimiser doesn't see any improvement from using the index and then accessing specific table rows, presumably because the cardinality is too high - it needs to access too many table rows to save any effort by hitting the index first. Doing so would be even slower.
So the second query is slower because it has to read the whole table from disk/cache rather than just the whole index, and the table is much larger than the index. You can look at the segments assigned to both objects (index and table) to see the ratio of their sizes.

INNER JOINs with where on the joined table

Let's say we have
SELECT * FROM A INNER JOIN B ON [....]
Assuming A has 2 rows and B contains 1M rows including 2 rows linked to A:
B will be scanned only once with "actual # of rows" of 2 right?
If I add a WHERE on table B:
SELECT * FROM A INNER JOIN B ON [....] WHERE B.Xyz > 10
The WHERE will actually be executed before the join... So if the where
returns 1000 rows, the "actual # of rows" of B will be 1000...
I don't get it.. shouldn't it be <= 2???
What am I missing... why does the optimiser proceeds that way?
(SQL 2008)
Thanks
The optimizer will proceed whichever way it thinks is faster. That means if the Xyz column is indexed but the join column is not, it will likely do the xyz filter first. Or if your statistics are bad so it doesn't know that the join filter would pare B down to just two rows, it would do the WHERE clause first.
It's based entirely on what indexes are available for the optimizer to use. Also, there is no reason to believe that the db engine will execute the WHERE before another part of the query. The query optimizer is free to execute the query in any order it likes as long as the correct results are returned. Again, the way to properly optimize this type of query is with strategically placed indexes.
The "scanned only once" is a bit misleading. A table scan is a horrendously expensive thing in SQL Server. At least up to SS2005, a table scan requires a read of all rows into a temporary table, then a read of the temporary table to find rows matching the join condition. So in the worst case, your query will read and write 1M rows, then try to match 2 rows to 1M rows, then delete the temporary table (that last bit is probably the cheapest part of the query). So if there are no usable indexes on B, you're just in a bad place.
In your second example, if B.Xyz is not indexed, the full table scan happens and there's a secondary match from 2 rows to 1000 rows - even less efficient. If B.Xyz is indexed, there should be an index lookup and a 2:1000 match - much faster & more efficient.
'course, this assumes the table stats are relatively current and no options are in effect that change how the optimizer works.
EDIT: is it possible for you to "unroll" the A rows and use them as a static condition in a no-JOIN query on B? We've used this in a couple of places in our application where we're joining small tables (<100 rows) to large (> 100M rows) ones to great effect.