Clarification on Indexes - sql

To illustrate my question, I will use the following example:
CREATE INDEX supplier_idx
ON supplier (supplier_name);
Will the searching on this table only be sped up if the supplier_name column is specified in the SELECT clause? What if we select the supplier_name column as well as other columns in the SELECT clause? Is searching sped up if this column is used in a WHERE clause, even if it is not in the SELECT clause?
Do the same rules apply to the following index as well:
CREATE INDEX supplier_idx
ON supplier (supplier_name, city);

Indexes can be complex, so a full explanation would take a lot of writing. There are many resources on the internet. (Helpful link here to Oracle indexes)
However, I can just answer your questions simply.
CREATE INDEX supplier_idx
ON supplier (supplier_name);
This means that any joins (and similar) using the col supplier_name and using the col supplier_name in a WHERE clause will benefit from an index.
For example
SELECT * FROM SomeTable
WHERE supplier_name = 'Smith'
But simply using the supplier_name column in a SELECT clause will not benefit from having an index (unless you add complexity to the SELECT clause, which I will cover...). For example - this will not benefit from an Index on supplier_name
SELECT
supplier_name
FROM SomeTable WHERE ID = 1
However, if you added some complexity to your SELECT statement, your index could indeed speed it up...For example:
SELECT
supplier_name -- no index benefit
,(SELECT TOP 1 somedata FROM Table2 WHERE supplier_name = Table2.name) AS SomeValue
-- the line above uses the index as supplier_name is used in WHERE
, CASE WHEN supplier_name = 'Best Supplier'
THEN 'Best'
ELSE 'Worst'
END AS FindBestSupplier
-- Also the CASE statement will use the index on supplier_name
FROM SomeTable WHERE ID = 1
(The 'complexity' above still basically shows that if the field 'supplier_name' is used in CASE, or WHERE aswell as JOINS and aggregations, then the INDEX is very beneficial...This example above is a combination of many clauses wrapped into one SELECT statement)
But your composite index
CREATE INDEX supplier_idx
ON supplier (supplier_name, city);
would be beneficial in specific and important cases (Eg: where the city is in the SELECT clause and the supplier_name is used in the WHERE clause), for example
SELECT
city
FROM SomeTable WHERE supplier_name = 'Smith'
The reason is that city is stored alongside the supplier_name index values, so when the index finds the supplier_name value, it immediately has a copy of the city value (stored in the index) and does not need to hit the database files to find any more data. (If city was not in the index, it would have to hit the database to pull the city value out, as it does with most data required in the SELECT statement usually)
The joins will benefit from an index also, with the example:
SELECT
* FROM SomeTable T1
LEFT JOIN AnotherTable T2
ON T1.supplier_name = T2.supplier_name_2
AND T1.city = T2.city_2
So in summary, if you use the field in any comparison expression like a WHERE clause or a JOIN , or a GROUP BY clause (and the aggregations SUM, MIN, MAX etc)...then an Index is very beneficial for Tables with over a few thousand rows...
(Usually only makes a big difference when you have at least 10,000 rows in a Table, but this can vary depending on your complexity)
SQL Server (for example) always creates any missing indexes that it needs (and then discards them)..So if you do not create the correct indexes manually - the system can slow down as it creates the missing indexes on the fly each time it needs them. (SQL Server will show you hints on what indexes it thinks you need for a certain query)
Indexes can slow down UPDATES or INSERTS, so they must be used with a little wisdom and balance...(Sometimes indexes are deleted before a batch of UPDATEs is performed and then the index re-created again, although this is kinda extreme)

Related

Best way to get distinct count from a query joining two tables

I have 2 tables, table A & table B.
Table A (has thousands of rows)
id
uuid
name
type
created_by
org_id
Table B (has a max of hundred rows)
org_id
org_name
I am trying to get the best join query to obtain a count with a WHERE clause. I need the count of distinct created_bys from table A with an org_name in Table B that contains 'myorg'. I currently have the below query (producing expected results) and wonder if this can be optimized further?
select count(distinct a.created_by)
from a left join
b
on a.org_id = b.org_id
where b.org_name like '%myorg%';
You don't need a left join:
select count(distinct a.created_by)
from a join
b
on a.org_id = b.org_id
where b.org_name like '%myorg%'
For this query, you want an index on b.org_id, which I assume that you have.
I would use exists for this:
select count(distinct a.created_by)
from a
where exists (select 1 from b where b.org_id = a.org_id and b.org_name like '%myorg%')
An index on b(org_id) would help. But in terms of performance, key points are:
searching using like with a wildcard on both sides is not good for performance (this cannot take advantage of an index); it would be far better to search for an exact match, or at least to not have a wildcard on the left side of the string.
count(distinct ...) is more expensive than a regular count(); if you don't really need distinct, then don't use it.
Your query looks good already. Use a plain [INNER] JOIN instead or LEFT [OUTER] JOIN, like Gordon suggested. But that won't change much.
You mention that table B has only ...
a max of hundred rows
while table A has ...
thousands of rows
If there are many rows per created_by (which I'd expect), then there is potential for an emulated index skip scan.
(The need to emulate it might go away in one of the coming Postgres versions.)
Essential ingredient is this multicolumn index:
CREATE INDEX ON a (org_id, created_by);
It can replace a simple index on just (org_id) and works for your simple query as well. See:
Is a composite index also good for queries on the first field?
There are two complications for your case:
DISTINCT
0-n org_id resulting from org_name like '%myorg%'
So the optimization is harder to implement. But still possible with some fancy SQL:
SELECT count(DISTINCT created_by) -- does not count NULL (as desired)
FROM b
CROSS JOIN LATERAL (
WITH RECURSIVE t AS (
( -- parentheses required
SELECT created_by
FROM a
WHERE org_id = b.org_id
ORDER BY created_by
LIMIT 1
)
UNION ALL
SELECT (SELECT created_by
FROM a
WHERE org_id = b.org_id
AND created_by > t.created_by
ORDER BY created_by
LIMIT 1)
FROM t
WHERE t.created_by IS NOT NULL -- stop recursion
)
TABLE t
) a
WHERE b.org_name LIKE '%myorg%';
db<>fiddle here (Postgres 12, but works in Postgres 9.6 as well.)
That's a recursive CTE in a LATERAL subquery, using a correlated subquery.
It utilizes the multicolumn index from above to only retrieve a single row for every (org_id, created_by). With an index-only scans if the table is vacuumed enough.
The main objective of the sophisticated SQL is to completely avoid a sequential scan (or even a bitmap index scan) on the big table and only read very few fast index tuples.
Due to the added overhead it can be a bit slower for an unfavorable data distribution (many org_id and/or only few rows per created_by) But it's much faster for favorable conditions and is scales excellently, even for millions of rows. You'll have to test to find the sweet spot.
Related:
Optimize GROUP BY query to retrieve latest row per user
What is the difference between LATERAL and a subquery in PostgreSQL?
Is there a shortcut for SELECT * FROM?

Query equivalence with DISTINCT

Let us have a simple table order(id: int, category: int, order_date: int) created using the following script
IF OBJECT_ID('dbo.orders', 'U') IS NOT NULL DROP TABLE dbo.orders
SELECT TOP 1000000
NEWID() id,
ABS(CHECKSUM(NEWID())) % 100 category,
ABS(CHECKSUM(NEWID())) % 10000 order_date
INTO orders
FROM sys.sysobjects
CROSS JOIN sys.all_columns
Now, I have two equivalent queries (at least I believe that they are equivalent):
-- Q1
select distinct o1.category,
(select count(*) from orders o2 where order_date = 1 and o1.category = o2.category)
from orders o1
-- Q2
select o1.category,
(select count(*) from orders o2 where order_date = 1 and o1.category = o2.category)
from (select distinct category from orders) o1
However, when I run those queries they have a significantly different characteristic. The Q2 is twice faster for my data and it is clearly caused by the fact that the query plan first find unique categories (hash match in the following query plans) before the join.
The difference is still there if add requested index
CREATE NONCLUSTERED INDEX ix_order_date ON orders(order_date)
INCLUDE (category)
Moreover, the Q2 can use efficiently also the following index, whereas, the Q1 remains the same:
CREATE NONCLUSTERED INDEX ix_orders_kat ON orders(category, order_date)
My question are:
Are these queries equivalent?
If yes, what is the obstacle for the SQL Server 2016 query optimizer to find the second query plan in the case of Q1 (I believe that the search space must be quite small in this case)?
If no, could you post a counter example?
EDIT
My motivation for the question is that I would like to understand why query optimizers are so poor in rewriting even simple queries and they rely on SQL syntax so heavily. SQL language is a declarative language, therefore, why SQL query processors are driven by syntax so often even for simple queries like this?
The queries are functionally equivalent, meaning that they should return the same data.
However, they are interpreted differently by the SQL engine. The first (SELECT DISTINCT) generates all the results and then removes the duplicates.
The second extracts the distinct values first, so the subquery is only called on the appropriate subset.
An index might make either query more efficient, but it won't fundamentally affect whether the distinct processing occurs before or after the subquery.
In this case, the results are the same. However, that is not necessarily true depending on the subquery.

Optimizing SQL Stored Procedure Query for better performance

Is there a way to optimize the query below as its took quite awhile to retrieve the massive records from the table (T_School_Class) and (T_School) I had created indexes for Name as well as SchoolCode for T_School. In additional, Temp Table was also created.
SELECT Distinct (S.SchoolCode) As Code, Name from T_STU_School AS S
LEFT JOIN T_STU_School_Class AS SC ON S.SchoolCode = SC.SchoolCode
WHERE S.SchoolCode IN
(SELECT SchoolCode FROM #MainLevelCodeTemp)
AND [Status] = 'A'
AND Name LIKE #Keyword
AND (#AcademicCode = '' OR SC.AcademicLevel IN (#AcademicCode))
Order BY Name ASC;
all the imperatives in the sproc are a waste, you're just forcing SQL to scan T_STU_School multiple times, all that logic should just be added to the where clause:
SELECT Distinct (S.SchoolCode) As Code, Name from T_STU_School AS S
LEFT JOIN T_STU_School_Class AS SC ON S.SchoolCode = SC.SchoolCode
WHERE ((#MainLevelCode LIKE '%J%' AND S.MixLevelType IN ('T1','T2','T6'))
OR (#MainLevelCode LIKE '%S%' AND S.MixLevelType IN ('T1','T2','T5','T6'))
OR (#MainLevelCode LIKE '%P%' AND S.MixLevelType IN ('T1','T2','T6'))
OR (MainLevelCode IN (SELECT Item FROM [dbo].[SplitString](#MainLevelCode, ',')))
OR #MainLevelCode = '')
AND [Status] = 'A'
AND (#Keyword = '' OR Name LIKE #Keyword)
AND (#AcademicCode = '' OR SC.AcademicLevel IN (#AcademicCode))
Order BY Name ASC;
..the reason both tables are still being scanned per your execution plan even though you've created indexes on Name and SchoolCode is because there's no criteria on SchoolCode that would reduce the result set to less than the whole table, and likewise with Name whenever it is blank or starts with a "%". to prevent the full table scans you should create indexes on:
T_STU_School (Status, Name)
T_STU_School_Class (MixLevelType, SchoolCode)
T_STU_School_Class (MainLevelCode, SchoolCode)
..also any time you have stuff like (y='' OR x=y) in the where clause it's a good idea to add an OPTION (RECOMPILE) to the bottom to avoid the eventual bad plan cache nightmare.
..also this line is probably a bug:
AND (#AcademicCode = '' OR SC.AcademicLevel IN (#AcademicCode))
IN won't parse #AcademicCode so this statement is equivalent to SC.AcademicLevel=#AcademicCode
You definitely need an index on T_STU_SCHOOL.SchoolCode. Your query plan shows that 65% of the query time is taken from the index scan that results from the join. An index on the SchoolCode column should turn that into an index seek, which will be much faster.
The Name index is not currently being used, probably because you're passing in values for #keyword that start with a wildcard. Given that Name is on the T_STU_School table, which has a small number rows, you can maybe afford a table scan there in order to use wildcards the way you want to. So you should be able to drop the Name index.

Performance of nested select

I know this is a common question and I have read several other posts and papers but I could not find one that takes into account indexed fields and the volume of records that both queries could return.
My question is simple really. Which of the two is recommended here written in an SQL-like syntax (in terms of performance).
First query:
Select *
from someTable s
where s.someTable_id in
(Select someTable_id
from otherTable o
where o.indexedField = 123)
Second query:
Select *
from someTable
where someTable_id in
(Select someTable_id
from otherTable o
where o.someIndexedField = s.someIndexedField
and o.anotherIndexedField = 123)
My understanding is that the second query will query the database for every tuple that the outer query will return where the first query will evaluate the inner select first and then apply the filter to the outer query.
Now the second query may query the database superfast considering that the someIndexedField field is indexed but say that we have thousands or millions of records wouldn't it be faster to use the first query?
Note: In an Oracle database.
In MySQL, if nested selects are over the same table, the execution time of the query can be hell.
A good way to improve the performance in MySQL is create a temporary table for the nested select and apply the main select against this table.
For example:
Select *
from someTable s1
where s1.someTable_id in
(Select someTable_id
from someTable s2
where s2.Field = 123);
Can have a better performance with:
create temporary table 'temp_table' as (
Select someTable_id
from someTable s2
where s2.Field = 123
);
Select *
from someTable s1
where s1.someTable_id in
(Select someTable_id
from tempTable s2);
I'm not sure about performance for a large amount of data.
About first query:
first query will evaluate the inner select first and then apply the
filter to the outer query.
That not so simple.
In SQL is mostly NOT possible to tell what will be executed first and what will be executed later.
Because SQL - declarative language.
Your "nested selects" - are only visually, not technically.
Example 1 - in "someTable" you have 10 rows, in "otherTable" - 10000 rows.
In most cases database optimizer will read "someTable" first and than check otherTable to have match. For that it may, or may not use indexes depending on situation, my filling in that case - it will use "indexedField" index.
Example 2 - in "someTable" you have 10000 rows, in "otherTable" - 10 rows.
In most cases database optimizer will read all rows from "otherTable" in memory, filter them by 123, and than will find a match in someTable PK(someTable_id) index. As result - no indexes will be used from "otherTable".
About second query:
It completely different from first. So, I don't know how compare them:
First query link two tables by one pair: s.someTable_id = o.someTable_id
Second query link two tables by two pairs: s.someTable_id = o.someTable_id AND o.someIndexedField = s.someIndexedField.
Common practice to link two tables - is your first query.
But, o.someTable_id should be indexed.
So common rules are:
all PK - should be indexed (they indexed by default)
all columns for filtering (like used in WHERE part) should be indexed
all columns used to provide match between tables (including IN, JOIN, etc) - is also filtering, so - should be indexed.
DB Engine will self choose the best order operations (or in parallel). In most cases you can not determine this.
Use Oracle EXPLAIN PLAN (similar exists for most DBs) to compare execution plans of different queries on real data.
When i used directly
where not exists (select VAL_ID FROM #newVals = OLDPAR.VAL_ID) it was cost 20sec. When I added the temp table it costs 0sec. I don't understand why. Just imagine as c++ developer that internally there loop by values)
-- Temp table for IDX give me big speedup
declare #newValID table (VAL_ID int INDEX IX1 CLUSTERED);
insert into #newValID select VAL_ID FROM #newVals
insert into #deleteValues
select OLDPAR.VAL_ID
from #oldVal AS OLDPAR
where
not exists (select VAL_ID from #newValID where VAL_ID=OLDPAR.VAL_ID)
or exists (select VAL_ID from #VaIdInternals where VAL_ID=OLDPAR.VAL_ID);

Does EXCEPT execute faster than a JOIN when the table columns are the same

To find all the changes between two databases, I am left joining the tables on the pk and using a date_modified field to choose the latest record. Will using EXCEPT increase performance since the tables have the same schema. I would like to rewrite it with an EXCEPT, but I'm not sure if the implementation for EXCEPT would out perform a JOIN in every case. Hopefully someone has a more technical explanation for when to use EXCEPT.
There is no way anyone can tell you that EXCEPT will always or never out-perform an equivalent OUTER JOIN. The optimizer will choose an appropriate execution plan regardless of how you write your intent.
That said, here is my guideline:
Use EXCEPT when at least one of the following is true:
The query is more readable (this will almost always be true).
Performance is improved.
And BOTH of the following are true:
The query produces semantically identical results, and you can demonstrate this through sufficient regression testing, including all edge cases.
Performance is not degraded (again, in all edge cases, as well as environmental changes such as clearing buffer pool, updating statistics, clearing plan cache, and restarting the service).
It is important to note that it can be a challenge to write an equivalent EXCEPT query as the JOIN becomes more complex and/or you are relying on duplicates in part of the columns but not others. Writing a NOT EXISTS equivalent, while slightly less readable than EXCEPT should be far more trivial to accomplish - and will often lead to a better plan (but note that I would never say ALWAYS or NEVER, except in the way I just did).
In this blog post I demonstrate at least one case where EXCEPT is outperformed by both a properly constructed LEFT OUTER JOIN and of course by an equivalent NOT EXISTS variation.
In the following example, the LEFT JOIN is faster than EXCEPT by 70%
(PostgreSQL 9.4.3)
Example:
There are three tables. suppliers, parts, shipments.
We need to get all parts not supplied by any supplier in London.
Database(has indexes on all involved columns):
CREATE TABLE suppliers (
id bigint primary key,
city character varying NOT NULL
);
CREATE TABLE parts (
id bigint primary key,
name character varying NOT NULL,
);
CREATE TABLE shipments (
id bigint primary key,
supplier_id bigint NOT NULL,
part_id bigint NOT NULL
);
Records count:
db=# SELECT COUNT(*) FROM suppliers;
count
---------
1281280
(1 row)
db=# SELECT COUNT(*) FROM parts;
count
---------
1280000
(1 row)
db=# SELECT COUNT(*) FROM shipments;
count
---------
1760161
(1 row)
Query using EXCEPT.
SELECT parts.*
FROM parts
EXCEPT
SELECT parts.*
FROM parts
LEFT JOIN shipments
ON (parts.id = shipments.part_id)
LEFT JOIN suppliers
ON (shipments.supplier_id = suppliers.id)
WHERE suppliers.city = 'London'
;
-- Execution time: 3327.728 ms
Query using LEFT JOIN with table, returned by subquery.
SELECT parts.*
FROM parts
LEFT JOIN (
SELECT parts.id
FROM parts
LEFT JOIN shipments
ON (parts.id = shipments.part_id)
LEFT JOIN suppliers
ON (shipments.supplier_id = suppliers.id)
WHERE suppliers.city = 'London'
) AS subquery_tbl
ON (parts.id = subquery_tbl.id)
WHERE subquery_tbl.id IS NULL
;
-- Execution time: 1136.393 ms