SQL Server - Indexing and Operator precedence - sql

I'm having table Student as below
Student(id,jdate)
where column id is primary key. Now I'm writing a query as below
select * from Student where id=2 and (jdate='date1' or jdate='date2')
Will index work here? or Can i modify as below?
select * from Student where (id=2) and (jdate='date1' or jdate='date2')

Both your examples will use the PK Index for column 'id'.
In case it is not clear, The operator "=" has precedence over "and", and thus the parenthesis are not necessary.

Since you are declaring a PK on the id column then you are defining a unique clustered index on the table as well. And since you are using the id column in the where clause then the index should be used.
The two queries, both of them, will use the index and the parenthesis around id = 2 don't change anything in the logic / condition evaluation.

Yes, both queries will work and will both hit any relevant clustered or non clustered index.
Given that id is your table PK, you probably won't even hit any index on jDate. (i.e. although at first glance the index (id, jdate) seems useful, in practice it will be redundant given that id is the PK and queries targetting a single id will either use the Clustered Index (if the default PK clustering is used), or the PK Constraint itself (if the table has different clustering).
Although the spurious parenthesis around id = 2 will be ignored, obviously and has precedence over or, so the parenthesis surrounding the or is essential:
... and (... or ...)

As other users said - both queries are the same. And PK index will be used. If you have any doubts about which index is used (in this or other queries) see execution plan: (http://technet.microsoft.com/en-us/library/ms178071%28v=sql.105%29.aspx)
Execution plan is a very useful tool. For example may prompt the missing indexes

Related

Does PostgreSQL use all available indexes to run a query faster?

We are structuring a project where some tables will have many records, and we intend to use 4 numeric foreign keys and 1 numeric primary, our assumption is that if we create an index for each foreign key and the default index of the primary key, the postgres planning would use all the starts (5 in total) to perform the query.
95% of the time the queries would be providing at least the 4 foreign keys.
Would each index be used to position the search faster in the sequential section of records?
Would having 4 indexes increase the speed of the query or would it suffice with a single index of the parent level (branch_id)?
Thank you for your time and experience.
example: if all foreign keys have an index
SELECT * FROM products WHERE
account_d=1 AND
organization_id=2 AND
business_id=3 AND
branch_id=4 AND
product_id=5;
example: if I only indicate the id of the primary key
SELECT * FROM products WHERE product_id=5;
If all 4 columns are specified by equality, it is possible to combine the single-column indexes using BitmapAnd. However, this would be less efficient than using one multi-column index on all four columns.
Since that will apparently be a very common query, it would make sense to have that multi-column index.
Usually you will want to index each foreign key column. Otherwise, if you want to delete an organization, for example, it would need to scan the whole table to verify that no records were still referencing it. Whichever column is the first one in the multi-column index will not need to also have a single-column index for it. But the other 3 which are not first probably still need their own indexes.
Indexes are (predominantly) used when filtering or joining tables, so whether the indexes you are proposing are useful is entirely dependent on the SQL you are running and whether the query optimiser determines that using an index would be beneficial.
For example, if you ran SELECT * FROM TABLE then none of the indexes would be used.
I can’t comment on Postgresql specifically but many/most DBMSs automatically create indexes when you define PKs/FKs - so you will get the indexes anyway, regardless of any performance tuning you are trying to implement
Update
Having individual indexes on each column is not going to help with the query you’ve provided, the optimiser will only use one of them, probably the PK. A compound index on multiple columns would help, but the more columns you add to the index, the more restrictive the pattern of queries it will benefit.
Say you have 3 columns A, B, C and include them all in WHERE clause, then having a compound index of A+B+C would be highly beneficial.
If you keep this index but your WHERE clause only has columns A, B it will still benefit significantly as the query can still use the A+B subset of the index.
If your WHERE clause only has columns A,C then it would benefit only slightly, as it would select all records from the index that start with the A value - but then would have to filter them to find the subset with the C value

Performance impact of view on aggregate function vs result set limiting

The problem
Using PostgreSQL 13, I ran into a performance issue selecting the highest id from a view that joins two tables, depending on the select statement I execute.
Here's a sample setup:
CREATE TABLE test1 (
id BIGSERIAL PRIMARY KEY,
joincol VARCHAR
);
CREATE TABLE test2 (
joincol VARCHAR
);
CREATE INDEX ON test1 (id);
CREATE INDEX ON test1 (joincol);
CREATE INDEX ON test2 (joincol);
CREATE VIEW testview AS (
SELECT test1.id,
test1.joincol AS t1charcol,
test2.joincol AS t2charcol
FROM test1, test2
WHERE test1.joincol = test2.joincol
);
What I found out
I'm executing two statements which result in completely different execution plans and runtimes. The following statement executes in less than 100ms. As far as I understand the execution plan, the runtime is independent of the rowcount, since Postgres iterates the rows one by one (starting at the highest id, using the index) until a join on a row is possible and immediately returns.
SELECT id FROM testview ORDER BY ID DESC LIMIT 1;
However, this one takes over 1 second on average (depending on rowcount), since the two tables are "joined completely", before Postgres uses the index to select the highest id.
SELECT MAX(id) FROM testview;
Please refer to this sample on dbfiddle to check the explain plans:
https://www.db-fiddle.com/f/bkMNeY6zXqBAYUsprJ5eWZ/1
My real environment
On my real environment test1 contains only a hand full of rows (< 100), having unique values in joincol. test2 contains up to ~10M rows, where joincol always matches a value of test1's joincol. test2's joincol is not nullable.
The actual question
Why does Postgres not recognize that it could use an Index Scan Backward on row basis for the second select? Is there anything I could improve on the tables/indexes?
Queries not strictly equivalent
why does Postgres not recognize that it could use a Index Scan Backward on row basis for the second select?
To make the context clear:
max(id) excludes NULL values. But ORDER BY ... LIMIT 1 does not.
NULL values sort last in ascending sort order, and first in descending. So an Index Scan Backward might not find the greatest value (according to max()) first, but any number of NULL values.
The formal equivalent of:
SELECT max(id) FROM testview;
is not:
SELECT id FROM testview ORDER BY id DESC LIMIT 1;
but:
SELECT id FROM testview ORDER BY id DESC NULLS LAST LIMIT 1;
The latter query doesn't get the fast query plan. But it would with an index with matching sort order: (id DESC NULLS LAST).
That's different for the aggregate functions min() and max(). Those get a fast plan when targeting table test1 directly using the plain PK index on (id). But not when based on the view (or the underlying join-query directly - the view is not the blocker). An index sorting NULL values in the right place has hardly any effect.
We know that id in this query can never be NULL. The column is defined NOT NULL. And the join in the view is effectively an INNER JOIN which cannot introduce NULL values for id.
We also know that the index on test.id cannot contain NULL values.
But the Postgres query planner is not an AI. (Nor does it try to be, that could get out of hands quickly.) I see two shortcomings:
min() and max() get the fast plan only when targeting the table, regardless of index sort order, an index condition is added: Index Cond: (id IS NOT NULL)
ORDER BY ... LIMIT 1 gets the fast plan only with the exactly matching index sort order.
Not sure, whether that might be improved (easily).
db<>fiddle here - demonstrating all of the above
Indexes
Is there anything I could improve on the tables/indexes?
This index is completely useless:
CREATE INDEX ON "test" ("id");
The PK on test.id is implemented with a unique index on the column, that already covers everything the additional index might do for you.
There may be more, waiting for the question to clear up.
Distorted test case
The test case is too far away from actual use case to be meaningful.
In the test setup, each table has 100k rows, there is no guarantee that every value in joincol has a match on the other side, and both columns can be NULL
Your real case has 10M rows in table1 and < 100 rows in table2, every value in table1.joincol has a match in table2.joincol, both are defined NOT NULL, and table2.joincol is unique. A classical one-to-many relationship. There should be a UNIQUE constraint on table2.joincol and a FK constraint t1.joincol --> t2.joincol.
But that's currently all twisted in the question. Standing by till that's cleaned up.
This is a very good problem, and good testcase.
I tested it in postgres 9.3 perhaps 13 is can it more more fast.
I used Occam's Razor and i excluded some possiblities
View (without view is slow to)
JOIN can filter some rows (unfortunatly in your test not, but more length md5 5-6 yes)
Other basic equivalent select statements not solve yout problem (inner query or exists)
I achieved to use just index, but because the tables isn't bigger than indexes it was not the solution.
I think
CREATE INDEX on "test" ("id");
is useless, because PK!
If you change this
CREATE INDEX on "test" ("joincol");
to this
CREATE INDEX ON TEST (joincol, id);
Than the second query use just indexes.
After you run this
REINDEX table test;
REINDEX table test2;
VACUUM ANALYZE test;
VACUUM ANALYZE test2;
you can achive some performance tuning. Because you created indexes before inserts.
I think the reason is the two aim of DB.
First aim optimalize just some row. So run Nested Loop. You can force it with limit x.
Second aim optimalize whole table. Run this query fast for whole table.
In this situation postgres optimalizer didn't notice that simple MAX can run with NESTED LOOP. Or perhaps postgres cannot use limit in aggregate clause (can run on whole partial select, what is filtered with query).
And this is very expensive. But you have possiblities to write there other aggregates, like SUM, MIN, AVG stb.
Perhaps can help you the Window functions too.

SQL Index - are both statements going to do the same?

I was wondering if in SQL server these two statements to create a non-clustered index will have the same behavior?
create nonclustered index EmpLastname_Incl_Firstname
on employee(lastname) include (firstname);
create nonclustered index EmpLastnameFirstname
on employee(lastname, firstname)
No. The key columns are optimized for things like filtering and grouping, while the included columns are optimized for retrieval of the column only. So if a lot of your queries look like the following:
SELECT firstname, lastname
FROM mytable
WHERE lastname = 'Doe' AND firstname = 'John'
then the second index you showed would be preferred. If you only use lastname in your SELECT such as the following query:
SELECT firstname, lastname
FROM mytable
WHERE lastname = 'Doe'
Then the first query would be preferred.
If you have a mix of both queries you should take the second index only as the second query is also able to make use of the first index.
absolutely no
INCLUDE means that the data from the column is stored in the index but it is not part of the index sorting
Those statements will not have the same behavior. The index with the include will only allow key lookups on the lastname field, while the index without the include will allow key lookups on both the lastname and firstname fields. Microsoft documentation for indexes with includes. This bit is especially important to your question:
Redesign nonclustered indexes with a large index key size so that only columns used for searching and lookups are key columns. Make all other columns that cover the query into nonkey columns. In this way, you will have all columns needed to cover the query, but the index key itself is small and efficient.
If you ever need to search by the firstname field, your index should include it as a key lookup.
Adding columns to include will store the respective data only on the leaf-node level of the b-tree (not in the tree itself).
Almost everything that can be accomplished with include can also be accomplished by putting the respective columns in the key part of the index. The exceptions are related to the length limits of the key. In doubt, it might be best to leave it in the key columns.
Having that said, there are some benefits when putting a column in include rather than the key part:
the resulting index is slightly smaller (a few percent)
The tree of the index might be a one level smaller
It is documented what the column of that index is used for. That makes extending this index more easy in the future.
I find the last one the most important one.
Have a look at my recent article about this topic for a better understanding:
https://use-the-index-luke.com/blog/2019-04/include-columns-in-btree-indexes

Is the addition of a second ID column beneficial to index?

Let's say I have a table tbl_FacilityOrders with two foreign keys fk_FacilityID and fk_OrderID in SQL Server 2005. It could contain orders from a few hundred facilities. I need to query single records and will have both the facilityID and the orderID available to me. Is it better to define an index on fk_FacilityID then fk_OrderID and pass the both to the query or to just use fk_OrderID. Since there will be less facility IDs than order IDs, I could see weeding out the other facilities' records first possibly being beneficial.
A second question is, if I were using the two columnn query above, does the order I write my WHERE clause columns in matter or is is the engine smart enough to evaluate them in the order of the index?
E.G. Is:
WHERE fk_facilityID = #FacilityID AND fk_OrderID = #OrderID
equivalent to:
WHERE fk_OrderID = #OrderID AND fk_FacilityID = #FacilityID
?
Is it better to define an index on fk_FacilityID then fk_OrderID and pass the both to the query or to just use fk_OrderID.
If OrderId is unique, there's no real added benefit to adding the other field for the scenario given. It is a good idea to index your FKs, though, since they will always been a JOIN key.
if I were using the two columnn query above, does the order I write my WHERE clause columns in matter or is is the engine smart enough to evaluate them in the order of the index?
Nope, order is irrelevant here. All that matters is that the SETS of fields match, i.e. FieldA and FieldB are both in the index and in the WHERE clause.
The order of fields in the index DOES matter, though. You can't use the second field in an index without knowing the value of the first field.
You should create an index for each of your foreign keys... not just the purpose of this question, but because indexing your foreign keys is good practice in general.
To answer your second question, the two statements are equivalent. SQL Server should internally re-order the statements to arrive at the optimal execution plan... however, you should always validate the generated execution plan just to make sure that its behaving as you would expect.

SQL Server index included columns

I need help understanding how to create indexes. I have a table that looks like this
Id
Name
Age
Location
Education,
PhoneNumber
My query looks like this:
SELECT *
FROM table1
WHERE name = 'sam'
What's the correct way to create an index for this with included columns?
What if the query has a order by statement?
SELECT *
FROM table1
WHERE name = 'sam'
ORDER BY id DESC
What if I have 2 parameters in my where statement?
SELECT *
FROM table1
WHERE name = 'sam'
AND age > 12
The correct way to create an index with included columns? Either via Management Studio/Toad/etc, or SQL (documentation):
CREATE INDEX idx_table_1 ON db.table_1 (name) INCLUDE (id)
What if the Query has an ORDER BY
The ORDER BY can use indexes, if the optimizer sees fit to (determined by table statistics & query). It's up to you to test if a composite index or an index with INCLUDE columns works best by reviewing the query cost.
If id is the clustered key (not always the primary key though), I probably wouldn't INCLUDE the column...
What if I have 2 parameters in my where statement?
Same as above - you need to test what works best for your query. Might be composite, or include, or separate indexes.
But keep in mind that:
tweaking for one query won't necessarily benefit every other query
indexes do slow down INSERT/UPDATE/DELETE statements, and require maintenance
You can use the Database Tuning Advisor (DTA) for index recommendations, including when some are redundant
Recommended reading
I highly recommend reading Kimberly Tripp's "The Tipping Point" for a better understanding of index decisions and impacts.
Since I do not know which exactly tasks your DB is going to implement and how many records in it, I would suggest that you take a look at the Index Basics MSDN article. It will allow you to decide yourself which indexes to create.
If ID is your primary and/or clustered index key, just create an index on Name, Age. This will cover all three queries.
Included fields are best used to retrieve row-level values for columns that are not in the filter list, or to retrieve aggregate values where the sorted field is in the GROUP BY clause.
If inserts are rare, create as much indexes as You want.
For first query create index for name column.
Id column I think already is primary key...
Create 2nd index with name and age. You can keep only one index: 'name, ag'e and it will not be much slower for 1st query.