SQL Server Update Where clause performance with clustered index - sql

I have an update query like below to update AccessDate only when the current date is less then the passed one. The table has a clustered index on Id.
Is there any use to have another non clustered index on Id, AccessDate?
Update Person
Set AccessDate = #NewAccessDate
Where Id = #Id
And AccessDate < #NewAccessDate

Under most circumstances, I would say that the update would be faster without the index. The key consideration is that the index itself would also need to be updated by the statement.
The one mitigating factor is when each id has lots and lots of AccessDates, and very, very few that are less than #NewAccessdate. For instance, if there were 10,000 rows per id and only 1 matched the condition, then updating the index is probably faster than scanning all the access dates.
Or, similarly, if most ids had no matching records for the WHERE clause.
I'm not sure what the cutoff value is for when one is better or not -- it would depend on other factors, such as your hardware and the number of records per page. But given that there is a trade-off, you are probably safe not putting in the index.

Related

Performance impact of view on aggregate function vs result set limiting

The problem
Using PostgreSQL 13, I ran into a performance issue selecting the highest id from a view that joins two tables, depending on the select statement I execute.
Here's a sample setup:
CREATE TABLE test1 (
id BIGSERIAL PRIMARY KEY,
joincol VARCHAR
);
CREATE TABLE test2 (
joincol VARCHAR
);
CREATE INDEX ON test1 (id);
CREATE INDEX ON test1 (joincol);
CREATE INDEX ON test2 (joincol);
CREATE VIEW testview AS (
SELECT test1.id,
test1.joincol AS t1charcol,
test2.joincol AS t2charcol
FROM test1, test2
WHERE test1.joincol = test2.joincol
);
What I found out
I'm executing two statements which result in completely different execution plans and runtimes. The following statement executes in less than 100ms. As far as I understand the execution plan, the runtime is independent of the rowcount, since Postgres iterates the rows one by one (starting at the highest id, using the index) until a join on a row is possible and immediately returns.
SELECT id FROM testview ORDER BY ID DESC LIMIT 1;
However, this one takes over 1 second on average (depending on rowcount), since the two tables are "joined completely", before Postgres uses the index to select the highest id.
SELECT MAX(id) FROM testview;
Please refer to this sample on dbfiddle to check the explain plans:
https://www.db-fiddle.com/f/bkMNeY6zXqBAYUsprJ5eWZ/1
My real environment
On my real environment test1 contains only a hand full of rows (< 100), having unique values in joincol. test2 contains up to ~10M rows, where joincol always matches a value of test1's joincol. test2's joincol is not nullable.
The actual question
Why does Postgres not recognize that it could use an Index Scan Backward on row basis for the second select? Is there anything I could improve on the tables/indexes?
Queries not strictly equivalent
why does Postgres not recognize that it could use a Index Scan Backward on row basis for the second select?
To make the context clear:
max(id) excludes NULL values. But ORDER BY ... LIMIT 1 does not.
NULL values sort last in ascending sort order, and first in descending. So an Index Scan Backward might not find the greatest value (according to max()) first, but any number of NULL values.
The formal equivalent of:
SELECT max(id) FROM testview;
is not:
SELECT id FROM testview ORDER BY id DESC LIMIT 1;
but:
SELECT id FROM testview ORDER BY id DESC NULLS LAST LIMIT 1;
The latter query doesn't get the fast query plan. But it would with an index with matching sort order: (id DESC NULLS LAST).
That's different for the aggregate functions min() and max(). Those get a fast plan when targeting table test1 directly using the plain PK index on (id). But not when based on the view (or the underlying join-query directly - the view is not the blocker). An index sorting NULL values in the right place has hardly any effect.
We know that id in this query can never be NULL. The column is defined NOT NULL. And the join in the view is effectively an INNER JOIN which cannot introduce NULL values for id.
We also know that the index on test.id cannot contain NULL values.
But the Postgres query planner is not an AI. (Nor does it try to be, that could get out of hands quickly.) I see two shortcomings:
min() and max() get the fast plan only when targeting the table, regardless of index sort order, an index condition is added: Index Cond: (id IS NOT NULL)
ORDER BY ... LIMIT 1 gets the fast plan only with the exactly matching index sort order.
Not sure, whether that might be improved (easily).
db<>fiddle here - demonstrating all of the above
Indexes
Is there anything I could improve on the tables/indexes?
This index is completely useless:
CREATE INDEX ON "test" ("id");
The PK on test.id is implemented with a unique index on the column, that already covers everything the additional index might do for you.
There may be more, waiting for the question to clear up.
Distorted test case
The test case is too far away from actual use case to be meaningful.
In the test setup, each table has 100k rows, there is no guarantee that every value in joincol has a match on the other side, and both columns can be NULL
Your real case has 10M rows in table1 and < 100 rows in table2, every value in table1.joincol has a match in table2.joincol, both are defined NOT NULL, and table2.joincol is unique. A classical one-to-many relationship. There should be a UNIQUE constraint on table2.joincol and a FK constraint t1.joincol --> t2.joincol.
But that's currently all twisted in the question. Standing by till that's cleaned up.
This is a very good problem, and good testcase.
I tested it in postgres 9.3 perhaps 13 is can it more more fast.
I used Occam's Razor and i excluded some possiblities
View (without view is slow to)
JOIN can filter some rows (unfortunatly in your test not, but more length md5 5-6 yes)
Other basic equivalent select statements not solve yout problem (inner query or exists)
I achieved to use just index, but because the tables isn't bigger than indexes it was not the solution.
I think
CREATE INDEX on "test" ("id");
is useless, because PK!
If you change this
CREATE INDEX on "test" ("joincol");
to this
CREATE INDEX ON TEST (joincol, id);
Than the second query use just indexes.
After you run this
REINDEX table test;
REINDEX table test2;
VACUUM ANALYZE test;
VACUUM ANALYZE test2;
you can achive some performance tuning. Because you created indexes before inserts.
I think the reason is the two aim of DB.
First aim optimalize just some row. So run Nested Loop. You can force it with limit x.
Second aim optimalize whole table. Run this query fast for whole table.
In this situation postgres optimalizer didn't notice that simple MAX can run with NESTED LOOP. Or perhaps postgres cannot use limit in aggregate clause (can run on whole partial select, what is filtered with query).
And this is very expensive. But you have possiblities to write there other aggregates, like SUM, MIN, AVG stb.
Perhaps can help you the Window functions too.

What sort of Index for 'AND' columns?

I have a table to store people and want to select where the person is not marked as "deleted". I have a clustered primary key on the ID column (PersonID).
The 'Deleted' column is a DATETIME, nullable, and is populated when deleted.
My query looks like this:
SELECT *
FROM dbo.Person
WHERE PersonID = 100
AND Deleted IS NULL
This table can grow to around 40,000 people.
Should I have an index that covers the Deleted flag as well?
I may also query things like:
SELECT *
FROM Task t
INNER JOIN Person p
ON p.PersonID = t.PersonID
AND p.Deleted IS NULL
WHERE t.TaskTypeId = 5
AND t.Deleted IS NULL
Task table estimate is about 1.5 million rows.
I think I need one that covers both the pk and the deleted flag on both tables? i.e. on (Task.TaskId, Task.Deleted) and (Person.PersonID, Person.Deleted)?
Reasons for me investigating an index rethink, is due to a number of deadlock occurring in complex procedures. I'd like to reduce the number of rows locked on selects/writes/updates, as well as get a performance gain.
Since you are using SQL Server 2008, the fastest querying might well be using a filtered index. In this Deleted column whose type is DATETIME and nullable, you could try something like this index:
CREATE NONCLUSTERED INDEX Filtered_Deleted_Index
ON dbo.Person(Deleted)
WHERE Deleted IS NOT NULL
This will get you the smallest valid set in both use cases you listed above (for querying dbo.Person and also joining with Tasks).
Your instinct is (generally speaking) sound - an index that contains all columns needed for the query is called a covering index, which in this case would require:
CREATE INDEX Person_PersonID_Deleted ON Person(PersonID, Deleted);
You are unlikely to get much performance benefit on index lookup by adding the Deleted column, since searching for null is (usually) ignored, but having these indexes means that accessing the table can be bypassed entirely for Person.
You could also try creating:
CREATE INDEX Task_TaskTypeId_Deleted ON Task(TaskTypeId, Deleted);
which will avoid accessing Task rows that are marked as "deleted", and Task would then only accessed for non-deleted rows. However, if most of your Tasks are not deleted, I wouldn't bother with this index.
It's worth trying out various combinations of index(es) to see which combination gives the best result.
Since the primary key is PersonID, adding another index with extra columns after PersonID will not improve the "selectability" of the index, although is may prevent the need to lookup the record by rowid for filtering on deleted. With only 3% records filtered, that's nothing, so don't create another index on Person.
As for the Task table, it very much depends on the selectability of TaskTypeId = 5 AND Deleted IS NULL, i.e. how many records match the criteria. In general, a sequential search (full table scan) is faster than an index scan with row lookup if more than 20% of the records are selected. For very larger tables where the data is very distributed (e.g. physically every 10th record is selected), the performance threshold is below 10%.
So, if more than 10-20% of Task records are type 5, and only 3% of records are deleted, no index will improve performance, because the fastest access plan is likely a merge join of two full table scans.

Do clustered index on a column GUARANTEES returning sorted rows according to that column [duplicate]

This question already has an answer here:
Does a SELECT query always return rows in the same order? Table with clustered index
(1 answer)
Closed 8 years ago.
I am unable to get clear cut answers on this contentious question .
MSDN documentation mentions
Clustered
Clustered indexes sort and store the data rows in the table or view
based on their key values. These are the columns included in the
index definition. There can be only one clustered index per table,
because the data rows themselves can be sorted in only one order.
The only time the data rows in a table are stored in sorted order is
when the table contains a clustered index. When a table has a
clustered index, the table is called a clustered table. If a table
has no clustered index, its data rows are stored in an unordered
structure called a heap.
While I see most of the answers
Does a SELECT query always return rows in the same order? Table with clustered index
http://sqlwithmanoj.com/2013/06/02/clustered-index-do-not-guarantee-physically-ordering-or-sorting-of-rows/
answering negative.
What is it ?
Just to be clear. Presumably, you are talking about a simple query such as:
select *
from table t;
First, if all the data on the table fits on a single page and there are no other indexes on the table, it is hard for me to imagine a scenario where the result set is not ordered by the primary key. However, this is because I think the most reasonable query plan would require a full-table scan, not because of any requirement -- documented or otherwise -- in SQL or SQL Server. Without an explicit order by, the ordering in the result set is a consequence of the query plan.
That gets to the heart of the issue. When you are talking about the ordering of the result sets, you are really talking about the query plan. And, the assumption of ordering by the primary key really means that you are assuming that the query uses full-table scan. What is ironic is that people make the assumption, without actually understanding the "why". Furthermore, people have a tendency to generalize from small examples (okay, this is part of the basis of human intelligence). Unfortunately, they see consistently that results sets from simple queries on small tables are always in primary key order and generalize to larger tables. The induction step is incorrect in this example.
What can change this? Off-hand, I think that a full table scan would return the data in primary key order if the following conditions are met:
Single threaded server.
Single file filegroup
No competing indexes
No table partitions
I'm not saying this is always true. It just seems reasonable that under these circumstances such a query would use a full table scan starting at the beginning of the table.
Even on a small table, you can get surprises. Consider:
select NonPrimaryKeyColumn
from table
The query plan would probably decide to use an index on table(NonPrimaryKeyColumn) rather than doing a full table scan. The results would not be ordered by the primary key (unless by accident). I show this example because indexes can be used for a variety of purposes, not just order by or where filtering.
If you use a multi-threaded instance of the database and you have reasonably sized tables, you will quickly learn that results without an order by have no explicit ordering.
And finally, SQL Server has a pretty smart optimizer. I think there is some reluctance to use order by in a query because users think it will automatically do a sort. SQL Server works hard to find the best execution plan for the query. IF it recognizes that the order by is redundant because of the rest of the plan, then the order by will not result in a sort.
And, of course you want to guarantee the ordering of results, you need order by in the outermost query. Even a query like this:
select *
from (select top 100 t.* from t order by col1) t
Does not guarantee that the results are ordered in the final result set. You really need to do:
select *
from (select top 100 t.* from t order by col1) t
order by col1;
to guarantee the results in a particular order. This behavior is documented here.
Without ORDER BY, there is no default sort order even if you have clustered index
in this link there is a good example :
CREATE SCHEMA Data AUTHORIZATION dbo
GO
CREATE TABLE Data.Numbers(Number INT NOT NULL PRIMARY KEY)
GO
DECLARE #ID INT;
SET NOCOUNT ON;
SET #ID = 1;
WHILE #ID < 100000 BEGIN
INSERT INTO Data.Numbers(Number)
SELECT #ID;
SET #ID = #ID+1;
END
CREATE TABLE Data.WideTable(ID INT NOT NULL
CONSTRAINT PK_WideTable PRIMARY KEY,
RandomInt INT NOT NULL,
CHARFiller CHAR(1000))
GO
CREATE VIEW dbo.WrappedRand
AS
SELECT RAND() AS random_value
GO
CREATE ALTER FUNCTION dbo.RandomInt()
RETURNS INT
AS
BEGIN
DECLARE #ret INT;
SET #ret = (SELECT random_value*1000000 FROM dbo.WrappedRand);
RETURN #ret;
END
GO
INSERT INTO Data.WideTable(ID,RandomInt,CHARFiller)
SELECT Number, dbo.RandomInt(), 'asdf'
FROM Data.Numbers
GO
CREATE INDEX WideTable_RandomInt ON Data.WideTable(RandomInt)
GO
SELECT TOP 100 ID FROM Data.WideTable
OUTPUT:
1407
253
9175
6568
4506
1623
581
As you have seen, the optimizer has chosen to use a non-clustered
index to satisfy this SELECT TOP query.
Clearly you cannot assume that your results are ordered unless you
explicitly use ORDER BY clause.
One must specify ORDER BY in the outermost query in order to guarantee rows are returned in a particular order. The SQL Server optimizer will optimize the query and data access to improve performance which may result in rows being returned in a different order. Examples of this are allocation order scans and parallelism. A relational table should always be viewed as an unordered set of rows.
I wish the MSDN documentation were clearer about this "sorting". It is more correct to say that SQL Server b-tree indexes provide ordering by 1) storing adjacent keys in the same page and 2) linking index pages in key order.

Does clustered index sort order have impact on performance

If a PK of a table is a standard auto-increment int (Id) and the retrieved and updated records are almost always the ones closer to the max Id will it make any difference performance-wise whether the PK clustered index is sorted as ascending or descending?
When such PK is created, SSMS by default sets the sort order of the index as ascending and since the rows most accessed are always the ones closer to the current max Id, I'm wondering if changing the sorting to descending would speed up the retrieval since the records will be sorted top-down instead of bottom-up and the records close to the top are accessed most frequently.
I don't think there will be any performance hit. Since, it's going to perform a binary search for the index key to access and then the specific data block with that key. Either way, that binary search will hit O(log N) complexity. So in total O(log N) + 1 and since it's clustered index, it actually should be O(log N) time complexity; since the table records are physically ordered instead of having a separate index page/block.
Indexes use a B-tree structure, so No. But if you have an index that is based off multiple columns, you want the most distinct columns on the outer level, and least distinct on the inner levels. For example, if you had 2 columns (gender and age), you would want age on the outer and gender on the inner, because there are only 2 possible genders, whereas there are many more ages. This will impact performance.

Slow MySQL retreival of the row with the largest value in indexed column

I have a SQL table readings something like:
id int
client_id int
device_id int
unique index(client_id, device_id)
I do not understand why the following query is so slow:
SELECT client_id FROM `readings` WHERE device_id = 10 ORDER BY client_id DESC LIMIT 1
My understanding with the index is that mysql keeps an ordered list (one property of a btree) of each row in the table sorted first by client_id and then by device_id. When I execute an explain on this query it says that it will use the index, but that it will need to look at every row. This makes sense since in the worst case, there may only be one row with device_id = 10 and that may also be the row with the smallest client_id and thus at the end of it's search. However, in practice, this is not true. My table has ~10 million rows, and rows with device_id = 10 are spread fairly evenly throughout that table. Why then doesn't MySQL start at the end of the index and scan until it finds the first row with device_id = 10, stop and return that value? It does not seem possible that this is what is happening since the query takes ~30 seconds to execute.
Is it that my unique key is implemented as a hash somehow and thus not accessible in a list form? PHPMyAdmin is telling me that it is implemented as a b-tree, which makes me think that it should be able to do the scan as I mentioned above and quit with the first instance.
Where is my error and how can I make this query execute more quickly?
Thanks
Try switching the column order in the index:
unique index(device_id, client_id)
Since you are filtering on device_id, you would want that to be the first column in the index.
First, I'm assuming that you have good statistics for this table. If not, you'll want to analyze the table to make sure the optimizer can figure out what the best option is.
Here's another approach you could try that might work better. I could be that MySQL is not understanding your intent well enough to optimize correctly:
SELECT MAX(client_id) from readings where device_id = 10
Otherwise you could modify the index to be by device_id first, then client_id. Or you could add another index by just device_id.
You have a compound index on (client_id, device_id), these will(more or less) be concatenated for the purpose of indexing and the index will only be considered if you use the
first of the column(s). Your query is using 'device_id' which is the last of them, you could provide a separate index on that column, or swap the columns around in the index.
Also, check the output of EXPLAIN on your queries.