How does query with multiple where condition work in PostgreSQL? - sql

I have a table account_config where I keep key-value configs for accounts with columns:
id - pk
account_id - fk
key
value
Table may have configs for thousands of accounts, but for each account it may have 10-20 configs max. I am using query:
select id, key, value from account_config t where t.account_id = ? and t.key = ?;
I already have index for account_id field, do I need another index for key field here? Will second filter (key = ?) apply to already filtered result set (account_id = ?) or it scans whole table?

Indexes are used when only a small percentage of the table's rows get accessed and the index helps finding those rows quickly.
You say there are thousands of accounts in your table, each with 10 to 20 rows.
Let's say there are 3000 accounts and 45,000 rows in your table, then accessing data via an index on the account ID means with the index we access about 0,03 % of the rows to find the one row in question. That makes it extremely likely that the index will be used.
Of course, if there were an index on (account_id, key), that index would be preferred, as we would only have to read one row from the table which the index points to.
So, yes, your index should suffice for the query shown, but if you want to get this faster, then provide the two-column index.

Related

SQL - Get specific row without a full table scan

I'm using Postgresql (cockroachdb) and I want to select a specific row. For example, there are thousands of records and I want to select row number 999.
In this case we would use LIMIT and OFFSET, SELECT * FROM table LIMIT 1 OFFSET 998;
However, using LIMIT and OFFSET can cause performance issue according to this post. So I'm wondering if there a way to get specific row without a full table scan.
I feel like it is possible because the database seems to sort data by primary key, that when I do SELECT * FROM table; it always show a sorted result. Since it is sorted by primary key, database can use index to access a specific row, right?
If you select rows based on the primary key (e.g. SELECT * FROM table WHERE <primary key> = <value>), no scans will be needed underneath the hood. The same is also true if you define a secondary index on the table and apply a WHERE clause that filters based on the column(s) in the secondary index.

What sort of Index for 'AND' columns?

I have a table to store people and want to select where the person is not marked as "deleted". I have a clustered primary key on the ID column (PersonID).
The 'Deleted' column is a DATETIME, nullable, and is populated when deleted.
My query looks like this:
SELECT *
FROM dbo.Person
WHERE PersonID = 100
AND Deleted IS NULL
This table can grow to around 40,000 people.
Should I have an index that covers the Deleted flag as well?
I may also query things like:
SELECT *
FROM Task t
INNER JOIN Person p
ON p.PersonID = t.PersonID
AND p.Deleted IS NULL
WHERE t.TaskTypeId = 5
AND t.Deleted IS NULL
Task table estimate is about 1.5 million rows.
I think I need one that covers both the pk and the deleted flag on both tables? i.e. on (Task.TaskId, Task.Deleted) and (Person.PersonID, Person.Deleted)?
Reasons for me investigating an index rethink, is due to a number of deadlock occurring in complex procedures. I'd like to reduce the number of rows locked on selects/writes/updates, as well as get a performance gain.
Since you are using SQL Server 2008, the fastest querying might well be using a filtered index. In this Deleted column whose type is DATETIME and nullable, you could try something like this index:
CREATE NONCLUSTERED INDEX Filtered_Deleted_Index
ON dbo.Person(Deleted)
WHERE Deleted IS NOT NULL
This will get you the smallest valid set in both use cases you listed above (for querying dbo.Person and also joining with Tasks).
Your instinct is (generally speaking) sound - an index that contains all columns needed for the query is called a covering index, which in this case would require:
CREATE INDEX Person_PersonID_Deleted ON Person(PersonID, Deleted);
You are unlikely to get much performance benefit on index lookup by adding the Deleted column, since searching for null is (usually) ignored, but having these indexes means that accessing the table can be bypassed entirely for Person.
You could also try creating:
CREATE INDEX Task_TaskTypeId_Deleted ON Task(TaskTypeId, Deleted);
which will avoid accessing Task rows that are marked as "deleted", and Task would then only accessed for non-deleted rows. However, if most of your Tasks are not deleted, I wouldn't bother with this index.
It's worth trying out various combinations of index(es) to see which combination gives the best result.
Since the primary key is PersonID, adding another index with extra columns after PersonID will not improve the "selectability" of the index, although is may prevent the need to lookup the record by rowid for filtering on deleted. With only 3% records filtered, that's nothing, so don't create another index on Person.
As for the Task table, it very much depends on the selectability of TaskTypeId = 5 AND Deleted IS NULL, i.e. how many records match the criteria. In general, a sequential search (full table scan) is faster than an index scan with row lookup if more than 20% of the records are selected. For very larger tables where the data is very distributed (e.g. physically every 10th record is selected), the performance threshold is below 10%.
So, if more than 10-20% of Task records are type 5, and only 3% of records are deleted, no index will improve performance, because the fastest access plan is likely a merge join of two full table scans.

SQL Server Update Where clause performance with clustered index

I have an update query like below to update AccessDate only when the current date is less then the passed one. The table has a clustered index on Id.
Is there any use to have another non clustered index on Id, AccessDate?
Update Person
Set AccessDate = #NewAccessDate
Where Id = #Id
And AccessDate < #NewAccessDate
Under most circumstances, I would say that the update would be faster without the index. The key consideration is that the index itself would also need to be updated by the statement.
The one mitigating factor is when each id has lots and lots of AccessDates, and very, very few that are less than #NewAccessdate. For instance, if there were 10,000 rows per id and only 1 matched the condition, then updating the index is probably faster than scanning all the access dates.
Or, similarly, if most ids had no matching records for the WHERE clause.
I'm not sure what the cutoff value is for when one is better or not -- it would depend on other factors, such as your hardware and the number of records per page. But given that there is a trade-off, you are probably safe not putting in the index.

What key columns to use on filtered index with covering WHERE clause?

I'm creating a filtered index such that the WHERE filter includes the complete query criteria. WIth such an index, it seems that a key column would be unnecessary, though SQL requires me to add one. For example, consider the table:
CREATE TABLE Invoice
(
Id INT NOT NULL IDENTITY PRIMARY KEY,
Data VARCHAR(MAX) NOT NULL,
IsProcessed BIT NOT NULL DEFAULT 0,
IsInvalidated BIT NOT NULL DEFAULT 0
)
Queries on the table look for new invoices to process, i.e.:
SELECT *
FROM Invoice
WHERE IsProcessed = 0 AND IsInvalidated = 0
So, I can tune for these queries with a filtered index:
CREATE INDEX IX_Invoice_IsProcessed_IsInvalidated
ON Invoice (IsProcessed)
WHERE (IsProcessed = 0 AND IsInvalidated = 0)
GO
My question: What should the key column(s) for IX_Invoice_IsProcessed_IsInvalidated be? Presumably the key column isn't being used. My intuition leads me to pick a column that is small and will keep the index structure relatively flat. Should I pick the table primary key (Id)? One of the filter columns, or both of them?
Because you have a clustered index on that table it doesn't really matter what you put in the key columns of that index; meaning Id is there free of charge. The only thing you can do is include everything in the included section of the index to actually have data handy at the leaf level of the index to exclude key lookups to the table. Or, if the queue is huge, then, perhaps, some other column would be useful in the key section.
Now, if that table didn't have a primary key then you would have to include or specify as key columns all the columns that you need for joining or other purposes. Otherwise, RID lookups on heap would occur because on the leaf level of indexes you would have references to data pages.
What percentage of the table does this filtered index cover? If it's small, you may want to cover the entire table to handle the "SELECT *" from the index without hitting the table. If it's a large portion of the table though this would not be optimal. Then I'd recommend using the clustered index or primary key. I'd have to research more because I forget which is optimal right now but if they're the same you should be set.
I suggest you declare it as follows
CREATE INDEX IX_Invoice_IsProcessed_IsInvalidated
ON Invoice (Id)
INCLUDE (Data)
WHERE (IsProcessed = 0 AND IsInvalidated = 0)
The INCLUDE clause will mean that the Values of the Data column will be stored as part of the index.
If you didn't have an INCLUDE clause then the query plan for
SELECT Id, Data
FROM Invoice
WHERE IsProcessed = 0 AND IsInvalidated = 0
would involve a two step process
use the index to find the list of primary key values that match the
criteria
get the data from the table that match those primary keys
If, on the other hand, the index includes the [Data] column then it will properly cover the query as there will be no need to look up the data using the primary keys
You don't get something for nothing though
The downside to this is that you will be storing the varchar(MAX) data twice for these records so there will need to be more data written to the database and more storage will be used although this isn't so much of a problem if you're only talking about a small section of the data.
As always the more time and effort you put into putting things away carefully the faster and easier it is to get them back.

SQL non-clustered index

I have a table that maps a user's permissions to a given object. So, it is essentially a join table to 3 different tables. (Object, User, and Permission)
The values of each row will always be unique for all 3 columns, but not any 2.
I need to create a non-clustered index. I want to put the index on the foreign keys to the object and user, but I am wondering if I should put it on all 3 columns.
"The values of each row will always be unique for all 3 columns"
You might be interested to know that SQL Server unique constraints are implemented as indexes. So if you have (or want) a constraint backing up that unique-claim of yours, you already have an index on all 3.
CREATE UNIQUE NONCLUSTERED INDEX idx_unique_perms ON UserPermissions
(
ObjectId ASC,
UserId ASC,
PermissionID ASC
)
If you make one, just remember to order your columns for high selectivity.
If you have some doubts, formulate the query(ies) you intend to execute against these tables, and run the SSMS Query Tuning Wizard. That should help you get started in the right direction.
One thing to consider is the number of rows in these three tables. If the row counts will be small, it might not even be worthwhile adding indexes. A table scan would probably be done anyway.