optimize query with column in where clause - sql

I have an sql query which fetch the first N rows in a table which is designed as a low-level queue.
select top N * from my_table where status = 0 order by date asc
The intention behind this query is as follows:
First, this question is intended to be database agnostic, as my implementation will support sql server, oracle, DB2 and sybase. The sql syntax above of "top N" is just an example.
The table can contain millions of rows.
N is a relatively small number in comparison, e.g. 100.
status is 0 when the row is in the queue. Later it is changed to 1 to indicate that it is in processing. After processing it is deleted. So it is expected that at least 90% of the rows in the table will be with status 0.
rows in the table should be fetched according to their date, hence the order by clause.
What is the optimal index to make this query works fastest?
I initially thought the index should be on (date, status), but I am not sure about it anymore. Since the status column will contain mostly zeros, is there an added-value to it? Will it be sufficient to index by (date) alone?
Or maybe it should be (status, date)?

I don't think there is an efficient solution that will be RDMS independent. For example, Oracle has bitmap indexes, SQLServer has partial indexes, and I don't see reasons not to use them if, for instance, Mysql or Sqlite has nothing similar. Also, historically SQLServer implements clustered tables (or IOT in Oracle world) way better than Oracle does, so having clustered index on date column may work perfectly for SQLServer, but not for Oracle.
I'd rather change approach a bit. If you say 90% of rows don't satisfy status=0 condition, why not try refactoring schema, and adding a new table (or materialized view) that holds only records you are interested in ? The number of new programmable objects required for keeping that table up-to-date and merging data with original table is relatively small even if RDMS doesn't support materialized view directly. Also, if it's possible to redesign underlying logic, so rows never updated, only inserted or deleted, then it will help avoiding lock contentions , and as a result , the whole system will have a better performance .

Have a clustered index on Date and a non clustered index on Status.

Related

Please explain COUNT and GROUP BY/ORDER BY in SQL for a beginner [duplicate]

I would like to understand how exactly does sql count work. Is it a whole table scan that happens or is it some property of the table that is read. However I feel a table scan would be an overhead in case of huge tables with lots of records.
In general either a table or index scan is performed. This is chiefly because in a MVCC-supporting engine, different transactions could see different rows, so there is no single "row count" which is simultaneously correct for everyone.
Likewise, if you have a WHERE clause, then the where condition could be different for different clients, so they see different numbers.
If you need to do a lot of counts of large tables, consider storing your own counters in a different table. Exactly how you do this is entirely application specific.
This will depend very much on which SQL implementation you are using (MS SQL Server, MySQL, Oracle, PostgreSQL etc), and how clever its optimiser is.
It may also depend on the query. For example, with something like
SELECT COUNT(primary_key) FROM table;
the optimiser may realise that there is no need to scan the table (since there is no filtering with WHERE and no possibility that any values are NULL) and just return the size of the table. With a more complicated query (where there is filtering, or the possibility of NULLs), the database may have to scan the table, or it may be able to do some optimisation with the use of an index.
This is obviously implementation dependant (i.e. different RDBMS may employ different strategies) and usage dependant (i.e. select count(*) from mytable and select count(*) from mytable where myfield < somevalue) may use different methods even in the same DB.
If you are trying to get the count based on some partitioning that is already expressed by an Index, smart DBs will try to use the index alone. Or something like the old "rushmore" used in Foxbase.
So, "it depends", but at the end of the day, if no better methods are available, yes, the DB will perform a table scan.
It is usually some sort of index scan, unless there is no unique index on the table.
Strangely enough, most database engines can only count by scanning. They even provide alternate solutions to count using table metadata. For instance SQL Server supports SELECT rowcnt FROM sysindexes .... However, these are usually not 100% accurate.
YSE COUNT FUNCTION DOSE TABLE SCAN, rather than using count on table to get total number of row you can use :
SELECT
Total_Rows= SUM(st.row_count)
FROM
sys.dm_db_partition_stats st
WHERE
object_name(object_id) = 'TABLENAME'
or
SELECT sysobjects.[name], max(sysindexes.[rows]) AS TableRows
FROM sysindexes INNER JOIN sysobjects ON sysindexes.[id] = sysobjects.[id]
WHERE sysobjects.xtype = 'U' and sysobjects.[name]='tablename'
GROUP BY sysobjects.[name]
ORDER BY max(rows) DESC
OTHER WAY TO GET TOTAL COUNT : http://www.codeproject.com/Tips/58796/Number-of-different-way-to-get-total-no-of-row-fro.aspx
It depends on the DBMS used.
If there is an index, there should be one index row for each table row. A smart DBMS will likely choose the smallest index and count the index rows.
Finally, if the table is small enough, it may count the table rows and bypass the index.
In postgreSQL a table scan is performed.
I think it's implementation dependant.
Edit:
See this link
It really doesn't matter!
I assume you want the row count for some sort of paging... so just make sure your paging algorithm is into the best practices and forget about how the engine works.
Let people in database business care about this, just follow the recommendation of those who are experts in the database your are using.
SQL Server - https://web.archive.org/web/20211020131201/https://www.4guysfromrolla.com/webtech/042606-1.shtml
Oracle - Paging with Oracle
MySQL - http://php.about.com/od/phpwithmysql/ss/php_pagination.htm

Postgres ignoring clustered index on date query

I have a large table that I run queries like select date_att > date '2001-01-01' on regularly. I'm trying to increase the speed of these queries by clustering the table on date_att, but when I run those queries through explain analyze it still chooses to sequentially scan the table, even on queries as simple as SELECT date_att from table where date_att > date '2001-01-01'. Why is this the case? I understand that since the query returns a large portion of the table, the optimizer will ignore the index, but since the table is clustered by that attribute, shouldn't it be able to really quickly binary search through the table to the point where date > '2001-01-01' and return all results after that? This query still takes as much time as without the clustering.
It seems like you are confusing two concepts:
PostgreSQL clustering of a table
Clustering a table according to an index in PostgreSQL aligns the order of table rows (stored in a heap table) to the order in the index at the time of clustering. From the docs:
Clustering is a one-time operation: when the table is subsequently
updated, the changes are not clustered.
http://www.postgresql.org/docs/9.3/static/sql-cluster.html
Clustering potentially (often) improves query speed for range queries because the selected rows are stored nearby in the heap table by coincidence. There is nothing that guarantees this order! Consequently the optimizer cannot assume that it is true.
E.g. if you insert a new row that fulfills your where clause it might be inserted at any place in the table — e.g. where rows for 1990 are stored. Hence, this assumtion doesn't hold true:
but since the table is clustered by that attribute, shouldn't it be able to really quickly binary > search through the table to the point where date > '2001-01-01' and return all results after that?
This brings us to the other concept you mentioned:
Clustered Indexes
This is something completely different, not supported by PostgreSQL at all but by many other databases (SQL Server, MySQL with InnoDB and also Oracle where it is called 'Index Organized Table').
In that case, the table data itself is stored in an index structure — there is no separate heap structure! As it is an index, the order is also maintained for each insert/update/delete. Hence your assumption would hold true and indeed I'd expect the above mentioned database to behave as you would expect it (given the date column is the clustering key!).
Hope that clarifies it.

how does sql count work?

I would like to understand how exactly does sql count work. Is it a whole table scan that happens or is it some property of the table that is read. However I feel a table scan would be an overhead in case of huge tables with lots of records.
In general either a table or index scan is performed. This is chiefly because in a MVCC-supporting engine, different transactions could see different rows, so there is no single "row count" which is simultaneously correct for everyone.
Likewise, if you have a WHERE clause, then the where condition could be different for different clients, so they see different numbers.
If you need to do a lot of counts of large tables, consider storing your own counters in a different table. Exactly how you do this is entirely application specific.
This will depend very much on which SQL implementation you are using (MS SQL Server, MySQL, Oracle, PostgreSQL etc), and how clever its optimiser is.
It may also depend on the query. For example, with something like
SELECT COUNT(primary_key) FROM table;
the optimiser may realise that there is no need to scan the table (since there is no filtering with WHERE and no possibility that any values are NULL) and just return the size of the table. With a more complicated query (where there is filtering, or the possibility of NULLs), the database may have to scan the table, or it may be able to do some optimisation with the use of an index.
This is obviously implementation dependant (i.e. different RDBMS may employ different strategies) and usage dependant (i.e. select count(*) from mytable and select count(*) from mytable where myfield < somevalue) may use different methods even in the same DB.
If you are trying to get the count based on some partitioning that is already expressed by an Index, smart DBs will try to use the index alone. Or something like the old "rushmore" used in Foxbase.
So, "it depends", but at the end of the day, if no better methods are available, yes, the DB will perform a table scan.
It is usually some sort of index scan, unless there is no unique index on the table.
Strangely enough, most database engines can only count by scanning. They even provide alternate solutions to count using table metadata. For instance SQL Server supports SELECT rowcnt FROM sysindexes .... However, these are usually not 100% accurate.
YSE COUNT FUNCTION DOSE TABLE SCAN, rather than using count on table to get total number of row you can use :
SELECT
Total_Rows= SUM(st.row_count)
FROM
sys.dm_db_partition_stats st
WHERE
object_name(object_id) = 'TABLENAME'
or
SELECT sysobjects.[name], max(sysindexes.[rows]) AS TableRows
FROM sysindexes INNER JOIN sysobjects ON sysindexes.[id] = sysobjects.[id]
WHERE sysobjects.xtype = 'U' and sysobjects.[name]='tablename'
GROUP BY sysobjects.[name]
ORDER BY max(rows) DESC
OTHER WAY TO GET TOTAL COUNT : http://www.codeproject.com/Tips/58796/Number-of-different-way-to-get-total-no-of-row-fro.aspx
It depends on the DBMS used.
If there is an index, there should be one index row for each table row. A smart DBMS will likely choose the smallest index and count the index rows.
Finally, if the table is small enough, it may count the table rows and bypass the index.
In postgreSQL a table scan is performed.
I think it's implementation dependant.
Edit:
See this link
It really doesn't matter!
I assume you want the row count for some sort of paging... so just make sure your paging algorithm is into the best practices and forget about how the engine works.
Let people in database business care about this, just follow the recommendation of those who are experts in the database your are using.
SQL Server - https://web.archive.org/web/20211020131201/https://www.4guysfromrolla.com/webtech/042606-1.shtml
Oracle - Paging with Oracle
MySQL - http://php.about.com/od/phpwithmysql/ss/php_pagination.htm

effect of number of projections on query performance

I am looking to improve the performance of a query which selects several columns from a table. was wondering if limiting the number of columns would have any effect on performance of the query.
Reducing the number of columns would, I think, have only very limited effect on the speed of the query but would have a potentially larger effect on the transfer speed of the data. The less data you select, the less data that would need to be transferred over the wire to your application.
I might be misunderstanding the question, but here goes anyway:
The absolute number of columns you select doesn't make a huge difference. However, which columns you select can make a significant difference depending on how the table is indexed.
If you are selecting only columns that are covered by the index, then the DB engine can use just the index for the query without ever fetching table data. If you use even one column that's not covered, though, it has to fetch the entire row (key lookup) and this will degrade performance significantly. Sometimes it will kill performance so much that the DB engine opts to do a full scan instead of even bothering with the index; it depends on the number of rows being selected.
So, if by removing columns you are able to turn this into a covering query, then yes, it can improve performance. Otherwise, probably not. Not noticeably anyway.
Quick example for SQL Server 2005+ - let's say this is your table:
ID int NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
Name varchar(50) NOT NULL,
Status tinyint NOT NULL
If we create this index:
CREATE INDEX IX_MyTable
ON MyTable (Name)
Then this query will be fast:
SELECT ID
FROM MyTable
WHERE Name = 'Aaron'
But this query will be slow(er):
SELECT ID, Name, Status
FROM MyTable
WHERE Name = 'Aaron'
If we change the index to a covering index, i.e.
CREATE INDEX IX_MyTable
ON MyTable (Name)
INCLUDE (Status)
Then the second query becomes fast again because the DB engine never needs to read the row.
Limiting the number of columns has no measurable effect on the query. Almost universally, an entire row is fetched to cache. The projection happens last in the SQL pipeline.
The projection part of the processing must happen last (after GROUP BY, for instance) because it may involve creating aggregates. Also, many columns may be required for JOIN, WHERE and ORDER BY processing. More columns than are finally returned in the result set. It's hardly worth adding a step to the query plan to do projections to somehow save a little I/O.
Check your query plan documentation. There's no "project" node in the query plan. It's a small part of formulating the result set.
To get away from "whole row fetch", you have to go for a columnar ("Inverted") database.
It can depend on the server you're dealing with (and, in the case of MySQL, the storage engine). Just for example, there's at least one MySQL storage engine that does column-wise storage instead of row-wise storage, and in this case more columns really can take more time.
The other major possibility would be if you had segmented your table so some columns were stored on one server, and other columns on another (aka vertical partitioning). In this case, retrieving more columns might involve retrieving data from different servers, and it's always possible that the load is imbalanced so different servers have different response times. Of course, you usually try to keep the load reasonably balanced so that should be fairly unusual, but it's still possible (especially if, for example, if one of the servers handles some other data whose usage might vary independently from the rest).
yes, if your query can be covered by a non clustered index it will be faster since all the data is already in the index and the base table (if you have a heap) or clustered index does not need to be touched by the optimizer
To demonstrate what tvanfosson has already written, that there is a "transfer" cost I ran the following two statements on a MSSQL 2000 DB from query analyzer.
SELECT datalength(text) FROM syscomments
SELECT text FROM syscomments
Both results returned 947 rows but the first one took 5 ms and the second 973 ms.
Also because the fields are the same I would not expect indexing to factor here.

Indexing nulls for fast searching on DB2

It's my understanding that nulls are not indexable in DB2, so assuming we have a huge table (Sales) with a date column (sold_on) which is normally a date, but is occasionally (10% of the time) null.
Furthermore, let's assume that it's a legacy application that we can't change, so those nulls are staying there and mean something (let's say sales that were returned).
We can make the following query fast by putting an index on the sold_on and total columns
Select * from Sales
where
Sales.sold_on between date1 and date2
and Sales.total = 9.99
But an index won't make this query any faster:
Select * from Sales
where
Sales.sold_on is null
and Sales.total = 9.99
Because the indexing is done on the value.
Can I index nulls? Maybe by changing the index type? Indexing the indicator column?
From where did you get the impression that DB2 doesn't index NULLs? I can't find anything in documentation or articles supporting the claim. And I just performed a query in a large table using a IS NULL restriction involving an indexed column containing a small fraction of NULLs; in this case, DB2 certainly used the index (verified by an EXPLAIN, and by observing that the database responded instantly instead of spending time to perform a table scan).
So: I claim that DB2 has no problem with NULLs in non-primary key indexes.
But as others have written: Your data may be composed in a way where DB2 thinks that using an index will not be quicker. Or the database's statistics aren't up-to-date for the involved table(s).
I'm no DB2 expert, but if 10% of your values are null, I don't think an index on that column alone will ever help your query. 10% is too many to bother using an index for -- it'll just do a table scan. If you were talking about 2-3%, I think it would actually use your index.
Think about how many records are on a page/block -- say 20. The reason to use an index is to avoid fetching pages you don't need. The odds that a given page will contain 0 records that are null is (90%)^20, or 12%. Those aren't good odds -- you're going to need 88% of your pages to be fetched anyway, using the index isn't very helpful.
If, however, your select clause only included a few columns (and not *) -- say just salesid, you could probably get it to use an index on (sold_on,salesid), as the read of the data page wouldn't be needed -- all the data would be in the index.
The rule of thumb is that an index is useful for values up on to 15% of the records. ... so an index might be useful here.
If DB2 won't index nulls, then I would suggest adding a boolean field, IsSold, and set it to true whenever the sold_on date gets set (this could be done in a trigger).
That's not the nicest solution, but it might be what you need.
Troels is correct; even rows with a SOLD_ON value of NULL will benefit from an index on that column. If you're doing ranged searches on SOLD_ON, you may benefit even more by creating a clustered index that begins with SOLD_ON. In this particular example, it may not require much additional overhead to maintain the clustering order based on SOLD_ON, since newer rows added will most likely have a newer SOLD_ON date.