SQL index on a sum of columns - is that possible? - sql

Given two tables, T1 with column a and T2 with column b, is it possible to apply an index to a sum of columns T1.a + T2.b? I recently got a question involving this index and was quite surprised, as the question was not whether it was possible (which I believe is not), but rather would it speed up some example query.
If it is possible, what exactly are we indexing? Would it be helpful in queries like WHERE T1.a+T2.b = 3 or in some other cases? Thanks!

Yes, most (not all) database systems allow you to create an index on the result of an expression, so creating an index on the sum of two columns is possible in those systems.
Would it be helpful in queries like WHERE T1.a+T2.b = 3 or in some other cases?
That depends completely on the query and what plan the compiler decides to use to evaluate the query. If you filter on the sum of two columns, and there are relatively few records that meet that criteria, then yes, an index will reduce the amount of scanning that needs to be done to find matching records.

Depending on your SQL product, it could be possible to index a view which can contain a group by to get the persisted summary values.
HOWEVER
This is a local optimization (google "no free lunch"), as that it will result in faster select performance for you at the expense of slower inserts and updates for others.

Related

Please explain COUNT and GROUP BY/ORDER BY in SQL for a beginner [duplicate]

I would like to understand how exactly does sql count work. Is it a whole table scan that happens or is it some property of the table that is read. However I feel a table scan would be an overhead in case of huge tables with lots of records.
In general either a table or index scan is performed. This is chiefly because in a MVCC-supporting engine, different transactions could see different rows, so there is no single "row count" which is simultaneously correct for everyone.
Likewise, if you have a WHERE clause, then the where condition could be different for different clients, so they see different numbers.
If you need to do a lot of counts of large tables, consider storing your own counters in a different table. Exactly how you do this is entirely application specific.
This will depend very much on which SQL implementation you are using (MS SQL Server, MySQL, Oracle, PostgreSQL etc), and how clever its optimiser is.
It may also depend on the query. For example, with something like
SELECT COUNT(primary_key) FROM table;
the optimiser may realise that there is no need to scan the table (since there is no filtering with WHERE and no possibility that any values are NULL) and just return the size of the table. With a more complicated query (where there is filtering, or the possibility of NULLs), the database may have to scan the table, or it may be able to do some optimisation with the use of an index.
This is obviously implementation dependant (i.e. different RDBMS may employ different strategies) and usage dependant (i.e. select count(*) from mytable and select count(*) from mytable where myfield < somevalue) may use different methods even in the same DB.
If you are trying to get the count based on some partitioning that is already expressed by an Index, smart DBs will try to use the index alone. Or something like the old "rushmore" used in Foxbase.
So, "it depends", but at the end of the day, if no better methods are available, yes, the DB will perform a table scan.
It is usually some sort of index scan, unless there is no unique index on the table.
Strangely enough, most database engines can only count by scanning. They even provide alternate solutions to count using table metadata. For instance SQL Server supports SELECT rowcnt FROM sysindexes .... However, these are usually not 100% accurate.
YSE COUNT FUNCTION DOSE TABLE SCAN, rather than using count on table to get total number of row you can use :
SELECT
Total_Rows= SUM(st.row_count)
FROM
sys.dm_db_partition_stats st
WHERE
object_name(object_id) = 'TABLENAME'
or
SELECT sysobjects.[name], max(sysindexes.[rows]) AS TableRows
FROM sysindexes INNER JOIN sysobjects ON sysindexes.[id] = sysobjects.[id]
WHERE sysobjects.xtype = 'U' and sysobjects.[name]='tablename'
GROUP BY sysobjects.[name]
ORDER BY max(rows) DESC
OTHER WAY TO GET TOTAL COUNT : http://www.codeproject.com/Tips/58796/Number-of-different-way-to-get-total-no-of-row-fro.aspx
It depends on the DBMS used.
If there is an index, there should be one index row for each table row. A smart DBMS will likely choose the smallest index and count the index rows.
Finally, if the table is small enough, it may count the table rows and bypass the index.
In postgreSQL a table scan is performed.
I think it's implementation dependant.
Edit:
See this link
It really doesn't matter!
I assume you want the row count for some sort of paging... so just make sure your paging algorithm is into the best practices and forget about how the engine works.
Let people in database business care about this, just follow the recommendation of those who are experts in the database your are using.
SQL Server - https://web.archive.org/web/20211020131201/https://www.4guysfromrolla.com/webtech/042606-1.shtml
Oracle - Paging with Oracle
MySQL - http://php.about.com/od/phpwithmysql/ss/php_pagination.htm

Using temp table for sorting data in SQL Server

Recently, I came across a pattern (not sure, could be an anti-pattern) of sorting data in a SELECT query. The pattern is more of a verbose and non-declarative way for ordering data. The pattern is to dump relevant data from actual table into temporary table and then apply orderby on a field on the temporary table. I guess, the only reason why someone would do that is to improve the performance (which I doubt) and no other benefit.
For e.g. Let's say, there is a user table. The table might contain rows in millions. We want to retrieve all the users whose first name starts with 'G' and sorted by first name. The natural and more declarative way to implement a SQL query for this scenario is:
More natural and declarative way
SELECT * FROM Users
WHERE NAME LIKE 'G%'
ORDER BY Name
Verbose way
SELECT * INTO TempTable
FROM Users
WHERE NAME LIKE 'G%'
SELECT * FROM TempTable
ORDER BY Name
With that context, I have few questions:
Will there be any performance difference between two ways if there is no index on the first name field. If yes, which one would be better.
Will there be any performance difference between two ways if there is an index on the first name field. If yes, which one would be better.
Should not the SQL Server optimizer generate same execution plan for both the ways?
Is there any benefit in writing a verbose way from any other persective like locking/blocking?
Thanks in advance.
Reguzlarly: Anti pattern by people without an idea what they do.
SOMETIMES: ok, because SQL Server has a problem that is not resolvable otherwise - not seen that one in yeas, though.
It makes things slower because it forces the tmpddb table to be fully populated FIRST, while otherwise the query could POSSIBLY be resoled more efficiently.
last time I saw that was like 3 years ago. We got it 3 times as fast by not being smart and using a tempdb table ;)
Answers:
1: No, it still needs a table scan, obviously.
2: Possibly - depends on data amount, but an index seek by index would contain the data in order already (as the index is ordered by content).
3: no. Obviously. Query plan optimization is statement by statement. By cutting the execution in 2, the query optimizer CAN NOT merge the join into the first statement.
4: Only if you run into a query optimizer issue or a limitation of how many tables you can join - not in that degenerate case (degenerate in a technical meaning - i.e. very simplistic). BUt if you need to join MANY MANY tables it may be better to go with an interim step.
If the field you want to do an order by on is not indexed, you could put everything into a temp table and index it and then do the ordering and it might be faster. You would have to test to make sure.
There is never any benefit of the second approach that I can think of.
It means if the data is available pre-ordered SQL Server can't take advantage of this and adds an unnecessary blocking operator and additional sort to the plan.
In the case that the data is not available pre-ordered SQL Server will sort it in a work table either in memory or tempdb anyway and adding an explicit #temp table just adds an unnecessary additional step.
Edit
I suppose one case where the second approach could give an apparent benefit might be if the presence of the ORDER BY caused SQL Server to choose a different plan that turned out to be sub optimal. In which case I would resolve that in a different way by either improving statistics or by using hints/query rewrite to avoid the undesired plan.

Why didn't SAS use my index?

I have a large SAS dataset sorted by field 'A'. I'd like to do a query that references fields 'A' and 'B'. To speed up performance I created an index on 'B'. This results in an unhelpful message:
INFO: Index B not used. Sorting into index order may help.
Of course sorting on B would help. But that's not the point. Indexes are for the case when you are already sorted on some other field.
In a similar query, SAS gives this message:
INFO: Use of index C for WHERE clause optimization canceled.
Any tips on getting SAS to use my indexes? In one case the query is taking 2 hours to run because SAS doesn't use the index.
In case the query is not selective enough - taking most of source records to the result, the index use may not help performance, eventually can make things worse. That's probably why the optimizer desided not to use the index.
To force the use of index try using IDXNAME data set option (on both tables, probably).
Refer to http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000414058.htm.
Without seeing the query and knowing some characteristics of data (at least record counts of input tables and expected size of the query result) it's hard to tell the optimal approach.
Anyway, for optimal performance, when joining tables, both tables need to be index similarly and all the join keys need to be part of the index.
Can't answer a question like this without seeing the query you are trying to run. An index will only be useful if the SAS optimizer determines it will improve performance. Can you show a simple example of the code you want to run?

how does sql count work?

I would like to understand how exactly does sql count work. Is it a whole table scan that happens or is it some property of the table that is read. However I feel a table scan would be an overhead in case of huge tables with lots of records.
In general either a table or index scan is performed. This is chiefly because in a MVCC-supporting engine, different transactions could see different rows, so there is no single "row count" which is simultaneously correct for everyone.
Likewise, if you have a WHERE clause, then the where condition could be different for different clients, so they see different numbers.
If you need to do a lot of counts of large tables, consider storing your own counters in a different table. Exactly how you do this is entirely application specific.
This will depend very much on which SQL implementation you are using (MS SQL Server, MySQL, Oracle, PostgreSQL etc), and how clever its optimiser is.
It may also depend on the query. For example, with something like
SELECT COUNT(primary_key) FROM table;
the optimiser may realise that there is no need to scan the table (since there is no filtering with WHERE and no possibility that any values are NULL) and just return the size of the table. With a more complicated query (where there is filtering, or the possibility of NULLs), the database may have to scan the table, or it may be able to do some optimisation with the use of an index.
This is obviously implementation dependant (i.e. different RDBMS may employ different strategies) and usage dependant (i.e. select count(*) from mytable and select count(*) from mytable where myfield < somevalue) may use different methods even in the same DB.
If you are trying to get the count based on some partitioning that is already expressed by an Index, smart DBs will try to use the index alone. Or something like the old "rushmore" used in Foxbase.
So, "it depends", but at the end of the day, if no better methods are available, yes, the DB will perform a table scan.
It is usually some sort of index scan, unless there is no unique index on the table.
Strangely enough, most database engines can only count by scanning. They even provide alternate solutions to count using table metadata. For instance SQL Server supports SELECT rowcnt FROM sysindexes .... However, these are usually not 100% accurate.
YSE COUNT FUNCTION DOSE TABLE SCAN, rather than using count on table to get total number of row you can use :
SELECT
Total_Rows= SUM(st.row_count)
FROM
sys.dm_db_partition_stats st
WHERE
object_name(object_id) = 'TABLENAME'
or
SELECT sysobjects.[name], max(sysindexes.[rows]) AS TableRows
FROM sysindexes INNER JOIN sysobjects ON sysindexes.[id] = sysobjects.[id]
WHERE sysobjects.xtype = 'U' and sysobjects.[name]='tablename'
GROUP BY sysobjects.[name]
ORDER BY max(rows) DESC
OTHER WAY TO GET TOTAL COUNT : http://www.codeproject.com/Tips/58796/Number-of-different-way-to-get-total-no-of-row-fro.aspx
It depends on the DBMS used.
If there is an index, there should be one index row for each table row. A smart DBMS will likely choose the smallest index and count the index rows.
Finally, if the table is small enough, it may count the table rows and bypass the index.
In postgreSQL a table scan is performed.
I think it's implementation dependant.
Edit:
See this link
It really doesn't matter!
I assume you want the row count for some sort of paging... so just make sure your paging algorithm is into the best practices and forget about how the engine works.
Let people in database business care about this, just follow the recommendation of those who are experts in the database your are using.
SQL Server - https://web.archive.org/web/20211020131201/https://www.4guysfromrolla.com/webtech/042606-1.shtml
Oracle - Paging with Oracle
MySQL - http://php.about.com/od/phpwithmysql/ss/php_pagination.htm

When do sql optimizations become overkill?

I'm updating tables with millions of records and I need to be as efficient as possible. Is there a point at which adding more criteria to the where clause will actually hurt rather than help?
For example, if know I want to set a column to 3 I could use this query:
update mytable set col = 3
Or I could update the record only if it's different
update mytable set col = 3 where col <> 3
I could also filter it so it only updates records added since the last time I ran this process
update mytable set col = 3 where col <> 3 and createDate > #lastRunDate
And perhaps I could look for more things in additional columns.
I guess my question is if there is a point where the cost of looking at additional columns outweighs the cost of the update itself and if there's a principle you can use to determine where to draw the line.
Update
So here's the principle I'm trying to piece together based on what was said. Feel free to argue with this and I'll update it accordingly:
If no indexed columns to filter on, add as much criteria as possible to limit the records being updated since a full table scan is going to happen anyway.
If the difference in records between filtering on only indexed columns and filtering on all possible columns is marginal, only use the indexed columns and avoid the full table scan.
If you have a mix of indexed and non-indexed columns, definitely use the indexed columns if you can and only use non-index columns if... [[I'm still struggling with this part. What's the threshold for introducing the non-indexed columns in the where clause?]]
Update #2
Sounds like I have my answer.
If you have an index on "col", then running your first query will update millions of rows regardless; your second query would potentially only update a few and find those quickly if there's an index available. If you don't have an index on that column, the effect will be marginal since a full table or index scan must occur to check all rows in your table (you'll just have fewer actual updates, but that's it).
The whole point of restricting your queries usnig WHERE clauses is to reduce the scope of your query, e.g. the number of rows SQL Server has to look at. Less data to process is always faster than just doing it to all millions of rows......
In response to your update: the main goal of using a WHERE clause is to reduce the number of rows you need to inspect / touch. If you have a means (typically an index) to reduce that number from 100% to a few percent, then it's definitely worth it. That's the whole point of having indices (mostly for SELECTs, but applies to other operations, too, of course).
If you have a suitable index, and thus you can pluck out a few hundred rows to check against a criteria versus having to inspect millions of rows, you'll always be faster. If you have a good book index in a bookstore that guides you easily to the two shelves where the books that interest you are located, you'll find what you're looking for more quickly than when you have to criss-cross the whole bookstore since there's no index available.
There obviously is a point where yet another criteria or index doesn't help anymore. If that's the case, typically yet another WHERE clause won't really help much - or at all. But in this case, the SQL query optimizer will find those cases and filter them out (possibly even just ignoring them when deciding on what the best query execution plan is).
This really comes down to index usage and query optimization. I would suggest looking at the query plan before making any decisions.
Adding indexed fields to the where clause will often improve query time, however, adding non-indexed fields can result in table scans which will slow your query.
My suggestion is write a query that works, look at the execution time, work to reduce it to an exceptable level by looking at the query plan. Don't over optimize, go for the acceptable solution.