When do sql optimizations become overkill?

When do sql optimizations become overkill? - sql

I'm updating tables with millions of records and I need to be as efficient as possible. Is there a point at which adding more criteria to the where clause will actually hurt rather than help?
For example, if know I want to set a column to 3 I could use this query:
update mytable set col = 3
Or I could update the record only if it's different
update mytable set col = 3 where col <> 3
I could also filter it so it only updates records added since the last time I ran this process
update mytable set col = 3 where col <> 3 and createDate > #lastRunDate
And perhaps I could look for more things in additional columns.
I guess my question is if there is a point where the cost of looking at additional columns outweighs the cost of the update itself and if there's a principle you can use to determine where to draw the line.
Update
So here's the principle I'm trying to piece together based on what was said. Feel free to argue with this and I'll update it accordingly:
If no indexed columns to filter on, add as much criteria as possible to limit the records being updated since a full table scan is going to happen anyway.
If the difference in records between filtering on only indexed columns and filtering on all possible columns is marginal, only use the indexed columns and avoid the full table scan.
If you have a mix of indexed and non-indexed columns, definitely use the indexed columns if you can and only use non-index columns if... [[I'm still struggling with this part. What's the threshold for introducing the non-indexed columns in the where clause?]]
Update #2
Sounds like I have my answer.

If you have an index on "col", then running your first query will update millions of rows regardless; your second query would potentially only update a few and find those quickly if there's an index available. If you don't have an index on that column, the effect will be marginal since a full table or index scan must occur to check all rows in your table (you'll just have fewer actual updates, but that's it).
The whole point of restricting your queries usnig WHERE clauses is to reduce the scope of your query, e.g. the number of rows SQL Server has to look at. Less data to process is always faster than just doing it to all millions of rows......
In response to your update: the main goal of using a WHERE clause is to reduce the number of rows you need to inspect / touch. If you have a means (typically an index) to reduce that number from 100% to a few percent, then it's definitely worth it. That's the whole point of having indices (mostly for SELECTs, but applies to other operations, too, of course).
If you have a suitable index, and thus you can pluck out a few hundred rows to check against a criteria versus having to inspect millions of rows, you'll always be faster. If you have a good book index in a bookstore that guides you easily to the two shelves where the books that interest you are located, you'll find what you're looking for more quickly than when you have to criss-cross the whole bookstore since there's no index available.
There obviously is a point where yet another criteria or index doesn't help anymore. If that's the case, typically yet another WHERE clause won't really help much - or at all. But in this case, the SQL query optimizer will find those cases and filter them out (possibly even just ignoring them when deciding on what the best query execution plan is).

This really comes down to index usage and query optimization. I would suggest looking at the query plan before making any decisions.
Adding indexed fields to the where clause will often improve query time, however, adding non-indexed fields can result in table scans which will slow your query.
My suggestion is write a query that works, look at the execution time, work to reduce it to an exceptable level by looking at the query plan. Don't over optimize, go for the acceptable solution.

Related

Oracle SQL: What is the best way to select a subset of a very large table

I have been roaming these forums for a few years and I've always found my questions had already been asked, and a fitting answer was already present.
I have a pretty generic (and maybe easy) question now though, but I haven't been able to find a thread asking the same one yet.
The situation:
I have a payment table with 10-50M records per day, a history of 10 days and hundreds of columns. About 10-20 columns are indexed. One of the indices is batch_id.
I have a batch table with considerably fewer records and columns, say 10k a day and 30 columns.
If I want to select all payments from one specific sender, I could just do this:
Select * from payments p
where p.sender_id = 'SenderA'
This runs a while, even though sender_id is also indexed. So I figure, it's better to select the batches first, then go into the payments table with the batch_id:
select * from payments p
where p.batch_id in
(select b.batch_id from batches where b.sender_id = 'SenderA')
--and p.sender_id = 'SenderA'
Now, my questions are:
In the second script, should I uncomment the Sender_id in my where clause on the payments table? It doesn't feel very efficient to filter on sender_id twice, even though it's in different tables.
Is it better if I make it an inner join instead of a nested query?
Is it better if I make it a common table expression instead of a nested query or inner join?
I suppose it could all fit into one question: What is the best way to query this?

In the worst case the two queries should run in the same time and in the best case I would expect the first query to run quicker. If it is running slower, there is some problem elsewhere. You don't need the additional condition in the second query.
The first query will retrieve index entries for a single value, so that is going to access less blocks than the second query which has to find index entries for multiple batches (as well as executing the subquery, but that is probably not significant).
But the danger as always with Oracle is that there are a lot of factors determining which query plan the optimizer chooses. I would immediately verify that the statistics on your indexed columns are up-to-date. If they are not, this might be your problem and you don't need to read any further.
The next step is to obtain a query execution plan. My guess is that this will tell you that your query is running a full-table-scan.
Whether or not Oracle choses to perform a full-table-scan on a query such as this is dependent on the number of rows returned and whether Oracle thinks it is more efficient to use the index or to simply read the whole table. The threshold for flipping between the two is not a fixed number: it depends on a lot of things, one of them being a parameter called DB_FILE_MULTIBLOCK_READ_COUNT.
This is set-up by Orale and in theory it should be configured such that the transition between indexed and full-table scan queries should be smooth. In other words, at the transition point where your query is returning enough rows to just about make a full table scan more efficient, the index scan and the table scan should take roughly the same time.
Unfortunately, I have seen systems where this is way out and Oracle flips to doing full table scans far too quickly, resulting in a long query time once the number of rows gets over a certain threshold.
As I said before, first check your statistics. If that doesn't work, get a QEP and start tuning your Oracle instance.
Tuning Oracle is a very complex subject that can't be answered in full here, so I am forced to recommend links. Here is a useful page on the parameter: reducing it might help: Why Change the Oracle DB_FILE_MULTIBLOCK_READ_COUNT.
Other than that, the general Oracle performance tuning guide is here: (Oracle) Configuring a Database for Performance.
If you are still having problems, you need to progress your investigation further and then come up with a more specific question.
EDIT:
Based on your comment where you say your query is returning 4M rows out of 10M-50M in the table. If it is 4M out of 10M there is no way an index will be of any use. Even with 4M out of 50M, it is still pretty certain that a full-table-scan would be the most efficient approach.
You say that you have a lot of columns, so probably this 4M row fetch is returning a huge amount of data.
You could perhaps consider splitting off some of the columns that are not required and putting them into a child table. In particular, if you have columns containing a lot of data (e.g., some text comments or whatever) they might be better being kept outside the main table.
Remember - small is fast, not only in terms of number of rows, but also in terms of the size of each row.

SQL is an declarative language. This means, that you specify what you like not how.
Check your indexes primary and "normal" ones...

Indexing and performance Implications of moving small table into big table

I have a table with approximately 2.5 million rows that I am thinking about moving into a much larger table, 35 million rows, with a boolean flag set on the original 2.5 million.
If I wanted to run lots of queries against the 2.5 million records in the new larger table, would adding an index be useful / not cause a full table scan on every query? I know that traditionally indexes aren't helpful in booleans, but since only 7% of the records will be true, I thought it might not require a table scan on every query.

Perhaps look at using a partial index.
From docs
A partial index is an index built over a subset of a table; the subset
is defined by a conditional expression (called the predicate of the
partial index). The index contains entries for only those table rows
that satisfy the predicate.
A major motivation for partial indexes is to avoid indexing common
values. Since a query searching for a common value (one that accounts
for more than a few percent of all the table rows) will not use the
index anyway, there is no point in keeping those rows in the index at
all. This reduces the size of the index, which will speed up queries
that do use the index. It will also speed up many table update
operations because the index does not need to be updated in all cases.
Example 11-1 shows a possible application of this idea.

I would be looking at partitioning, if you have a substantial proportion of the table that you want to access efficiently.

If you do "insert into big select * from small", then all of the rows that came from the small table are likely to be physically close to each other. After analyzing the table, PostgreSQL will know this, and so will probably choose to use the index on the boolean.
But, if there a lot of churn in the rows then eventually the "true" rows and the "false" rows will become all jumbled up, making use of the index less and less effective, and PostgreSQL will stop using it.
By using partitioning/inheritance, you can keep the rows physically separate (to make sequential scanning on just the small set faster) while making them look like a single set of data when you want to.
Depending on the nature of the queries you run, you might also benefit from adding other columns to the index, keeping the boolean column as the first column.

What aspects of a sql query are relatively costly to one another? Joins? Num of records? columns selected?

How costly would SELECT One, Two, Three be compared to SELECT One, Two, Three, ..... N-Column
If you have a sql query that has two or three tables joined together and is retrieving 100 rows of data, does performance have anything to say whether I should be selecting only the number of columns I need? Or should I write a query that just yanks all the columns..
If possible, could you help me understand what aspects of a query would be relatively costly compared to one another? Is it the joins? is it the large number of records pulled? is it the number of columns in the select statement?
Would 1 record vs 10 record vs 100 record matter?

As an extremely generalized version of ranking those factors you mention in terms of performance penalty and occurrence in the queries you write, I would say:
Joins - Especially when joining on tables with no indexes for the fields you're joining on and/or with tables that have a very large amount of data.
# of Rows / Amount of Data - Again, indexes mitigate this quite a bit, just make sure you have the right ones.
# of Fields - I would say the # of fields in the SELECT clause impact performance the least in most situations.
I would say any performance-driving property is always coupled with how much data you have - sure a join might be fast when your tables have 100 rows each, but when millions of rows are in the tables, you have to start thinking about more efficient design.

Several things impact the cost of a query.
First, are there appropriate indexes for it to use. Fields that are used in a join should almost always be indexed and foreign keys are not indexed by default, the designer of the database must create them. Fields used inthe the where clasues often need indexes as well.
Next, is the where clause sargable, in other words can it use the indexes even if you have the correct ones? A bad where clause can hurt a query far more than joins or extra columns. You can't get anything but a table scan if you use syntax that prevents the use of an index such as:
LIKE '%test'
Next, are you returning more data than you need? You should never return more columns than you need and you should not be using select * in production code as it has additional work to look up the columns as well as being very fragile and subject to create bad bugs as the structure changes with time.
Are you joining to tables you don't need to be joining to? If a table returns no columns in the select, is not used in the where and doesn't filter out any records if the join is removed, then you have an unnecessary join and it can be eliminated. Unnecessary joins are particularly prevalant when you use a lot of views, especially if you make the mistake of calling views from other views (which is a buig performance killer for may reasons) Sometimes if you trace through these views that call other views, you will see the same table joined to multiple times when it would not have been necessary if the query was written from scratch instead of using a view.
Not only does returning more data than you need cause the SQL Server to work harder, it causes the query to use up more of the network resources and more of the memory of the web server if you are holding the results in memory. It is an all arouns poor choice.
Finally are you using known poorly performing techniques when a better one is available. This would include the use of cursors when a set-based alternative is better, the use of correlated subqueries when a join would be better, the use of scalar User-defined functions, the use of views that call other views (especially if you nest more than one level. Most of these poor techniques involve processing row-by-agonizing-row which is generally the worst choice in a database. To properly query datbases you need to think in terms of data sets, not processing one row at a time.
There are plenty more things that affect performance of queries and the datbase, to truly get a grip onthis subject you need to read some books onthe subject. This is too complex a subject to fully discuss in a message board.

Or should I write a query that just yanks all the columns..
No. Just today there was another question about that.
If possible, could you help me understand what aspects of a query would be relatively costly compared to one another? Is it the joins? is it the large number of records pulled? is it the number of columns in the select statement?
Any useless join or data retrieval costs you time and should be avoided. Retrieving rows from a datastore is costly. Joins can be more or less costly depending on the context, amount of indexes defined... you can examine the query plan of each query to see the estimated cost for each step.

Selecting more columns/rows will have some performance impacts, but honestly why would you want to select more data than you are going to use anyway?
If possible, could you help me
understand what aspects of a query
would be relatively costly compared to
one another?
Build the query you need, THEN worry about optimizing it if the performance doesn't meet your expectations. You are putting the horse before the cart.

To answer the following:
How costly would SELECT One, Two,
Three be compared to SELECT One, Two,
Three, ..... N-Column
This is not a matter of the select performance but the amount of time it takes to fetch the data. Select * from Table and Select ID from Table preform the same but the fetch of the data will take longer. This goes hand in hand with the number of rows returned from a query.
As for understanding preformance here is a good link
http://www.dotnetheaven.com/UploadFile/skrishnasamy/SQLPerformanceTunning03112005044423AM/SQLPerformanceTunning.aspx
Or google tsql Performance

Joins have the potential to be expensive. In the worst case scenario, when no indexes can be used, they require O(M*N) time, where M and N are the number of records in the tables. To speed things up, you can CREATE INDEX on columns that are part of the join condition.
The number of columns has little effect on the time required to find rows, but slows things down by requiring more data to be sent.

What others are saying is all true.
But typically, if you are working with tables that already have good indexes, what's most important for performance is what goes into the WHERE statement. There you have to worry more about using a field that has no index or using a statement that can't me optimized.

The difference between SELECT One, Two, Three FROM ... and SELECT One,...,N FROM ... could be like the difference between day and night. To understand the problem, you need to understand the concept of a covering index:
A covering index is a special case
where the index itself contains the
required data field(s) and can return
the data.
As you add more unnecessary columns to the projection list you are forcing the query optimizer to lookup the newly added columns in the 'table' (really in the clustered index or in the heap). This can change an execution plan from an efficient narrow index range scan or seek into a bloated clustered index scan, which can result in differences of times from sub-second to +hours, depending on your data. So projecting unnecessary columns is often the most impacting factor of a query.
The number of records pulled is a more subtle issue. With a large number, a query can hit the index tipping point and choose, again, a clustered index scan over narrower index range scan and lookup. Now the fact that lookups into the clustered index are necessary to start with means the narrow index is not covering, which ultimately may be caused by projecting unnecessary column.
And finally, joins. The question here is joins, as opposed to what else? If a join is required, there is no alternative, and that's all there is to say about this.
Ultimately, query performance is driven by one factor alone: amount of IO. And the amount of IO is driven ultimately by the access paths available to satisfy the query. In other words, by the indexing of your data. It is impossible to write efficient queries on bad indexes. It is possible to write bad queries on good indexes, but more often than not the optimizer can compensate and come up with a good plan. You should spend all your effort in better understanding index design:
Designing Indexes
SQL Server Optimization

Short answer: Dont select more fields then you need - Search for "*" in both your sourcecode and your stored procedures ;)
You allways have to consider what parts of the query will cause which costs.
If you have a good DB design, joining a few tables is usually not expensive. (Make sure you have correct indices).
The main issue with "select *" is that it will cause unpredictable behavior in your results. If you write a query like that, AND access the fields with the columnindex, you will be locked into the DB-Schema forever.
Another thing to consider is the amount of data you have to consider. You might think its trivial, but the Version2.0 of your application suddenly adds a ProfilePicture to the User table. And now the query that will select 100 Users will suddenly use up several Megabyte of bandwith.
The second thing you should consider is the number of rows you return. SQL is very powerfull at sorting and grouping, so let SQL do his job, and dont move it to the client. Limit the amount of records you return. In most applications it makes no sense to return more then 100 rows to a user at once. You might let the user choose to load more, but make it a choice he has to make.
Finally, monitor your SQL Server. Run a profiler against it, and try to find your worst queries. A SQL Query should not take longer then half a second, if it does, something is most likely messed up (Yes... there are operation that can take much longer, but those should have a reason)
Edit:
Once you found the slow query, look at the execution plan... You will see which parts of the query are expensive, and which parts work well... The Optimizer is also a tool that can be used.

I suggest you consider your queries in terms of I/O first. Disk I/O on my SATA II system is 6Gb/sec. My DDR3 memory bandwidth is 12GB/sec. I can move items in memory 16 times faster than I can retrieve from disk. (Ref Wikipedia and Tom's hardware)
The difference between getting a few columns and all the columns for your 100 rows could be the dfference in getting a single 8K page from disk to getting two or more pages from disk. When the pages are finally in memory moving two columns or all columns to a hash table is faster than any measuring tool I have.
I value the advice of the others on this topic related to database design. The design of narrow indexes, using included columns to make covering indexes, avoiding table or index scans in favor of seeks by using an appropiate WHERE clause, narrow primary keys, etc is the diffenence between having a DBA title and being a DBA.

effect of number of projections on query performance

I am looking to improve the performance of a query which selects several columns from a table. was wondering if limiting the number of columns would have any effect on performance of the query.

Reducing the number of columns would, I think, have only very limited effect on the speed of the query but would have a potentially larger effect on the transfer speed of the data. The less data you select, the less data that would need to be transferred over the wire to your application.

I might be misunderstanding the question, but here goes anyway:
The absolute number of columns you select doesn't make a huge difference. However, which columns you select can make a significant difference depending on how the table is indexed.
If you are selecting only columns that are covered by the index, then the DB engine can use just the index for the query without ever fetching table data. If you use even one column that's not covered, though, it has to fetch the entire row (key lookup) and this will degrade performance significantly. Sometimes it will kill performance so much that the DB engine opts to do a full scan instead of even bothering with the index; it depends on the number of rows being selected.
So, if by removing columns you are able to turn this into a covering query, then yes, it can improve performance. Otherwise, probably not. Not noticeably anyway.
Quick example for SQL Server 2005+ - let's say this is your table:
ID int NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
Name varchar(50) NOT NULL,
Status tinyint NOT NULL
If we create this index:
CREATE INDEX IX_MyTable
ON MyTable (Name)
Then this query will be fast:
SELECT ID
FROM MyTable
WHERE Name = 'Aaron'
But this query will be slow(er):
SELECT ID, Name, Status
FROM MyTable
WHERE Name = 'Aaron'
If we change the index to a covering index, i.e.
CREATE INDEX IX_MyTable
ON MyTable (Name)
INCLUDE (Status)
Then the second query becomes fast again because the DB engine never needs to read the row.

Limiting the number of columns has no measurable effect on the query. Almost universally, an entire row is fetched to cache. The projection happens last in the SQL pipeline.
The projection part of the processing must happen last (after GROUP BY, for instance) because it may involve creating aggregates. Also, many columns may be required for JOIN, WHERE and ORDER BY processing. More columns than are finally returned in the result set. It's hardly worth adding a step to the query plan to do projections to somehow save a little I/O.
Check your query plan documentation. There's no "project" node in the query plan. It's a small part of formulating the result set.
To get away from "whole row fetch", you have to go for a columnar ("Inverted") database.

It can depend on the server you're dealing with (and, in the case of MySQL, the storage engine). Just for example, there's at least one MySQL storage engine that does column-wise storage instead of row-wise storage, and in this case more columns really can take more time.
The other major possibility would be if you had segmented your table so some columns were stored on one server, and other columns on another (aka vertical partitioning). In this case, retrieving more columns might involve retrieving data from different servers, and it's always possible that the load is imbalanced so different servers have different response times. Of course, you usually try to keep the load reasonably balanced so that should be fairly unusual, but it's still possible (especially if, for example, if one of the servers handles some other data whose usage might vary independently from the rest).

yes, if your query can be covered by a non clustered index it will be faster since all the data is already in the index and the base table (if you have a heap) or clustered index does not need to be touched by the optimizer

To demonstrate what tvanfosson has already written, that there is a "transfer" cost I ran the following two statements on a MSSQL 2000 DB from query analyzer.
SELECT datalength(text) FROM syscomments
SELECT text FROM syscomments
Both results returned 947 rows but the first one took 5 ms and the second 973 ms.
Also because the fields are the same I would not expect indexing to factor here.

eligibility for creating index

I have created script to find selectivity of each columns for every tables. In those some tables with less than 100 rows but selectivity of column is more than 50%.
where Selectivity = Distinct Values / Total Number Rows
So, is those column are eligible for index?
Or, can you tell, how much minimum rows require for eligibility for creating index?

I think I understand what you are trying to accomplish by calculating a 'Selectivity' value for your data but you cannot apply the rule blindly.
In fact in for certain queries the 'Selectivity' value might be really low an index will still be very beneficial. For example:
Assume a 'inbox' table with millions of rows, these rows have a 'Read' boolean field. In this case the distinct values over the number of rows will be really low. If most items are read most of the time then finding unread items with an index on this field will be very efficient.
Creating indexes index come at a cost. Although you get the benefit for reads, you pay for writes and disk usage.
I would rather recommend you profile your queries and index accordingly. You can also look at the data from sys.dm_db_missing_index_group_stats and other Dynamic management views that will give you insight on indexes usage (or missing) ones.

You can create a index on a table with 0 rows, 1 row or a 100 million rows. You can create an index where every column has the same value or unique values.
So you can create an index. The question is really should you create an index and no tool is going to tell you that because indexes can also be multi-value and it depends on what queries you run. Creating indexes is something done when performance tuning queries or preemptively when you know that you'll be creating queries that are using it.
Every index comes with a cost in terms of space and time required to do updates, inserts and deletes. You don't want to be creating them spuriously so you're really going to have to do this by hand, not as a result of a script to see how unique the value of a column is.

A general rule of thumb says that if you have a very large table (over 1 million rows), you should only use an index if a WHERE clause based on that index selects at most something in the neighborhood of 1-2% of the data.
If you have a "gender" column and roughly 50% of values are "male" and roughly 50% "female", then having an index on that really doesn't give you much - SQL Server and most other RDBMS will most likely still do a full table scan in this case, since on average, they'd have to scan at least half the table anyway, so the "detour" by using an index first and then looking up the actual full data based on that index value is just not worth it.
An index is excellent if you have something like unique keys (customer number), or a value that is quite selective. An index is not without cost - it uses up disk space, it needs to be maintained, it will slightly slow down all operations besides the SELECT - so thread carefully, it's not the best idea to just blindly index everything. Having too few indices is bad - but having too many, and the wrong ones, can be even worse! :-) Nobody ever claimed getting your indices right was easy.... :-)
But there's definitely help out there - the best source I know are Kimberly Tripp's excellent blog posts on SQL Server indexing (and many other topics).
Marc

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas