I am using views for query convenience. The view is a join between three tables, using INNER JOIN and OUTER RIGHT joins. The overall result set from the view could be 500,000 records. I then perform other queries off of this view, similar to:
SELECT colA, colB, colC FROM vwMyView WHERE colD = 'ABC'
This query might return only 30 or so results. How will this be for performance? Internally in the SQL engine will the view always be executed, then the WHERE clause applied after, or is SQL Server smart enough to apply the WHERE clause first so that the JOIN operations are only done on a subset of records?
If I'm only returning 30 records to the middle tier, do I need to worry too much that the SQL Server had to trawl through 500,000 records to get to those 30 records? I have indexes applied on all important columns on the base tables.
Using MS SQL Server, view is not materialized
Usually, a view is treated in much the same way as a macro might be in other languages - the body of the view is "expanded out" into the query it's a part of, before the query is optimized. So your concern about it first computing all 500,000 results first is unfounded.
The exception to the above is if the view is e.g. an indexed view (SQL Server, query has to use appropriate hints or you have to be using a high-level edition) or a materialized view (Oracle, not sure on the requirements) where the view isn't expanded out - but the results have already been computed beforehand and are being stored much like a real table's rows are - so again, there shouldn't be too much concern whilst actually querying.
When not having a materialized view, the SQL behind your view will always executed when using the view e.g. inside the FROM part. Of course, maybe some caching is possible, but this is depending on your DBMS and your configurations.
To see what the database is doing in background your might like to start with using EXPLAIN ANALYZE <your query>.
Performance of queries on large datasets typically need clever application of indices. In your case a simple index on colD probably will do the trick. Depending on the data different types of indeces might need scrutiny. Hash tables, btrees etc all behave differently depending on the data. So there is no one solution that rules them all here. Otherwise optimization is better left to the query optimizer in your RDBMS. The developers there spend quite some time optimizing and critical segments probably are in low-level fast moving code.
On another node clever cleaning of the data might be considered as well. And if aggregation is required datawarehousing with clever dimensions and pre aggregated values. Storage is cheap these days, computing time maybe not so.
Related
So i have this query:
SELECT *
FROM ViewTechnicianStatus
WHERE NotificationClass = 2
AND MachineID IN (SELECT ID FROM MachinesTable WHERE DepartmentID = 1 AND IsMachineActive <> 0)
--ORDER BY ResponseDate DESC
The view is huge and complex with a lot of joins and subqueries. When i run this query it takes forever to finish, however if i add the ORDER BY it finishes instantly and returns 20 rows as intended. I don't understand how adding the ORDER BY could have such a huge positive impact on the performance. Would love if somebody could explain to me the phenomenon.
EDIT:
Here is the rundown with SET STATISTICS TIME, IO ON; flags on. Sorry for the hidden table names but i don't think i can expose those.
Without ORDER BY
With Order BY
To answer your question, The reason that your query runs faster when adding order by is due to the INDEXING. Probably in all the Clients that you tested, had Indexing for those specific fields/tables, and on using the Order by made the performance better.
Summary
OK.. I've thought about this for a while as I think it's an interesting issue. I believe it's very much an edge case - which is part of what makes it interesting.
I'm taking an educated guess based on the info provided - obviously, without being able to see it/play with it, I cannot be certain. But I think this explanation matches the evidence based on the info you provide and the statistics.
I think the main issue is a poor query plan. In the version without the sort, it uses an inappropriate nested loop; in the version with the sort, it does (say) a hash match or merge join.
I have found that SQL Server often has issues with query plans within complex views that reference other views, and especially if those sub-views have group bys/sorts/etc.
For demonstration of what difference that could occur, I'll simplify your complex view into 2 subgroups I'll call 'view groups' (which may be one or several views and tables - not a technical term, just a term to summarise them).
The first view group contains most tables,
The second view group contains the views getting data from tables 6 and 7.
For both approaches, how SQL uses the data in the view groups are probably the same (e.g., use the same indexes, etc). However, there's a difference in its approach to how it does the join between the two view groups.
Example - query planner underestimates cost of view group 2 and doesn't care which method is used
I'm guessing
The first view group, at the point of the join, is dealing with about 3000 rows (it hasn't filtered it down yet), and
The query builder thinks view group 2 is easy to run
In the version without the order by, the query plan is designed with a nested loop join. That is, it gets each value in view group 1, and then for each value it runs view group 2 to get the relevant data. This means the view group 2 is run 3000-ish times (once for each value in view group 1).
In the version with the order by, it decides to do (say) a hash match between view group 1 and view group 2. This means it has to only run view group 2 once, but spend a bit more time sorting it. However, because you asked for it to be sorted anyway, it chooses the hash match.
However, because the query designed underestimated the cost of view group 2, it turns out that the hash match is a much much better query plan for the circumstances.
Example - query planner use of cached plans
I believe (but may be wrong!) that when you reference views within views, it can often just used cached plans for the sub-views rather than trying to get the best plan possible for your current situation.
It may be that one of your views uses the "cached plan" whereas the other one tries to optimise the query plan including the sub-views.
Ironically, it may be that the query version with the order by is more complex, and in this case it uses the cached plans for view group 2. However, as it knows it hasn't optimised the plan for view group 2, it simply gets the data once for view group 2, then keeps all results in memory and uses it in a hash match.
In contrast, in the version without the order by, it takes a shot at optimising the query plan (including optimising how it uses the views), and makes a mess of it.
Possible solutions
These are all possibilities - they may make it better or may make it worse! Note that SQL is a declarative language (you tell the computer what to do/what you want, but not how to do it).
This is not a comprehensive list of possibilities, but they are things you can try
Pre-calculate all or part(s) of the views (e.g., put the pre-calculated stuff from tables 6 and 7 into a temporary table, then use the temporary tables in the views)
Simplify the SQL and/or move all the SQL into a single view that doesn't call other views
Use join hints e.g., instead of INNER JOIN, use INNER HASH JOIN in the appropriate position
Use OPTION(RECOMPILE)
We have a table design that consists of 10,000,000 records and 200,000 columns.
The columns are a mixture of:
Binary flags.
Integers.
The queries need to perform and / or operations on 1-100 columns at a time, and should complete in under 0.1 seconds, returning a only projection/subset of each matched row.
Around 10 new columns get added per day.
Around 1,000 new rows get added per day.
There are no joins.
Which DBMS is best suited for this?
Reason behind this approach:
The columns are materialized indexes from user defined queries: that's why new columns get added each day (as more users come up with their own queries). The other option would be to not use materialized views, and have the user's queries perform joins. Problem here is the queries could take any form and in aggregate there would be a large number of very different execution plans across everyones query... since the user defines the query, it's kinda impossible to optimise a traditional SQL database using indexes, normalised tables, etc.
First, I'd suggest measuring ad-hoc JOINs, and only doing further optimization if you find the performance lacking. I understand it could be difficult to measure every possible query, but you may be able to cover most common/representative cases, and if they perform well-enough just stop there. There is a lot that can be done with good indexing!
Second, and only if the measurements above warrant it, create a new separate materialized view for each ad-hoc query.
Some databases will be able to maintain such views automatically for you1, so if the "base" data changes, relevant results will be automatically added or removed from the materialized view (just as they would from the "live" query result).
Other databases may allow periodic refresh2.
Be warned though: maintaining materialized views is not free, and having thousands of them (especially if they are constantly kept up-to-date, as opposed to periodically refreshed) will definitely impact the insert/update/delete performance on the base data!
1 E.g. SQL Server indexed views.
2 E.g. Oracle Materialized views, although it looks like 12c can also do something close to SQL Server's immediate refresh.
Keeping aside ,why you want to go with 1000 of columns,you can look at below databases which support,unlimited columns
References: https://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems
How costly would SELECT One, Two, Three be compared to SELECT One, Two, Three, ..... N-Column
If you have a sql query that has two or three tables joined together and is retrieving 100 rows of data, does performance have anything to say whether I should be selecting only the number of columns I need? Or should I write a query that just yanks all the columns..
If possible, could you help me understand what aspects of a query would be relatively costly compared to one another? Is it the joins? is it the large number of records pulled? is it the number of columns in the select statement?
Would 1 record vs 10 record vs 100 record matter?
As an extremely generalized version of ranking those factors you mention in terms of performance penalty and occurrence in the queries you write, I would say:
Joins - Especially when joining on tables with no indexes for the fields you're joining on and/or with tables that have a very large amount of data.
# of Rows / Amount of Data - Again, indexes mitigate this quite a bit, just make sure you have the right ones.
# of Fields - I would say the # of fields in the SELECT clause impact performance the least in most situations.
I would say any performance-driving property is always coupled with how much data you have - sure a join might be fast when your tables have 100 rows each, but when millions of rows are in the tables, you have to start thinking about more efficient design.
Several things impact the cost of a query.
First, are there appropriate indexes for it to use. Fields that are used in a join should almost always be indexed and foreign keys are not indexed by default, the designer of the database must create them. Fields used inthe the where clasues often need indexes as well.
Next, is the where clause sargable, in other words can it use the indexes even if you have the correct ones? A bad where clause can hurt a query far more than joins or extra columns. You can't get anything but a table scan if you use syntax that prevents the use of an index such as:
LIKE '%test'
Next, are you returning more data than you need? You should never return more columns than you need and you should not be using select * in production code as it has additional work to look up the columns as well as being very fragile and subject to create bad bugs as the structure changes with time.
Are you joining to tables you don't need to be joining to? If a table returns no columns in the select, is not used in the where and doesn't filter out any records if the join is removed, then you have an unnecessary join and it can be eliminated. Unnecessary joins are particularly prevalant when you use a lot of views, especially if you make the mistake of calling views from other views (which is a buig performance killer for may reasons) Sometimes if you trace through these views that call other views, you will see the same table joined to multiple times when it would not have been necessary if the query was written from scratch instead of using a view.
Not only does returning more data than you need cause the SQL Server to work harder, it causes the query to use up more of the network resources and more of the memory of the web server if you are holding the results in memory. It is an all arouns poor choice.
Finally are you using known poorly performing techniques when a better one is available. This would include the use of cursors when a set-based alternative is better, the use of correlated subqueries when a join would be better, the use of scalar User-defined functions, the use of views that call other views (especially if you nest more than one level. Most of these poor techniques involve processing row-by-agonizing-row which is generally the worst choice in a database. To properly query datbases you need to think in terms of data sets, not processing one row at a time.
There are plenty more things that affect performance of queries and the datbase, to truly get a grip onthis subject you need to read some books onthe subject. This is too complex a subject to fully discuss in a message board.
Or should I write a query that just yanks all the columns..
No. Just today there was another question about that.
If possible, could you help me understand what aspects of a query would be relatively costly compared to one another? Is it the joins? is it the large number of records pulled? is it the number of columns in the select statement?
Any useless join or data retrieval costs you time and should be avoided. Retrieving rows from a datastore is costly. Joins can be more or less costly depending on the context, amount of indexes defined... you can examine the query plan of each query to see the estimated cost for each step.
Selecting more columns/rows will have some performance impacts, but honestly why would you want to select more data than you are going to use anyway?
If possible, could you help me
understand what aspects of a query
would be relatively costly compared to
one another?
Build the query you need, THEN worry about optimizing it if the performance doesn't meet your expectations. You are putting the horse before the cart.
To answer the following:
How costly would SELECT One, Two,
Three be compared to SELECT One, Two,
Three, ..... N-Column
This is not a matter of the select performance but the amount of time it takes to fetch the data. Select * from Table and Select ID from Table preform the same but the fetch of the data will take longer. This goes hand in hand with the number of rows returned from a query.
As for understanding preformance here is a good link
http://www.dotnetheaven.com/UploadFile/skrishnasamy/SQLPerformanceTunning03112005044423AM/SQLPerformanceTunning.aspx
Or google tsql Performance
Joins have the potential to be expensive. In the worst case scenario, when no indexes can be used, they require O(M*N) time, where M and N are the number of records in the tables. To speed things up, you can CREATE INDEX on columns that are part of the join condition.
The number of columns has little effect on the time required to find rows, but slows things down by requiring more data to be sent.
What others are saying is all true.
But typically, if you are working with tables that already have good indexes, what's most important for performance is what goes into the WHERE statement. There you have to worry more about using a field that has no index or using a statement that can't me optimized.
The difference between SELECT One, Two, Three FROM ... and SELECT One,...,N FROM ... could be like the difference between day and night. To understand the problem, you need to understand the concept of a covering index:
A covering index is a special case
where the index itself contains the
required data field(s) and can return
the data.
As you add more unnecessary columns to the projection list you are forcing the query optimizer to lookup the newly added columns in the 'table' (really in the clustered index or in the heap). This can change an execution plan from an efficient narrow index range scan or seek into a bloated clustered index scan, which can result in differences of times from sub-second to +hours, depending on your data. So projecting unnecessary columns is often the most impacting factor of a query.
The number of records pulled is a more subtle issue. With a large number, a query can hit the index tipping point and choose, again, a clustered index scan over narrower index range scan and lookup. Now the fact that lookups into the clustered index are necessary to start with means the narrow index is not covering, which ultimately may be caused by projecting unnecessary column.
And finally, joins. The question here is joins, as opposed to what else? If a join is required, there is no alternative, and that's all there is to say about this.
Ultimately, query performance is driven by one factor alone: amount of IO. And the amount of IO is driven ultimately by the access paths available to satisfy the query. In other words, by the indexing of your data. It is impossible to write efficient queries on bad indexes. It is possible to write bad queries on good indexes, but more often than not the optimizer can compensate and come up with a good plan. You should spend all your effort in better understanding index design:
Designing Indexes
SQL Server Optimization
Short answer: Dont select more fields then you need - Search for "*" in both your sourcecode and your stored procedures ;)
You allways have to consider what parts of the query will cause which costs.
If you have a good DB design, joining a few tables is usually not expensive. (Make sure you have correct indices).
The main issue with "select *" is that it will cause unpredictable behavior in your results. If you write a query like that, AND access the fields with the columnindex, you will be locked into the DB-Schema forever.
Another thing to consider is the amount of data you have to consider. You might think its trivial, but the Version2.0 of your application suddenly adds a ProfilePicture to the User table. And now the query that will select 100 Users will suddenly use up several Megabyte of bandwith.
The second thing you should consider is the number of rows you return. SQL is very powerfull at sorting and grouping, so let SQL do his job, and dont move it to the client. Limit the amount of records you return. In most applications it makes no sense to return more then 100 rows to a user at once. You might let the user choose to load more, but make it a choice he has to make.
Finally, monitor your SQL Server. Run a profiler against it, and try to find your worst queries. A SQL Query should not take longer then half a second, if it does, something is most likely messed up (Yes... there are operation that can take much longer, but those should have a reason)
Edit:
Once you found the slow query, look at the execution plan... You will see which parts of the query are expensive, and which parts work well... The Optimizer is also a tool that can be used.
I suggest you consider your queries in terms of I/O first. Disk I/O on my SATA II system is 6Gb/sec. My DDR3 memory bandwidth is 12GB/sec. I can move items in memory 16 times faster than I can retrieve from disk. (Ref Wikipedia and Tom's hardware)
The difference between getting a few columns and all the columns for your 100 rows could be the dfference in getting a single 8K page from disk to getting two or more pages from disk. When the pages are finally in memory moving two columns or all columns to a hash table is faster than any measuring tool I have.
I value the advice of the others on this topic related to database design. The design of narrow indexes, using included columns to make covering indexes, avoiding table or index scans in favor of seeks by using an appropiate WHERE clause, narrow primary keys, etc is the diffenence between having a DBA title and being a DBA.
I have always hoped and assumed that it is not - that set theory (or something) provides a shortcut to the result.
I have created a non-updateable view that aggregates data from several tables, in a way that produces an exponential number of records. From this view, I query one record at a time. Because the underlying dataset is small, this technique works well - but I'm concerned it won't scale.
I've heard MySQL uses temporary tables to implement views. My heart lurches at the thought of potentially massive temp tables popping into and out of existence for each and every query.
Use explain <select query> syntax to see what really happens within your query.
Generally speaking, using a view is equivalent to using subquery with the same SQL. No better and no worse, just a shortcut to prevent writing the same subquery over and over again.
Sometimes you'll end up with temporary tables used to resolve some comples queries, but it shouldn't happen often if DB structure is well optimized and using views instead of subqueries won't change anything.
What techniques can be applied effectively to improve the performance of SQL queries? Are there any general rules that apply?
Use primary keys
Avoid select *
Be as specific as you can when building your conditional statements
De-normalisation can often be more efficient
Table variables and temporary tables (where available) will often be better than using a large source table
Partitioned views
Employ indices and constraints
Learn what's really going on under the hood - you should be able to understand the following concepts in detail:
Indexes (not just what they are but actually how they work).
Clustered indexes vs heap allocated tables.
Text and binary lookups and when they can be in-lined.
Fill factor.
How records are ghosted for update/delete.
When page splits happen and why.
Statistics, and how they effect various query speeds.
The query planner, and how it works for your specific database (for instance on some systems "select *" is slow, on modern MS-Sql DBs the planner can handle it).
The biggest thing you can do is to look for table scans in sql server query analyzer (make sure you turn on "show execution plan"). Otherwise there are a myriad of articles at MSDN and elsewhere that will give good advice.
As an aside, when I started learning to optimize queries I ran sql server query profiler against a trace, looked at the generated SQL, and tried to figure out why that was an improvement. Query profiler is far from optimal, but it's a decent start.
There are a couple of things you can look at to optimize your query performance.
Ensure that you just have the minimum of data. Make sure you select only the columns you need. Reduce field sizes to a minimum.
Consider de-normalising your database to reduce joins
Avoid loops (i.e. fetch cursors), stick to set operations.
Implement the query as a stored procedure as this is pre-compiled and will execute faster.
Make sure that you have the correct indexes set up. If your database is used mostly for searching then consider more indexes.
Use the execution plan to see how the processing is done. What you want to avoid is a table scan as this is costly.
Make sure that the Auto Statistics is set to on. SQL needs this to help decide the optimal execution. See Mike Gunderloy's great post for more info. Basics of Statistics in SQL Server 2005
Make sure your indexes are not fragmented. Reducing SQL Server Index Fragmentation
Make sure your tables are not fragmented. How to Detect Table Fragmentation in SQL Server 2000 and 2005
Use a with statment to handle query filtering.
Limit each subquery to the minimum number of rows possible.
then join the subqueries.
WITH
master AS
(
SELECT SSN, FIRST_NAME, LAST_NAME
FROM MASTER_SSN
WHERE STATE = 'PA' AND
GENDER = 'M'
),
taxReturns AS
(
SELECT SSN, RETURN_ID, GROSS_PAY
FROM MASTER_RETURNS
WHERE YEAR < 2003 AND
YEAR > 2000
)
SELECT *
FROM master,
taxReturns
WHERE master.ssn = taxReturns.ssn
A subqueries within a with statement may end up as being the same as inline views,
or automatically generated temp tables. I find in the work I do, retail data, that about 70-80% of the time, there is a performance benefit.
100% of the time, there is a maintenance benefit.
I think using SQL query analyzer would be a good start.
In Oracle you can look at the explain plan to compare variations on your query
Make sure that you have the right indexes on the table. if you frequently use a column as a way to order or limit your dataset an index can make a big difference. I saw in a recent article that select distinct can really slow down a query, especially if you have no index.
The obvious optimization for SELECT queries is ensuring you have indexes on columns used for joins or in WHERE clauses.
Since adding indexes can slow down data writes you do need to monitor performance to ensure you don't kill the DB's write performance, but that's where using a good query analysis tool can help you balanace things accordingly.
Indexes
Statistics
on microsoft stack, Database Engine Tuning Advisor
Some other points (Mine are based on SQL server, since each db backend has it's own implementations they may or may not hold true for all databases):
Avoid correlated subqueries in the select part of a statement, they are essentially cursors.
Design your tables to use the correct datatypes to avoid having to apply functions on them to get the data out. It is far harder to do date math when you store your data as varchar for instance.
If you find that you are frequently doing joins that have functions in them, then you need to think about redesigning your tables.
If your WHERE or JOIN conditions include OR statements (which are slower) you may get better speed using a UNION statement.
UNION ALL is faster than UNION if (And only if) the two statments are mutually exclusive and return the same results either way.
NOT EXISTS is usually faster than NOT IN or using a left join with a WHERE clause of ID = null
In an UPDATE query add a WHERE condition to make sure you are not updating values that are already equal. The difference between updating 10,000,000 records and 4 can be quite significant!
Consider pre-calculating some values if you will be querying them frequently or for large reports. A sum of the values in an order only needs to be done when the order is made or adjusted, rather than when you are summarizing the results of 10,000,000 million orders in a report. Pre-calculations should be done in triggers so that they are always up-to-date is the underlying data changes. And it doesn't have to be just numbers either, we havea calculated field that concatenates names that we use in reports.
Be wary of scalar UDFs, they can be slower than putting the code in line.
Temp table tend to be faster for large data set and table variables faster for small ones. In addition you can index temp tables.
Formatting is usually faster in the user interface than in SQL.
Do not return more data than you actually need.
This one seems obvious but you would not believe how often I end up fixing this. Do not join to tables that you are not using to filter the records or actually calling one of the fields in the select part of the statement. Unnecessary joins can be very expensive.
It is an very bad idea to create views that call other views that call other views. You may find you are joining to the same table 6 times when you only need to once and creating 100,000,00 records in an underlying view in order to get the 6 that are in your final result.
In designing a database, think about reporting not just the user interface to enter data. Data is useless if it is not used, so think about how it will be used after it is in the database and how that data will be maintained or audited. That will often change the design. (This is one reason why it is a poor idea to let an ORM design your tables, it is only thinking about one use case for the data.) The most complex queries affecting the most data are in reporting, so designing changes to help reporting can speed up queries (and simplify them) considerably.
Database-specific implementations of features can be faster than using standard SQL (That's one of the ways they sell their product), so get to know your database features and find out which are faster.
And because it can't be said too often, use indexes correctly, not too many or too few. And make your WHERE clauses sargable (Able to use indexes).