Is is possible to somehow optimize the performance of the queries (apart from playing with hardware and OS settings) under these conditions
1) You can't add indexes.
2) You can't alter the queries themselves.
This is the common constraint while bench-marking the performance of a database.
I understand that the dbms has a query optimizer that plays a number game with all the statistics pertaining to accessing the tables touched by the query. Are there cases when the query optimizer comes up with sub optimal solutions. I know that you can force the optimizer to use a particular query plan. Not sure how to cache it though without altering the query plan. DB in question is Sybase
Independent of the specific case here (Sybase), there are multiple ways to optimize a query under the given conditions. Syntax is system-specific.
Most systems rely on statistics for finding the best query plan. So updating the statistics could help improve performance.
Many systems allow to set an optimization level independent of the application. This can have positive impact on the performance.
Many systems allow to re-use query plans for similar ad-hoc queries (dynamic SQL). Usually this has positive impact.
Allowing the database system (independent to the OS) to assign more memory to bottlenecks can also help.
What privileges do you have, what are the benchmark rules?
Data Henrik mentions optimisation level - you can set this system-wide for Sybase, or per session.
You can even have a flexible method that sets the level according to application name or login Id (see Rob Verschoor's Sybase site - login triggers.) I'd guess if you're not allowed to change queries or indexes you'd not likely be allowed to do this.
As far as I can tell you don't have a specific problem - you just mention benchmarking.
You should be sure all tables have UPDATE INDEX STATISTICS run on them, and you could then do your benchmarks with the 3 Sybase optimisation levels - OLTP, MIX, DSS.
If you have specific problems, that's another subject.
Related
Given a live table in SQL with some non-trivial number of columns/entries, with one or more applications actively querying it, what would be the effect of introducing a new index on some column of this table? What takes priority? Serving the query, or constructing the index? Put another way, would setting up the index be experienced by the querying applications as a delay in getting their responses?
It is possible to use the database while indexing is taking place, but it's effects on performance is nearly impossible for us to say. A great deal about the optimizer is magic to anyone who hasn't worked on it themselves, and the answer could change greatly depending on which RDMS you're using. On top of that, your own hardware will play a huge part in the answer.
That being said, if you're primarily reading from the table, there's a good chance you won't see a major performance hit, if your system has the IO/CPU capabilities of handling both tasks at the same time. Inserting however, will be slowed down considerably.
Whether this impact is problematic will depend on your current system load, size of your tables, and what exactly it is you're indexing. Generally speaking, if you have a decent server, a lowish load, and a table with only a few million rows or less, I wouldn't expect to see a performance hit at all.
I'm hoping to catch the eye of someone with experience in both SQL Server and DB2. I thought I'd ask to see if anyone could comment on these from the top of their head. The following is a list of features with SQL Server, that I'd like to do with DB2 as well.
Configuration option "optimize for ad hoc workloads", which saves first-time query plans as stubs, to avoid memory pressure from heavy-duty one-time queries (especially helpful with an extreme number of parameterized queries). What - if any - is the equivalent for this with DB2?
On a similar note, what would be the equivalents for SQL Server configuration options auto create statistics, auto update statistics and auto update statistics async. Which all are fundamental for creating and maintaining proper statistics without causing too much overhead during business hours?
Indexes. MSSQL standard for index maintenance is REORGANIZE when fragmentation is between 5 - 35%, REBUILD (technically identical to DROP & RECREATE) when over 35%. As importantly, MSSQL supports ONLINE index rebuilds which keeps the associated data accessible by read / write operations. Anything similar with DB2?
Statistics. In SQL Server the standard statistics update procedure is all but useless in larger DB's, as the sample ratio is far too low. Is there an equivalent to UPDATE STATISTICS X WITH FULLSCAN in DB2, or a similarly functioning consideration?
In MSSQL, REBUILD index operations also fully recreate the underlying statistics, which is important to consider with maintenance operations in order to avoid overlapping statistics maintenance. The best method for statistics updates in larger DB's also involves targeting them on a per-statistic basis, since full table statistics maintenance can be extremely heavy when for example only a few of the dozens of statistics on a table actually need to be updated. How would this relate to DB2?
Show execution plan is an invaluable tool for analyzing specific queries and potential index / statistic issues with SQL Server. What would be the best similar method to use with DB2 (Explain tools? Or something else)?
Finding the bottlenecks: SQL Server has system views such as sys.dm_exec_query_stats and sys.dm_exec_sql_text, which make it extremely easy to see the most run, and most resource-intensive (number of logical reads, for instance) queries that need tuning, or proper indexing. Is there an equivalent query in DB2 you can use to instantly recognize problems in a clear and easy to understand manner?
All these questions represent a big chunk of where many of the problems are with SQL Server databases. I'd like to take that know-how, and translate it to DB2.
I'm assuming this is about DB2 for Linux, Unix and Windows.
Configuration option "optimize for ad hoc workloads", which saves first-time query plans as stubs, to avoid memory pressure from heavy-duty one-time queries (especially helpful with an extreme number of parameterized queries). What - if any - is the equivalent for this with DB2?
There is no equivalent; DB2 will evict least recently used plans from the package cache. One can enable automatic memory management for the package cache, where DB2 will grow and shrink it on demand (taking into account other memory consumers of course).
what would be the equivalents for SQL Server configuration options auto create statistics, auto update statistics and auto update statistics async.
Database configuration parameters auto_runstats and auto_stmt_stats
MSSQL standard for index maintenance is REORGANIZE when fragmentation is between 5 - 35%, REBUILD (technically identical to DROP & RECREATE) when over 35%. As importantly, MSSQL supports ONLINE index rebuilds
You have an option of automatic table reorganization (which includes indexes); the trigger threshold is not documented. Additionally you have a REORGCHK utility that calculates and prints a number of statistics that allow you to decide what tables/indexes you want to reorganize manually. Both table and index reorganization can be performed online with read-only or full access.
Is there an equivalent to UPDATE STATISTICS X WITH FULLSCAN in DB2, or a similarly functioning consideration? ... The best method for statistics updates in larger DB's also involves targeting them on a per-statistic basis, since full table statistics maintenance can be extremely heavy when for example only a few of the dozens of statistics on a table actually need to be updated.
You can configure automatic statistics collection to use sampling or not (configuration parameter auto_sampling). When updating statistics manually using the RUNSTATS utility you have full control over the sample size and what statistics to collect.
Show execution plan is an invaluable tool for analyzing specific queries and potential index / statistic issues with SQL Server. What would be the best similar method to use with DB2
You have both GUI (Data Studio, Data Server Manager) and command-line (db2expln, db2exfmt) tools to generate query plans, including plans for statements that are in the package cache or are currently executing.
Finding the bottlenecks: SQL Server has system views such as sys.dm_exec_query_stats and sys.dm_exec_sql_text, which make it extremely easy to see the most run, and most resource-intensive (number of logical reads, for instance) queries that need tuning
There is an extensive set of monitor procedures, views and table functions, e.g. MONREPORT.DBSUMMARY(), TOP_DYNAMIC_SQL, SNAP_GET_DYN_SQL, MON_CURRENT_SQL, MON_CONNECTION_SUMMARY etc.
I'm currently researching a very large table (~100 million rows, 35 columns), it's currently stored in SQL db, but the queries I'm running (and they're various) run very, very slow..
so I get it I should probably move to NoSQL db. question is:
How can I tell which (NoSQL) db is best for me?
How can I move my current SQL table to the new NoSQL scheme?
OR should I stay in SQL and just fine tune it?
A few more details: rows will not be added/removed, this is historical data and all of the analysis will be done on that table. plan to run various queries on it. data is numerical.
I routinely work with a SQL Server 2012 table that has 900 million rows. This table has rows being added to it about every 2 minutes with a total of about 200K per day. I can query this table and get rows back in a couple seconds (using the clustered index / PK). I can also query on one of the other indexes and get results back in seconds or less.
So, it's all a matter of making sure your indexes are set up correctly, AND BEING USED!! Check your queries against the query plan being generated and make sure seeks are being done.
There could be good reasons for moving to NoSQL, or something similar. But moving to NoSQL because you think you can't get good performance in SQL Server, before making sure you've done everything you can do to improve performance first, is not a good reason.
Some food for thought:
100M rows is well within SQL's "sweet spot". You can grow by x10 and still be assured that SQL will be able to support you with fairly trivial effort.
NoSQL is not a silver bullet for solving performance problems at scale. It offers a set of tradeoffs which, with careful planning, can provide better results. But if sounds like you don't fully understand your performance issues in SQL, and without that your chances of making the correct design decisions in a NoSQL environment are slim.
One of the common tradeoffs in NoSQL systems is that they typically provide less flexibilty in querying, in return for greater flexibility in schema management. You mentioned your queries are "various"- if they are truly varied, or more importantly- frequently changing - then moving to a NoSQL system can put you in a world of pain. Especially if you are not familiar with the technology yet.
Bottom line- You aren't doing anything which is clearly "beyond" the capabilities of SQL, and your problems are probably caused more by inefficient implementation than by any inherent platform limitations. Moving to a NoSQL system won't magically solve any of your problems, and will probably introduce new ones.
If you are running a query on columns that are not indexed you will be very slow. You can add more indexes to speed them up. If your DB is static this should work.
One major speed up is the usage of map-reduce queries, where aggregations are carried out by multiple processes or computers. NoSQL databases like MongoDB can be used in such ways. But even MySQL has Cluster capabilities nowadays: http://www.mysql.de/products/cluster/scalability.html. SQL Server can be clustered as well.
So I guess the best first shot would be to optimize your indexes in the table to the query. Each argument column to the query (compare, count ...) etc. should be indexed.
If this is not doing any better you probably count and calculate a lot and you should use map-reduce jobs and a DB which can handle this like MongoDB: http://docs.mongodb.org/manual/aggregation/
I hope this helps
I am an advocate of ORM-solutions and from time to time I am giving a workshop about Hibernate.
When talking about framework-generated SQL, people usually start talking about how they need to be able to use "hints", and this is supposedly not possible with ORM frameworks.
Usually something like: "We tried Hibernate. It looked promising in the beginning, but when we let it loose on our very very complex production database it broke down because we were not able to apply hints!".
But when asked for a concrete example, the memory of those people is suddenly not so clear any more ...
I usually feel intimidated, because the whole "hints"-topic sounds like voodoo to me...
So can anybody enlighten me? What is meant by SQL-hints or DB-Hints?
The only thing I know, that is somehow "hint-like" is SELECT ... FOR UPDATE. But this is supported by the Hibernate-API...
A SQL statement, especially a complex one, can actually be executed by the DB engine in any number of different ways (which table in the join to read first, which index to use based on many different parameters, etc).
An experienced dba can use hints to encourage the DB engine to choose a particular method when it generates its execution plan. You would only normally need to do this after extensive testing and analysis of the specific queries (because the DB engines are usually pretty darn good at figuring out the optimum execution plan).
Some MSSQL-specific discussion and syntax here:
http://msdn.microsoft.com/en-us/library/ms181714.aspx
Edit: some additional examples at http://geeks.netindonesia.net/blogs/kasim.wirama/archive/2007/12/31/sql-server-2005-query-hints.aspx
Query hints are used to guide the query optimiser when it doesn't produce sensible query plans by default. First, a small background in query optimisers:
Database programming is different from pretty much all other software development because it has a mechanical component. Disk seeks and rotational latency (waiting fora particular sector to arrive under the disk head) are very expensive in comparison to CPU. Different query resolution strategies will result in different amounts of I/O, often radically different amounts. Getting this right or wrong can make a major difference to the performance of the query. For an overview of query optimisation, see This paper.
SQL is declarative - you specify the logic of the query and let the DBMS figure out how to resolve it. A modern cost-based query optimiser (some systems, such as Oracle also have a legacy query optimiser retained for backward compatibility) will run a series of transformations on the query. These maintain semantic equivalence but differ in the order and choice of operations. Based on statistics collected on the tables (sizes, distribution histograms of keys) the optimiser computes an estimate of the amount of work needed for each query plan. It selects the most efficient plan.
Cost-based optimisation is heuristic, and is dependent on accurate statistics. As query complexity goes up the heuristics can produce incorrect plans, which can potentially be wildly inefficient.
Query hints can be used in this situation to force certain strategies in the query plan, such as a type of join. For example, on a query that usually returns very small result sets you may wish to force a nested loops join. You may also wish to force a certain join order of tables.
O/R mappers (or any tool that generates SQL) generates its own query, which will typically not have hinting information. In the case that this query runs inefficiently you have limited options, some of which are:
Examine the indexing on the tables. Possibly you can add an index. Some systems (recent versions of Oracle for example) allow you index joins across more than one table.
Some database management systems (again, Oracle comes to mind) allow you to manually associate a query plan with a specific query string. Query plans are cached by a hash value of the query. If the queries are paramaterised the base query string is constant and will resolve to the same hash value.
As a last resort, you can modify the database schema, but this is only possible if you control the application.
If you control the SQL you can hint queries. In practice it's fairly uncommon to actually need to do this. A more common failure mode on O/R mappers with complex database schemas is they can make it difficult to express complex query predicates or do complex operations over large bodies of data.
I tend to advocate using the O/R mapper for the 98% of work that it's suited for and dropping to stored procedures where they are the appropriate solution. If you really need to hint a query than this might be the appropriate strategy. Unless there is something unusual about your application (for example some sort of DSS) you should
only need to escape from the O/R mapper on a minority of situations. You might also
find (again, an example would be DSS tools working with the data in aggregate) that an O/R mapper is not really the appropriate strategy for the application.
While HINTS do as the other answers describe, you should only use them in rare, researched circumstances. 9 times out of 10 a HINT will result in a poor query plan. Unless you really know what you are doing, don't use them.
There is no such thing as "optimized SQL code", because SQL code is never executed.
SQL code is translated into an execution plan by the Optimizer. The Optimizer will use the information it has to choose (among other things).
the order in which tables are involved
the join method for each involved table (nested/merge/hash)
how to access a table's data (direct table access/ index with bookmark lookup/direct index access) (scan/seek)
should parallelism be used, and when to end parallelism (gather streams)
Query hints allow a programmer to over-ride (in most cases) or suggest politely (in other cases) the optimizer's choices.
Query hints can let you force off parallelism, force all joins to be implemented as nested loop, force one index to be used over another... as a few examples.
Since the optimizer is really good, if one over-rides the optimizer, one is generally asking for a non-optimal plan. Query hints are best served when the optimizer does not have the required information to make a good choice.
One place I use query hints is for table variables. Table variables are assumed to have 0 rows by the Optimizer, and so the Optimizer always joins table variables using nested loop (the best join implementation for small numbers of rows). If I have a large table variable - already ordered in a favorable way for merge join, I can specify a merge join be used by applying a query hint.
All modern RDBMS-es have some sort of query optimizer that calculates best query plan, which is sequence of read/write operations needed to execute SQL query.
Sometimes plans can be suboptimal, so RDBMS designers included "hints" in SQL. Hints are instructions you can embed in your SQL that affect query optimizer, With hints you can instruct query optimizer e.g. which indexes it should use, in what order data should be read from tables, ...
So, with hints you can resolve some bottlenecks that the query optimizer cannot solve by itself.
For example, here is list of Oracle hints.
I have never clearly understood the usage of MAXDOP. I do know that it makes the query faster and that it is the last item that I can use for Query Optimization.
However, my question is, when and where it is best suited to use in a query?
As Kaboing mentioned, MAXDOP(n) actually controls the number of CPU cores that are being used in the query processor.
On a completely idle system, SQL Server will attempt to pull the tables into memory as quickly as possible and join between them in memory. It could be that, in your case, it's best to do this with a single CPU. This might have the same effect as using OPTION (FORCE ORDER) which forces the query optimizer to use the order of joins that you have specified. IN some cases, I have seen OPTION (FORCE PLAN) reduce a query from 26 seconds to 1 second of execution time.
Books Online goes on to say that possible values for MAXDOP are:
0 - Uses the actual number of available CPUs depending on the current system workload. This is the default value and recommended setting.
1 - Suppresses parallel plan generation. The operation will be executed serially.
2-64 - Limits the number of processors to the specified value. Fewer processors may be used depending on the current workload. If a value larger than the number of available CPUs is specified, the actual number of available CPUs is used.
I'm not sure what the best usage of MAXDOP is, however I would take a guess and say that if you have a table with 8 partitions on it, you would want to specify MAXDOP(8) due to I/O limitations, but I could be wrong.
Here are a few quick links I found about MAXDOP:
Books Online: Degree of Parallelism
General guidelines to use to configure the MAXDOP option
This is a general rambling on Parallelism in SQL Server, it might not answer your question directly.
From Books Online, on MAXDOP:
Sets the maximum number of processors
the query processor can use to execute
a single index statement. Fewer
processors may be used depending on
the current system workload.
See Rickie Lee's blog on parallelism and CXPACKET wait type. It's quite interesting.
Generally, in an OLTP database, my opinion is that if a query is so costly it needs to be executed on several processors, the query needs to be re-written into something more efficient.
Why you get better results adding MAXDOP(1)? Hard to tell without the actual execution plans, but it might be so simple as that the execution plan is totally different that without the OPTION, for instance using a different index (or more likely) JOINing differently, using MERGE or HASH joins.
As something of an aside, MAXDOP can apparently be used as a workaround to a potentially nasty bug:
Returned identity values not always correct
There are a couple of parallization bugs in SQL server with abnormal input. OPTION(MAXDOP 1) will sidestep them.
EDIT: Old. My testing was done largely on SQL 2005. Most of these seem to not exist anymore, but every once in awhile we question the assumption when SQL 2014 does something dumb and we go back to the old way and it works. We never managed to demonstrate that it wasn't just a bad plan generation on more recent cases though since SQL server can be relied on to get the old way right in newer versions. Since all cases were IO bound queries MAXDOP 1 doesn't hurt.
Adding my two cents, based on a performance issue I observed.
If simple queries are getting parellelized unnecessarily, it can bring more problems than solving one. However, before adding MAXDOP into the query as "knee-jerk" fix, there are some server settings to check.
In Jeremiah Peschka - Five SQL Server Settings to Change, MAXDOP and "COST THRESHOLD FOR PARALLELISM" (CTFP) are mentioned as important settings to check.
Note: Paul White mentioned max server memory aslo as a setting to check, in a response to Performance problem after migration from SQL Server 2005 to 2012. A good kb article to read is Using large amounts of memory can result in an inefficient plan in SQL Server
Jonathan Kehayias - Tuning ‘cost threshold for parallelism’ from the Plan Cache helps to find out good value for CTFP.
Why is cost threshold for parallelism ignored?
Aaron Bertrand - Six reasons you should be nervous about parallelism has a discussion about some scenario where MAXDOP is the solution.
Parallelism-Inhibiting Components are mentioned in Paul White - Forcing a Parallel Query Execution Plan