How to get a list of tables that need tuning - sql

I have a database with tables that grow every day. I cannot predict which tables are going to grow and which are not as I'm not the one who is putting the data into them.
Is there a way to find tables that need indexes at a particular point in time? Is there a way, in SQL Server, to notify me if a database needs tuning on certain tables?
This is a product we have deployed at different client locations and we cannot go onto their servers every time to check if they have a performance issue. What I was thinking about is something that can notify me if there are performance issues on certain tables, so as the new patches go to the clients we can add these indexes or tuned queries.
After referring to Insertion of data after creating index on empty table or creating unique index after inserting data on oracle? I'm not willing to create indexes while installing databases or when the tables have few rows or are empty.

As per my understanding we must not create indexes on a smaller table as it can affect the write performances.
This is only a real concern if you're bulk loading or otherwise generating a hundred million records each day and write performance is a problem. Indexes do increase write times because they have to be updated when data is written, but unless you're running on a potato or running very high loads it's unlikely to be a problem. You'd know it was a problem before you encountered it.
If we're talking about small tables (less than 100 pages) then it's much more likely that indexes won't be useful because the data set is so small, but you shouldn't be concerned about impacting write performance.
Overall, your application should have indexes that support the queries that you expect should be run in your unit testing and staging. You will need feedback from your customers or clients, but until you really know how people use their data, you're going to have to make a best guess.
The general question of "How do I know what indexes I need when I don't know what queries will be run?" is better suited to DBA Stack Exchange. Briefly, you'll need to use dynamic management views for that. The three missing index dynamic views can be used for this. The example query given isn't horrible:
SELECT mig.*, statement AS table_name,
column_id, column_name, column_usage
FROM sys.dm_db_missing_index_details AS mid
CROSS APPLY sys.dm_db_missing_index_columns (mid.index_handle)
INNER JOIN sys.dm_db_missing_index_groups AS mig
ON mig.index_handle = mid.index_handle
ORDER BY mig.index_group_handle, mig.index_handle, column_id;
You shouldn't just blindly follow what this view says, however. It's a good lead on what to look at, but you have to look at the column order and queries actually being used to tell.
You should also monitor index usage statistics and examine how much and in what way indexes are used compared to how much they have to be updated. Indexes that are updated a million times a day but are used once or twice should be considered for removal.
You will also want to monitor query stats to look for queries that run for a long time. This may be poor development on the part of your client, but can also be a sign of design problems.
This is not even a comprehensive overview of things to look for, however. There's a lot to database maintenance and operations. That's why DBAs make a good living. This is just the tip of the iceberg. Just the tip for indexes, even.
What I'd do if you want to maintain this is consider asking your customers to allow you to send feedback for performance analysis. Set up a broker that monitors the management views and sends compiled and sanitized information back to yourselves. You'll need to be very careful about what you send because you don't want to be sending actual customer data, of course.
Keep in mind that dynamic management views typically reset when the instance does, so the results will not typically represent the entire lifespan of the database.

Related

Is it a good idea to create tables dynamically to store user-content?

I'm currently designing an application where users can create/join groups, and then post content within a group. I'm trying to figure out how best to store this content in a RDBMS.
Option 1: Create a single table for all user content. One of the columns in this table will be the groupID, designating which group the content was posted in. Create an index using the groupID, to enable fast searching of content within a specific group. All content reads/writes will hit this single table.
Option 2: Whenever a user creates a new group, we dynamically create a new table. Something like group_content_{groupName}. All content reads/writes will be routed to the group-specific dynamically created table.
Pros for Option 1:
It's easier to search for content across multiple groups, using a single simple query, operating on a single table.
It's easier to construct simple cross-table queries, since the content table is static and well-defined.
It's easier to implement schema changes and changes to indexing/triggers etc, since there's only one table to maintain.
Pros for Option 2:
All reads and writes will be distributed across numerous tables, thus avoiding any bottlenecks that can result from a lot of traffic hitting a single table (though admittedly, all these tables are still in a single DB)
Each table will be much smaller in size, allowing for faster lookups, faster schema-changes, faster indexing, etc
If we want to shard the DB in future, the transition would be easier if all the data is already "sharded" across different tables.
What are the general recommendations between the above 2 options, from performance/development/maintenance perspectives?
One of the cardinal sins in computing is optimizing too early. It is the opinion of this DBA of 20+ years that you're overestimating the IO that's going to happen to these groups.. RDBMS's are very good at querying and writing this type of info within a standard set of tables. Worst case, you can partition them later. You'll have a lot more search capability and management ease with 1 set of tables instead of a set per user.
Imagine if the schema needs to change? do you really want to update hundreds or thousands of tables or write some long script to fix a mundane issue? Stick with a single set of tables and ignore sharding. Instead, think "maybe we'll partition the tables someday, if necessary"
It is a no-brainer. (1) is the way to go.
You list these as optimizations for the second method. All of these are misconceptions. See comments below:
All reads and writes will be distributed across numerous tables, thus
avoiding any bottlenecks that can result from a lot of traffic hitting
a single table (though admittedly, all these tables are still in a
single DB)
Reads and writes can just as easily be distributed within a table. The only issue would be write conflicts within a page. That is probably a pretty minor consideration, unless you are dealing with more than dozens of transactions per second.
Because of the next item (partially filled pages), you are actually much better off with a single table and pages that are mostly filled.
Each table will be much smaller in size, allowing for faster lookups,
faster schema-changes, faster indexing, etc
Smaller tables can be a performance disaster. Tables are stored on data pages. Each table is then a partially filled page. What you end up with is:
A lot of wasted space on disk.
A lot of wasted space in your page cache -- space that could be used to store records.
A lot of wasted I/O reading in partially filled pages.
If we want to shard the DB in future, the transition would be easier
if all the data is already "sharded" across different tables.
Postgres supports table partitioning, so you can store different parts of a table in different places. That should be sufficient for your purpose of spreading the I/O load.
Option 1: Performance=Normal Development=Easy Maintenance=Easy
Option 2: Performance=Fast Development=Complex Maintenance=Hard
I suggest to choose the Oprion1 and for the BIG table you can manage the performance with better indexes or cash indexes (for some DB) and the last thing is nothing help make the second Option 2, because development a maintenance time is fatal factor

Is a Data-filled SQL table queryable while setting up a new index?

Given a live table in SQL with some non-trivial number of columns/entries, with one or more applications actively querying it, what would be the effect of introducing a new index on some column of this table? What takes priority? Serving the query, or constructing the index? Put another way, would setting up the index be experienced by the querying applications as a delay in getting their responses?
It is possible to use the database while indexing is taking place, but it's effects on performance is nearly impossible for us to say. A great deal about the optimizer is magic to anyone who hasn't worked on it themselves, and the answer could change greatly depending on which RDMS you're using. On top of that, your own hardware will play a huge part in the answer.
That being said, if you're primarily reading from the table, there's a good chance you won't see a major performance hit, if your system has the IO/CPU capabilities of handling both tasks at the same time. Inserting however, will be slowed down considerably.
Whether this impact is problematic will depend on your current system load, size of your tables, and what exactly it is you're indexing. Generally speaking, if you have a decent server, a lowish load, and a table with only a few million rows or less, I wouldn't expect to see a performance hit at all.

Dynamic Index Creation - SQL Server

Ok, so I work for a company who sells a web product which has a MS SQL Server back end (can be any version, we've just changed our requirements to 2008+ now that 05 is out of extended support). All databases are owned by the company who purchases the product but we have VPN access and have a tech support department to deal with any issues. One part of my role is to act as 3rd line support for SQL issues.
When performance is a concern one of the usual checks is unused/missing indexes. We've got the usual standard indexes but depending on which modules or how a company utilises the system then it will require different indexes (there's an accounting module and a document management module amongst others). With hundreds of customers it's not possible to remote onto each on a regular basis in order to carry out optimisation work. I'm wondering if anybody else in my position has considered a scheduled task that may be able to drop and create indexes when needed?
I've got concerns (obviously), any changes that this procedure makes would also be stored in a table with full details of the change and a time stamp. I'd need this to be bullet proof, can't be sending something out into the wild if it may cause issues. I'm thinking an overnight or (probably) weekly task.
Dropping Indexes:
Would require the server to be up for a minimum amount of time to ensure all relevant server statistics are up to date (say 2 weeks or 1 month).
Only drop unused indexes for tables that are being actively used (indexes on unused parts of the system aren't a concern).
Log it.
This won't highlight duplicate indexes (that will have to be manual), just the quick wins (unused indexes with writes).
Creating Indexes
Only look for indexes with a value above a certain threshold.
Would have to check whether any similar indexes could be modified to cover the requirement. This could be on a ranking (check all indexed fields are the same and then score the included fields to see if additional would be needed).
Limit to a maximum number of indexes to be created (say 5 per week) to ensure it doesn't get carried away and create a bunch at once). This should help only focus on the most important indexes.
Log it.
This would need to be dynamic as we've got customers on different versions of the system with different usage patterns.
Just to clarify: I'm not expecting anybody to code for this, it's more a question relating to the feasibility and concerns for a task like this.
Edit: I've put a bounty on this to gather some further opinions and to get feedback from anybody who may have tried this before. I'll award it to the answer with the most upvotes by the time the bounty duration ends.
I can't recommend what you're contemplating, but you might be able to simplify your life by gathering the inputs to your contemplated program and making them available to clients and the support team.
If the problem were as simple as you suppose, surely the server itself or the tuning advisor would have solved it by now. You're making at least one unwarranted assumption,
require the server to be up for a minimum amount of time to ensure all relevant server statistics are up to date.
Table statistics are only as good as the last time the were updated after a significant change. Uptime won't guarantee anything about truncate table or a bulk insert.
This won't highlight duplicate indexes
But that's something you can do in a single query using the system tables. (It would be disappointing if the tuning gadget didn't help with those.) You could similarly look for overlapping indexes, such as for columns {a,b} and {a}; the second won't be useful unless {b} is selective and there are queries that don't mention {b}.
To look for new indexes, I would be tempted to try to instrument query use frequency and automate the analysis of query plan output. If you can identify frequently used, long-running queries and map their physical operations (table scan, hash join, etc.) onto the tables and existing indexes, you would have good input for adding and removing indexes. But you have to allow for the infrequently run quarterly report that, without its otherwise unused index, would take days to complete.
I must tell you that when I did that kind of analysis once some years ago, I was disappointed to learn that most problem children were awful queries, usually prompted by awful table design. No index will help the SQL mule. Hopefully that will not be your experience.
An aspect you didn't touch on that might be just as important is machine capacity. You might look into gathering, say, hourly snapshots of SQL Server stats, like disk queue depth and paging. Hardly a server exists that can't be improved with more RAM, and sometimes that's really the best answer.
SQL perf tuning advisor worth a check: https://msdn.microsoft.com/en-us/library/ms186232.aspx
another way could be to get performance data, start here: https://www.experts-exchange.com/articles/17780/Monitoring-table-level-activity-in-a-SQL-Server-database-by-using-T-SQL.html and generate indexes based on the performance table data
check this too : https://msdn.microsoft.com/en-us/library/dn817826.aspx

Should I move to NoSQL? (big data)

I'm currently researching a very large table (~100 million rows, 35 columns), it's currently stored in SQL db, but the queries I'm running (and they're various) run very, very slow..
so I get it I should probably move to NoSQL db. question is:
How can I tell which (NoSQL) db is best for me?
How can I move my current SQL table to the new NoSQL scheme?
OR should I stay in SQL and just fine tune it?
A few more details: rows will not be added/removed, this is historical data and all of the analysis will be done on that table. plan to run various queries on it. data is numerical.
I routinely work with a SQL Server 2012 table that has 900 million rows. This table has rows being added to it about every 2 minutes with a total of about 200K per day. I can query this table and get rows back in a couple seconds (using the clustered index / PK). I can also query on one of the other indexes and get results back in seconds or less.
So, it's all a matter of making sure your indexes are set up correctly, AND BEING USED!! Check your queries against the query plan being generated and make sure seeks are being done.
There could be good reasons for moving to NoSQL, or something similar. But moving to NoSQL because you think you can't get good performance in SQL Server, before making sure you've done everything you can do to improve performance first, is not a good reason.
Some food for thought:
100M rows is well within SQL's "sweet spot". You can grow by x10 and still be assured that SQL will be able to support you with fairly trivial effort.
NoSQL is not a silver bullet for solving performance problems at scale. It offers a set of tradeoffs which, with careful planning, can provide better results. But if sounds like you don't fully understand your performance issues in SQL, and without that your chances of making the correct design decisions in a NoSQL environment are slim.
One of the common tradeoffs in NoSQL systems is that they typically provide less flexibilty in querying, in return for greater flexibility in schema management. You mentioned your queries are "various"- if they are truly varied, or more importantly- frequently changing - then moving to a NoSQL system can put you in a world of pain. Especially if you are not familiar with the technology yet.
Bottom line- You aren't doing anything which is clearly "beyond" the capabilities of SQL, and your problems are probably caused more by inefficient implementation than by any inherent platform limitations. Moving to a NoSQL system won't magically solve any of your problems, and will probably introduce new ones.
If you are running a query on columns that are not indexed you will be very slow. You can add more indexes to speed them up. If your DB is static this should work.
One major speed up is the usage of map-reduce queries, where aggregations are carried out by multiple processes or computers. NoSQL databases like MongoDB can be used in such ways. But even MySQL has Cluster capabilities nowadays: http://www.mysql.de/products/cluster/scalability.html. SQL Server can be clustered as well.
So I guess the best first shot would be to optimize your indexes in the table to the query. Each argument column to the query (compare, count ...) etc. should be indexed.
If this is not doing any better you probably count and calculate a lot and you should use map-reduce jobs and a DB which can handle this like MongoDB: http://docs.mongodb.org/manual/aggregation/
I hope this helps

What is the best way to partition large tables in SQL Server?

In a recent project the "lead" developer designed a database schema where "larger" tables would be split across two separate databases with a view on the main database which would union the two separate database-tables together. The main database is what the application was driven off of so these tables looked and felt like ordinary tables (except some quirky things around updating). This seemed like a HUGE performance problem. We do see problems with performance around these tables but nothing to make him change his mind about his design. Just wondering what is the best way to do this, or if it is even worth doing?
I don't think that you are really going to gain anything by partitioning the table across multiple databases in a single server. All you have essentially done there is increased the overhead in working with the "table" in the first place by having several instances (i.e. open in two different DBs) of it under a single SQL Server instance.
How large of a dataset do you have? I have a client with a 6 million row table in SQL Server that contains 2 years worth of sales data. They use it transactionally and for reporting without any noticiable speed problems.
Tuning the indexes and choosing the correct clustered index is crucial to performance of course.
If your dataset is really large and you are looking to partition, you will get more bang for your buck partitioning the table across physical servers.
Partitioning is not something to be undertaken lightly as there can be many subtle performance implications.
My first question is are you referring simply to placing larger table objects in separate filegroups (on separate spindles) or are you referring to data partitioning inside of a table object?
I suspect that the situation described is an attempt to have the physical storage of certain large tables on different spindles from the rest of the tables. In this case, adding the extra overhead of separate databases, losing any ability to enforce referential integrity across databases, and the security implications of enabling cross-database ownership chaining does not provide any benefit over using multiple filegroups within a single database. If, as is quite possible, the separate databases you refer to in your question are not even stored on separate spindles but are all stored on the same spindle then you negate even the slight performance benefit you could have gained by physically separating your disk activity and have received absolutely no benefit.
I would suggest instead of using additional databases to hold large tables you look into the Filegroup topic in SQL Server Books Online or for a quick review see this article:
If you are interested in data partitioning (including partitioning into multiple file groups) then I recommend reading articles by Kimberly Tripp, who gave an excellent presentation at the time SQL Server 2005 came out about the improvements available there. A good place to start is this whitepaper
Which version of SQL Server are you using? SQL Server 2005 has partitioned tables, but in 2000 (or 7.0) you needed to use partition views.
Also, what was the reasoning for putting the table partitions in a separate database?
When I've had to partition tables in the past (pre-2005), it's usually by a date column or something similar, with a view over the various partitions. Books Online has a section that talks about how to do this and all of the rules around it. You need to follow the rules to make it work how it's supposed to work.
The key thing to remember is that your partitioning column must be part of the primary key and you want to try to always use that column in any access against the table so that the optimizer can ignore partitions that shouldn't be affected by the query.
Look up "partitioned table" in MSDN and you should be able to find a more complete tutorial for SQL Server 2005 partitioned tables as well as advice on how to set them up for maximum performance.
Are you asking about best practices in terms of database design, or convincing your lead to change his mind? :)
In terms of design... Back in the goode olde days, vertical partitioning was sometimes needed to work around database engine limitations, where the number of columns in a table was a hard limit, like 255 columns. These days the main benefits are purely for performance: putting rarely used columns, or blobs on a separate disk array. But if you're regularly pulling things from both tables it will likely be a loss. It sounds like your lead is suffering from a case of premature optimisation.
In terms of telling your lead is wrong... that requires diplomacy. If he's aware of mutterings of discontent in terms of performance, a benchmark is probably the best way to show the difference.
Create a new physical table somewhere with 'create table t1 as select * from view1' and then run some lengthy batch with the vertically partitioned table and your new table. If it's as bad as you say, the difference should be evident.
But this too may be premature optimisation. Find out what the end-users think of the performance. If the performance is good enough, for some definition of good, then don't fix what ain't broke.
There is a definite benefit for table partitioning (regardless whether it's on same or different filegroups /disks). If the partition column is correctly selected, you'll realize that your queries will hit only the required partition. So imagine if you have 100 million records (I've partitioned tables much bigger than that - about 20+ Billion rows) and if for the most part, more than 70% of your data access is only a certain category or timeline or type of data then it helps to keep the most accessed data in a separate partition. Plus you can align the partition with separate file groups with various type of disks (SATA, Fiber channel, SSDs) so that the most accessed/busy data are on the fastest storage and the least/rarely accessed are virtually on slower disks.
Although, in SQL Server there's limited partitioning ability, unlike Oracle. You can choose only one column for partitioning (even in SQL 2008). So you've to choose a column wisely where that column also is part of most of your frequent queries. For the most part, people find it easy to choose to partition by a date column. However although it seems logical to partition that way, if your queries do not have that column as part of the condition, you won't be gaining sufficient benefits from partitioning (in other words, your query will hit all the partition regardless).
It's much easier to partition for data warehouse/data mining type databases than OLTP as most DW database queries are limited by time period.
That's why these days due to the volume of data being handled by databases, it's wise to design the application in such a way that ever query is limited by some broader group such as time, geographical location or such so that when such columns are chosen for partitioning you'll gain maximum benefits.
I would disagree with the assumption that nothing can be gained by partitioning.
If the partition data is physically and logically aligned, then the potential IO of queries should be dramatically reduced.
For example, We have a table which has the batch field as an INT representing an INT.
If we partition the data by this field and then re-run a query for a particular batch, we should be able to run set statistics io ON before and after partitioning and see a reduction in IO,
If we have a million rows per partition and each partition is written to a separate device. The query should be able to eliminate the nonessential partitions.
I've not done a lot of partitioning on SQL Server, but I do have experience of partitioning on Sybase ASE, and this is known as partition eliminiation. When I have time I'm going to test out the scenario on a SQL Server 2005 machine.