Insert Based on Primary Key - sql

In our data warehouse (SQL Server 2005), we attempt to insert/update records in order of the primary key. In other words, we pull from the source table and issue an ORDER BY primary key in DW. This is a standard practice to keep the data reads/writes in logical order on the hard drive and improve performance. (If this is not accurate, please let me know).
When issuing an ORDER BY on a very large source table, this really kills performance. Is there another way to get the same result? I am thinking some combination of index rebuilds and computing stats?
Hope that makes sense! I'm not a DBA! Thanks.

If you want to force a specific ordering, you must have an order by clause. No ifs, ands, or buts. A clustered index on the column you want to sort by would probably make the select run faster, but building that index may end up taking just as long as your original query.
I would look into alternative methods of re-ordering your target table (as suggested in Johnny Bones's reply).

If that table does not have an index, then the pages aren't going to be stored with any particular order on the disk. In the case of a large table without any index, a SELECT.....ORDER BY from said table will have a performance issue.
It sounds like you need an index on your primary key.

If the table you are pulling from has a clustered index on the PK (which it should), then bringing back records in the order of the PK should have zero performance impact. Any performance issues you are having is probably due to the sheer number of records you are returning (not to the ORDER BY).
I don't think I understand what you're trying to do, exactly. But bringing back the records in that order should not be a problem.

I don't know if this is "standard practice", but I worked in a warehouse where our typical table had close to a billion records. We would always drop the indexes, insert the new data, and then rebuild the indexes. Someone, at some point, determined this was the most efficient way for us to do it. I'm sure someone here can chime in about page size and physical attributes (which is probably more than you need to know, since you say you're not a DBA), but the short answer is to do it that way.
If you decide to go that route, always remember to drop the non-clustered indexes first, and then drop the clustered. When you rebuild them, rebuild them in reverse order (clustered first and then non-clustered).

Related

Correct SQL Server Index When Selecting one parameter

I have a query in SQL Server that looks something like this:
SELECT m.id
FROM Message m
WHERE m.id IN (someIds)
AND m.creationTime >= someDate
AND m.partition_number IN (0,1)
My Question is which kind of index should be best for this case. Thanks a lot.
I'm going to assume that id is the primary key on the Message table (if it's not, you're going to get a lot of people having trouble following your code).
The PK will be a Clustered index: that is, it's not separate from the data the index points to. Find the key and the data associated with it is stored with it. For a non-clustered index (such as might exist on creationTime) what's stored with the key is a pointer to data, and the server then has to go away and do another disc access to read the data. Thus clustered indexes are more efficient than non-clustered ones as you only have to find the key.
If you can specify the PK values (the someIds in your query) it will be as efficient as possible, and adding an index on creationTime will not improve it. Indeed, if SQL Server used it as well as the PK then it would degrade the query.
So as long as id is the PK you are doing as well as possible.
Final comment: what you are doing is called Premature Optimisation and is generally a no-no. With databases, do not try too hard to improve performance until you KNOW you have a problem and can measure it. Then you get out the Query Plan for the query which is slow and start playing about it. Until then, knowing what queries your app is issuing will give a very good first approximation of what indexes you will need and you shouldn't try too hard to optimise them until you know there is a problem.

How to know when to use indexes and which type?

I've searched a bit and didn't see any similar question, so here goes.
How do you know when to put an index in a table? How do you decide which columns to include in the index? When should a clustered index be used?
Can an index ever slow down the performance of select statements? How many indexes is too many and how big of a table do you need for it to benefit from an index?
EDIT:
What about column data types? Is it ok to have an index on a varchar or datetime?
Well, the first question is easy:
When should a clustered index be used?
Always. Period. Except for a very few, rare, edge cases. A clustered index makes a table faster, for every operation. YES! It does. See Kim Tripp's excellent The Clustered Index Debate continues for background info. She also mentions her main criteria for a clustered index:
narrow
static (never changes)
unique
if ever possible: ever increasing
INT IDENTITY fulfills this perfectly - GUID's do not. See GUID's as Primary Key for extensive background info.
Why narrow? Because the clustering key is added to each and every index page of each and every non-clustered index on the same table (in order to be able to actually look up the data row, if needed). You don't want to have VARCHAR(200) in your clustering key....
Why unique?? See above - the clustering key is the item and mechanism that SQL Server uses to uniquely find a data row. It has to be unique. If you pick a non-unique clustering key, SQL Server itself will add a 4-byte uniqueifier to your keys. Be careful of that!
Next: non-clustered indices. Basically there's one rule: any foreign key in a child table referencing another table should be indexed, it'll speed up JOINs and other operations.
Furthermore, any queries that have WHERE clauses are a good candidate - pick those first which are executed a lot. Put indices on columns that show up in WHERE clauses, in ORDER BY statements.
Next: measure your system, check the DMV's (dynamic management views) for hints about unused or missing indices, and tweak your system over and over again. It's an ongoing process, you'll never be done! See here for info on those two DMV's (missing and unused indices).
Another word of warning: with a truckload of indices, you can make any SELECT query go really really fast. But at the same time, INSERTs, UPDATEs and DELETEs which have to update all the indices involved might suffer. If you only ever SELECT - go nuts! Otherwise, it's a fine and delicate balancing act. You can always tweak a single query beyond belief - but the rest of your system might suffer in doing so. Don't over-index your database! Put a few good indices in place, check and observe how the system behaves, and then maybe add another one or two, and again: observe how the total system performance is affected by that.
Rule of thumb is primary key (implied and defaults to clustered) and each foreign key column
There is more but you could do worse than using SQL Server's missing index DMVs
An index may slow down a SELECT if the optimiser makes a bad choice, and it is possible to have too many. Too many will slow writes but it's also possible to overlap indexes
Answering the ones I can I would say that every table, no matter how small, will always benefit from at least one index as there has to be at least one way in which you are interested in looking up the data; otherwise why store it?
A general rule for adding indexes would be if you need to find data in the table using a particular field, or set of fields. This leads on to how many indexes are too many, generally the more indexes you have the slower inserts and updates will be as they also have to modify the indexes but it all depends on how you use your data. If you need fast inserts then don't use too many. In reporting "read only" type data stores you can have a number of them to make all your lookups faster.
Unfortunately there is no one rule to guide you on the number or type of indexes to use, although the query optimiser of your chosen DB can give hints based on the queries you are executing.
As to clustered indexes they are the Ace card you only get to use once, so choose carefully. It's worth calculating the selectivity of the field you are thinking of putting it on as it can be wasted to put it on something like a boolean field (contrived example) as the selectivity of the data is very low.
This is really a very involved question, though a good starting place would be to index any column that you will filter results on. ie. If you often break products into groups by sale price, index the sale_price column of the products table to improve scan times for that query, etc.
If you are querying based on the value in a column, you probably want to index that column.
i.e.
SELECT a,b,c FROM MyTable WHERE x = 1
You would want an index on X.
Generally, I add indexes for columns which are frequently queried, and I add compound indexes when I'm querying on more than one column.
Indexes won't hurt the performance of a SELECT, but they may slow down INSERTS (or UPDATES) if you have too many indexes columns per table.
As a rule of thumb - start off by adding indexes when you find yourself saying WHERE a = 123 (in this case, an index for "a").
You should use an index on columns that you use for selection and ordering - i.e. the WHERE and ORDER BY clauses.
Indexes can slow down select statements if there are many of them and you are using WHERE and ORDER BY on columns that have not been indexed.
As for size of table - several thousands rows and upwards would start showing real benefits to index usage.
Having said that, there are automated tools to do this, and SQL server has an Database Tuning Advisor that will help with this.

unique index slows down?

does it slow down the query time to use a lot of unique indexes? i dont have that many im just curious, i think i have heard this some where
id (primary auto_increment)
username (unique)
password
salt (unique)
email (unique)
It depends on the database server software and table/index type you are using.
Inserts will be slower any time you have indexes of any sort, but not necessarily by a lot - this will depend on the table size as well.
In general, unique indexes should speed up any SELECT queries you use that are able to take advantage of the index(es)
They will slow down your inserts and updates (since each has to be checked to see if the constraint is violated), but should not slow down selects. In fact, they could speed them up, since there are more choices for the optimizer to use to find your data.
In general, having more indexes will slow down inserts, updates and deletes, but can speed up queries - assuming that you query based on these fields.
If you need a unique index to guarantee consistency in your application then you should usually add it even if it will result in a slight performance hit. It's better to be correct than fast but wrong.
Each time you add a unique key to a database table it adds a non-clustered index. Clustered indexes are on primary keys and sort the data in the table physically by that column(s). Each non-clustered index creates a number of leaves external to the table that sort based off of the unique key. This allows for a much more consistent search on the unique key because a full table scan is no longer required. The down side is every time you insert, update, or delete a row from the table, the server must go back and update all of the leaves external to the table. This is usually not an issue if you have a small number of unique keys but when you have many it can slow down the response times. For more info, read the Wikipedia article here. Msdn also has a good article.
Query time, no. They will slow down INSERT and DELETE time though, since each index value has to be calculated and then inserted, or removed.

When should you consider indexing your sql tables?

How many records should there be before I consider indexing my sql tables?
There's no good reason to forego obvious indexes (FKs, etc.) when you're creating the table. It will never noticeably affect performance to have unnecessary indexes on tiny tables, and it's good to take a first cut when your mind is into schema design. Also, some indexes serve to prevent duplicates, which can be useful regardless of table size.
I guess the proper answer to your question is that the number of records in the table should have nothing to do with when to create indexes.
I would create the index entries when I create my table. If you decide to create indices after the table has grown to 100, 1000, 100000 entries it can just take alot of time and perhaps make your database unavailable while you are doing it.
Think about the table first, create the indices you think you'll need, and then move on.
In some cases you will discover that you should have indexed a column, if thats the case, fix it when you discover it.
Creating an index on a searched field is not a pre-optimization, its just what should be done.
When the query time is unacceptable. Better yet, create a few indexes now that are likely to be useful, and run an EXPLAIN or EXPLAIN ANALYZE on your queries once your database is populated by representative data. If the indexes aren't helping, drop them. If there are slow queries that could benefit from more or different indexes, change the indexes.
You are not going to be locked in to an initial choice of indexes. Experiment, and make sure you measure performance!
In general I agree with the previous advice.
Always declare the referential integrity for the tables (Primary Key, Foreign Keys), column constraints (not null, check). Saves you from nightmares when apps put bad data into the tables (even in development).
I'd consider adding indexes for the common access columns (columns in your where clauses which are used in =, <> tests), as well.
Most of the modern RDBMS implementations are quite good at keeping you indexes up to date, without hitting your performance. So, the cost of having indexes is minimal.
Also, most RDBMS's have query plan evaluators which look at the relative costs going to the data rows via the index, or using some sort of table scan. So, again the performance hits are minimal.
Two.
I'm serious. If there are two rows now, and there will always be two rows, the cost of indexing is almost zero. It's quicker to index than to ponder whether you should. It won't take the optimizer very long to figure out that scanning the table is quicker than using the index.
If there are two rows now, but there will be 200,000 in the near future, the cost of not indexing could become prohibitively high. The right time to consider indexing is now.
Having said this, remember that you get an index automatically when you declare a primary key. Creating a table with no primary key is asking for trouble in most cases. So the only time you really need to consider indexing is when you want an index other than the index on the primary key. You need to know the traffic, and the anticipated volume to make this call. If you get it wrong, you'll know, and you can reverse the decision.
I once saw a reference table that had been created with no index when it contained 20 rows. Due to a business change, this table had grown to about 900 rows, but the person who should have noticed the absence of an index didn't. The time to insert a new order had grown from about 10 seconds to 15 minutes.
As a matter of routine I perform the following on read heavy tables:
Create indexes on common join fields such as Foreign Keys when I create the table.
Check the query plan for Views or Stored Procedures and add indexes wherever a table scan is indicated.
Check the query plan for queries by my application and add indexes wherever a table scan is indicated. (and often try to make them into Stored Procedures)
On write heavy tables (like activity logs) I avoid indexes unless they are absolutely necessary. I also tend to archive such data into indexed tables at regular intervals.
It depends.
How much data is in the table? How often is data inserted? A lot of indexes can slow down insertion time. Do you always query all the rows of the table? In this case indexes probably won't help much.
Those aren't common usages though. In most cases, you know you're going to be querying a subset of data. ON what fields? Are there common fields that are always joined on? Look at query plans for common or typical queries, it will generally show you where it's spending all of its time.
If there's a unique constraint on the table (and there should be at least one), then that will usually be enforced by a unique index.
Otherwise, you add indexes when the query performance is bad and adding the index will demonstrably improve the performance. There are books on the subject of how to create good sets of indexes on tables, including Relational Database Index Design and the Optimizers. It will give you a lot of ideas and the reasons why they are good.
See also:
No indexes on small tables
When to create a new SQL Server index
Best Practices and Anti-Patterns in Creating Indexes
and, no doubt, a host of others.

Table Scan vs. Add Index - which is quicker?

I have a table with many millions of rows. I need to find all the rows with a specific column value. That column is not in an index, so a table scan results.
But would it be quicker to add an index with the column at the head (prime key following), do the query, then drop the index?
I can't add an index permanently as the user is nominating what column they're looking for.
Two questions to think about:
How many columns could be nominated for the query?
Does the data change frequently? A lot of it?
If you have a small number of candidate columns, and the data doesn't change a lot, then you might want to consider adding a permanent index on any or even all candidate column.
"Blasphemy!", I hear. Most sources tell you to "never" index every column of a table, but that advised is rooted on the generic assumption that tables are modified frequently.
You will pay a price in additional storage, as well as a performance hit when the data changes.
How small is small and how much is a lot, and is the tradeoff worth it?
There is no way to tell a priory because "too slow" is usually a subjective measurement.
You will have to try it, measure the size of your indexes and then the effect they have in the searches. You will have to balance the costs against the increase in satisfaction of your customers.
[Added] Oh, one more thing: temporary indexes are not only physically slower than a table scan, but they would destroy your concurrency. Re-indexing a table usually (always?) requires a full table lock, so in effect only one user search could be done at a time.
Good luck.
I'm no DBA, but I would guess that building the index would require scanning the table anyway.
Unless there are going to be multiple queries on that column, I would recommend not creating the index.
Best to check the explain plans/execution times for both ways, though!
As everyone else has said, it most certainly would not be faster to add an index than it would be to do a full scan of that column.
However, I would suggest tracking the query pattern and find out which column(s) are searched for the most, and add indexes at least for them. You may find out that 3-4 indexes speeds up 90% of your queries.
Adding an index requires a table scan, so if you can't add a permanent index it sounds like a single scan will be (slightly) faster.
No, that would not be quicker. What would be quicker is to just add the index and leave it there!
Of course, it may not be practical to index every column, but then again it may. How is data added to the table?
It wouldn't be. Creating an index is more complex than simply scanning the column, even if the computational complexity is the same.
That said - how many columns do you have? Are you sure you can't just create an index for each of them if the query time for a single find is too long?
It depends on the complexity of your query. If you're retrieving the data once, then doing a table scan is faster. However, if you're going back to the table more than once for related information in the same query, then the index is faster.
Another related strategy is to do the table scan, and put all the data in a temporary table. Then index THAT and then you can do all your subsequent selects, groupings, and as many other queries on the subset of indexed data. The benefit being that looking up related information in related tables using the temp table is MUCH faster.
However, space is cheap these days, so you'd probably best be served by examining how your users actually USE your system and adding indexes on those frequent columns. I have yet to see users use ALL the search parameters ALL the time.
Your solution will not scale unless you add a permanent index to each column, with all of the columns that are returned in the query in the list of included columns (a covering index). These indexes will be very large, and inserts and updates to that table will be a bit slower, but you don't have much of a choice if you are allowing a user to arbitrarily select a search column.
How many columns are there? How often does the data get updated? How fast do inserts and updates need to run? There are trade-offs involved, depending on the answers to those questions. Do plenty of experimentation and testing so you know for sure how things will perform.
But to your original question, adding and dropping an index for the purpose of a single query is only beneficial if you do more than one select during the query (for example, the select is in a sub-query that gets run for each row returned).