is performance of sql insert dependent on number of rows in table - sql

I want to log some information from the users of a system to a special statistics table.
There will be a lot of inserts into this table, but no reads (not for the users, only I will read)
Will I get a better performance to have two tables where I move the rows into a table I can use for querying, just to keep the "insert" table as small as possible?
E.g. I the user-behavior results in 20000 inserts per day, the table will grow rapidly, and I am afraid the inserts get slower and slower as more and more rows are inserted.

Inserts get slower if SQL Server has to update indexes. If your indexes mean reorganisation isn't required then inserts won't be particularly slow even for large amounts of data.
I don't think you should prematurely optimise in the way you're suggesting unless you're totally sure it's necessary. Most likely you're fine just taking a simple approach, then possibly adding a periodically-updated stats table for querying if you don't mind that being somewhat out of date ... or something like that.

The performance of the inserts depends on whether there are indexes on the table. You say that users aren't going to be reading from the table. If you yourself only need to query it rarely, then you could just leave off the indexes entirely, in which case insert performance won't change.
If you do need indexes on the table, you're right that performance will start to suffer. In this case, you should look into partitioning the table and performing minimally-logged inserts. That's done by inserting the new records into a new table that has the same structure as your main table but without indexes on it. Then, once the data has been inserted, adding it to the main table becomes nothing more than a metadata operation; you simply redefine that table as a new partition of your main table.
Here's a good writeup on table partitioning.

Related

Limit MS SQL DB Index Size using Archival Column

In the project I'm working with, we have a table which sees a lot of read/write activity.
It's sort of a "visibilities" table; a background job is constantly running to generate records into this table based on the creation of other business domain entities.
This table needs to get searched against (and is being updated) on a regular basis, and we're running into performance problems because of it.
When we introduce indexes to improve the speed of search, it causes timeout issues with writing to the table when people perform updates. The table is relatively large and the search criteria is a bit complex so the indexes are large.
What I'm wondering, is if I added an "archived" bit column to the table, consistently marked somewhat old records as archived, could I re-structure the indexes to only index data which is Archived=0? Would that allow me to reduce the size of the indexes (and thus the performance impact of writing to those tables)?
I would assume no since the indexes must still consider which records are archived or not, but I'm not a SQL expert so I wanted to check.
If that would not be an ideal setup, what might I do to accomplish a similar result?
You can create a Filtered Index, which can index only the columns where Archived=0, and can be used only in queries that specify WHERE Archived=0 and ....

Creating index for a query

I have one table Person with two columns Name and Gender and suppose in my application if I have a query which is called frequently :
select * from Person where Gender = 'M'
So is it advisable to create an index on the column Gender?
It's not advisable unless there is loads of one an only a few of the other and your query only looks at the few. A full table scan would give you a much more efficient result than diving through an index. In fact, even if you created the index, it's highly unlikely the optimiser would use it.
Below points might give you the idea:
From Documentation
In general, index access paths are more efficient for statements that retrieve a small subset of table rows, whereas full table scans are more efficient when accessing a large portion of a table.
Do not index columns that are modified frequently. UPDATE statements that modify indexed columns and INSERT and DELETE statements that modify indexed tables take longer than if there were no index. Such SQL statements must modify data in indexes as well as data in tables. They also generate additional undo and redo.
When choosing to index a key, consider whether the performance gain for queries is worth the performance loss for INSERTs, UPDATEs, and DELETEs and the use of the space required to store the index. You might want to experiment by comparing the processing times of the SQL statements with and without indexes. You can measure processing time with the SQL trace facility.

Index all columns

Knowing that an indexed column leads to a better performance, is it worthy to indexes all columns in all tables of the database? What are the advantages/disadvantages of such approach?
If it is worthy, is there a way to auto create indexes in SQL Server? My application dynamically adds tables and columns (depending on the user configuration) and I would like to have them auto indexed.
It is difficult to imagine real-world scenarios where indexing every column would be useful, for the reasons mentioned above. The type of scenario would require a bunch of different queries, all accessing exactly one column of the table. Each query could be accessing a different column.
The other answers don't address the issues during the select side of the query. Obviously, maintaining indexes is an issue, but if you are creating the table/s once and then reading many, many times, the overhead of updates/inserts/deletes is not a consideration.
An index contains the original data along with points to records/pages where the data resides. The structure of an index makes it fast to do things like: find a single value, retrieve values in order, count the number of distinct values, and find the minimum and maximum values.
An index does not only take space up on disk. More importantly, it occupies memory. And, memory contention is often the factor that determines query performance. In general, building an index on every column will occupy more space than then original data. (One exception would be a column that is relative wide and has relatively few values.)
In addition, to satisfy many queries you may need one or more indexes plus the original data. Your page cache gets rather filled with data, which can increase the number of cache misses, which in turn incurs more overhead.
I wonder if your question is really a sign that you have not modelled your data structures adequately. There are few cases where you want users to build ad hoc permanent tables. More typically, their data would be stored in a pre-defined format, which you can optimize for the access requirements.
No because you have to take in consideration that every time you add or update a record, you have to recalculate your indexes and having indexes on all columns would take a lot of time and lead to bad performance.
So databases like data warehouses where there use only select queries is a good idea but on normal database it's a bad idea.
Also, it's not because you are using a column in a where clause that you have to add an index on it.
Try to find a column where the record will be almost all unique like a primary key and that you don't edit often.
A bad idea would be to index the sex of a person cause there are only 2 possible values and the result of the index would only split the data then it will search in almost every records.
No, you should not index all of your columns, and there's several reasons for this:
There is a cost to maintain each index during an insert, update or delete statement, that will cause each of those transactions to take longer.
It will increase the storage required since each index takes up space on disk.
If the column values are not disperse, the index will not be used/ignored (ex: A gender flag).
Composite indexes (indexes with more than one column) can greatly benefit performance for frequently run WHERE, GROUP BY, ORDER BY or JOIN clauses, and multiple single indexes cannot be combined.
You are much better off using Explain plans and data access and adding indexes when necessary (and only when necessary, IMHO), rather than creating them all up front.
No, there is overhead in maintaining the indexes, so indexing all columns would slow down all of your insert, update and delete operations. You should index the columns that you are frequently referencing in WHERE clauses, and you will see a benefit.
Indexes take up space. And they take up time to create, rebuild, maintain, etc. So there's not a guaranteed return on performance for indexing just any old column. You should index the columns that give the performance for the operations you'll use. Indexes help reads, so if you're mostly reading, index columns that will be searched on, sorted by, or joined to other tables relationally. Otherwise, it's more expensive than what benefit you may see.
Every index requires additional CPU time and disk I/O overhead during
inserts and deletions.
Indies on non-primary keys might have to be hanged on updates, although an index on the primary key might not (this is beause updates typially do not modify the primary-key attributes).
Each extra index requires additional storage spae.
For queries whih involve onditions on several searh keys, e ieny
might not be bad even if only some of the keys have indies on them.
Therefore, database performane is improved less by adding indies when
many indies already exist.

Is it quicker to insert sorted data into a Sybase table?

A table in Sybase has a unique varchar(32) column, and a few other columns. It is indexed on this column too.
At regular intervals, I need to truncate it, and repopulate it with fresh data from other tables.
insert into MyTable
select list_of_columns
from OtherTable
where some_simple_conditions
order by MyUniqueId
If we are dealing with a few thousand rows, would it help speed up the insert if we have the order by clause for the select? If so, would this gain in time compensate for the extra time needed to order the select query?
I could try this out, but currently my data set is small and the results don’t say much.
With only a few thousand rows, you're not likely to see much difference even if it is a little faster. If you anticipate approaching 10,000 rows or so, that's when you'll probably start seeing a noticeable difference -- try creating a large test data set and doing a benchmark to see if it helps.
Since you're truncating, though, deleting and recreating the index should be faster than inserting into a table with an existing index. Again, for a relatively small table, it shouldn't matter -- if everything can fit comfortably in the amount of RAM you have available, then it's going to be pretty quick.
One other thought -- depending on how Sybase does its indexing, passing a sorted list could slow it down. Try benchmarking against an ORDER BY RANDOM() to see if this is the case.
I don't believe order speeds in INSERT, so don't run ORDER BY in a vain attempt to improve performance.
I'd say that it doesn't really matter in which order you execute these functions.
Just use the normal way of inserting INSERT INTO, and do the rest afterwards.
I can't say about sybase, but MS SQL inserts faster if records are sorted carefully. Sorting can minimize number of index expansions. As you know it is better to populate the table ant then create index. Sorting data before insertion leads to the similar effect.
The order in which you insert data will generally not improve performance. The issues that affect insert speed have more to do with your databases mechanisms for data storage than the order of inserts.
One performance problem you may experience when inserting a lot of data into a table is the time it takes to update indexes on the table. However again in this case the order in which you insert data will not help you.
If you have a lot of data and by a lot I mean hundreds of thousands perhaps millions of records you could consider dropping the indexes on the table, inserting the records then recreating the indexes.
Dropping and recreating indexes (at least in SQL server) is by far the best way to do the inserts. At least some of the time ;-) Seriously though, if you aren't noticing any major performance problems, don't mess with it.

When should you consider indexing your sql tables?

How many records should there be before I consider indexing my sql tables?
There's no good reason to forego obvious indexes (FKs, etc.) when you're creating the table. It will never noticeably affect performance to have unnecessary indexes on tiny tables, and it's good to take a first cut when your mind is into schema design. Also, some indexes serve to prevent duplicates, which can be useful regardless of table size.
I guess the proper answer to your question is that the number of records in the table should have nothing to do with when to create indexes.
I would create the index entries when I create my table. If you decide to create indices after the table has grown to 100, 1000, 100000 entries it can just take alot of time and perhaps make your database unavailable while you are doing it.
Think about the table first, create the indices you think you'll need, and then move on.
In some cases you will discover that you should have indexed a column, if thats the case, fix it when you discover it.
Creating an index on a searched field is not a pre-optimization, its just what should be done.
When the query time is unacceptable. Better yet, create a few indexes now that are likely to be useful, and run an EXPLAIN or EXPLAIN ANALYZE on your queries once your database is populated by representative data. If the indexes aren't helping, drop them. If there are slow queries that could benefit from more or different indexes, change the indexes.
You are not going to be locked in to an initial choice of indexes. Experiment, and make sure you measure performance!
In general I agree with the previous advice.
Always declare the referential integrity for the tables (Primary Key, Foreign Keys), column constraints (not null, check). Saves you from nightmares when apps put bad data into the tables (even in development).
I'd consider adding indexes for the common access columns (columns in your where clauses which are used in =, <> tests), as well.
Most of the modern RDBMS implementations are quite good at keeping you indexes up to date, without hitting your performance. So, the cost of having indexes is minimal.
Also, most RDBMS's have query plan evaluators which look at the relative costs going to the data rows via the index, or using some sort of table scan. So, again the performance hits are minimal.
Two.
I'm serious. If there are two rows now, and there will always be two rows, the cost of indexing is almost zero. It's quicker to index than to ponder whether you should. It won't take the optimizer very long to figure out that scanning the table is quicker than using the index.
If there are two rows now, but there will be 200,000 in the near future, the cost of not indexing could become prohibitively high. The right time to consider indexing is now.
Having said this, remember that you get an index automatically when you declare a primary key. Creating a table with no primary key is asking for trouble in most cases. So the only time you really need to consider indexing is when you want an index other than the index on the primary key. You need to know the traffic, and the anticipated volume to make this call. If you get it wrong, you'll know, and you can reverse the decision.
I once saw a reference table that had been created with no index when it contained 20 rows. Due to a business change, this table had grown to about 900 rows, but the person who should have noticed the absence of an index didn't. The time to insert a new order had grown from about 10 seconds to 15 minutes.
As a matter of routine I perform the following on read heavy tables:
Create indexes on common join fields such as Foreign Keys when I create the table.
Check the query plan for Views or Stored Procedures and add indexes wherever a table scan is indicated.
Check the query plan for queries by my application and add indexes wherever a table scan is indicated. (and often try to make them into Stored Procedures)
On write heavy tables (like activity logs) I avoid indexes unless they are absolutely necessary. I also tend to archive such data into indexed tables at regular intervals.
It depends.
How much data is in the table? How often is data inserted? A lot of indexes can slow down insertion time. Do you always query all the rows of the table? In this case indexes probably won't help much.
Those aren't common usages though. In most cases, you know you're going to be querying a subset of data. ON what fields? Are there common fields that are always joined on? Look at query plans for common or typical queries, it will generally show you where it's spending all of its time.
If there's a unique constraint on the table (and there should be at least one), then that will usually be enforced by a unique index.
Otherwise, you add indexes when the query performance is bad and adding the index will demonstrably improve the performance. There are books on the subject of how to create good sets of indexes on tables, including Relational Database Index Design and the Optimizers. It will give you a lot of ideas and the reasons why they are good.
See also:
No indexes on small tables
When to create a new SQL Server index
Best Practices and Anti-Patterns in Creating Indexes
and, no doubt, a host of others.