I am inserting around 2 million records into a SQL Server 2005 table. The table currently have clustered as well as non-clustered indexes. I want to improve the performance of the insert query in that table . Can anyone have idea about
Drop all the indexes (including primary if your data for insert are
not preordered with the same key)
Insert the data
recreate all the dropped indexes
You can try to disable the indexs on the table before inserting and enabling them again after. It can be a huge timesaver if you're inserting large amounts of data into a table.
Check out this article for SQL server on how to do such a thing: http://msdn.microsoft.com/en-us/library/ms177406.aspx
If there is no good reason you aren't using bulk-insert, I'd say that your best option is to do this. Ie: Select rows into a format you can then bulk re-insert.
By doing ordinary inserts in this amount, you are putting a huge strain on your transaction logs.
If bulk-insert is not an option, you might win a little bit by splitting up the inserts into chunks - so that you don't go row-by-row, but don't try to insert and update it all in one fell swoop either.
I've experimented a bit with this myself, but haven't had the time to get close to a conclusive answer. (I've started the question Performance for RBAR vs. set-based processing with varying transactional sizes for the same reason.)
You should drop the indexes and then insert data and then recreate indexes.
You can insert up to 1000 rows in one insert.
values (a,b,c), (d,f,h)
Sort the the data via the primary key on the insert.
Use with (hold lock)
Related
I have a table of 50 million records with 17 columns. I want to publish data into some tables. I have built some tables for this.
I wrote a sql script for this work. But the speed of this script is very low.
The main problem is that Before I want to insert a record in a table, I must check the table to not exists that record.
Of course I already done some optimization in my code. For example I replace cursor with while statement. But still the speed is very low.
What can I do to increase the speed and optimization?
I must check the table to not exists that record.
Let the database do the work via a unique constraint or index. Decide on the columns that cannot be identical and run something like:
create unique index unq_t_col1_col2_col3 on t(col1, col2, col3);
The database will then return an error if you attempt to insert a duplicate.
This is standard functionality and should be available in any database. But, you should tag your question with the database you are using and provide more information about what you mean by a duplicate.
I am wondering if having a UNIQUE column constraint will slow down the performance of insert operations on a table.
Yes, indexes will slow it down marginally (perhaps not even noticeably).
HOWEVER, do not forego proper database design to because you want it to be fast as possible. Indexes will slow down an insert a tiny amount; if this amount is unacceptable, your design is almost certainly wrong in the first place and you are attacking the issue from the wrong angle.
When in doubt, test. If you need to be able to insert 100,000 rows a minute. Test that scenario.
I would say, yes.
The difference between inserting into HEAP with and without such constraint will be visible since general rule applies: more indexes - slower inserts. Also unique index has to be checked if a row can be inserted or such value (or combination) already exists, so double work.
The slowdown will be more visible on bulk inserts of large amounts of rows. And vice versa on single spotted inserts the impact going to be smaller.
The other thing that unique constraints and indexes help query optimizer to build better SELECT plans...
Typically no. SQL Server creates an index and SQL Server can quickly determine if the value already exist. Maybe in a an enormous table (billions of rows) but I have never seen it. Unique constraints are very a very useful and convenient way to guarantee data consistancy.
I have a SQL Server table with a nvarchar(50) column. This column must be unique and can't be the PK, since another column is the PK. So I have set a non-clustered unique index on this column.
During a large amount of insert statements in a serializable transaction, I want to perform select queries based on this column only, in different transaction. But these inserts seem to lock the table. If I change the datatype of the unique column to bigint for example, no locking occurs.
Why isn't nvarchar working, whereas bigint does? How can I achieve the same, using nvarchar(50) as the datatype?
After all, mystery solved! Rather stupid situation I guess..
The problem was in the select statement. The where clause was missing the quotes, but due to a devilish coincidence of the existing data were only numbers, the select wasn't failing but just wasn't executing until the inserts committed. When the first alphanumeric data were inserted, the select statement begun failing with 'Error converting data type nvarchar to numeric'
e.g
Instead of
SELECT [my_nvarchar_column]
FROM [dbo].[my_table]
WHERE [my_nvarchar_column] = '12345'
the select statement was
SELECT [my_nvarchar_column]
FROM [dbo].[my_table]
WHERE [my_nvarchar_column] = 12345
I guess a silent cast was performed, the unique index was not being used which resulted to the block.
Fixed the statement and everything works as expected now.
Thanks everyone for their help, and sorry for the rather stupid issue!
First, you can change the PK to be a non-clustered index, then you you could create a clustered index on this field. Of course, that may be a bad idea based on your usage, or just simply not help.
You might have a use case for a covering index, see previous question re: covering index
You might be able to change your "other queries" to non-blocking by changing the isolation level of those queries.
It is relatively uncommon for it to be a necessity to insert a large number of rows in of a single transaction. You may be able to simply not use a transaction, or split up into a smaller set of transactions to avoid locking large sections of the table. E.g., you can insert the records into a pending table (that is not otherwise used in normal activity) in a transaction, then migrate these records in smaller transactions to the main table if real-time posting to the main table is not required.
ADDED
Perhaps the most obvious question. Are you sure you have to use a serializable transaction to insert a large number of records? These are relatively rarely necessary outside of financial transactions, and impose a high concurrency cost compared to the other isolation levels?
ADDED
Based on your comment about "all or none", you are describing atomicity, not serializability. I.e., you might be able to use a different isolation level for your large insert transaction, and still get atomicity.
Second thing, I notice you specify a large amount of insert statements. This just sounds like you should be able to push these inserts into a pending/staging table, then perform a single insert or batches of inserts from the staging table into the production table. Yes, it is more work, but you may just have an existing problem that requires the extra effort.
You may want to add the NOLOCK hint (a.k.a. READUNCOMMITTED) to your query. It will allow to to perform a "dirty read" of the data that has been already inserted.
e.g.
SELECT
[my_nvarchar_column]
FROM
[dbo].[locked_table] WITH (NOLOCK)
Take a look at a better explanation here:
http://www.mssqltips.com/sqlservertip/2470/understanding-the-sql-server-nolock-hint/
And the READUNCOMMITTED section here:
http://technet.microsoft.com/en-us/library/ms187373.aspx
I want to log some information from the users of a system to a special statistics table.
There will be a lot of inserts into this table, but no reads (not for the users, only I will read)
Will I get a better performance to have two tables where I move the rows into a table I can use for querying, just to keep the "insert" table as small as possible?
E.g. I the user-behavior results in 20000 inserts per day, the table will grow rapidly, and I am afraid the inserts get slower and slower as more and more rows are inserted.
Inserts get slower if SQL Server has to update indexes. If your indexes mean reorganisation isn't required then inserts won't be particularly slow even for large amounts of data.
I don't think you should prematurely optimise in the way you're suggesting unless you're totally sure it's necessary. Most likely you're fine just taking a simple approach, then possibly adding a periodically-updated stats table for querying if you don't mind that being somewhat out of date ... or something like that.
The performance of the inserts depends on whether there are indexes on the table. You say that users aren't going to be reading from the table. If you yourself only need to query it rarely, then you could just leave off the indexes entirely, in which case insert performance won't change.
If you do need indexes on the table, you're right that performance will start to suffer. In this case, you should look into partitioning the table and performing minimally-logged inserts. That's done by inserting the new records into a new table that has the same structure as your main table but without indexes on it. Then, once the data has been inserted, adding it to the main table becomes nothing more than a metadata operation; you simply redefine that table as a new partition of your main table.
Here's a good writeup on table partitioning.
A table in Sybase has a unique varchar(32) column, and a few other columns. It is indexed on this column too.
At regular intervals, I need to truncate it, and repopulate it with fresh data from other tables.
insert into MyTable
select list_of_columns
from OtherTable
where some_simple_conditions
order by MyUniqueId
If we are dealing with a few thousand rows, would it help speed up the insert if we have the order by clause for the select? If so, would this gain in time compensate for the extra time needed to order the select query?
I could try this out, but currently my data set is small and the results don’t say much.
With only a few thousand rows, you're not likely to see much difference even if it is a little faster. If you anticipate approaching 10,000 rows or so, that's when you'll probably start seeing a noticeable difference -- try creating a large test data set and doing a benchmark to see if it helps.
Since you're truncating, though, deleting and recreating the index should be faster than inserting into a table with an existing index. Again, for a relatively small table, it shouldn't matter -- if everything can fit comfortably in the amount of RAM you have available, then it's going to be pretty quick.
One other thought -- depending on how Sybase does its indexing, passing a sorted list could slow it down. Try benchmarking against an ORDER BY RANDOM() to see if this is the case.
I don't believe order speeds in INSERT, so don't run ORDER BY in a vain attempt to improve performance.
I'd say that it doesn't really matter in which order you execute these functions.
Just use the normal way of inserting INSERT INTO, and do the rest afterwards.
I can't say about sybase, but MS SQL inserts faster if records are sorted carefully. Sorting can minimize number of index expansions. As you know it is better to populate the table ant then create index. Sorting data before insertion leads to the similar effect.
The order in which you insert data will generally not improve performance. The issues that affect insert speed have more to do with your databases mechanisms for data storage than the order of inserts.
One performance problem you may experience when inserting a lot of data into a table is the time it takes to update indexes on the table. However again in this case the order in which you insert data will not help you.
If you have a lot of data and by a lot I mean hundreds of thousands perhaps millions of records you could consider dropping the indexes on the table, inserting the records then recreating the indexes.
Dropping and recreating indexes (at least in SQL server) is by far the best way to do the inserts. At least some of the time ;-) Seriously though, if you aren't noticing any major performance problems, don't mess with it.