Is there any perfomance issues when inserting into a large SQL Server table which is being queried? - sql

I use SQL Server. I got a large table - millions of rows. And I iterate through them (SELECT .. WHERE ..). This is a long operation (and I assume can't be shorter).
So what am I asking is if there will be any problems to insert data into that table in the progress of selecting? If yes, what should I do to reduce that? Same questing for update command (with indexed parameters of course).

Yes, you will have performance, and more specifically, locking and blocking issues. If your SELECT statements are using indexes, which they should be, these indexes will be locked every time that you INSERT data into the table. Since the table is relatively large, the lock will probably be long enough to block your SELECT statements, and deadlocks are likely as well.
This might be a scenario where you need to re-evaluate your table structure, and possibly even consider denormalizing to avoid this.
You might also consider Enabling Row Versioning-Based Isolation Levels, assuming that you can throughly test the rest of your system to understand the impact.

The answer is yes, absolutely. A simple solution (if it's an acceptable trade off within your application) is to specify the NOLOCK locking hint. IE:
select * from table with NOLOCK
The tradeoff is that you won't get a consistent read, but in many cases this isn't problem.

It's generally not a good idea to have long running queries on a database with frequent updates. This decrease performance significantly because of locking.
It might be a good idea to look into data warehouses and see if that is something that you could use. That would enable you to have the transactions on a separate database and the bulk load from it in to another database that would have your warehouse.
This would greatly improve performance for both inserts and queries. The trans-actional database could have no indexes, and the warehouse could have all the indexes you want.
You could also put the warehouse in a column store database. That would give you the best query time with the minimal effort because there isn't any need to create indexes in a column store, all you would have to do is to design the schema properly. The drawback with column stores is how ever that inserts, updates and deletes are very slow compared to relational databases. But bulk loading from the transactional database should do the trick. If you require the data to be very up to date, you could bulk load every few minutes. If you just need data from the previous day you could bulk load into the warehouse each night.
The possibilities are endless. If you want to look into column store warehouses you could try MonetDB. Its an open source column store so you could try it out and see if that's anything that suits you.

Do not assume execution time can't be shorter. If you query a date range, an index on date is a must!
Solve your problem indexing on date field:
-- please use correct names for your_table and date_field --
CREATE INDEX index_name ON your_table date_field

Warehousing, as per #Gisli, is a good option: build a copy of the data elsewhere, and run your long-running queries there, freeing up the "main" database for OLTP processing.
If this is not an option, you can mess around with snapshot isolation (something I know about, but have never worked with personally). Esssentially, this will take a "snapshot" of the database at the point in time you start the query, and will execute the query as if no subsequent changes were made to the database, even if changes are made to the database while the query is running. More importantly, any such changes are "real" and permanent. Think of it like a short-term branching of your database.
The duration of the branch (snapshot) is where I get weak. I believe you can have the snapshot last for the duration of the query, which means you'd (possibly) never be able to get the same results for a given query twice (if the data changes while you are running it); or you can create a "saved" snapshot that can be re-used over and over until you get around to deleting it. Be wary with this, you don't want your system to get cluttered up with old forgotten branches of past data!

There is no PROBLEM. SQL Serve is built to deal with this kind of situations, you just need to set the correct isolation level on the transactions.
There are several possible scenarios, for example, if you don't mind reading the data that is being inserted, set your isolation level to read uncommited on your read transaction. If you are inserting values in a range and reading values on another range, you can use SERIALIZABLE.
Take a look at the possible isolation levels:
http://msdn.microsoft.com/en-us/library/ms173763.aspx

Related

avoiding write conflicts while re-sorting a table

I have a large table that I need to re-sort periodically. I am partly basing this on a suggestion I was given to stay away from using cluster keys since I am inserting data ordered differently (by time) from how I need it clustered (by ID), and that can cause re-clustering to get a little out of control.
Since I am writing to the table on a hourly I am wary of causing problems with these two processes conflicting: If I CTAS to a newly sorted temp table and then swap the table name it seems like I am opening the door to have a write on the source table not make it to the temp table.
I figure I can trigger a flag when I am re-sorting that causes the ETL to pause writing, but that seems a bit hacky and maybe fragile.
I was considering leveraging locking and transactions, but this doesn't seem to be the right use case for this since I don't think I'd be locking the table I am copying from while I write to a new table. Any advice on how to approach this?
I've asked some clarifying questions in the comments regarding the clustering that you are avoiding, but in regards to your sort, have you considered creating a nice 4XL warehouse and leveraging the INSERT OVERWRITE option back into itself? It'd look something like:
INSERT OVERWRITE INTO table SELECT * FROM table ORDER BY id;
Assuming that your table isn't hundreds of TB in size, this will complete rather quickly (inside an hour, I would guess), and any inserts into the table during that period will queue up and wait for it to finish.
There are some reasons to avoid the automatic reclustering, but they're basically all the same reasons why you shouldn't set up a job to re-cluster frequently. You're making the database do all the same work, but without the built in management of it.
If your table is big enough that you are seeing performance issues with the clustering by time, and you know that the ID column is the main way that this table is filtered (in JOINs and WHERE clauses) then this is probably a good candidate for automatic clustering.
So I would recommend at least testing out a cluster key on the ID and then monitoring/comparing performance.
To give a brief answer to the question about resorting without conflicts as written:
I might recommend using a time column to re-sort records older than a given time (probably in a separate table). While it's sorting, you may get some new records. But you will be able to use that time column to marry up those new records with the, now sorted, older records.
You might consider revoking INSERT, UPDATE, DELETE privileges on the original table within the same script or procedure that performs the CTAS creating the newly sorted copy of the table. After a successful swap you can re-enable the privileges for the roles that are used to perform updates.

SQL Server Table locking within a store procedure without using a transaction

I need to execute a store procedure which includes 3 queries referring to the same table:
MERGE …
SELECT X FROM TABLE WHERE { BLA BLA }
UPDATE Y FROM TABLE WHERE { BLA BLA }
The store procedure should be thread safe and should executed as an atomic operation.
Currently I am using a transaction with serializable isolation level and a WITH ( XLOCK, TABLOCK ) hint on every query.
Is there a way to sustain a table lock for the time span of the store procedure without using a transaction which causing a performance penalty?
Cheers,
Doron
Whenever a developer chooses TABLOCKX or (XLOCK, TABLOCK) it had instantaneously lost the right to ask questions about performance.
Is not the transaction that is causing performance penalties. Is holding the locks. So your question really is:
Is there a way to produce performance
penalties without the effect of
producing performance penalties?
The answer is no, there is not such way.
Doron, to accomplish what you want to do, can only be done using a transaction. I don't understand why you say a transaction causes a performance overhead, that is a completely untrue statement. Unless what you mean is that your transaction takes eg. 10 seconds, and in that 10 seconds other transactions are blocked.
Now I regularly work and design databases that has to sustain around 80K transactions per second, and doing this you learn a few tricks. What I would suggest you do is to take a step back and re-evaluate your query and table architecture, and if this is a highly transactional database, the first this I suggest is to get rid of any foreign key constraints, that is a performance hit on any transactional db.
The other thing is to look at indexes, do you have the right indexes, and are you perhaps over-indexing tables that have to be inserted into and updated? This can cause massive perf impacts!
Maybe can I suggest if you cannot re-architect tables etc, think outside the box a little, perhaps select the data you want (with nolock) into temp tables, and then perform your merges etc.
Perhaps if you give me a more concrete example, I can assist more.
But for now, tell me what you can and cannot do. Hope it helps!

SQL transaction affecting a big amount of rows

The situation is as follows:
A big production client/server system where one central database table has a certain column that has had NULL as default value but now has 0 as default value. But all the rows created before that change of course still have value as null and that generates a lot of unnecessary error messages in this system.
Solution is of course simple as that:
update theTable set theColumn = 0 where theColumn is null
But I guess it's gonna take a lot of time to complete this transaction? Apart from that, will there be any other issues I should think of before I do this? Will this big transaction block the whole database, or that particular table during the whole update process?
This particular table has about 550k rows and 500k of them has null value and will be affected by the above sql statement.
The impact on the performance of other connected clients depends on:
How fast the servers hardware is
How many indexes containing the column your update statement has to update
Which transaction isolation settings the other clients connect to the database
The db engine will acquire write locks, so when your clients only need read access to the table, it should not be a big problem.
500.000 records sounds not too much for me, but as i said, the time and resources the update takes depends on many factors.
Do you have a similar test system, where you can try out the update?
Another solution is to split the one big update into many small ones and call them in a loop.
When you have clients writing frequently to that table, your update statement might get blocked "forever". I have seen databases where performing the update row by row was the only way of getting the update through. But that was a table with about 200.000.000 records and about 500 very active clients!
it's gonna take a lot of time to complete this transaction
there's no definite way to say this. Depends a lot on the hardware, number of concurrent sessions, whether the table has got locks, the number of interdependent triggers et al.
Will this big transaction block the whole database, or that particular table during the whole update process
If the "whole database" is dependent on this table then it might.
will there be any other issues I should think of before I do this
If the table has been locked by other transaction - you might run into a row-lock situation. In rare cases, perhaps a dead lock situation. Best would be to ensure that no one is utilizing the table, check for any pre-exising locks and then run the statement.
Locking issues are vendor specific.
Asuming no triggers on the table, half a million rows is not much for a dediated database server even with many indexes on the table.

When it comes to updating all rows in a table, does the method of locking matter for performance?

Question is a follow up to this.
The SQL in question was
UPDATE stats SET visits = (visits+1)
And the question is, for the purpose of performance, does it matter if you lock all rows in stats in comparison to locking the table stats? Or, if the database uses a page-lock rather than a table/row lock?
There is no predicate on this. Any self respecting DB engine should work this out and realise all rows need updated.
Generally, don't second guess the DB engine: performance is subjectively the same.
Personally,
I'd not use table or locking hints unless I have to and know why I'm doing it.
I'd not issue a query like this anyway from an application without a WHERE clause
In theory you should lock the table, because 1 lock is cheaper than 1M locks.
Many DBs, though, will promote locks for operations like this. As they see the locks expanding, they'll automatically promote to page and table locks.
But, as with anything, "it depends", and it's better to be specific and lock the table yourself.
Edit:
sigh
Postgres example:
LOCK TABLE mytable IN EXCLUSIVE MODE;
UPDATE mytable SET field = field + 1;
COMMIT;
Here's the deal. This is going to happen ANYWAY, the LOCK TABLE command makes it more explicit, and ensures that your intent, locking the table, is clear and capable before the process takes place.
Would I do this on a 10 row table? No.
Would I do this on a database that I KNEW I had exclusive access to? No, there's no need.
Would I do this on a operational database with a table with a large amount a rows? You bet.

Best practice for archiving a huge table of over 1,000,000,000 rows

I'm using SQL Server 2005. There is an audit trail table, containing over 1,000,000,000 rows. I'm planning to archive this table. When I make a simple select with nolock, I can still find blocking (probably IO blocking with other process?). So are there any best practice for this kind of situation?
For a table that large you will be wanting to find some effective sharding/partitioning strategy. Archiving in this sense tends to be a form of partitioning but not a good one since you often want to query over the current and archive anyway. In the worst cases you end up with a SELECT over a UNION of the archive and current tables, which is worse than if you hadn't split them at all.
You will often do better by finding some other means to slice the data, say on a record type or something. But if you are going to split it by date make absolutely sure you won't query over the archive+current data set.
Also, SQL Server 2005+ doesn't by default enable MVCC. It can do this however if you enable what MS calls Snapshot Isolation. See Serializable vs. Snapshot Isolation Level.
The effect of not having this enabled is that an uncommitted INSERT or UPDATE will block a SELECT in another transaction until the first transaction commits or rolls back. That can cause unnecessary locks and limit your scalability.
Create a backup of the database and restore it in the archive location.
Selecting 1 billion rows all at once is going to strain the server no matter how you do it.
Do it in batches instead, say 1000 rows at a time. The bcp tool does this automatically. Or use SSIS to copy the data into another database - it does pretty much the same thing.