Postgresql: Autovacuum partitioned tables - sql

We have a very large table that was partitioned into monthly tables. We have no autovacuum parameters set in the postgresql.conf file, so it's on by default with default parameters.
The past months tables table_201404, table_201403 do not get written to or updated/deleted once they are passed, they are only read from for historical data. Why is it that we are noticing autovacuum processes running on these tables? Is it because they are part of a main partition and PostgreSQL is seeing those tables as one?
We are toying with the idea of setting autovacuum_enabled to off for these past tables, but I wanted to consult the wisdom of Stackoverflow first.
Thanks all...

Even read-only tables need to be vacuumed for wrap-around once every 2 billion transactions, and under the default settings are vacuumed for wrap-around once every 150 million transactions.
The transaction IDs stored with each row are 32 bits, so they wrap around eventually. To prevent this from causing problems, any very old transactionID has to be replaced with a magic value meaning "Older than all other IDs". So the table has to be scanned to do that replacement. If the table never changes, eventually every transaction ID will be replaced with the magic value and conceptually that table no longer needs to be scanned. But that fact is not stored anywhere, so the table still needs to be scanned every now then so that the system can observe that they are all still OK. Fortunately the scan is done in sequentially and only needs to read, not write, so it should be fairly efficient.
It is possible that the whole thing will be redone in 9.5 so that tables like that would no longer need to be scanned.

Related

Why is the transaction log growing so large?

I'm performing an update on a DB that is inserting a 15 digit number into 270,000,000 rows of a single column. I think the space required should be around 4GB but it is still running and the transaction log has just hit 180GB.
Transactions have to store a lot of information just in case the changes need to be rolled back.
There needs to be a sequential value to know which order the records were updated/inserted. It needs to store the original value for the column (some RDBMSs might even store the whole row!). It needs a unique identifier to tie the data back to the row's location.
It has to store so much data because if something catastrophic happens -- like the database crashing -- it needs to be able to return to a consistent state.
Yes, 15 digits * 270 mil may come out to 4 GB, but that completely ignores all of the very important metadata required.
If this is a one-off update that doesn't need to be repeated, it may be faster to simply recreate the table with the column updated. Compared to inserts/updates/deletes, table creates from selects require almost no transaction logging.
Probably, all pages split due to the significant amount of data added (4/180 = 2.2%; might not seem significant but probably pushes many pages over the edge).
Rebuild the clustered index with a fillfactor (probably 90 is enough). Then, you will not have any page splits when updating.
If this does not help we need to dig deeper.
In any case there will be significant log growth and it will be bigger than 4GB for sure. 180 sounds too much. That sounds like whole pages are stored.

Slow queries on 'transaction' table - sql partition as a solution?

I have a table with 281,433 records in it, ranging from March 2010 to the current date (Sept 2014). It's a transaction table which consists of records that determine stock which is currently in and out of the warehouse.
When making picks from the warehouse, the system needs to look over every transaction from a particular customer that was ever made (based on the AccountListID field, which determines the customer, a customer might on average have about 300 records in the table). This happens 2-3 times per request from the particular .NET application when a picking run is done.
There are times when the database seemingly locks out. Some requests complete no bother, within about 3 seconds. Others hang for 'up to 4 minutes' according to the end users.
My guess is with 4-5 requests at the same time all looking at this one transaction table things are getting locked up.
I'm thinking about partitioning this table so that the primary transaction table only contains record from the last 2 years. The end user has agreed that any records past this date are unnecessary.
But I can't just delete them, they're used elsewhere in the system. I have indexes already in place and they make a massive difference (going from >30 seconds to <2, on the accountlistid field). It seems partitioning is the next step.
1) Am I going down the right route as a solution to my 'locking' problem?
2) When moving a set of records (e.g. records where the field DateTimeCheckedIn is more than 2 years old) is this a manual process or does partitioning automatically do this?
Partitioning shouldn't be necessary on a table with fewer than 300,000 rows, unless each record is really big. If a record is occupying more than 4k bytes, then you have 300,000 pages (2,400,000,000 bytes) and that is getting larger.
Indexes are usually the solution for something like this. Taking more than a second to return 300 records in an indexed database seems like a long time (unless the records are really big and the network overhead adds to the time). Your table and index should both fit into memory. Check your memory configuration.
The next question is about the application code. If it uses cursors, then these might be the culprit by locking rows under certain circumstances. For read-only cursors, "FAST_FORWARD" or "FORWARD READ_ONLY" should be fast. It is possible that if the application code is locking all the historical records, then you might get contention. After all, this would occur when two records (for different) customers are on the same data page. The solution is to not lock the historical records as you read them. Or, to avoid using cursors all together.
I don't think partitioning will be necessary here. You can probably fix this with a well-placed index: I'm thinking a single index covering (in order) company, part number, and quantity. Or, if it's an old server, possibly just add ram. Finally, since this is reading a lot of older data for transactions, where individual transactions themselves are likely never (or at most very rarely) updated once written, you might do better with a READ UNCOMMITTED isolation level for this query.

T-SQL Optimize DELETE of many records

I have a table that can grew to millions records (50 millions for example). On each 20 minutes records that are older than 20 minutes are deleted.
The problems is that if the table has so many records such deletion can take a lot of time and I want to make it faster.
I can not do "truncate table" because I want to remove only records that are older than 20 minutes. I suppose that when doing the "delete" and filtering the information that need to be delete, the server is creating log file or something and this take much time?
Am I right? Is there a way to stop any flag or option to optimize the delete, and then to turn on the stopped option?
To expand on the batch delete suggestion, i'd suggest you do this far more regularly (every 20 seconds perhaps) - batch deletions are easy:
WHILE 1 = 1
BEGIN
DELETE TOP ( 4000 )
FROM YOURTABLE
WHERE YourIndexedDateColumn < DATEADD(MINUTE, -20, GETDATE())
IF ##ROWCOUNT = 0
BREAK
END
Your inserts may lag slightly whilst they wait for the locks to release but they should insert rather than error.
In regards to your table though, a table with this much traffic i'd expect to see on a very fast raid 10 array / perhaps even partitioned - are your disks up to it? Are your transaction logs on different disks to your data files? - they should be
EDIT 1 - Response to your comment
TO put a database into SIMPLE recovery:
ALTER DATABASE Database Name SET RECOVERY='SIMPLE'
This basically turns off transaction logging on the given database. Meaning in the event of data loss you would need loose all data since your last full backup. If you're OK with that, well this should save a lot of time when running large transactions. (NOTE that as the transaction is running, the logging still takes place in SIMPLE - to enable the rolling back of the transaction).
If there are tables within your database where you cant afford to loose data you'll need to leave your database in FULL recovery mode (i.e. any transaction gets logged (and hopefully flushed to *.trn files by your servers maintenance plans). As i stated in my question though, there is nothing stopping you having two databases, 1 in FULL and 1 in SIMPLE. the FULL database would be fore tables where you cant afford to loose any data (i.e. you could apply the transaction logs to restore data to a specific time) and the SIMPLE database would be for these massive high-traffic tables that you can allow data loss on in the event of a failure.
All of this is relevant assuming your creating full (*.bak) files every night & flushing your log files to *.trn files every half hour or so).
In regards to your index question, it's imperative your date column is indexed, if you check your execution plan and see any "TABLE SCAN" - that would be an indicator of a missing index.
Your date column i presume is DATETIME with a constraint setting the DEFAULT to getdate()?
You may find that you get better performance by replacing that with a BIGINT YYYYMMDDHHMMSS and then apply a CLUSTERED index to that column - note however that you can only have 1 clustered index per table, so if that table already has one you'll need to use a Non-Clustered index. (in case you didnt know, a clustered index basically tells SQL to store the information in that order, meaning that when you delete rows > 20 minutes SQL can literally delete stuff sequentially rather than hopping from page to page.
The log problem is probably due to the number of records deleted in the trasaction, to make things worse the engine may be requesting a lock per record (or by page wich is not so bad)
The one big thing here is how you determine the records to be deleted, i'm assuming you use a datetime field, if so make sure you have an index on the column otherwise it's a sequential scan of the table that will really penalize your process.
There are two things you may do depending of the concurrency of users an the time of the delete
If you can guarantee that no one is going to read or write when you delete, you can lock the table in exclusive mode and delete (this takes only one lock from the engine) and release the lock
You can use batch deletes, you would make a script with a cursor that provides the rows you want to delete, and you begin transtaction and commit every X records (ideally 5000), so you can keep the transactions shorts and not take that many locks
Take a look at the query plan for the delete process, and see what it shows, a sequential scan of a big table its never good.
Unfortunately for the purpose of this question and fortunately for the sake of consistency and recoverability of the databases in SQL server, putting a database into Simple recovery mode DOES NOT disable logging.
Every transaction still gets logged before committing it to the data file(s), the only difference would be that the space in the log would get released (in most cases) right after the transaction is either rolled back or committed in the Simple recovery mode, but this is not going to affect the performance of the DELETE statement in one way or another.
I had a similar problem when I needed to delete more than 70% of the rows from a big table with 3 indexes and a lot of foreign keys.
For this scenario, I saved the rows I wanted in a temp table, truncated the original table and reinserted the rows, something like:
SELECT * INTO #tempuser FROM [User] WHERE [Status] >= 600;
TRUNCATE TABLE [User];
INSERT [User] SELECT * FROM #tempuser;
I learned this technique with this link that explains:
DELETE is a a fully logged operation , and can be rolled back if something goes wrong
TRUNCATE Removes all rows from a table without logging the individual row deletions
In the article you can explore other strategies to resolve the delay in deleting many records, that one worked to me

Table with no more than 30k records needs index rebuilding after a handful of inserts

I have a table with 20 or so columns. I have approximately 7 non-clustered indexes in that table on the columns that users filter by more often. The active records (those that the users see on their screen) are no more than 700-800. Twice a day a batch job runs and inserts a few records in that table - maybe 30 - 100 - and may update the existing ones as well.
I have noticed that the indexes need rebuilding EVERY time that the batch operation completes. Their fragmentation level doesnt go from 0-1% step by step to say 50%. I have noticed that they go from 0-1% to approx. 99% after the batch operation completes. A zillion of selects can happen on this table between batch operations but i dont think that matters.
Is this normal? i dont think it is. what do you think its the problem here? The indexed columns are mostly strings and floats.
A few changes could easily change fragmentation levels.
An insert on a page can cause a page split
Rows can overflow
Rows can be moved (forward pointers)
You'll have quite wide rows too so your data density (rows per page) is lower. DML on existing rows will cause fragmentation quite quickly if the DML is distributed across many pages

Fastest way to do mass update

Let’s say you have a table with about 5 million records and a nvarchar(max) column populated with large text data. You want to set this column to NULL if SomeOtherColumn = 1 in the fastest possible way.
The brute force UPDATE does not work very well here because it will create large implicit transaction and take forever.
Doing updates in small batches of 50K records at a time works but it’s still taking 47 hours to complete on beefy 32 core/64GB server.
Is there any way to do this update faster? Are there any magic query hints / table options that sacrifices something else (like concurrency) in exchange for speed?
NOTE: Creating temp table or temp column is not an option because this nvarchar(max) column involves lots of data and so consumes lots of space!
PS: Yes, SomeOtherColumn is already indexed.
From everything I can see it does not look like your problems are related to indexes.
The key seems to be in the fact that your nvarchar(max) field contains "lots" of data. Think about what SQL has to do in order to perform this update.
Since the column you are updating is likely more than 8000 characters it is stored off-page, which implies additional effort in reading this column when it is not NULL.
When you run a batch of 50000 updates SQL has to place this in an implicit transaction in order to make it possible to roll back in case of any problems. In order to roll back it has to store the original value of the column in the transaction log.
Assuming (for simplicity sake) that each column contains on average 10,000 bytes of data, that means 50,000 rows will contain around 500MB of data, which has to be stored temporarily (in simple recovery mode) or permanently (in full recovery mode).
There is no way to disable the logs as it will compromise the database integrity.
I ran a quick test on my dog slow desktop, and running batches of even 10,000 becomes prohibitively slow, but bringing the size down to 1000 rows, which implies a temporary log size of around 10MB, worked just nicely.
I loaded a table with 350,000 rows and marked 50,000 of them for update. This completed in around 4 minutes, and since it scales linearly you should be able to update your entire 5Million rows on my dog slow desktop in around 6 hours on my 1 processor 2GB desktop, so I would expect something much better on your beefy server backed by SAN or something.
You may want to run your update statement as a select, selecting only the primary key and the large nvarchar column, and ensure this runs as fast as you expect.
Of course the bottleneck may be other users locking things or contention on your storage or memory on the server, but since you did not mention other users I will assume you have the DB in single user mode for this.
As an optimization you should ensure that the transaction logs are on a different physical disk /disk group than the data to minimize seek times.
Hopefully you already dropped any indexes on the column you are setting to null, including full text indexes. As said before, turning off transactions and the log file temporarily would do the trick. Backing up your data will usually truncate your log files too.
You could set the database recovery mode to Simple to reduce logging, BUT do not do this without considering the full implications for a production environment.
What indexes are in place on the table? Given that batch updates of approx. 50,000 rows take so long, I would say you require an index.
Have you tried placing an index or statistics on someOtherColumn?
This really helped me. I went from 2 hours to 20 minutes with this.
/* I'm using database recovery mode to Simple */
/* Update table statistics */
set transaction isolation level read uncommitted
/* Your 50k update, just to have a measures of the time it will take */
set transaction isolation level READ COMMITTED
In my experience, working in MSSQL 2005, moving everyday (automatically) 4 Million 46-byte-records (no nvarchar(max) though) from one table in a database to another table in a different database takes around 20 minutes in a QuadCore 8GB, 2Ghz server and it doesn't hurt application performance. By moving I mean INSERT INTO SELECT and then DELETE. The CPU usage never goes over 30 %, even when the table being deleted has 28M records and it constantly makes around 4K insert per minute but no updates. Well, that's my case, it may vary depending on your server load.
READ UNCOMMITTED
"Specifies that statements (your updates) can read rows that have been modified by other transactions but not yet committed." In my case, the records are readonly.
I don't know what rg-tsql means but here you'll find info about transaction isolation levels in MSSQL.
Try indexing 'SomeOtherColumn'...50K records should update in a snap. If there is already an index in place see if the index needs to be reorganized and that statistics have been collected for it.
If you are running a production environment with not enough space to duplicate all your tables, I believe that you are looking for trouble sooner or later.
If you provide some info about the number of rows with SomeOtherColumn=1, perhaps we can think another way, but I suggest:
0) Backup your table
1) Index the flag column
2) Set the table option to "no log tranctions" ... if posible
3) write a stored procedure to run the updates