How to delete large data from Firebird SQL database

How to delete large data from Firebird SQL database - sql

I have a very large database (at least for me) - above 1 000 000 records and I need to delete all records that are with a timestamp lower than something. Like for example:
DELETE FROM table WHERE TS < 2020-01-01;
The problem I'm facing is that after the transaction finishes, if it finishes at all, is that the database is unresponsive and unusable. How can I delete so much records without the above said problem?
I'm new to this, as of now I've only worked with databases that had 1000-10000 rows and the command I used to delete records hasn't caused problems.

Based on your description in the question, and your comments, the problem has to do with how garbage collection works in Firebird. Firebird is a so-called Multi-Version Concurrency Control (MVCC) database, each change you make to a row (record), including deletions, will create new versions of that record, and keep previous versions available for other transactions that were started before the transaction that made the change is committed.
If there are no more transactions 'interested' in a previous version of record, that previous version becomes eligible for garbage collection. Firebird has two options for garbage collection: cooperative (supported by all server modes) and background (supported by SuperServer), and a third combined mode which does both (this is the default for SuperServer).
The background mode is a dedicated thread which cleans up garbage, it's signaled by active statements if they see garbage.
In the cooperative mode, a statement that sees garbage is also the one that has to clean it up. This can be especially costly when the statement performs a full table scan just after a large update or delete. Instead of just finding and returning rows, that statement will also rewrite database pages to get rid of that garbage.
See also the slides Garbage collection mechanism and sweep in details.
There are some possible solutions:
If you're using SuperServer, change the policy, by setting the setting GCPolicy in firebird.conf to background.
The downside of this solution is that it might take longer before all garbage is collected, but the big benefit is that transactions are not slowed down by doing garbage collection work.
After committing a transaction that produced a lot garbage, execute a statement that performs a full table scan (e.g. select count(*) from table) to trigger garbage collection, using a separate worker thread to not block the rest of your process.
This option only really works if there are no active transactions that are still interested in those old record versions.
Create a backup of the database (there is no need to restore, except to verify if the backup worked correctly).
By default (unless you specify the -g option to disable garbage collection), the gbak tool will perform garbage collection during a backup. This has the same restriction as option 2, as this works because gbak does the equivalent of a select * from table
Perform a 'sweep' of the database using gfix -sweep.
This has similar restrictions as the previous two options
For connections that cannot incur the slowdown of a garbage collection, specify the connection option isc_dpb_no_garbage_collect (details vary between drivers and connection libraries).
If you specify this for all connections, and your policy is cooperative (either because it is configured, or you're using Classic or SuperClassic server mode), then no garbage collection will take place, which can cause an eventual slowdown as well, because the engine will have to scan longer chains of record versions. This can be mitigated by using the previous two options to perform a garbage collection.
Instead of really deleting records, introduce a soft-delete in your application to mark records as deleted instead of really deleting them.
Either keep those records permanently, or really delete them at a later time, for example by a scheduled job running at a time the database is not under load, and include one of the previous options to trigger a garbage collection.

Actually background garbage collection is exactly what can cause "unresponsive database" behavior because of high tension between garbage collector and working threads. Cooperative GC may slow down operations but keep the database "responsive". At least for version 2.5.
Another reason is bad indexes which have a lot of duplicates. Such indexes are often useless for queries and should be simply dropped. If it is not an option, they could be deactivated before delete and reactivated after in separate transactions (as a side effect the activation will cause full garbage collection).
But of course the best option is to keep all data. 1kk records is not that much for well-designed database on decent hardware.

Related

Firebird lock table / lock record

Suppose you have one table for a Desktop application and several users.
When a user opens a record, i want to lock this record. I have tried "WITH LOCK" statement. It works fine.
But when a second users want to update the same record, i want to put a message "Sorry, you cannot work on this order because it is locked. Somebody else has opened this record before you". Firebird waits the first user to commit/rollback. I don t want to wait. I want to put an error message. Is there a simple way to ask firebird record lock status ?
Is there a way to lock a full table ? Or to put a semaphore/mutex (like get_lock on mysql)
i have tried reserving on set transaction statement but it does not work.
My wish is to display a message to the user. Not waiting.
Thanks

If you don't want to wait, then configure your transaction to use NO WAIT, or a wait timeout. However controlling business rules like this through database transactions is not advisable as it requires long running transactions which inhibit garbage collection, increases the chain of interesting transactions, and increases the chance of update conflicts.
I'd advise to use different options like:
First to update wins
Change detection (eg by a timestamp or record version counter which is also used as a condition in the update statement), and allowing the user to overwrite or abandon his update (or maybe merge)
Explicit reservation by updating the record (setting the username) in a separate transaction. This might require cleanup or the ability for a user to break the reservation (eg if someone had it open for too long).
Note that Firebird uses multi version concurrency control (MVCC), so explicit locking is not really natural. See also this answer to Locking tables firebird, delphi.
Locking tables using RESERVING should be possible, but I have never used it, so I am not entirely sure how to use it although you probably also need to specify FOR PROTECTED READ (see Interbase 6.0 Embedded SQL Guide, pages 70/71).

.NET SQLDatareader isolated read

I have a SQL Server database which stores accounts with credits (about 200.000 records), and a separate table which stores the transactions (about 20.000.000).
Whenever a transaction is added to the database the credits are updated.
What I need to do is update client programs (using a web service) to store the credits locally, and whenever new transactions are added to the server they are sent to the clients as well (using timestamps for the delta). My main problem is creating the first data set for the client. I need to supply the list of all accounts and the last timestamp on the transaction table.
This would mean I have to create this list and the last timestamp within a snapshot, because any updates during creating this list would mean a mismatch in credits total and last transaction timestamp.
I've researched the ALLOW_SNAPSHOT_ISOLATION setting and using snapshot isolation on the SqlCommand transaction, but from what I've read this will induce a significant performance penalty. Is this true, and can this problem be solved using other means ?

but from what I've read this will induce a significant performance penalty.
I don't even want to know where you read that. I'll refer you to the official document. The costs come from additional tempdb space used for row versions and from traversing old row versions. These problems do not concern you if you have a low write rate.
Snapshot isolation is a boon for solving blocking and consistency issues. It is a perfect match for your scenario.
Many SQL Server questions on Stack Overflow lead me to comment "did you investigate snapshot isolation yet?". Most underused feature.
Oracle and Postgres have it always on.

Don't jump onto SI wagon hastily. As everythng else it has it's benefits and has it's drawbacks.
As the drawbacks are concerned, for example, the application might count on blocking behaviour or/and is willing to wait for that last version of the data. You should thoroughly test the application under SI to be sure it behaves correctly. Further, an uncommitted transaction could make a mess of the version store and lead to dramatic tempdb growth, so monitoring is a must.
Also, SI might be an overkill for you, if normally you don't have blocking issues.
Instead, if what you need is a one-off or close to it, create a database snapshot of your database, create the initial list from that snapshot, and then simply drop it.

Programmatically purge document deletion

I've a database with an agent that periodically delete (via Java agent, "removePermanently" method) all documents in a view and re-create them.
After some month, i've noticed that database size is considerably increased.
Showing database information through this command
sh database <dbpath>
it results that i've a lot of deleted documents (i suppose they are deletion stubs)
Document Type Live Deleted
Documents 1,922 817,378
Compacting database, 80% space was recovered.
Is there a way to programmatically delete stubs definitively to avoid "database explosion"? Or, is there a way to correctly manage this scenario (deletion and creation of documents)?

Don't delete the documents! Re-use them. That's the best answer. Seriously. Take the existing documents, clear the fields and set Form := "Obsolete". Modify the selection formula for all your views by appending & Form != "Obsolete" Create a new hidden view called "Obsolete" with selection formula Form = "Obsolete", and instead of creating new documents, change your code to go to the Obsolete view, grab an available document and set new field values (including changing the Form field). Only create new documents if there are not enough available in the Obsolete view. Any performance that you lose by doing this, which really should be minimal with the number of documents that you seem to have, will be more than offset by what you will gain by avoiding the growth and fragmentation of the NSF file that you are creating by doing all the deletions and creating new documents.
If, however, there's no possible way for you to do that -- maybe some third party tool that is outside of your control is creating the documents -- then it's important to know if the database you are talking about is replicated. If it is replicated, then you must be very careful because purging deletion stubs before all replicas are brought up to date will cause deleted documents to "come back to life" if a replica that has been off-line since before the delete occurs comes back on-line.
If the database is not replicated at all, or is reliably replicated across all replicas quickly, then you can reduce the purge interval. Go to the Replication Settings dialog, find the checkbox labeled "Remove documents not modified in the last __ days". Do not check the box, but enter a small number into the number of days. The purge interval for deletion stubs will be set to 1/3 of this number. So if you set it to 3 the effect will be that stubs are kept for 1 day and then purged, giving you 24 hours to assure that all replicas are up to date. If you need more, set the interval higher, maintaining the 3x multiple as needed. If a server is down for an extended period of time (longer than your purge interval), then adjust your operations procedures so that you will be sure to disable replication of the database to that server before it comes back on line and the replica can be deleted and recreated. Be aware, though, that user replicas pose the same problem, and it's not really possible to control or be aware of user replicas that might go off-line for longer than the purge interval. In any case, remember: do not check the box. To reduce the purge interval for deletion stubs only, just reduce the number.
Apart from this, the only way to programmatically delete deletion stubs requires use of the Notes C API. It's possible to call the required routines from LotusScript, but in my experience once the total number of stubs plus documents gets too high you will likely run into an error and may have to create and deploy a new non-replica copy of the database to get past it. You can find code along with my explanation in the answer to this previous question.

I have to second Richard's recommendation to reuse documents. I recently had a similar project, and started the way you did with deleting everything and importing half a million records every night. Deletion stubs and the growth of the FT index quickly became problems, eating up huge amounts of disk space and slowing performance significantly. I tried to manage the deletion stubs, but I was clearly going against the grain of Domino's architecture.
I read Richard's suggestion here, and adopted that approach. Here's what I did:
1) create 2 views based on form - one for 'active' records, and another for 'inactive' records
2) start the agent by setting autoupdate = false for both views
3) use stampall("form", "inactive") to change all fo the active records to inactive
4) manually refresh the 2 views using notesview.refresh()
5) start importing data. for each record, pull a document out of the pool of inactive records (by walking the 'inactive' view)
6) if if run out of inactive records in the pool, create new ones
7) when import is complete, manually refresh the views again
8) use db.createftindex(0, true) to re-create the FT index
the code is really not that complex, and it runs in about the same amount of time, if not faster, than my original approach.
Thanks Richard!
Also, look at the advanced db properties - several things there that will help optimize the db.

It sounds like you are "refreshing" the contents of the database by periodically deleting all the documents and creating new ones from some other source. Cut that out. If the data are in the Notes database already, leave the document alone. What you're doing is very inefficient.

Firebird backup restore is frustrating, is there a way to avoid it?

I am using Firebird, but lately the database grows really seriously.
There is really a lot of delete statements running, as well update/inserts, and the database file size grows really fast.
After tons of deleting records the database size doesn't decrease, and even worse, i have the feeling that actually the query getting slowed down a bit.
In order to fix this a daily backup/restore process have been involved, but because of it's time to complete - i could say that it is really frustrating to use Firebird.
Any ideas on workarounds or solution on this will be welcome.
As well, I am considering switching to Interbase because I heard from a friend that it is not having this issue - it is so ?

We have a lot of huge databases on Firebird in production but never had an issue with a database growth. Yes, every time a record being deleted or updated an old version of it will be kept in the file. But sooner or later a garbage collector will sweap it away. Once both processes will balance each other the database file will grow only for the size of new data and indices.
As general precaution to prevent an enormous database growth try to make your transactions as short as possible. In our applications we use one READ ONLY transaction for reading all the data. This transaction is open through whole application life time. For every batch of insert/update/delete statements we use short separate transactions.
Slowing of database operations could be resulted from obsolete indices stats. Here you can find an example of how to recalculate statistics for all indices: http://www.firebirdfaq.org/faq167/

Check if you have unfinished transactions in your applications. If transaction is started but not committed or rolled back, database will have own revision for each transaction after the oldest active transaction.
You can check the database statistics (gstat or external tool), there's oldest transaction and the next transaction. If the difference between those numbers keeps growing, you have the stuck transaction problem.
There are also monitoring tools the check situation, one I've used is Sinatica Monitor for Firebird.
Edit: Also, database file doesn't shrink automatically ever. Parts of it get marked as unused (after sweep operation) and will be reused. http://www.firebirdfaq.org/faq41/

The space occupied by deleted records will be re-used as soon as it is garbage collected by Firebird.
If GC is not happening (transaction problems?), DB will keep growing, until GC can do its job.
Also, there is a problem when you do a massive delete in a table (ex: millions of records), the next select in that table will "trigger" the garbage collection, and the performance will drop until GC finishes. The only way to workaround this would be to do the massive deletes in a time when the server is not very used, and run a sweep after that, making sure that there are no stuck transactions.
Also, keep in mind that if you are using "standard" tables to hold temporary data (ie: info is inserted and delete several times), you can get corrupted database in some circumstances. I strongly suggest you to start using Global Temporary Tables feature.

Regarding SQL Server Locking Mechanism

I would like to ask couple of questions regarding SQL Server Locking mechanism
If i am not using the lock hint with SQL Statement, SQL Server uses PAGELOCK hint by default. am I right??? If yes then why? may be its due to the factor of managing too many locks this is the only thing i took as drawback but please let me know if there are others. and also tell me if we can change this default behavior if its reasonable to do.
I am writing a server side application, a Sync Server (not using sync framework) and I have written database queries in C# code file and using ODBC connection to execute them. Now question is what is the best way to change the default locking from Page to Row keeping drawbacks in mind (e.g. adding lock hint in queries this is what i am planning for).
What if a sql query(SELECT/DML) is being executed without the scope of transaction and statement contains lock hint then what kind of lock will be acquired (e.g. shared, update, exclusive)? AND while in transaction scope does Isolation Level of transaction has impact on lock type if ROWLOCK hint is being used.
Lastly, If some could give me sample so i could test and experience all above scenarios my self (e.g. dot net code or sql script)
Thanks
Mubashar

No. It locks as it sees fit and escalates locks as needed
Let the DB engine manage it
See point 2
See point 2
I'd only use lock hints if you want specific and certain behaviours eg queues or non-blocking (dirty) reads.
More generally, why do you think the DB engine can't do what you want by default?

The default locking is row locks not page locks, although the way in which the locking mechanism works means you will be placing locks on all the objects within the hierarchy e.g. reading a single row will place a shared lock on the table, a shared lock on the page and then a shared lock on the row.
This enables an action requesting an exclusive lock on the table to know it may not take it yet, since there is a shared lock present (otherwise it would have to check every page / row for locks.)
If you issue too many locks for an individual query however, it performs lock escalation which reduces the granularity of the lock - so that is it managing less locks.
This can be turned off using a trace flag but I wouldn't consider it.
Until you know you actually have a locking / lock escalation issue you risk prematurely optimizing a non-existant problem.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas