Lucene indexwriter.close() is a must? - lucene

My program was running too slow that I had to terminate it halfway to optimize some part of the codes right after 40,000 documents was inserted into database, BUT, before Lucene indexwriter.close was called. Then I couldn't find any results for some of the records that seems to be limited to the 40,000 documents from that particular run.
Does that mean that those record which I had index during the program run was lost? IndexWriter must always be perfectly closed to allow for the data to be written to the index?
Thanks in advance!

It's not close that you need to call but commit. addDocument only analyzes the document and buffers data into memory while commit will flush pending changes and perform a fsync.
close calls commit internally, I think this is why you assume close is required.
However, beware of not calling commit too often as this operation is very costly compared to addDocument.

Related

How to delete large data from Firebird SQL database

I have a very large database (at least for me) - above 1 000 000 records and I need to delete all records that are with a timestamp lower than something. Like for example:
DELETE FROM table WHERE TS < 2020-01-01;
The problem I'm facing is that after the transaction finishes, if it finishes at all, is that the database is unresponsive and unusable. How can I delete so much records without the above said problem?
I'm new to this, as of now I've only worked with databases that had 1000-10000 rows and the command I used to delete records hasn't caused problems.
Based on your description in the question, and your comments, the problem has to do with how garbage collection works in Firebird. Firebird is a so-called Multi-Version Concurrency Control (MVCC) database, each change you make to a row (record), including deletions, will create new versions of that record, and keep previous versions available for other transactions that were started before the transaction that made the change is committed.
If there are no more transactions 'interested' in a previous version of record, that previous version becomes eligible for garbage collection. Firebird has two options for garbage collection: cooperative (supported by all server modes) and background (supported by SuperServer), and a third combined mode which does both (this is the default for SuperServer).
The background mode is a dedicated thread which cleans up garbage, it's signaled by active statements if they see garbage.
In the cooperative mode, a statement that sees garbage is also the one that has to clean it up. This can be especially costly when the statement performs a full table scan just after a large update or delete. Instead of just finding and returning rows, that statement will also rewrite database pages to get rid of that garbage.
See also the slides Garbage collection mechanism and sweep in details.
There are some possible solutions:
If you're using SuperServer, change the policy, by setting the setting GCPolicy in firebird.conf to background.
The downside of this solution is that it might take longer before all garbage is collected, but the big benefit is that transactions are not slowed down by doing garbage collection work.
After committing a transaction that produced a lot garbage, execute a statement that performs a full table scan (e.g. select count(*) from table) to trigger garbage collection, using a separate worker thread to not block the rest of your process.
This option only really works if there are no active transactions that are still interested in those old record versions.
Create a backup of the database (there is no need to restore, except to verify if the backup worked correctly).
By default (unless you specify the -g option to disable garbage collection), the gbak tool will perform garbage collection during a backup. This has the same restriction as option 2, as this works because gbak does the equivalent of a select * from table
Perform a 'sweep' of the database using gfix -sweep.
This has similar restrictions as the previous two options
For connections that cannot incur the slowdown of a garbage collection, specify the connection option isc_dpb_no_garbage_collect (details vary between drivers and connection libraries).
If you specify this for all connections, and your policy is cooperative (either because it is configured, or you're using Classic or SuperClassic server mode), then no garbage collection will take place, which can cause an eventual slowdown as well, because the engine will have to scan longer chains of record versions. This can be mitigated by using the previous two options to perform a garbage collection.
Instead of really deleting records, introduce a soft-delete in your application to mark records as deleted instead of really deleting them.
Either keep those records permanently, or really delete them at a later time, for example by a scheduled job running at a time the database is not under load, and include one of the previous options to trigger a garbage collection.
Actually background garbage collection is exactly what can cause "unresponsive database" behavior because of high tension between garbage collector and working threads. Cooperative GC may slow down operations but keep the database "responsive". At least for version 2.5.
Another reason is bad indexes which have a lot of duplicates. Such indexes are often useless for queries and should be simply dropped. If it is not an option, they could be deactivated before delete and reactivated after in separate transactions (as a side effect the activation will cause full garbage collection).
But of course the best option is to keep all data. 1kk records is not that much for well-designed database on decent hardware.

Load balancing SQL reads while batch-processing?

Given an SQL table with timestamped records. Every once in a while an application App0 does something like foreach record in since(certainTimestamp) do process(record); commitOffset(record.timestamp), i.e. periodically it consumes a batch of "fresh" data, processes it sequentially and commits success after each record and then just sleeps for reasonable time (to accumulate yet another batch). That works perfect with single instance.. however how to load balance multiple ones?
In exactly the same environment App0 and App1 concurrently competite for the fresh data. The idea is that ready query executed by the App0 must not overlay with the same read query executed by the App1 - such that they never try to process the same item. In other words, I need SQL-based guarantees that concurrent read queries return different data. Is that even possible?
P.S. Postgres is preferred option.
The problem description is rather vague on what App1 should do while App0 is processing the previously selected records.
In this answer, I make the following assumptions:
all Apps somehow know what the last certainTimestamp is and it is the same for all Apps whenever they start a DB query.
while App0 is processing, say the 10 records it found when it started working, new records come in. That means, the pile of new records with respect to certainTimestamp grows.
when App1 (or any further App) starts, the should process only those new records with respect to certainTimestamp that are not yet being handled by other Apps.
yet, if on App fails/crashes, the unfinished records should be picked the next time another App runs.
This can be achieved by locking records in many SQL databases.
One way to go about this is to use
SELECT ... FOR UPDATE SKIP LOCKED
This statement, in combination with the range-selection since(certainTimestamp) selects and locks all records matching the condition and not being locked currently.
Whenever a new App instance runs this query, it only gets "what's left" to do and can work on that.
This solves the problem of "overlay" or working on the same data.
What's left is then the definition and update of the certainTimestamp.
In order to keep this answer short, I don't go into that here and just leave the pointer to the OP that this needs to be thought through properly to avoid situations where e.g. a single record that cannot be processed for some reason keeps the certainTimestamp at a permanent minimum.

How can I recreate a blocking process which uses FETCH API_CURSOR?

My organization has recently had trouble with some SQL Server blocking processes. dbWarden has successfully reported blocking to us, but we often have the blocking SQL text reported as 'FETCH API_CURSOR'.
So, we're looking to alter the blocking alerts trigger in dbWarden to use sys.dm_exec_cursors and sys.dm_exec_sql_text to retrieve the text in the case where we find 'FETCH API_CURSOR' reported.
Trouble is, I cannot seem to come up with a way to recreate/simulate a blocking situation on our development server that will report as 'FETCH API_CURSOR'. I've started from the VB script here on SQL Authority to recreate the open cursor, but I cannot for the life of me figure out how to make it blocking.
I've seen many methods for recreating blocking transactions (open a transaction in one window, but do not commit/close, then try an update on same table in another), but not that would utilize FETCH API_CURSOR in a way that would allow us to successfully test. I'm somewhat at a loss here.
Has anyone had success in simulating blocking cursors in the past and can offer suggestions?
I'd suggest you to use Profiler tool to capture actual code that creates and fetches cursor. In this case, you'd see exactly what's going on in an application. It is not so difficult to reproduce similar blocking on a development server.
Let's say, one thread fetches rows from a cursor and another thread try to UPDATE same rows. See what's going on under the hood. Reading thread creates cursor to fetch result of SELECT back to an application. This technology is ancient and extremely slow and nowaday only some old (mostly) Java application use cursors for this purpose. Rows get fetched one-by-one, client is handling this process, so it takes time. During this time, reading thread holds shared locks on data it reads. It is by design, SQL Server is locker, it does use locks to function properly. If another thread tries to update a row that has been locked with shared lock, it get blocked. Because updating thread uses shared locks when searching rows for update, and tries to upgrade it to something more serious, but can't. For example, you can't upgrade your shared lock to U lock if another thread owns S lock on the same row. So I'd try to create a cursor, fetch several rows and tried to update in another tab. If you see difficulties, try to increase reading transaction serialization level.
But, seriously, I don't think you have to reproduce this or similar scenario on a development server. Stop use cursors for recordset fetching! Reads will be much faster and blocking issues will be reduced a lot. It's been a while since 1989 when cursors seen their great times. Client DB-access libraries evolved a lot, it is worth trying to pick up fruits of progress. Even in Java, it is a configuration option, use or not use them.
I apologize if cursor does get used on purpose, in this case. It is very unprobably but possible. I haven't seen such 'proper' cursor usage for ages! I'll be delighted to run into one more proper case.

Firebird backup restore is frustrating, is there a way to avoid it?

I am using Firebird, but lately the database grows really seriously.
There is really a lot of delete statements running, as well update/inserts, and the database file size grows really fast.
After tons of deleting records the database size doesn't decrease, and even worse, i have the feeling that actually the query getting slowed down a bit.
In order to fix this a daily backup/restore process have been involved, but because of it's time to complete - i could say that it is really frustrating to use Firebird.
Any ideas on workarounds or solution on this will be welcome.
As well, I am considering switching to Interbase because I heard from a friend that it is not having this issue - it is so ?
We have a lot of huge databases on Firebird in production but never had an issue with a database growth. Yes, every time a record being deleted or updated an old version of it will be kept in the file. But sooner or later a garbage collector will sweap it away. Once both processes will balance each other the database file will grow only for the size of new data and indices.
As general precaution to prevent an enormous database growth try to make your transactions as short as possible. In our applications we use one READ ONLY transaction for reading all the data. This transaction is open through whole application life time. For every batch of insert/update/delete statements we use short separate transactions.
Slowing of database operations could be resulted from obsolete indices stats. Here you can find an example of how to recalculate statistics for all indices: http://www.firebirdfaq.org/faq167/
Check if you have unfinished transactions in your applications. If transaction is started but not committed or rolled back, database will have own revision for each transaction after the oldest active transaction.
You can check the database statistics (gstat or external tool), there's oldest transaction and the next transaction. If the difference between those numbers keeps growing, you have the stuck transaction problem.
There are also monitoring tools the check situation, one I've used is Sinatica Monitor for Firebird.
Edit: Also, database file doesn't shrink automatically ever. Parts of it get marked as unused (after sweep operation) and will be reused. http://www.firebirdfaq.org/faq41/
The space occupied by deleted records will be re-used as soon as it is garbage collected by Firebird.
If GC is not happening (transaction problems?), DB will keep growing, until GC can do its job.
Also, there is a problem when you do a massive delete in a table (ex: millions of records), the next select in that table will "trigger" the garbage collection, and the performance will drop until GC finishes. The only way to workaround this would be to do the massive deletes in a time when the server is not very used, and run a sweep after that, making sure that there are no stuck transactions.
Also, keep in mind that if you are using "standard" tables to hold temporary data (ie: info is inserted and delete several times), you can get corrupted database in some circumstances. I strongly suggest you to start using Global Temporary Tables feature.

LockObtainFailedException updating Lucene search index using solr

I've googled this a lot. Most of these issues are caused by a lock being left around after a JVM crash. This is not my case.
I have an index with multiple readers and writers. I'm am trying to do a mass index update (delete and add -- that's how lucene does updates). I'm using solr's embedded server (org.apache.solr.client.solrj.embedded.EmbeddedSolrServer). Other writers are using the remote, non-streaming server (org.apache.solr.client.solrj.impl.CommonsHttpSolrServer).
I kick off this mass update, it runs fine for a while, then dies with a
Caused by:
org.apache.lucene.store.LockObtainFailedException:
Lock obtain timed out:
NativeFSLock#/.../lucene-ff783c5d8800fd9722a95494d07d7e37-write.lock
I've adjusted my lock timeouts in solrconfig.xml
<writeLockTimeout>20000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
I'm about to start reading the lucene code to figure this out. Any help so I don't have to do this would be great!
EDIT: All my updates go through the following code (Scala):
val req = new UpdateRequest
req.setAction(AbstractUpdateRequest.ACTION.COMMIT, false, false)
req.add(docs)
val rsp = req.process(solrServer)
solrServer is an instance of org.apache.solr.client.solrj.impl.CommonsHttpSolrServer, org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer, or org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.
ANOTHER EDIT:
I stopped using EmbeddedSolrServer and it works now. I have two separate processes that update the solr search index:
1) Servlet
2) Command line tool
The command line tool was using the EmbeddedSolrServer and it would eventually crash with the LockObtainFailedException. When I started using StreamingUpdateSolrServer, the problems went away.
I'm still a little confused that the EmbeddedSolrServer would work at all. Can someone explain this. I thought that it would play nice with the Servlet process and they would wait while the other is writing.
I'm assuming that you're doing something like:
writer1.writeSomeStuff();
writer2.writeSomeStuff(); // this one doesn't write
The reason this won't work is because the writer stays open unless you close it. So writer1 writes and holds on to the lock, even after it's done writing. (Once a writer gets a lock, it never releases until it's destroyed.) writer2 can't get the lock, since writer1 is still holding onto it, so it throws a LockObtainFailedException.
If you want to use two writers, you'd need to do something like:
writer1.writeSomeStuff();
writer1.close();
writer2.open();
writer2.writeSomeStuff();
writer2.close();
Since you can only have one writer open at a time, this pretty much negates any benefit you would get from using multiple writers. (It's actually much worse to open and close them all the time since you'll be constantly paying a warmup penalty.)
So the answer to what I suspect is your underlying question is: don't use multiple writers. Use a single writer with multiple threads accessing it (IndexWriter is thread safe). If you're connecting to Solr via REST or some other HTTP API, a single Solr writer should be able to handle many requests.
I'm not sure what your use case is, but another possible answer is to see Solr's Recommendations for managing multiple indices. Particularly the ability to hot-swap cores might be of interest.
>> But you have multiple Solr servers writing to the same location, right?
No, wrong. Solr is using the Lucene libraries and it is stated in "Lucene in Action" * that there can only be one process/thread writing to the index at a time. That is why the writer takes a lock.
Your concurrent processes that are trying to write could, perhaps, check for the org.apache.lucene.store.LockObtainFailedException exception when instantiating the writer.
You could, for instance, put the process that instantiates writer2 in a waiting loop to wait until the active writing process finishes and issues writer1.close(); which will then release the lock and make the Lucene index available for writing again. Alternatively, you could have multiple Lucene indexes (in different locations) being written to concurrently and when doing a search you would need to search through all of them.
* "In order to enforce a single writer at a time, which means an IndexWriter or an IndexReader doing deletions or changing norms, Lucene uses a file-based lock: If the lock file (write.lock, by default) exists in your index directory, a writer currently has the index open. Any attempt to create another writer on the same index will hit a LockObtainFailedException. This is a vital protection mechanism, because if two writers are accidentally created on a single index, it will very quickly lead to index corruption."
Section 2.11.3, Lucene in Action, Second Edition, Michael McCandless, Erik Hatcher, and Otis Gospodnetić, 2010