How do I remove logically deleted documents from a Solr index? - lucene

I am implementing Solr for a free text search for a project where the records available to be searched will need to be added and deleted on a large scale every day.
Because of the scale I need to make sure that the size of the index is appropriate.
On my test installation of Solr, I index a set of 10 documents. Then I make a change in one of the document and want to replace the document with the same ID in the index. This works correctly and behaves as expected when I search.
I am using this code to update the document:
getSolrServer().deleteById(document.getIndexId());
getSolrServer().add(document.getSolrInputDocument());
getSolrServer().commit();
What I noticed though is that when I look at the stats page for the Solr server that the figures are not what I expect.
After the initial index, numDocs and maxDocs both equal 10 as expected. When I update the document however, numDocs is still equal to 10 (expected) but maxDocs equals 11 (unexpected).
When reading the documentation I see that
maxDoc may be larger as the maxDoc count includes logically deleted documents that have not yet been removed from the index.
So the question is, how do I remove logically deleted documents from the index?
If these documents still exist in the index do I run the risk of performance penalties when this is run with a very large volume of documents?
Thanks :)

You have to optimize your index.
Note that an optimize is expansive, you probably should not do it more than daily.
Here is some more info on optimize:
http://www.lucidimagination.com/search/document/CDRG_ch06_6.3.1.3
http://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations

Related

Are docids constant if the index is not manipulated in Lucene 8.6.1?

Say I update my index once a day, everyday, at the same time. During the time between updates (for 21 hours or so), will the docids remain constant?
As #andrewjames mentioned, the docId's only change when a merge happens. The docsId is basically the array index position of the doc in a particular segment.
The side effect of that is also that if you have multiple segments, then a given docId might be assigned to multiple docs, one in one segment, one in another segment, etc. If that's a problem, you can do a force merge once you are done building your index so that there is only a single segment. Then no two docs will have the same docId at that point.
The docId for a given document will not change if a merge does not happen. And a merge won't happen unless you call force merge or add or delete documents, or upgrade your index.
So...if you build your index, and don't add docs, delete docs, or call force merge, or upgrade your index then the docIds will be stable. But the next time you build your index, a give doc may receive a totally different doc Id. And as #andrewjames said, the docId assignments and timing of assignments are an internal affair in Lucene, so you sould be cautious about relying on them even when you know when and how they are currently assigned.

Put and Delete with CouchDB + Lucene

I'm running CouchDB (1.2.1) + Lucene on Linux (https://github.com/rnewson/couchdb-lucene/), and I have a few questions.
I index everything - one index for all documents. i've got around 20.000.000 documents.
How fast are puts/deletes done on the index -- I have about 10-50 Puts/Deletes etc. a second.
Is there a rule, like after 10,000 updates you have to optimize the index?
Are changes in documents immediately visible in the index? If not is there a delay or a temporary table for this updates/deletes?
Thanks in advance - Brandon
Use a profiler to measure the put/delete performance. That's the only way you'll get reasonably accurate numbers.
Optimization depends on how quickly the index is changing -- again, you will need to experiment and profile.
Changes are immediately visible in the index, but not to already-open IndexReaders.

Bulk update strategy lucene?

For a project I am working on, I have a index of nearly 10 million documents. For sets of documents, ranging from 100k to 5m, I need to regularly add fields.
Lucene 4 supports updating documents (basically remove and add). What would be a good approach to add the field to a larger set of documents?
What I have tried so far is using the SearcherManager to wrap a IndexWriter, and to make small searches for documents that do not yet contain the field, but do match the Query I am interested in, by wrapping these in a BooleanQuery. I then iterate over the ScoreDocs, retrieve the documents, add my new field and call writer.updateDocument with the uuid I stored with each document. Then I call commit and maybeRefreshBlocking , reacquire the IndexSearcher and search again. This is kinda slow and seems a naive approach.
You only need to require the IndexSearcher before your searches will return different results based on the fields that you add.
In the case where your searches are never affected by the fields that you add, you only need to reacquire the IndexSearcher when documents are added to the index.
So, it will simplify and speed things up at least a little if you only reaquire the IndexSearcher when necessary rather than before each search.

Apache Solr - indexing a DB table appears to retrieve more records than contained in the table

I'm very new to Solr so If I am saying something that doesn't make sense please let me know.
I've recently setup Solr 4.0 beta and it is working quite well. It is setup with DIH to read a view from a MySQL DB. The view contains about 20 million rows and 16 columns. A number of the columns have a lot of NULL values. The performance of the DB is quite good -I get sub-second query times against the view when I run a query manually.
I pointed Solr at the view and it began the index process. I came back four hours later to check on it and discovered that not only was it still indexing but that it reported as fetching 200+ million.
Am I mis-understanding how Solr works? I was under the assumption that it would fetch the same number of rows as what is in the DB -which is about 20 million. Or, is it actually counting each field as an item fetched? Or, even worse, is it in some kind of loop?
I did some prior testing with a small sub-set of the data from the very same view by limiting the query to 100,000 records. On completion, it reported as having fetched exactly 100,000. I am not getting any warnings/errors in the logs either.
Any ideas on what's happening?
The number is represent row in db. Could you post your db-data-config.xml file? I think you should check your sql again.

How to deal with constantly changing data and SOLR indexes?

Afternoon guys,
I'm using a SOLR index for searching through items on my site. The search results contain an average rating of the item and an amount of comments the item has. The results can be sorted by both rating and num of comments.
But obviously with the solr index, these numbers aren't updated until the db (2million~ rows) is reindexed (done nightly, probably).
What would you guys think is the best way to approach this?
Well, i think you should change your db - index sync policy:
First approach: when commiting database changes also post changes (a batch of them) to indexes. You should write a mapper tier to map your domain objects to solr docs (remember, persists and if it goes ok then index -this works fine for us ;-)). If you want to achieve near real-time index updates you should see solutions like zoey (linkedin lucene-based searching framework)
Second approach: take a look around delta import (and program more frecuently index updates).