lucene documents disappeared (index size shrinks)

lucene documents disappeared (index size shrinks) - lucene

I have a cluster of servers indexing documents with Lucene indexes located on a network drive. I synchronize IndexWriter creation so only one instance at a time. There are about 50 documents added to the index every second so the index is actively updated. But within a day or so, the index suddenly shrinks in size, reducing from a few GB to almost empty, then starts growing again. It appears to be that all documents in the index are suddenly deleted and a new index is been built. I checked the index with Luke and I could not find any documents labelled as deleted. I have not been able to catch exactly when this happens. But when this happens, all the files in the index directly have a new timestamp within the past few minutes, except the write.lock file, which has the time stamp when the index was first created. There is no any error in indexing and the indexes (before and after the size shrunk) are normal (searchable).
Has anyone seen anything like this? Any idea what is causing this behavior? I am using Lucene 6.1 on Windows Server 2012.
Thanks a lot in advance!

Related

Memory consumption in my Jersey application keeps growing with time

Memory consumption in my application keeps growing with time. This app uses Lucene and search is performed using Rest endpoints that searched Lucene directories. Around 10 different directories are created and multiple users can perform search on one or more directories at the same time. While searching it also checked if any new record is entered or modified in DB then directories are updated by deleting and re-adding the documents. I could not find anything wrong in Lucene configurations an coding doe for search like IndexWriter on directories are flushed and committed after deletion and addition of documents. I am just wondering if search can also consume memory. I can provide more details if required.
Will appreciate any clue provided exactly what might be wrong.

Sitecore_master_index has stopped updating in Solr

We are using Solr, hosted on Tomcat, with Sitecore 8.0. We are having an odd issue where sitecore_master_index incrementally updates an index for about half an hour after Tomcat or server start but then stops the indexing after that. This has been the case for the last 4 days now.
In the crawling logs for Sitecore there are never any entries found related to sitecore_master_index but other indexes including sitecore_core_index and sitecore_web_index get updated. Entries for the latter indexes appear in the crawling logs.
I have checked everything including the eventqueue tables size, index update strategy (which is syncmaster), indexing not paused, etc. but there is no obvious answer to this puzzle.
Any clues where to look?

Solr tlog extremely large, not merging with index after commits

I am in the process of a bulk indexing operation into a solr 5.0 collection with approx 200m documents now. I am noticing that the tlog is building up and is not being deleted, additionally, indexing performance has gotten really slow. I am wondering why the tlog is not being removed. This is what the data directory looks like:
du -sh *
4.0K data
69G index
109G tlog
I've tried multiple variations of:
update?commit=true&expungeDeletes=true&openSearcher=true
I see in the log file that Solr is picking it up, but there are no changes.
The commit settings in solrconfig are:
<autoCommit>
<maxTime>15000</maxTime>
<maxDocs>1500000</maxDocs>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>900000</maxTime>
<maxDocs>2000000</maxDocs>
</autoSoftCommit>
One thing to keep in mind is that I had soft commit commented out during the indexing process. Also, these values are pretty high because this is relatively index heavy collection, with pretty controlled querying, so the commit strategy is pretty relaxed.
I restarted Solr and naturally it is taking forever to start because it is replaying the tlog, not sure if it will clear this up once fully started. Now, I am under impression that Solr keeps some tlogs around in case it needs to replica the data to another collection, but this is a standalone instance and is not really necessary, additionally, since it is larger than the index folder, I am assuming there are items not commited to the main index yet. Is that right?
Any idea what's happening here?

So I thought I'd pass along an update, even though it's a bit late.
I restarted Solr instance, naturally it took about 4 hours to start up since tlogs had to be replayed. Then they were purged after a commit.

Solr Index slow after a while

I use SolrJ to send data to my Solr server.
When I start my program off, it indexes stuff at the rate of about 1000 docs per sec(I commit every 250,000 docs)
I have noticed that when my index is filled up with about 5 million docs, it starts crawling, not just at commit time, add time too.
My Solr server and indexing program run on the same machine
Here are some of the relevant portions from my solrconfig:
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>1024</ramBufferSizeMB>
<mergeFactor>150</mergeFactor>
Any suggestions about how to fix this?

that merge factor seems really, really (really) high.
Do you really want that?
If you aren't using compound files that could easily lead to a ulimit problem (if you are linux).

Best way to keep index real time?

I have a Solr/Lucene index file of approximately 700 Gb. The documents that I need to index are being read in real-time, roughly 1000 docs every 30 minutes are submitted and need to be indexed. In my scenario a script is run every 30 mins that indexes the documents that are not yet indexed, since it is a requirement that new documents should be searchable as soon as possible, but this process slow down the searching.
Is this the best way i can index latest documents or there is some other better way!

First, remember that Solr is not a real-time search engine (yet). There is still work to be done.
You can use a master/slave setup, where the indexation are done on the master and the search on the slave. With this, indexation does not affect search performance. After the commit is done on the master, force the slave to fetch the latest index from the master. While the new index is being replicated on the slave, it is still processing queries with the previous index.
Also, check you cache warming settings. Remember that this might slow down the searches if those settings are too aggressive. Also check the queries launched on the new searcher event.

You can do this with Lucene easily. Split the indexes in multiple parts (or to be precise, while building indexes, create "smaller" parts.) Create searcher for each of the part and store a reference to them. You can create a MultiSearcher on top of these individual parts.
Now, there will be only one index that will get the new documents. At regular intervals, add documents to this index, commit and re-open this searcher.
After the last index is updated, you can create a new multi-searcher again, using the previously opened searchers.
Thus, at any point, you will be re-opening only one searcher and that will be quite fast.

Check http://code.google.com/p/zoie/ wrapper around Lucene to make it real time - code donated from Linkedin.

^^i do this, with normal lucene, non solr, and it works really nice. however not sure if there is a solr way to do that at the moment. twitter recently went with lucene for searching and has effectively real time searching by just writing to their index at any update. their index resides completely in memory, so updating/reading the index is of no consequence and happens instantly, a lucene index can always be read while being written to, as long as there is only one writer at a time.

Check out this wiki page

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

lucene documents disappeared (index size shrinks) - lucene

Related

Memory consumption in my Jersey application keeps growing with time

Sitecore_master_index has stopped updating in Solr

Solr tlog extremely large, not merging with index after commits

Solr Index slow after a while

Best way to keep index real time?

Categories

Resources