Sitecore syncMaster index strategy not working - lucene

I have a rendering component that runs a search using the Lucene index to populate itself.
We have two indexes defined; Master & Web. When in the experience editor it uses the Master index, and the Web index for the actual site.
We've configured the Web index strategy as onPublishEndAsync, and we've configured the Master index strategy as syncMaster, the idea being that CMS users can add/edit Sitecore items that power this component, and see them straight away in the experience editor.
However, it seems that the master index is not being updated as we change data in Sitecore. The experience editor only shows the data once I've manually run an index rebuild.
<strategies hint="list:AddStrategy">
<strategy ref="contentSearch/indexConfigurations/indexUpdateStrategies/syncMaster" />
</strategies>
Why doesn't the index update itself upon data changes?
UPDATE
So I've compared the files suggested to a clean install and they are the same.
I should add, I'm not using the standard sitecore_master_index. We have multiple sites running off the same instance of sitecore, so we have added a config include for websitename_master_index. I have compared the config for this within the <index> node against sitecore_master_index in Sitecore.ContentSearch.Lucene.Index.Master.config and the only differences are the crawler's <root> element which points to the particular sites content node, plus we've added some custom fields, but I assume that these fields wouldn't be causing a problem we can manually rebuild the index fine?
One other interesting thing I found when looking at the showconfig.aspx was this:
<agent type="Sitecore.ContentSearch.Tasks.Optimize" method="Run" interval="12:00:00" patch:source="Sitecore.ContentSearch.config">
<indexes hint="list">
<index>sitecore_master_index</index>
</indexes>
</agent>
I'm not sure if this has any significance, but there was not a matching entry for our custom websitename_master_index?
UPDATE
I've also added debug level logging to the crawler
In the crawling.log I only see the following:
14416 08:55:10 INFO [Index=website_master_index] Initializing SitecoreItemCrawler. DB:master / Root:/sitecore/Content/Website/Home
14416 08:55:10 INFO [Index=website_master_index] Initializing SynchronousStrategy.
Upon editing and saving items, there is no further mention of the index in the log, and this is actually true of the standard sitecore_master_index which we haven't altered the config for?

In order to guarantee Lucene files are not concurrently modified, Lucene adds a .lock file concept - whatever process is about to write, has to create the file.
In case there is one already - wait for it to be removed.
Should a writer process be terminated, file never got removed, hence index never got updated.
The solution was to clean the folder manually.
In order to make a better prediction a memory snapshot of the process is needed to see what is happening inside (or what does each thread do).

Related

How to fix empty semantic media wiki query results after restore?

After restoring a semantic media wiki installation from backup the SMW engine does no longer return any query results. I have (re)inserted all regular pages, all form pages, all property pages into the new MW instance. So all content is there but query results remain empty. It seems as the internal data structures maintained by SMW are not filled. How can this be fixed? Are there any specific scripts that need to be run manually?
Indeed the internal SMW cache is not filled after restore. The solution is simple: You need to go to extensions/SemanticMediaWiki/maintenance and run the script rebuildData.php which will reparse every single Wiki page and fill the SMW database accordingly.
Be aware that for this to work your Wiki needs to be configured properly. By default SMW will not process additional namespaces! You need to enable this manually for every single namespace you add yourself in LocalSettings.php after the line where you enable SMW in this file. (This configuration will only have effect if you do this after the line that enables SMW.)

How to disable versioning in Jackrabbit?

I am working on a legacy application currently incorporating Jackrabbit 2.6, which at some point used the jackrabbit versioning (I am not even sure if it was with this or another jackrabbit version). Currently the versioning is still present in the configuration and its corresponding DB tables (*_BINVAL, *_BUNDLE, *_NAMES, *_REFS) are still there.
I would like to have the versioning disabled and completely removed as it takes up space in our database and slows down the Jackrabbit garbage collection with an empty run over the versioning persistence manager. I cannot find any information though about how to proceed with it.
Is it safe to simply remove the <Versioning>...</Versioning> tag from the xml configuration and to drop the related tables? How should I proceed?
Unfortunately, versioning is mandatory. Therefore we needed to clean as much of the version information as possible. In my case it turned out that somehow the mix:versionable mixins disappeared (probably due to changes in the custom node types and OCM), leaving the version related properties behind. What I ended up doing:
Iterate over the whole repository deleting the version history for each node (either by removing the mixin or the versioning properties in my case), saving the session after every X of changed nodes.
Close the Jackrabbit repository and rename the versioning tables (*_BINVAL, *_BUNDLE, *_NAMES, *_REFS) in the database to hide them from Jackrabbit.
Start Jackrabbit again - the tables in the database have been recreated and besides three default nodes are empty
After confirming that the repository is intact, drop the the hidden tables.
The garbage collection has become faster - we went down from two weeks to 4 hours. The version history contained millions of entries, which were completely unnecessary.

What is a good practice to entirely replace an existing Lucene index?

We use Lucene as a search engine. Our Lucene index is created by a master server, which is then deployed to slave instances.
This deployment is currently done by a script that deletes the files, and copy the new ones.
We needed to know if there was any good practice to do a "hot deployment" of a Lucene index. Do we need to stop or suspend Lucene? Do we need to inform Lucene the index has changed?
Thanks
The first step is to open the index in append mode for writing. You can achieve this by calling IndexWriter with the open mode named IndexWriterConfig.OpenMode.CREATE_OR_APPEND.
Once this is done, you are ready to both update existing documents and add new documents. For updating documents, you need to provide some kind of a unique identifier for a document (could be the URL or something else that is guaranteed to be unique). Now if you want to update a document with id say "Doc001" simply call the updateDocument function of Lucene passing "Doc001" as the Term (the very first) argument.
By this you can update an existing index without deleting it.

Sitecore "Indexing is paused"

I seem to be having some issues with the IntervalAsynchronousStrategy for updating content items.
Sometimes, the indexes will not be automatically updated with this strategy, and a manual index rebuild is required.
These are the corresponding log file entries:
8404 09:20:24 INFO [Index=artscentre_web_index] IntervalAsynchronousUpdateStrategy executing.
8404 09:20:24 INFO [Index=artscentre_web_index] History engine is empty. Incremental rebuild returns
8032 09:20:21 WARN [Index=artscentre_web_index] IntervalAsynchronousUpdateStrategy triggered but muted. Indexing is paused.
And I see this for every time the index rebuilds, even though there is content being edited and published in that time.
I have previously swapped from the OnPublishEnd rebuild strategy to the interval strategy as I was finding that publishing content would not trigger an index rebuild either.
Our environment is a single instance setup only, so the single IIS website handles both CM and CD. Therefore I can eliminate anything to do with remote events, I think?
Has anyone else had this much trouble getting Sitecore to maintain index updates?
Cheers,
Justin

Sitecore System Lucene Index for custom queries

I have been using Sitecore query and FAST query for some sections of the website. But with growing content these queries have gotten slow and I'd like to implement Lucene querying for content to speed up things.
I am wondering if I can just use the System index instead of having to setup a separate index. Does Sitecore by default index all content in the content editor? Is this a good approach or should I just create my own index?
(I'm going to assume your using Sitecore 6.4->6.6)
As with everything .. it depends .. Sitecore keeps an index of all the Sitecore items in its system index, you are welcome to use that. Sometimes you may want a more specialised or restricted list of items, like being based on a certain template, being indexed or need a checkbox field indexed (as the system one by default only indexes text fields).
Setting up your own search index is pretty easy.. It does require some fiddling with the web.config though (and I'd recommend adding as a .include file).
Create an new <index> node with its own id that will define the name of the collection and the folder it will go into. (You can check its working by looking for the dir in the /data/indexes directory of your installation.
.. next you can tell the crawler which database to look at (most likely master if you want unpublished content to be indexed or web for published stuff) and where to start the search from (in this example I am indexing only the news section). You can tag,boostand tell if whether to IndexAllFields (otherwise it will only index fields it understands as text .. rich-text / multi-line text / text etc).
.. Finally, you can tell the indexer which template types to include or exclude.
How the indexer works is that it will subscribed to item events within sitecore .. so every time an item is changed or moved or deleted the index will be updated automatically. Obviously if you are indexing the web db the items will need to have been published.
More in-depth info on the query syntax & indexing can be found here on SDN.
The search syntax and API is much improved in 6.4/6.5 but if you want to add extra kick then my colleague Alex Shyba's Advanced Database Crawler is worth checking out too.
Hope this helps :D
You will want to implement your own index. For the same reason that you are seeing things slow down when there is a lot of content, indexes slow down when there is a lot of content in it as well.
I prefer targeted indexes meant specifically to drive the functionality I need and only has the data in it that is required. This allows for smaller and more efficient index usage on your components.
Additionally, you probably want to look into the AdvancedDatabaseCrawler put together by Alex Shyba. There are a few blogs out there with some great posts on implementing this lucene indexing module.
A separate index is always a wise decision, you can keep it light. In big environments the system index can grow up to gigabytes.
You can exclude the content from the index, as you will only be using it for performing lookups, not showing content from the index.
Finally: the system index is for the master database, you'll be querying the web database, possibly on a content delivery server.