We use Lucene as a search engine. Our Lucene index is created by a master server, which is then deployed to slave instances.
This deployment is currently done by a script that deletes the files, and copy the new ones.
We needed to know if there was any good practice to do a "hot deployment" of a Lucene index. Do we need to stop or suspend Lucene? Do we need to inform Lucene the index has changed?
Thanks
The first step is to open the index in append mode for writing. You can achieve this by calling IndexWriter with the open mode named IndexWriterConfig.OpenMode.CREATE_OR_APPEND.
Once this is done, you are ready to both update existing documents and add new documents. For updating documents, you need to provide some kind of a unique identifier for a document (could be the URL or something else that is guaranteed to be unique). Now if you want to update a document with id say "Doc001" simply call the updateDocument function of Lucene passing "Doc001" as the Term (the very first) argument.
By this you can update an existing index without deleting it.
Related
I use bin/post to index all my files in /documents (mounted volume). It works and full-text search works fine.
I do an atomic update for specific metadata that I added to the schema BEFORE posting all docs, it works too.
I do a full-text search to find back the document for which the metadata has been updated, it DOESN'T work anymore, the updates are there but it seems that the full-text index has disappeared.
I do a full re-index and then it overrides my added metadata for the doc, resetting it to the default value. Although the metadata field I added is both stored and indexed.
Not sure what to do. That means that each reindexing will reset my added metadata...not great
The update - under the hood - reconstructs the document from stored fields, applies changes and puts them back to disk. On Lucene level, there is no "document update", it is a higher level concept. That's how the search indexes stay fast in this architecture.
So, your full-text field which is not stored, does not show up in the reconstructed document and does not get stored again in the "updated document".
If you have such a mix of stored and non-stored fields, you have to merge your updates outside of Solr from the original full-content.
Alternatively, depending on your use case, if you are just returning those update values, you could inject them with custom SearchComponent, use ExternalFileField or similar. The user mailing list could be a good place to ask for various options possible.
I have been using Sitecore query and FAST query for some sections of the website. But with growing content these queries have gotten slow and I'd like to implement Lucene querying for content to speed up things.
I am wondering if I can just use the System index instead of having to setup a separate index. Does Sitecore by default index all content in the content editor? Is this a good approach or should I just create my own index?
(I'm going to assume your using Sitecore 6.4->6.6)
As with everything .. it depends .. Sitecore keeps an index of all the Sitecore items in its system index, you are welcome to use that. Sometimes you may want a more specialised or restricted list of items, like being based on a certain template, being indexed or need a checkbox field indexed (as the system one by default only indexes text fields).
Setting up your own search index is pretty easy.. It does require some fiddling with the web.config though (and I'd recommend adding as a .include file).
Create an new <index> node with its own id that will define the name of the collection and the folder it will go into. (You can check its working by looking for the dir in the /data/indexes directory of your installation.
.. next you can tell the crawler which database to look at (most likely master if you want unpublished content to be indexed or web for published stuff) and where to start the search from (in this example I am indexing only the news section). You can tag,boostand tell if whether to IndexAllFields (otherwise it will only index fields it understands as text .. rich-text / multi-line text / text etc).
.. Finally, you can tell the indexer which template types to include or exclude.
How the indexer works is that it will subscribed to item events within sitecore .. so every time an item is changed or moved or deleted the index will be updated automatically. Obviously if you are indexing the web db the items will need to have been published.
More in-depth info on the query syntax & indexing can be found here on SDN.
The search syntax and API is much improved in 6.4/6.5 but if you want to add extra kick then my colleague Alex Shyba's Advanced Database Crawler is worth checking out too.
Hope this helps :D
You will want to implement your own index. For the same reason that you are seeing things slow down when there is a lot of content, indexes slow down when there is a lot of content in it as well.
I prefer targeted indexes meant specifically to drive the functionality I need and only has the data in it that is required. This allows for smaller and more efficient index usage on your components.
Additionally, you probably want to look into the AdvancedDatabaseCrawler put together by Alex Shyba. There are a few blogs out there with some great posts on implementing this lucene indexing module.
A separate index is always a wise decision, you can keep it light. In big environments the system index can grow up to gigabytes.
You can exclude the content from the index, as you will only be using it for performing lookups, not showing content from the index.
Finally: the system index is for the master database, you'll be querying the web database, possibly on a content delivery server.
I am creating a tagging system for my site
I got the basics of adding a document to lucene but i can seem to figure out how to delete a document or update one when the user changes the tags of something. I found pages that say use the document index and i need to optimize before effect but how do i get the document index? Also i seen another that said to use IndexWriter to delete but i couldnt figure out how to do it with that either.
I am using C# asp.net and i dont have java install on that machine
What version of Lucene are you using? The IndexWriter class has an update method that lets you update (BTW an update under the hood is really a delete followed by an add). You will need to have some identifier (such as as document id) which lets you update. When you index the document, add a unique document identifier such as a URL, a counter etc. Then the "Term" will be the ID of the document you wish to update. For example using URL you can update thus:
IndexWriter writer = ...
writer.update(new Term("id","http://somedomain.org/somedoc.htm"), doc);
You need an IndexReader to delete a document, I'm not sure about the .net version but the Java and C++ versions of the Lucene API have an IndexModifier class that hides the differences between IndexReader and IndexWriter classes and just uses the appropriate one as you call addDocument() and removeDocument().
Also, there is no concept of updating a document in Lucene you have to remove it an them re-add it again. In order to do this you will need to make sure that every document has a unique stored id in the index.
I can't find a straightforward yes or no answer to this!
I know I can send multiple reads in parallel but can I query an index while a seperate process/thread is updating it?
It's been a while since I used Lucene. However, assuming you are talking about the Java version, the FAQ has this to say:
Does Lucene allow searching and indexing simultaneously?
Yes. However, an IndexReader only searches the index as of the "point in time" that it was opened. Any updates to the index, either added or deleted documents, will not be visible until the IndexReader is re-opened. So your application must periodically re-open its IndexReaders to see the latest updates. The IndexReader.isCurrent() method allows you to test whether any updates have occurred to the index since your IndexReader was opened.
See also Lucene's near-real-time feature, which gives fast turnaround upon changes (adds, deletes, updates) to the index to being able to search those changes. For example, using near-real-time you could make changes to the index and then reopen the reader every few seconds.
I've a .net windows service which generates Lucene search indexes every night.
I first get all the records from the database and add it to Lucene index using IndexWriter's AddDocument method and then call Optimize method before returning from the method.
Since the records fetched are faily large, indexing takes around 2-3 minutes to complete.
As you already know,Lucene generates intermediate segment files while it is generating the index and it compresses the whole index into 3 files when Optimize is called.
Is there anyway I can know that this index generation process is finished by Lucene and index is avaialable for search?
I need to know this because I want to call another method when process is completed.
You can check for the existance of the write.lock file.
http://wiki.apache.org/lucene-java/LuceneFAQ#head-733eab8f4000ba0f6c9f4ea052dea77d3d541857
I don't understand why you would need to know when Lucene finishes indexing. You can perform searches while Lucene is indexing. In fact, I think you can search while it's optimizing.
I personally do not like the idea of searching for the lock file. Can you not set a boolean and toggle it after you call writer.optimize()?