How do i delete/update a doc with lucene? - lucene

I am creating a tagging system for my site
I got the basics of adding a document to lucene but i can seem to figure out how to delete a document or update one when the user changes the tags of something. I found pages that say use the document index and i need to optimize before effect but how do i get the document index? Also i seen another that said to use IndexWriter to delete but i couldnt figure out how to do it with that either.
I am using C# asp.net and i dont have java install on that machine

What version of Lucene are you using? The IndexWriter class has an update method that lets you update (BTW an update under the hood is really a delete followed by an add). You will need to have some identifier (such as as document id) which lets you update. When you index the document, add a unique document identifier such as a URL, a counter etc. Then the "Term" will be the ID of the document you wish to update. For example using URL you can update thus:
IndexWriter writer = ...
writer.update(new Term("id","http://somedomain.org/somedoc.htm"), doc);

You need an IndexReader to delete a document, I'm not sure about the .net version but the Java and C++ versions of the Lucene API have an IndexModifier class that hides the differences between IndexReader and IndexWriter classes and just uses the appropriate one as you call addDocument() and removeDocument().
Also, there is no concept of updating a document in Lucene you have to remove it an them re-add it again. In order to do this you will need to make sure that every document has a unique stored id in the index.

Related

Couchbaselite - Is it possible not to create revision document for standalone application

I am building a "standalone" mobile app with ReactNative and CouchbaseLite using the library react-native-couchbase-lite.
Is it possible to have only one document(ie only the original document) without any revision document even though if i update the document multiple times. For example if i make multiple update to a ToDo task, only the original document should be updated and no extra revision document should be created.
Yes. You can tune the maxRevTreeDepth parameter. Set it via a Database object instance. It defaults to 20.
Edit: An alternative approach might be to create a new document every time, and delete the old one. This would be appropriate in a case where one wants to save only a single revision of some documents. It would require creating a new document ID each time, too.

What is a good practice to entirely replace an existing Lucene index?

We use Lucene as a search engine. Our Lucene index is created by a master server, which is then deployed to slave instances.
This deployment is currently done by a script that deletes the files, and copy the new ones.
We needed to know if there was any good practice to do a "hot deployment" of a Lucene index. Do we need to stop or suspend Lucene? Do we need to inform Lucene the index has changed?
Thanks
The first step is to open the index in append mode for writing. You can achieve this by calling IndexWriter with the open mode named IndexWriterConfig.OpenMode.CREATE_OR_APPEND.
Once this is done, you are ready to both update existing documents and add new documents. For updating documents, you need to provide some kind of a unique identifier for a document (could be the URL or something else that is guaranteed to be unique). Now if you want to update a document with id say "Doc001" simply call the updateDocument function of Lucene passing "Doc001" as the Term (the very first) argument.
By this you can update an existing index without deleting it.

Using Lucene Highlighter infrastructure to mark up arbitrary text

I am using Lucene 3.5 in a client-server architecture as follows: the client issues a query to the server. The server returns a list of terms used in the query, and a list of hits, including snippets generated by the application of a Highlighter to the retrieved documents. The user can then request that the full document be displayed. This document comes from another service that is part of the system I am building.
When the requested document is displayed, I would like to highlight the same terms that were used to retrieve it. I can write some other code to do this without involving the Lucene infrastructure, but since I already have code to generate the snippets, I was hoping to be able to re-use it. (DRY and all that.)
So my question is how best to do this: When the need to mark up a document with search results occurs, the client has the set of terms that were used to retrieve the document and the id of the document that was retrieved. It also knows which fields in the document can be marked up with query terms.
Some possible strategies:
Create a query filter that selects only the needed document and then re-run the query only on that document.
Somehow (how?) construct a Scorer that doesn't depend on a Query but that can be seeded with the terms I already have.
Skip the Lucene infrastructure entirely.
What else?
I believe you could index your documents with a TermVector which will tell you the position of each term in the original document. Making highlighting trivial. Or simply reuse the contrib highlighter

Apache Lucene Index Writer

I'm new to apache lucene. I started using lucene. I had faced an issue. I started indexing all the files in the directory. I didn't close the indexwriter and tried to open in luke. It prompted with an error "Index not closed". The problem over here is the code execution has been completed. How to unlock the index? If I instantiate a new IndexWriter with the same directory, will it overwrite the existing index?
I am not an expert either..
If I were you, I'd do the following..
1) Add the following snippet to code at the end, which is a must at any cost.
myIndexWriter.close();
2) Delete the existing directory manually, and rerun the whole code.
If you instantiate the new IndexWriter without deleting the directory, it will add docs to the existing index. Yes, it will result in duplicate index entries.
However, in the Lucene's perspective, all those entries are still unique, i.e. every addDocument() creates a new entry in the Index with a new unique lucene-internal-doc-id.

Lucene index updation and performance

I am working on a job portal site and have been using Lucene for job search functionality.
Users will be posting a number jobs on our site on a daily basis.We need to make sure that new job posted is searchable on the site as soon as possible.
In this context, how do I update Lucene index when a new job is posted or when an existing job is edited?
Can lucene index updating and search work in parallel?
Also,can I know any tips/best practices with respect to Lucene indexing,optimizing,performance etc?
Appreciate ur help!
Thanks!
Yes, Lucene can search from and write to an index at the same time as long as no more than 1 IndexWriter writes to it. If you want the new records visible ASAP, have the IndexWriter call the commit() function often (see IndexWriter's JavaDoc for details).
These Wiki pages might also help:
ImproveIndexingSpeed
ImproveSearchingSpeed
I have used Lucene.Net on a web site similar to what you are doing. Yes, you can do live indexes, updating to keep everything up to date? What platform are you using Lucene on, .NET, Java?
Make sure you create a new IndexSearcher as any additions after an IndexSearcher has been created are not visible to that instance.
A better approach may be to ReOpen the IndexReader if you want to resuse the same index searcher.