I am using Luke, to add a new field to all existing documents in the index. I can easily do this for 1 document by using the "Reconstruct and Edit" option. However, I want to do this for all document, in one go. Right now, I have to do this one by one. There are 2000 Documents in the index. This would take a lot of time.
Related
From the database, some data are taken and is indexed and stored using lucene.
Later, some more data is added to the db and I need to index those newly added data only and append to the existing indexed files.
Can you explain with a program ?
What you are asking is incremental indexing and this is less on indexing side and more on selection approach of data ( target documents ) from database.
You need to make your SQL SELECT query to flexible enough to use a column that identifies newly added / updated rows.
That column is usually a DATE column i.e. something like - LAST_ADDED_DT , LAST_UPDT_DT so you can fetch records added / updated in last x days , x hours etc.
e.g. on DB2 like , WHERE DATE(LAST_UPDT_DT) >= CURRENT DATE - 2 DAY will give your records updated in last two days etc.
and then use updateDocument(...) method of Lucene writer instead of addDocument(...) since updateDocument(...) will add documents if a document is a new document and will update if document already exists.
So this approach takes care of updated existing rows as well as new rows.
Lucene creates new files or appends to existing files is not your headache then, Lucene will organize files as per its settings and structure for that version.
You should open your writer in OpenMode.CREATE_OR_APPEND mode.
Hope this helps !!
I have indexed approximately 1000 documents in Solr. But all of them are missing a field. I need to add a field to all these documents, and this field will have the same value for all of them. I do not have access to these documents to index them again. Is there any way to do this without re-indexing all the data again?
Unless you've configured your schema to store all values, no, there is no usable way to add a field to the documents without reindexing. If you all fields are stored, you can use atomic updates to add a new field for a document, so you could query Solr for the ids of all existing documents and perform an update that way.
Otherwise you're going to have to go with the suggestion from #michielvoo, and return a static value from the query string .. but then you could also just append it in your application before returning it to the user (or, you could add the field as a default value for the request handler in solrconfig.xml, so that you can edit and change it server side).
I only want my Lucene search to give the highest scoring highlighted fragment per document. So say I have 5 documents with the word "performance" on each one three times, I still only want 5 results to be printed and highlighted to the results page. How can I go about doing that? Thanks!
You get only one fragment per document returned from the search by calling getBestFragment, rather than getBestFragments.
If your call to search is returning the same documents more than once, you very likely have more than one copy of the same document in your index. Make sure that if you intend to create a new index, you open your IndexWriter with it's OpenMode set to: IndexWriterConfig.OpenMode.CREATE.
im trying to index a database table with Lucene 4. I index all fields of a table entry into a document as a TextField (1 document per table entry) and try to search over the directory afterward.
So my problem is, that i need all the field names that are in the directory to use a MultiFieldQuery.
QueryParser parser = new MultiFieldQueryParser(Version.LUCENE_42, !FIELDS! , analyzer);
How do i get them? I could save them away while indexing, but it wouldn't be very performant to log them away with the index :/
Thank You
Alex
You can get the fieldnames from AtomicReader.getFieldInfos().
That will pass back a FieldInfos instance. Loop through FieldInfos.iterator(), and get the field names from FieldInfo.name
I don't see why it wouldn't be performant to store them somewhere ahead of time, though.
I am working on an application, where then need is to index the data without storing it to database.
When I initialize an abject it should index it. Consider there is a pages table with fields page_title, tags, content.
The last field content may have a large amount of text data(some times in MBs). Which is not going to be used for processing at all.
My objective is to index that data without saving it to database. I mean only pages, page_title, tags will be saved into the DB and indexed as well, and content will be indexed only.
I am open to use any full-text search plugin/gem
Implemented this using ultrasphinx. I am indexing by manually generating xml docs for sphinx.